GitHub - abcsys/libem: An open-source compound AI toolchain for fast and accurate entity matching, powered by LLMs.

Scalable entity matching with human-level accuracy, powered by LLMs and tooling.

[Jul'24] Libem Arena: Online Evaluation and Leaderboard for EM
[Jun'24] Liberal Entity Matching as a Compound AI Toolchain
[Jun'24] Presented at Compound AI Systems Workshop - Data + AI Summit 2024

Libem is an open-source, compound AI toolchain designed to perform and streamline entity matching (EM). EM involves identifying whether two descriptions refer to the same entity, a task crucial in data management and integration. Traditional EM methods have evolved from rule-based to LLM-based systems, yet they fall short due to their reliance on static knowledge and rigid, predefined prompts.

Libem addresses these limitations by adopting a modular, tool-oriented approach. It supports dynamic tool use, self-refinement, and optimization, allowing it to adapt and refine its processes based on the dataset and performance metrics. Unlike existing LLM-based EM systems, which are usually in the form of Python notebooks, Libem offers a composable and reusable toolchain that can be easily incorporated into applications or used as a service via APIs. Specifically, Libem can be used as a library or a CLI tool and can be configured to use different models, tools, and parameters. Libem supports a variety of models, including GPT-4o, GPT-4o-mini, GPT-4, GPT-3.5-turbo, and Llama3 as well as tools to facilitate entity matching, such as browsing and data preparation.

Installation

To install the Libem library and CLI:

pip install libem

For the latest version, you can install from main:

pip install git+https://github.com/abcsys/libem.git

Alternatively, if you are interested in contributing to Libem or running the benchmarks, you can install from source. First clone the repository and run:

make install

After installation, you can run the CLI tool to configure Libem with API key(s):

libem $ libem configure
Enter OPENAI_API_KEY ('sk-****'):

The API key is used to access the OpenAI API. If you don't have an API key, you can get one from the OpenAI website.

You can now validate the installation:

libem $ libem match apple orange
Match: no

Or run through the EM examples in /examples:

make match

Libem Usage

Libem can be used as a library or as a CLI tool. The library provides a simple API to match two entities:

import libem

e1 = "Dyson Hot+Cool AM09 Jet Focus heater and fan, White/Silver"
e2 = "Dyson AM09 Hot + Cool Jet Focus Fan Heater - W/S"

is_match = libem.match(e1, e2)

The CLI tool can be used to match entities from the command line:

libem $ libem match "Dyson Hot+Cool AM09 Jet Focus heater and fan, White/Silver" "Dyson AM09 Hot + Cool Jet Focus Fan Heater - W/S"
Match: yes

Both the library and the CLI tool can be configured to use tools, different LLM models, and other parameters and prompts. For example,

import libem

libem.calibrate({
    "libem.match.parameter.model": "gpt-3.5-turbo",
    "libem.match.parameter.tools": ["libem.browse"],
})

This will use the gpt-3.5-turbo model and the libem.browse tool to match entities.

Libem can be configured to output more information about the matching process, e.g., in the CLI:

libem $ libem match apple orange --cot --confidence
Explanation:
 1. **Name Comparison**: The names "apple" and "orange" are different.
2. **Category Comparison**: Both are fruits, but they are distinct types of fruits.
3. **Attributes Comparison**: Apples and oranges have different colors, tastes, and nutritional profiles.
4. **Contextual Usage**: In common language, "apple" and "orange" are used to refer to different fruits.

Match: no
Confidence: 5

Benchmarks and Arena

Libem comes with a benchmarking tool that can be used to easily compare the performance of different configurations of Libem over 10+ common EM datasets such as amazon-google, dblp-acm, and abt-buy. To run these benchmarks, first fetch the datasets in the libem-sample-data:

make data

Then, the benchmarking tool can be run invoked as:

python -m benchmark.run -n amazon-google

There are several options available to the benchmarking tool, check out /benchmark for more information.

Libem also comes with an online evaluation tool called Libem Arena, where you can compete with Libem, other EM tools, and human annotators to match entities over a variety of datasets. We track the performance of these tools and annotators over time, and provide a leaderboard of the best-performing tools and annotators.

Citation & Reading More

If you use Libem in a research paper, please cite our work as follows:

@article{fu2024liberal,
  title={Liberal Entity Matching as a Compound AI Toolchain},
  author={Fu, Silvery D and Wang, David and Zhang, Wen and Ge, Kathleen},
  journal={arXiv preprint arXiv:2406.11255},
  year={2024}
}

You can also read more about the research behind Libem in the following manuscripts:

Liberal Entity Matching as a Compound AI Toolchain (Academic Paper, June 2024)
Poster: Liberal Entity Matching as a Compound AI Toolchain (Poster, Compound AI Systems Workshop, San Francisco at Data + AI Summit, June 2024)
Libem Arena (Online evaluation, July 2024)

Please report any issues or feedback to the GitHub repository. We welcome contributions and collaborations!

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
arena		arena
benchmark		benchmark
cli		cli
docs		docs
examples		examples
libem		libem
serve		serve
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Installation

Libem Usage

Benchmarks and Arena

Citation & Reading More

About

Releases 9

Packages

Contributors 4

Languages

License

abcsys/libem

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Installation

Libem Usage

Benchmarks and Arena

Citation & Reading More

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 4

Languages

Packages