- The preprint of the paper is available at arXiv: https://arxiv.org/abs/2312.00326
- The source code, data, and/or other artifacts are available at GitHub: https://github.com/qzc438/ontology-llm
- This paper has been accepted by PVLDB 2025.
- The slide presentation of the PVLDB 2025 paper can be found in the
slide_presentationfolder. - A production version linked to the PVLDB 2025 paper can be found in the
Releasessection.
@article{qiang2023agent,
title={{Agent-OM}: Leveraging {LLM} Agents for Ontology Matching},
author={Zhangcheng Qiang and Weiqing Wang and Kerry Taylor},
journal={Proceedings of the {VLDB} Endowment},
year={2024},
volume={18},
number={3},
pages={516--529},
address={London, UK},
publisher={VLDB Endowment},
doi={10.14778/3712221.3712222}
}- For technical inquiries, please submit a GitHub issue.
- For feature discussion or potential extensions, please join our foundation model discussion group: https://groups.google.com/g/agent-om
- To track the continuous development of LLMs, we propose to use the benchmark with a timestamp tag.
- Our experiment was run on a Dell Alienware Aurora R15.
- Memory: 64.0 GiB
- Processor: 13th Gen Intel® Core™ i9-13900KF × 32
- Graphics: NVIDIA GeForce RTX™ 4090
- Disk Capacity: 6.1 TB
- The operating system is Ubuntu 24.04.1 LTS.
- The CUDA version is 12.2.
- The Nvidia driver version is 535.247.01.
- PostgreSQL: https://www.postgresql.org/download/
- pgAdmin: https://www.pgadmin.org/download/ (optional for GUI access to the database)
- pgvector: https://github.com/pgvector/pgvector
- Create a database and name it
ontology. - Install PostgreSQL, pgAdmin, and pgvector on Ubuntu and CUDA 12.2:
- Install PostgreSQL: https://www.postgresql.org/download/linux/ubuntu/
- Install pgAdmin: https://www.pgadmin.org/download/pgadmin-4-apt/
- Install pgvector: https://github.com/pgvector/pgvector
- If the password failed:
psql --version
sudo -u postgres psql
alter user postgres password 'postgres';
\q
- If fatal error: postgres.h: No such file or directory
sudo apt install postgresql-server-dev-16 (Replace 16 with your Postgres server version)
- Install Python: https://www.python.org/downloads/
- We report our results with Python 3.10.12: https://www.python.org/downloads/release/python-31012/
- Install LangChain packages:
pip install langchain==0.2.10
pip install langchain-openai==0.1.17
pip install langchain-anthropic==0.1.20
pip install langchain_community==0.2.9
- Install other packages:
pip install pandas==2.0.3
pip install rdflib==7.0.0
pip install nltk==3.9.1
pip install python-dotenv==1.0.1
pip install pyenchant==3.2.2
pip install tiktoken==0.7.0
pip install asyncpg==0.28.0
pip install psycopg2_binary==2.9.9
pip install pgvector==0.1.8
pip install commentjson==0.9.0
pip install transformers==4.41.1
pip install colorama==0.4.6
- Install visualisation packages:
pip install matplotlib==3.8.4
pip install notebook
pip install ipyparallel
- Alternatively, you can run the following script to install all required packages:
pip install -r requirements.txt
- Deal with the blank page: https://stackoverflow.com/questions/55152948/juypter-notebook-shows-blank-page
Author Note: There is a known issue with the Enchant and PyEnchant libraries, we suggest using the hunspell and pyhunspell libraries instead in the util.py.
sudo apt update
sudo apt install -y build-essential pkg-config libhunspell-dev hunspell
sudo apt install hunspell-en-gb hunspell-en-us
pip install hunspell==0.5.5
import hunspell
uk_dict = hunspell.HunSpell('/usr/share/hunspell/en_GB.dic', '/usr/share/hunspell/en_GB.aff')
us_dict = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
def change_british_to_american(word):
if uk_dict.spell(word) and not us_dict.spell(word):
suggestions = us_dict.suggest(word)
return suggestions[0] if suggestions else word
return word
- GitHub link: https://github.com/ollama/ollama
- Install Ollama: https://ollama.com/download
- Ollama FAQs: https://github.com/ollama/ollama/blob/main/docs/faq.md
- Link Ollama to LangChain: https://python.langchain.com/v0.1/docs/integrations/llms/ollama/
- Install PyTorch: https://pytorch.org/get-started/locally/
- Install Open WebUI: https://docs.openwebui.com/getting-started/ (optional for GUI access to LLMs)
- Install Ollama on Ubuntu and CUDA 12.2:
- Install or update Ollama:
curl -fsSL https://ollama.com/install.sh | sh- Start or stop Ollama in the process:
sudo systemctl start ollama sudo systemctl stop ollama- Install PyTorch:
pip3 install torch torchvision torchaudio - Ollama models: https://ollama.com/library
- Add a model:
ollama pull <MODEL_NAME>- Find a model's metadata:
ollama show <MODEL_NAME>- Remove a model:
ollama rm <MODEL_NAME>- Check local models:
ollama list- Update local models:
Please check this link for further updates: ollama/ollama#2633ollama list | awk 'NR>1 {print $1}' | xargs -I {} sh -c 'echo "Updating model: {}"; ollama pull {}; echo "--"' && echo "All models updated."
At present, multiple LLM models are used in the experiments. The reader is referred to README.md for models currently used and reference therein to the API access or download, pricing, and licensing for each model.
- You will need API keys to interact with API-accessed commercial LLMs.
- Create a file named as
.envand write:
OPENAI_API_KEY = <YOUR_OPENAI_API_KEY>
ANTHROPIC_API_KEY = <YOUR_ANTHROPIC_API_KEY>
- To protect your API keys, please add
.envinto the file.gitignore:
.env
git rm --cached .env
- Load API keys into the file
run_config.py:
import os
import dotenv
dotenv.load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["ANTHROPIC_API_KEY"] = os.getenv("ANTHROPIC_API_KEY")
- Select one LLM in the file
run_config.py:
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_community.chat_models import ChatOllama
# load GPT models: https://platform.openai.com/docs/models/
# pricing: https://openai.com/api/pricing/
llm = ChatOpenAI(model_name='gpt-4o-2024-05-13', temperature=0)
llm = ChatOpenAI(model_name='gpt-4o-mini-2024-07-18', temperature=0)
llm = ChatOpenAI(model_name='gpt-3.5-turbo-0125', temperature=0) # old, not included
# load Anthropic models: https://docs.anthropic.com/en/docs/about-claude/models
# pricing: https://www.anthropic.com/pricing#anthropic-api
llm = ChatAnthropic(model="claude-3-opus-20240229", temperature=0) # expensive, not included
llm = ChatAnthropic(model="claude-3-sonnet-20240229", temperature=0)
llm = ChatAnthropic(model="claude-3-haiku-20240307", temperature=0)
# load Llama models
llm = ChatOllama(model="llama3:8b", temperature=0)
llm = ChatOllama(model="llama3.1:8b", temperature=0)
# load Qwen models
llm = ChatOllama(model="qwen2:7b", temperature=0)
llm = ChatOllama(model="qwen2.5:7b", temperature=0)
# load Gemma models
llm = ChatOllama(model="gemma2:9b", temperature=0)
# load GLM models
llm = ChatOllama(model="glm4:9b", temperature=0)
Author Note: To ensure consistency and reproducibility, we recommend using API-accessed models with a timestamp tag.
- Select one embeddings service in the file
run_config.py:
# https://platform.openai.com/docs/guides/embeddings/embedding-models
embeddings_service = OpenAIEmbeddings(model="text-embedding-ada-002")
vector_length = 1536
embeddings_service = OpenAIEmbeddings(model="text-embedding-3-small")
vector_length = 1536
embeddings_service = OpenAIEmbeddings(model="text-embedding-3-large")
vector_length = 3072
- It is possible to use embedding models other than OpenAI. For example, the following works for Llama 3 embedding models.
pip install langchain-ollama==0.1.0
from langchain_ollama import OllamaEmbeddings
embeddings_service = OllamaEmbeddings(model="llama3:8b")
vector_length = 4096
Author Note: The current setting of Agent-OM for the OAEI 2025 campaign are as follows.
- Agent-OM (Product Version):
llm = ChatOpenAI(model_name='gpt-4o-2024-05-13', temperature=0.0, seed=42, top_p=1.0, presence_penalty=0.0, frequency_penalty=0.0)
embeddings_service = OpenAIEmbeddings(model="text-embedding-ada-002")
vector_length = 1536
- Agent-OM-Lite (Lightweight Version):
llm = ChatOllama(model="llama3:8b", temperature=0.0, seed=42, top_p=1.0, top_k=1, repeat_penalty=1.0)
embeddings_service = OllamaEmbeddings(model="llama3:8b")
vector_length = 4096
- To ensure all the data uses the consistent reference schema, please run the following code:
python fix_inconsistent_reference.py
- Set your alignment in the file
run_config.py. For example, if you would like to run the CMT-ConfOf alignment, then the settings are:
context = "conference"
o1_is_code = False
o2_is_code = False
alignment = "conference/cmt-confof/component/"
- Set your matching hyperparameters in the file
run_config.py. For example, if you would like to set the similarity_threshold = 0.90 and top_k = 3, then the settings are:
similarity_threshold = 0.90
top_k = 3
- (Optional) Set
num_matchesin the filerun_config.py.num_matchesis a parameter that performs a "limit" function on the database queries. We set 50 here, but you can adjust this number to fit your database memory.
num_matches = 50
- Run the script:
python run_config.py
- The alignment result will be stored in the folder
alignment/. You can run the scriptcsv_to_alignment_api.pyto convert the CSV format to Alignment API format: https://moex.gitlabpages.inria.fr/alignapi/format.html - The performance evaluation will be stored in the file
result.csv. The first five lines are the intermediate results for debugging purposes. For the final result, you should use the last line that ends with "llm_with_agent". - The cost evaluation will be stored in the file
cost.csv. The cost calculation only works for API-accessed commercial LLMs. - The time evaluation will be stored in the file
time.csv. The unit of measurement is milliseconds. - The matching log will be stored in the file
agent.log. This file will be rewritten when you run a new task.
data/: data from OAEI tracks.find_data_vocabulary.py: summarise the vocabulary used in OAEI tracks.find_reference_only.py: find the reference alignment only.fix_inconsistent_reference.py: fix the URI issue of OAEI tracks.
alignment/: alignment results.om_ontology_to_csv.py: Retrieval Agent Part 1.om_csv_to_database.py: Retrieval Agent Part 2.om_database_matching.py: Matching Agent.run_config.py: main function of the project.run_series_archaeology.py: run all the archaeology alignments at one time.run_series_conference.py: run all the conference alignments at one time.run_series_multifarm.py: run all the multifarm alignments at one time.run_series_similarity.py: run different similarity thresholds for one alignment at one time.llm_matching.py: examples using purely LLMs for general matching tasks.llm_om_only.py: examples of using LLMs only for ontology matching.llm_om_with_context.py: examples of using LLMs with context information for ontology matching.util.py: util component of the project.
Frequently Asked Questions (FAQs):
-
Why does the Retrieval Agent have two parts
om_ontology_to_csv.pyandom_csv_to_database.py?
Answer: You can simply combine these two parts together. We decompose this into two parts to make it easy to debug any issues that may occur in the database storage. -
Why
om_csv_to_database.pycreate three additional columnssyntactic_matching,lexical_matching, andsemantic_matchingin the tableontology_matching?
Answer: You can simply ignore these columns. We add these columns to debug any issues that may occur in the database storage. -
Why do I find a slight difference for each run?
Answer: It is because https://community.openai.com/t/run-same-query-many-times-different-results/140588 -
How do I use the file
run_series_archaeology.pyandrun_series_conference.py, andrun_series_multifarm.py?
Answer: Please uncomment the following code in the filerun_config.py.
import os
if os.environ.get('alignment'):
alignment = os.environ['alignment']
- How do I use the file
run_series_similarity.py?
Answer: Please set the variables in the filerun_series_similarity.py.
For example, if you would like to check the similarities [1.00, 0.95, ..., 0.55, 0.50], then the settings are:
start = 1.00
end = 0.50
step = -0.05
generate_conference_benchmark.py: generate the results of OAEI Conference Track.generate_anatomy_mse_benchmark.py: generate the results of OAEI Anatomy Track and MSE Track.benchmark_2022/: results of OAEI 2022.benchmark_2023/: results of OAEI 2023.
draw_benchmark.ipynb: visualise the results of the evaluation.draw_ablation_study.ipynb: visualise the results of the ablation study.result_csv/: original data of the results.result_figure/: visualisation of the results.- Our visualisation is inspired by the following references:
campaign/OAEI_2025: results of OAEI 2025 campaign.
- We have created a debugging log for this project. Click the link here.
- AI-generated content (AIGC) can contain harmful, unethical, prejudiced, or negative content (https://docs.mistral.ai/capabilities/guardrailing/). However, ontology matching tasks only check the meaning of domain-specific terminologies, and we have not observed such content being generated.
- Our data-driven application architecture is inspired by: https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/cloud-sql/postgres/pgvector/notebooks/pgvector_gen_ai_demo.ipynb
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
