KGrEaT is a framework built to evaluate the performance impact of knowledge graphs (KGs) on multiple downstream tasks. To that end, the framework implements various algorithms to solve tasks like classification, regression, or recommendation of entities. The impact of a given KG is measured by using its information as background knowledge for solving the tasks. To compare the performance of different KGs on downstream tasks, a fixed experimental setup with the KG as the only variable is used.
The hardware requirements of the framework are dominated by the embedding generation step (see DGL-KE framework for details). To compute embeddings for KGs with the size of DBpedia or YAGO, we recommend to use a CPU and have at least 100GB of RAM. As of now, the datasets are moderate in size and the implemented algorithms are quite efficient. Hence, the execution of tasks does not consume a large amount of resources.
- In the project root, create a conda environment with:
conda env create -f environment.yaml - Activate the new environment with
conda activate kgreat - Install dependencies with
poetry install - Make sure that the
kgreatenvironment is activated when using the framework!
- Create a new folder under
kgwhich will contain all data related to the graph (input files, configuration, intermediate representations, results, logs). Note that the name of the folder will serve as identifier for the graph throughout the framework. - In the folder of your KG:
- Create a sub-folder
data. Put the RDF files of the KG in this folder (supported file types are NT, TTL, TSV). You may want to create a download script similar to the existing KGs. - Create a file
config.yamlwith the evaluation configuration of your KG. You can find explanations for all configuration parameters in theexample_config.yamlfile of the root directory.
- Create a sub-folder
In the following you will prepare and run the three stages Mapping, Preprocessing, and Task. As the later stages are dependent on the earlier ones, they must be run in this order.
First, pull the docker images of all stages. Make sure that your config.yaml is already configured correctly, as the manager only pulls images of the steps defined in the config. In the root directory of the project, run the following commands:
python . <your-kg-identifier> pullWe then run the prepare action which initializes required files for the actual stages. In particular, we create a entity_mapping.tsv file which contains all the URIs and labels of entities to be mapped.
python . <your-kg-identifier> prepareThen we run the actual stages:
python . <your-kg-identifier> runThe results of the evaluation runs are put in a result folder within your KG directory. The framework creates one TSV result file and one log file per task.
You can use the result_analysis.ipynb notebook to explore and compare the results of one or more KGs.
If you want to trigger individual stages or steps, you can do so by supplying them as optional arguments. You can trigger steps by supplying the ID of the step as defined in the config.yaml. Here are some examples:
Running only the preprocessing stage:
python . <your-kg-identifier> run --stage preprocessingRunning the RDF2vec embedding generation step of the preprocessing stage:
python . <your-kg-identifier> run --stage preprocessing --step embedding-rdf2vecRunning two specific classification tasks (i.e., steps of the Task stage):
python . <your-kg-identifier> run --stage task --step dm-aaup_classification dm-cities_classificationContributions to the framework are highly welcome, and we appreciate pull requests for additional datasets, tasks, matchers, preprocessors, etc.! Here's how you can extend the framework:
To add a dataset for an existing task type, create a folder in the dataset directory with at least the following data:
DockerfileSetup of the docker container including all relevant preparations (import code, install dependencies, ..).datasetDataset in a format of your choice. Have a look atshared/dm/utils/dataset.pyfor already supported dataset formatsentities.tsvLabels and URIs of the dataset entities that have to be mapped to the input KGREADME.mdA file describing the dataset as well as any deviations from the general task API
To run a task using the new dataset you have to add an entry in your config.yaml file where you define an identifier as well as necessary parameters for your task. Don't forget to update the example_config.yaml with information about the new dataset/task!
To define a new task type, add the code to a subfolder below shared. If your task type uses Python, you can put it below shared/dm and reuse the utility functions in shared/dm/util.
The only information a task retrieves is the environment variable KGREAT_STEP which it can use to identify its configuration in the config.yaml of the KG.
Results should be written in the result/run_<run_id> folder of the KG using the existing format.
To define a new mapper, add the code to a subfolder below shared/mapping. The mapper should be self-contained and should define its own Dockerfile (see existing mappers for examples).
A mapper should fill gaps in the source column of the entity_mapping.tsv file in the KG folder (i.e., load the file, fill gaps, update the file).
To use the mapper, add a respective entry to the mapping section of your config.yaml.
To define a new preprocessing method, add the code to a subfolder below shared/preprocessing. The preprocessing method should be self-contained and should define its own Dockerfile (see existing preprocessors for examples).
A preprocessing step can use any data contained in the KG folder and persist artifacts in the same folder. These artifacts may then be used by subsequent preprocessing steps or by tasks.
To use the preprocessing method, add a respective entry to the preprocessing section of your config.yaml.