This repository contains Python code to generate the Tough Tables (2T) dataset, a dataset for benchmarking table annotation algorithms on the CEA and CTA tasks (as defined in the SemTab challenge). The target KG is DBpedia 2016-10.
The 2T dataset is available in Zenodo
The 2T dataset is compliant with the SemTab 2019 format. It is possible to evaluate all the annotation algorithms that produce a results file compatible with the SemTab challenge submission file format. For details, see SemTab 2019 (CEA, CTA).
This work is based on the following paper:
Cutrona, V., Bianchi, F., Jimenez-Ruiz, E. and Palmonari, M. (2020). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data. ISWC 2020, LNCS 12507, pp. 1–16.
The code is developed for Python 3.8.
Install all the required packages listed in the requirements.txt file.
virtualenv -p python3.8 venv # we suggest to create a virtual environment
source venv/bin/activate
pip install -r requirements.txtThe following command reads the tables under the control and tough directories,
and generates the gold standard (GS).
python tough_tables.py make_gs --output_folder ./gs \
--endpoint http://dbpedia.org/sparqlNote: the resultant GS may differ in different executions, due to the unsorted results of SPARQL queries.
Starting from the GS tables, the following command generates a) the set of tables to annotate, and b) the ground truth file.
python tough_tables.py to_cea --input_folder ./gs \
--output_tables_folder ./2T/tables \
--output_gs_folder ./2T/gt \
--output_target_folder ./2T/targets \
--endpoint http://dbpedia.org/sparql \
--sameas_file dbp_sameas.jsonThe resources/dbp_sameas.json file contains the collection of all the sameAs links used to build 2T.
It is possible to derive the CTA ground truth from the CEA ground truth using a majority voting strategy.
python tough_tables.py cta_from_cea --cea_gs_file ./2T/gt/CEA_2T_gt.csv \
--output_gs_folder ./2T/gt \
--output_target_folder ./2T/targets \
--instance_types_file ./instance_types_en.ttl \
--ontology_file ./dbpedia_2016-10.ntThe command requires two external sources:
- the
instance_types_enfile containing the list of all the DBpedia instances and their types (.ttl) - the DBpedia ontology (.nt)
To score an algorithm, run:
python tough_tables.py score_cea --annotations_file <your_annotation_file.csv> \
--gs_file ./2T_cea/2T_gt.csvThe annotations file format must be the same used in the SemTab 2019 challenge (tab_id, col_id, row_id, annotation).
Along with the overall result (ALL), all the performance metrics are computed for each category of tables.
A radar plot (<your_annotation_file>.pdf) is saved in the submission file directory.
Other utility commands are available in the script. See the full list by executing:
python tough_tables.py --helpThe 2T dataset has been converted into its corresponding Wikidata version and it has been adopted as part of the SemTab2020 challenge - Round 4.
NOTE: the new format for CEA is <tab_id, row_id, col_id, entity>. Check out the SemTab 2020 website for more details.
The conversion script to_wikidata.py requires the following files to be downloaded and put in the resources
directory to generate a conversion map:
NOTE: commented lines (e.g., "# started 2017-07-06T12:05:32Z") must be removed from the above files.
A pre-computed conversion map is available under the resources directory (db_wd_conversion_map.pickle).
Along with packages listed in the requirements, this repository uses the
tabular-data-semantics-py package to query
SPARQL endpoints. We slightly adapted the package to meet our needs (the resultant version is available under the
tabular_semantics directory).
In previous versions, we exploited the py-sparql-transformer package for querying the DBpedia SPARQL endpoint.