=Paper=
{{Paper
|id=Vol-2775/paper2
|storemode=property
|title=bbw: Matching CSV to Wikidata via Meta-lookup
|pdfUrl=https://ceur-ws.org/Vol-2775/paper2.pdf
|volume=Vol-2775
|authors=Renat Shigapov,Philipp Zumstein,Jan Kamlah,Lars Oberländer,Jörg Mechnich,Irene Schumm
|dblpUrl=https://dblp.org/rec/conf/semweb/ShigapovZKOMS20
}}
==bbw: Matching CSV to Wikidata via Meta-lookup==
bbw: Matching CSV to Wikidata via Meta-lookup Renat Shigapov1[0000−0002−0331−2558] , Philipp Zumstein1[0000−0002−6485−9434] , Jan Kamlah1[0000−0002−0417−7562] , Lars Oberländer1[0000−0001−7102−5369] , Jörg Mechnich1[0000−0002−6406−4906] , and Irene Schumm1[0000−0002−0167−3683] Mannheim University Library, University of Mannheim, Schloss Schneckenhof, 68159 Mannheim, Germany {firstname}.{lastname}@bib.uni-mannheim.de Abstract. We present our publicly available semantic annotator bbw (boosted by wiki) tested at the second Semantic Web Challenge on Tab- ular Data to Knowledge Graph Matching (SemTab2020). It annotates a raw CSV-table using the entities, types and properties in Wikidata. Our key ideas are meta-lookup over the SearX metasearch API and con- textual matching with at least two features. Avoiding the use of dump files, we kept the storage requirements low, used only up-to-date values in Wikidata and ranked third in the challenge. Keywords: Semantic Annotation · Knowledge Graph · Tabular Data · Meta-lookup · Contextual Matching · Metasearch · Wikidata · SearX 1 Introduction Motivation. We are developing a domain-specific ontology together with a knowledge graph and the reconciliation services in the project “Business and Economics Research Data Center Baden-Württemberg”. We joined SemTab2020 in order to design, implement and test our own semantic annotator. Challenge. SemTab 2020 was organized within the 19th International Se- mantic Web Conference and the 15th International Workshop on Ontology Match- ing. It consisted of 4 rounds and for the first 3 rounds the AICrowd platform was used whereas for the 4th round an offline evaluation was performed. Data in many tables as CSV files were given as input. These raw tables have no metadata (i.e. table name, column names, etc.) and suffer from misprints, mistakes, encod- ing problems and data changes. The goal was to annotate the column properties (CPA), column types (CTA) and cell entities (CEA) using the properties, types (classes) and entities in the Wikidata knowledge graph, see Fig. 1. Our contribution. We designed bbw using two ideas: meta-lookup and contextual matching. The bbw relies on the Wikidata SPARQL endpoint and Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). R. Shigapov et al. Fig. 1: Tabular data (black) is annotated with the properties (blue), entities (magenta), and types (green) from Wikidata as asked in the CPA, CEA, and CTA tasks respec- tively. the locally-deployed SearX1 metasearch API without using the Wikidata dump files. Relying on Web-APIs releases us from storing large data sets and processing them locally. Moreover, the queries run every time on the most recent data without the need of developing any updating mechanism. Related work. According to the results of SemTab 2019 [4], the key for a high performance (e.g. MTab2 [6] and CSV2KG3 [10]) is the use of 1) mul- tiple lookup services and 2) elaborated lexical matching techniques. To speed up a lookup, the ElasticSearch index of labels and aliases in a knowledge graph is constructed in Tabularisi-system [12], ADOG4 [7] and DAGOBAH [1]. The importance of contextual matching is showed by MantisTable5 [2]. The search engine LOD4ALL6 is used in [5]. Structure. The rest of this work is organized as follows: We describe how bbw works, show and discuss our results at SemTab2020, and make conclusions. 2 Implementation The annotator bbw is implemented in Python3 and is distributed as an open source library with MIT License7 . We added the command-line interface (CLI) using argparse8 library. A simple graphical user interface (GUI) is developed 1 https://github.com/searx/searx 2 https://github.com/phucty/MTab 3 https://github.com/IBCNServices/CSV2KG 4 https://github.com/danielapoliveira/iswc-annotation-challenge 5 https://bitbucket.org/disco unimib/mantistable-tool-3 6 http://lod4all.net/frontend/index/applicationtop 7 https://pypi.org/project/bbw 8 https://docs.python.org/3/library/argparse.html bbw: Matching CSV to Wikidata via Meta-lookup with Streamlit9 . All scripts are available at the project page10 . The bbw-CLI is parallelized using GNU parallel [11] on subsets of the input data. We also implemented a few scripts for merging partial solutions as well as outputting aggregated information used for consistency checks. 2.1 Key concepts: a CSV-file and the data model of Wikidata Wikidata as a knowledge graph can be easiest thought of simple statements, i.e. triples containing a subject, a property, and an object/literal. For the subject and object the Wikidata items are identified by a QID (Wikidata Q identifier) and the property is identified by a PID (Wikidata property) which can be prefixed accordingly in order to receive an URI. The data types for literals in Wikidata are strings, quantities (i.e. numbers), times, coordinates and more. The actual data model of Wikidata11 is more complex but we do not focus on it. We assume that a row in a CSV-file contains data from an item in Wikidata. We refer to that item as the subject or subject item. The first cell in a row usually contains the label (rdfs:label) or an alias (skos:altLabel) of the subject. The other (tail) cells in a row correspond to the values of properties of that subject. Such a value can either be a literal (e.g. string, number, date) or the label (or an alias) of the object item. Three kinds of annotation tasks for each row can be defined: Entity annotation. If a cell contains a label or alias of an item in Wikidata, the cell can be annotated via the QID of this item. Property annotation. If the values of two cells in a row are related via a property in Wikidata, the relation can be identified by its PID. Type annotation. If a cell contains a label or alias of an item in Wikidata and the item has a statement with the property P31 (instance of), the cell’s type can be annotated via the object’s QID. These annotations may not exist, are not necessarily unique and may be false positive due to mistakes in the values. The CPA and CTA tasks from the challenge are then an aggregation over all rows, such that hopefully all possible ambiguity or uncertainty vanishes. Example: In our annotated example from Figure 1 we see in the first column two subjects Q2119 (Mannheim) and Q23436 (Edinburgh). The second and forth column contain the objects annotated with their QIDs and the third column contains literals of type number. Between the first row with the subjects and the other rows the property relations are annotated with P206, P2044, P131. The type annotation for Baden-Württemberg is Q1221156 (state of Germany) and the type annotation for the City of Edinburgh is Q15060255 (Scottish council area), but in the end the CTA for the whole column will result in Q12076836 (administrative territorial entity of a single country). Both items Q1221156 and Q15060255 are subclasses of Q12076836. 9 https://github.com/streamlit/streamlit 10 http://github.com/UB-Mannheim/bbw 11 https://www.mediawiki.org/wiki/Wikibase/DataModel R. Shigapov et al. 2.2 Key ideas: meta-lookup and contextual matching The data in some input tables have been changed by the challenge organizers. For example, the labels can contain typos or other artefacts: “Monnhem” instead of “Mannheim” or “dinbur” instead of “Edinburgh”. The values can differ from the currently known ones in Wikidata, e.g. 9.78 instead of 9, or 1492-09-11 instead of 1492-10-12. Simple lookup and matching mechanisms fail when the data contain such errors. We do a meta-lookup first and, if needed, perform several contextual matching steps. However, before going into details, we will explain these two key concepts here. Meta-lookup. Meta-lookup is a lookup over many search and metasearch engines. SearX can search over more than 80 engines. The use of some of them is constrained technically or legally, see further information in Subsection 2.4. Thus, we use only a subset of the engines: Wikidata, Wikipedia, Wikibooks, Wikiquote, Wikisource, Wiktionary, Wikiversity, Wikivoyage, Startpage, Bing, DuckDuckGo, DuckDuckGo Definitions, eTools, Erowid, Etymonline, MyMem- ory, Qwant, Mojeek, Naver, DictZone. We run a docker container locally with the SearX docker image12 . The bbw sends a GET-request to the SearX API. SearX sends the requests to the search and metasearch engines, collects the responses, ranks them and returns the search results, suggestions, corrections and infoboxes as JSON data. The bbw collects those results and takes the best matches using the edit distance from difflib13 library. To add more candidates, bbw takes the best match in sugges- tions and corrections, sends them to the SearX API and collects the infoboxes. We use a few out of these candidates as an input for matching. For example, SearX resolves “Monnhem” and “dinbur” to the correct spellings “Mannheim” and “Edinburgh”. Contextual matching. In contrast to a simple matching, a contextual matching is based on at least two features. For example, it can be 1) a label of a subject item and a label of an object item (or literal), 2) a label and type of an object item, or 3) a label of an object item (or literal) and a property. The actual matching is performed as follows: We try an exact matching. If it does not give any results, we try a case-insensitive matching for strings. If there are still no results, we match a string with the edit distance. The names are matched ignoring abbreviations. The date, which is nearest to the value, is matched if it is within six months. All the numbers are matched within a 2 % range. 2.3 Workflow: 7 steps We implemented bbw as the 7-steps workflow illustrated in Fig. 2. Step 2 is performed for all rows, but some rows might not be annotated. These rows are then considered in Step 3 including all annotations we already have after Step 12 https://hub.docker.com/r/searx/searx 13 https://docs.python.org/3/library/difflib.html bbw: Matching CSV to Wikidata via Meta-lookup 2. The same pattern continues up to Step 6. The postprocessing Step 7 takes the annotations from all steps as input. Fig. 2: The 7-steps workflow in bbb can be divided in a preprocessing Step 1, five contextual matching Steps 2-6, and a postprocessing Step 7. Steps 5 and 6 are only used in round 3 and 4 and Steps 3 and 4 only in round 3. Step 1: Preprocessing. Encoding problems are fixed by the ftfy library [9] and the regular expressions14 are used to predict the number, time, name and string data types. A string is identified as a name if it starts with an abbreviation or has an abbreviation between two words. Step 2: Meta-lookup of the subject’s label and querying the can- didates. We use the first cell in the row for a meta-lookup, assuming that it is the subject’s label. The resulting candidates are used to query the Wikidata SPARQL endpoint and we get all triples for the subjects with a label from the candidates’ list. We match the value in a tail cell with object’s labels in the resulting triples. If the matching is successful, we take the subject’s QID, the subject’s type and the property. If, additionally, the object is a literal, we take also the object’s QID and the object’s type. Step 2 is performed for each row. We use the most frequent property per column as the input property for Step 3. Step 3: Querying the objects’ labels with their properties. An anno- tated row is a row with all property annotations found. Otherwise we say that the row is still unannotated. Step 3 is executed only for unannotated rows. We query the endpoint with the values of the tail cells and their properties inferred from Step 2. We match the value in the first cell to the subject’s label using the edit distance. Then, we match the resulting triples with the values in the tail cells as in Step 2. Step 4: Querying the objects’ labels of entity columns. We call a column an entity column if it has at least one type annotation inferred from previous steps. In contrast to Step 3, we do not use properties in Step 4. We query the endpoint with the values of the tail cells from all entity columns. It results in the subject’s labels and the types of a subject and the objects. The cell values are matched as in Step 2. 14 https://docs.python.org/3/library/re.html R. Shigapov et al. Step 5: Querying the objects’ labels with their types. As the type for this step we use the most frequent type per column inferred from previous steps. For every entity column the endpoint is queried using the value of a tail cell and its type. We take the object’s QID from the received results. Step 6: Querying the subject’s types. For unannotated rows the end- point is asked for the labels of all items with a certain type. We do this for the two most frequent type annotations of the first column. This list of labels is matched with the subject’s label via the edit distance. We send another query to the endpoint and receive all triples, which have the subject’s label equal to one of the labels from the candidate items. The values in the tail columns are matched as in Step 2. Step 7: Postprocessing. After Steps 1-6 we might have a few candi- dates for each task. We choose the most frequent property and entity for CPA and CEA tasks correspondingly. Postprocessing for CTA task is different. If there exists the unique most frequent type, we choose it. If there are mul- tiple most frequent types, we find the first common class w.r.t. to the sub- class relation using a special query15 to the endpoint with the SSSP class16 in com.bigdata.rdf.graph.analytics package17 . However, if the query returns the generic class (Q35120), we choose the most frequent type. As the final post- processing step we choose only those annotations which are mentioned in the target files. 2.4 Limitations The annotator bbw has both technical and legal constrains related to the use of multiple API services. Technical constraints. The current limits of the Wikidata SPARQL end- point are five parallel queries per IP, thirty error queries per minute and one minute runtime per single query.18 For example, in round 3 some of our requests were blocked by the endpoint with the error code 429 (Too Many Requests) due to the use of six parallel processes. We reduced the number of CPUs in round 3 after the deadline and added an extra request after waiting the number of seconds returned by the “Retry-After” header. SearX can throw an error message due to its own technical reasons and due to constraints of a search engine. For example, it happens when an engine asks to resolve a captcha. We experienced that during our initial tests with Google Search. Thus, we did not use Google Search for our submissions. We found a bug in SearX in the 2nd round and reported it19 . It was fixed shortly and we used the improved code in the next rounds. 15 https://w.wiki/h8X. The user Pasleim proposed it. 16 https://blazegraph.com/database/apidocs/com/bigdata/rdf/graph/analytics/SSSP.html 17 https://blazegraph.com/database/apidocs/com/bigdata/rdf/graph/analytics/package- summary.html 18 https://www.mediawiki.org/wiki/Wikidata Query Service/User Manual#Query limits 19 https://github.com/searx/searx/issues/2188 bbw: Matching CSV to Wikidata via Meta-lookup Legal constraints. Possible legal restrictions for our usage of the search engines may occur from copyrights and sui generis database rights20 . In accor- dance with the jurisdiction of the EUGH21 , these rules do not restrict our usage of search engines, because “if the database maker makes the contents of that database accessible to third parties, [. . . ] his sui generis right does not enable him to prevent such third parties from consulting that database for information purposes”22 . As described above, we “consult [. . . ] the database”23 for infor- mation purpose in contrast to build “a parasitical competing product”24 and therefore we do not harm the investment of the database makers. Though, the terms of use of the different search engines may restrict our usage. The terms of use of the search engines were examined for restrictions concern- ing metasearch. While some search engines do not have written terms of use at all, and therefore no contractual restrictions affecting their use via metasearch25 exists, others have terms of use without restrictions on metasearching26 . Google Search is a special case: Whereas their former terms explicitly prohibited “meta- search[ing]”, their current terms merely demand to not use their services “abu- sively”. This change can be understood as a waiver of restricting the use of their service for metasearch. Unfortunately, the terms of use of some search engines include formulations, that can be understood as a refusal of metasearching with- out permission27 . However, within the territorial scope of German contract law, even the most extensive of these restrictions could be seen as invalid according to § 307 I, II Nr.1 BGB as they are in violation of § 307 II in relation with basic principle of privileging scientific research purposes stated in § 60d UrhG or at least for privileged scientific use cases unenforceable according to § 60g I UrhG. 3 Results Table 1 shows our scores and rank in leaderboard at the SemTab2020 challenge. We did not participate in round 1 and round 2 before the first deadline. We started to implement bbw on the day of the first deadline in round 2. One CPU was used in round 2, six CPUs were used in round 3 before the deadline, and five CPUs were used in round 3 after the deadline and in round 4. We applied only Steps 1, 2 and 7 of the workflow in round 2 and this already gave as a very high score in the CPA task. Our low CTA score before the second deadline in round 2 has two reasons: 1) both the target and out-of-the target 20 Art. 7 Directive 96/9 EC and their national implementations, eg.: §§ 4, 87a ff. UrhG. 21 EUGH C-203/02 The British Horseracing Board Rn. 55; EUGH C-304/07 Direct- media Publishing Rn. 53. 22 EUGH C-202/12 Innoweb/Wegener Rn. 46. 23 EUGH C-202/12 Innoweb/Wegener Rn. 47. 24 EUGH C-202/12 Innoweb/Wegener Rn. 48; ErwG. 42, Directive 96/9 EC. 25 For example, Duckduckgo, Etools, Mojeek and Startpage. 26 For example, Bing, Erowid, Etymonline, MyMemory, Qwant, Wikibooks, Wikiquote, Wikisource, Wikivoyage, Wiktionary, Wikiversity, Wikipedia and Wikidata. 27 For example, Yandex and Yahoo. R. Shigapov et al. Round CPA CTA CEA F1 P R AF1 AP R F1 P R 1 - - - - - - - - - 2 (<1st deadline) - - - - - - - - - 2 (>1st deadline) 0.991 0.992 4 0.914 0.929 7 0.892 0.960 10 2 (>2nd deadline) 0.992 0.994 3 0.964 0.978 6 0.944 0.981 7 3 (deadline) 0.989 0.994 3 0.971 0.974 4 0.976 0.985 4 4 (AG) 0.995 0.996 2 0.980 0.980 2 0.978 0.984 4 4 (2T) - - - 0.516 0.789 6 0.863 0.927 2 4 (avg) 0.995 0.996 2 0.762 0.884 5 0.920 0.955 2 Table 1: Our scores (F1 is F1-score, P is precision, AF1 is approximate F1-score and AP is approximate precision) and rank (R) in leaderboard at SemTab2020. AG – automatically generated dataset, 2T – tough tables, and avg – averaged over both datasets. annotations were submitted, and 2) a bug in the evaluator at SemTab2020 which counted the out-of-the target annotations. We interpret our low CEA score as a consequence of using only Step 2 in contextual matching. Our CPA-score dropped in round 3 in comparison to round 2 due to a bug in bbw leading to the incorrect counts of the most frequent property. After fixing it, our CPA scores were comparable to our CPA scores in round 2. Steps 3-6 of the workflow increased our CTA and CEA scores in round 3. We continued our tests after the deadline and found out that disabling Steps 3 and 4 increased our scores further. Only steps 2, 5 and 6 for contextual matching were used in round 4. The dataset contained two parts: AG – automatically generated and 2T – tough tables [3]. Tough tables were easily separated, because they are not mentioned in the CPA target files and have usually higher size than the AG tables. We applied our reduced 5-steps workflow to the AG tables, got our highest scores in all tasks and ranked in second place in CPA and CTA tasks. Our 5-steps workflow is applied to roughly three quarters of the 2T dataset containing relatively small tables. Among those circa 70 % of tables were annotated. This gave us very low CTA scores with the 2T dataset. We decided to improve our CEA scores and to keep the CTA task as it was. We performed meta-lookup and non-contextual matching with the Wikidata API in order to annotate 800 of the most frequently unannotated labels. This improved our CEA score and we ranked in second place with the 2T dataset. Due to time pressure we did not add the corresponding types to our 2T-CTA submission. bbw: Matching CSV to Wikidata via Meta-lookup 4 Conclusions We presented the semantic annotator bbw for matching tabular data to the Wikidata knowledge graph. The annotator is based on 1) meta-lookup over many search and metasearch engines, and 2) contextual matching with at least two features. We were continuously developing bbw between rounds 2 and 4. Starting with the low CEA and CTA scores in round 2, we improved the workflow and ranked in second place in round 4. Thus, a meta-lookup is a competitive approach compared to processing the Wikidata dump files. In contrast to the latter, the meta-lookup keeps the storage requirements low and uses only up-to-date data in the knowledge graph. Further improvements to bbw can be made by using more features [8], fine tuning the hyperparameters and optimizing the SPARQL queries28 . The annotator bbw is distributed as an open source Python library with MIT license: https://pypi.org/project/bbw. It can be easily installed via “pip install bbw”. The command line interface, a simple graphical user interface and a script for parallel computations are also provided. Everyone is welcome to adapt bbw at http://github.com/UB-Mannheim/bbw. Acknowledgments This work was funded by the Ministry of Science, Research and Arts of Baden- Württemberg through the project “Business and Economics Research Data Cen- ter Baden-Württemberg”. References 1. Chabot, Y., Labbé, T., Liu, J., Troncy, R.: DAGOBAH: An end-to-end context- free tabular data semantic annotation system. In: SemTab@ISWC 2019. pp. 41–48 (2019), http://ceur-ws.org/Vol-2553/paper6.pdf 2. Cremaschi, M., Avogadro, R., Chieregato, D.: MantisTable: An automatic ap- proach for the semantic table interpretation. In: SemTab@ISWC 2019. pp. 15–24 (2019), http://ceur-ws.org/Vol-2553/paper3.pdf 3. Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M.: Tough Tables: Care- fully evaluating entity linking for tabular data. In: The Semantic Web – ISWC 2020. pp. 328–343 (2020), https://doi.org/10.1007/978-3-030-62466-8 21 4. Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: SemTab 2019: Resources to benchmark tabular data to knowledge graph matching systems. In: The Semantic Web – ESWC 2020. pp. 514–530 (2020), https://doi.org/10.1007/ 978-3-030-49461-2 30 5. Morikawa, H.: Semantic table interpretation using LOD4ALL. In: SemTab@ISWC 2019. pp. 49–56 (2019), http://ceur-ws.org/Vol-2553/paper7.pdf 28 https://www.wikidata.org/wiki/Wikidata:SPARQL query service/query optimization R. Shigapov et al. 6. Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab: Matching tabular data to knowledge graph with probability models. In: SemTab@ISWC 2019. pp. 191–192 (2019), http://ceur-ws.org/Vol-2553/paper2.pdf 7. Oliveira, D., d’Aquin, M.: ADOG - Annotating data with ontologies and graphs. In: SemTab@ISWC 2019. pp. 1–6 (2019), http://ceur-ws.org/Vol-2553/paper1.pdf 8. Ritze, D., Bizer, C.: Matching Web tables to DBpedia - A feature utility study. In: 20th International Conference on Extending Database Technology. pp. 210–221 (2017), https://doi.org/10.5441/002/edbt.2017.20 9. Speer, R.: ftfy. Zenodo (2019), https://doi.org/10.5281/zenodo.2591652, v. 5.5 10. Steenwinckel, B., Vandewiele, G., Turck, F.D., Ongenae, F.: CVS2KG: Transform- ing tabular data into semantic knowledge. In: SemTab@ISWC 2019. pp. 33–40 (2019), http://ceur-ws.org/Vol-2553/paper5.pdf 11. Tange, O.: GNU parallel 20200722 (’Privacy Shield’) (July 2020), https://doi.org/ 10.5281/zenodo.3956817 12. Thawani, A., Hu, M., Hu, E., Zafar, H., Divvala, N.T., Singh, A., Qasemi, E., Szekely, P.A., Pujara, J.: Entity linking to knowledge graphs to infer column types and properties. In: SemTab@ISWC 2019. pp. 25–32 (2019), http://ceur-ws.org/ Vol-2553/paper4.pdf