Towards an Approach based on Knowledge Graph Refinement for Tabular Data to Knowledge Graph Matching Azanzi Jiomekong1,*,† , Brice Foko1,† 1 Department of Computer Science, University of Yaounde I, Yaounde, Cameroon Abstract This paper presents our contribution to the Accuracy Track of Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab). This contribution consists of the proposition of an approach based on knowledge graph refinement for tabular data annotation. Internal methods were used to predict the links between cells in the table and external methods were used to predict missing entities and relations. This approach was applied to the annotation of HardTables and ToughTables using DBpedia and Wikidata; and GitTables and BiodivTab using DBpedia and Schema.org. During Round 3 of the competition, we were ranked third and second position respectively for the annotation of GitTables and BiodivTab. Keywords Tabular Data, Knowledge Graph, Wikidata, DBpedia, Schema.org, Tabular data to Knowledge Graph Matching, SemTab, 1. Introduction The addition of semantic information to tabular data1 may enhance a large range of applications such as Web Search, Question Answering, Knowledge Graph construction and refinement, etc. For instance, adding semantic information to a food composition table can allow us to determine which ingredient can be used to substitute another ingredient in the case of allergy. On the other hand, gaining the semantic understanding of food composition tables [1] can improve the food data analysis and facilitate food data integration. However, constructing and assigning semantic tags to tabular datasets is often difficult because of incomplete data, erroneous data, incomplete metadata, and ambiguous data and metadata. To solve the problem of tabular data to knowledge graph matching, we are proposing an approach based on KG refinement. In order to increase the utility of a graph, KG completion aims to complete the graph by adding missing knowledge such as missing entities, missing types of Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2022) * Corresponding author. † These authors contributed equally. $ fidel.jiomekong@facsciences-uy1.cm (A. Jiomekong); fokobrice3@gmail.com (B. Foko) € https://sites.google.com/facsciences-uy1.cm/azanzijiomekong (A. Jiomekong)  0000-0002-0877-7063 (A. Jiomekong) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 https://sem-tab-challenge.github.io/2022/ entities, and/or missing relations that exist between entities. On the other hand, error detection aims at identifying errors in the KG. These errors can be type assertions, relation between individuals, literal values and KG interlinks. To refine a KG, internal methods use knowledge in the graph and external methods use knowledge that come from external knowledge sources such as text corpora or existing knowledge graphs [2]. In this research, each tabular file is represented as a graph in which each cell is a node, labeled by the content of the cell and can be linked to another cell or column title. Our aim is to correct the misspelling of the cell’s label, link the cell to its corresponding annotation in the KG and determine the type of a set of cells and the relation between cells. The rest of the paper is structured as follows: Section 2 presents the methodology of our research, Section 3 presents the results and Section 4 presents the conclusion. 2. Research methodology Taking advantage of our experience in empirical research in software engineering [3] and ontology learning [4], we designed the research methodology. Section 2.1 presents the research question, Section 2.2 the empirical research methods used and Section 2.3 presents the pipeline defined after Round 1, refined during Round 2 and used during Round 3 of the SemTab challenge. 2.1. Research question To solve the tabular data to knowledge graph matching problem, one should reply to the following research question "How to annotate tabular data using knowledge graph?". To reply to this question, one should provide a system which takes as input a tabular dataset and a knowledge graph and furnishes as output the dataset annotated with entities and properties extracted from this KG. To this end, the following questions should be replied (see Fig. 1): • Which Entity from the KG should be used to annotate a cell in the tabular data? This is the Cell Entity Annotation (CEA). • What is the most fine-grained semantic type that should be assigned to a tabular data column? This is the Column Type Annotation (CTA) task. • Which property from the KG should be used to link two columns that are related in the tabular data? This is the Column Property Annotation (CPA). 2.2. Empirical methods The research methodology consists of the combination of three empirical research methods in software engineering [5]: case study research, action research and experimental research. To solve the tabular data to knowledge graph matching research problem, the SemTab organizers provided us with 8 case studies which are: (1) Annotation of HardTables [6] using Wikidata, (2) Annotation of HardTables using DBpedia, (3) Annotation of ToughTables [7, 8] using Wikidata, (4) Annotation of ToughTables using DBpedia, (5) Annotation of BiodivTab [9, 10] using DBpedia, (6) Annotation of GitTables [11, 12] using DBpedia and (8) Annotation of GitTables using Schema.org. The aim of studying these case studies is to provide a deeper understanding of the Figure 1: Tabular data to knowledge graphs matching tabular data to the knowledge graph matching problem so that the proposed solution can be generalized to any setting. Given that this was our first participation in the SemTab challenge, during Round 1 and Round 2, we applied action research. This consists of the exploration, testing, evaluation of possible solutions and the proposition of a reliable solution that can be used to annotate any tabular data using knowledge graphs. The proposed solution was experimented during Round 3. Given that the solution proposed was a software solution, we used the Scrum process [3]. Globally, the solution we proposed was set-up in 14 Sprints, each Sprint involving one or many iterations. The Scrum Weekly meeting was used to discuss the results obtained during the week, and how to ameliorate the approach. Ad-hoc meeting during the week were used to discuss some problems. For instance, we have had some problems when querying DBpedia and Wikidata online. This problem was solved during Ad-hoc meetings by an algorithm that allowed us to reduce the number of queries by defining links between cells so that the information obtained from a cell can be used to annotate another cell. 2.3. Pipeline As stated in the introduction, we are using the KG refinement approach to solve the tabular data to knowledge graph matching problem. The two main refinement activities were: error correction and tabular data completion with missing entities and relations. A deep analysis of the CEA, CTA and CPA tasks during Round 1 allowed us to define the CEA task as the core task. Solving the CEA task allowed us to improve the performance of the CTA and CPA tasks. Exploration of solutions to be used to annotate tabular dataset, applied and evaluated (by the SemTab organizers) allowed us to define the pipeline to be used for the annotation of any tabular data. This pipeline is presented in Fig. 2. It consists of: errors correction in the tabular data (cell pre-processing), and tabular data completion with missing entities and relations (information retrieval, entity discovery and link prediction). Figure 2: Pipeline of annotation process 2.3.1. Cell Pre-processing Figure 3: Example of a cell pre-processing In this approach, we are using SPARQL queries to search for entities in the KG. Thus, the cell pre-processing activity consists of transforming any cell in a form that makes the SPARQL query as efficient as possible. This pre-processing consists of: 1. Removing extra spaces at the beginning, the end and between words in each cell; 2. Removing special characters such as #, (, ), [, ] etc. 3. Correct the Mojibake2 errors. Once processed, the dataset contains cleaned cells that can be used to make queries on the KG. Figure 3 presents an example of a cell that is cleaned during the pre-processing phase. 2.3.2. Completing the tabular data with missing entities and relations This task is done using the following public endpoints: DBpedia3 , Wikidata4 , Wikibase API5 and Lookup API6 . The main problem during this task was the fact that there is a limited number of queries that should be done on a SPARQL endpoint. For instance, after a certain number of queries on Wikidata endpoint, we got the error 429: Too Many Requests. To solve this problem, we suppose that any cell in a table is linked to other ones. Thus, for each SPARQL query, we extract a subgraph that is processed locally to determine the CEA, CPA and CTA of the table (Fig. 4 is an illustration). Figure 4: Searching of semantic tag Entity search The entity searching (CEA task) process starts with the entity misspelling checking using the google service7 . This is to check any spelling errors and to correct them. Thereafter, the cells obtained are used to define SPARQL queries. It should be noted that in many cases, a cell can have multiple annotations. For instance, the cell "Solomon", can refer to a 2 Errors due to UTF-8 encoding 3 https://dbpedia.org/sparql 4 https://query.wikidata.org/sparql 5 https://en.wikipedia.org/w/api.php?action=query&format=json&prop=info|pageprops&inprop=url&ppprop= wikibase_item&redirects=1&titles= 6 https://lookup.dbpedia.org/api/search?limit=25&format=json&query= 7 https://www.npmjs.com/package/googlethis piece of music (Solomon Handel, Solomon album, etc.), a person (Iser Solomon, Jack Solomon, etc.), a place or a film. To solve this problem, we defined the cell context as the elements in the same line of the cell in the table that can be used to lift this ambiguity. Figure 5 illustrates how we proceed for disambiguation. Figure 5: Entity disambiguation Once the entity is identified in the graph, we used cosine similarity measures to calculate the similarity between two cells. This is the prediction of the isCloseTo relation. The goal being to assign the annotation of 𝑐𝑒𝑙𝑙𝑖 to 𝑐𝑒𝑙𝑙𝑗 if the triple (𝑐𝑒𝑙𝑙𝑖 isCloseTo 𝑐𝑒𝑙𝑙𝑗 ) holds. Searching for Semantic Types To search for Semantic Types of columns (CTA task), SPARQL queries are used to search for each CEA, their types. The candidates are sorted and the three most frequent are selected. If one type is the parent of the others, we validate this as the CTA; if not, the CTA having the highest frequency is selected. Property search This task consists of using the entities to search for all the properties that may exist among these entities. Thereafter, the corresponding property is selected. Figure 6 shows an example of searching for property between two columns. It consists of: (1) Getting the QIDs of entities (Q1009 for "Maurice Kamto" −→ column 0 - and Q2410772 for "Cameroon" −→ column 1); (2) searching for all properties that link column 0 to column 1: getting all the properties of the entity Q2410772 and identifying from these properties, the ones that have as range entity Q1009 - in our case, we found P27 (Country of Citizenships). This process is repeated for each row in the tabular data. At its end, the property having the most occurrences is selected as the CPA The pipeline presented in this section was experimented during Round 3 and we have obtained the second position for the annotation of BioDivTab and the third position for the annotation of GitTables. Figure 7 is an example of the execution of our program during the annotation process. 3. Results The SemTab 2022 challenge consisted of three Rounds that lasted from June 13 to October 15, 2022. In this section, we present the results we obtained during these 3 Rounds. Figure 6: Knowledge graph property research example Figure 7: A screenshot of an annotation process made by the tool 3.1. Round 1 Round 1 consisted of the annotation of HardTables using Wikidata KG. The statistics on the dataset provided by the organizers are presented in Table 1. Table 1 Description of the HardTables dataset for Round 1 # Files # Target for CEA task # Targets for CTA task # Targets for CPA task 3,691 26,189 4,511 5,745 We joined the challenge at the end of the first Round. Thus, during the last week of the first Round, we annotated 25 files manually in order to understand the tasks. Thus, the following annotations were assigned: 16 CTA over 4,511 targets, 79 CEA over 26,189 targets and 14 CPA over 5,745 targets. Evaluated by the organizers, we obtained the results presented in Fig. 8. 3.2. Round 2 Round 2 consisted of annotating HardTables (see Table 1) and ToughTables (see Table 3 for Wikidata annotation and table 4 for DBpedia annotation) datasets. During Round 2, we used an automatic annotation approach. From this approach, we have Figure 8: Round 1 CTA, CEA and CPA results for HardTables Table 2 Description of the HardTables dataset for Round 2 # Files # Target for CEA task # Targets for CTA task # Targets for CPA task 4,649 22,009 4,534 3,954 Table 3 Description of the ToughTableWD dataset for Round 2 # Files # Target for CEA task # Targets for CTA task # Targets for CPA task 144 586,118 443 - Table 4 Description of the ToughTableDB dataset for Round 2 # Files # Target for CEA task # Targets for CTA task # Targets for CPA task 144 486,203 429 - obtained better results than during Round 1. The results from the SemTab organizers are presented in Fig. 9 for HardTables and Fig. 10 for ToughTables. Figure 9: Round 2 CTA, CEA and CPA results for HardTables(EXTRA) Figure 10: Round 2 CTA and CEA results for ToughTables 3.3. Round 3 Our main goal during Round 3 was to experiment the approach that was defined from Round 1 and Round 2. Round 03 consisted of annotating the BiodivTab (see Table 5) and GitTable (see Table 6 for DBpedia annotation and Table 7 for Schema.org annotation) datasets. Table 5 Description of the BiodivTab dataset for Round 3 # Files # Target for CEA task # Targets for CTA task # Targets for CPA task 45 31,942 526 - Table 6 Description of the GitTableDBP dataset for Round 3 # Files # Target for CEA task # Targets for CTA task # Targets for CPA task 6,892 - 6,228 - Table 7 Description of the GitTableSCH dataset for Round 3 # Files # Target for CEA task # Targets for CTA task # Targets for CPA task 6,892 - 5,411 (4411 properties and - 1000 classes) The approach defined after Round 2 and presented in Section 2.3 was evaluated on the datasets of Round 3. The analysis of the results by the SemTab organizers positioned us in the second position for the annotation of the BiodivTab and the third position for the annotation of GitTables (see Fig. 11). 4. Conclusion This paper presents the approach we proposed for the annotation of tabular data using knowl- edge graphs. This approach is based on knowledge graph refinement. Error correction aims to put the cells in the table in a form that can be used to make SPARQL queries and to solve disambiguation of cells. Tabular data completion aims to complete the table with missing entities and relations. To add the context and solve the ambiguity problem encountered during the CEA task, we are exploring language models such as BERT. Online resource The source code we produced during this work is available on GitHub8 . 8 https://github.com/jiofidelus/tsotsa/tree/SemTab_22 Figure 11: Round 3 CTA and CEA results for BiodivTables and GitTables Acknowledgment We are grateful to SemTab organizers for having given us the opportunity to share this work with the community. We are also grateful to Vinsight and neuralearn.ai for the training support. References [1] A. Jiomekong, Comparison of food composition tables/databases, 2022. URL: https://orkg. org/comparison/R206121/. [2] P. Cimiano, H. Paulheim, Knowledge graph refinement: A survey of approaches and eval- uation methods, Semant. Web 8 (2017) 489–508. URL: https://doi.org/10.3233/SW-160218. doi:10.3233/SW-160218. [3] A. Jiomekong, H. Tapamo, G. Camara, Combining Scrum and Model Driven Architecture for the development of the EPICAM platform, in: CARI 2022, Yaounde, Cameroon, 2022. URL: https://hal.archives-ouvertes.fr/hal-03712484. [4] A. Jiomekong, G. Camara, M. Tchuente, Extracting ontological knowledge from java source code using hidden markov models, Open Computer Science 9 (2019) 181–199. [5] A. S. I. G. on Software Engineering, Empirical standards, 2020. URL: https://github.com/ acmsigsoft/EmpiricalStandards. [6] O. Hassanzadeh, V. Efthymiou, J. Chen, E. Jiménez-Ruiz, K. Srinivas, SemTab 2021: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets, 2021. URL: https://doi.org/10.5281/zenodo.6154708. doi:10.5281/zenodo.6154708. [7] V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, M. Palmonari, Tough Tables: Carefully Evaluating Entity Linking for Tabular Data, 2020. URL: https://doi.org/10.5281/zenodo.4246370. doi:10. 5281/zenodo.4246370. [8] V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, M. Palmonari, Tough tables: Carefully evaluating entity linking for tabular data, in: J. Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne, L. Kagal (Eds.), The Semantic Web – ISWC 2020, Springer International Publishing, Cham, 2020, pp. 328–343. [9] N. Abdelmageed, S. Schindler, B. König-Ries, BiodivTab: A Tabular Benchmark based on Biodiversity Research Data, in: SemTab@ISWC, submitted, 2021. [10] N. Abdelmageed, S. Schindler, B. König-Ries, fusion-jena/biodivtab, 2021. URL: https: //doi.org/10.5281/zenodo.5584180. doi:10.5281/zenodo.5584180. [11] M. Hulsebos, Ç. Demiralp, P. Groth, Gittables: A large-scale corpus of relational tables, arXiv preprint arXiv:2106.07258 (2021). [12] M. Hulsebos, Çağatay Demiralp, P. Demiralp, Gittables benchmark - column type detection, 2021. URL: https://doi.org/10.5281/zenodo.5706316. doi:10.5281/zenodo.5706316.