SemInt at SemTab 2022 Abhisek Sharma1,∗,† , Sumit Dalal1,† and Sarika Jain1,† 1 National Institute of Technology Kurukshetra, India. Abstract In this paper we present SemInt, for SemTab 2022 challenge of ISWC 2022. This is SemInt’s first participation to the challenge. This challenge is about annotating tabular data from publically available knowledge graphs (such as Wikidata/DBPedia). We propose a model named as SemInt that runs iterative SPARQL query over Wikidata/DBPedia SPARQL endpoints for each term available a given table. For handling misformed or differing representations of terms or entities in the table, SemInt queries the Wikidata or DBPedia API’s and find the suitable matches for them. It also employs a search engine to address typos in the terms. This year SemInt participated for CTA task and got some encouraging results with 0.794 Precision and F-measure. We plan to extend it for CEA and CPA as well. Keywords Entity annotation, Table interpretation, Knowledge graph, SemInt, SemTab 1. Introduction Web pages contains information of various dimensions. However, most of this information is present in the tables. Tables occupies relational data in various fields and are sources of high-quality data with lesser noise than unstructured text which is useful for various tasks knowledge graph augmentation [1] and knowledge extraction [2]. Hence tables can not be ignored while moving to the Web 3.0. Simple data (without any annotation) from tables don’t have much meaning, but annotated tables are valuable sources and has critical research value. Semantic annotation of the tabular data has gained much attention in recent years. Most of the works employs probabilistic graphical models for the annotation purpose [3], [4]. There are several units in a table which can be annotated like cells, columns. A column or pair of columns can be assigned to entities, while relationship between two columns can be annotated to two cells from these columns. Though there are many benefits of annotating tables and employing them in knowledge extraction assignments. However, due to diverse languages and noise mentions, interpreting semantic data from tables by machines is not easy. SemTab chellenge is organized every year since 2019 on tabular data to Wikidata or DBpedia matching [5]. This year’s challenge is to match the tabular data to Wikidata, DBpedia and Schema.org properties or classes depending on the rounds of the challenge. A new set of difficulties such as larger- ∗ Corresponding author. † These authors contributed equally. Envelope-Open abhisek_61900048@nitkkr.ac.in (A. Sharma); sumitdalal9050@gmail.com (S. Dalal); jasarika@nitkkr.ac.in (S. Jain) Orcid 0000-0003-1568-2625 (A. Sharma); 0000-0002-8736-2148 (S. Dalal); 0000-0002-7432-8506 (S. Jain) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Property in KG (CPA) Country State Capital Entity in KG (CEA) India Rajasthan Jaipur USA California Sacramento Germany Bayern München France Normandy Rouen Type in KG (CTA) Figure 1: Tasks in SemTab 2022 scale knowledge graph setting, knowledge graph data shifting, and noisy schema structure of multiple knowledge graphs have followed. Additionally, this year’s challenge also has a more challenging dataset (the tough tables [6]), which is manually curated, offering realistic issues than the last challenge. The Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2022) aims at benchmarking tabular data to knowledge graph matching systems. The challenge consists of three tasks: Column Type Annotation (CTA), Cell Entity Annotation (CEA) and Column Property Annotation (CPA). The CTA task is assigning a semantic type to a column, the CEA task is matching cells to entities in a specific KG, and the CPA task is assigning a KG property to the relationship between two columns. These three tasks and their formal definitions can be illustrated by Figure 1. We have proposed an approach to solve the CTA task, where internally as insights we have used approach that gives some results for the CEA task, though we have not individually participated for CEA. For CTA task, we have used Wikidata/DBPedia SPARQL endpoint to query individual entities from each column and proceed from there. Outline. The rest of the paper is organised as follows: Section 2 of the paper presents work from previous year SemTab challenges. Section 3 defines the proposed approach to solve the CTA task while Section 4 discusses the results for one rounds. Conclusion and future direction of this work is given in the last Section number 5. 2. Related Work MTab tool supports multilingual tables and could process various table formats [7]. Referent entity for a cell in table is detected using a graphical model with iterative probability propagation algorithm in [8]. MTab4Wikidata [9] considers statement search and fuzzy search to handle noise mentions which improves entity search. Some works provided new formula for ranking the matching results such as DAGOBAH [10], MantisTable SE [11]. MTab system [12] is based on an aggregation of multiple cross-lingual lookup services and probabilistic graphical model. CSV2KG (IDLab) also uses multiple lookup services to improve matching performance [13]. Tabular ISI implements the lookup part with Wikidata API and Elastic Search on DBpedia Refined Term Extracting "Including results for" Outer Loop: Select a column If first time results pt are e Inner Loop: Select a cell of the are empty, refine em lts tim su nd Tables column terms re eco with DB API y s If Type with maximum Result Table Preprocessed frequency selected Empty? Table If results are as column type Term1 Type fetched from KG not empty Term2 Type fetched from KG If results are empty No third time Term3 Type fetched from KG Skip Term Leave term Yes Figure 2: SemInt Architecture labels, and aliases [14]. ADOG [15] system also uses Elastic Search to index knowledge graph. LOD4ALL first checks whereas there is available entity which has a similar label with table cell using ASK SPARQL, else perform DBpedia entity search [16]. DAGOBAH system performs entity linking with a lookup on Wikidata and DBpedia; the authors also used Wikidata entity embedding to estimate the entity type candidates [17]. Mantis Table provides a Web interface and API for tabular data matching [18]. 3. Proposed Model This section describes the architecture of our proposed system, named SemInt, whose various components are depicted in Figure 2. We have participated for the first time in SemTab, in the CTA task only. SemInt follows a simple, yet with decent results, majority-voting-based lookup approach: Cell contents are looked up in the SPARQL endpoint of the target KG, and in case of null results, looked up again on a search engine (DuckDuckGo) for fixing typos. The returned entity type with the highest number of votes per column is assigned as the type of that column.1 Assumptions SemInt is developed keeping some assumptions in mind. 1. Assumption 1 We assume that the input table contains values horizontally, i.e., column represent values of same type. 2. Assumption 2 The cell and column types defined in Wikidata/DBPedia uses rdf:type and are of type owl:class. 3.1. Loading of tables and Selection of terms A set of file with tables are provided in the beginning. Iteratively single files are selected and loaded as dataframe. SemInt then iterate over columns of loaded table selecting one at a time. Terms are then selected out of the selected column. 1 Can be accessed through: https://github.com/abhiseksharma/SemInt 3.2. Lookup The chosen term is supplied through a SPARQL query to retrieve various term types from the online DBpedia/Wikidata repository. If no result is received from the knowledge graph for any term then that term will be passed via respective API (DBpedia API or Wikidata API) to obtain the candidate representation of the term. This is done because an empty result may be caused by a difference in representation between the term stored in DBPedia/Wikidata and the representation in the table(like lowercase or camelcase, use of punctuations). Out of all the returned terms, first term is selected as in The query is then executed again once the candidate term has been obtained. If the result is still empty, the term is passed through a search engine (this version of SemInt uses DuckDuckGo search engine) to catch any typos by extracting ”including results for” part of the search result. DBPedia/Wikidata may have some representations that are accurately listed in the table but on which search engines may become confused, because of which this was not done in the first place. After the search engine has corrected any typos, the query is run one last time to seek for results that aren’t empty. SemInt skips it and proceeds on to the following term in the line if the result is still empty. When a result is not empty, it is saved as a table with terms in one column and types returned by the repository in the other. We have used following SPARQL Query for the above lookup: select DISTINCT ?o where {?s rdfs:label @en . ?s wdt:P31 ?o .} The in the above query is the entry/concept/term in the cell of the dataset which will be queried for its type in DBPedia or WikiData (based on the dataset). 3.3. Type Selection The frequency of entity types in the saved term-type table is taken into consideration while choosing the column type (for the CTA task). The column type is determined by the entity type with the highest frequency. 4. SemInt Performance and Results This sections presents the performance and result of SemInt at SemTab 2022 in 1 out of the 3 rounds (i.e., Round 1) in which SemInt participated. SemInt did went through the execution on dataset of round 2 and 3. In round 2, SemInt was able to get partial results locally, though was unable to execute completely due to some external factors. So, we had to skip submission for round 2. For round 3, SemInt ran completly on the dataset and produced some results, though after submission the evaluation scores (F1, recall, precision) came out as 0, we suspect the output KG types were represented in wrong format in the submitted CSV file. Round 1 This year first round has 3 tasks, CTA-WD(Column Type Annotation using Wikidata), CEA-WD (Cell Entity Annotation using Wikidata), and CPA-WD (Annotating two columns with property on Wikidata). SemInt submitted results for CTA-WD task of Round 1 this year. The comparative results are presented in table 1 Table 1 Result of Round 1 for CTA-WD task System Precision F1 DAGOBAH 0.975 0.975 s-elBat 0.951 0.957 Kepler-aSI 0.944 0.944 KGCODE-Tab 0.944 0.942 JenTab 0.940 0.938 AMALGAM 0.793 0.786 Laurent 0.785 0.770 SemInt 0.794 0.794 5. Conclusion This paper presented the first version of SemInt approach. We are participating in this challenge for the first time. We have used a combination of strategies and treatment to tackle the tasks of SemTab 2022 and achieved encouraging performance. We have performed preprocessing,it- erative term improvement techniques, and then iterative querying over SPARQL endpoint of Wikidata/DBPedia. SemInt injects cell contents of a table into a generic SPARQL query. SemInt at SemTab 2022 is a promising approach, but which will be further improved. Our focus will be to decrease the complexity of the system in terms of space and time requirements. We will try to incorporate some Big Data or machine learning approaches to improve data processing. To speed up the process and handle the problem of large data we will employ parallel processing techniques and varying search strategies. Eventually, we want to cater the system for all the tasks i.e., CTA, CEA, and CPA over all the data sources. References [1] D. Ritze, O. Lehmberg, Y. Oulabi, C. Bizer, Profiling the potential of web tables for augmenting cross-domain knowledge bases, in: Proceedings of the 25th international conference on world wide web, 2016, pp. 251–261. [2] D. Wang, P. Shiralkar, C. Lockard, B. Huang, X. L. Dong, M. Jiang, Tcn: Table convolutional network for web table interpretation, in: Proceedings of the Web Conference 2021, 2021, pp. 4020–4032. [3] G. Limaye, S. Sarawagi, S. Chakrabarti, Annotating and searching web tables using entities, types and relationships, Proceedings of the VLDB Endowment 3 (2010) 1338–1347. [4] V. Mulwad, T. Finin, A. Joshi, Semantic message passing for generating linked data from tables, in: International Semantic Web Conference, Springer, 2013, pp. 363–378. [5] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, Semtab 2019: Re- sources to benchmark tabular data to knowledge graph matching systems, in: European Semantic Web Conference, Springer, 2020, pp. 514–530. [6] V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, M. Palmonari, Tough tables: Carefully evaluating entity linking for tabular data, in: International Semantic Web Conference, Springer, 2020, pp. 328–343. [7] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, Semtab 2021: Tabular data annotation with mtab tool., in: SemTab@ ISWC, 2021, pp. 92–101. [8] L. Yang, S. Shen, J. Ding, J. Jin, Gbmtab: A graph-based method for interpreting noisy semantic table to knowledge graph., in: SemTab@ ISWC, 2021, pp. 32–41. [9] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, Mtab4wikidata at semtab 2020: Tabular data annotation with wikidata., SemTab@ ISWC 2775 (2020) 86–95. [10] V.-P. Huynh, J. Liu, Y. Chabot, T. Labbé, P. Monnin, R. Troncy, Dagobah: Enhanced scoring algorithms for scalable annotations of tabular data., in: SemTab@ ISWC, 2020, pp. 27–39. [11] M. Cremaschi, R. Avogadro, A. Barazzetti, D. Chieregato, E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, Mantistable se: an efficient approach for the semantic table interpretation., in: SemTab@ ISWC, 2020, pp. 75–85. [12] P. Nguyen, N. Kertkeidkachorn, R. Ichise, H. Takeda, Mtab: matching tabular data to knowledge graph using probability models, arXiv preprint arXiv:1910.00246 (2019). [13] B. Steenwinckel, G. Vandewiele, F. De Turck, F. Ongenae, Csv2kg: Transforming tabular data into semantic knowledge, SemTab, ISWC Challenge (2019). [14] A. Thawani, M. Hu, E. Hu, H. Zafar, N. T. Divvala, A. Singh, E. Qasemi, P. A. Szekely, J. Pujara, Entity linking to knowledge graphs to infer column types and properties., SemTab@ ISWC 2019 (2019) 25–32. [15] D. Oliveira, M. d’Aquin, Adog-annotating data with ontologies and graphs, in: SemTab@ ISWC, 2019. [16] H. Morikawa, Semantic table interpretation using lod4all., SemTab@ ISWC 2019 (2019) 49–56. [17] J. Liu, R. Troncy, Dagobah: an end-to-end context-free tabular data semantic annotation system, SemTab@ ISWC (2019). [18] M. Cremaschi, R. Avogadro, D. Chieregato, Mantistable: an automatic approach for the semantic table interpretation., SemTab@ ISWC 2019 (2019) 15–24.