Yet Another Milestone for Kepler-aSI at SemTab 2022 Wiem Baazouzi1 , Marouen Kachroudi2 and Sami Faiz3 1 Université de Manouba, Ecole Nationale des sciences de l’informatique, Laboratoire de Recherche en génie logiciel, Application distribuées, Manouba 2010, Tunis, Tunisie. 1 Université de Tunis El Manar, Faculté des Sciences de Tunis, Informatique Programmation Algorithmique et Heuristique, LR11ES14, 2092, Tunis, Tunisie. 3 Université de Tunis El Manar, Ecole Nationale d’Ingénieurs de Tunis, Laboratoire de Télédétection et Systèmes d’Information à Référence Spatiale, 99/UR/11-11, 2092, Tunis, Tunisie Abstract In this paper, we present our system Kepler-aSI, for the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2022). This system is participating for the second time in this challege edition, bringing improvements and new technical aspects. Kepler-aSI analyzes tabular data to be able to detect correct matches in Wikidata and Dbpedia. It should be noted that each data resource, or each round of the challenge imposes a certain number of constraints, requiring advanced techniques. The aforementioned task turns out to be difficult for the machines, which requires an additional effort in order to deploy the congenitive capacity in the matching methods. Kepler-aSI [1, 2, 3, 4] still relies on the SPARQL query to semantically annotate tables in Knowledge Graphs (KG), in order to solve the critical problems of matching tasks. The results obtained during the evaluation phase are encouraging and show the strengths of the proposed system. Keywords Tabular Data, Knowledge Graph, Kepler-aSI, SPARQL 1. Introduction It is evident that the World Wide Web encompasses and conveys very large volumes of textual information, in several forms: unstructured text, semi-structured model-based web pages (which represent data in the form widely recognized by key-value notation and lists), and of course arrays. In this broad context, the methods aiming to extract information from these resources to convert them in a structured form have been the subject of several works [5, 6]. As an observation, it is evident that there is a lack of understanding of the semantic structure which can hamper the process of data analysis. This observation reveals a gap between data islands. Indeed, acquiring this semantic reconciliation will therefore be very useful for data integration, data cleansing, data mining, machine learning and knowledge discovery tasks. For example, understanding the data can help assess the appropriate types of transformation. Depending on the use and deployment scenario, tabular data is carefully conveyed to the Web in various formats. The majority of these datasets are available in tabular form (e.g., CSV SemTab 2022 Envelope-Open wiem.baazouzi@ensi-uma.tn (W. Baazouzi); marouen.kachroudi@fst.rnu.tn (M. Kachroudi); sami.faiz@insat.rnu.tn (S. Faiz) Orcid 0000-0002-6512-7382 (W. Baazouzi); 0000-0002-7536-0428 (M. Kachroudi); 0000-0001-7065-6572 (S. Faiz) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) (Comma-Separated Values)). The main reason for the popularity of this format is its simplicity: many common office tools are available to facilitate their generation and use. Tables on the Web are a very valuable data source. Thus, injecting semantic information into arrays on the web has the potential to boost a wide range of applications, such as web searching, answering queries, and building Knowledge Bases (KB). Research reports that there are various issues with tabular data available on the Web, such as learning with limited labeled data, defining or updating ontologies, exploiting prior knowledge, and / or scaling up existing solutions. Therefore, this task is often difficult in practice, due to missing, incomplete or ambiguous metadata (e.g., table and column names). In recent years, we have identified several works that can be mainly classified as supervised (in the form of annotated tables to carry out the learning task) [7, 8, 9, 10, 11] or unsupervised (tables whose data is not dedicated to learning) [12, 11]. To solve these problems, we propose a global approach named Kepler-aSI, which addresses the challenge of matching tabular data to knowledge graphs.This method is based on previous work, which deals with ontology alignment [13, 14, 15, 16, 17]. This year’s SemTab challenge differs from the last two sessions1 2 , in that it deals with Wikidata and Dbpedia. In this challenge, the input is a CSV file, but three different challenges had to be met : 1. CTA : A type of the Wikidata (or eventually Dbpedia) ontology had to be assigned a class KG to a column (Column-Type Annotation ). 2. CEA : A Wikidata or Dbpedia entity had to be matched to the different cells (Cell-Entity Annotation). 3. CPA : A KG (Wikidata or Dbpedia) property had to be assigned to the relationship between two columns (Columns Property Annotation). Data annotation is a fundamental process in tabular data analysis [18, 19], it allows to infer the meaning of other information. Then deduce the meaning of a tabular Knowledge Graph. The data we used was based both on Wikidata and Dbpedia. It should be noted that in a broader context, the data used and manipulated obey the triples format representation : subject (𝒮), a predicate (𝒫) and an object (𝒪). This notation ensures semantic navigability through the data and makes all data manipulation more fluid, explicit and reliable. Indeed, Cell Entity Annotation (CEA) matches a cell to a KG entity. At this level, we have to annotate each individual element of the subject (𝒮) and the object (𝒪). Column Property Annotation (CPA) assigns a KG property to the relationship between two columns. The task is to find out which property of the two columns are connected to either Wikidata or Dbpedia. Column Type Annotation (CTA) assigns connected semantic type to a column. Our goal is to design a fast and efficient approach to annotate tabular data with entities from Wikidata or Dbpedia. Our approach combines a multitude of NLP and search and filter strategies, based on text preprocessing techniques. Experiments carried out in the context of SemTab 2022 for all tasks have shown encouraging results. 1 https://www.cs.ox.ac.uk/isg/challenges/sem-tab/2019/ 2 https://www.cs.ox.ac.uk/isg/challenges/sem-tab/2020/ 2. Kepler-aSI approach In this section, we will describe in detail the different stages of our system, while presenting some basic notions to highlight the technical issues identified. 2.1. Key notions • Tabular Data : 𝑆 is a two-dimensional tabular structure made up of an ordered set of N rows and M columns, as depicted by Figure ??. 𝑛𝑖 is a row of the table (i = 1 ... N), 𝑚𝑗 is a column of the table (j = 1 ... M). The intersection between a row 𝑛𝑖 and a column 𝑚𝑗 is 𝑐𝑖 ,𝑗 , which is a value of the cell 𝑆𝑖 ,𝑗 . The table contents can have different types (string, date, float, number, etc.). – Target Table (S): M × N. – Subject Cell: 𝑆(𝑖,0) (i = 1, 2 ... N). – Object Cell: 𝑆(𝑖,𝑗) (i = 1, 2 ... M),(j = 1, 2 ... N). Col0 Col𝑖 Col𝑁 𝑅𝑜𝑤1 𝑆, … … … 𝑆 1 ,𝑁 ⎛ 10 ⎞ ⋮ ⋱ ⋱ ⋱ ⋮ ⎜ ⎟ ⋮ ⋱ ⋱ ⋱ ⋮ ⎜ ⎟ 𝑅𝑜𝑤𝑗 ⎜ 𝑆𝑗 ,0 … 𝑆 𝑗 ,𝑖 … 𝑆 𝑗 ,𝑁 ⎟ ⎜ ⋮ ⋱ ⋱ ⋱ ⋮ ⎟ ⎜ ⋮ ⋱ ⋱ ⋱ ⋮ ⎟ Row𝑀 ⎝ 𝑆𝑀 ,0 … … … 𝑆 𝑀 ,𝑁 ⎠ Figure 1: Target Table • Knowledge Graph : Knowledge Graphs have been in the focus of research since 2012, resulting in a wide variety of published descriptions and definitions. The lack of a common core, a fact that is also indicated by Paulheim [20] in 2015. Paulheim listed in his survey of Knowledge Graph refinement, the minimum set of characteristics that must be present to distinguish Knowledge Graphs from other knowledge collections, which basically restricts the term to any graph based knowledge representation. In the online reviewing [20], authors agreed that a more precise definition was hard to find at that point. This statement points out the need of a closer investigation and deeper reflection in this area. Farber et al. defined a Knowledge Graph as an Resource Description Framework (RDF) graph and stated that the term KG was coined by Google to describe any graph-based Knowledge Base (KB) [21]. Although this definition is the only formal one, it contradicts with more general definitions as it explicitly requires the RDF data model. In the following we present a detailed description of our contribution, namely Kepler-aSI. 2.2. System description In order to address the above mentioned SemTab challenge tasks, Kepler-aSI is designed ac- cording to the workflow depicted by Figure 2. There are three major complementary modules which consist in, respectively, Preprocessing, Annotation context and Tabular data to KG match- ing. The aforementioned steps are the same for each round, but the changes remain minimal depending on the variations observed in each case. Figure 2: Kepler-aSI Workflow As shown in Figure 2 Preprocessing aims to prepare the data inside the considered table. While Annotation Context, seeks to create a list of terms denoting the same context. 2.2.1. Preprocessing It should be noted that the content of each table can be expressed according to different types and formats, namely: numeric, character strings, binary data, date/time, boolean, addresses, etc. Indeed, with the great diversity of data types, the pre-processing step is crucial. Therefore, the goal of preprocessing is to ensure that the processing of each table is triggered without errors. The effort is especially accentuated when the data contains spelling errors. In other words, these issues must be resolved before we apply our approach. In order to well carry out this step, we used several techniques and libraries such as (Textblob3 , Pyspellchecker4 , etc.) to rectify and correct all the noisy textual data in the considered tables. As an example, we detect punctuation, parentheses, hyphen and apostrophe, and also stop words by using the P a n d a s 5 library to remove them. Like a classic treatment in this register, we ended this phase by transforming all the upper case letters into lower case. 3 https://textblob.readthedocs.io/en/dev/ 4 https://pypi.org/project/pyspellchecker/ 5 https://pandas.pydata.org 2.2.2. Annotation context This phase allows to explicitly extract the candidates for the annotation process. The priming is carried out by a processing columns analysis, which aims to understand and delimit the set of regular expressions which contains a set of units: the area, the currency, the density, the electric current, the energy , flow rate, force, frequency, energy efficiency, unit of information, length, density, mass, numbers, population density, power, pressure, speed, temperature, time, torque, voltage and volume. This step allows to identify multiple Regextypes using regular expressions (e.g. numbers, geographic coordinates, address, code, color, URL). Since all values of type text are selected, preprocessing for natural languages was performed using the l a n g r i d 6 library to detect 26 languages in our data. By the way, it’s a novelty for this year’s SemTab challenge, i.e., which makes the task more difficult with the introduction of natural language barriers. The l a n g r i d library is a stand-alone language identification tool. It is preformed on a large number of languages (97 currently). Doing so, correction, data type and language detection is performed. This can considerably reduce the effort and the cost of executing our approach by avoiding the massive repetition of these treatments for all the table cells, and this in each subtask. 2.2.3. Assigning a semantic type to a column (CTA) As depicted by Figure 3, the task is to annotate each entity column with elements from Wikidata (or possibly Dbpedia) as its type identified during the preprocessing phase. Figure 3: CTA task at a glance. 6 https://github.com/openlangrid Each item is marked with the tag in Wikidata or Dbpedia. This treatment allows semantics identification. The CTA task can be performed based on Wikidata or Dbpedia APIs which allows us to search for an item according to its description. The main information collected about a given entity and used in our approach are: a list of instances (expressed by the i n s t a n c e O f primitive and accessible by the P31 code), the subclass of (expressed by the s u b c l a s s O f primitive and accessible by code P279) and overlaps (expressed by the p a r t O f primitive and accessible by code P361). At this point, we are able to process the CTA task using a SPARQL query. The SPARQL query is our interrogation mean fed by the main information of the entity which governs the choice of each data type, since they are a list of instances (P31), of subclasses (P279) or a part of a class (P361). The result of the SPARQL query may return a single type but for some cases the result is more than one type, so in this case no annotation is produced for the CTA task. 2.2.4. Matching a cell to a KG entity (CEA) The CEA task aims to annotate the cells of a given table to a specific entity listed on Wikidata or Dbpedia. Figure 4: Descriptive model of CEA task. Figure 4 gathers the CEA task that can be performed based on the same principle of CTA task. Our approach reuses the results of the CTA task process by introducing the necessary modifications on the SPARQL query. If the operation returns more than one annotation and since we are conducting a fuzzy search [22, 23], we run a process based on examining the context of the considered column, relatively to what was obtained with the CTA task, to overcome the ambiguity problem. 2.2.5. Matching a property to a KG entity (CPA) After having annotated the cell values as well as the different types of each of the considered entities, we will identify the relationships between two cells appearing on the same row via a property using a SPARQL query, as flagged by Figure 5. Indeed, the CPA task look for annotating the relationship between two cells in a row via a property. Similarly, this latter task can be performed in an analogous manner to the CTA and CEA tasks. The only difference in the CPA task is that the SPARQL query must select both the entity and the corresponding attributes. The properties are fairly easy to match since we have already determined them during CEA and CTA task processing. Figure 5: A representation of CPA task. 3. Kepler-aSI performance and results In this section we will present the results of Kepler-aSI for the different matching tasks in the 3 rounds of SemTab 2022. These results highlight the strengths of Kepler-aSI with its encouraging performance despite the multiplicity of issues. 3.1. Round 1 In this first round, and for this version of SemTab 2022, three tasks are presented: CTA-WD, CEA-WD and CPA-WD. The column type annotation (CTA -WD) assigns a Wikidata semantic type (a Wikidata entity) to a column. Cell Entity Annotation (CEA-WD) maps a cell to a KG entity. Annotation should be represented by its full IRI, where case is not sensitive. Each line should include a column identified by a table ID and a column ID, along with the column annotation (a Wikidata item). This means that a row must include three fields: ”Table ID”, ”Column ID”, and ”IRI Annotation”, where: • ”Table ID” is the filename of the table data, but does not include the extension. • ”Column ID” is the position of the column in the input, starting from 0, i.e., the ID of the first column is 0. • ”IRI annotation”: the prefix of h t t p : / / w w w . w i k i d a t a . o r g / e n t i t y / instead of h t t p s : / / w w w . w i k i d a t a . o r g / w i k i / which is the URL prefix of the Wikidata page. When it comes to associating a cell with an entity on the Knowledge Graph, the task is to annotate each target cell with an entity from Wikidata. Each annotation must contain the target cell annotation. A cell can be annotated by an entity with the prefix http://www.wiki- data.org/entity/. Each CTA annotation must contain the annotation of a cell identified by a table identifier, a column identifier and a row identifier. Namely, an annotation must have four fields: ”Table ID”, ”Row ID”, ”Column ID” and ”Entity IRI ”, where: • ”Table ID”: does not include the filename extension; and removing the .csv extension from the filename. • ”Column ID”: is the position of the column in the table file, starting from 0, i.e. the ID of the first column is 0. • “Row ID”: is the row position in the table file, starting from 0, i.e. the first row ID is 0. • ”Entity IRI ”: the prefix of h t t p : / / w w w . w i k i d a t a . o r g / e n t i t y / instead of h t t p s : / / w w w . w i k i d a t a . o r g / w i k i / which is the URL prefix of the Wikidata page. As for the annotation of column properties by Wikidata (CPA-WD), it consists in annotating the relations between the columns in a table with the properties of Wikidata. Each annotation must contain the two-column annotation which is itself identified by a table identifier, a first column identifier and a second column identifier. Namely, a row must have four fields: ”Table ID”, ”Column ID 1”, ”Column ID 2” and ”Property IRI ”. Each pair of columns must be annotated by at most one property, as follows: • ”Table ID” does not include filename extension • ”Column ID 1”, ”Column ID 2”, is the position of the column in the table file, starting from 0, i.e. the first column ID is 0. • ”Property IRI ”: prefix of h t t p : / / w w w . w i k i d a t a . o r g / p r o p / d i r e c t / instead of h t t p s : / / w w w . w i k i d a t a . o r g / w i k i / which is the URL prefix of the Wikidata page. It should be noted that the CTA-WD, CEA-WD and CPA-WD task data contains 3691 tables. Results are summarized in Table 1: 3.2. Round 2 Round 2 includes 3 main families of tests, the results of which are summarized in Table 2: • HardTables (HT-WD): represented by 4649 tables; • ToughTablesR2-WD (2T-WD): represented by 114 tables; • ToughTablesR2-DBP (2T-DBP): represented by 114 tables. Round 1 CTA CEA CPA APrecision AF1 Rank APrecision AF1 Rank APrecision AF1 Rank 0.944 0.944 3/10 — 0.937 0.937 4/10 Table 1 Results for Round 1 Round 2 APrecision AF1 Rank HardTables-CTA-WD 0.881 0.811 3/6 HardTables-CEA-WD — HardTables-CPA-WD 0.912 0.912 3/6 ToughTables-CTA-WD 0.369 0.369 3/6 ToughTables-CEA-WD — ToughTables-CTA-DBP 0.154 0.154 5 ToughTables-CEA-DBP —- Table 2 Results for Round 2 3.3. Round 3 Round 3 includes 3 main families of tests, metrics are in Table 3: • GitTables schema: represented by 45 tables; • GitTables DBP: represented by 6898 tables; • Bio-Div-Tables: represented by 45 tables. Round 3 APrecision AF1 Rank BiodivTab-CTA-DBP 0.781 0.731 3/7 BiodivTab-CEA-DBP 0.534 0.534 4/7 GitTables-CTA-DBP — GitTables-CTA-SCH — Table 3 Results for Round 3 In Round 3, we realized that there were significant amounts of entity duplication in our result. Thus, the pairing process has been improved by adding the following features. First, spell checking of misspelled sentences was used. However, approaches based on relishing content duplications can achieve results without column duplication. In order to overcome duplicate columns, We used Fuzzy matching in pandas to detect duplicate rows (efficiently). In fact, FuzzyWuzzy is an implementation of edit distance, which would be a good candidate for constructing a pairwise distance matrix in numpy or similar. To detect ”duplicates” or close matches, We have to compare each row to the other rows or We will never know if two are close to each other 4. Conclusion To conclude, we have presented in this paper the second version of our Kepler-aSI approach. Our system is participating in the challenge for the third time, it is approaching maturity and achieving very encouraging performance. We have succeeded in combining several strategies and treatment techniques, which is also the strength of our system. We boosted the preprocessing and spellchecking steps that got the system up and running. In addition, despite the data size, which is quite large, we managed to get around this problem by using a kind of local dictionary, which allows us to reuse already existing matches. Thus, we realized a considerable saving of time, which allowed us to adjust and rectify after each execution. We also participated in all the tasks without exception, which allowed us to test our system on all facets, i.e., to identify its strengths and weaknesses. In this paper, we presented our contribution to the SemTab2022 challenge, Kepler-aSI. We tackled the several proposed tasks. Our solution is based on a generic SPARQL query using the cell contents as a description of a given item. In each round, despite the time allocated by the organizers running out, we continued the work and the improvements, having the conviction that each effort counts and brings us closer to the good control of the studied field. References [1] W. Baazouzi, M. Kachroudi, S. Faïz, Kepler-asi: Kepler as a semantic interpreter, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19𝑡ℎ International Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to be in Athens, Greece), November 5, 2020, volume 2775, 2020, pp. 50–58. [2] W. Baazouzi, M. Kachroudi, S. Faïz, Kepler-asi at semtab 2021, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20𝑡ℎ International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021, volume 3103, 2021, pp. 54–67. [3] W. Baazouzi, M. Kachroudi, S. Faiz, Towards an efficient fairification approach of tabular data with knowledge graph models, in: Proceedings of the 26𝑡ℎ Knowledge-Based and Intelligent Information Engineering Systems International Conference KES 2022, volume 207, 2022, pp. 2727–2736. [4] W. Baazouzi, M. Kachroudi, S. Faiz, A matching approach to confer semantics over tabular data based on knowledge graphs, in: Proceedings of the 11𝑡ℎ International Conference on Model and Data Engineering, Springer, 2023, pp. 236–249. [5] J. Chen, E. Jiménez-Ruiz, I. Horrocks, C. Sutton, Colnet: Embedding the semantics of web tables for column type prediction, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 29–36. [6] S. Malyshev, M. Krötzsch, L. González, J. Gonsior, A. Bielefeldt, Getting the most out of wikidata: Semantic technology usage in wikipedia’s knowledge graph, in: International Semantic Web Conference, Springer, 2018, pp. 376–394. [7] M. Pham, S. Alse, C. A. Knoblock, P. Szekely, Semantic labeling: a domain-independent approach, in: International Semantic Web Conference, Springer, 2016, pp. 446–462. [8] M. Taheriyan, C. A. Knoblock, P. Szekely, J. L. Ambite, Learning the semantics of structured data sources, Journal of Web Semantics 37 (2016) 152–169. [9] S. K. Ramnandan, A. Mittal, C. A. Knoblock, P. Szekely, Assigning semantic labels to data sources, in: European Semantic Web Conference, Springer, 2015, pp. 403–417. [10] C. A. Knoblock, P. Szekely, J. L. Ambite, A. Goel, S. Gupta, K. Lerman, M. Muslea, M. Taheriyan, P. Mallick, Semi-automatically mapping structured sources into the semantic web, in: Extended Semantic Web Conference, Springer, 2012, pp. 375–390. [11] M. Cremaschi, F. De Paoli, A. Rula, B. Spahiu, A fully automated approach to a complete semantic table interpretation, Future Generation Computer Systems (2020). [12] Z. Zhang, Effective and efficient semantic table interpretation using tableminer+, Semantic Web 8 (2017) 921–957. [13] M. Kachroudi, G. Diallo, S. Ben Yahia, OAEI 2017 results of KEPLER, in: Proceedings of the 12th International Workshop on Ontology Matching co-located with the 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 21, 2017, volume 2032 of CEUR Workshop Proceedings, CEUR-WS.org, 2017, pp. 138–145. [14] M. Kachroudi, S. Ben Yahia, Dealing with direct and indirect ontology alignment, J. Data Semant. 7 (2018) 237–252. [15] M. Kachroudi, G. Diallo, S. Ben Yahia, KEPLER at OAEI 2018, in: Proceedings of the 13th International Workshop on Ontology Matching co-located with the 17th International Semantic Web Conference, OM@ISWC 2018, Monterey, CA, USA, October 8, 2018, volume 2288 of CEUR Workshop Proceedings, CEUR-WS.org, 2018, pp. 173–178. [16] M. Kachroudi, S. Zghal, S. Ben Yahia, Bridging the multilingualism gap in ontology alignment, International Journal of Metadata, Semantics and Ontologies 9 (2014) 252–262. [17] M. Kachroudi, S. Zghal, S. Ben Yahia, Using linguistic resource for cross-lingual ontology alignment, International Journal of Recent Contributions from Engineering 1 (2013) 21–27. [18] J. Chen, E. Jiménez-Ruiz, I. Horrocks, C. Sutton, Learning semantic annotations for tabular data, arXiv preprint arXiv:1906.00781 (2019). [19] V. Efthymiou, O. Hassanzadeh, M. Rodriguez-Muro, V. Christophides, Matching web tables with knowledge base entities: from entity lookups to entity embeddings, in: International Semantic Web Conference, Springer, 2017, pp. 260–277. [20] L. Ehrlinger, W. Wöß, Towards a definition of knowledge graphs., SEMANTiCS (Posters, Demos, SuCCESS) 48 (2016) 1–4. [21] M. Färber, F. Bartscherer, C. Menne, A. Rettinger, Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago, Semantic Web 9 (2018) 77–129. [22] H. Akremi, S. Zghal, Dof: a generic approach of domain ontology fuzzification, Frontiers Comput. Sci. 15 (2021) 153322. [23] H. Akremi, M. G. Ayadi, S. Zghal, To medical ontology fuzzification purpose: Covid-19 study case, Procedia Computer Science 207 (2022) 1027–1036.