JenTab: Do CTA Solutions Affect the Entire Scores? Nora Abdelmageed1,2,* , Sirko Schindler1 1 Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena 2 Michael Stifel Center Jena, Friedrich Schiller University Jena Abstract Semantic Table Annotation remains a crucial task to exploit tabular data in knowledge-aware systems. However, in the process, annotation systems have to overcome various issues ranging from mere typos over inconsistent naming conventions to homonymy among values. The Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) continues to provide demanding datasets to evaluate annotation systems and drive their continued development. In this paper, we describe JenTab’s adaptations to the 2022 edition of SemTab: In particular, we added an additional preprocessing step to target Tough Tables (2T)’s excessive misspellings and a new pipeline to exploit meaningful header information. In addition, for each round, we execute two different settings of Column Type Annotation (CTA) creation. We report on the impact of these changes on JenTab’s results. In 2022, we highlight the effect of the CTA on the overall score per round. Our GitHub Repository: https://github.com/fusion-jena/JenTab Keywords Entity Linking, Cell Entity Annotation, Column Type Annotation, Column-Column Property Annotation, Semantic Table Annotation, JenTab, SemTab 1. Introduction Tabular data such as CSV files are a common way to publish data and represent a precious resource. Nevertheless, they are hardly machine-interpretable in their raw form and are thus hidden from many automated processes. The annotation of regular tables with concepts from the Semantic Web faces various challenges, including misspellings, abbreviations, and the general ambiguity of the free text. Over time, different approaches have been developed to cope with these issues and provide a semantic layer on top of common tables [1, 2, 3, 4, 9]. Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab)1 offers a forum for state-of-the-art systems to compare against one another and provides them with various datasets to challenge their capabilities. In its fourth year, it features a series of three rounds. Each round consists of a variety of raw tables. Such tables to be annotated with concepts either from Wikidata [5] or DBpedia [6]. The annotation tasks themselves are called Semantic Table Annotation (STA). Based on the SemTab description of such tasks, the three tasks are namely Cell Entity Annotation (CEA), SemTab@ISWC 2022, October 23–27, 2022, Hangzhou, China (Virtual) * Corresponding author. $ nora.abdelmageed@uni-jena.de (N. Abdelmageed); sirko.schindler@uni-jena.de (S. Schindler)  0000-0002-1405-6860 (N. Abdelmageed); 0000-0002-0964-4457 (S. Schindler) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 http://www.cs.ox.ac.uk/isg/challenges/sem-tab/ Egypt 1,010,408 Cairo Egypt 1,010,408 Cairo Egypt 1,010,408 Cairo Germany 357,386 Berlin Germany 357,386 Berlin Germany 357,386 Berlin wd:Q79 ("Egypt") wd:Q6256 ("country") wdt:P36 ("capital") wd:Q183 ("Germany") (a) CEA (b) CTA (c) CPA Figure 1: SemTab tasks summary [7]. CTA, and Column Property Annotation (CPA). Given a data table and a target Knowledge Graph (KG), CEA links a cell to an entity within the KG (cf. Figure 1a). CTA is the task of assigning a semantic type (e.g., a class) to a column (cf. Figure 1b). Finally, CPA assigns a suitable semantic relation (predicate) from the KG to individual column pairs (cf. Figure 1c). Our previous participation in the SemTab challenge found that the hardest task to solve is the CTA. The challenge call asks for the most precise type to annotate the given column. However, we can consider high-level types as possible to decide on that fine-grained solution. We have investigated the effect of using multiple CTA strategies on the results of STA tasks. In this paper, we focus on analyzing JenTab performance given various strategies for creating and selecting CTA solutions using the provided SemTab 2022 datasets. In addition, we developed a sophisticated cleaning module for the 2T dataset [8] which yielded into significant scores improvement. Finally, we developed a new pipeline configuration that suites datasets with headers. The remainder of this paper is organized as follows: Section 2 outlines the general approach of JenTab, its pipelines configurations, and CTA creation strategies. Section 3 gives an overview of this year’s challenge datasets and requirements. Section 4 highlights the newly developed modules of JenTab during SemTab 2022. Section 5 discusses the given dataset’s characteristics of SemTab 2022 and our scores during the rounds under different settings. We conclude and point out future directions in Section 6. 2. Background This section provides an overview of the general approach JenTab follows. Last year, we developed various pipelines based on the given datasets characteristics like pipeline_full, pipeline_no_cpa, or pipeline_numeric. All pipelines follow the Create, Filter, and Select (CFS) pattern developed during SemTab 2020 [4]. The default pipeline, pipeline_full, is outlined in Figure 2. For more details about the CFS pattern, and our various pipelines we refer to our previ- ous publications in 2020 [4], and 2021 [9]. This year, during SemTab 2022, we focus on the pipeline_full, which is the most potent pipeline due to its consistent performance on various datasets. CTA solutions are crucial to solving STA. In 2020, we developed and investigated three strategies to create CTA candidates [7]. We give a brief overview of each strategy in a Wikidata context, as follows: A series of Retry Last resort Filtering First attempt of Row & Column Create initial strategies for modules using Selecting contexts to candidates selecting Row & Column solutions create missing solutions contexts candidates Figure 2: Abstract view of the pipeline_full [9]. • P31 includes only direct parents using instance of (P31) relations. This strategy does not include any further traversal of the class hierarchy. • 2Hops extends “P31” with one additional parent (higher level) via subclass of (P279). • Multi Hops creates a more general tree of parents following subclass of (P279) relations. From our previous study, Multi Hops gave the lowest scores due to its consideration of very high-level types. Thus, in this year, we focus our experiments on P31 and 2Hops only. Together with that CTA creation strategies, we have developed two CTA selection methods. On the one hand, we have implemented a “majority vote” technique that can be used with any creation strategies. This technique does not rely on the hierarchical relations among the possible CTA candidates. On the other hand, we have developed a “Least Common Subsumer (LCS)” method that selects the most fine-grained type from the hierarchy of CTA candidates such that this type has the maximum support among column cells. 3. SemTab 2022 Datasets & Requirements In 2022, SemTab consisted of three rounds. Multiple datasets are given for each round. Unlike in previous editions, partial ground truth data is available. Each dataset was divided into two parts: validation and test sets. The validation set is provided with ground truth data and the validator code. This allows self-check on a small portion before the actual system run and the final submission per round. Table 1 shows the given datasets, train/test splits, target KG, and the associated STA tasks per round. The recommended Wikidata dump by the challenge organizers is a custom n-triple dump as of May 21st , 2022, and is hosted on Zenodo [10]. However, using a public API was also recommended since the dump version mentioned is very recent. Table 2 illustrates the characteristics of SemTab 2022 datasets. It shows the number of tables, average rows, columns, and cells. In addition, it shows the number of target annotation for CEA, CTA, and CPA tasks. In this paper, we focus on the test sets since they directly affect the scores. For submission, we were allowed multiple submissions per week, but only the most recent one was evaluated each Friday. This is unlike the previous years when we used to submit our solutions to an AICrowd page. 4. What’s New in JenTab? In this section, we discuss our newly developed components. First, we explain the cleaning procedure for one of the provided datasets. Then, we discuss of newly created pipeline that Table 1 Specifications per round of SemTab 2022. Round Dataset Dev/Test Target CEA CTA CPA R1 HardTables 200/3,691 Wikidata Yes Yes Yes R1 HardTables 457/4,649 Wikidata Yes Yes Yes R2 2T 18/144 Wikidata Yes Yes No R2 2T 18/144 DBpedia Yes Yes No R3 GitTables 4097/4110 DBpedia & schema.org No Yes No R3 BiodivTab 5/45 DBpedia Yes Yes No Table 2 SemTab 2022 datasets. Targets created by JenTab are marked with a star (*). Avg. Rows # Avg. Cols # Avg. Cells # Round Dataset Tables CEA CTA CPA (± Std Dev.) (± Std Dev.) (± Std Dev.) R1 HardTables 3, 691 6±2 3±1 14 ± 6 26,189 4,511 5,745 HardTables 4, 649 6±1 3±1 14 ± 5 22,009 4,534 3,954 R2 2T (WD) 144 1, 181 ± 2, 985 4±2 4, 511 ± 11, 602 586,118 443 299* 2T (DB) 144 1, 008 ± 2, 710 4±2 3, 787 ± 10, 198 486,203 429 285* R3 BiodivTab 45 259 ± 743 24 ± 13 4, 589 ± 10, 862 33,405 569 NA selects CTA solutions based on the header values. Tough Tables Cleanup We have developed a cleaning module for the 2T dataset. This dataset contains a large amount of artificially added misspellings to its tables. Thus, our core idea is to locate the correctly spelled cells and then replace all the artificial occurrences with the correct word. The first step aims to find the correctly spelled words by querying those cells in target KG, Wikidata, Those with exact matches are considered correct words. The second step is to match the remaining values in the tables to the correctly identified values. We converted all the given cells into the embedding space using fasttext [11] to avoid the out-of-vocab (OOV) problem. Then, we applied cosine similarity among those vectors; we picked the final value if the similarity is ≥ 70%. We ran this step offline before the actual running of JenTab to solve the STA tasks. New Pipeline: pipeline_headers In addition to the previously developed pipelines [9], we added a new one, pipeline_headers, during Round 3 of SemTab 2022. It is based on pipeline_no_cpa, which contains all modules from the default configuration except the CPA create, filter, and select parts. However, the handling of CTA candidates has changed to accommodate datasets that contain meaningful headers. Already in 2021, BiodivTab [12] was included in SemTab as an example of such datasets. Here, JenTab only achieved rather low scores: 60%, and 10% on both CEA, and CTA tasks respectively [9]. Contrary to 2021, BiodivTab in 2022 also asked for DBpedia annotations replacing the previous target KG of Table 3 Generic Lookup: Unique labels and ratio of resolved labels per round. Rounds Dataset Target Unique Labels Unmatched Matched Matched (%) R1 HardTables Wikidata 19,107 179 18,928 99% R2 HardTables & 2T Wikidata 74,177 6,191 67,986 91.6% R2 2T DBpedia 65,223 6,988 58,235 89.3% Wikidata. 5. Experimental Results Spelling mistakes and artificial noise are common challenges across SemTab’s datasets. Espe- cially in 2T dataset. We developed the generic lookup as our primary strategy for tackling this crucial issue. Due to the resources required for comparing cell values against all labels (and aliases) within Wikidata or DBpedia, we extracted the unique values from all dataset tables. Then, we matched those against the labels of the respective KG using an optimized Jaro-Winkler Similarity implementation based on [13] and a threshold for minimum similarity of 0.9. Table 3 illustrates the results of this approach. For the synthetic datasets, HardTables, the matching percentage is high. It reached up to 99% in the first round. This is unlike the case of the 2T dataset; it reached around 89% in the second round, where DBpedia is the target KG. Such lower matching percentage guided us to develop a more sophisticated cleaning step before the actual run, as discussed in Section 4. Table 4 demonstrates our scores of the three rounds of SemTab 2022 as reported in the results sheet after each round. Each week per round, we have submitted different pipeline setting results. For example, during the first week of Round 1, we submitted the results of the pipeline_full combined with the “P31” CTA creation strategy and the majority vote as the selection strategy. In week two of the same round, we tested the same pipeline with “2Hops” CTA creation strategy combined with LCS selection technique instead. From the results, the P31 strategy that is associated with the majority vote selection yielded the best scores on the HardTables dataset in both rounds. However, for the 2T dataset, 2Hops improved the CEA scores significantly compared to the P31 strategy while achieving similar results for CTA. The 2Hops strategy seems better equipped to deal with challenging values like those found in 2T whose values are even hard to annotate for human users [8]. On the other hand, P31 seems a reasonable choice for comparatively more straightforward datasets across all tasks. Omitting higher levels in the hierarchy, P31 is also computationally less expensive and can thus be run faster. We highlight the impact of the sophisticated cleaning we applied on the 2T dataset. This additional step yielded substantially improved results over past years’ attempts: During 2020 our initial pipeline only achieved an F1-score of 10% [4]. In 2021, applying the 2Hops strategy improved this result to an F1-score of 45% [9]. This year, we surpassed the previous scores using the lightweight P31 and the 2Hops strategies by achieving 75.1% and 80.2%, respectively. Table 4 JenTab scores during SemTab 2022. F1 - F1 Score, Pr - Precision, AF1 - Average F1 Score, and APr - Average Precision. CEA CTA CPA Rounds Dataset Target Setting F1 Pr R AF1 APr AR F1 Pr R R1 HardTables WD P31 0.945 0.946 0.944 0.938 0.940 0.936 0.974 0.985 0.964 2Hops 0.936 0.936 0.935 0.871 0.871 0.871 0.975 0.986 0.965 R2 HardTables WD P31 0.751 0.758 0.743 0.836 0.881 0.795 0.872 0.921 0.828 2Hops 0.713 0.720 0.707 0.72 0.752 0.691 0.862 0.913 0.816 2T WD P31 0.773 0.774 0.772 0.357 0.362 0.352 NA NA NA 2Hops 0.802 0.807 0.796 0.346 0.357 0.337 NA NA NA R3 BiodivTab DBP P31 0.550 0.605 0.505 0.414 0.421 0.407 NA NA NA DBP 2Hops 0.547 0.601 0.502 0.408 0.410 0.407 NA NA NA Table 5 Generic Lookup effect. Primary, secondary scores, and Ranks for JenTab. F1 - F1 Score, Pr - Precision, AF1 - Average F1 Score, and APr - Average Precision. CEA CTA CPA Generic Dataset Target F1 Pr R AF1 APr AR F1 Pr R Lookup HardTables WD Yes 0.945 0.946 0.944 0.938 0.940 0.936 0.974 0.985 0.964 No 0.655 0.672 0.638 0.626 0.657 0.599 0.804 0.881 0.740 2T WD Yes 0.773 0.774 0.772 0.357 0.362 0.352 NA NA NA No 0.726 0.803 0.663 0.323 0.362 0.291 NA NA NA Moreover, we investigated the impact of Generic Lookup, shown in Table 5, on both HardTa- bles and 2T datasets during Round 2. We have selected the P31 strategy to perform this experiment since it had the overall best performance across all STA tasks. The absence of Generic Lookup yielded lower scores in general except for the precision of CEA task. This indeed reflects the importance of this module in the JenTab system. Our solution strategy for BiodivTab in Round 3 differs from our traditional way. Initially, we ran both pipeline_no_cpa and pipeline_header directly against DBpedia Proxy. However, the scores were deficient, reaching only 20% and 5% for both CEA and CTA tasks, respectively. Thus, we run the same pipelines against Wikidata Proxy; this fetches solutions for the dataset from Wikidata. After a complete run of the dataset, we retrieved owl:sameAs mappings that translate the Wikidata annotations to DBpedia resources for both tasks. 6. Conclusions & Future Work In this paper, we have reported on our participation and JenTab’s continuous developments as a part of the 2022 edition of Semantic Web Challenge on Tabular Data to Knowledge Graph Matching challenge. We introduced a cleaning module for the Tough Tables (2T) dataset that significantly impacted our results. In addition, we have developed a new pipeline that leverages information from the table header. We used this pipeline during Round 3 for the BiodivTab dataset. JenTab remains a top participant of the SemTab during its third participation and remains without any complex requirements. Our code is publicly available [14]. Moreover, our precomputed generic lookup [15] and solution files [16] for each round of SemTab are also publicly available. We see various areas for further improvement. First, the binary decision of whether to keep candidates or remove them should be replaced by a scoring system that emphasizes well- supported candidates but maintains other options. In addition, the new pipeline that uses the header candidates as direct CTA solutions also needs a more intelligent mechanism. For instance, we can apply a weighting technique that controls such decisions. Further, we see the need to apply a more detailed investigation on the impact of individual modules within the pipelines. This applies to both the content level (are we removing correct solutions by accident?) as well as on the performance level (can we exclude more candidates earlier in the pipeline?). Acknowledgment The authors thank the Carl Zeiss Foundation for the financial support of the project “A Virtual Werkstatt for Digitization in the Sciences (P5)” within the scope of the program line “Break- throughs: Exploring Intelligent Systems” for “Digitization - explore the basics, use applications”. We would thank Birgitta König-Ries for her guidance and continuous feedback. References [1] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, Semtab 2021: Tabular data annotation with mtab tool, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021, volume 3103 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 92–101. [2] V. Huynh, J. Liu, Y. Chabot, F. Deuzé, T. Labbé, P. Monnin, R. Troncy, DAGOBAH: table and graph contexts for efficient semantic annotation of tabular data, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021, volume 3103 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 19–31. [3] R. Shigapov, P. Zumstein, J. Kamlah, L. Oberländer, J. Mechnich, I. Schumm, bbw: Matching CSV to wikidata via meta-lookup, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th Inter- national Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to be in Athens, Greece), November 5, 2020, volume 2775 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 17–26. [4] N. Abdelmageed, S. Schindler, Jentab: Matching tabular data to knowledge graphs, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to be in Athens, Greece), November 5, 2020, volume 2775 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 40–49. [5] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications of the ACM 57 (2014) 78–85. doi:10.1145/2629489. [6] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, DBpedia: A nucleus for a web of open data, in: The Semantic Web, Springer Berlin Heidelberg, 2007, pp. 722–735. doi:10.1007/978-3-540-76298-0_52. [7] N. Abdelmageed, S. Schindler, Jentab: A toolkit for semantic table annotations, in: Proceedings of the 2nd International Workshop on Knowledge Graph Construction co- located with 18th Extended Semantic Web Conference (ESWC 2021), Online, June 6, 2021, volume 2873 of CEUR Workshop Proceedings, CEUR-WS.org, 2021. [8] V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, M. Palmonari, Tough Tables: Carefully Evaluating Entity Linking for Tabular Data, 2020. doi:10.5281/zenodo.4246370. [9] N. Abdelmageed, S. Schindler, Jentab meets semtab 2021’s new challenges, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021, volume 3103 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 42–53. [10] O. Hassanzadeh, Wikidata Truthy Dump from May 21, 2022, 2022. URL: https://doi.org/10. 5281/zenodo.6643443. doi:10.5281/zenodo.6643443. [11] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017) 135–146. doi:10.1162/tacl_a_00051. [12] N. Abdelmageed, S. Schindler, B. König-Ries, Biodivtab: A table annotation benchmark based on biodiversity research data, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021, volume 3103 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 13–18. [13] J. M. Keil, Efficient bounded Jaro-Winkler Similarity based search, BTW 2019 (2019). doi:10.18420/BTW2019-13. [14] N. Abdelmageed, fusion-jena/jentab: Jentab code for semtab 2022, 2022. URL: https://doi. org/10.5281/zenodo.7229238. doi:10.5281/zenodo.7229238. [15] N. Abdelmageed, fusion-jena/jentab_precomputed_lookup: Semtab2022, 2022. URL: https: //doi.org/10.5281/zenodo.7229246. doi:10.5281/zenodo.7229246. [16] N. Abdelmageed, fusion-jena/jentab_solution_files: Semtab2022, 2022. URL: https://doi. org/10.5281/zenodo.7229243. doi:10.5281/zenodo.7229243.