-

JenTab: A Toolkit for Semantic Table Annotations

0 Computer Vision Group 1 Friedrich Schiller University Jena , Germany 2 Heinz Nixdorf Chair for Distributed Information Systems 3 Michael Stifel Center Jena

Tables are a ubiquitous source of structured information. However, their use in automated pipelines is severely a ected by conicts in naming and issues like missing entries or spelling mistakes. The Semantic Web has proven itself a valuable tool in dealing with such issues, allowing the fusion of data from heterogeneous sources. Its usage requires the annotation of table elements like cells and columns with entities from existing knowledge graphs. Automating this semantic annotation, especially for noisy tabular data, remains a challenge, though. JenTab is a modular system to map table contents onto large knowledge graphs like Wikidata. It starts by creating an initial pool of candidates for possible annotations. Over multiple iterations context information is then used to eliminate candidates until, eventually, a single annotation is identi ed as the best match. Based on the SemTab2020 dataset, this paper presents various experiments to evaluate the performance of JenTab. This includes a detailed analysis of individual components and of the impact di erent approaches. Further, we evaluate JenTab against other systems and demonstrate its e ectiveness in table annotation tasks.

knowledge graph matching tabular data semantic annotation

Although a considerable amount of data is published in tabular form, oftentimes, the information contained is hardly accessible to automated processes. Causes range from issues like misspellings and partial omissions to the ambiguity introduced by using di erent naming schemes, languages, or abbreviations. The Semantic Web promises to overcome the ambiguities but requires annotation with semantic entities and relations. The process of annotating a tabular dataset to a given Knowledge Graph (KG) is called Semantic Table Annotation (STA). The objective is to map individual table elements to their counterparts from the KG as illustrated in Figure 1 (naming according to [ 14 ]): Cell Entity Annotation (CEA) matches cells to individuals, whereas Column Type Annotation (CTA) does the same for columns and classes. Furthermore, Column Property Annotation (CPA) captures the relationship between pairs of columns.

Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1,010,408 357,386 wd:Q79 ("Egypt") wd:Q183 ("Germany")

Cairo Berlin

Egypt Germany

JenTab is a toolkit to annotate large corpora of tables. It follows a general pattern of Create, Filter and Select (CFS): First, for each annotation, initial candidates are generated using appropriate lookup techniques (Create). Subsequently, the available context is used in multiple iterations to narrow down these sets of candidates as much as possible (Filter). Finally, if multiple candidates remain, a solution is chosen among them (Select). We provide several modules for each of these steps. Di erent combinations allow to ne-tune the annotation process by considering both the modules' performance characteristics and their impact on the generated solutions. The contributions of our paper are as follows. All experiments are based on the large corpus provided by Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab2020) [ 11, 13, 14 ]5 ( 130; 000 tables) matching the content to Wikidata [ 24 ].

{ We demonstrate the e ectiveness of JenTab relying only on public lookup services. { We provide a detailed evaluation of the impact individual modules have on the candidate generation. { We perform three experiments exploring di erent CTA-strategies that vary the mode of determining cells' types and hence the column annotation. { We compare JenTab's performance to other top contenders of the SemTab2020.

The remainder of this paper is structured as follows. Section 2 gives an overview of the related work. Section 3 describes our pipeline. Section 4 explains the dataset, encountered challenges, and the metrics used in our evaluation. Section 5 discusses our experiments and results. Section 6 concludes the paper and shows future directions. 2

Related Work

We start by brie y reviewing benchmark datasets and motivate the selection of the SemTab2020 dataset for our evaluation. We then summarize existing approaches to match tabular data to KGs. While both semi-automatic and full-automatic approaches have been proposed, we will focus our attention on later ones. This is in line with the assumptions in this paper and the conditions posed by the SemTab challenges. Benchmarks. In the past, various benchmarks have been proposed and used for STA tasks. Manually annotated corpora like T2Dv27 or the ones used in [ 6, 18 ] o er only a minimal number of tables. On the other hand, larger corpora are often automatically created using web tables as a source. The resulting Ground Truth (GT) data is thus rather noisy as seen, e.g., in [ 12 ]. The tables in the SemTab2020 datasets [ 13 ] are arti cially created from Wikidata [ 24 ]. Further, Round 4 includes the Tough Tables

5 http://www.cs.ox.ac.uk/isg/challenges/sem-tab/

6 We use the pre xes wd: and wdt: for http://www.wikidata.org/entity/ and http://www.wikidata.org/prop/direct/ respectively. 7 http://webdatacommons.org/webtables/goldstandardV2.html (2T) dataset [ 11 ] { manually curated tables of increased di culty. This inverts older approaches of benchmarks creation and provides a large corpus of tables with highquality GT data. Further, it allows adjusting the di culty of tasks by varying the noise introduced to the tables.

Approaches. ColNet [ 8 ] tackles only the CTA task. It uses a Convolutional Neural Networks (CNN) trained by classes contained within a KG. The predicted annotations are combined with the results of a traditional KG. The nal annotation is selected using a score that selects the lookup solutions with high con dence and otherwise resorts to the CNN predictions. Results have shown that CNN prediction outperforms the lookup service for a larger knowledge gap. The approach has then been extended by considering other cells in the same row in a property feature vector Property to Vector (P2Vec) as an additional signal to the neural network which yields better results [ 9 ]. Efthymiou et.al [ 12 ] have a slightly di erent task description. They tackle row to KG entity matching. Their approach combines a lookup model, FactBase, with a word embedding model trained using the KG. Two variations are proposed, each succeeding in di erent benchmarks. Each variant uses one model as the primary source and only resorts to the other when the rst does not return any result.

All these approaches rely on lookup services for their success. However, each of them addresses only a single task from STA. Moreover, they can not cope with the frequent changes of KGs since they rely on snapshots of the KG to train their respective models. SemTab2019. In 2019, the SemTab challenge was initiated to bring together the community of automatic systems for STA tasks. A four-round-dataset was released with DBpedia [ 5 ] as a target KG. Among the participants, the following systems emerged. MTab [ 19 ], the challenge winner in 2019, relies on a joint probability distribution that is updated after more information is known. Input signals include the results of various lookup services and conditional probabilities based on the row and column context. The authors mention the computational cost from the multitude of signals as a signi cant drawback. CSV2KG [ 22 ], achieving second place, uses an iterative process with the following steps: (i) get an entity matching using lookup services; (ii) infer the column types and relations; (iii) re ne cell mappings with the inferred column types and relations; (iv) re ne subject cells using the remaining cells of the row; and (v) re-calculate the column type with all the corrected annotations. Tabularisi [ 23 ], third place in 2019, also uses lookup services. For each returned candidate an adapted TF-IDF score8 is calculated. A combination of this score, the Levenshtein distance between cell value and candidate label, and a distance measure between cell value and the URL tokens is used to determine the nal annotation. DAGOBAH [ 7 ] assumes that entities in the same column are close in the embedding space. Candidates are rst retrieved using a lookup based on regular expressions and the Levenshtein distance. Afterwards, a clustering of their vector representations using the embedding is performed to disambiguate among them. The cluster with the highest row-coverage is selected and nal ambiguity are resolved via a con dence score based on the row context of the candidates.

A key success factor to those systems is the use of Wikidata and Wikipedia as additional data sources. In this paper, we focus on exploiting only the target KG data sources. Therefore, we try to maximize the bene t from a given cell value and minimize our reliance on di erent data sources, which leads to a more straightforward system. SemTab2020. The second edition of the challenge in 2020 changed the target KG to Wikidata. MTab4Wikidata [ 20 ] builds an extensive index that includes all historic

8 Term Frequency-Inverse Document Frequency.

revisions. Cell annotation candidates are generated using this index and a one-editdistance algorithm. Disambiguation is done via pairwise lookups for all pairs of entities within the same row. bbw [ 21 ] relies on two core ideas. First, SearX 9 as a meta-lookup enabling it to search over more than 80 engines. Second, contextual matching using two features, for example, entity and property labels. The former collects results and ranks them, while the latter picks the best matches using edit-distance. SSL [ 16 ] generates a Wikidata subgraph over a table. It leverages SPARQL queries for all tasks and does not implement any fuzzy search for entities. However, it applies a crawling process through Google to suggest better words and thus, overcomes the problem of spelling mistakes. LinkingPark [ 10 ] has a three-module pipeline. For entity generation, it uses the Wikidata lookup API while employing an o -the-shelf spell checker. Further, its Property Linker module uses a fuzzy matching technique for numeric values with a certain margin. JenTab uses a similar methodology to LinkingPark for tackling spelling mistakes but with the aid of word vectors10. Moreover, JenTab uses the same concept of fuzzy matching for entities and properties generation.

To our knowledge, none of the these systems provided a detailed study on various solutions for STA tasks, backward compatibility across rounds, or a time analysis. 3

Approach

Our system's modules can be classi ed into one of the following three phases, which together form a Create, Filter and Select (CFS) pattern. During the Create-phase, candidates are retrieved for each requested annotation. In the Filter-phase, the surrounding context is used to reduce the number of candidates. Eventually, in the Select-phase, the nal annotations are chosen among the remaining candidates. The individual modules for the same task di er in their treatment of the textual input and the context used. This causes not only di erences in the accuracy of their results but also a ects their performance characteristics. In the following, we explain the necessary preprocessing steps and describe the developed modules for each phase. 3.1

Preprocessing

Before the actual pipeline, each table is subjected to a preprocessing phase consisting of three steps: The rst step aims at normalizing the cells' content. First, we attempt to x any encoding issues using ftfy11. Further, we remove special characters like parentheses or slashes. Finally, we use regular expressions to detect missing spaces like in \1stGlobal Opinion Leader's Summit ". In addition to the initial values, the normalized ones are stored as a cell's \clean value". In the second step, we use regular expressions to determine the datatype of each column. While our system distinguishes more datatypes, we aggregate to those having direct equivalents in KGs, i.e. OBJECT, QUANTITY, DATE, and STRING. Cells in OBJECT-columns correspond to entities of the KG, while the others represent literals. In the nal step, we apply type-based cleaning. In general, it attempts to extract the relevant parts of a cell value for QUANTITY and DATE columns. For example, it splits the numeric value from a possibly existing unit in QUANTITY cells. Similarly, redundant values like \10/12/2020 (10 Dec 2020)" are reduced to \10/12/2020 ".

9 https://github.com/searx/searx 10 https://www.kaggle.com/cpmpml/spell-checker-using-word2vec 11 https://github.com/LuminosoInsight/python-ftfy

(a) Cell (b) Column (c) Row (d) Row-Column Tabular data o ers di erent dimensions of context that can be used to either generate annotation candidates (Create-phase) or remove highly improbable ones (Filter-Phase). Figure 2 illustrates those visually. The Cell Context is the most basic one, outlined in Figure 2a. Here, nothing but an individual cell's content is available. We can then de ne a Column context as shown in Figure 2b. It is based on the premise that all cells within a column represent the same characteristic of the corresponding tuples. For the annotation process, this can be exploited insofar that all cells of a column share the same class from the KG. Annotations for cells in OBJECT-columns have further a common class as required by the CTA task. Similarly, the assumption that each row refers to one tuple leads to the Row Context of Figure 2c. Annotation candidates for the subject cell, i.e., a cell holding the identi er for the respective tuple/row, have to be connected to their counterparts in all other cells within the same row. Finally, all contexts can be subsumed in the Row-Column Context as given by Figure 2d. It combines the last two assumptions representing the most exhaustive context. In the following, we summarize our modules. For a detailed description kindly refer to [ 1 ]. Creating Candidates All subsequent tasks are based on suitable CEA-candidates for individual cells. The textual representation of such a cell can deviate from its canonical name and other labels given by the KG in many di erent ways. We devised various modules to cope with the encountered issues using the aforementioned contexts. { CEA Label Lookup (Cell Context ) employs six strategies to cope with spelling mistakes, use of abbreviations and other lexicographical challenges. { CEA by column (Column Context ) populates the candidate pool for a cell with all available instances of that shared class. { CEA by subject (Row Context ) populates mappings for cells in the same row given the subject cell's annotation, i.e. the cell serving as an identi er for that row. { CEA by row (Row Context ) nds candidates for subject cells given the object annotations in the same row.

With candidates available for individual cells, another set of modules can be used to derive candidates for the CTA and CPA tasks.

{ CTA collects the parent classes from all CEA-candidates for a particular column and uses them as CTA-candidates for that column. { CPA retrieves all properties for CEA-candidates of subject cells and compares those to the values of the row. While object-properties are matched against the candidate lists, literal-properties use a mix of exact and fuzzy matching. DATE-values are matched based on the date part omitting any additional time information. Di erent datetime-formats are supported.

STRING-values are split into sets of tokens. Pairs with an overlap of at least 50% are considered a match.

QUANTITY-values are compared using a 10% tolerance, as given in Equation 1.

M atch = (true;

if j1 f alse; otherwise vvaalluuee21 j < 0:1 (1) Filtering Candidates The previous modules generate lists of candidates for each task. Next, lter-modules remove unlikely candidates based of di erent contexts. { CTA support (Column Context ) removes CTA-candidates that do not apply to at least a minimum number of cells in that column. { CEA unmatched properties (Row Context ) removes CEA-candidates that are not part of any candidate for a CPA-matching. { CEA by property support (Row Context ) rst counts CPA-matches for subjectcells' CEA-candidates. All but the ones scoring highest are then removed. { CEA by string distance (Cell Context ) excludes all CEA-candidates whose label is not within a certain range wrt. their Levenshtein distance [ 17 ] to the cell value. Selecting a Final Annotation At some point, a nal annotation from the list of candidates has to be selected. If only a single candidate is remaining, this candidate is chosen as a solution. In all other cases, the following modules will be applied. { CEA by string similarity selects the CEA-candidate whose label is the closest to the original cell value using the Levenshtein distance. { CEA by column operates on cells with no CEA-candidates left12. It looks for other cells in the same column that are reasonably close wrt. to their Levenshtein distance and adopts their solution if available. { CTA by LCS considers the whole class hierarchy of current CTA-candidates and picks the Least Common Subsumer (LCS) as a solution. { CTA by Direct Parents applies a majority voting on CTA-candidates and their direct parents in the class hierarchy. { CTA by Majority applies a majority voting on the remaining CTA-candidates. { CTA by Popularity breaks any remaining ties by selecting the most popular

CTA-candidate, i.e., the one with the most instances in the KG.

{ CPA by Majority applies a majority voting on the remaining CPA-candidates. 3.3

Architecture

12 The lter modules applied before might have removed all CEA-candidates. Runner

Runner errors audit results

Manager

Runner

Clean Cells

Type Prediction Approach

Lookup (Wikidata) cache Generic Strategy

cache Endpoint (Wikidata) cache

Autocorrect results correspond to annotations of tasks, audit data that allows assessing the impact of individual modules, and possibly a list of any errors thrown during the processing.

The Manager's dashboard contains information about the following, the current state of the overall system, i.e., processed versus not yet tables, besides, data about connected Runners and errors are thrown (if any). It also gives an estimate of the remaining time needed. Finally, once the processing has nished, all gathered annotations can also be accessed from this central point. The Runner coordinates a single table's processing at a time through a series of calls to di erent services. Tables are rst passed through the preprocessing services of Clean Cells and Type Prediction. Afterwards, the core pipeline is executed via the Approach service. Approach depends on the following four services. Lookup and Endpoint are proxies to the respective KG lookup and SPARQL endpoint services respectively. Moreover, the computationally expensive Generic Strategy, in the CEA lookup, see Subsection 3.2, is wrapped in a separate service. These three services include caching for their results. The nal dependency is given by the Autocorrect service, which tries to x the spellings mistakes in cells.

The chosen architecture has several advantages. First, using caches for computationally expensive tasks or external dependencies increases the overall system performance. Furthermore, it reduces the pressure on downstream systems, which is especially important when public third-party services are used. Second, when the target KG is to be substituted, all necessary changes like adjusting SPARQL queries are concentrated within just two locations: the corresponding lookup and endpoint services. Third, the distributed design allows scaling well with respect to the number of tables to be annotated. Any increase in the number of tables can be mitigated by adding new Runners to cope with the workload. Finally, the implementation allows reusable, and self encapsulated pieces of code. For example, Runner can deal with any other Approach implementation, and Autocorrect can be used by any other Approach. 4

Evaluation Setup

We base the evaluation of our approach on the corpus provided by the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab2020) [ 13 ]. In the following, we will rst outline the con guration of annotation modules listed in Section 3, before describing the corpus in more detail. Further, we will explain the metrics used which follow the evaluation strategy prescribed by the challenge.

CEA Label Lookup

CTA CTA-Support

CPA CEA by Unmatched Properties

CEA by String Distance

CEA by Column CEA by String Similarity CTA by Direct Parents CTA by Popularity

CEA by Row and Column

CEA by Row

CTA CTA-Support

CPA CEA by Unmatched Properties

CEA by Subject CEA by String Similarity 2 3 7 8

CEA by Property Support CEA by String Similarity

CEA by Column

CTA by LCS CPA by Majority CEA by Column

CPA CEA by Unmatched Properties

CEA by String Similarity 4 5 6 The order of modules used in the evaluation is outlined in Figure 4. The modules are arranged into several groups. Some groups are only executed if the preceding group had any e ects on the current candidate pool. Similarly, the di erent approaches for creating CEA-candidates skip cells that already have candidates at the time.

Group 1 represents the most direct approach. As its modules use only a few interdependencies, queries are rather simple and can be executed quickly. Still, it accounts for a substantial share of candidates and thus forms the basis for subsequent groups.

For cells that so far did not receive any CEA-candidates, Group 2 is a rst attempt to compensate by expanding the considered scope. Here, CEA by Row and Column precedes CEA by Row. Using more context information, i.e., the Column Context, returned results are of higher quality compared to CEA by Row. It will fail, though, when the current list of corresponding CTA-candidates do not yet contain the correct solution. In such cases, CEA by Row can ll in the gaps. If any of the two modules resulted in new CEA-candidates, the corresponding modules for CTA and CPA candidate creation will be repeated in Group 3 .

Group 4 attempts to select annotations for the rst time. A prior lter step again uses the Row Context to retain only the CEA-candidates with the highest support within their respective tuples. Afterwards, annotations are selected from the candidate pool available at this point. It yields solutions for the majority of annotation-tasks but may leave some gaps on occasion.

The next two groups represent our last e orts to generate new candidates using stronger assumptions. Group 5 assumes that we were already able to determine the correct CTA-annotation for the respective column and then uses all corresponding instances as CEA-candidates. Similarly, Group 7 assumes that the CEA-annotation subject cell is already determined and creates candidates from all connected entities. Country

Inception (LITERAL)

Area (LITERAL)

Label (LITERAL)

Capital (IRI) Groups 6 , and 8 are used to validate those candidates and possibly select further annotations to ll in the gaps.

Group 9 makes a last-ditch e ort for cells that could not be annotated so far. As no other module was able to nd a proper solution, this group will reconsider all CEA-candidates that were dropped at some point. Using this pool, it attempts to ll the remaining gaps in annotations. 4.2

Dataset

We use the SemTab2020 dataset [ 13 ] as a benchmark for our approach. It contains over 130,000 tables automatically generated from Wikidata [ 24 ] that were further altered by introducing arti cial noise [ 14 ]. The corpus is split into four rounds. In the last round, 180 tables are added from 2T dataset [ 11 ] increasing the di culty here. Table 1 summarizes the data characteristics of the four rounds.

Figure 5 illustrates the challenges present in the dataset. a missing or not descriptive table metadata, like column headers. b spelling mistakes. c ambiguity in cell values. For example, UK has (Ukrainian (Q8798), United Kingdom (Q145), University of Kentucky (Q1360303) and more) as corresponding entities in Wikidata. d missing spaces, causing tokenizers to perform poorly. e inconsistent format of date and time values. f nested pieces of information in Quantity elds, interfere in the corresponding CPA tasks. g redundant columns. h encoding issues. i seemingly random noise in the data. Berlin would be expected in the context of the given example. j missing values including nulls, empty strings or special characters like (?, -, {) to the same e ect. k tables of excessive length. Besides the datasets, SemTab2020 also provides a framework to evaluate tabular data to knowledge graph matching systems [ 14 ]. Our evaluation follows the proposed methodology, which is outlined in the following. At its core, it relies on the standard information retrieval metrics of Precision (P ), Recall (R), and F1 Score (F 1) as given in Equation 2.

P = jcorrect annotationsj ; R = jcorrect annotationsj ; F 1 = 2 jannotated cellsj jtarget cellsj

P P + R

R (2)

However, these default metrics fall short for the CTA task. Here, there is not always a clear-cut distinction between right and wrong. Some solutions might be acceptable but do not represent the most precise class to annotate a column. Taking the last column of Figure 5 as an example, the best annotation for the last column would likely be capital (Q5119) (assuming \Tubingen" is noise here). Nevertheless, an annotation city (Q515) is also correct, but just not as precise. To account for such correct but imprecise solutions, an adapted metric called cscore is advised as shown in Equation 3 [ 15 ]. Here, d( ) is the shortest distance between the chosen annotation-entity and the most precise one, i.e., the one given in the GT. Consequently, P recision, Recall, and F 1 Score are adapted to the forms in Equation 4.

cscore( ) = AP =

P cscore( ) jannotated cellsj 81; > > ><0:8d( ); if >0:7d( ); if > >:0; ; AR = if is in GT; otherwise

P cscore( ) jtarget cellsj is an ancestor of the GT; is a descendant of the GT; ; AF 1 = 2

AP AR AP + AR (3) (4) 5

Experiments and Results

In this section, we discuss our ndings regarding the overall system. We start with preprocessing assessment, \Type Prediction" step which is responsible for determining a column's datatype, see Subsection 3.1. Figure 6 shows the confusion matrix of this step with 99% accuracy. We used the ground truth for CEA and CPA tasks to query Wikidata for their types; such values represent the actual datatypes, the predicted values are our system results.

Spelling mistakes are a crucial problem that we have tackled by using \Generic Strategy", see Subsection 3.2. The e ectiveness of this is illustrated in Table 2: Almost 99% of unique labels were covered in the rst three rounds. However, this is reduced to 97% in the last round. Our pre-computed lookups are publicly available [ 3 ].

OBJECT 0.99 0.01 0.00 0.00 l a u t c A

Our modular approach enables us to exchange individual components or provide backup solutions if the existing ones failed in speci c situations. By this means, we have established three di erent experiments to explore the e ect of changing cells' types retrieval. These three modes include: First, \P31" includes only direct parents using instance of (P31 ). We have used a majority vote to select a column type. Second, 2 Hops, includes \P31" with one additional parent via subclass of (P279 ). Finally, Multi Hops, creates a more general tree of parents following subclass of (P279 ) relations.

We have implemented ve strategies for an initial CEA candidates creation, see Subsection 3.2. Figure 7a shows how much each strategy is used. This underlines the need for various strategies to capture a wide range of useful information inside each cell. The shown distribution also re ects our chosen order of methods. For example, \Generic Strategy" is our rst priority, thus used most of the time. On the other hand, \Autocorrect" is has the lowest priority and is used as a means of the last resort. CEA selection phase involves two methods. Figure 7b demonstrates the use of each of them: our dominant select approach is \String Similarity", it is used by 38% more than the \Column Similarity". Finally, Figure 8a describes the distribution of CTA selection methods during the \P31" setting. While, Figure 8b represents the used methods in \2 Hops" mode, where LCS is the dominant selection strategy. Let's compare \Majority Vote" with the LCS methods in the two settings. The former successfully nds more solutions than the latter, which yields less reliance on backup strategies or tiebreakers. The same exclusive execution concept in CEA selection is also applied in CTA selection methods. The dominant method, e.g., LCS in \2 Hops" mode, is invoked more frequently due to its highest order. Other backup strategies try to solve the remaining columns if other methods failed to nd a solution for them.

Table 3 reports our results for the four rounds given the three execution modes. In the rst three rounds, we achieved a coverage of more than 98.8% for the three tasks. In the fourth round, CEA task, the coverage is dramatically a ected by the selected mode. \P31" has achieved the highest coverage by 99.39%, while \Multi Hops" reached only 81.83%. F 1 Score in CEA, CTA and CPA tasks is greater than 0.967, 0.945 and 0.963 receptively. These scores where obtained through the publicly available evaluator code13 on our solution les [ 4 ]. Both \2 Hops" and \Multi Hops" have better coverage but lower recall. Unlike, \P31" which achieved the best scores in most cases. Our performance is compared with with the top systems of SemTab2020 in Table 4. JenTab's results are competitive across all tasks, but are severely impacted by the 13 https://github.com/sem-tab-challenge/aicrowd-evaluator

R4 R4 Tough Tables (2T) dataset [ 11 ]. Here, the ambiguity of possible annotations increased, so even human annotators have problems in disambiguation. Moreover, it includes real tables created from multiple data sources, which means that some cells lack a match in the target KG. Finally, misspellings are far more extreme and frequent.

Table 5 shows the time consumption for all four rounds with the number of used runners for each mode setting of the CTA task. Close inspection revealed that the execution time is largely dominated by the responses of Wikidata servers and thus beyond our control. Execution was time-scoped, i.e. an upper limit for the time per table was set. This allowed the system to converge faster compared to the initial implementation [ 1 ] with, e.g., Round 4 showing a more than 50% reduction in time. Intermediate results are cached across rounds saving time and lowering the number of requests to external services. Our modular approach allows to scale the number of runners based on available resources and hence speed up the overall process.

The results show that for most tables \P31" mode is the most e cient fastest approach. However, for the 2T dataset a more sophisticated approach is needed. Here, the \2 Hops" appraoch yields better results. The \Multiple Hops" strategy can not surpass any of the other strategies no matter the setting. In terms of both performance and results it delivers inferior results and should thus not be used.

A reoccurring source of issues was the dynamic nature of Wikidata. Users enter new data, delete existing claims, or adjust the information contained. On several occasions, we investigated missing mappings of our approach only to nd that the respective entity in Wikidata had changed. The challenge and ground truth were created at one point in time, so using the live system will leave some mappings unrecoverable. Moreover, we are limited by the fair-use policies of the Wikidata Endpoint service. Another limitation a ects the \CEA by Column" module. Some classes like human (Q5) have a large number of instances. Here, queries to retrieve those instances oftentimes fail with timeouts, which limits the module to reasonably speci c classes. 6

Conclusions and Future Work

In this paper, we presented an extensive evaluation of our toolkit for Semantic Table Annotation, \JenTab". Based purely on the publicly available endpoints of Wikidata, its modular architecture allows to exploit various strategies and easily adjust the processing pipeline. \JenTab" is publicly available [ 2 ]15. We presented a detailed analyses on the e ectiveness of JenTab's strategies using the benchmark dataset provided by SemTab2020. Finally, we compared JenTab to other top contenders from that challenge and demonstrate the competitiveness of our system.

We see multiple di erent areas for further improvement. First, certain components currently require substantial resources, either due to the number of computations necessary like the Generic Lookup or the lacking performance of the SPARQL endpoint. While we can address the latter by rewriting queries or re-designing the approach, the former o ers plenty of opportunities to accelerate the system.

Acknowledgment

The authors thank the Carl Zeiss Foundation for the nancial support of the project \A Virtual Werkstatt for Digitization in the Sciences (P5)" within the scope of the program line \Breakthroughs: Exploring Intelligent Systems" for \Digitization - explore the basics, use applications". We would further like to thank K. Opasjumruskit, S. Samuel, and F. Zander for the fruitful discussions throughout the challenge. Last but not least, we thank B. Konig-Ries and J. Denzler for their guidance and feedback. 15 https://github.com/fusion-jena/JenTab

1. Abdelmageed , N. , Schindler , S. : Jentab: Matching tabular data to knowledge graphs . CEUR Workshop Proceedings , vol. 2775 , pp. 40 { 49 ( 2020 )

2. Abdelmageed , N. , Schindler , S.: fusion-jena/JenTab: KGCW 2021 ( Apr 2021 ). https://doi.org/10.5281/zenodo.4730314

3. Abdelmageed , N. , Schindler , S. : fusion-jena/JenTab precomputed lookup : KGCW 2021 (Apr 2021 ). https://doi.org/10.5281/zenodo.4730341

4. Abdelmageed , N. , Schindler , S. : fusion-jena/JenTab solution les : KGCW 2021 (Apr 2021 ). https://doi.org/10.5281/zenodo.4730350

5. Auer , S. , Bizer , C. , Kobilarov , G. , Lehmann , J. , Cyganiak , R. , Ives , Z. : Dbpedia: A nucleus for a web of open data . In: The semantic web , pp. 722 { 735 . Springer ( 2007 ). https://doi.org/10.1007/978-3- 540 -76298-0 52

6. Bhagavatula , C.S. , Noraset , T. , Downey , D.: TabEL: Entity linking in web tables . In: The Semantic Web - ISWC 2015 , pp. 425 { 441 ( 2015 ). https://doi.org/10.1007/978-3- 319 -25007-6 25

7. Chabot , Y. , Labbe , T. , Liu , J. , Troncy , R.: DAGOBAH: An end-to-end context-free tabular data semantic annotation system . CEUR Workshop Proceedings , vol. 2553 , pp. 41 { 48 ( 2019 )

8. Chen , J. , Jimenez-Ruiz , E. , Horrocks , I. , Sutton , C. : ColNet: Embedding the semantics of web tables for column type prediction . Proceedings of the AAAI Conference on Arti cial Intelligence 33 , 29 {36 (jul 2019 ). https://doi.org/10.1609/aaai.v33i01. 330129

9. Chen , J. , Jimenez-Ruiz , E. , Horrocks , I. , Sutton , C. : Learning Semantic Annotations for Tabular Data . In: Proceedings of the Twenty-Eighth International Joint Conference on Arti cial Intelligence, IJCAI-19 . pp. 2088 { 2094 ( 2019 ). https://doi.org/10.24963/ijcai. 2019 /289

10. Chen , S. , Karaoglu , A. , Negreanu , C. , Ma , T. , Yao , J.G. , Williams , J. , Gordon , A. , Lin , C.Y. : Linkingpark: An integrated approach for semantic table interpretation . CEUR Workshop Proceedings , vol. 2775 , pp. 65 { 74 ( 2020 )

11. Cutrona , V. , Bianchi , F. , Jimenez-Ruiz , E. , Palmonari , M. : Tough Tables: Carefully Evaluating Entity Linking for Tabular Data (Nov 2020 ). https://doi.org/10.5281/zenodo.4246370

12. Efthymiou , V. , Hassanzadeh , O. , Rodriguez-Muro , M. , Christophides , V. : Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings . In: Lecture Notes in Computer Science , pp. 260 { 277 . Springer International Publishing ( 2017 ). https://doi.org/10.1007/978-3- 319 -68288-4 16

13. Hassanzadeh , O. , Efthymiou , V. , Chen , J. , Jimenez-Ruiz , E. , Srinivas , K. : SemTab 2020: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets (Nov 2020 ). https://doi.org/10.5281/zenodo.4282879

14. Jimenez-Ruiz , E. , Hassanzadeh , O. , Efthymiou , V. , Chen , J. , Srinivas , K. : SemTab 2019 : Resources to benchmark tabular data to knowledge graph matching systems . In: The Semantic Web , pp. 514 { 530 . Springer International Publishing ( 2020 ). https://doi.org/10.1007/978-3- 030 -49461-2 30

15. Jimenez-Ruiz , E. , Hassanzadeh , O. , Efthymiou , V. , Chen , J. , Srinivasm , K. , Cutrona , V. : Results of SemTab 2020 . CEUR Workshop Proceedings , vol. 2775 , pp. 1 { 8 ( 2020 )

16. Kim , D. , Park , H. , Lee , J.K. , Kim , W. : Generating conceptual subgraph from tabular data for knowledge graph matching . CEUR Workshop Proceedings , vol. 2775 , pp. 96 { 103 ( 2020 )

17. Levenshtein , V.I. : Binary codes capable of correcting deletions, insertions and reversals . Doklady. Akademii Nauk SSSR 163 ( 4 ), 845 { 848 ( 1965 )

18. Limaye , G. , Sarawagi , S. , Chakrabarti , S. : Annotating and searching web tables using entities, types and relationships . Proceedings of the VLDB Endowment 3 ( 1- 2 ), 1338 { 1347 ( 2010 )

19. Nguyen , P. , Kertkeidkachorn , N. , Ichise , R. , Takeda , H.: MTab: Matching Tabular Data to Knowledge Graph using Probability Models . CEUR Workshop Proceedings , vol. 2553 , pp. 7 { 14 ( 2019 )

20. Nguyen , P. , Yamada , I. , Kertkeidkachorn , N. , Ichise , R. , Takeda , H.: MTab4Wikidata at SemTab 2020: Tabular Data Annotation with Wikidata . CEUR Workshop Proceedings , vol. 2775 , pp. 86 { 95 ( 2020 )

21. Shigapov , R. , Zumstein , P. , Kamlah , J. , Oberlaender , L. , Mechnich , J. , Schumm , I.: bbw: Matching csv to wikidata via meta-lookup . CEUR Workshop Proceedings , vol. 2775 , pp. 17 { 26 ( 2020 )

22. Steenwinckel , B. , Vandewiele , G. , De Turck , F. , Ongenae , F. : CSV2KG: Transforming tabular data into semantic knowledge ( 2019 )

23. Thawani , A. , Hu , M. , Hu , E. , Zafar , H. , Divvala , N.T. , Singh , A. , Qasemi , E. , Szekely , P.A. , Pujara , J.: Entity linking to knowledge graphs to infer column types and properties . CEUR Workshop Proceedings , vol. 2553 , pp. 25 { 32 ( 2019 )

24. Vrandecic , D. , Krotzsch, M.: Wikidata: a free collaborative knowledgebase . Communications of the ACM 57 ( 10 ), 78 {85 (sep 2014 ). https://doi.org/10.1145/2629489