<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>JenTab: A Toolkit for Semantic Table Annotations</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision Group</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Friedrich Schiller University Jena</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Heinz Nixdorf Chair for Distributed Information Systems</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Michael Stifel Center Jena</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tables are a ubiquitous source of structured information. However, their use in automated pipelines is severely a ected by conicts in naming and issues like missing entries or spelling mistakes. The Semantic Web has proven itself a valuable tool in dealing with such issues, allowing the fusion of data from heterogeneous sources. Its usage requires the annotation of table elements like cells and columns with entities from existing knowledge graphs. Automating this semantic annotation, especially for noisy tabular data, remains a challenge, though. JenTab is a modular system to map table contents onto large knowledge graphs like Wikidata. It starts by creating an initial pool of candidates for possible annotations. Over multiple iterations context information is then used to eliminate candidates until, eventually, a single annotation is identi ed as the best match. Based on the SemTab2020 dataset, this paper presents various experiments to evaluate the performance of JenTab. This includes a detailed analysis of individual components and of the impact di erent approaches. Further, we evaluate JenTab against other systems and demonstrate its e ectiveness in table annotation tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>knowledge graph</kwd>
        <kwd>matching</kwd>
        <kwd>tabular data</kwd>
        <kwd>semantic annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Although a considerable amount of data is published in tabular form, oftentimes,
the information contained is hardly accessible to automated processes. Causes range
from issues like misspellings and partial omissions to the ambiguity introduced by using
di erent naming schemes, languages, or abbreviations. The Semantic Web promises to
overcome the ambiguities but requires annotation with semantic entities and relations.
The process of annotating a tabular dataset to a given Knowledge Graph (KG) is
called Semantic Table Annotation (STA). The objective is to map individual table
elements to their counterparts from the KG as illustrated in Figure 1 (naming according
to [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]): Cell Entity Annotation (CEA) matches cells to individuals, whereas Column
Type Annotation (CTA) does the same for columns and classes. Furthermore, Column
Property Annotation (CPA) captures the relationship between pairs of columns.
      </p>
      <p>Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
1,010,408
357,386
wd:Q79 ("Egypt")
wd:Q183 ("Germany")</p>
      <p>Cairo
Berlin</p>
      <p>Egypt
Germany</p>
      <p>
        JenTab is a toolkit to annotate large corpora of tables. It follows a general pattern of
Create, Filter and Select (CFS): First, for each annotation, initial candidates are
generated using appropriate lookup techniques (Create). Subsequently, the available context
is used in multiple iterations to narrow down these sets of candidates as much as
possible (Filter). Finally, if multiple candidates remain, a solution is chosen among them
(Select). We provide several modules for each of these steps. Di erent combinations allow
to ne-tune the annotation process by considering both the modules' performance
characteristics and their impact on the generated solutions. The contributions of our paper
are as follows. All experiments are based on the large corpus provided by Semantic Web
Challenge on Tabular Data to Knowledge Graph Matching (SemTab2020) [
        <xref ref-type="bibr" rid="ref11 ref13 ref14">11, 13, 14</xref>
        ]5
( 130; 000 tables) matching the content to Wikidata [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>{ We demonstrate the e ectiveness of JenTab relying only on public lookup services.
{ We provide a detailed evaluation of the impact individual modules have on the
candidate generation.
{ We perform three experiments exploring di erent CTA-strategies that vary the
mode of determining cells' types and hence the column annotation.
{ We compare JenTab's performance to other top contenders of the SemTab2020.</p>
      <p>The remainder of this paper is structured as follows. Section 2 gives an overview
of the related work. Section 3 describes our pipeline. Section 4 explains the dataset,
encountered challenges, and the metrics used in our evaluation. Section 5 discusses our
experiments and results. Section 6 concludes the paper and shows future directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        We start by brie y reviewing benchmark datasets and motivate the selection of the
SemTab2020 dataset for our evaluation. We then summarize existing approaches to
match tabular data to KGs. While both semi-automatic and full-automatic approaches
have been proposed, we will focus our attention on later ones. This is in line with the
assumptions in this paper and the conditions posed by the SemTab challenges.
Benchmarks. In the past, various benchmarks have been proposed and used for STA
tasks. Manually annotated corpora like T2Dv27 or the ones used in [
        <xref ref-type="bibr" rid="ref18 ref6">6, 18</xref>
        ] o er only a
minimal number of tables. On the other hand, larger corpora are often automatically
created using web tables as a source. The resulting Ground Truth (GT) data is thus
rather noisy as seen, e.g., in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The tables in the SemTab2020 datasets [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are
arti cially created from Wikidata [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Further, Round 4 includes the Tough Tables
      </p>
      <sec id="sec-2-1">
        <title>5 http://www.cs.ox.ac.uk/isg/challenges/sem-tab/</title>
        <p>
          6 We use the pre xes wd: and wdt: for http://www.wikidata.org/entity/ and
http://www.wikidata.org/prop/direct/ respectively.
7 http://webdatacommons.org/webtables/goldstandardV2.html
(2T) dataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] { manually curated tables of increased di culty. This inverts older
approaches of benchmarks creation and provides a large corpus of tables with
highquality GT data. Further, it allows adjusting the di culty of tasks by varying the noise
introduced to the tables.
        </p>
        <p>
          Approaches. ColNet [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] tackles only the CTA task. It uses a Convolutional Neural
Networks (CNN) trained by classes contained within a KG. The predicted annotations
are combined with the results of a traditional KG. The nal annotation is selected
using a score that selects the lookup solutions with high con dence and otherwise
resorts to the CNN predictions. Results have shown that CNN prediction outperforms
the lookup service for a larger knowledge gap. The approach has then been extended by
considering other cells in the same row in a property feature vector Property to Vector
(P2Vec) as an additional signal to the neural network which yields better results [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
Efthymiou et.al [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] have a slightly di erent task description. They tackle row to KG
entity matching. Their approach combines a lookup model, FactBase, with a word
embedding model trained using the KG. Two variations are proposed, each succeeding
in di erent benchmarks. Each variant uses one model as the primary source and only
resorts to the other when the rst does not return any result.
        </p>
        <p>
          All these approaches rely on lookup services for their success. However, each of them
addresses only a single task from STA. Moreover, they can not cope with the frequent
changes of KGs since they rely on snapshots of the KG to train their respective models.
SemTab2019. In 2019, the SemTab challenge was initiated to bring together the
community of automatic systems for STA tasks. A four-round-dataset was released with
DBpedia [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] as a target KG. Among the participants, the following systems emerged.
MTab [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], the challenge winner in 2019, relies on a joint probability distribution that
is updated after more information is known. Input signals include the results of various
lookup services and conditional probabilities based on the row and column context. The
authors mention the computational cost from the multitude of signals as a signi cant
drawback. CSV2KG [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], achieving second place, uses an iterative process with the
following steps: (i) get an entity matching using lookup services; (ii) infer the column
types and relations; (iii) re ne cell mappings with the inferred column types and
relations; (iv) re ne subject cells using the remaining cells of the row; and (v) re-calculate
the column type with all the corrected annotations. Tabularisi [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], third place in 2019,
also uses lookup services. For each returned candidate an adapted TF-IDF score8 is
calculated. A combination of this score, the Levenshtein distance between cell value and
candidate label, and a distance measure between cell value and the URL tokens is used
to determine the nal annotation. DAGOBAH [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] assumes that entities in the same
column are close in the embedding space. Candidates are rst retrieved using a lookup
based on regular expressions and the Levenshtein distance. Afterwards, a clustering of
their vector representations using the embedding is performed to disambiguate among
them. The cluster with the highest row-coverage is selected and nal ambiguity are
resolved via a con dence score based on the row context of the candidates.
        </p>
        <p>
          A key success factor to those systems is the use of Wikidata and Wikipedia as
additional data sources. In this paper, we focus on exploiting only the target KG data
sources. Therefore, we try to maximize the bene t from a given cell value and minimize
our reliance on di erent data sources, which leads to a more straightforward system.
SemTab2020. The second edition of the challenge in 2020 changed the target KG
to Wikidata. MTab4Wikidata [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] builds an extensive index that includes all historic
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>8 Term Frequency-Inverse Document Frequency.</title>
        <p>
          revisions. Cell annotation candidates are generated using this index and a
one-editdistance algorithm. Disambiguation is done via pairwise lookups for all pairs of entities
within the same row. bbw [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] relies on two core ideas. First, SearX 9 as a meta-lookup
enabling it to search over more than 80 engines. Second, contextual matching using two
features, for example, entity and property labels. The former collects results and ranks
them, while the latter picks the best matches using edit-distance. SSL [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] generates
a Wikidata subgraph over a table. It leverages SPARQL queries for all tasks and does
not implement any fuzzy search for entities. However, it applies a crawling process
through Google to suggest better words and thus, overcomes the problem of spelling
mistakes. LinkingPark [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] has a three-module pipeline. For entity generation, it uses
the Wikidata lookup API while employing an o -the-shelf spell checker. Further, its
Property Linker module uses a fuzzy matching technique for numeric values with a
certain margin. JenTab uses a similar methodology to LinkingPark for tackling spelling
mistakes but with the aid of word vectors10. Moreover, JenTab uses the same concept
of fuzzy matching for entities and properties generation.
        </p>
        <p>To our knowledge, none of the these systems provided a detailed study on various
solutions for STA tasks, backward compatibility across rounds, or a time analysis.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>Our system's modules can be classi ed into one of the following three phases, which
together form a Create, Filter and Select (CFS) pattern. During the Create-phase,
candidates are retrieved for each requested annotation. In the Filter-phase, the surrounding
context is used to reduce the number of candidates. Eventually, in the Select-phase, the
nal annotations are chosen among the remaining candidates. The individual modules
for the same task di er in their treatment of the textual input and the context used.
This causes not only di erences in the accuracy of their results but also a ects their
performance characteristics. In the following, we explain the necessary preprocessing
steps and describe the developed modules for each phase.
3.1</p>
      <sec id="sec-3-1">
        <title>Preprocessing</title>
        <p>Before the actual pipeline, each table is subjected to a preprocessing phase consisting of
three steps: The rst step aims at normalizing the cells' content. First, we attempt to x
any encoding issues using ftfy11. Further, we remove special characters like parentheses
or slashes. Finally, we use regular expressions to detect missing spaces like in \1stGlobal
Opinion Leader's Summit ". In addition to the initial values, the normalized ones are
stored as a cell's \clean value". In the second step, we use regular expressions to
determine the datatype of each column. While our system distinguishes more datatypes,
we aggregate to those having direct equivalents in KGs, i.e. OBJECT, QUANTITY, DATE,
and STRING. Cells in OBJECT-columns correspond to entities of the KG, while the others
represent literals. In the nal step, we apply type-based cleaning. In general, it attempts
to extract the relevant parts of a cell value for QUANTITY and DATE columns. For example,
it splits the numeric value from a possibly existing unit in QUANTITY cells. Similarly,
redundant values like \10/12/2020 (10 Dec 2020)" are reduced to \10/12/2020 ".</p>
        <sec id="sec-3-1-1">
          <title>9 https://github.com/searx/searx 10 https://www.kaggle.com/cpmpml/spell-checker-using-word2vec 11 https://github.com/LuminosoInsight/python-ftfy</title>
          <p>
            (a) Cell
(b) Column
(c) Row
(d) Row-Column
Tabular data o ers di erent dimensions of context that can be used to either generate
annotation candidates (Create-phase) or remove highly improbable ones (Filter-Phase).
Figure 2 illustrates those visually. The Cell Context is the most basic one, outlined in
Figure 2a. Here, nothing but an individual cell's content is available. We can then
de ne a Column context as shown in Figure 2b. It is based on the premise that all cells
within a column represent the same characteristic of the corresponding tuples. For
the annotation process, this can be exploited insofar that all cells of a column share
the same class from the KG. Annotations for cells in OBJECT-columns have further a
common class as required by the CTA task. Similarly, the assumption that each row
refers to one tuple leads to the Row Context of Figure 2c. Annotation candidates for
the subject cell, i.e., a cell holding the identi er for the respective tuple/row, have
to be connected to their counterparts in all other cells within the same row. Finally,
all contexts can be subsumed in the Row-Column Context as given by Figure 2d. It
combines the last two assumptions representing the most exhaustive context. In the
following, we summarize our modules. For a detailed description kindly refer to [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
Creating Candidates All subsequent tasks are based on suitable CEA-candidates for
individual cells. The textual representation of such a cell can deviate from its canonical
name and other labels given by the KG in many di erent ways. We devised various
modules to cope with the encountered issues using the aforementioned contexts.
{ CEA Label Lookup (Cell Context ) employs six strategies to cope with spelling
mistakes, use of abbreviations and other lexicographical challenges.
{ CEA by column (Column Context ) populates the candidate pool for a cell with
all available instances of that shared class.
{ CEA by subject (Row Context ) populates mappings for cells in the same row
given the subject cell's annotation, i.e. the cell serving as an identi er for that row.
{ CEA by row (Row Context ) nds candidates for subject cells given the object
annotations in the same row.
          </p>
          <p>With candidates available for individual cells, another set of modules can be used
to derive candidates for the CTA and CPA tasks.</p>
          <p>{ CTA collects the parent classes from all CEA-candidates for a particular column
and uses them as CTA-candidates for that column.
{ CPA retrieves all properties for CEA-candidates of subject cells and compares
those to the values of the row. While object-properties are matched against the
candidate lists, literal-properties use a mix of exact and fuzzy matching.
DATE-values are matched based on the date part omitting any additional time
information. Di erent datetime-formats are supported.</p>
          <p>STRING-values are split into sets of tokens. Pairs with an overlap of at least
50% are considered a match.</p>
          <p>QUANTITY-values are compared using a 10% tolerance, as given in Equation 1.</p>
          <p>M atch =
(true;</p>
          <p>
            if j1
f alse; otherwise
vvaalluuee21 j &lt; 0:1
(1)
Filtering Candidates The previous modules generate lists of candidates for each task.
Next, lter-modules remove unlikely candidates based of di erent contexts.
{ CTA support (Column Context ) removes CTA-candidates that do not apply to
at least a minimum number of cells in that column.
{ CEA unmatched properties (Row Context ) removes CEA-candidates that are
not part of any candidate for a CPA-matching.
{ CEA by property support (Row Context ) rst counts CPA-matches for
subjectcells' CEA-candidates. All but the ones scoring highest are then removed.
{ CEA by string distance (Cell Context ) excludes all CEA-candidates whose label
is not within a certain range wrt. their Levenshtein distance [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] to the cell value.
Selecting a Final Annotation At some point, a nal annotation from the list of
candidates has to be selected. If only a single candidate is remaining, this candidate is
chosen as a solution. In all other cases, the following modules will be applied.
{ CEA by string similarity selects the CEA-candidate whose label is the closest
to the original cell value using the Levenshtein distance.
{ CEA by column operates on cells with no CEA-candidates left12. It looks for
other cells in the same column that are reasonably close wrt. to their Levenshtein
distance and adopts their solution if available.
{ CTA by LCS considers the whole class hierarchy of current CTA-candidates and
picks the Least Common Subsumer (LCS) as a solution.
{ CTA by Direct Parents applies a majority voting on CTA-candidates and their
direct parents in the class hierarchy.
{ CTA by Majority applies a majority voting on the remaining CTA-candidates.
{ CTA by Popularity breaks any remaining ties by selecting the most popular
          </p>
          <p>CTA-candidate, i.e., the one with the most instances in the KG.</p>
          <p>{ CPA by Majority applies a majority voting on the remaining CPA-candidates.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Architecture</title>
        <p>12 The lter modules applied before might have removed all CEA-candidates.
Runner</p>
        <p>Runner
errors
audit
results</p>
        <p>Manager</p>
        <p>Runner</p>
        <p>Clean Cells</p>
        <p>Type
Prediction
Approach</p>
        <p>Lookup
(Wikidata) cache
Generic
Strategy</p>
        <p>cache
Endpoint
(Wikidata) cache</p>
        <p>Autocorrect
results correspond to annotations of tasks, audit data that allows assessing the impact
of individual modules, and possibly a list of any errors thrown during the processing.</p>
        <p>The Manager's dashboard contains information about the following, the current
state of the overall system, i.e., processed versus not yet tables, besides, data about
connected Runners and errors are thrown (if any). It also gives an estimate of the
remaining time needed. Finally, once the processing has nished, all gathered
annotations can also be accessed from this central point. The Runner coordinates a single
table's processing at a time through a series of calls to di erent services. Tables are rst
passed through the preprocessing services of Clean Cells and Type Prediction.
Afterwards, the core pipeline is executed via the Approach service. Approach depends on the
following four services. Lookup and Endpoint are proxies to the respective KG lookup
and SPARQL endpoint services respectively. Moreover, the computationally expensive
Generic Strategy, in the CEA lookup, see Subsection 3.2, is wrapped in a separate
service. These three services include caching for their results. The nal dependency is
given by the Autocorrect service, which tries to x the spellings mistakes in cells.</p>
        <p>The chosen architecture has several advantages. First, using caches for
computationally expensive tasks or external dependencies increases the overall system
performance. Furthermore, it reduces the pressure on downstream systems, which is especially
important when public third-party services are used. Second, when the target KG is to
be substituted, all necessary changes like adjusting SPARQL queries are concentrated
within just two locations: the corresponding lookup and endpoint services. Third, the
distributed design allows scaling well with respect to the number of tables to be
annotated. Any increase in the number of tables can be mitigated by adding new Runners
to cope with the workload. Finally, the implementation allows reusable, and self
encapsulated pieces of code. For example, Runner can deal with any other Approach
implementation, and Autocorrect can be used by any other Approach.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation Setup</title>
      <p>
        We base the evaluation of our approach on the corpus provided by the Semantic Web
Challenge on Tabular Data to Knowledge Graph Matching (SemTab2020) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In the
following, we will rst outline the con guration of annotation modules listed in
Section 3, before describing the corpus in more detail. Further, we will explain the metrics
used which follow the evaluation strategy prescribed by the challenge.
      </p>
      <p>CEA Label Lookup</p>
      <p>CTA
CTA-Support</p>
      <p>CPA
CEA by Unmatched Properties</p>
      <p>CEA by String Distance</p>
      <p>CEA by Column
CEA by String Similarity
CTA by Direct Parents
CTA by Popularity</p>
      <p>CEA by Row and Column</p>
      <p>CEA by Row</p>
      <p>CTA
CTA-Support</p>
      <p>CPA
CEA by Unmatched Properties</p>
      <p>CEA by Subject
CEA by String Similarity
2
3
7
8</p>
      <p>CEA by Property Support
CEA by String Similarity</p>
      <p>CEA by Column</p>
      <p>CTA by LCS
CPA by Majority
CEA by Column</p>
      <p>CPA
CEA by Unmatched Properties</p>
      <p>CEA by String Similarity
4
5
6
The order of modules used in the evaluation is outlined in Figure 4. The modules are
arranged into several groups. Some groups are only executed if the preceding group
had any e ects on the current candidate pool. Similarly, the di erent approaches for
creating CEA-candidates skip cells that already have candidates at the time.</p>
      <p>Group 1 represents the most direct approach. As its modules use only a few
interdependencies, queries are rather simple and can be executed quickly. Still, it accounts
for a substantial share of candidates and thus forms the basis for subsequent groups.</p>
      <p>For cells that so far did not receive any CEA-candidates, Group 2 is a rst attempt
to compensate by expanding the considered scope. Here, CEA by Row and Column
precedes CEA by Row. Using more context information, i.e., the Column Context, returned
results are of higher quality compared to CEA by Row. It will fail, though, when the
current list of corresponding CTA-candidates do not yet contain the correct solution.
In such cases, CEA by Row can ll in the gaps. If any of the two modules resulted in
new CEA-candidates, the corresponding modules for CTA and CPA candidate creation
will be repeated in Group 3 .</p>
      <p>Group 4 attempts to select annotations for the rst time. A prior lter step again
uses the Row Context to retain only the CEA-candidates with the highest support
within their respective tuples. Afterwards, annotations are selected from the candidate
pool available at this point. It yields solutions for the majority of annotation-tasks but
may leave some gaps on occasion.</p>
      <p>The next two groups represent our last e orts to generate new candidates using
stronger assumptions. Group 5 assumes that we were already able to determine the
correct CTA-annotation for the respective column and then uses all corresponding
instances as CEA-candidates. Similarly, Group 7 assumes that the CEA-annotation
subject cell is already determined and creates candidates from all connected entities.
Country</p>
      <p>Inception (LITERAL)</p>
      <p>Area (LITERAL)</p>
      <p>Label (LITERAL)</p>
      <p>Capital (IRI)
Groups 6 , and 8 are used to validate those candidates and possibly select further
annotations to ll in the gaps.</p>
      <p>Group 9 makes a last-ditch e ort for cells that could not be annotated so far.
As no other module was able to nd a proper solution, this group will reconsider all
CEA-candidates that were dropped at some point. Using this pool, it attempts to ll
the remaining gaps in annotations.
4.2</p>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>
          We use the SemTab2020 dataset [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] as a benchmark for our approach. It contains over
130,000 tables automatically generated from Wikidata [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] that were further altered
by introducing arti cial noise [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The corpus is split into four rounds. In the last
round, 180 tables are added from 2T dataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] increasing the di culty here. Table 1
summarizes the data characteristics of the four rounds.
        </p>
        <p>
          Figure 5 illustrates the challenges present in the dataset. a missing or not
descriptive table metadata, like column headers. b spelling mistakes. c ambiguity in cell
values. For example, UK has (Ukrainian (Q8798), United Kingdom (Q145), University
of Kentucky (Q1360303) and more) as corresponding entities in Wikidata. d missing
spaces, causing tokenizers to perform poorly. e inconsistent format of date and time
values. f nested pieces of information in Quantity elds, interfere in the corresponding
CPA tasks. g redundant columns. h encoding issues. i seemingly random noise in
the data. Berlin would be expected in the context of the given example. j missing
values including nulls, empty strings or special characters like (?, -, {) to the same
e ect. k tables of excessive length.
Besides the datasets, SemTab2020 also provides a framework to evaluate tabular data to
knowledge graph matching systems [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Our evaluation follows the proposed
methodology, which is outlined in the following. At its core, it relies on the standard information
retrieval metrics of Precision (P ), Recall (R), and F1 Score (F 1) as given in Equation 2.
        </p>
        <p>P = jcorrect annotationsj ; R = jcorrect annotationsj ; F 1 = 2
jannotated cellsj jtarget cellsj</p>
        <p>P
P + R</p>
        <p>R
(2)</p>
        <p>
          However, these default metrics fall short for the CTA task. Here, there is not always
a clear-cut distinction between right and wrong. Some solutions might be acceptable but
do not represent the most precise class to annotate a column. Taking the last column
of Figure 5 as an example, the best annotation for the last column would likely be
capital (Q5119) (assuming \Tubingen" is noise here). Nevertheless, an annotation city
(Q515) is also correct, but just not as precise. To account for such correct but imprecise
solutions, an adapted metric called cscore is advised as shown in Equation 3 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Here, d( ) is the shortest distance between the chosen annotation-entity and the
most precise one, i.e., the one given in the GT. Consequently, P recision, Recall, and
F 1 Score are adapted to the forms in Equation 4.
        </p>
        <p>cscore( ) =
AP =</p>
        <p>P cscore( )
jannotated cellsj
81;
&gt;
&gt;
&gt;&lt;0:8d( ); if
&gt;0:7d( ); if
&gt;
&gt;:0;
; AR =
if is in GT;
otherwise</p>
        <p>P cscore( )
jtarget cellsj
is an ancestor of the GT;
is a descendant of the GT;
; AF 1 =
2</p>
        <p>AP AR
AP + AR
(3)
(4)
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Results</title>
      <p>In this section, we discuss our ndings regarding the overall system. We start with
preprocessing assessment, \Type Prediction" step which is responsible for determining
a column's datatype, see Subsection 3.1. Figure 6 shows the confusion matrix of this
step with 99% accuracy. We used the ground truth for CEA and CPA tasks to query
Wikidata for their types; such values represent the actual datatypes, the predicted
values are our system results.</p>
      <p>
        Spelling mistakes are a crucial problem that we have tackled by using \Generic
Strategy", see Subsection 3.2. The e ectiveness of this is illustrated in Table 2: Almost
99% of unique labels were covered in the rst three rounds. However, this is reduced
to 97% in the last round. Our pre-computed lookups are publicly available [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>OBJECT 0.99 0.01 0.00 0.00
l
a
u
t
c
A</p>
      <p>Our modular approach enables us to exchange individual components or provide
backup solutions if the existing ones failed in speci c situations. By this means, we have
established three di erent experiments to explore the e ect of changing cells' types
retrieval. These three modes include: First, \P31" includes only direct parents using
instance of (P31 ). We have used a majority vote to select a column type. Second, 2
Hops, includes \P31" with one additional parent via subclass of (P279 ). Finally, Multi
Hops, creates a more general tree of parents following subclass of (P279 ) relations.</p>
      <p>We have implemented ve strategies for an initial CEA candidates creation, see
Subsection 3.2. Figure 7a shows how much each strategy is used. This underlines the
need for various strategies to capture a wide range of useful information inside each
cell. The shown distribution also re ects our chosen order of methods. For example,
\Generic Strategy" is our rst priority, thus used most of the time. On the other hand,
\Autocorrect" is has the lowest priority and is used as a means of the last resort. CEA
selection phase involves two methods. Figure 7b demonstrates the use of each of them:
our dominant select approach is \String Similarity", it is used by 38% more than the
\Column Similarity". Finally, Figure 8a describes the distribution of CTA selection
methods during the \P31" setting. While, Figure 8b represents the used methods in \2
Hops" mode, where LCS is the dominant selection strategy. Let's compare \Majority
Vote" with the LCS methods in the two settings. The former successfully nds more
solutions than the latter, which yields less reliance on backup strategies or tiebreakers.
The same exclusive execution concept in CEA selection is also applied in CTA
selection methods. The dominant method, e.g., LCS in \2 Hops" mode, is invoked more
frequently due to its highest order. Other backup strategies try to solve the remaining
columns if other methods failed to nd a solution for them.</p>
      <p>
        Table 3 reports our results for the four rounds given the three execution modes.
In the rst three rounds, we achieved a coverage of more than 98.8% for the three
tasks. In the fourth round, CEA task, the coverage is dramatically a ected by the
selected mode. \P31" has achieved the highest coverage by 99.39%, while \Multi Hops"
reached only 81.83%. F 1 Score in CEA, CTA and CPA tasks is greater than 0.967,
0.945 and 0.963 receptively. These scores where obtained through the publicly available
evaluator code13 on our solution les [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Both \2 Hops" and \Multi Hops" have better
coverage but lower recall. Unlike, \P31" which achieved the best scores in most cases.
Our performance is compared with with the top systems of SemTab2020 in Table 4.
JenTab's results are competitive across all tasks, but are severely impacted by the
13 https://github.com/sem-tab-challenge/aicrowd-evaluator
      </p>
      <p>
        R4
R4
Tough Tables (2T) dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Here, the ambiguity of possible annotations increased,
so even human annotators have problems in disambiguation. Moreover, it includes real
tables created from multiple data sources, which means that some cells lack a match
in the target KG. Finally, misspellings are far more extreme and frequent.
      </p>
      <p>
        Table 5 shows the time consumption for all four rounds with the number of used
runners for each mode setting of the CTA task. Close inspection revealed that the
execution time is largely dominated by the responses of Wikidata servers and thus beyond
our control. Execution was time-scoped, i.e. an upper limit for the time per table was
set. This allowed the system to converge faster compared to the initial
implementation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with, e.g., Round 4 showing a more than 50% reduction in time. Intermediate
results are cached across rounds saving time and lowering the number of requests to
external services. Our modular approach allows to scale the number of runners based
on available resources and hence speed up the overall process.
      </p>
      <p>The results show that for most tables \P31" mode is the most e cient fastest
approach. However, for the 2T dataset a more sophisticated approach is needed. Here,
the \2 Hops" appraoch yields better results. The \Multiple Hops" strategy can not
surpass any of the other strategies no matter the setting. In terms of both performance
and results it delivers inferior results and should thus not be used.</p>
      <p>A reoccurring source of issues was the dynamic nature of Wikidata. Users enter new
data, delete existing claims, or adjust the information contained. On several occasions,
we investigated missing mappings of our approach only to nd that the respective
entity in Wikidata had changed. The challenge and ground truth were created at one
point in time, so using the live system will leave some mappings unrecoverable.
Moreover, we are limited by the fair-use policies of the Wikidata Endpoint service. Another
limitation a ects the \CEA by Column" module. Some classes like human (Q5) have
a large number of instances. Here, queries to retrieve those instances oftentimes fail
with timeouts, which limits the module to reasonably speci c classes.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>
        In this paper, we presented an extensive evaluation of our toolkit for Semantic Table
Annotation, \JenTab". Based purely on the publicly available endpoints of Wikidata,
its modular architecture allows to exploit various strategies and easily adjust the
processing pipeline. \JenTab" is publicly available [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]15. We presented a detailed analyses
on the e ectiveness of JenTab's strategies using the benchmark dataset provided by
SemTab2020. Finally, we compared JenTab to other top contenders from that challenge
and demonstrate the competitiveness of our system.
      </p>
      <p>We see multiple di erent areas for further improvement. First, certain components
currently require substantial resources, either due to the number of computations
necessary like the Generic Lookup or the lacking performance of the SPARQL endpoint.
While we can address the latter by rewriting queries or re-designing the approach, the
former o ers plenty of opportunities to accelerate the system.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>The authors thank the Carl Zeiss Foundation for the nancial support of the project
\A Virtual Werkstatt for Digitization in the Sciences (P5)" within the scope of the
program line \Breakthroughs: Exploring Intelligent Systems" for \Digitization - explore
the basics, use applications". We would further like to thank K. Opasjumruskit, S.
Samuel, and F. Zander for the fruitful discussions throughout the challenge. Last but
not least, we thank B. Konig-Ries and J. Denzler for their guidance and feedback.
15 https://github.com/fusion-jena/JenTab</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abdelmageed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schindler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Jentab:
          <article-title>Matching tabular data to knowledge graphs</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2775</volume>
          , pp.
          <volume>40</volume>
          {
          <issue>49</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Abdelmageed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schindler</surname>
          </string-name>
          , S.: fusion-jena/JenTab: KGCW 2021 (
          <year>Apr 2021</year>
          ). https://doi.org/10.5281/zenodo.4730314
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Abdelmageed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schindler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>: fusion-jena/JenTab precomputed lookup</article-title>
          :
          <source>KGCW 2021 (Apr</source>
          <year>2021</year>
          ). https://doi.org/10.5281/zenodo.4730341
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Abdelmageed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schindler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>: fusion-jena/JenTab solution les</article-title>
          :
          <source>KGCW 2021 (Apr</source>
          <year>2021</year>
          ). https://doi.org/10.5281/zenodo.4730350
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          .
          <source>In: The semantic web</source>
          , pp.
          <volume>722</volume>
          {
          <fpage>735</fpage>
          . Springer (
          <year>2007</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -76298-0 52
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bhagavatula</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noraset</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Downey</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>TabEL: Entity linking in web tables</article-title>
          .
          <source>In: The Semantic Web - ISWC</source>
          <year>2015</year>
          , pp.
          <volume>425</volume>
          {
          <issue>441</issue>
          (
          <year>2015</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -25007-6 25
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chabot</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labbe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          , R.: DAGOBAH:
          <article-title>An end-to-end context-free tabular data semantic annotation system</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2553</volume>
          , pp.
          <volume>41</volume>
          {
          <issue>48</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>ColNet: Embedding the semantics of web tables for column type prediction</article-title>
          .
          <source>Proceedings of the AAAI Conference on Arti cial Intelligence</source>
          <volume>33</volume>
          ,
          <fpage>29</fpage>
          {36 (jul
          <year>2019</year>
          ). https://doi.org/10.1609/aaai.v33i01.
          <fpage>330129</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Learning Semantic Annotations for Tabular Data</article-title>
          .
          <source>In: Proceedings of the Twenty-Eighth International Joint Conference on Arti cial Intelligence, IJCAI-19</source>
          . pp.
          <year>2088</year>
          {
          <year>2094</year>
          (
          <year>2019</year>
          ). https://doi.org/10.24963/ijcai.
          <year>2019</year>
          /289
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karaoglu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Negreanu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gordon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          :
          <article-title>Linkingpark: An integrated approach for semantic table interpretation</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2775</volume>
          , pp.
          <volume>65</volume>
          {
          <issue>74</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Cutrona</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bianchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmonari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Tough Tables:
          <article-title>Carefully Evaluating Entity Linking for Tabular Data (Nov</article-title>
          <year>2020</year>
          ). https://doi.org/10.5281/zenodo.4246370
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Efthymiou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodriguez-Muro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christophides</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings</article-title>
          .
          <source>In: Lecture Notes in Computer Science</source>
          , pp.
          <volume>260</volume>
          {
          <fpage>277</fpage>
          . Springer International Publishing (
          <year>2017</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -68288-4 16
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efthymiou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>SemTab 2020: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets (Nov</article-title>
          <year>2020</year>
          ). https://doi.org/10.5281/zenodo.4282879
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efthymiou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <source>SemTab</source>
          <year>2019</year>
          :
          <article-title>Resources to benchmark tabular data to knowledge graph matching systems</article-title>
          .
          <source>In: The Semantic Web</source>
          , pp.
          <volume>514</volume>
          {
          <fpage>530</fpage>
          . Springer International Publishing (
          <year>2020</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -49461-2 30
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efthymiou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivasm</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cutrona</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Results of SemTab 2020</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2775</volume>
          , pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Generating conceptual subgraph from tabular data for knowledge graph matching</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2775</volume>
          , pp.
          <volume>96</volume>
          {
          <issue>103</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Levenshtein</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          :
          <article-title>Binary codes capable of correcting deletions, insertions and reversals</article-title>
          .
          <source>Doklady. Akademii Nauk SSSR</source>
          <volume>163</volume>
          (
          <issue>4</issue>
          ),
          <volume>845</volume>
          {
          <fpage>848</fpage>
          (
          <year>1965</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Limaye</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Annotating and searching web tables using entities, types and relationships</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>3</volume>
          (
          <issue>1- 2</issue>
          ),
          <volume>1338</volume>
          {
          <fpage>1347</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kertkeidkachorn</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
          </string-name>
          , H.:
          <article-title>MTab: Matching Tabular Data to Knowledge Graph using Probability Models</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2553</volume>
          , pp.
          <volume>7</volume>
          {
          <issue>14</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamada</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kertkeidkachorn</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
          </string-name>
          , H.: MTab4Wikidata at SemTab 2020:
          <article-title>Tabular Data Annotation with Wikidata</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2775</volume>
          , pp.
          <volume>86</volume>
          {
          <issue>95</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Shigapov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zumstein</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamlah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oberlaender</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mechnich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schumm</surname>
          </string-name>
          , I.:
          <article-title>bbw: Matching csv to wikidata via meta-lookup</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2775</volume>
          , pp.
          <volume>17</volume>
          {
          <issue>26</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Steenwinckel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandewiele</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Turck</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ongenae</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>CSV2KG: Transforming tabular data into semantic knowledge (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Thawani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zafar</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Divvala</surname>
            ,
            <given-names>N.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qasemi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pujara</surname>
          </string-name>
          , J.:
          <article-title>Entity linking to knowledge graphs to infer column types and properties</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2553</volume>
          , pp.
          <volume>25</volume>
          {
          <issue>32</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {85 (sep
          <year>2014</year>
          ). https://doi.org/10.1145/2629489
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>