<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MantisTable SE: an E cient Approach for the Semantic Table Interpretation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Cremaschi</string-name>
          <email>marco.cremaschi@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Avogadro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Barazzetti</string-name>
          <email>a.barazzetti1g@campus.unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Chieregato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milan - Bicocca</institution>
          ,
          <addr-line>Viale Sarca 336, 20126 Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present a novel unsupervised and automatic approach for Semantic Table Interpretation. The technique presented is performed against DBpedia and Wikidata, and it can be easily adapted to any other Knowledge Graph (KG). Moreover, we provide a tool (LamAPI) that allows to e ciently fetch data needed for Semantic Table Interpretation (STI) tasks from the KG dumps.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Table Interpretation</kwd>
        <kwd>DBpedia</kwd>
        <kwd>Wikidata</kwd>
        <kwd>Knowledge Graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The STI is a research eld in continuous evolution with increasing interest over
time. To mention a few numbers describing the phenomenon:</p>
      <p>
        KG because it was the richest data source available with ground truths available,
later we adapted our approach to Wikidata because of it's increasing popularity
and in order to take part in the SemTab2020 challenge[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The requirement to
adapt the approach to Wikidata drove us to the development of a new approach.
In this way, we showed that the approach can be easily adapted to be used with
any KG. We will be referring to our new approach with \MantisTable SE", and
it is the 4th Open Source implementation of MantisTable.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>
        As seen in Section 1, to obtain the STI of tabular data, it is required to link
elements of the table with the elements in a KG. The elements in KGs (e.g.
DBpedia or Wikidata) are frequently stored in Resource Description
Framework (RDF) format, so to access to these elements, it is necessary to query a
SPARQL endpoint. For instance, the most popular way to access DBpedia dumps
is by using OpenLink Virtuoso, a row-wise transaction-oriented RDBMS with
a SPARQL query engine to access to RDF graph store1. Wikidata instead uses
Blazegraph2 that is a high-performance graph database supporting Blueprints
and RDF/SPARQL APIs. The issue faced with these solutions is the time
required for importing the data. Wikidata2019 dump requires some days to set
1 http://virtuoso.openlinksw.com
2 https://blazegraph.com
up3. Another problem is given by the amount of information present in KG;
for instance, the Wikidata dump is about 1.1TB (uncompressed). The English
version of DBpedia instead is split into multiple les of the size of 26GB, which
leads to high computation times to obtain a complete STI (e.g. TableMiner+
according to the author [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] took 13.35 hours to process the Limaye200 dataset).
However, not all the information present in a KG is necessary to carry out a
STI. Therefore, in order to obtain an e cient approach, it is necessary to
identify other ways for querying KG. To do that, the approach described in this
paper does not use SPARQL queries but queries indexes built on DBpedia and
Wikidata dump les. These indexes are accessible through the use of four di
erent API services that allow us to replace all the SPARQL queries. The tool that
provides these APIs is called LamAPI. LamAPI is an open-source tool, and the
code is available as a Git repository4. LamAPI provides the following services:
1. Label Matching: given a free text (in this case a data inside the table cell),
it retrieves the entities with the greatest similarity, using ElasticSearch's
default sorted scoring algorithm with full-text search enabled. For instance,
considering a table cell containing \Jurassic World" the result returned is
shown in Listing 1.1 or Listing 1.2;
2. Predicate matching: given two entities it retrieves all the predicates
between them; considering the entity \Jurassic World" and the entity
\Colin Trevorrow", the list of predicates is shown in Listing 1.4;
3. Object matching: given an entity it retrieves all the related objects and
predicates; for example, with the entity \Jurassic World" the result is the
one shown in Listing 1.4;
4. Concept matching: given an entity it retrieves all it's concepts as shown
in Listing 1.3.
3
https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limitspart-1/
4 https://bitbucket.org/disco unimib/lamapi
      </p>
      <p>Listing 1.1: Wikidata labels.
}
" count ": 70 ,
" r"esuurilt"s:":" Q[5{5615459 ",</p>
      <p>" label ": " Jurassic World "
} ,{" uri ": " Q3512046 ",</p>
      <p>" label ": " Jurassic World "
} ,{" uri ": " Q18615494 ",</p>
      <p>" label ": " Jurassic World "
} ,{" uri ": " Q2336369 ",</p>
      <p>" label ": " Jurassic World "
} ,{" uri ": " Q37598390 ",</p>
      <p>" label ": " Jurassic World Evol ."
Listing 1.3: Query result - Literals
and Concepts.
7
9
11
13
15</p>
      <p>The loading of the labels (as shown in Listing 1.1 and Listing 1.2) takes place
through the use of pre-computed JSON les. The other three services are made
with Redis5 key-value database. As a search engine for the labels service, we
use an instance of ElasticSearch6 that supports HTTP GZip natively, giving a
performance boost. KG's dump les need some pre-processing in order to be
used for this system, in particular we use namespace abbreviations (e.g. dbo: for
dbpedia.org/ontology ) in order to save space and we avoid data duplication.</p>
      <p>In the following, we will focus on the algorithmic process of our MantisTable
SE. The process is organised into eight phases as follows: i) Data Preparation and
Normalisation, ii) Column Analysis and Subject Detection, iii) Data Retrieval,
iv) Cell Entity Annotation (CEA), v) Column Predicate Annotation (CPA), vi)
Column Type Annotation (CTA), vii) Revision and viii) Export.</p>
      <p>The process was designed to retrieve the candidate for a given cell only once,
especially if the content of that cell is repeated across multiple tables. This allows</p>
      <sec id="sec-2-1">
        <title>5 https://redis.io/ 6 https://www.elastic.co/</title>
        <p>the approach to avoid repeating queries for the same content and saves network
time, but the drawback is more di cult memory management during execution.
Every phase must be completed for all the tables before going to the next one.
This does not preclude running the tool against only one table, but when running
with multiple tables the execution time will be sharply reduced (for the same
text in a cell we have the same candidate entities that will be sorted considering
the content of every table. This will be explained more in detail in the CEA
phase). For examples, we will refer to Table 1.</p>
        <p>SUBJ</p>
        <p>NE</p>
        <p>LIT</p>
        <p>NE</p>
        <p>LIT</p>
        <p>LIT</p>
        <p>i) Data Preparation and Normalisation. During this phase, all the cells
of all the tables are analysed using a tokeniser managing special characters and
removing parenthesis and the text block inside them.</p>
        <p>
          ii) Column Analysis and Subject Detection. During Column Analysis
we identify literal columns (L-column) by using a set of regex [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] (i.e. BOOLEAN,
DATE, EMAIL, GEOCORDS, INTEGER, REAL, ISBN, URL) to identity
different datatypes. The other columns are annotated as Named Entity column
(NE-column). The subject column (S-column) can be identi ed between
NEcolumn thanks to content-based scores, but it will not be discussed in this
approach as it would not introduce anything new from [
          <xref ref-type="bibr" rid="ref2 ref6">2,6</xref>
          ]. Besides, target columns
are provided during SemTab2019 and SemTab2020 challenges.
        </p>
        <p>iii) Data Retrieval. For each normalised cell, the candidate entities are
retrieved from the Label Matching API service. Retrieving data implies to seek
for the entities in the KG with similar labels to our Named Entities cells. For
example, if we query \jurassic world" on LamAPI Wikidata we get a list of
candidates:
{ Q3512046 ("Jurassic World" - Movie);
{ Q18615494 ("Jurassic World" - Comic Strip);
{ Q55615459 ("Jurassic World" - Attraction);
{ thousands more results.</p>
        <p>The candidates will be ranked during the next phases. Cells of NE columns
that have the same string value in di erent tables start by considering the same
set of candidate entities; this set will be sorted di erently, considering
relationships between the entire contents of a table.</p>
        <p>iv) Cell Entity Annotation (CEA). For each cell of the S-column and
NE-columns a con dence score is calculated by computing the edit distance
(Levenshtein distance) between the labels (in di erent languages) of candidate
entity ei;j 2 Ei;j and the content of the cell tx(i; j):
1
norm(LevenshteinDistance(tx(i; j)); ei;j )
(1)
All the values are normalised in (0,1) range with Divide by Maximum
normalisation (for every entity). For L-columns, the con dence score is computed as
follows:
{ for L-columns with numeric datatype: The con dence score is calculated as
of the formula in Equation 2. Where a is the cell and b is the corresponding
candidate value in the KG and is a constant (experimentally xed at 100).
e 0:5[ (a b) ]2
(2)
{ for L-columns with string datatype: the con dence score is computed like in
the case of NE-column for short strings. For long strings, a Jaccard distance
is computed, because the number of edits required to change a long string
into another one is not necessarily signi cant while considering n-grams with
Jaccard is generally a better similarity measure.
{ for L-column with date datatype, the dates are considered as sortable
numeric values in the format YYYYMMDDHHmmSS. The con dence score is
computed in the same way as in the numeric case.</p>
        <p>As an example we consider the cell containing \superman returns" which can
refer to both a movie (Table 1) and a videogame (Table 2).</p>
        <p>Listing 1.6: Candidates for the cell \superman returns" of the Table 2.</p>
        <p>Considering literal values for the cell \batman begins" with the content of
the column \length in min" we are almost sure that the value is correct because
after sorting all the values we have the result shown in Listing 1.7.</p>
        <p>Listing 1.7: Literal values for the entity \Batman Begins ( lm)".
" P4632 ":
" label ": " Bechdel Test Movie List ID ",
" value ": 40
" confidence ": 0
" P2047 ":
" label ": " duration ",
" value ": 140
" confidence ": 1.0
" P3110 ":
" label ": " ISzDb film ID ",
" value ": 234
" confidence ": 0
v) Column Predicate Annotation (CPA). Considering that all the
necessary information is gathered in the previous phase, the CPA is a relatively fast
process. For each column, all the predicates are retrieved for every possible
candidate couple with con dence greater than 0 and they are sorted by their relative
frequency to the entire column. The predicate with the greatest frequency will
be ranked rst. This process allows to reduce the number of candidate entities
for every cell. Con dence scores for the predicates of the director column are
shown in Listing 1.8.</p>
        <p>When we consider the video game in Table 2, the CPA phase allowed us
to drop the \ lm" with the same name. This because it does not have any
relationship with the rest of the content of the table (the lm does not have
anything to do with \electronic arts", while the video game has a property with
exact match).</p>
        <p>Listing 1.8: Predicates for \Director" column.</p>
        <p>" P57 ":
2 "" lcaobneflid"e:nc"ed"i:rec1t.o0r ",
4 " P58 ":</p>
        <p>" label ": " screenwriter ",
6 " P162""c:onfidence ": 0.75
8 "" lcaobneflid"e:nc"ep"r:odu0c.e7r5 ",
10 " P161 ":</p>
        <p>" label ": " cast member ",
12 " confidence ": 0.25</p>
        <p>vi) Column Type Annotation (CTA). In this phase every candidate
entity ei;j 2 Ei;j is analysed again considering the CPA reordering. The process
is a bit di erent in DBpedia and Wikidata because the Wikidata KG contains
cycles. For reasons of space, in this paper, only the approach for Wikidata will be
considered. In the Wikidata version of MantisTable SE, we retrieve the concepts
corresponding to every candidate entity from LamAPI. For every column, we
use an in-memory structure storing frequencies of the most common datatypes,
as shown in Listing 1.9.</p>
        <p>Listing 1.9: Example of the structure storing the most frequent concepts.</p>
        <p>The frequencies are stored with by-max normalization in order to be
homogeneous (i.e. values are between 0 and 1). For every column, the winning concept
is chosen as follows:
{ concepts with a frequency lower than 0.95 are discarded in order to reduce
the needed computation on the fully connected Wikidata concepts graph;
{ in every single table, we consider every possible column pair and we count
inbound and outbound edges; the nal concept selected is the one having
the highest number of connections in the Wikidata graph.</p>
        <p>Listing 1.10: Con dence for rst column of the lm table (Table 1).
1 " Q229390 ":</p>
        <p>" label ": "3 D film ",
3 " Q114"2c4o"n:fidence ": 1.0
5 "" lcaobneflid"e:nc"ef"il:m "1.,0
7 " Q114"2l4a"be:l {": " live - action animated film ",
9 " confidence ": 0.33</p>
        <p>As both \ lm" and \3D lm" have the same con dence, we consider the
number of links in the Wikidata Graph. In our algorithm, we initially ranked
rst concepts with the highest number of links in the graph, but after nding
out experimentally that in the SemTab2020 challenge a most speci c concept is
preferred by the scoring system, we decided to consider the relationship between
the nal concepts. In this example, as \3D lm" is a child concept of \ lm", we
assign the most speci c concept, so \3D lm" is the nal winning concept. This
consideration should not be considered ideal for every dataset.</p>
        <p>Listing 1.11: Links in Wikidata Graph for concepts of rst column of Table 1.
1 " Q229390 ":</p>
        <p>" label ": "3 D film ",
3 " Q114"2l4i"nk:s ": 0
5 "" llaibnekls "":: "1 film ",</p>
        <p>vii) Revision. The revision phase analyses all the information gathered in
the previous phases in order to do a nal reordering of the candidates. In
particular, this allows for correcting CEA entities previously selected: every entity
have to be coherent with the rest of the concepts and predicates selected in every
column. Moreover, predicates are also re-ranked.</p>
        <p>viii) Export. The MantisTable SE approach previously described keeps the
candidates coming from each phase; all the results are stored. It is possible to
apply some thresholds during the export phase to reach a balance between
annotation quality and the number of annotations provided. The export threshold
can have a signi cant role in evaluation metrics for every gold standard.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Deployment</title>
      <p>As described above, the MantisTable SE approach has been integrated into a
web application developed with Python and Django (Figure 4a, Figure 4b). A
MongoDB database acts as table and KG repository. The code is freely available
through a Git repository7. In order to achieve the scalability of the application,
and therefore improve e ciency, MantisTable SE has been installed in a Docker
container to achieve parallelisms at the application level and to facilitate the
deployment on servers. The management of resources is performed by using
Task Queues (i.e. Celery Workers8). The GUI is current under development,
when ready it will allow to graphically explore annotations for every phase.
(a)
(b)
This section will report the results obtained by MantisTable SE in Rounds 1, 2,
3 of the SemTab 2020 challenge.</p>
      <sec id="sec-3-1">
        <title>7 bitbucket.org/disco unimib/mantistable-4 8 docs.celeryproject.org/en/latest/userguide/workers.html</title>
        <p>Tasks FR1oundP1 FR1oundP2 FR1oundP3
CEA 0.982 0.989 0.991 0.993 0.974 0.979
CTA 0.745 0.753 0.966 0.973 0.958 0.965</p>
        <p>CPA 0.888 0.942 0.961 0.966 0.941 0.957</p>
        <p>The results obtained in the rst three rounds shows that our method is highly
promising. Unfortunately, during Round4, our results are not representative. Due
to an infrastructural problem, it was not possible to complete the computation
in the given timeframe.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and General comments</title>
      <p>MantisTable SE i) provides a comprehensive solution to support all annotation
steps; ii) provides an unsupervised method to annotate datasets of tables e
ciently; iii) generates context for disambiguation; iv) provides a tool to support
KG querying for STI tasks (LamAPI) and a work ow for STI (MantisTable SE)
which are publicly available.</p>
      <p>Talking about future improvements:
{ CEA does take in consideration only directly linked entities. This is likely
the most important source of missing/wrong annotations and is going to be
solved in the next version;
{ CPA is directly computed from CEA results but some ambiguities still
remain. A slightly more complex algorithm is currently under development;
{ for CTA annotations we will possibly use space-representations and
featurebased similarities (e.g. embeddings) to match missing concepts in the KG.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Zhang, Y.:
          <article-title>Webtables: Exploring the power of tables on the web</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>538</volume>
          {549 (Aug
          <year>2008</year>
          ). https://doi.org/10.14778/1453856.1453916, https://doi.org/10.14778/1453856. 1453916
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cremaschi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Paoli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spahiu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>A fully automated approach to a complete semantic table interpretation</article-title>
          .
          <source>Future Generation Computer Systems</source>
          <volume>112</volume>
          ,
          <fpage>478</fpage>
          {
          <fpage>500</fpage>
          (
          <year>2020</year>
          ). https://doi.org/https://doi.org/10.1016/j.future.
          <year>2020</year>
          .
          <volume>05</volume>
          .019, http://www.sciencedirect.com/science/article/pii/S0167739X19302663
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fetahu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anand</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koutraki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Tablenet: An approach for determining ne-grained relations for wikipedia tables</article-title>
          .
          <source>In: The World Wide Web Conference</source>
          . p.
          <volume>2736</volume>
          {
          <fpage>2742</fpage>
          . WWW '
          <volume>19</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2019</year>
          ). https://doi.org/10.1145/3308558.3313629, https://doi.org/10. 1145/3308558.3313629
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efthymiou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Semtab 2019: Resources to benchmark tabular data to knowledge graph matching systems</article-title>
          .
          <source>In: European Semantic Web Conference</source>
          . pp.
          <volume>514</volume>
          {
          <fpage>530</fpage>
          . Springer (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lehmberg</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritze</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meusel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A large public corpus of web tables containing time and context metadata</article-title>
          .
          <source>In: Proceedings of the 25th International Conference Companion on World Wide Web</source>
          . p.
          <volume>75</volume>
          {
          <fpage>76</fpage>
          . WWW '16 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (
          <year>2016</year>
          ). https://doi.org/10.1145/2872518.2889386, https://doi.org/10.1145/2872518.2889386
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>E ective and e cient semantic table interpretation using tableminer+</article-title>
          .
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <issue>6</issue>
          ),
          <volume>921</volume>
          {
          <fpage>957</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>