MantisTable SE: an Efficient Approach for the
           Semantic Table Interpretation

      Marco Cremaschi, Roberto Avogadro, Andrea Barazzetti, and David
                                Chieregato

          University of Milan - Bicocca, Viale Sarca 336, 20126 Milan, Italy
                            marco.cremaschi@unimib.it
          {r.avogadro,d.chieregato,a.barazzetti1}@campus.unimib.it


       Abstract. In this paper, we present a novel unsupervised and auto-
       matic approach for Semantic Table Interpretation. The technique pre-
       sented is performed against DBpedia and Wikidata, and it can be easily
       adapted to any other Knowledge Graph (KG). Moreover, we provide a
       tool (LamAPI) that allows to efficiently fetch data needed for Semantic
       Table Interpretation (STI) tasks from the KG dumps.

       Keywords: Semantic Table Interpretation · DBpedia · Wikidata ·
       Knowledge Graph


1    Introduction
The STI is a research field in continuous evolution with increasing interest over
time. To mention a few numbers describing the phenomenon:
 – Web Tables: The WebTables system [1] extracts 14.1 billion HTML tables
   and finds 154 million are high-quality tables (1.1%);
 – Web Tables: Lehmberg et al. [5] extract 233 million content tables from
   Common Crawl 2015 (2.25% of all tables);
 – Wikipedia Tables: The current snapshot of Wikipedia contains more than
   3.23 million tables from 520k articles [3];
 – Spreadsheets: The number of worldwide spreadsheet users is estimated to
   exceed 400 million, and about 50% to 80% of businesses use spreadsheets.
The input of STI is: i) a well-formed and normalised relational table (i.e. a table
with headers and simple values, thus excluding nested and figure-like tables),
as the one in Figure 1, and ii) a KG which describes real world entities in the
domain of interest (i.e. a set of concepts, datatypes, predicates, instances, and
the relations among them), as the example in Figure 2. The output returned is a
semantically annotated table, as shown in Figure 3. This work is an improvement
and extension of MantisTable [2]. Our approach used DBpedia as first matching

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
                          Cremaschi M. et al.


Fig. 1: Example of a well-formed relational table, with notes that are used in this
paper.


                      Fig. 2: A sample of Knowledge Graph.


KG because it was the richest data source available with ground truths available,
later we adapted our approach to Wikidata because of it’s increasing popularity
and in order to take part in the SemTab2020 challenge[4]. The requirement to
adapt the approach to Wikidata drove us to the development of a new approach.
In this way, we showed that the approach can be easily adapted to be used with
any KG. We will be referring to our new approach with “MantisTable SE”, and
it is the 4th Open Source implementation of MantisTable.


2     Methodology
As seen in Section 1, to obtain the STI of tabular data, it is required to link
elements of the table with the elements in a KG. The elements in KGs (e.g.
DBpedia or Wikidata) are frequently stored in Resource Description Frame-
work (RDF) format, so to access to these elements, it is necessary to query a
SPARQL endpoint. For instance, the most popular way to access DBpedia dumps
is by using OpenLink Virtuoso, a row-wise transaction-oriented RDBMS with
a SPARQL query engine to access to RDF graph store1 . Wikidata instead uses
Blazegraph2 that is a high-performance graph database supporting Blueprints
and RDF/SPARQL APIs. The issue faced with these solutions is the time re-
quired for importing the data. Wikidata2019 dump requires some days to set
1
    http://virtuoso.openlinksw.com
2
    https://blazegraph.com
                                    MantisTable: an Efficient Approach for the STI


                     Fig. 3: Example of an annotated table.


up3 . Another problem is given by the amount of information present in KG;
for instance, the Wikidata dump is about 1.1TB (uncompressed). The English
version of DBpedia instead is split into multiple files of the size of 26GB, which
leads to high computation times to obtain a complete STI (e.g. TableMiner+
according to the author [6] took 13.35 hours to process the Limaye200 dataset).
However, not all the information present in a KG is necessary to carry out a
STI. Therefore, in order to obtain an efficient approach, it is necessary to iden-
tify other ways for querying KG. To do that, the approach described in this
paper does not use SPARQL queries but queries indexes built on DBpedia and
Wikidata dump files. These indexes are accessible through the use of four differ-
ent API services that allow us to replace all the SPARQL queries. The tool that
provides these APIs is called LamAPI. LamAPI is an open-source tool, and the
code is available as a Git repository4 . LamAPI provides the following services:


1. Label Matching: given a free text (in this case a data inside the table cell),
   it retrieves the entities with the greatest similarity, using ElasticSearch’s
   default sorted scoring algorithm with full-text search enabled. For instance,
   considering a table cell containing “Jurassic World” the result returned is
   shown in Listing 1.1 or Listing 1.2;
2. Predicate matching: given two entities it retrieves all the predicates
   between them; considering the entity “Jurassic World” and the entity
   “Colin Trevorrow”, the list of predicates is shown in Listing 1.4;
3. Object matching: given an entity it retrieves all the related objects and
   predicates; for example, with the entity “Jurassic World” the result is the
   one shown in Listing 1.4;
4. Concept matching: given an entity it retrieves all it’s concepts as shown
   in Listing 1.3.


3
  https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-
  part-1/
4
  https://bitbucket.org/disco unimib/lamapi
                                  Cremaschi M. et al.


          Listing 1.1: Wikidata labels.                     Listing 1.2: DBpedia labels.
1        " count " : 70 ,                           1     " count " : 31 ,
         " results " : [{                                 " results " : [{
3           " uri " : " Q55615459 " ,               3        " uri " : " Jura ssic_Wo rld " ,
            " label " : " Jurassic World "                   " label " : " Jurassic World "
5        } ,{                                       5     } ,{
            " uri " : " Q3512046 " ,                         " uri " : " Jura ssic_Wo rld " ,
7           " label " : " Jurassic World "          7        " label " : " Jurassic world "
         } ,{                                             } ,{
9           " uri " : " Q18615494 " ,               9        " uri " : " Jura ssic_Wo rld " ,
            " label " : " Jurassic World "                   " label " : " Jurassic World film "
11       } ,{                                      11     } ,{
            " uri " : " Q2336369 " ,                         " uri " : " Jura ssic_Wo rld " ,
13          " label " : " Jurassic World "         13        " label " : " Jurassic World ( film )"
         } ,{                                             } ,{
15          " uri " : " Q37598390 " ,              15        " uri " : " L e g o _ J u r a s s i c _ W o r l d " ,
            " label " : " Jurassic World Evol ."             " label " : " Lego Jurassic World "
17       }                                         17     }


     Listing 1.3: Query result - Literals               Listing 1.4: Query results - Predi-
     and Concepts.                                      cate and Objects.
1    Literal                                            Predicate
     {                                              2   {
3      " Ju rassic_W orld " : {                           " Jur assic_W orld Colin_Trev . " : [
       " dbo : budget " : [                         4     " dbo : director "
5         " 1.5 E8 "                                      ]
       ],                                           6   }
7      " dbo : gross " : [                              Object
          " 1.67 E9 "                               8   {
9      ],                                                 " Jur assic_W orld " : {
       " foaf : name " : [                         10     " dbo : director " : [
11        " Jurassic World "                                 " Co li n _T re vo r ro w "
       ],                                          12     ],
13     " dbo : runtime " : [                              " dbo : musicComposer " : [
          " 7440.0 "                               14        " Michael_Giacchino "
15     ]                                                  ],
       }                                           16     " dbo : cin ematogra phy " : [
17   }                                                       " John_Schwartzman "
     Concept                                       18     ],
19   {                                                    " dbo : editing " : [
       " Ju rassic_W orld " : [                    20        " Kevin_Stitt "
21     " Film " ,                                         ],
       " Work "                                    22     " dbo : distributor " : [
23     ]                                                     " Universal_Studios "
     }                                             24     ]

         The loading of the labels (as shown in Listing 1.1 and Listing 1.2) takes place
     through the use of pre-computed JSON files. The other three services are made
     with Redis5 key-value database. As a search engine for the labels service, we
     use an instance of ElasticSearch6 that supports HTTP GZip natively, giving a
     performance boost. KG’s dump files need some pre-processing in order to be
     used for this system, in particular we use namespace abbreviations (e.g. dbo: for
     dbpedia.org/ontology) in order to save space and we avoid data duplication.
         In the following, we will focus on the algorithmic process of our MantisTable
     SE. The process is organised into eight phases as follows: i) Data Preparation and
     Normalisation, ii) Column Analysis and Subject Detection, iii) Data Retrieval,
     iv) Cell Entity Annotation (CEA), v) Column Predicate Annotation (CPA), vi)
     Column Type Annotation (CTA), vii) Revision and viii) Export.
         The process was designed to retrieve the candidate for a given cell only once,
     especially if the content of that cell is repeated across multiple tables. This allows
     5
         https://redis.io/
     6
         https://www.elastic.co/
                                          MantisTable: an Efficient Approach for the STI

the approach to avoid repeating queries for the same content and saves network
time, but the drawback is more difficult memory management during execution.
Every phase must be completed for all the tables before going to the next one.
This does not preclude running the tool against only one table, but when running
with multiple tables the execution time will be sharply reduced (for the same
text in a cell we have the same candidate entities that will be sorted considering
the content of every table. This will be explained more in detail in the CEA
phase). For examples, we will refer to Table 1.


SUBJ              NE               LIT         NE                   LIT           LIT

title            director          release year domestic distributor length in min worldwide gross
jurassic world   colin trevorrow   2015         universal pictures   124           1670400637
superman returns bryan singer      2006         warner bros          154           391081192
batman begins    christopher nolan 2005         warner bros          140           371853783
avatar           james cameron     2009         20 century fox       162           2744336793
              Table 1: Example of a table containing a list of movies.


    i) Data Preparation and Normalisation. During this phase, all the cells
of all the tables are analysed using a tokeniser managing special characters and
removing parenthesis and the text block inside them.
    ii) Column Analysis and Subject Detection. During Column Analysis
we identify literal columns (L-column) by using a set of regex [2] (i.e. BOOLEAN,
DATE, EMAIL, GEOCORDS, INTEGER, REAL, ISBN, URL) to identity dif-
ferent datatypes. The other columns are annotated as Named Entity column
(NE-column). The subject column (S-column) can be identified between NE-
column thanks to content-based scores, but it will not be discussed in this ap-
proach as it would not introduce anything new from [2,6]. Besides, target columns
are provided during SemTab2019 and SemTab2020 challenges.
    iii) Data Retrieval. For each normalised cell, the candidate entities are
retrieved from the Label Matching API service. Retrieving data implies to seek
for the entities in the KG with similar labels to our Named Entities cells. For
example, if we query “jurassic world” on LamAPI Wikidata we get a list of
candidates:
 – Q3512046 (”Jurassic World” - Movie);
 – Q18615494 (”Jurassic World” - Comic Strip);
 – Q55615459 (”Jurassic World” - Attraction);
 – thousands more results.
    The candidates will be ranked during the next phases. Cells of NE columns
that have the same string value in different tables start by considering the same
set of candidate entities; this set will be sorted differently, considering relation-
ships between the entire contents of a table.
    iv) Cell Entity Annotation (CEA). For each cell of the S-column and
NE-columns a confidence score is calculated by computing the edit distance
                                  Cremaschi M. et al.

     (Levenshtein distance) between the labels (in different languages) of candidate
     entity ei,j ∈ Ei,j and the content of the cell tx(i, j):

                         1 − norm(LevenshteinDistance(tx(i, j)), ei,j )            (1)

     All the values are normalised in (0,1) range with Divide by Maximum normal-
     isation (for every entity). For L-columns, the confidence score is computed as
     follows:

      – for L-columns with numeric datatype: The confidence score is calculated as
        of the formula in Equation 2. Where a is the cell and b is the corresponding
        candidate value in the KG and σ is a constant (experimentally fixed at 100).
                                                          (a−b) 2
                                                 e−0.5[     σ  ]                   (2)

      – for L-columns with string datatype: the confidence score is computed like in
        the case of NE-column for short strings. For long strings, a Jaccard distance
        is computed, because the number of edits required to change a long string
        into another one is not necessarily significant while considering n-grams with
        Jaccard is generally a better similarity measure.
      – for L-column with date datatype, the dates are considered as sortable nu-
        meric values in the format YYYYMMDDHHmmSS. The confidence score is
        computed in the same way as in the numeric case.

         As an example we consider the cell containing “superman returns” which can
     refer to both a movie (Table 1) and a videogame (Table 2).


                           videogame        publisher       release date
                           superman returns electronic arts 2006
                           pokemon white nintendo           2010
                           call of duty     activision      2003
     Table 2: Example of table with elements equal to Table 1, but different domain.


     Listing 1.5: Candidates for the cell “superman returns” of the movies table (Table
     1).
     " Q328695 " :
2         " label " : " Superman Returns " ,
          " instance_of " : ["3 D film " , " film "] ,
4         " confidence " : 0.8
     " Q655031 " :
6         " label " : " Superman Returns " ,
          " instance_of " : [" video game "] ,
8         " confidence " : 0.2
     " Q3977963 " :
10        " label " : " Superman Returns " ,
          " instance_of " : [" album "] ,
12        " confidence " : 0.2
                                             MantisTable: an Efficient Approach for the STI

          Listing 1.6: Candidates for the cell “superman returns” of the Table 2.
     " Q655031 " :
2         " label " : " Superman Returns " ,
          " instance_of " : " video game " ,
4         " confidence " : 1.0
     " Q7643850 " :
6         " label " : " Superman Returns : Fortress of Solitude " ,
          " instance_of " : " video game " ,
8         " confidence " : 0.41

         Considering literal values for the cell “batman begins” with the content of
     the column “length in min” we are almost sure that the value is correct because
     after sorting all the values we have the result shown in Listing 1.7.

              Listing 1.7: Literal values for the entity “Batman Begins (film)”.
     " P4632 " :
2         " label " : " Bechdel Test Movie List ID " ,
          " value " : 40
4         " confidence " : 0
     " P2047 " :
6         " label " : " duration " ,
          " value " : 140
8         " confidence " : 1.0
     " P3110 " :
10        " label " : " ISzDb film ID " ,
          " value " : 234
12        " confidence " : 0

         v) Column Predicate Annotation (CPA). Considering that all the nec-
     essary information is gathered in the previous phase, the CPA is a relatively fast
     process. For each column, all the predicates are retrieved for every possible can-
     didate couple with confidence greater than 0 and they are sorted by their relative
     frequency to the entire column. The predicate with the greatest frequency will
     be ranked first. This process allows to reduce the number of candidate entities
     for every cell. Confidence scores for the predicates of the director column are
     shown in Listing 1.8.
         When we consider the video game in Table 2, the CPA phase allowed us
     to drop the “film” with the same name. This because it does not have any
     relationship with the rest of the content of the table (the film does not have
     anything to do with “electronic arts”, while the video game has a property with
     exact match).

                          Listing 1.8: Predicates for “Director” column.
     " P57 " :
2          " label " : " director " ,
           " confidence " : 1.0
4    " P58 " :
           " label " : " screenwriter " ,
6          " confidence " : 0.75
     " P162 " :
8          " label " : " producer " ,
           " confidence " : 0.75
10   " P161 " :
           " label " : " cast member " ,
12         " confidence " : 0.25

         vi) Column Type Annotation (CTA). In this phase every candidate
     entity ei,j ∈ Ei,j is analysed again considering the CPA reordering. The process
     is a bit different in DBpedia and Wikidata because the Wikidata KG contains
     cycles. For reasons of space, in this paper, only the approach for Wikidata will be
     considered. In the Wikidata version of MantisTable SE, we retrieve the concepts
                                        Cremaschi M. et al.

     corresponding to every candidate entity from LamAPI. For every column, we
     use an in-memory structure storing frequencies of the most common datatypes,
     as shown in Listing 1.9.
         Listing 1.9: Example of the structure storing the most frequent concepts.
       " movie_table " :
2      "0":
          " Q229390 " : 1 ,
4         " Q11424 " : 1 ,
          " Q25110269 " : 0.33
6      "1":
          " Q5 " : 1
8      "3":
          " Q1762059 " : 1 ,
10        " Q375336 " : 0.66 ,
          " Q1107679 " : 0.33 ,
12        " Q18127 " : 0.33 ,
          " Q10689397 " : 0.33

         The frequencies are stored with by-max normalization in order to be homo-
     geneous (i.e. values are between 0 and 1). For every column, the winning concept
     is chosen as follows:
      – concepts with a frequency lower than 0.95 are discarded in order to reduce
        the needed computation on the fully connected Wikidata concepts graph;
      – in every single table, we consider every possible column pair and we count
        inbound and outbound edges; the final concept selected is the one having
        the highest number of connections in the Wikidata graph.

            Listing 1.10: Confidence for first column of the film table (Table 1).
1    " Q229390 " :
          " label " : "3 D film " ,
3         " confidence " : 1.0
     " Q11424 " :
5         " label " : " film " ,
          " confidence " : 1.0
7    " Q11424 " : {
          " label " : " live - action animated film " ,
9         " confidence " : 0.33

         As both “film” and “3D film” have the same confidence, we consider the
     number of links in the Wikidata Graph. In our algorithm, we initially ranked
     first concepts with the highest number of links in the graph, but after finding
     out experimentally that in the SemTab2020 challenge a most specific concept is
     preferred by the scoring system, we decided to consider the relationship between
     the final concepts. In this example, as “3D film” is a child concept of “film”, we
     assign the most specific concept, so “3D film” is the final winning concept. This
     consideration should not be considered ideal for every dataset.
     Listing 1.11: Links in Wikidata Graph for concepts of first column of Table 1.
1    " Q229390 " :
          " label " :   "3 D film " ,
3         " links " :   0
     " Q11424 " :
5         " label " :   " film " ,
          " links " :   1

         vii) Revision. The revision phase analyses all the information gathered in
     the previous phases in order to do a final reordering of the candidates. In par-
     ticular, this allows for correcting CEA entities previously selected: every entity
                                      MantisTable: an Efficient Approach for the STI

have to be coherent with the rest of the concepts and predicates selected in every
column. Moreover, predicates are also re-ranked.
    viii) Export. The MantisTable SE approach previously described keeps the
candidates coming from each phase; all the results are stored. It is possible to
apply some thresholds during the export phase to reach a balance between an-
notation quality and the number of annotations provided. The export threshold
can have a significant role in evaluation metrics for every gold standard.


3     Deployment

As described above, the MantisTable SE approach has been integrated into a
web application developed with Python and Django (Figure 4a, Figure 4b). A
MongoDB database acts as table and KG repository. The code is freely available
through a Git repository7 . In order to achieve the scalability of the application,
and therefore improve efficiency, MantisTable SE has been installed in a Docker
container to achieve parallelisms at the application level and to facilitate the
deployment on servers. The management of resources is performed by using
Task Queues (i.e. Celery Workers8 ). The GUI is current under development,
when ready it will allow to graphically explore annotations for every phase.


                   (a)                                         (b)

         Fig. 4: MantisTable SE page with CTA(4a) and CPA(4b) results


4     Results

This section will report the results obtained by MantisTable SE in Rounds 1, 2,
3 of the SemTab 2020 challenge.

7
    bitbucket.org/disco unimib/mantistable-4
8
    docs.celeryproject.org/en/latest/userguide/workers.html
                         Cremaschi M. et al.

                           Round1      Round2      Round3
                     Tasks
                          F1     P    F1     P    F1     P
                     CEA 0.982 0.989 0.991 0.993 0.974 0.979
                     CTA 0.745 0.753 0.966 0.973 0.958 0.965
                     CPA 0.888 0.942 0.961 0.966 0.941 0.957


    The results obtained in the first three rounds shows that our method is highly
promising. Unfortunately, during Round4, our results are not representative. Due
to an infrastructural problem, it was not possible to complete the computation
in the given timeframe.


5   Conclusions and General comments
MantisTable SE i) provides a comprehensive solution to support all annotation
steps; ii) provides an unsupervised method to annotate datasets of tables effi-
ciently; iii) generates context for disambiguation; iv) provides a tool to support
KG querying for STI tasks (LamAPI) and a workflow for STI (MantisTable SE)
which are publicly available.
    Talking about future improvements:
 – CEA does take in consideration only directly linked entities. This is likely
   the most important source of missing/wrong annotations and is going to be
   solved in the next version;
 – CPA is directly computed from CEA results but some ambiguities still re-
   main. A slightly more complex algorithm is currently under development;
 – for CTA annotations we will possibly use space-representations and feature-
   based similarities (e.g. embeddings) to match missing concepts in the KG.


References
1. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: Exploring
   the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (Aug 2008).
   https://doi.org/10.14778/1453856.1453916, https://doi.org/10.14778/1453856.
   1453916
2. Cremaschi, M., De Paoli, F., Rula, A., Spahiu, B.: A fully automated approach to a
   complete semantic table interpretation. Future Generation Computer Systems 112,
   478 – 500 (2020). https://doi.org/https://doi.org/10.1016/j.future.2020.05.019,
   http://www.sciencedirect.com/science/article/pii/S0167739X19302663
3. Fetahu, B., Anand, A., Koutraki, M.: Tablenet: An approach for determining
   fine-grained relations for wikipedia tables. In: The World Wide Web Confer-
   ence. p. 2736–2742. WWW ’19, Association for Computing Machinery, New York,
   NY, USA (2019). https://doi.org/10.1145/3308558.3313629, https://doi.org/10.
   1145/3308558.3313629
4. Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: Semtab
   2019: Resources to benchmark tabular data to knowledge graph matching systems.
   In: European Semantic Web Conference. pp. 514–530. Springer (2020)
                                      MantisTable: an Efficient Approach for the STI

5. Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of web ta-
   bles containing time and context metadata. In: Proceedings of the 25th Interna-
   tional Conference Companion on World Wide Web. p. 75–76. WWW ’16 Com-
   panion, International World Wide Web Conferences Steering Committee, Repub-
   lic and Canton of Geneva, CHE (2016). https://doi.org/10.1145/2872518.2889386,
   https://doi.org/10.1145/2872518.2889386
6. Zhang, Z.: Effective and efficient semantic table interpretation using tableminer+.
   Semantic Web 8(6), 921–957 (2017)