=Paper= {{Paper |id=Vol-3103/paper7 |storemode=property |title=MantisTable V: A novel and efficient approach to Semantic Table Interpretation |pdfUrl=https://ceur-ws.org/Vol-3103/paper7.pdf |volume=Vol-3103 |authors=Roberto Avogadro,Marco Cremaschi |dblpUrl=https://dblp.org/rec/conf/semweb/AvogadroC21 }} ==MantisTable V: A novel and efficient approach to Semantic Table Interpretation== https://ceur-ws.org/Vol-3103/paper7.pdf
MantisTable V: a novel and efficient approach to
        Semantic Table Interpretation

                     Roberto Avogadro[0000−0001−8074−7793] and
                       Marco Cremaschi[0000−0001−7840−6228]

                           University of Milano - Bicocca
                  {roberto.avogadro,marco.cremaschi}@unimib.it



       Abstract. In this paper, we present MantisTable V, a novel unsuper-
       vised and automatic approach for the Semantic Table Interpretation.
       The approach is performed against DBpedia and Wikidata, and it can
       be easily adapted to any other Knowledge Graph. Moreover, we provide
       a tool (LamAPI) that allows to efficiently fetch data needed for Semantic
       Table Interpretation tasks from the Knowledge Graph dumps. The ap-
       proach is manageable through a User Interface (tUI), a separated tool,
       which allows the visualisation and modification of table data and seman-
       tic annotations.

       Keywords: Semantic Web · Knowledge Graph · Semantic Table Inter-
       pretation · Table Understanding · DBpedia · Wikidata · User Interface


1    Introduction

The Semantic Table Interpretation (STI) is a research field in continuous evo-
lution with increasing interest over time, also considering the great diffusion of
tabular data on the web. The input of STI is: i) a well-formed and normalised re-
lational table (i.e., a table with headers and simple values, thus excluding nested
and figure-like tables), as the one in Fig. 1, and ii) a Knowledge Graph (KG)
which describes real-world entities in the domain of interest (i.e., a set of con-
cepts, datatypes, predicates, instances, and the relations among them), as the
example in Fig. 2. The output returned is a semantically annotated table, as
shown in Fig. 3.
    Moreover, the STI process is composed of the following main annotation
steps: i) semantic classification of columns, which takes into account the values
of a column to mark it as Literal column (L-column) if values are datatypes (e.g.,
strings, numbers, dates, etc., such as 2015, 10/04/1983, etc.), or as Named-Entity
column (NE-column) if values are concepts (e.g., Film, Director, etc., such as
Jurassic World, Colin Trevorrow, etc.); ii) detection of the subject column (S-
column), which identifies the main column (the one all the others are referring

Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2                                R. Avogadro et al.

to) among the NE-columns identified in the previous step (e.g., the Title column
in Fig. 3); iii) concept and datatype annotation, which associates NE-columns
with a concept in the KG (e.g., the column Title is associated with Film in
Wikidata1 ), and L-columns with a datatype in the KG (e.g., the column Year is
of type date); and iv) predicate annotation, which identifies the relations between
the S-column and the other columns (e.g., Film publication date Year).
    Each of the above steps is obtained by annotating column values referring
to existing KGs. For example, in Fig. 3 if the majority of entities in the Ti-
tle column is associated with Film, these entities are of type Film. Similarly,
publication date can be identified as the predicate connecting entities in the
Title column with datatypes of type date of the Year column.
    Unfortunately, explicit situations like the ones in the example are not so
common, therefore we need to set up strategies and algorithms to address several
issues.




Fig. 1. Example of a well-formed relational table, with labels that are used in this
paper.




                      Fig. 2. A sample of Knowledge Graph.


    This work is an improvement and extension of MantisTable (seManti cs Table)
[2] and MantisTable SE [1]. We will refer to our new approach with “MantisTable
V”, the 5th Open Source implementation of MantisTable. Compared to Man-
tisTable, MantisTable V is characterised by a complete refactoring due to a sub-
stantial modification of the annotation process, now no longer procedural but
1
    www.wikidata.org/wiki/Q11424
                                                               MantisTable V      3




                      Fig. 3. Example of an annotated table.


iterative, to improve the cell disambiguation process. Compared to MantisTable
SE, this new version of the approach can consider different types of tables (con-
cerning the number of columns) and uses new algorithms to identify and classify
L-columns. It also manages ambiguity in annotations optimally, as will be anal-
ysed in the following sections. Our approach uses DBpedia and Wikidata as
the first matching Knowledge Graphs (KGs) because they are the richest data
sources with ground truths available. MantisTable V can be easily adapted to
be used with any KG, through the use of a new tool, called LamAPI (Label
matching API ). Thanks to LamAPI and its index systems, it is possible to ef-
ficiently search a particular entity by its ID or full-text search. Together with
MantisTable V and LamAPI, we have developed a user interface (tUI) that allows
the management of the tables and the annotations, as well as the update. tUI
(table U ser I nterface) is a fully configurable tool, which can be used with any
STI approach.
     The main contributions of this paper are: (i) MantisTable V, a comprehen-
sive approach which deals with all phases of the STI process, (ii) LamAPI, an
open-source tool to efficiently manage and retrieve data of KGs, (iii) tUI, a fully
configurable open-source UI to manage and display and update tables and se-
mantic annotations. All tools have been encapsulated in Docker containers to
facilitate the deployment and scalability by replication.
     The remainder of the paper is organised as follow: in Section 2 we describe
the functionalities of the LamAPI tool while in Section 3 MantisTable V is de-
scribed. Details on tUI are depicted in Section 4. Section 5 introduces the Gold
Standards and discusses the evaluation results. Finally, conclusions and pointers,
are presented in Section 6.


2   Data management for an efficient STI with LamAPI

As seen in Section 1, to obtain the STI of tabular data, it is required to link ele-
ments of the table with the elements of a KG. The elements in the KGs (e.g., DB-
pedia or Wikidata) are frequently stored in Resource Description Format (RDF)
format, so to access these elements, it is necessary to query a SPARQL endpoint.
For instance, the most popular way to access DBpedia dumps is by using Open-
Link Virtuoso, a row-wise transaction-oriented RDBMS with a SPARQL query
4                                R. Avogadro et al.

engine to access to RDF graph store2 . Wikidata instead uses Blazegraph3 that is
a high-performance graph database supporting RDF/SPARQL APIs. The issue
faced with these solutions is the time required for importing the data. Wiki-
data 2019 dump requires some days to set up4 . Another problem is given by
the amount of information present in a KG; for instance, the Wikidata dump is
about 1.1TB (uncompressed). The English version of DBpedia instead is split
into multiple files of the size of 26GB, which leads to high computation times
to obtain a complete STI (e.g., TableMiner+ according to the author in [8] took
13.35 hours to process the Limaye200 dataset). However, not all the information
present in a KG is necessary to carry out a STI.
    Therefore, in order to obtain an efficient approach, it is necessary to identify
other ways for querying KG. To do that, :n the state of the art, two works can
be identified, FactBase [4] and Knowledge Graph ToolKit (KGTK) [5]. FactBase
index5 introduce a manually and generic search index over Wikidata entries.
FactBase index takes the cells of a table column as input and returns the top-k
candidate entities for each cell. The KGTK framework 6 is used for the creation
and exploitation of large KGs, such as Wikidata. However, the authors suggest
using only parts of a KG. The approach described in this paper does not use
SPARQL queries but queries indexes built on the entire DBpedia and Wikidata.
These indexes are accessible through the use of four different API services.
    The open-source tool that provides these APIs is called LamAPI (Label
matching API )7 and provides the following services:

 1. Lookup: given a free text (in this case a data inside the table cell), it
    retrieves the entities with the greatest similarity, using the IB similarity8
    scoring algorithm of ElasticSearch which combines different search strate-
    gies (i.e., full-text search based on tokens, on n-grams and fuzzy search).
    The ElasticSearch contains an index of the KG entities to improve the per-
    formance. Considering a table cell containing “Jurassic World” the result
    returned is shown in Listing 1.1 or Listing 1.2;
 2. Concepts: given an entity it retrieves all its concepts as shown in Listing
    1.3. This service can extend automatically the list of concepts associated
    with a given entity through the use of vector similarity measures between
    the different concepts in the KG. Thanks to this functionality, it is possible
    to extend the candidates associated with a cell.
 3. Literals: given an entity it retrieves all the related literals values and pred-
    icates as shown in Listing 1.3;
2
  virtuoso.openlinksw.com
3
  blazegraph.com
4
  addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/
5
  www.cs.toronto.edu/ oktie/webtables
6
  github.com/usc-isi-i2/kgtk/
7
  bitbucket.org/disco unimib/lamapi/
8
  www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-
  similarity.html
                                                                          MantisTable V                 5

      4. Predicates: given two entities it retrieves all the predicates between them;
         considering the entity “Jurassic World” and the entity “Colin Trevorrow”,
         the list of predicates is shown in Listing 1.4;
      5. Objects: given an entity it retrieves all the related objects and predicates;
         for example, with the entity “Jurassic World” the result is the one shown in
         Listing 1.4.

           Listing 1.1. Wikidata lookup.                    Listing 1.2. DBpedia lookup.
1    " id " : " Q3512046 "                             " id " : " Juras sic_Worl d "
     " name " : " Jurassic World "               2     " name " : " Jurassic World "
3    " types " : {                                     " types " : {
            " id " : " Q229390 "                 4            " id " : " Film "
5           " name " : "3 D film "                            " name " : " Film "
            } ,{                                 6            } ,{
7           " id " : " Q11424 "                               " id " : " Work "
            " name " : " film "                  8            " name " : " Work "
9           }                                                 }
     " id " : " Q21877685 "                     10     " id " : " Jurassic_Park "
11   " name " : " Jurassic World "                     " name " : " Jurassic World 2"
     " types " : {                              12     " types " : {
13          " id " : " Q3512046 "                             " id " : " Film "
            " name " : " Jurassic World "       14            " name " : " Film "
15          } ,{                                              } ,{
            " id " : " Q3512046 "               16            " id " : " SportsTeam "
17          " name " : " Jurassic World "                     " name " : " SportsTeam "
            }                                   18            }

        The data in DBpedia9 have been preprocessed to be then integrated into
     LamAPI.
     Listing 1.3. Query result - Concepts              Listing 1.4. Query results - Predicate
     and Literals.                                     and Objects.
     Concepts                                    1     Predicate
2    " Ju rassic_W orld " :                            " Ju rassic_W orld C ol i n_ Tr ev o rr ow " :
           " rdf : type " :                      3           " dbo : director "
4                " Film "
                 " Work "                        5     Object
6    Literals                                          " Ju rassic_W orld " :
     " Ju rassic_W orld " :                      7           " Co li n _T re vo r ro w " :
8          " number " :                                            " dbo : director "
                 " dbo : budget " :              9           " Michael_Giacchino ":
10                     " 1.5 E8 "                                  " dbo : musicComposer "
                 " dbo : gross " :              11           " John_Schwartzman ":
12                     " 1.67 E9 "                                 " dbo : cin ematogr aphy "
                 " dbo : runtime " :            13           " Kevin_Stitt " :
14                     " 7440.0 "                                  " dbo : editing "
           " string " :                         15           " Universal_Studios ":
16               " foaf : name " :                                 " dbo : distributor "
                       " Jurassic World "

         Differently from DBpedia, Wikidata offers every week a new single dump file
     of large dimensions10 . For Wikidata, we had to make a different design decision
     in order to support multi-language.
         This new way to access DBpedia/Wikidata provided by LamAPI overcame
     the limitations of SPARQL endpoints such as:
      – SPARQL endpoint response times are directly proportional to the size of
        the returned data. In this context, sometimes it is not even possible to get
        a result because of the endpoint returning a timeout;
      – the volume of requests per second is limited (online endpoint) or computa-
        tionally expansive (a local endpoint requires at least 64GB of ram and tons
        of CPU cycles);
     9
          wiki.dbpedia.org/downloads-2016-10
     10
          dumps.wikimedia.org/wikidatawiki/entities/
6                                  R. Avogadro et al.

    – there are some intrinsic limits in the SPARQL language expressiveness (i.e.,
      the full-text search capability that is useful for label matching is possible to
      be obtained only with extremely slow “contains” or “regex” queries11 ).

      LamAPI is developed using ElasticSearch, MongoDB and Python12 .




Fig. 4. LamAPI documentation page with       Fig. 5. Documentation       of   LamAPI
Swagger.                                     Lookup service.




3      The MantisTable V approach

This Section will focus on the algorithmic process of MantisTable V13 . The pro-
cess is organised into eight phases as follows: i) Data Preparation and Normal-
isation, ii) Column Analysis and Subject Detection, iii) Cell Entity Annotation
(CEA), iv) Column Predicate Annotation (CPA), v) Column Type Annotation
(CTA), vi) Revision and viii) Export.
    The process was designed to retrieve the candidate for a given cell only once,
especially if the content of that cell is repeated across multiple tables. This
allows the approach to avoid repeating queries for the same content and saves
network time. Every phase must be completed for all the tables in a dataset
before going to the next one. This does not preclude running the tool against
only one table, but when running with multiple tables, the execution time will be
sharply reduced (for the exact text in a cell, we have the same candidate entities
that will be sorted considering the content of every table. This will be explained
more in detail in Paragraph 3). To describe each phase of the STI approach,
consider Table 1, which lists some movies with additional information, such as
director and release year.
11
   docs.openlinksw.com/virtuoso/rdfsparqlrulefulltext/
12
   lamapi.ml
13
   bitbucket.org/disco unimib/mantistable-v/
                                                                       MantisTable V             7

           Table 1. Illustrative movies table that will be used for examples.

title            director          release year domestic distributor length in min worldwide gross
Jurassic World   Colin Trevorrow 2015           Universal Pictures    124          1,670,400,637
Superman Returns Bryan Singer      2006         Warner Bros.          154          391,081,192
Batman Begins    Christopher Nolan 2005         Warner Bros.          140          371,853,783
Avatar           James Cameron     2009         Twentieth Century Fox 162          2,744,336,793




i. Data Preparation and Normalisation. During this phase, all tables’ cells are
analysed using a tokeniser managing special characters and additional spaces.
For each normalised cell, the candidate entities are retrieved from the Lookup
service of LamAPI. The obtained candidates will be ranked during the next
phases.

ii. Column Analysis and Subject Detection. During Column Analysis we iden-
tify literal columns (L-column) by using a set of Regextypes [2] (i.e., boolean,
date, email, geocoords, integer, float, ISBN, URL, XPath, CSS) to identity dif-
ferent datatypes. If the number of occurrences of the most frequent Regextypes
detected exceeds a given threshold, the column will be annotated as L-column
and the most frequent Regextype will be assigned to the column under analy-
sis; otherwise, the column will be annotated as NE-column. The subject column
(S-column) can be identified between the Named Entity columns (NE-columns)
thanks to content-based scores, but it will not be discussed in this approach as
it would not introduce anything new from [2]. Related to the example about
films (Table 1), the columns release year, length in min, and worldwide gross are
tagged as L-column. Director and domestic distributor are NE-columns. Title is
the S-column.

iii. Cell Entity Annotation (CEA). In the first step of this phase, the approach
performs the entity-linking on NE-columns by searching the LamAPI, using the
Lookup service, with the content of a cell tx(i, j). The content of the cell tx(i, j)
and the candidate entities Ei,j ⊆ E are used to disambiguate the content of the
cell by considering the degree of similarity. For each cell a confidence score is
calculated by computing the edit distance (Levenshtein distance) between the
labels (in different languages) of candidate entity ei,j ∈ Ei,j and the content of
the cell tx(i, j):

                    1 − norm(LevenshteinDistance(tx(i, j), ei,j ))                            (1)
   All the values are normalised in [0,1] range with Divide by Maximum nor-
malisation (for every entity).
   For L-columns, the confidence score is computed as follows:
 – for L-columns with numeric datatype (float and integer Regextypes): all the
   numeric values (object of RDF-triples) linked to candidate entities are taken
   using the Literal service of LamAPI. The confidence score is calculated as of
   the formula in Equation 2, where lit(ei,j ) is the numerical values associated
   to the candidate entity ei,j .
     8                                     R. Avogadro et al.


                                             |tx(i, j) − lit(ei,j )|
                                    1−                                                  (2)
                                          max(|tx(i, j)|, |lit(ei,j )|, 1)
         – for L-columns with string datatype: the confidence score is computed using
           the Jaccard distance. We change the similarity measure, particularly for the
           long string, because the number of edits required to change a long string
           into another one is not necessarily significant (Edit distance). The Jaccard
           instead considers n-grams.
         – for L-column with date datatype, the dates are considered as sortable nu-
           meric values in the format YYYYMMDDHHmmSS. The confidence score is
           computed as described for the numeric datatype.

         Cells with the same content in different tables start considering the same set
     of candidate entities, but it will be sorted differently concerning the entire table’s
     contents in the next phases. As an example, we consider the cell containing
     “superman returns” in Table 1. In this case, it is referred to the movie, but if
     we consider Table 2, it is referred to the video game.


     Table 2. Example of a table with a cell identical to the first example but different
     content.
                             videogame        publisher       release date
                             superman returns electronic arts 2006
                             pokemon white    nintendo        2010
                             call of duty     activision      2003




         Considering the Table 1 and the cell “Superman Returns”, the candidate
     entities are associated to the ontology concept “film” and “video game” (List-
     ing 1.5). In the CTA phase, where the approach extracts the types (concepts)
     of entities, all entities associated with the concept “video game” are penalised
     because the most frequent concept is “film”.
           Listing 1.5. Candidates for the cell Superman Returns of the movies Table.
     " Q328695 " :
2         " label " : " Superman Returns "
          " instance_of " : ["3 D film " , " film "]
4         " confidence " : 2.25
     " Q655031 " :
6         " label " : " Superman Returns " ,
          " instance_of " : [" video game "]
8         " confidence " : 0.2
     " Q3977963 " :
10        " label " : " Superman Returns "
          " instance_of " : [" album "]
12        " confidence " : 0.2

         Instead, when we consider the video games in Table 2 the CPA phase, where
     the approach extracts the relationship between entities, allowed us to penalise
     all entities with the concept “film” with the same name because it has few or no
     relationship with the rest of the content of the row.
                Listing 1.6. Candidates for the same cell of the video game table.
                                                                    MantisTable V      9

     " Q655031 " :
2         " label " : " Superman Returns " ,
          " instance_of " : [" video game "]
4         " confidence " : 1.0
     " Q7643850 " :
6         " label " : " Superman Returns : Fortress of Solitude "
          " instance_of " : [" video game "]
8         " confidence " : 0.41


         Considering literal values for the cell “batman begins” with the content of
     the column “length in min” we are almost sure that the value is correct because
     after sorting all the values, we have the result shown in Listing 1.7.

                 Listing 1.7. Literal values for the entity batman begins (film).
     " P4632 " :
2         " label " : " Bechdel Test Movie List ID "
          " value " : 40
4         " confidence " : 0
     " P2047 " :
6         " label " : " duration "
          " value " : 140
8         " confidence " : 1.0
     " P3110 " :
10        " label " : " ISzDb film ID "
          " value " : 234
12        " confidence " : 0



     iv. Column Predicate Annotation (CPA) Considering that all the necessary in-
     formation is gathered in the previous phase using LamAPI, the CPA is a relatively
     fast process. All the predicates previously identified for each column are sorted
     by their relative frequency to the entire column. The predicate with the greatest
     frequency will be ranked first. This process allows reducing the number of can-
     didate entities for every cell. Confidence scores for the predicates of the director
     column are shown in Listing 1.8.
         When we consider the video games in Table 2, the CPA phase allowed us to
     penalise the “film” with the same name. This is because it does not have any
     relationship with the rest of the content of the table (the film does not have
     anything to do with “electronic arts”, while the video game has a property with
     exact match).

                                   Listing 1.8. Example for CPA.
     " P57 " :
2          " label " : " director "
           " confidence " : 1.0
4    " P58 " :
           " label " : " screenwriter "
6          " confidence " : 0.75
     " P162 " :
8          " label " : " producer "
           " confidence " : 0.625
10   " P161 " :
           " label " : " cast member "
12         " confidence " : 0.25



     v. Column Type Annotation (CTA) To get the CTA annotation, we collect the
     types/concept of every ei,j resulting from the CEA. The concept with maximum
     frequency has been selected for the CTA annotations.
         For every column, we collect the frequencies of the concepts, as shown in
     Listing 1.9 for Wikidata and in Listing 1.10 for DBpedia.
     10                                   R. Avogadro et al.

             Listing 1.9. Example of the structure storing the most frequent concepts.
            " movie_table " :
2           "0":
                 " Q229390 (3 D film ) " : 1 ,
4                " Q11424 ( film ) " : 1 ,
                 " Q25110269 ( live - action / animated film ) " : 0.33
6           "1":
                 " Q5 ( Human ) " : 1
8           "3":
                 " Q1762059 ( film production company ) " : 1 ,
10               " Q375336 ( film studio ) " : 0.5 ,
                 " Q1107679 ( animation studio ) " : 0.5 ,
12               " Q18127 ( record label ) " : 0.5 ,
                 " Q4830453 ( business ) " : 0.5 ,
14               " Q10689397 ( television production company ) " : 0.25


            Listing 1.10. Example of the structure storing the most frequent concepts.
          " movie_table " :
2         "0":
             " Film " : 1
4            " Work " : 1
          "1":
6            " Person " : 1
          "3":
8            " Company " : 1
             " Organisation " : 1

         If the system returns many annotations for one column (e.g. column 0 in
     Listing 1.9, 1.10), the approach randomly selects one of them as the final anno-
     tation.

     vi. Revision The revision phase analyses all the information gathered in the
     previous phases to do a final reordering of the candidates. In particular, this
     allows for correcting CEA entities previously selected: every entity have to be
     coherent with the rest of the concepts and predicates selected in every column.
     Moreover, predicates are also re-ranked.

     vii. Export The MantisTable V approach previously described keeps the candi-
     dates coming from each phase. It is possible to apply some thresholds during
     the export phase to balance annotation quality and the number of annotations
     provided. The export threshold can have a significant role in evaluation metrics
     for every gold standard.
         MantisTable V is developed using Python.


     4      User Interface of tUI
     tUI14 is a Web application that aims to provide a visualisation tool for STI ap-
     proaches; it can work with any backend that provides API endpoints to retrieve
     data. Endpoints are stored in a YAML configuration file: UI will display data
     and functionalities based on which APIs are available. tUI consists of three main
     parts: i) the view listing the datasets, ii) the view listing the tables contained in
     each dataset (Fig. 6), iii) the view showing the table data and the semantic anno-
     tations (Fig. 7). Regarding the annotated table view, tUI supports all the three
     main tasks of the STI (CTA, CPA, CEA): annotations can be viewed directly
     14
          bitbucket.org/disco unimib/tui/
                                                                MantisTable V         11

inside the table. Concerning the CEA task, if multiple candidates are retrieved,
all of them can be shown in the UI. Endpoints to retrieve the datasets list, the
tables list and the table data (with or without annotations) are mandatory to
ensure the basic functionality of the tool and display the data correctly. Other
non-mandatory endpoints that may be provided will enable the following fea-
tures: i) export of annotations in any format (Fig. 9, multiple exports supported,
e.g., SemTab CSV, JSON-LD, RDF/XML, RDF/N-Triple, R2RML), ii) editing
and saving annotation (Fig. 8), and iii) global search (into datasets or tables).
tUI is developed using React and Typescript15 .




Fig. 6. Display page of the tables within    Fig. 7. Detail page of a table, with the
a dataset.                                   annotation display. For each cell it is pos-
                                             sible to see the associated entity, or the
                                             list of candidates.




Fig. 8. Page for the analysis of candidate   Fig. 9. Page for downloading annotations
entities in case of uncertain annotation.    in different formats.



15
     tui-tool.ml
12                              R. Avogadro et al.

5    Evaluation
To evaluate our approach on a large dataset and compare it with other state-of-
the-art approaches, we tested our approach against different datasets and gold
standards. In particular we consider T2Dv2 Gold Standard16 and the three most
complex rounds of the various versions of the international challenge SemTab17
[6,7]. In particular, we select the Round 3 of SemTab 2020, 2T [3], and finally the
Round 2 and 3 of SemTab 2021 Hard Table. From the results shown in Table 3,
in general, the accuracy of the proposed algorithm is high. It is also underlined
that the approach obtained the best score in the CTA task for the SemTab2021
GitTable dataset18 . In the Table 3 it is possible to notice differences between
the datasets, which can be justified by two hypotheses: the first concerns the
different complexities of the various datasets; the second concerns the use of
different KGs as a target for the annotation. One possible solution is to create
an additional layer to unify the different KGs, and treat them as a single dataset.

           Table 3. Results on the SemTab 2020, 2021 and 2T datasets.

     SemTab 2020     2T      SemTab2021HTR2 SemTab2021HTR3
Tasks
      F1     P    F1     P    F1       P     F1       P
 CEA 0.980 0.984 0.932 0.958 0.983   0.988  0.961   0.985
 CTA 0.962 0.963   -     -   0.978   0.980  0.968   0.976
 CPA 0.993 0.994   -     -   0.999   0.999  0.990   0.998




6    Conclusions
MantisTable V, represents the fifth version of the STI approach, MantisTable. It
results from complete refactoring to substantially improve the approach, both
in terms of the quality of the annotations and scalability. This second objective
led to the definition and implementation of LamAPI, a system for indexing and
querying KG. The current version of MantisTable also allows different manag-
ing types of tables through an improved approach to creating contexts for the
disambiguation of the cells. The presence of tUI guarantees the usability of the
approach, a new UI, capable of adapting concerning the services provided by
the STI approaches. A limit of the current version of MantisTable V is that it
performs well only on tables, including elements directly referenceable to entities
in a KG (table-to-KG annotations). A challenge is to develop methods that can
handle elements not present in a KG (out-of-KG annotations). Therefore, future
developments on the described approach envisage the development of techniques
for identifying novel entities through the use of feature-based methods and em-
beddings.
16
   webdatacommons.org/webtables/goldstandardV2.html
17
   www.cs.ox.ac.uk/isg/challenges/sem-tab/
18
   zenodo.org/record/5706316
                                                                 MantisTable V        13

References
1. Cremaschi, M., Avogadro, R., Barazzetti, A., Chieregato, D.: Mantistable se: an
   efficient approach for the semantic table interpretation. In: SemTab@ ISWC. pp.
   75–85 (2020)
2. Cremaschi, M., De Paoli, F., Rula, A., Spahiu, B.: A fully automated approach to a
   complete semantic table interpretation. Future Generation Computer Systems 112,
   478 – 500 (2020)
3. Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M.: Tough tables: Carefully
   evaluating entity linking for tabular data. In: Pan, J.Z., Tamma, V., d’Amato, C.,
   Janowicz, K., Fu, B., Polleres, A., Seneviratne, O., Kagal, L. (eds.) The Semantic
   Web – ISWC 2020. pp. 328–343. Springer International Publishing, Cham (2020)
4. Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Matching
   web tables with knowledge base entities: From entity lookups to entity embeddings.
   In: d’Amato, C., Fernández, M., Tamma, V.A.M., Lécué, F., Cudré-Mauroux, P., Se-
   queda, J.F., Lange, C., Heflin, J. (eds.) The Semantic Web - ISWC 2017 - 16th Inter-
   national Semantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceed-
   ings, Part I. Lecture Notes in Computer Science, vol. 10587, pp. 260–277. Springer
   (2017)
5. Ilievski, F., Garijo, D., Chalupsky, H., Divvala, N.T., Yao, Y., Rogers, C.M., Li, R.,
   Liu, J., Singh, A., Schwabe, D., Szekely, P.A.: KGTK: A toolkit for large knowledge
   graph manipulation and analysis. In: Pan, J.Z., Tamma, V.A.M., d’Amato, C.,
   Janowicz, K., Fu, B., Polleres, A., Seneviratne, O., Kagal, L. (eds.) The Semantic
   Web - ISWC 2020 - 19th International Semantic Web Conference, Athens, Greece,
   November 2-6, 2020, Proceedings, Part II. Lecture Notes in Computer Science, vol.
   12507, pp. 278–293. Springer (2020)
6. Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: Semtab
   2019: Resources to benchmark tabular data to knowledge graph matching systems.
   In: Harth, A., Kirrane, S., Ngonga Ngomo, A.C., Paulheim, H., Rula, A., Gen-
   tile, A.L., Haase, P., Cochez, M. (eds.) The Semantic Web. pp. 514–530. Springer
   International Publishing, Cham (2020)
7. Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K., Cutrona,
   V.: Results of semtab 2020. In: CEUR Workshop Proceedings. vol. 2775, pp. 1–8
   (2020)
8. Zhang, Z.: Effective and efficient semantic table interpretation using tableminer+.
   Semantic Web 8(6), 921–957 (2017)