SemTab 2021:
      Tabular Data Annotation with MTab Tool

          Phuc Nguyen1 , Ikuya Yamada2 , Natthawut Kertkeidkachorn3 ,
                    Ryutaro Ichise1 , and Hideaki Takeda1
                       1
                      National Institute of Informatics, Japan
                             2
                               Studio Ousia, Japan,
            3
              Japan Advanced Institute of Science and Technology, Japan


        Abstract. This paper presents MTab, an automatic tool for tabular
        data annotation with knowledge graphs. MTab tool could provide help-
        ful information for tabular data such as structural annotations (e.g.,
        table headers, subject column) or semantic annotations with knowledge
        graph concepts from Wikidata, DBpedia, and Wikipedia (e.g., cells with
        entities, columns with types, and column pairs with properties). The
        tool supports multilingual tables and could process many table formats
        such as Excel, CSV, TSV, markdown tables, or a pasted table content.
        MTab achieves impressive empirical performance on many datasets: 1st
        on HardTable CEA, CTA, CPA tasks, BioTable CTA, CPA tasks, and
        HardTablesR3 CPA task. Additionally, the system also got the 1st on us-
        ability track with advanced features: easy-to-use, generic solution, well-
        designed user interface. MTab’s graphical interface, public APIs, docu-
        ments are available at https://github.com/phucty/mtab_tool.

        Keywords: tabular data annotation · knowledge graph · semantic an-
        notation · structural annotation · Wikidata · Wikipedia · DBpedia


1     Introduction
The Open Data movement has made many valuable tabular resources available
on the Internet and Open Data Portals. However, due to insufficient data de-
scriptions, various data formats, and terminology issues, the use of tabular data
in applications is constrained. Many tabular data lack a description, or the de-
scription is not adequately described the data. Table structure and layout are
also lacking in many tabular resources. Furthermore, many tables do not em-
ploy conventional vocabularies, such as multilingual expressions, abbreviations,
ambiguous or many misspellings, and encoding issues. To improve tabular data
usability, it is necessary to have a tabular data annotation system capable of
providing explicit information about table content.
    This paper introduces MTab, an automatic tool that generates structural
and semantic annotations for tabular data. MTab tool, as illustrated in Figure 1,
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2        Phuc Nguyen et al.


    Structural Annotations                   Semantic Annotations
           Header                     CEA               CTA                 CPA


          Subject                     Cell              Column
                                                                             Relation
          Column
                                     Entity             Type              Property
                                  Q1490 (Tokyo)     Q5119 (capital)   P1082 (population)


                                                  Knowledge Graphs

                Fig. 1: Tabular data annotations with MTab Tool


could provide helpful information for tabular data such as structural annotations
(e.g., table headers, subject column) or semantic annotations with knowledge
graph concepts from Wikidata, DBpedia, and Wikipedia, e.g., a cell with entity
annotation (CEA task), a column with type (or class) annotation (CTA task),
and a column pair with property annotation (CPA task). The tool supports
multilingual tables and could process many table formats such as Excel, CSV,
TSV, markdown tables, or a pasted table content.
    MTab archives impressive performance on many datasets: 1st on HardTable
CEA, CTA, CPA tasks, BioTable CTA, CPA tasks, and HardTablesR3 CPA task.
Additionally, the system also got the 1st on usability track with advanced fea-
tures: easy-to-use, generic solution, well-designed user interface. The user could
access MTab’s graphical interface, APIs, documents at https://github.com/
phucty/mtab_tool.


2     Related Work

Table understanding is an important task for data integration and management.
Much of the previous research on table understanding has addressed many data
annotation tasks such as structural annotations, e.g., table header detection, sub-
ject column prediction as in [17], [20], [7] or semantic annotations, e.g., cell-entity
annotation (CEA), column-type annotation (CTA), and column pair-property
annotation (CPA) as the participant systems in the Semantic Web Challenge on
Tabular Data to Knowledge Graph Matching: SemTab 2019 [12], and SemTab
2020 [13].
    SemTab 2019 is the Semantic Web challenge on tabular data to DBpedia
matching. There are three annotations tasks of CEA, CTA, and CPA, and the
tabular data was generated from DBpedia. MTab (the winner system) is based on
                                                                MTab tool       3

an aggregation of multiple cross-lingual lookup services and probabilistic graph-
ical models [16]. CSV2KG (IDLab) also uses multiple lookup services to improve
matching performance [24]. Tabular ISI implements the lookup part with Wiki-
data API, and Elastic Search on DBpedia labels and aliases [23]. ADOG [19]
system also uses Elastic Search to index knowledge graph. LOD4ALL first checks
whereas there is an available entity which has a similar label with table cell us-
ing ASK SPARQL, else perform DBpedia entity search [15]. DAGOBAH system
performs entity linking with a lookup on Wikidata and DBpedia; the authors
also used Wikidata entity embedding to estimate the entity type candidates [3].
Mantis Table provides a Web interface and API for tabular data matching [6].
    In SemTab 2020, the matching target knowledge graph is Wikidata includ-
ing new set of difficulties such as larger-scale of data, graph shifting, rich and
complex data schema in Wikidata. Beside the generated tabular data from Wiki-
data, there was a new manually curated dataset (tough tables [8]). The winner
system, MTab4Wikidata proposed new fuzzy entity and statement search meth-
ods to improve entity candidate generation (with 99.89% coverage) [18]. The
bbw system [21] are based on contextual matching and meta-lookup with SearX
metasearch engine to deal with spelling mistakes. LinkingPart [4], DAGOBAH
[11], JenTab [1], MantisTable SE [5], SSL [14], AMALGAM [2] systems proposed
new scoring functions to rank the matching results.
    However, most solutions or systems are not available to use or require exten-
sive configuration, setup, high computing power, or high time complexity [25].
We implement the MTab tool and release the public APIs and interfaces to
address the usability issue of the current annotation systems.

3     MTab Tool
This section describes MTab tool, started with the system assumptions in Section
3.1, then the overall framework is described in Section 3.2.

3.1   Assumptions
Assumption 1 MTab tool is built on a closed-world assumption.
It means that the tool could return incorrect answers if table elements are not
available in the knowledge graph.
Assumption 2 We assume that the input tables are horizontal relational types.
A horizontal relational table contains semantic knowledge graph triples in [sub-
ject, predicate, object]. The table also has a subject column containing entity
names and the relation between the subject column and other columns repre-
senting the predicate relation between the entities (subject) and attribute values
(object).
Assumption 3 We assume that all the cell values of the same column have the
same data type, and the entities related to cell values are of the same type.
Assumption 4 MTab tool treats input tables independently.
4           Phuc Nguyen et al.

                                                                          WikiGraph

                                                          Integration


        Table                     Preprocessing                   Semantic Annotations
                                   Table Loading                         Target Prediction

                                 Cell Normalization                       Entity Search
    -Text (string)
    -File (CSV, TSV, EXCEL)                                              Postprocessing
    -Table Object
                                                                        Value-based matching
                              Structural Annotations
                                                                            CEA Annotation
                                Data Type Prediction
                                                                            CTA Annotation
                                  Header Prediction
                                                                            CPA Annotation
                              Subject Column Prediction


                                 Fig. 2: MTab tool framework


3.2       Framework
In this paper, we focus on the usability factor of the annotation system. So,
we implement the MTab tool to support multilingual tables and could process
various table formats. The system efficiency also is an important concern of
the implementation so that we optimize the annotations run time by about 1.52
sec/table on average (tested on SemTab 2020 dataset). Moreover, we also provide
graphical interfaces to visualize the annotation results as in Section 4.
    The overall framework of the MTab tool is described in Fig. 2. We build Wiki-
Graph, which is an integrated knowledge graph from Wikidata, DBpedia, and
Wikipedia as in Section 3.2.1. The annotation procedure is started with data pre-
processing as in Section 3.2.2. Then, the system performs data type prediction,
header prediction, and subject column prediction as in the structural annota-
tions section (Section 3.2.3). Finally, MTab performs semantic annotations as in
Section 3.2.4.

3.2.1 Knowledge Graph We build a WikiGraph from the dump data of
Wikidata, Wikipedia, and DBpedia as the target knowledge graph for the anno-
tation tasks. With the dump data on 1 January 2021, we extracted 91.2 million
entities and 249.3 million entity labels in multilingual, including entity labels,
aliases, other names, redirect entity labels, and disambiguation entities. We also
extracted 3.5 billion triples in WikiGraph. Additionally, WikiGraph will be up-
dated frequently based on the future released dumps of knowledge graphs (Wiki-
data, Wikipedia, and DBpedia).

3.2.2        Preprocessing
                                                                MTab tool        5

Table Loading : MTab tool supports the three types of input tables, including
text (table content as a string), file object (table file such as CSV, TSV, EXCEL),
and table object (matrix of rows and columns). The tool automatically predicts
the encoding used in the input table and loads the table content based on the
predicted encoding.

Table Cell Normalization: We remove HTML tags and non-cell-values such as
-, NaN, none, null, blank, unknown, ?, #. Additionally, we use the ftfy tool [22]
to fix all noisy cells caused by incorrect encoding during file loading.

3.2.3   Structural Annotations

Data Type Prediction The system firstly predicts a table cell’s data type into
either non-cell (empty cell), literal, or named-entity (NE). We use the pre-
trained SpaCy models [10] (trained using the OntoNotes 5 dataset) to iden-
tify named entities (PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT,
EVENT, WORK OF ART, LAW, LANGUAGE) and date-time and numeric
entities (DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL, CAR-
DINAL). We associate the named entities to NE type, and date-time and numeric
entities to literal types. If there is no assigned named entities of SpaCy outputs,
we associate the cell type as NE because the SpaCy model could miss recognized
named-entity of table cells.
    Next, the system predicts a table column’s data type into either a non-match
column (empty column), a literal, or a named-entity column. The column data
type is derived from the majority voting of all cell data types in this column.

Header Prediction We use simple heuristics to predict table headers as follows.
 – Table headers could be located in some of the first rows of a table.
 – If the list of data types of the header candidate row differs from most data
   types of the remaining rows, the candidate is the table header. For example,
   the list of data types of header candidate (row) is [named-entity, named-
   entity, named-entity], while the list of the majority data type of remaining
   rows is [named-entity, literal, literal].
 – We also found that the length of header text is empirically shorter or longer
   than the remaining data rows. If the length of values of the header candidate
   row is less than the 0.05 quantile or larger than the 0.95 quantiles of the
   length of the value of remaining rows, the candidates are the table header.

Subject Column Prediction We adopt the heuristics proposed by Ritze et al. [20]
as well as modify a simple heuristic to predict the subject column of a table as
follows.
 – A column is a subject column when its data type is a named-entity type.
 – The average cell value length is from 3.5 to 200. We also add a restriction
   that only considers non-header cells since the length of table headers could
   differ from the remaining cells.
6        Phuc Nguyen et al.

    – The subject column is determined based on the uniqueness score as an in-
      creased score for columns with many unique values and reduces the score
      for columns with many missing values. The subject column is the highest
      unique score column. If we have many columns that have the same score,
      the left-most column is chosen.


3.2.4     Semantic Annotations

Matching Target Prediction: MTab automatically predicts the matching targets
based on data types when the input does not have matching targets. The CEA
matching targets are the table cells whose data types are named entity types. The
CTA matching targets are columns so that the column data types are named
entity types. The CPA matching targets are the relation between the subject
column and the remaining table columns.

Entity Search: We perform entity candidate generation for each table cell with
the entity search modules. MTab tool provides the three entity search mod-
ules, i.e., keyword search, fuzzy search, and aggregation search1 . We imple-
ment the keyword search using BM25 algorithm with the hyper-parameters as
b = 0.75, k1 = 1.2. The fuzzy search is implemented using Damerau–Levenshtein
edit distance. We perform candidate filtering and hashing with pre-calculating
entity label deletes as the Symmetric Delete algorithm [9] to reduce the number
of operations on pairwise edit distance calculation and capable of up to six edits.
In the aggregation search, we combine the results of keyword search and fuzzy
search. In our experiments, we use the aggregation search as the default entity
search.

Post-Processing: We calculate context similarities with the value-based matching
between statements of entity candidates in the subject column with table row
values. Finally, generate the annotations for entities, properties, and types based
on majority voting of context similarities [18].


4      Interfaces

4.1     Entity Search

The entity search interface is available at https://mtab.app/mtabes. Fig. 3 de-
picts an example of entity search with the query of “2MASS J10540655-0031018”.
MTab tool supports multilingual search so that users could type entity name ex-
pressed in any language.
1
    Entity Search Documents: https://mtab.app/mtabes/docs
                                                                MTab tool       7


                  Fig. 3: Example of entity search with MTab


            Fig. 4: Example of tabular data annotation with MTab


4.2   Table Annotation

The table annotation interface is available at https://mtab.app. Users could
submit table files in various table formats, expressed in any language to MTab
API, or copy data content and paste it to the interface. Then, users could tap
the “Annotate” button to get the annotation results.
   Fig. 4 illustrates an annotation example of a SemTab dataset’s table. MTab
took 0.49 seconds to annotate a pasted table from the text box (left picture).
The photo on the right is the annotation results. The table header is in the first
row, and the subject column is in the first column. Entity annotations are in
red and located below the table cell value. The type annotation is in green and
located in the “Type” column. Finally, the relations between the subject column
and other columns are in blue and located in the property column.
8        Phuc Nguyen et al.


Table 1: Overall result of MTab tool on HardTable and BioTable Datasets at
SemTab 2021
                            CEA                 CTA                  CPA
       Dataset
                         F1    Rank         AF1    Rank           F1    Rank
    HardTable           0.985      1        0.977       1       0.998        1
    BioTable            0.964      2        0.956       1       0.947        1
    BioDivTab           0.522      2        0.123       3         -          -
    HardTablesR3        0.968      2        0.984       2       0.993        1


5     SemTab 2021 Results
Table 1 reports the overall results of the MTab tool for three matching tasks
(CEA, CTA, and CPA) of HardTable, BioTable, BioDivTab, and HardTalesR3
Datasets. Overall, these results show that MTab tool achieves impressive perfor-
mances on many datasets: 1st on HardTable CEA, CTA, CPA tasks, BioTable
CTA, CPA tasks, and HardTablesR3 CPA task. MTab tool consistently archive
the best performance in CPA task on many dataset. The detail of results of all
SemTab 2021 participants are available in AICrowd2 .
    Additionally, we also release public APIs and graphical interfaces that enable
users access annotations without doing many intensive setup or configuration.
At the end, MTab tool also got the first rank in the usability track with advanced
features: easy-to-use, generic solution, well-designed user interface.

6     Conclusions
This paper presents the MTab tool for table annotation with Wikidata, DBpedia,
and Wikipedia knowledge graphs. MTab tool achieves promising performance on
many datasets of SemTab 2021. Moreover, the system also got the first rank of
usability track.
   In the future work, we will focus on efficiency improvement of the MTab
tool by processing only small parts of table content and continues expanding
until there is no difference in the annotation results. Another direction is build-
ing downstream applications based on MTab’s annotations, such as question
answering and data analysis.

Acknowledgements
The research was supported by the Cross-ministerial Strategic Innovation Pro-
motion Program (SIP) Second Phase, “Big-data and AI-enabled Cyberspace
Technologies” by the New Energy and Industrial Technology Development Or-
ganization (NEDO).
2
    SemTab 2021 Leaderboards: https://www.aicrowd.com/challenges/semtab-2021/
    leaderboards
                                                                   MTab tool         9

References
 1. Abdelmageed, N., Schindler, S.: Jentab: Matching tabular data to knowledge
    graphs. In: SemTab@ ISWC. pp. 40–49 (2020)
 2. Azzi, R., Diallo, G.: Amalgam: A matching approach to fairfy tabular data with
    knowledge graph model. arXiv preprint arXiv:2101.06637 (2021)
 3. Chabot, Y., Labbe, T., Liu, J., Troncy, R.: Dagobah: an end-to-end context-free
    tabular data semantic annotation system. In: SemTab@ ISWC. pp. 41–48 (2019)
 4. Chen, S., Karaoglu, A., Negreanu, C., Ma, T., Yao, J.G., Williams, J., Gordon, A.,
    Lin, C.Y.: Linkingpark: An integrated approach for semantic table interpretation.
    In: SemTab@ ISWC. pp. 65–74 (2020)
 5. Cremaschi, M., Avogadro, R., Barazzetti, A., Chieregato, D.: Mantistable se: an
    efficient approach for the semantic table interpretation. In: SemTab@ ISWC. pp.
    75–85 (2020)
 6. Cremaschi, M., Avogadro, R., Chieregato, D.: Mantistable: an automatic approach
    for the semantic table interpretation. In: SemTab@ ISWC. pp. 15–24 (2019)
 7. Cremaschi, M., De Paoli, F., Rula, A., Spahiu, B.: A fully automated approach to
    a complete semantic table interpretation. Future Generation Computer Systems
    112, 478–500 (2020)
 8. Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M.: Tough tables: Carefully
    evaluating entity linking for tabular data. In: ISWC. pp. 328–343. Springer (2020)
 9. Garbe, W.: Symspell: Symmetric delete algorithm. https://github.com/
    wolfgarbe/SymSpell (2012)
10. Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom
    embeddings, convolutional neural networks and incremental parsing (2017), https:
    //spacy.io/, to appear
11. Huynh, V.P., Liu, J., Chabot, Y., Labbé, T., Monnin, P., Troncy, R.: Dagobah:
    Enhanced scoring algorithms for scalable annotations of tabular data. In: SemTab@
    ISWC. pp. 27–39 (2020)
12. Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: Semtab
    2019: Resources to benchmark tabular data to knowledge graph matching systems.
    In: ESWC. vol. 12123, pp. 514–530. Springer (2020)
13. Jimenez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K., Cutrona,
    V.: Results of semtab 2020. In: SemTab@ISWC. vol. 2775, pp. 1–8 (2020)
14. Kim, D., Park, H., Lee, J.K., Kim, W.: Generating conceptual subgraph from
    tabular data for knowledge graph matching. In: SemTab@ ISWC. pp. 96–103 (2020)
15. Morikawa, H.: Semantic table interpretation using lod4all. In: SemTab@ ISWC.
    pp. 49–56 (2019)
16. Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: Mtab: Matching tabu-
    lar data to knowledge graph using probability models. In: SemTab@ISWC 2019.
    vol. 2553, pp. 7–14 (2019)
17. Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: Tabeano: Table to knowl-
    edge graph entity annotation. CoRR abs/2010.01829 (2020)
18. Nguyen, P., Yamada, I., Kertkeidkachorn, N., Ichise, R., Takeda, H.:
    Mtab4wikidata at semtab 2020: Tabular data annotation with wikidata. In:
    SemTab@ISWC. vol. 2775, pp. 86–95 (2020)
19. Oliveira, D., d’Aquin, M.: Adog-annotating data with ontologies and graphs. In:
    SemTab@ ISWC. pp. 1–6 (2019)
20. Ritze, D., Lehmberg, O., Bizer, C.: Matching html tables to dbpedia. In: Proceed-
    ings of the 5th International Conference on Web Intelligence, Mining and Seman-
    tics, WIMS 2015. pp. 10:1–10:6. ACM (2015)
10      Phuc Nguyen et al.

21. Shigapov, R., Zumstein, P., Kamlah, J., Oberländer, L., Mechnich, J., Schumm,
    I.: bbw: Matching csv to wikidata via meta-lookup. vol. 2775, pp. 17–26 (2020)
22. Speer, R.: ftfy. Zenodo (2019), https://github.com/LuminosoInsight/
    python-ftfy, version 5.5
23. Thawani, A., Hu, M., Hu, E., Zafar, H., Divvala, N.T., Singh, A., Qasemi, E.,
    Szekely, P.A., Pujara, J.: Entity linking to knowledge graphs to infer column types
    and properties. In: SemTab@ ISWC. pp. 25–32 (2019)
24. Vandewiele, G., Steenwinckel, B., De Turck, F., Ongenae, F.: Cvs2kg: Transforming
    tabular data into semantic knowledge. In: SemTab@ ISWC. pp. 33–40 (2019)
25. Wang, D., Shiralkar, P., Lockard, C., Huang, B., Dong, X.L., Jiang, M.: TCN: table
    convolutional network for web table interpretation. In: WWW ’21. pp. 4020–4032.
    ACM / IW3C2 (2021)