<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Demonstration of MTab: Tabular Data Annotation with Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Phuc Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ikuya Yamada</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natthawut Kertkeidkachorn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryutaro Ichise</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hideaki Takeda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Japan Advanced Institute of Science and Technology</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Informatics</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Studio Ousia</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a demonstration of MTab, a tabular data annotation toolkit with knowledge graphs: Wikidata, Wikipedia, and DBpedia. MTab is the best performance system for all semantic annotation tasks at the Semantic Web Challenges on tabular data to knowledge graph matching SemTab 2019 and SemTab 2020. This paper introduces MTab's public APIs capable of structural and semantic annotations for tabular data. We also provide a graphical interface to visualize the annotation results. The tool supports multilingual tables and could process many table formats such as Excel, CSV, TSV, markdown tables, or a pasted table content. MTab's repository is publicly available at https://github.com/phucty/mtab_tool.</p>
      </abstract>
      <kwd-group>
        <kwd>tabular data annotation</kwd>
        <kwd>knowledge graph</kwd>
        <kwd>semantic annotation</kwd>
        <kwd>structural annotation</kwd>
        <kwd>Wikidata</kwd>
        <kwd>Wikipedia</kwd>
        <kwd>DBpedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Many valuable tabular resources have been made available on the Internet and
Open Data Portals, thanks to the Open Data movement. However, the usage of
the tabular data is very limited in applications due to lacking or insu cient data
descriptions, various data formats, vocabulary issues. Tabular data usually do
not have a description, or the description does not cover data content. Tabular
data also lack speci cation on table structure, and layout. Moreover, many tables
do not use a standard vocabulary such as expressed in non-English, abbreviation,
ambiguous or contain many misspellings, encoding problems. It is crucial to have
a tabular data annotation system that could provide explicit information about
table content to improve tabular data usability.</p>
      <p>
        Previous studies addressed many tabular data annotation tasks such as
structural annotations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or semantic annotations as the participant systems in
the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching:
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>
        SemTab 2019 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and SemTab 2020 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Unfortunately, most solutions or
systems are not available to use, or require extensive con guration, setup, high
computing power, or high time complexity [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>This paper introduces MTab, a public service that generates structural and
semantic annotations for tabular data. The structural annotations provide
information about table headers, the table core attribute. The semantic
annotations o er table elements matching knowledge graph concepts: cell-entity (CEA
task), column-type (CTA task), and CPA task where the relation between core
attribute to another column is annotated with a property. We also provide a
graphical interface to visualize the annotation results.</p>
      <p>
        The major advantages of MTab compared to other systems are as follows.
{ E ectiveness: MTab tool is the best performance system in SemTab 2019
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and SemTab 2020 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The key success of MTab is on the entity
search modules with multilingual support (a keyword search with BM25
algorithm, a fuzzy search with edit distances, and an aggregation search
with weighted fusion of keyword search and fuzzy search). The fuzzy search
could support up to six edits (on the low-budget mac mini M1 2021), while
most other systems only support two edits. As a result, MTab could address
a higher level of noisiness compared to other systems. The entity search
module achieves 87.98% on average of the top 1 accuracy (the top 1000
accuracy is 99.7%) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] on Semtab 2020 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Tough Tables datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
{ E ciency: MTab fuzzy search implementation works e ciently with
candidate ltering based on entity labels and hashing with pre-calculating entity
label deletes as the Symmetric Delete algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, the
statement search also gives a tremendous e cient improvement where it could
eliminate non-statements entity candidates. Additionally, we use a light way
solution as the value matching to calculate the context similarity between
entity candidate statements and table row values. The experiments show that
our solution could improve e ciency without losing e ective performance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Overall, it takes only 1.52 seconds/table on average (SemTab 2020 dataset)
to annotate with MTab.
{ Easy to use: We provide public APIs, graphical interfaces so that users
do not need to do intensive setup or con guration. MTab also supports
multilingual and could process many table formats such as Excel, CSV, TSV,
or markdown tables. According to Wang et al., they only could generate the
annotations using the MTab tool, while other systems require high time
complexity to process [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
{ Privacy Policy: MTab does not store any data from users. All users' tabular
data les are completely deleted after the annotation.
      </p>
      <p>MTab's repository, API documents, and other information could be accessed
at https://github.com/phucty/mtab_tool; the demonstration video is
available at https://youtu.be/0ibTWeObWaA.</p>
      <p>MTab</p>
    </sec>
    <sec id="sec-2">
      <title>Knowledge Graphs</title>
      <p>We build a WikiGraph from the dump data of Wikidata, Wikipedia, and
DBpedia as the target knowledge graph the annotation tasks. Wikidata is the
central knowledge graph because it has the largest number of entities among the
three graphs. With the dump data on 1 January 2021, we extracted 91.2
million entities and 249.3 million entity labels in multilingual, including entity
labels, aliases, other names, redirect entity labels, and disambiguation entities. We
also extracted 3.5 billion triples in WikiGraph. Additionally, WikiGraph will be
updated frequently based on the future released dumps of knowledge graphs
(Wikidata, Wikipedia, and DBpedia).
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Entity Search Modules</title>
      <p>
        Entity Search on a Cell We introduce the search modes1 as follows [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
{ Keyword search with BM25 algorithm: We use the hyper-parameters
as b = 0:75; k1 = 1:2.
{ Fuzzy search with edit distance: We use Damerau{Levenshtein distance
as the edit distance for fuzzy search. We also perform candidate ltering and
hashing with pre-calculating entity label deletes as the Symmetric Delete
algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to reduce the number of operations on pairwise edit distance
calculation. Overall, MTab could support the fuzzy search up to six edits.
{ Aggregation search: This module is a weighted fusion of the keyword
search and the fuzzy search results.
      </p>
      <p>Statement Search on Two Cells This module is built on the assumption
that there is a logical relation between two cells of a table row, equivalent to a
knowledge graph triple. We only keep the candidates of the two cells that have
equivalent statements in the WikiGraph. We implement this statement search
with a sparse matrix of 91 million entities and around 500 million edges.
2.3
MTab demonstration is available at https://mtab.app. Users could submit
table les in various table formats, expressed in any language to MTab API, or
copy data content and paste it to the interface. Then, users could tap to the
\Annotate" button to get the annotation results. MTab will perform the following
steps.</p>
      <p>
        The annotation procedure2 are as the following steps:
1 Entity Search Documents: https://mtab.app/mtabes/docs
2 Table Annotation Document: https://mtab.app/mtab/docs
{ Pre-processing The input tables are pre-processed with encoding
prediction, table type prediction, data type prediction for cells and columns.
{ Structural Annotations: Then, we perform header detection based on
majority voting of column data type as [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The core attribute detection is
based on the uniqueness of cell values in a column as [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
{ Semantic Annotations: MTab automatically predicts the matching
targets based on data types, when the input does not have matching targets.
The CEA matching targets are the table cells whose data types are strings.
The CTA matching targets are columns so that the column data types are
strings. The CPA matching targets are the relation between the core
attribute and the remaining table columns. Then, we perform entity candidate
generation for each table cell with entity search and two cells in the same
row with statement search. We calculate context similarities with the value
matching between statements of entity candidates in the core attributes with
table row values. Finally, generate the annotations for entities, properties,
and types based on majority voting of context similarities [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
This paper presents a demonstration of the MTab toolkit for table annotation
with knowledge graphs of Wikidata, DBpedia, and Wikipedia. MTab is e ective,
e cient, and easy to use.
      </p>
      <p>In the future work, we will focus on building downstream applications based
on MTab's annotations such as question answering, and data analysis.
The research was supported by the Cross-ministerial Strategic Innovation
Promotion Program (SIP) Second Phase, \Big-data and AI-enabled Cyberspace
Technologies" by the New Energy and Industrial Technology Development
Organization (NEDO).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cutrona</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bianchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmonari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Tough tables: Carefully evaluating entity linking for tabular data</article-title>
          .
          <source>In: The Semantic Web - ISWC 2020. Lecture Notes in Computer Science</source>
          , vol.
          <volume>12507</volume>
          , pp.
          <volume>328</volume>
          {
          <fpage>343</fpage>
          . Springer (
          <year>2020</year>
          ), https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -62466-8_
          <fpage>21</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Garbe</surname>
          </string-name>
          , W.:
          <article-title>Symspell: Symmetric delete algorithm</article-title>
          . https://github.com/ wolfgarbe/SymSpell (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efthymiou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Semtab 2019: Resources to benchmark tabular data to knowledge graph matching systems</article-title>
          .
          <source>In: The Semantic Web - 17th International Conference, ESWC 2020. Lecture Notes in Computer Science</source>
          , vol.
          <volume>12123</volume>
          , pp.
          <volume>514</volume>
          {
          <fpage>530</fpage>
          . Springer (
          <year>2020</year>
          ), https://doi.org/ 10.1007/978-3-
          <fpage>030</fpage>
          -49461-2_
          <fpage>30</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efthymiou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cutrona</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Results of semtab 2020</article-title>
          .
          <article-title>In: SemTab@ISWC</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2775</volume>
          , pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2020</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2775</volume>
          / paper0.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kertkeidkachorn</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
          </string-name>
          , H.:
          <article-title>Mtab: Matching tabular data to knowledge graph using probability models</article-title>
          .
          <source>In: SemTab@ISWC 2019. CEUR Workshop Proceedings</source>
          , vol.
          <volume>2553</volume>
          , pp.
          <volume>7</volume>
          {
          <fpage>14</fpage>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2019</year>
          ), http: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2553</volume>
          /paper2.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kertkeidkachorn</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
          </string-name>
          , H.:
          <article-title>Tabeano: Table to knowledge graph entity annotation</article-title>
          . CoRR abs/
          <year>2010</year>
          .
          <year>01829</year>
          (
          <year>2020</year>
          ), https://arxiv. org/abs/
          <year>2010</year>
          .01829
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamada</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kertkeidkachorn</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
          </string-name>
          , H.:
          <article-title>Mtab4wikidata at semtab 2020: Tabular data annotation with wikidata</article-title>
          .
          <source>In: SemTab@ISWC</source>
          . vol.
          <volume>2775</volume>
          , pp.
          <volume>86</volume>
          {
          <issue>95</issue>
          (
          <year>2020</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2775</volume>
          / paper9.pdf
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamada</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
          </string-name>
          , H.:
          <article-title>Mtabes: Entity search with keyword search, fuzzy search, and entity popularities</article-title>
          .
          <source>In: The 35th Annual Conference of the Japanese Society for Arti cial Intelligence</source>
          ,
          <string-name>
            <surname>JSAI</surname>
          </string-name>
          <year>2021</year>
          . vol.
          <year>2021</year>
          .
          <article-title>The Japanese Society for Arti cial Intelligence</article-title>
          , https://www.jstage.jst.go.jp/article/pjsai/ JSAI2021/0/JSAI2021_1N4IS1a02/_pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ritze</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmberg</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Matching html tables to dbpedia</article-title>
          .
          <source>In: Proceedings of the 5th International Conference on Web Intelligence</source>
          , Mining and Semantics,
          <string-name>
            <surname>WIMS</surname>
          </string-name>
          <year>2015</year>
          . pp.
          <volume>10</volume>
          :
          <issue>1</issue>
          {
          <issue>10</issue>
          :
          <article-title>6</article-title>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          ), https://doi.org/10.1145/ 2797115.2797118
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shiralkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lockard</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>TCN: table convolutional network for web table interpretation</article-title>
          .
          <source>In: WWW '21: The Web Conference</source>
          <year>2021</year>
          , Virtual Event / Ljubljana, Slovenia,
          <source>April 19-23</source>
          ,
          <year>2021</year>
          . pp.
          <volume>4020</volume>
          {
          <fpage>4032</fpage>
          . ACM / IW3C2 (
          <year>2021</year>
          ), https://doi.org/10.1145/3442381.3450090
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>