<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LexMa: Tabular Data to Knowledge Graph Matching using Lexical Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shalini Tyagi​</string-name>
          <email>shaliniktyagi@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ernesto Jimenez-Ruiz​</string-name>
          <email>ernesto.jimenez-ruiz@city.ac.uk</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>​With the fundamentals of lives dependent upon the extensive use of the internet-based searches for common life items, there is an ever-growing demand of the quick and meaningful search query systems. This has given the rise of the concept called Semantic Web. There are many challenges in developing the Semantic Web however one fundamental challenge is to design systems to enable the semantic access to the information in tabular data (e.g., Web tables). In this paper, we discuss one such system which has been developed for the automatic annotation of the tabular data using a knowledge graph. We call this system LexMa. Our system is based on lexical matching techniques. LexMa has participated in the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020).</p>
      </abstract>
      <kwd-group>
        <kwd>​ Lexical Matching</kwd>
        <kwd>Web Tables</kwd>
        <kwd>Cosine Similarity</kwd>
        <kwd>Semantic Table Interpretation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Tabular data to knowledge graph (KG) matching is the procedure of assigning the
semantic tags from a KG such as Wikidata to the elements of the tables [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However,
in the real-world data, it is hard to practice because of missing, noisy or incomplete
data [
        <xref ref-type="bibr" rid="ref3 ref8">3,8</xref>
        ]. SemTab 2020: Semantic Web Challenge on Tabular Data to Knowledge
Graph Matching is a challenge for assigning semantic tags from part of the table to
Wikidata KG [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. More specifically, table annotation consists of three tasks such as
cell to KG entity annotation (CEA), column to KG class annotation (CTA) and pair of
columns to KG property annotation (CPA) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These three tasks are summarized in
Figure 1.
We developed the LexMa system to solve the CEA and CTA tasks using basic but
efficient lexical techniques.
      </p>
      <p>
        Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
In SemTab 2020, the target KG is Wikidata [
        <xref ref-type="bibr" rid="ref10 ref9">9,10</xref>
        ]. The CEA task is to annotate the
cells of the table to the specific entity of the Wikidata KG. The schematic of the
overall pipeline used to annotate single cells is shown in Figure 2. For each of the cell
values in the table, we first pre-process them by trimming the text in the cell and
convert the resultant strings into uppercase. After that the top-5 entities were fetched
for each cell value from the Wikidata look-up service [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Thereafter, the lexical
matching was evaluated based upon the cosine similarity [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of the encoded one-hot
vectors formed out of the fetched entity labels and the cell value. Labels and cell
values were split into tokens and stop words were removed before creating the
one-hot vectors. There were still considerable numbers of cells returned with empty
values as their respective entities could not be found in the Wikidata KG. These
missed values were searched in the DBpedia KG via its look-up service and later
converted into a (same as) Wikidata entity via the DBpedia SPARQL Endpoint [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
After annotating the cell values, we search the different types of each of these entities
in the same column using the Wikidata SPARQL Endpoint [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The focus is to find
the most suitable class that represents the entities in the column. For this task, we
have submitted the most frequent/voted type for a column.
3
3.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>
        Result for CEA
In Round 1 of SemTab, we focused on the CTA and CEA tasks and submitted the
results for them to the challenge. We did not participate in the CPA task because our
motivation was to improve CEA and CTA results. In Round 1, the CEA result is
satisfactory with above 90% accuracy. LexMa holds the 8th position in the challenge
(see Table 1). Our focus in the next rounds was to improve the performance in the
CEA. LexMa achieved similar results and relative positions (see Table 1).
2T is the ‘Tough Tables’ dataset [
        <xref ref-type="bibr" rid="ref10 ref6">6, 10</xref>
        ] which was used in Round 4 together with a
synthetic dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as in previous rounds. Figure 5 summarizes the performance in
terms of F1-score, recall and precision for different types of tables within the 2T
dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The 2T dataset brings additional complexity to the challenge, but LexMa,
unlike in the other rounds, outperformed 5 participating systems (see Table 1).
      </p>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>Our GitHub repository contains all the final submitted results.
(​https://github.com/shaliniktyagi/TabularData_to_Knowledge_graph​). The code for
completing the challenge is also available in GitHub repository, together with
instructions about how to run the codes.</p>
      <p>Overall, this study has developed a simple approach but better results in 2T than five
systems, which suggests that LexMa provides a flexible annotation system for the
automatic table annotation. While there are a number of methods available, we took a
rather simple but efficient approach with the use of existing technologies. Our main
effort was in the pre-processing, lexical matching and parallel computing part of the
challenge. In pre-processing several ideas were tried but the most effective were the
selective special characters removal, duplicate words removal, white space removal
and extra punctuation removal. This pre-processing improved the KG look-up
efficiency and resulted in quite a good accuracy against the ground truth. We highly
recommend an appropriate data conditioning upfront for the automated table
annotation.</p>
      <p>In lexical matching, using cosine similarity resulted in incremental accuracy against
the ground truth. The lexical patterns could be analyzed further, and some pair-based
analysis can be done. We have also tried a string length-based constraint but that did
not lead to a significant improvement.</p>
      <p>
        For the SemTab datasets running a job locally was not possible, in fact not only
running the actual flow for look-up of the entities in the KG but to perform the data
wrangling and text formatting was not very efficient while running on the local
machine. A parallel processing using the Google CoLab [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] platform was a very
efficient approach and reduced the turnaround time of the project significantly.
The SemTab challenge brings in a unique opportunity to learn and grow the
programming skills. The pre-conditioning of the dataset and the format text editing
was a rigorous task and took a multi-platform approach to achieve. All in all, the
study and the entire challenge created a wide pool of research work which will be
beneficial to the academic community at large.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>In this study, the aim was to annotate tabular data with the Wikidata Knowledge
graph. Two tasks of the table annotation were accomplished in the Semantic Web
Challenge on Tabular Data to Knowledge Graph Matching such as CEA and CTA
which has been discussed in detail above. Different techniques were used to improve
the result on both tasks in Round 1 but in Rounds 2-4, the prime objective was to
improve the performance of the CEA task by using different methods. In Round 4 (2T
dataset), LexMa produced very promising results in comparison to other systems.
The SemTab challenge gives an engaging platform to systematically evaluate systems
and lead to system improvements. Text processing and applying lexical matching
with cosine similarity helped to improve a bit with 91.5% for the CEA task whereas
in Round 2, the dataset had more noise in comparison to Round 1. Rounds 3 and 4
also brought additional noise and challenges. In conclusion, lexical matching
techniques were able to improve performance for the CEA task to match a cell to a
KG entity. Including DBpedia KG did not add a significant value in terms of overall
improvement of the results; however, did improve the look up part.</p>
      <p>
        In the future, we aim at improving column type annotation and cell entity annotation
by using different techniques such as (pre-trained) word embedding. These techniques
use a neural network model to learn word correlations within the text. The system
ColNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], based on convolution neural networks, produced state-of-the-art results
for the column type annotation. In the near future we also aim at analysing the use of
CNNs to increase the accuracy of LexMa for the CEA and CTA tasks.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Malyshev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krötzsch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>González</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonsior</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bielefeldt</surname>
            ,
            <given-names>A Getting</given-names>
          </string-name>
          <article-title>the Most out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph</article-title>
          .
          <source>Wikimedia Foundation</source>
          , San Francisco, U.S.A (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2007</year>
          )
          <article-title>YAGO: A core of semantic knowledge unifying WordNet and Wikipedia</article-title>
          .
          <source>In Proceedings of the 16th International Conference on World Wide Web, WWW</source>
          <year>2007</year>
          , Banff, Alberta, Canada, May 8-
          <issue>12</issue>
          ,
          <year>2007</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            .E and Zang,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>WebTables: Exploring the Power of Tables on the Web</article-title>
          ,
          <source>VLDB '08 Auckland</source>
          , New Zealand.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiménez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          <article-title>ColNet: Embedding the Semantics of Web Tables for Column Type Prediction</article-title>
          . In: AAAI. pp.
          <fpage>29</fpage>
          -
          <lpage>36</lpage>
          . (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Pahi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thapa</surname>
            ,P and Shakya,
            <given-names>S.</given-names>
          </string-name>
          <article-title>A Comparison of Semantic Similarity Methods for Maximum Human Interpretability</article-title>
          , University Pulchowk Campus : Nepal (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Vincenzo</given-names>
            <surname>Cutrona</surname>
          </string-name>
          , Federico Bianchi,
          <article-title>Ernesto Jiménez-Ruiz and Matteo Palmonari, Tough Tables: Carefully Evaluating Entity Linking for Tabular Data</article-title>
          .
          <source>International Semantic Web Conference (ISWC)</source>
          .
          <article-title>(</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>7. Parallelism In Python, ​https://colab.research.google.com/drive/1nMDtWcVZCT9q1VWen 5rXL8ZHVlxn2KnL​, (Accessed on 21/10/2020)</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Ernesto</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          , Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen and
          <string-name>
            <given-names>Kavitha</given-names>
            <surname>Srinivas</surname>
          </string-name>
          .
          <source>SemTab</source>
          <year>2019</year>
          :
          <article-title>Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems</article-title>
          .
          <source>Extended Semantic Web Conference (ESWC)</source>
          .
          <year>2020</year>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Oktie</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          , Vasilis Efthymiou, Jiaoyan Chen,
          <article-title>Ernesto Jiménez-Ruiz, and Kavitha Srinivas</article-title>
          .
          <source>SemTab</source>
          <year>2020</year>
          :
          <article-title>Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets (Version</article-title>
          <year>2020</year>
          )
          <article-title>[Data set]</article-title>
          .
          <source>Zenodo</source>
          . https://doi.org/10.5281/zenodo.4282879
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vincenzo</surname>
            <given-names>Cutrona</given-names>
          </string-name>
          , Federico Bianchi, Ernesto Jiménez-Ruiz and
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Palmonari</surname>
          </string-name>
          . Tough Tables:
          <article-title>Carefully Benchmarking Semantic Table Annotators [Data set]</article-title>
          .
          <source>Zenodo</source>
          . https://doi.org/10.5281/zenodo.3840646
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ernesto</surname>
          </string-name>
          Jiménez-Ruiz.
          <article-title>Tabular Data Semantics for Python</article-title>
          . https://github.com/ernestojimenezruiz/tabular
          <article-title>-data-semantics-py</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>