LexMa: Tabular Data to Knowledge Graph
             Matching using Lexical Techniques
                      Shalini Tyagi​1​ and Ernesto Jimenez-Ruiz​1,2
                             1​
                             City, University of London, UK
                             2​
                                University of Oslo, Norway
                           shaliniktyagi@gmail.com​,
                       ernesto.jimenez-ruiz@city.ac.uk


       Abstract. ​With the fundamentals of lives dependent upon the extensive use of
       the internet-based searches for common life items, there is an ever-growing
       demand of the quick and meaningful search query systems. This has given the
       rise of the concept called Semantic Web. There are many challenges in
       developing the Semantic Web however one fundamental challenge is to design
       systems to enable the semantic access to the information in tabular data (e.g.,
       Web tables). In this paper, we discuss one such system which has been
       developed for the automatic annotation of the tabular data using a knowledge
       graph. We call this system LexMa. Our system is based on lexical matching
       techniques. LexMa has participated in the Semantic Web Challenge on Tabular
       Data to Knowledge Graph Matching (SemTab 2020).

         Keywords:​ Lexical Matching, Web Tables, Cosine Similarity, Semantic
         Table Interpretation.


1      Introduction
Tabular data to knowledge graph (KG) matching is the procedure of assigning the
semantic tags from a KG such as Wikidata to the elements of the tables [2]. However,
in the real-world data, it is hard to practice because of missing, noisy or incomplete
data [3,8]. SemTab 2020: Semantic Web Challenge on Tabular Data to Knowledge
Graph Matching is a challenge for assigning semantic tags from part of the table to
Wikidata KG [1]. More specifically, table annotation consists of three tasks such as
cell to KG entity annotation (CEA), column to KG class annotation (CTA) and pair of
columns to KG property annotation (CPA) [8]. These three tasks are summarized in
Figure 1.


2      Methods
We developed the LexMa system to solve the CEA and CTA tasks using basic but
efficient lexical techniques.


Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
              Fig. 1.​ The explanation of three different tasks in the challenge.

2.1    Method used for CEA task
In SemTab 2020, the target KG is Wikidata [9,10]. The CEA task is to annotate the
cells of the table to the specific entity of the Wikidata KG. The schematic of the
overall pipeline used to annotate single cells is shown in Figure 2. For each of the cell
values in the table, we first pre-process them by trimming the text in the cell and
convert the resultant strings into uppercase. After that the top-5 entities were fetched
for each cell value from the Wikidata look-up service [11]. Thereafter, the lexical
matching was evaluated based upon the cosine similarity [5] of the encoded one-hot
vectors formed out of the fetched entity labels and the cell value. Labels and cell
values were split into tokens and stop words were removed before creating the
one-hot vectors. There were still considerable numbers of cells returned with empty
values as their respective entities could not be found in the Wikidata KG. These
missed values were searched in the DBpedia KG via its look-up service and later
converted into a (same as) Wikidata entity via the DBpedia SPARQL Endpoint [11].


                                Fig. 2. ​Pipeline for CEA task
2.2    Method used for the CTA tasl
After annotating the cell values, we search the different types of each of these entities
in the same column using the Wikidata SPARQL Endpoint [11]. The focus is to find
the most suitable class that represents the entities in the column. For this task, we
have submitted the most frequent/voted type for a column.


3      Results
3.1    Result for CEA
In Round 1 of SemTab, we focused on the CTA and CEA tasks and submitted the
results for them to the challenge. We did not participate in the CPA task because our
motivation was to improve CEA and CTA results. In Round 1, the CEA result is
satisfactory with above 90% accuracy. LexMa holds the 8th position in the challenge
(see Table 1). Our focus in the next rounds was to improve the performance in the
CEA. LexMa achieved similar results and relative positions (see Table 1).

2T is the ‘Tough Tables’ dataset [6, 10] which was used in Round 4 together with a
synthetic dataset [9] as in previous rounds. Figure 5 summarizes the performance in
terms of F1-score, recall and precision for different types of tables within the 2T
dataset [6]. The 2T dataset brings additional complexity to the challenge, but LexMa,
unlike in the other rounds, outperformed 5 participating systems (see Table 1).


                        Fig. 5. ​Performance in Round4 2T dataset.
                Table 1.​ Results for cell entity annotation. Official results:
             http://www.cs.ox.ac.uk/isg/challenges/sem-tab/2020/results.html

Task Name                F1-Score         Precision         Leaderboard position
CEA Round 1              0.909            0.913             8 out 10 competing systems
CEA Round 2              0.915            0.927             9 out 10 competing systems
CEA Round 3              0.863            0.907             9 out 9 competing systems
CEA Round 4              0.845            0.911             7 out 9 competing systems
CEA Round 4 (2T)         0.587            0.795             4 out 9 competing systems


3.2    Result for CTA
For this task, initially, LexMa got a 40.4% F1-score but after removing duplicate
values and applying a ranking method, we improved our CTA result to 63.8%.

               Table 2.​ Results for column type annotation. Official results:
             http://www.cs.ox.ac.uk/isg/challenges/sem-tab/2020/results.html

Task Name                Approximate Approximate            Leaderboard position
                         F1-Score    Precision
CTA Round 1              0.638       0.734                  14 out 15 competing systems


3.3    Result links
Our       GitHub    repository contains all the final submitted results.
(​https://github.com/shaliniktyagi/TabularData_to_Knowledge_graph​). The code for
completing the challenge is also available in GitHub repository, together with
instructions about how to run the codes.


4      Discussion

Overall, this study has developed a simple approach but better results in 2T than five
systems, which suggests that LexMa provides a flexible annotation system for the
automatic table annotation. While there are a number of methods available, we took a
rather simple but efficient approach with the use of existing technologies. Our main
effort was in the pre-processing, lexical matching and parallel computing part of the
challenge. In pre-processing several ideas were tried but the most effective were the
selective special characters removal, duplicate words removal, white space removal
and extra punctuation removal. This pre-processing improved the KG look-up
efficiency and resulted in quite a good accuracy against the ground truth. We highly
recommend an appropriate data conditioning upfront for the automated table
annotation.
In lexical matching, using cosine similarity resulted in incremental accuracy against
the ground truth. The lexical patterns could be analyzed further, and some pair-based
analysis can be done. We have also tried a string length-based constraint but that did
not lead to a significant improvement.

For the SemTab datasets running a job locally was not possible, in fact not only
running the actual flow for look-up of the entities in the KG but to perform the data
wrangling and text formatting was not very efficient while running on the local
machine. A parallel processing using the Google CoLab [7] platform was a very
efficient approach and reduced the turnaround time of the project significantly.

The SemTab challenge brings in a unique opportunity to learn and grow the
programming skills. The pre-conditioning of the dataset and the format text editing
was a rigorous task and took a multi-platform approach to achieve. All in all, the
study and the entire challenge created a wide pool of research work which will be
beneficial to the academic community at large.


5      Conclusion and Future Work
In this study, the aim was to annotate tabular data with the Wikidata Knowledge
graph. Two tasks of the table annotation were accomplished in the Semantic Web
Challenge on Tabular Data to Knowledge Graph Matching such as CEA and CTA
which has been discussed in detail above. Different techniques were used to improve
the result on both tasks in Round 1 but in Rounds 2-4, the prime objective was to
improve the performance of the CEA task by using different methods. In Round 4 (2T
dataset), LexMa produced very promising results in comparison to other systems.

The SemTab challenge gives an engaging platform to systematically evaluate systems
and lead to system improvements. Text processing and applying lexical matching
with cosine similarity helped to improve a bit with 91.5% for the CEA task whereas
in Round 2, the dataset had more noise in comparison to Round 1. Rounds 3 and 4
also brought additional noise and challenges. In conclusion, lexical matching
techniques were able to improve performance for the CEA task to match a cell to a
KG entity. Including DBpedia KG did not add a significant value in terms of overall
improvement of the results; however, did improve the look up part.

In the future, we aim at improving column type annotation and cell entity annotation
by using different techniques such as (pre-trained) word embedding. These techniques
use a neural network model to learn word correlations within the text. The system
ColNet [4], based on convolution neural networks, produced state-of-the-art results
for the column type annotation. In the near future we also aim at analysing the use of
CNNs to increase the accuracy of LexMa for the CEA and CTA tasks.
References

1. Malyshev, S., Krötzsch,M., González,L., Gonsior,J., and Bielefeldt,A Getting the Most
    out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph.
    Wikimedia Foundation, San Francisco, U.S.A (2018)
 2. Suchanek, F.H., Kasneci, G. and Weikum,G. (2007) YAGO: A core of semantic
    knowledge unifying WordNet and Wikipedia. In Proceedings of the 16th International
    Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007
 3. Cafarella,M.J., Halevy,A., Wang, Z. D., Wu.E and Zang,Y. WebTables: Exploring the
    Power of Tables on the Web, VLDB ’08 Auckland, New Zealand.
 4. Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.A. ColNet: Embedding the Semantics
    of Web Tables for Column Type Prediction. In: AAAI. pp. 29–36. (2019)
 5. Pahi ,K., Thapa,P and Shakya,S. A Comparison of Semantic Similarity Methods for
    Maximum Human Interpretability, University Pulchowk Campus : Nepal (2019).
 6. Vincenzo Cutrona, Federico Bianchi, Ernesto Jiménez-Ruiz and Matteo Palmonari, Tough
    Tables: Carefully Evaluating Entity Linking for Tabular Data. International Semantic Web
    Conference (ISWC). (2020)
 7. Parallelism In Python, ​https://colab.research.google.com/drive/1nMDtWcVZCT9q1VWen
    5rXL8ZHVlxn2KnL​, (Accessed on 21/10/2020)
 8. Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen and Kavitha
    Srinivas. SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph
    Matching Systems. Extended Semantic Web Conference (ESWC). 2020
 9. Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen, Ernesto Jiménez-Ruiz, and Kavitha
    Srinivas. SemTab 2020: Semantic Web Challenge on Tabular Data to Knowledge Graph
    Matching         Data      Sets      (Version      2020)      [Data     set].    Zenodo.
    https://doi.org/10.5281/zenodo.4282879
10. Vincenzo Cutrona, Federico Bianchi, Ernesto Jiménez-Ruiz and Matteo Palmonari. Tough
    Tables: Carefully Benchmarking Semantic Table Annotators [Data set]. Zenodo.
    https://doi.org/10.5281/zenodo.3840646
11. Ernesto        Jiménez-Ruiz.       Tabular       Data      Semantics      for    Python.
    https://github.com/ernestojimenezruiz/tabular-data-semantics-py