=Paper= {{Paper |id=Vol-2775/paper1 |storemode=property |title=AMALGAM: making tabular dataset explicit with knowledge graph |pdfUrl=https://ceur-ws.org/Vol-2775/paper1.pdf |volume=Vol-2775 |authors=Rabia Azzi,Gayo Diallo |dblpUrl=https://dblp.org/rec/conf/semweb/AzziD20 }} ==AMALGAM: making tabular dataset explicit with knowledge graph== https://ceur-ws.org/Vol-2775/paper1.pdf
    AMALGAM: making tabular dataset explicit with
                 knowledge graph

     Rabia Azzi [0000−0002−8777−1566] and Gayo Diallo [0000−0002−9799−9484]

          BPH Center - INSERM U1219, Team ERIAS, Univ. Bordeaux,
                          F-33000, Bordeaux, France
                         first.last@u-bordeaux.fr



      Abstract. In this paper we present AMALGAM, a matching approach to
      annotate tabular dataset with the use of a knowledge graph, developed in
      the context of the Semantic Web Challenge on Tabular Data to Knowl-
      edge Graph Matching (SemTab 2020). The ultimate goal is to provide
      fast and efficient approach to annotate tabular data with entities from
      a background knowledge. The approach combines lookup and filtering
      services combined with text pre-processing techniques. Experiments con-
      ducted in the context of SemTab 2020 with both Column Type Annota-
      tion and Cell Type Annotation tasks showed promising results.

      Keywords: Tabular Data · Knowledge Graph · AMALGAM · Entity Link-
      ing · FAIR principles.


1   Introduction

Making web data complying with the FAIR principles [1] has become a necessity
in order to facilitate their discovery and reuse. This could be achieved by anno-
tating these data by entities coming from ontologies and structured vocabularies,
semantic repositories [2] and Knowledge Graphs (KG). Tabular Data to KG [3]
Matching challenge (SemTab 20201 ) aims at benchmarking systems which deals
with the task of annotating tabular data with entities from a KG, referred as
table annotation [10]. Over the last years, a set of systems for matching web
tables to knowledge bases have been developed [4, 5].
    In order to perform a more fine-grained analysis of the SemTab challenge
tasks, we categorize them as two main tasks e.g., structure, and semantic anno-
tation. Structure annotation, deals with various tasks including data type pre-
diction and table header annotation [6]. Semantic annotation involves matching
table elements into KG, e.g., columns to class and cells to entities [7, 8].
    Three tasks that are organised into several evaluation rounds are defined in
SemTab 2020: (i) assigning a semantic type (e.g., a KG class) to a column (CTA);
(ii) matching a cell to a KG entity (CEA); (iii) assigning a KG property to the
  Copyright © 2020 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0).
1
  http://www.cs.ox.ac.uk/isg/challenges/sem-tab/2020/index.html
2       R. Azzi et al.

relationship between two columns (CPA). The most popular approaches to deal
with these three tasks are a supervised learning setting, where entities candidate
are selected by a classification model [9]. However, for the real-time application,
obtaining the result as fast as possible is a requirement. As a basis for the current
work, the ultimate goal is to provide a fast and efficient approach for a tabular
dataset to KG matching task. We have designed and implemented AMALGAM, our
proposed approach to realizing CTA and CEA tasks.

2    The AMALGAM approach
To address the above mentioned tasks of the SemTab challenge, AMALGAM is
designed according to the workflow depicted in Fig. 1. There are three major
phases which consist in, respectively, Pre-Processing, Annotation context and
Tabular data to KG matching. The first two steps of the Workflow are identical
for the two tasks.




Fig. 1. Workflow of AMALGAM approach. Pre-Processing, which aims to prepare the data
inside the table. Annotation context, which aims to create a list of terms that describe
the same context. Tabular data to KG matching: (1) Assigning a semantic type to a
column (CTA), which aims to identify the attribute label of column; (2) Matching a cell
to a KG entity (CEA), which deals with mappings between terms and entities in a KG.


    To describe each phase of the AMALGAM approach we consider Table 1, which
lists Alberta region towns with additional information, country, elevation above
sea level, etc.

Table 1. Table with a list of the Alberta region towns extracted from Round 1 of
SemTab’2020.

    col0           col1            col2   col3                    col4 col5
    Grande Prairie city in Alberta Canada Sexsmith                650 Alberta
    Sundre         town in Alberta Canada Mountain View County 1093 Alberta
    Peace River    town in Alberta Canada Northern Sunrise County 330 Alberta
    Vegreville     town in Alberta Canada Mundare                 635 Alberta
    Sexsmith       town in Alberta Canada Hythe, Alberta          724 Alberta


   Tables Pre-Processing. As the Table 1 shows, the content of each table
can have different types (string, date, float, etc.). The aim of this pre-processing
                                               AMALGAM at SemTab 2020             3

step is to ensure that the process of loading each table happens without any
error. For example, a problem of textual encoding where some characters are
loaded as noisy sequences, text field with an unescaped delimiter, will cause the
record being processed to have an extra column, etc. Loading incorrect encoding
might strongly affect the lookup performance. Therefore, we used the Pandas2
library to fix all noisy textual data in the tables.
    Annotation context. We consider a table as a two-dimensional tabular
structure (see Fig. 2) that is composed of an ordered set of x rows and y columns.
Each intersection between a row and a column determines a cell cij with the value
vij where 1 ≤ i ≤ x, 1 ≤ j ≤ y. To identify the attribute label of column also
called header detection (CTA task), the approach consists in annotating all the
items of the column using entity linking. Then, the attribute label row detection
is estimated using the random entity linking. In this use case, the annotation
context is represented by the list of items in the column. For example, the context
of the first row in the Fig. 2 is: [Grande Prairie, Sundre, Peace River, Vegreville,
Sexsmith]. According to the same logic, we consider that all cells in the same
row describe the same context. More precisely, the first cell of the row describes
the entity and the following cells the associated properties. For example, the
context of the first row in the Fig. 2 is: [Grande Prairie, city in Alberta, Canada,
Sexsmith, 650, Alberta].
    Assigning a semantic type to a column (CTA). The CTA task can be per-
formed by exploiting the process described in Fig. 2. The Wikidata API allows
to look up a Wikidata item3 according to the title of its corresponding page on
a given Wikipedia page, or other Wikimedia family site. In our case, the main
information needed from the entity is a list of the instances of (P31), subclass of
(P279) and part of (P361) statements. To do so, a parser is developed to retrieve
this information from the Wikidata built request. For example, ”Grande Prairie”
provides the following results: [list of towns in Alberta:Q15219391, village in Al-
berta:Q6644696, city in Alberta:Q55440238]. To achieve this, our methodology
combines wbsearchentities and parse actions provided by the API. It could be
observed that in this task, there were many items that have not been anno-
tated. This is because tables contain incorrectly spelled terms. Therefore, before
implementing the other tasks, a spell check component is required.
    As per the literature [11], spell-checker is a crucial language tool of natural
language processing (NLP) which is used in applications like information extrac-
tion, proofreading, information retrieval, social media and search engines. In our
case, we compared several approaches and libraries: Textblob4 , Spark NLP5 ,



2
  https://pandas.pydata.org/
3
  https://www.wikidata.org/w/api.php
4
  https://textblob.readthedocs.io/en/dev/
5
  https://nlp.johnsnowlabs.com/
4         R. Azzi et al.

Gurunudi6 , Wikipedia api7 , Pyspellchecker8 , Serpapi9 . A comparison of these
approaches could be found in Table 2.


       Table 2. Comparative of approaches and libraries related to spell-checking.

    Name           Category         Strengths/Limitations
    Textblob       NLP              Spelling correction, Easy-to-use
    Spark NLP      NLP              Pre-trained models, Text Analysis, Multilanguage
    Gurunudi       NLP              Pre-trained models, Text Analysis, Easy-to-use,
                                    Multilanguage
    Wikipedia api Search engines    Search with suggestion, Easy-to-use, Unlimited Ac-
                                    cess
    Pyspellchecker Spell checking   Simple spell checking algorithm, No pre-trained
                                    models, Easy-to-use
    Serpapi        Search engines   Limited Access for free


    Our choice is oriented towards Gurunudi and the Wikidata API with a post-
processing step consisting in validating the output using fuzzywuzzy10 to keep
only the results whose ratio is greater than the threshold of 90%. For example,
let’s take the expression “St Peter’s Seminarz” after using the Wikidata API
we get “St Peter’s seminary” and the ratio of fuzzy string matching is 95%.
    We are now able to perform the CTA task. In the trivial case, the result of
an item lookup is equal a single record. The best matching entity is chosen as
a result. In the other cases, where the result is more than one, no annotation is
produced for the CTA task. Finally, if there is no result after the lookup, another
one is performed using the output of the spell check produced by the item. At
the end of these lookups, the matched couple results are then stored in a nested
dictionary [item:claims]. The most relevant candidate, counting the number of
occurrences, is selected.
    Matching a cell to a KG entity (CEA). The CEA task can be performed
by exploiting the process described in Fig. 3. Our approach reuse the process
of the CTA task and made necessary adaptations. The first step is to get all the
statements for the first item of the list context. The process is the same as CTA,
the only difference is where the result provides more than one record. In this
case, we create nested dictionary with all candidates. Then, to disambiguate the
candidates entities, we use the concept of the column generated with the CTA
task. Next, a lookup is performed by using the other items of the list context
in the claims of the first item. If the item is found, it is selected as the target
entity; if not, the lookup is performed with the item using the Wikidata API (if
the result is empty, no annotation is produced).
6
   https://github.com/guruyuga/gurunudi
7
   https://wikipedia.readthedocs.io/en/latest/code.html
 8
   https://github.com/barrust/pyspellchecker
 9
   https://serpapi.com/spell-check
10
   https://github.com/seatgeek/fuzzywuzzy
                                              AMALGAM at SemTab 2020            5




               Fig. 2. Assigning a semantic type to a column (CTA).


    With this process, it is possible to reduce errors associated with the lookup.
Let’s take the value “650“ in row 0 of the table Fig. 3 for instance. If we lookup
directly in Wikidata, we can get many results. However, if we check first in the
statements of the first item of the list, “Grande Prairie“, it is more likely to
successfully identify the item.




                   Fig. 3. Matching a cell to a KG entity (CEA).




3   Results
This section reports the overall results (The Primary Score) of AMALGAM for the
two matching tasks in the four rounds of SemTab 2020 [10]. Overall, these results
show that AMALGAM achieves promising performances for CTA and CEA.


      Table 3. Results of Round 1.              Table 4. Results of Round 2.

        TASK F1 Score Precision                   TASK F1 Score Precision
        CTA 0.724     0.727                       CTA 0.926     0.928
        CEA 0.913     0.914                       CEA 0.921     0.927



    The primary results (F1 score and Precision) of AMALGAM for CTA and CEA
in the four rounds are presented in Tables 3, 4, 5 and 6. In the first round,
6         R. Azzi et al.

        Table 5. Results of Round 3.             Table 6. Results of Round 4.

          TASK F1 Score Precision                  TASK F1 Score Precision
          CTA 0.869     0.873                      CTA 0.858     0.861
          CEA 0.877     0.892                      CEA 0.892     0.914




AMALGAM achieved the results in Table 3. It could be observed that AMALGAM
handles properly the two tasks, in particular in the CEA task. Regarding the CTA
task, these results can be explained according to a new revision created in the
item revision history and there are probably spelling errors in the contents of
the tables. For example, ”rural district of Lower Saxony” after 16 April 2020
became ”district of Lower Saxony”. A possible solution is to retrieve the history
of revisions (parsing Wikidata data history dumps) to use them. This could be
a perspective to this work.
    In Round 2 we particularly focused on the spell check of items to improve the
results of the CEA and CTA tasks. Clearly, our choice tends to use two API services,
Wikipedia and Gurunudi, for spelling correction. It could be observed that the
achieved results are better than the previous round both in terms of precision and
F1-Score. However, these results may be improved too. From previous rounds,
we noted that one single term is ambiguous as it refers to more than one entity.
In Wikipedia, there is only one article for each concept. However, there can be
many equivalent titles for a concept due to the existence of synonyms, etc. For
example, the term “Paris” may refer to many concepts such as “the capital and
largest city of France“, “son of Priam, king of Troy“, “county seat of Lamar
County, Texas, United States“, etc. For the next rounds, disambiguation item is
required.
    In Round 3 and 4, to overcome the issue related to disambiguation, we have
updated our approach by integrating the concept of the column obtained in CTA
in the linking phase. We showed that the two tasks can be performed relatively
successfully with AMALGAM, achieving higher than 0.86 in precision and recall
values. However, the automatic disambiguation of items proved to be a more
challenging task.
    A second evaluation is performed with the Tough Tables (2T) dataset which
is designed to evaluate table annotation approaches in solving the CEA and CTA
tasks [12]. The results of this second evaluation are given in Table 7. It shows
that compared to the four rounds of the traditional challenge, the performance of
AMALGAM declined for F1 Score respectively to 0.32 and 0.60 for CEA and CTA
tasks. The same behaviour is observed for all the other participating systems to
the SemTab 2020 challenge. That suggests a more challenging issue to annotate
2T11 .


11
     https://www.cs.ox.ac.uk/isg/challenges/sem-tab/2020/results.html
                                                 AMALGAM at SemTab 2020                7

                                 Table 7. 2T Results.

                               TASK F1 Score Precision
                               CTA 0.606     0.608
                               CEA 0.323     0.553


4    Conclusion and Future Works
In this paper, we presented AMALGAM, a matching approach to making tabular
dataset explicit with entities annotation from a knowledge graph model (in the
context of the SemTab 2020 Challenge.
    Its advantage is that it allows to perform CTA and CEA tasks in a timely
manner. This may be accomplished through the combination of a lookup services
and a spell check techniques. It is the first participation of the AMALGAM system
which is still in the early stages, so there are still rooms from some improvements.
However, the experimental results show that our approach achieved promising
results.
    Our findings during this first participation suggest that the matching process
is very sensitive to errors in spelling. Thus, as of future work, an improved spell
checking techniques will be investigated. To process such errors the contextual
based spell-checkers are needed. Often the string is very close in spelling, but
context could help reveal which word makes the most sense. Further more, we will
improve the approach by finding a trade-off between effectiveness and efficiency.

References
1. Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Princi-
   ples for scientific data management and stewardship. Sci Data 3, 160018 (2016).
   https://doi.org/10.1038/sdata.2016.18
2. Diallo, G.: Efficient Building of Local Repository of Distributed Ontologies. Proceed-
   ings of the 7th International Conference on Signal Image Technology Internet-Based
   Systems (SITIS’2011), 2011. pages 159-166. doi:10.1109/SITIS.2011.45
3. Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge
   graphs:Representation, acquisition and applications. CoRRabs/2002.00388(2020)
4. Subramanian, A., Srinivasa, S.: Semantic Interpretation and Integration of Open
   Data Tables. In: Geospatial Infrastructure, Applications and Technologies: India
   Case Studies, pp. 217–233. Springer Singapore (2018). https://doi.org/10.1007/978-
   981-13-2330-0 17
5. Taheriyan, M., Knoblock, C.-A., Szekely, P., Ambite, J.-L.: Learning the semantics
   of structured data sources. Web Semantics: Science, Services and Agents on the
   World Wide Web 37(38), 152–169 (2016)
6. Zhang, L., Wang, T., Liu, Y., Duan, Q.: A semi-structured information semantic
   annotation method for Web pages. Neural Computing and Applications 32(11),
   6491–6501 (2019)
7. Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Matching
   Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embed-
   dings. In: Lecture Notes in Computer Science, pp. 260–277. Springer International
   Publishing (2017). https://doi.org/10.1007/978-3-319-68288-4 16
8       R. Azzi et al.

8. Eslahi, Y., Bhardwaj, A., Rosso, P., Stockinger, K., Cudre-Mauroux, P.: Anno-
   tating Web Tables through Knowledge Bases: A Context-Based Approach. In:
   2020 7th Swiss Conference on Data Science (SDS), pp. 29–34. IEEE (2020).
   https://doi.org/10.1109/sds49233.2020.00013
9. Hassanzadeh, O., Efthymiou, V., Chen, C., Jimenez-Ruiz, E., Srinivas,
   K.: SemTab2019: Semantic Web Challenge on Tabular Data to Knowl-
   edge Graph Matching - 2019 Data Sets (Version 2019) [Data set]. Zenodo.
   https://doi.org/10.5281/zenodo.3518539
10. Hassanzadeh, O., Efthymiou, V., Chen, C., Jimenez-Ruiz, E., Srinivas, K.:
   SemTab2020: Semantic Web Challenge on Tabular Data to Knowledge Graph
   Matching - 2020 Data Sets, October 2020.
11. Shashank, S., Shailendra, S.: Systematic review of spell-checkers for highly inflec-
   tional languages. Artificial Intelligence Review 53(6), 4051–4092 (2019)
12. Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M. Tough Ta-
   bles: Carefully Evaluating Entity Linking for Tabular Data. Zenodo, (2020).
   https://doi.org/10.5281/ZENODO.4246370