=Paper=
{{Paper
|id=Vol-2775/paper10
|storemode=property
|title=Generating Conceptual Subgraph from Tabular Data for Knowledge Graph Matching
|pdfUrl=https://ceur-ws.org/Vol-2775/paper10.pdf
|volume=Vol-2775
|authors=Donguk Kim,Heesung Park,Jae Kyu Lee,Wooju Kim
|dblpUrl=https://dblp.org/rec/conf/semweb/KimPLK20
}}
==Generating Conceptual Subgraph from Tabular Data for Knowledge Graph Matching==
<pdf width="1500px">https://ceur-ws.org/Vol-2775/paper10.pdf</pdf>
<pre>
Generating Conceptual Subgraph from Tabular Data for
            Knowledge Graph Matching†

             Donguk Kim1, Heesung Park1, Jae Kyu Lee1 and Wooju Kim1*
      1Department of Industrial Engineering, Yonsei University, Seoul, Republic of Korea

              tmdh78@yonsei.ac.kr, 2020311517@yonsei.ac.kr,
                dlworb1994@yonsei.ac.kr, wkim@yonsei.ac.kr


        Abstract. In this paper, we study the problem of analyzing the relationship be-
        tween data given in a tabular format and a target knowledge graph, e.g., Wikidata.
        It is most important to find the label that indicates the correct meaning in Wiki-
        data where data and values are annotated with each label. It is a very difficult task
        for a machine to correctly understand or infer its meaning. For this to happen,
        data must be accurately tagged. Wikidata has a label for each document. In addi-
        tion, it has the characteristic of being linked to another document through these
        documents. These connected data can be represented as graphs. In this paper, a
        method is proposed to create a graph based on related elements and infer the
        relationship of other elements using advanced Wikidata SPARQL queries. Above
        all, this approach helps in interpreting clear inference relationships and provides
        a very suitable approach in an environment where data changes frequently such
        as an open access database.

        Keywords: Knowledge Graph, Wikdata, SPARQL Query, Semantic Annotation.


1       Introduction

Annotating data is one of the important tasks in tabular data. Because other information
can be inferred without requiring a lot of information due to accurate annotation. There-
fore, putting an appropriate annotation can be considered as knowing the semantics. In
that sense, it is very important to find out about the meaning in a tabular knowledge
graph. Because fallacy reasoning can lead to another fallacy reasoning in a data pro-
cessing pipeline. Eventually, fallacy inferences from one can spread throughout. The
data we used were based on Wikidata. Wikidata is composed of several facts consisting
of subject (S), predicate (P) and object (O). Each element is marked with the label in
Wikidata. This makes it possible to identify the semantics in Wikidata [1].


*   corresponding author
†   This work is financially supported by Korea Ministry of Land, Infrastructure and
    Transport(MOLIT) as 「Innovative Talent Education Program for Smart City」.
    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0).
         1.1     Task Description

       In the SemTab challenge, three tasks were given: CTA, CEA, and CPA [2]. Column
       Type Annotation (CTA) is assigning a semantic type to a column. Cell Entity Annota-
       tion (CEA) is matching a cell to a KG entity. This is to annotate each individual element
       of subject and object. Columns Property Annotation (CPA) is assigning a KG property
• Task to
       description
           the relationship between two columns. This task is to find out which property the
       elements in the two columns are connected to on which Wikidata. In other words, this
       is the process of attaching annotations matching to predicate (Fig. 1.).


                    CTA                              CEA                             CPA

             CTA : Assigning a semanticFig.
                                        type1. to
                                               Challenges
                                                  a columnin Tabular Data

             CEA : Matching a cell to a KG entity
         1.2     Assumptions
             CPA : Assigning a KG property to the relationship between two columns
         We have made the following assumptions to solve the problem.

         ─ Assumption 1. Every target must have a correct answer.
                                                                                                     1
         ─ Assumption 2. First column must be a key value (S) of row for making conceptual
           graph.

         ─ Assumption 3. Typo cases occur only in the first column.

         ─ Assumption 4. CTA is linked only to property called “instance of”.

         ─ Assumption 5. Lower number labeling item represent wider range of class.

         We have established Assumption 1 and assumed that there is always an answer because
         it is impossible to reason accurately without a clear answer. We try to approach the
         problem from a graph perspective and find the key part of the graph using Assumption
         2. We set out the rest of the assumptions rather aggressively, considering the factors we
         gained empirically over the course of the round.


         2       Conceptual Graph

         2.1     Target Table Structure
         Given tabular data to perform the task, we approached the table from a matrix perspec-
         tive and redefine it as follows (Table 1). The subject cell refers to the zero column in
  the target table. This cell means the title of the document in Wikidata, and it always has
  a document label, not a literal form. The object cell refers to the cells of all rows except
  the zero column in the target table. This cell means objects that exist in the document
                        Target
  title in Wikidata, •and        table cells,
                          unlike subject structure
                                                there may be a literal form that is not tagged
  with a label.

                                          Table 1. Target Table Structure

                                 col0           …          col(j)          ...        col(n)
                                𝑡( , 0)                                               𝑡( , )
                                                                                                      • table(t) : m x n matrix
                                     .                                                     .
                                     .                                                     .
                                     .                                                     .          • subject cell : 𝑡(𝑖, 0) (i = 1, 2 … m)
                                𝑡(𝑖, 0)         …          𝑡(𝑖, 𝑗)         …          𝑡(𝑖, )
                                     .                                                     .          • object cell : 𝑡(𝑖, 𝑗) (i = 1, 2 … m), ( j = 1, 2 … n)
                                     .                                                     .
                                     .                                                     .
                                𝑡(   , 0)                                             𝑡(   , )


  ─ Target Table (t) : m × n matrix.
                                                                                                                                                     2

  ─ Subject Cell : 𝑡(𝑖,0) ( i = 1, 2 …, m ).

  ─ Object Cell : 𝑡(𝑖,𝑗) ( i = 1, 2 …, m ), ( j = 1, 2 …, n ).

  ─ Header Row : 𝑡(0,𝑗) ( j = 0, 1, 2 …, n ).


• Conceptual
  2.2               subgraph
      Generating Subgraph
  If information about the target is given as follows, CTA with column id 0, CPA with
  head column id and tail column id 0 and 1, and CEA of 0th and 1st column cells, We
  can find the values in Table 2 and generate a conceptual subgraph as shown in Figure
  2.
                                   Table 2. Target Table
                     col0                        col1                     col2                    col3           col4
                  Leesmuseum                  Amsterdam               Netherlands              1800-11-17 reading museum
                  The Marlowe                  Cambridge             United Kingdom            1907-05-01 theatrical troupe
                  Club Gorca                    Seville                  Spain                 1966-01-01    organization
       Pennsylvania Horticultural Society     Philadelphia United States of America 1827-01-01               organization
      College of Physicians of Philadelphia Philadelphia United States of America 1787-01-01                 organization


                                                                        Leesmuseum
                                                                        (Q2472824)
                                                          𝑡( , 0)
          wikidata document title
                                                          𝑡( , 0)


                                                          𝑡( , 0)


        Society          instance of                      𝑡( , 0)
       (Q8425)              (P31)
                                                                                                located in the administrative territorial entity
                                                                                                                   (P131)
                                                          𝑡( , 0)
• Conceptual subgraph


                    col0                      col1                    col2             col3           col4
                 Leesmuseum                Amsterdam              Netherlands       1800-11-17 reading museum
                 The Marlowe               Cambridge             United Kingdom     1907-05-01 theatrical troupe
                 Club Gorca                  Seville                 Spain          1966-01-01    organization
      Pennsylvania Horticultural Society   Philadelphia United States of America 1827-01-01       organization
     College of Physicians of Philadelphia Philadelphia United States of America 1787-01-01       organization


                                                                    Leesmuseum                                                                     Amsterdam
                                                                    (Q2472824)                                                                      (Q9899)
                                                       𝑡( , 0)                                                                            𝑡( , )
          wikidata document title
                                                       𝑡( , 0)                                                                            𝑡( , )


                                                       𝑡( , 0)                                                                            𝑡( , )


       Society          instance of                    𝑡( , 0)                                                                            𝑡( , )
      (Q8425)              (P31)
                                                                                     located in the administrative territorial entity
                                                                                                        (P131)
                                                       𝑡( , 0)                                                                            𝑡( , )


                                                                    Fig. 2. Conceptual Subgraph

 Tabular data can be expressed in the form of knowledge graphs in SPO relationships.
 Most of all, Wikidata can find all relationships in a document if it has a clear label, such
 as the title of a document defined as an item label. Blueline means the title of a Wiki-
 data document. All relationships can be found in Wikidata document title label. Based
 on the document title, it can follow the list of targets on the right and carry out the task
 matching to the table. To the right of the standard, it is possible to confirm that other
 objects are connected to the CPA matching location in the administrative terminal en-
 tity. Similarly, it can check the CTA connected to the instance connected by Assump-
 tion 4 to the left. It is important to find Leesmuseum through these graphs. However,
 we have prioritized finding the labeling of Leesmuseum (Q2472824). Because if we
 know at least one subject, we can deduce the remaining factors or classify candidates
 close to the answer.


 3            System Description

 In this section, we present the architecture of our system, consisting of 4 stages, as
 illustrated in Figure 3: Stage 1. Candidate Extraction, Stage 2. Node Selection, Stage 3.
 Subject Crawling, Stage 4. Element Inference. Each stage is described next in more
 detail in a separate subsection.
                                                                                       4


                                                                                         SPO
                                                                             1                                     2      Updated Graph              Result
                                                                                       Candidates

  Table
                                                                                 Web Crawling
                                                                                        3

                                                                     Fig. 3. System Framework

 ─ Stage 1. Find SPO standards using advanced SPARQL query from tabular data.

 ─ Stage 2. Select 'subject' with the high probability value among the Candidates and
   updated graph.
─ Stage 3. Crawl the relevant subject which can’t be found, and then repeat Stage 1-2.

─ Stage 4. Infer to other people using updated graph, and then repeat stage 1-2.


3.1      Candidate Extraction

                            Table 3. Target Table of Data Containing Typos
                     col0                     col1               col2               col3           col4
                 Leesmuseum                Amsterdam         Netherlands         1800-11-17 reading museum
                 The Marlowe               Cambridge       United Kingdom        1907-05-01 theatrical troupe
                  Cl?b Gorca                 Seville            Spain            1966-01-01    organization
      Pen$nsylvania Horticultural Society Philadelphia United States of America 1827-01-01     organization
      College of Physicians of Philadelphia Philadelphia United States of America 1787-01-01   organization

In Stage 1, the advanced SPARQL query that is a rule-based model helps us make the
choice of appropriate queries for each data type, such as a constant pattern value (e.g.
a date type), a numerical value and text is used to find the values in the tabular table
from Wikidata. Since the tabular file can be converted to utf-8 code, preprocessing for
the language type was not performed. For each table, according to the appropriate data
type, a mix of query features were applied [3]. The data type was determined only by
1st row of table. Because all the correct answers exist by Assumption 1, each column
must be of the same type. Assumption 1 gives an important evidence that the first col-
umn in Table 3 will have the same attributes, although it contains the typos. In this way,
the data type is determined, it can reduce the effort of not having to check all cells. The
data types are divided into text, number, and date. But in the case of number, number
could be of non-labeled literal type and are also included in the text. As a supplementary
explanation, the reason for dividing in consideration of this case is that it is more ap-
propriate to classify numbers as text, which represent the properties of numbers such
as prime numbers or even numbers. The database can always be updated, so value such
as population, length and width may have a slight error. Reflecting these points, a query
was created by specifying a range of ±1.5% based on the value. The query finds one
subject and the remaining object cells in a row with a one-to-one match. This search
method ensures that the answer to subject always exists among the candidates that came
out through the query.
3.2    Node Selection

          answer


         CTA


                                 CEA                        CPA

                                                                                CEA
                                  Fig. 4. Updated Graph

Selecting all the candidates using a query, we proceed with a simple preprocessing. As
mentioned earlier, numbers can be either text or literal, so we use a query with two
things in mind. When such a query is applied, the contents unrelated to the annotation
are extracted to the predicate candidate, so the work of removing the candidates with it
is proceeded. After preprocessing, the subject with the highest probability of appear-
ance is selected among all candidates. If an equivalent probability is found, all pro-
cesses for the candidates are performed the same and one is chosen randomly for the
final process. When a subject is determined, then objects related to the subject are
matched. However, the work of finding the predicate must continue until the rest of the
Wikidata page titles are found. Through the determined subject and objects, the work
of filling in the nodes in the graph was in progress (Fig. 4).


3.3    Subject Crawling
If there are no errors in the tabular data, the whole process is probably done in the
previous step. However, in the assignment, there were many typos and errors in the
data. There were many types of data containing incorrect values, such as misspellings,
incorrect spacing, and omission of other special symbols or numbers. In order to solve
the typo error, the problem was approached by crawling through a search engine. The
crawl was performed through the Google search engine, but during the crawling pro-
cess, several cases were classified and prioritized to perform the crawl. When using the
Google search engine, Google sometimes automatically corrects typos and recom-
mends related search terms. Crawling was performed on the top page, and the order
was related to the Wikidata title, automatically correcting the typo, and finally the case
related to the Wikipedia title.
3.4    Element Inference
After correcting the typo, we repeat the process in Stages 1-2. Then, the system can
find a subject that is highly relevant to the subject indicated in yellow on the graph (Fig.
5a). And like subject, the candidate with the highest probability value is selected from
the predicate list that has been kept so far. When an equivalent probability value comes
out, we create and maintain a predicate list of only candidates with equivalent values.
For CTA work, only ‘instance of (P31)’ is used by Assumption 4. After this process,
inferring through the remaining elements in the graph is performed to find the elements
corresponding to orange (Fig. 5b). If there is a predicate list composed of equivalent
values, the remaining predicates are selected through the inferred elements and the final
work is completed.

           associative subject
           answer


         CTA


                                 CEA                         CPA

                                                                                 CEA
                                                (a)


           inference target
           instance of (P31)
           answer


          CTA


                                 CEA                         CPA

                                                                                 CEA
                                                (b)

       Fig. 5. Update Graph Process. The graph on the upper is the graph after Stage 3.
4       Results

We continued to update our system as we performed round by round. In Round 1 and
Round 2, results were good and had few problems. However, in Round 3, the CPA
results were particularly bad, including all rounds in Table 4. This result came as the
number of rows in the table decreased and the number of errors increased. It was con-
firmed that if the number of rows in the table is large, incorrect reasoning rarely occurs,
but in the opposite case, many errors occur.

                                 Table 4. Challenge Results

                        CTA                        CEA                        CPA
              AF1-Score A-Precision F1-Score           Precision    F1-Score      Precision
    Round 1     0.861         0.860        0.936         0.936        0.943         0.943
    Round 2     0.966         0.966        0.961         0.961        0.973         0.973
    Round 3     0.913         0.913        0.906         0.906        0.815         0.815
    Round 4     0.655         0.655        0.617         0.819        0.924         0.924


5       Conclusion

This system presents a method of approaching the semantic table annotation tasks by
creating SPRAQL queries and graphs. Accessing Wikidata using queries is very simple
and much lighter than downloading a database dump directly. Especially, in the case of
small sized data, this advantage is clear. In addition, this approach is well suited to the
nature of Wikidata, which has the potential to modify data at any time. There are several
improvements to this system. It is a method that can only be applied within the closed
world called Wikidata. Additionally, if terms with many comprehensive meanings ex-
ist, it takes a lot of time to work. Although many assumptions were set up to solve the
problem above, if data problems occur in other cells, a more advanced system is needed
rather than a crawl method. This problem can show better performance if we apply
learning about pattern sequence in characters.


References
 1. Adrian Bielefeldt, Julius Gonsior, and Markus Krötzsch.: Practical Linked Data Access via
    SPARQL: The Case of Wikidata. In: LDOW@ WWW, (2018).
 2. "Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen, Kavitha
    Srinivas. SemTab 2019: Tabular Data to Knowledge Graph Matching Challenge. ESWC
    2020" or the challenge website: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
 3. Daniel Hernández, Aidan Hogan, Cristian Riveros, Carlos Rojas, and Enzo Zerega.: Query-
    ing wikidata: Comparing sparql, relational and graph databases. In: International Semantic
    Web Conference, pp. 88–103. Springer, (2016).

</pre>