<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Entity Linking to Knowledge Graphs to Infer Column Types and Properties ?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Avijit Thawani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minda Hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erdong Hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Husain Zafar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naren Teja Divvala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amandeep Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ehsan Qasemi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro Szekely</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jay Pujara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Sciences Institute, University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our broad goal of linking tabular data to semantic knowledge graphs, as well as our speci c attempts at solving the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. Our e orts were split into a Candidate Generation and a Candidate Selection phase. The former involves searching for relevant entities in knowledge bases, while the latter involves picking the top candidate using various techniques such as heuristics (the `TF-IDF' approach) and machine learning (the Neural Network Ranking model). We achieve an F1 score of 0.826 without any training data on the 400000+ cells to be annotated in Round 2 CEA challenge. On CTA and CPA variants, we score 1.099 and 0.790 respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Web</kwd>
        <kwd>Table Understanding</kwd>
        <kwd>Knowledge Graphs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The objective of our work is to investigate approaches to address the three
tasks in the ISWC Tabular Data to Knowledge Graph (KG) Matching challenge:
entity linking (CEA) to map table cells to entities in a knowledge graph,
semantic labeling (CTA) to map table columns to an ontology class, and
semantic modeling (CPA) to map column-pairs to an ontology property. While
the challenge focuses on DBpedia, we seek general approaches that e ective with
other knowledge graphs, such as Wikidata.</p>
      <p>Figure 1 shows an overview of our approach. For each cell that must be
annotated, a candidate generation module generates a ranked list of candidate
entities in the target KG. In the next step, a feature generation module
generates a vector of features for each candidate, and in the nal step, a candidate
selection module scores the candidates using the feature vectors. The gure
illustrates the most important challenges: inability to generate any candidates
(row 4); generated candidates do not include the correct answer (row 5); the
Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).</p>
      <p>1
2
3
4
5</p>
      <p>Input
Table</p>
      <p>Wikidata
Entities</p>
      <p>Feature
Vectors
correct answer is not an instance of the semantic label for a column (row 2);
multiple candidates have the correct semantic label (row 3); all candidates with
the correct semantic label are incorrect (row 5).</p>
      <p>Our approaches for CTA and CPA use the components shown in the
architecture. For CTA we compute the semantic label using the scored candidates,
and for CPA we use the candidates generated in the candidate generation step.
In the following sections we describe multiple approaches for each module and
report on initial experiments to assess e ectiveness.</p>
    </sec>
    <sec id="sec-2">
      <title>Techniques</title>
      <sec id="sec-2-1">
        <title>Candidate Generation</title>
        <p>We investigated multiple approaches for candidate generation. We use
Wikidata API to obtain up to 100 candidates for each cell with cell contents and
headers. While the results of Wikidata API are of high quality(submission with
top candidate of each cell achieves 0.7 F1-score), the maximum achievable
recall, which is the ratio of cells having ground truths in their candidates, is still
limited. To address this recall limitation of the API, we build a Wikidata
Elasticsearch Index containing elds for multilingual labels, aliases and
descriptions and combine results from multiple queries focusing on di erent elds. The
combined Wikidata candidate generation approach achieves a maximum
achievable recall of 85%. In addition, we build a second Elastic Search index using
DBpedia labels, mapping all DBpedia URIs to their corresponding Wikidata
entity identi er (qnode), achieving the maximum achievable recall of 91%. We
use normalized reciprocal rank as scores to combine and sort candidates from
all queries. Finally, we build a Name Abbreviations Index that records
abbreviations of person names (initials plus surname) for all instances of Human
(Q5) in Wikidata, as this format is common in the round 2 dataset.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Feature Generation</title>
        <p>In this work we investigated a feature engineering approach. Lexical features
capture the lexical similarity between the contents of a cell and entity labels,
and semantic features capture semantic coherence among cells in a column.
Lexical features We investigated two lexical features. The rst one is the
normalized score from all candidate generation modules, which captures a token-based
similarity between the tokens in a cell, and the tokens in the labels, aliases and
descriptions in the KGs. For performance reasons, we limited the candidate
generation to the token-based similarity and designed a second feature to measure
at character level. We generate features using several string distance metrics,
like inverse Levenshtein distance, word similarity (number of words of the label
text which were found in the DBpedia URI), etc.</p>
        <p>Semantic features A simple approach to capture semantic coherence is to assume
that all entities in a column belong to the same ontology class. The most speci c
class of all or most cells in a column, as computed in the CTA task is informative,
but not speci c enough. For example, in Wikidata, all humans are instances of
class Human (Q5), and the set of athletes is identi ed using the occupation
property. We investigated a generalization of this idea to use all properties used
to describe entities, in addition to the instance of property that identi es the
classes of an entity.</p>
        <p>For example, consider a column containing names of airports. Candidate
entities that are instances of dbo:Airport are likely to be correct. The correct
set can also be identi ed by candidates that have values for the property dbo:
runwayDesignation.</p>
        <p>We use candidates for all cells in a column to compile a set of all the
classes and properties used to describe them, and for each candidate, we de ne
a uniform-length binary sparse feature vector to record the classes and properties
used to describe it. If the features are dbo:Airport and dbo:runwayDesignation,
then the candidate entity dbpedia:Heathrow has a feature vector [1; 1]. In
practice, these vectors may contain thousands of entries.</p>
        <p>These features are not equally informative. We seek to maximize coverage
and selectivity. Intuitively, a feature has good coverage if for every cell there is a
candidate for which this feature has value 1. In our example, both Organization
and Airport have good coverage, but Person would not as many airports are
not named after people. A feature is selective if fewer candidates possess that
feature. For our example column, Airport is more selective than Organization
as fewer candidates have a 1 for the Airport feature.</p>
        <p>We implement this intuition using an adaptation of TF-IDF. In our context,
we observe that the rst result from candidate generation is more often a correct
candidate. Thus, a good measure of `Term Frequency' of a given semantic feature
(eg. dbo:Person) is the number of cells for which the rst candidate has this
feature. In Figure 2, dbo:Person occurs in the rst candidate of all 5 cells, hence
its TF = 5. On the other hand, dbp:Album occurs just once in the rst candidate
(as a feature of dbr:Madonna) hence its TF = 1.</p>
        <p>Taking forward the analogy, we de ne the Document Frequency of a semantic
feature as the number of total occurrences of the feature (in all candidates). In
Figure 2, dbo:Person occurs in 8 out of 9 candidates, hence its IDF = log(9=8) =
0.05. Once we compute all feature weights accordingly, we can nd the `Score'
for each candidate, which is the dot product of the weight vector (TFIDF) and
the binary feature vector (v1 or v2). As seen in Figure 2, these weights help us
select Saint Madonna as the correct entity rather than Madonna, the singer.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Candidate Selection</title>
        <p>We investigated multiple approaches for candidate selection.</p>
        <p>Top-1 candidate The simplest approach is to select the top-scoring candidate
from the candidate generation module, ignoring all other features. Surprisingly,
this approach is a very strong baseline: in round 1 it received an F1 score of
0.809 and a precision of 0.843 and in round 2 it received an F1 score of 0.701
and a precision of 0.768.</p>
        <p>Heuristic linear combination We combiner the lexical and semantic features
using the following simple formula:</p>
        <p>Sc = tf idf + 0:5 levenshstein similarity + 0:5
This approach in round 2 received 0.826 F1-score and 0.852 precision. Please note
that this submission was the result of concatenating two di erent submissions,
both using the same linear combination as above but di ering in that one of
them had candidates generated by using a special query in Elasticsearch for
handling abbreviations (See Section 2.6 and Table 1 for details).
Neural Network Ranking Model In the heuristic algorithm above, weights among
the three features are xed and not adaptive to the dataset. To solve this
problem, we propose a new model that learns weights and relationships from labeled
data.</p>
        <p>Pairwise Contrastive Loss: Given set S = f(xi+; xi ) j xi+ 2
truthc; and xi 2 candidatec truthc; c 2 g where (xi+; xi ) is a pair of
feature vectors described in Heuristic linear combination respectively from ground
 = 50.8%
dbo: Diploma
dbo: AchtecturalStructure
0%</p>
        <p>2.8%
dbo: Agent ……
dbo: Thing</p>
        <p>Legend:</p>
        <p>Classes included in answer
97.2% &gt;  Correct answer
dbo: Place % Percentage</p>
        <p>Search path
1.7% 98.3% &gt;  0%
dbo: PopulatedPlace ……</p>
        <p>dbo: Building
68.0% &gt;  11.4%
dbo: Settlement dbo: Region ……
0%</p>
        <p>dbo: Territory
0% &lt; 
dbo: CityDistric
25d.b1o%: C&lt;ity ……
33.3% &lt; 
dbo: Village</p>
        <p>X
(xi+;xi )2S
truth truthc and other wrong answers from candidatec, and c is the cell from
training set . Given a ranking model F : x 2 RNf ! y 2 R where Nf is feature
length, the main idea of model training is to enforce one-dimensional model
output yi+ of ground truth feature vector xi+ to be far away from yi of the wrong
candidates xi and to make yi+ greater than yi . For this goal we propose and
optimize a variant of contrastive loss[1] as follows.</p>
        <p>Lcontrastive =
maxf0; m + F (xi )</p>
        <p>F (xi+)g +
kwk1
where m &gt; 0 is margin and kwk1 is L1 normalization.</p>
        <p>Model Structure: In this challenge we use a 2-layer neural network activated
by ReLU[2]. After training on T2Dv2 dataset[3], scores are obtained from feature
of each candidate with ranking model, and candidates with highest score is
chosen as answers for each cell. This approach received a high 0.871 F1-score with
candidates of 8% cells lacking ground truths in T2Dv2 dataset, and in round 2 it
received a lower F1 score 0.808, hinting towards distribution di erences between
the two datasets.
2.4</p>
        <p>CTA
Our approach for the CTA challenge uses the results from the CEA challenge as
illustrated in see Fig. 3:</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) For each column C our algorithm starts from dbo:Thing on level 0 of
semantic class tree and calculates the percentages of cells in C that belong to
classes on certain level of the DBpedia class tree. In Fig. 3, on level 2 the
percentages of dbo:Diploma, dbo:Agent and dbo:Place are 0%, 2.8% and 97.2%.
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) The algorithm then picks most common class dbo:Place with the highest
percentage of 97.2%. Since 97.2% is greater than the threshold percentage T =
50:8%, dbo:Place is regarded as one of classes that C belongs to. The algorithm
records dbo:Place in its search path and goes back to step (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), selecting most
common classes from the children of dbo:Place.
2.5
        </p>
        <p>
          CPA
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) After selecting dbo:PopulatedPlace and dbo:Settlement respectively
on level 2 and 3, there is no class percentage greater than T on level 4, so the
search stops and the algorithm outputs the path below the root: dbo:Place !
dbo:PopulatedPlace ! dbo:Settlement.
        </p>
        <p>In the CPA task, the goal is to nd the DBPedia ontology property that best
matches the relation between a primary column and other secondary columns.
As input, we have the ranked candidates from CEA and sample size N , which
is the top N of the candidates we want to consider, to limit the size of our
query search space. For each row, we query DBPedia for the properties between
all pairs of candidates in the primary and secondary columns. For example in
Fig. 4, the candidates for "M. Thatcher" are paired with the one candidate
for "Lincolnshire" and we discover dbo:birthplace among these properties.
We output the most frequent property among all the rows, in the above case
dbo:birthplace. For secondary columns consisting of literals, we transform the
value to address potential di erences between the cell value and the values in
the KG. Currently we address cases of string, date, and numerical literals.
In our work, our primary goal was to develop e ective algorithms for DBpedia
and Wikidata as target KGs. To work with the challenge les we cleaned the
input les to address issues such as les with and without headers, empty lines
before the rst row and a variety of character encoding problems.</p>
        <p>Our primary adaptation to the challenge are string similarity metrics to
measure distance between cell values and DBpedia URIs. To handle abbreviations we
utilized an abbreviated Elasticsearch index to get the candidates for labels that
match the regex ^.\.\s.+. In the future we plan to use metrics that compare
the cell values to the labels of candidates.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        CEA In the upper section of Table 1 i.e. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Ablations with Abbreviations, we
show the gains achieved using a specialized candidate generation module (Refer
to Section 2.6) to handle abbreviations. In the lower section i.e. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Experiments
on number of rows, we report results by submitting predicted cell annotations
for subsets of tables (a small part of our best achieved results) with restrictions
on the number of rows, as described. Our heuristic loses precision when run on
smaller tables (less than 10 cells/rows in each column) as compared to larger
tables (more than 100 cells/rows). Besides, even for very large tables, our heuristic
seems to be only at a precision of 0.877.
      </p>
      <p>CTA In CTA challenge we rst did a grid search by submitting results with
different T to get the best performing one. According to the results, CTA algorithm
performs best when T is set to 0.508. However, in Experiment 1 the algorithm
discards 1389 columns mistakenly annotated as dbo:Thing or dbo:Agent. In
Experiment 2 we tune another T 0 for these columns, increasing the score by 0.008.
The result is shown in Table 3.</p>
      <p>We studied the sensitivity of our algorithm to the number of cells in a column.
We partitioned the collection of columns into multiple datasets and computed
the improvement on round2 primary scores between adjacent bands as shown
in Table 2. The results show a signi cant performance increase with tables
containing more than 10 rows and better performance with a larger number of cells.
CPA In CPA, our initial experiment directly used the CEA candidates with
sample size N = 10. The initial experiment performed poorly on primary
columns consisting of human names which could not be directly found in our
candidate generation, so we augment the candidate list with a human names index
in the second experiment. We found that certain properties such as dbo:area
had class-speci c synonyms such as dbo:PopulatedArea/area so in our third
experiment we mapped these synonymous properties to their generic analog.
These results are shown in Table 4.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We proposed an architecture with three modules, candidate generation, feature
generation and candidate selection and implemented KG-agnostic approaches
for each module, as we are interested in using Wikidata as a target KG. We
developed an interesting feature generation module that combines classes and
properties in a uniform framework to characterize the semantic coherence of the
values in a column, and explored both heuristic and machine learning approaches
for candidate selection.</p>
      <p>Our round 2 results are limited by insu cient recall in our candidate
generation modules, so in round 3 we plan to focus on candidate generation. In round
2 our best results were obtained using a heuristic candidate selection method, so
in round 3 we plan to continue our work on the machine learning approaches. We
also plan to study sensitivity of our approaches to characteristics of the datasets,
including the number of cells in a column, distribution of length of cell values,
and e ectiveness for di erent data types (people, organizations, places, etc.) We
propose the following modi cations to the challenge:
1. Annotation of all cells in a column, with a special no annotation marker to
identify cells not present in the target KG.
2. Submission of a single class for CTA, changing the scoring function to give
partial credit for super-classes or over-speci c classes (e.g., 1=2n where n is
the number of super-class or sub-class links to the correct answer).
3. Release of a subset of the ground truth for algorithm development.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , LeCun, Y.:
          <article-title>Dimensionality reduction by learning an invariant mapping</article-title>
          .
          <source>In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)</source>
          . vol.
          <volume>2</volume>
          , pp.
          <volume>1735</volume>
          {
          <fpage>1742</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Recti ed linear units improve restricted boltzmann machines</article-title>
          .
          <source>In: Proceedings of the 27th international conference on machine learning (ICML-10)</source>
          . pp.
          <volume>807</volume>
          {
          <issue>814</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ritze</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmberg</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>T2dv2 gold standard for matching web tables to dbpedia</article-title>
          . http://webdatacommons.org/webtables/goldstandardV2.html (
          <year>2015</year>
          ), accessed
          <fpage>28</fpage>
          -September-2019
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>