Knowledge Base Augmentation using Tabular Data∗

 Yoones A. Sekhavat1 ,                       Francesco di Paolo2 ,         Denilson Barbosa1 ,                Paolo Merialdo2
                                  1                                             2
                                      University of Alberta                         Roma Tre University
                    {yoones.a.s,denilson}@ualberta.ca {fradipi,merialdo}@dia.uniroma3.it


ABSTRACT                                                                    Ronaldinho          Brazil        Barcelona FC
Large linked data repositories have been built by leverag-                  Fabio Cannavaro     Italy         Juventus
ing semi-structured data in Wikipedia (e.g., DBpedia) and                   Kaka                Brazil        AC Milan
through extracting information from natural language text                   Lionel Messi        Argentina     Barcelona FC
(e.g., YAGO). However, the Web contains many other vast
sources of linked data, such as structured HTML tables and                      Figure 1: Example tabular data.
spreadsheets. Often, the semantics in such tables is hidden,
preventing one from extracting triples from them directly.
This paper describes a probabilistic method that augments
an existing knowledge base with facts from tabular data by
                                                                     together in the same rows indicates that there are relation-
leveraging a Web text corpus and natural language patterns           ships between them. In this case, we know that Ronaldinho
associated with relations in the knowledge base. A prelim-           was born in Brazil and played for Barcelona. Moreover, the
inary evaluation shows high potential for this technique in          same relationships apply to all rows: Cannavaro was born
augmenting linked data repositories.
                                                                     in Italy and played for Juventus. In fact, we can say that
                                                                     the relations was born in and played for are defined over the
1.    INTRODUCTION                                                   columns (1st and 2nd, and 1st and 3rd, respectively).
The Web is bursting with valuable tabular data that could            A general approach for understanding such tables with the
be leveraged in many applications such as knowledge base             help of linked open data web [2] would be to (1) link the val-
population and query answering. For instance, Cafarella et           ues in each cell to known entities in a linked data knowledge
al. report 14.1 billion HTML tables in English documents             base (e.g., YAGO [14] and Freebase [3]), and (2) identify
in Googles’s main index [4] and over 400,000 Excel spread-           relations between the linked values. The best-case-scenario
sheets in the Clueweb091 crawl [5]. Tapping into these semi-         for this approach would be when all entities are linked to the
structured information sources is difficult, however, and of-        same knowledge base, and a relation already exists between
ten they end up being treated no differently than plain text         them (this is the approach in [8]). However, there are other
documents. What is worse, state-of-the-art text-based in-            situations. For example, it could be that the entities exist
formation extraction systems fail on such tables, as they            in different, unlinked, knowledge bases, or that some of the
require full sentences to yield good results.                        entities are not linked to anything yet. Tackling such cases
Tables have inherent semantics which are often implicit or           would provide new triples, thus augmenting the linked open
only given in the pages that contain them. Consider the ex-          data web with new facts.
ample in Figure 1, which is a snapshot of a table in Wikipedia.      This paper focuses on the problem of identifying plausible
It may not obvious from the table that it is about winners           relations between pairs of entities that appear in the same
of FIFA’s World Player of the Year Award. Nevertheless, it           row of a table. Our main assumption is that if someone went
should be noted that the fact that someone put those literals        to the trouble of juxtaposing these entities on the same file,
∗All data used to build and test the models in our work              then there must be a relation between them. Moreover, we
can be found at http://cs.ualberta.ca/~denilson/data/                seek to augment an existing repository with new instances of
ldow14_ualberta_data.zip                                             relations already defined. In other words, the list of possible
1                                                                    relations is part of the input. We also assume the entities can
  ClueWeb, 2009, http://lemurproject.org/clueweb09.php
                                                                     be resolved and linked to a linked open data repository or
                                                                     knowledge base. However, unlike previous works (e.g., [8]),
                                                                     our method does not require that.


                                                                     Overview. As an example, assume our input table has only
                                                                     columns 1 and 3 in Figure 1, and assume we have two can-
                                                                     didate relations: plays-for and lives-in. We start from a
Copyright is held by the author/owner(s).
LDOW2014, April 8, 2014, Seoul, Korea.                               list of textual patterns that are associated with each rela-
                                                                     tion; such patterns are detected by automatic methods for
                                 Figure 2: Rank Aggregation approach vs. Global Ranking.


building knowledge bases from natural language2 . For ex-                For example, the pattern “played in” may represent rela-
ample, patterns for the relation plays-for could be “scored              tions plays-for and performed-at (e.g., “Pink Floyd played
for” and “signed contract with”. In practice, thousands of               in Pompeii”).
patterns are associated with a single relation; conversely,
the same pattern may be associated with multiple relations.              Another challenge, which has been the subject of substantial
Thus, we build a probabilistic model to estimate the pos-                work recently, is entity linking. For example, “Barcelona”
terior probability of a relation, given a set of text patterns           may represent the football club or the city. Evidently, the
observed. To make a prediction for a given pair of entities,             choice of entity linking approach will have an impact on
we: (1) collect all sentences containing both entities from              the quality of the triples produced by a method like ours.
a large text corpus; (2) extract the text in between them;               To avoid this factor in our study of the relation assignment
(3) match those texts against the list of patterns; and (4)              problem we use exact matching to “link” the entities. In
estimate the posterior probability of all candidate relations.           effect, this is akin to assuming the entities are already linked.

Paper Outline. Section 2 illustrates our model to extract                2.1    Relation between a Pair of Entities
triples from tabular data. Section 3 discusses our first imple-          We start with determining whether a relation r applies to
mentation of the model, and Section 4 reports the results of             two entities. In the absence of further information, a reason-
experiments on this first implementation. Section 5 presents             able approach is to search the (textual) Web for all sentences
related work, and Section 6 concludes the paper and presents             connecting the entities and determine whether r follows from
ideas to improve our model.                                              those sentences. While textual entailment and other reason-
                                                                         ing methods could be used for this purpose, we turn this into
                                                                         a probabilistic inference as follows. We start from a known
2.     TRIPLE EXTRACTION                                                 set of textual patterns p1 , . . . , pk that are commonly asso-
In general terms, the problem we address is as follows:                  ciated with relation r; thus giving us prior probabilities for
                                                                         each pattern expressing the relation. To determine if enti-
Given a table T , with n rows and k columns, whose cells con-
                                                                         ties e1 and e2 belong in relation r, we search the corpus for
tain mentions to entities, and a set R = {r1 , . . . , rm } of rela-
                                                                         all sentences containing both e1 and e2 , extract all words
tions of interest, we aim to produce all triples hti [x], rj , ti [y]i
                                                                         in between them, and match those words against the list
∀ti ∈ T , rj ∈ R, where ti [x] denotes the value of column x
                                                                         of patterns. We can use Bayesian inference to compute the
in row i.
                                                                         posterior probability of r given the observed patterns.
Furthermore, we seek to assign relation rj to pairs of columns
                                                                         In this model, relation r is a categorical class variable whose
in T , in the sense that the predicate rj holds for all rows of
                                                                         domain is R, and the patterns p1 , ..., pn are binary evidence
T . The problem boils down to the case of an input table with
                                                                         variables, representing patterns observed between entities in
just two columns x and y, and, without loss of generality,
                                                                         the text corpus. The model is thus:
we address this case here.
                                                                                                             P r(r)P r(p1 , ..., pk |r)
Understanding semantic relations between entities using text                        P r(r|p1 , ..., pk ) =                                (1)
                                                                                                                 P r(p1 , ..., pk )
corpora is a challenging task because a relation can be ex-
pressed with different textual (surface) forms. For example,             Relations maximizing P r(r|p1 , ..., pk ) are the most probable
the relation plays-for can be inferred from sentences with               relations for the entities (given the corpus used). A last com-
the patterns “scored for” and “signed contract with”. On the             ponent in the model is a threshold to filter low-confidence
other hand, a pattern may express more than one relation.                predictions.
2
    We use the patterns in the PATTY system [12].                        We assume evidence variables {p1 , ..., pn } are conditionally
independent, which is reasonable since the probability that
pattern pi represents a relation does not depend on the prob-           Table 1: Number of PATTY patterns and ranked
ability that another pattern pj also represents that relation.          obtained by each strategy, for each relation. A –
Given this assumption and using the Chain rule, we have:                indicates the relation has been incorrectly ranked
                                                                        4th or lower by the method
                                               k
                                               Q                                Relation          Patterns      SP     KE     MR      GR
                                      P r(r)         P r(pi |r)
                                               i=1                       ismarriedto                1274         1      1      1       1
             P r(r|p1 , ..., pk ) =                               (2)    created                    1148         1      1      1       1
                                        P r(p1 , ..., pk )
                                                                         haschild                   1090         –      –      –       3
                                                                         influences                 694          1      1      2       –
2.2    Relation between Two Lists of Entities                            actedin                    624          2      2      1       1
Consider now the case of a table with n rows (i.e. a list of n           graduatedfrom              472          1      1      1       1
entity pairs). There are two main approaches, illustrated in             isknownfor                 452          –      –      –       –
Figure 2: we can find one relation assignment for each row               worksat                    447          –      –      1       1
and compute an aggregate from all these assignments, or we               holdspoliticalposition     417          –      3      1       1
can observe all text patterns for all rows at once and obtain            directed                   400          2      –      2       2
a global assignment. We discuss each next.                               playsfor                   354          1      1      1       1
                                                                         diedin                     335          3      –      1       2
                                                                         wasbornin                  273          –      –      1       1
Rank Aggregation. In this approach, we first obtain a rank-              islocatedin                249          –      3      –       –
ing of all relations in R sorted in decreasing likelihood ac-            livesin                    200          –      –      –       –
cording to the model above, for each pair of entities in the             isleaderof                 156          –      –      –       –
table. Next, we combine these ranked lists to find a single              iscitizenof                121          1      1      –       –
best relation for all entity pairs.                                      haswonprize                 81          –      –      3       1
                                                                         dealswith                   59          –      –      –       1
Rank aggregation [7] is a well-studied problem: given n                  ispoliticianof              49          1      –      –       1
ranked lists l1 , ..., ln , and aPdistance measure d, the problem        participatedin              33          1      1      2       2
is to find a list l such that n                                          happenedin                  12          2      1      –       1
                                   i=1 d(l, li ) is minimized. Among
                                                                         hascapital                   1          2      1      –       –
different distance measures to aggregate ranking lists, we use
Spearman’s Footrule (SP) and Kendall’s tau (KE) which
were shown to outperform other approaches [7]. Kendall’s
tau distance between two permutations of a list is the sum              extracted via dependency parsing on the ClueWeb09 dataset
of the number of pairs from the list which are not in the               (Clueweb09 is a Web crawl with about 1 billion web pages
same order in these two permutations. This distance is also             in ten different languages).
known as the number of exchanges needed in a bubble sort
to convert one permutation to another. On the other hand,               As mentioned before, we link entities mentioned in the tables
Spearman’s Footrule distance is the sum of the distances                to those in the text corpus with exact string matching. That
between positions of each item in two different permuta-                is, given entities e1 and e2 in a table row, we find all triples in
tions. We also employ a simple average score method to                  the NELL corpus which have e1 as the subject and e2 as the
aggregate ranking lists called Mean Ranking (MR), which                 object. Similarly, we match the verbs of the resulting triples
works as follows: given a sorted (descending) list of proba-            against the PATTY patterns (to obtain the observations).
bilities P r(r|p1 , ..., pk ) generated for each pair hxi , yi i, and   In effect, our system uses the “intersection” of PATTY pat-
m possible relations in the knowledge base, we assign a score           terns and NELL triples, consisting of 4,357 unique patterns
sci (r) = m − pos(r), where pos(r) is the position of relation          and 108,699,400 triples. The YAGO relation has-academic-
r in this sorted list.                                                  advisor was discarded as none of its patterns are found in
                                                                        the NELL corpus. Table 1 shows the relations used and the
                                                                        number of patterns in each of them. Note the wide variation
Global Ranking. Another approach is to feed all patterns                in the number of patterns per relation.
for all entity pairs in the relation as evidence to the prob-
abilistic model. In this Global Ranking (GR) approach, all
observed patterns for all entity pairs simultaneously con-              Estimating Prior Probabilities. Recall Equation 2. We
tribute to selection of the most probable relation.                     estimate the prior
                                                                                       P probability of each of the 24 relations
                                                                        as P r(r)=|r|/ ri ∈R (|ri |), where |r| is the number of in-
3.    IMPLEMENTATION                                                    stances of relation r, and R is the set of all relations in
This section describes one instantiation of the probabilistic           YAGO. As for P r(p|r), the prior probability that pattern p
model developed in the previous section, using off-the-shelf,           occurs among instances of relation r, we use the associations
real-world knowledge bases and text corpora.                            between YAGO   Prelations and textual patterns in PATTY:
                                                                        P r(p|r) = |p|/ pi ∈P T (r) |pi |, where P T (r) is the set of pat-
                                                                        terns associated with r. To avoid zero probabilities, we use
Data. We use YAGO [14] and PATTY [12] to obtain re-                     the add-one Laplace smoothing technique.
lations and patterns. YAGO is a popular knowledge base
with about 10 million entities and 120 million facts, and
PATTY, developed by the same research group, associates                 4.   EXPERIMENTAL EVALUATION
22,779 patterns mined from New York Times articles and                  Since we do not query YAGO to make predictions, we use
Wikipedia with 25 relations. Of those, 24 exist in YAGO                 some of its facts to build a ground-truth to test the accu-
and were used here. Our text corpus is the NELL Subject-                racy of our model. We extracted facts from YAGO relations
Verb-Object triple corpus [15], with about 604 million triples          where both entities can be matched exactly in the NELL
                                                                  plications using our technique can exploit this trade-off to
Table 2: Results of rankings on 23 YAGO relations                 set the appropriate threshold.
                      Rank Aggregation     Global Ranking
                      SP KE      MR             GR
    Ranked First       8    9      9             12               Performance. For efficiency reasons, we index patterns us-
    Ranked Second      4    1      3              3               ing a suffix tree in memory. The average execution times in
    Ranked Third       1    2      1              1               milliseconds for processing a pair of entities (taken over 20
    Ranked > 3        10    11    10              7               executions) are: 1688 for SP, 1868 for KE; 1729.4 for MR;
                                                                  and 1719 for GR. As one can see, there are no considerable
                                                                  differences among the methods. In fact, our observations
 Table 3: Results of rankings on filtered relations               are that the majority of the time is spent on matching the
                      Rank Aggregation     Global Ranking         entities against the NELL corpus.
                      SP KE      MR             GR
    Ranked First       9    7      9              8
    Ranked Second      1    3      2              2               4.1   Towards Knowledge Augmentation
    Ranked Third       2    1      2              3               The ultimate goal of our technique is knowledge augmenta-
    Ranked > 3         2    3      1              1               tion by generating new instances of relations from tabular
                                                                  data that are not already in the knowledge base. We per-
                                                                  formed preliminary experiments to assess if our system could
corpus. The number of facts from YAGO that can be found           accomplish this goal.
from NELL triples using exact matching varies across rela-
tions. The lower bound was 25 facts for relation is-known-        Our first test was with a spreadsheet including song data
for. For the other relations, we picked several random sam-       available at www.aardvarkdjservices.co.uk (a website spe-
ples with 25 pairs and tested each separately. The difference     cialized in music services). We looked at 48 singer, song
in accuracy was negligible, so we used 25 facts per relation.     pairs from 2 albums, with 24 songs from Elvis Presley, and
                                                                  24 songs from Frank Sinatra. Every approach returned cre-
Effectiveness. We performed experiments to evaluate the           ated as the best relation between those entities. We man-
effectiveness of three different rank aggregation techniques      ually verified the 48 facts in this case, and found that only
(i.e., Spearman’s Footrule (SP), Kendall’s tau (KE), and          31 were already in YAGO. In another experiment, we used
Mean Ranking (MR)) as well as the Global Ranking ap-              a spreadsheet with data about NBA players extracted from
proach (GR). The ranking of the correct relation generated        wwww.espn.go.com, and tested our system with 100 player,
by the system is considered as a measure of success to an-        team pairs. YAGO had 92 of these pairs in the is-affiliated-
notate that relation. The best result is achieved when the        to relation. Every configuration of our system identified all
correct relation appears at the top of the ranked list for the    100 pairs as instances of the plays-for relation, which, one
facts in that relation in the ground-truth. Table 2 shows the     can argue, is a suitable and more specific relation for these
number of relations ranked the top 3 positions, as well as        entities.
anywhere above the 3rd place.

As one can see, GR outperforms the rank aggregation tech-         5.    RELATED WORK
niques in identifying correct relations. Among the rank ag-
                                                                  A lot of work has been done towards understanding ta-
gregation techniques, MR works slightly better than SP and
                                                                  bles within text or online. Some have attempted to exploit
KE in this test. However, there is no statistical significance
                                                                  column headers to identify relations between two columns
in the differences between the results of rank aggregation
                                                                  (e.g., [6]), which is akin to schema-based data integration.
techniques based on analysis of variance. Another observa-
                                                                  We make no assumption about the existence of this informa-
tion is that the number of PATTY patterns has an effect
                                                                  tion in our approach as this information may not be avail-
on the accuracy, and this effect is more pronounced for the
                                                                  able for all tables, making our method more similar to an
ranking aggregation methods. As shown in Table 1, relations
                                                                  instance-based data integration approach.
associated with fewer patterns are less likely to be identified
correctly by rank aggregation techniques. We argue this is        A probabilistic model is proposed in [16] to associate class
due to lack of sufficient evidence (patterns) for each pair. It   labels with columns and identify relations between entity
follows that rank aggregation techniques require more evi-        columns and the rest of columns. Recently, the joint infer-
dence in order to infer correct relations. On the other hand,     ence technique is used to simultaneously annotate table cells,
the GR performs better for relations associated with fewer        table columns and relations between columns. In [8, 10],
patterns. This happens because it is more likely that many        graphical models are employed to annotate column headers,
relevant patterns appear in the union of patterns of all pairs    table cells, and relations between columns. Our work is sim-
compared to individual entity pairs.                              ilar, to some extent, to [11] in which Wikipedia’s tables are
                                                                  used to generate new triples using DBpedia as a knowledge
We also filtered out relations with 200 or less patterns in       base. Unlike our method, these techniques require linking
our corpus, recomputed their prior probabilities, and re-         entities to one or more linked open data repositories.
executed the experiments on the same dataset for them.
Table 3 shows the results of this experiment. Although MR         The problem has also been considered in terms of extracting
performed slightly better than the other techniques, a sta-       schema for tabular data. In [5], an extraction system is pro-
tistical test reveals that the differences are not significant.   posed to convert data stored in spreadsheets into relational
What is important to note is that, as expected, filtering out     tuples. In [1], a set of row classes representing common
less popular relations leads to higher overall accuracy. Ap-      features of individual rows in a table is identified. Then,
Conditional Random Fields techniques are used to generate           7.   REFERENCES
a sequence of row labels.                                            [1] M. D. Adelfio and H. Samet. Schema extraction for
                                                                         tabular data on the web. Proc. VLDB Endow.,
Our work is also related to relation extraction techniques               6(6):421–432, 2013.
from text corpora. In supervised learning (e.g., [18]), manu-        [2] C. Bizer, T. Heath, and T. Berners-Lee. Linked
ally labelled relations are used to train a model for labeling           data-the story so far. Int. J. Sem. Web Inf. Sys., 5(3),
relations. On the other hand, in unsupervised approaches                 2009.
(e.g., [13]), strings between entities in a text corpus are clus-
                                                                     [3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and
tered and then simplified to generate relations. In [9], the
                                                                         J. Taylor. Freebase: A collaboratively created graph
classifier is trained using textual features of sentences be-
                                                                         database for structuring human knowledge. In
tween known entities in Freebase. This technique generates
                                                                         SIGMOD’08 Conf. Proc., 2008.
instances of new relations, while our technique generates in-
stances of relations linked to exiting relations in linked open      [4] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and
data repositories.                                                       Y. Zhang. Webtables: Exploring the power of tables
                                                                         on the web. Proc. VLDB Endow., 1(1), 2008.
                                                                     [5] Z. Chen and M. Cafarella. Automatic web spreadsheet
                                                                         data extraction. In SSW’13 Conf. Proc., 2013.
                                                                     [6] L. Ding, D. DiFranzo, A. Graves, J. Michaelis, X. Li,
6.   CONCLUSION                                                          D. L. McGuinness, and J. Hendler. Data-gov wiki:
In this paper, we described a probabilistic approach for aug-            Towards linking government data. In AAAI Spring
menting linked open data repositories using tabular data,                Symp.: Linked data meets AI, 2010.
thus tapping into these under-explored sources of valuable           [7] R. Fagin, R. Kumar, and D. Sivakumar. Comparing
information. Unlike prior methods that focus on natural                  top k lists. In SODA’03 Conf. Proc., 2003.
language understanding to determine whether two named                [8] G. Limaye, S. Sarawagi, and S. Chakrabarti.
entities are even related, we start from the (reasonable) as-            Annotating and searching web tables using entities,
sumption that all entities in the same row of a table are                types and relationships. Proc. VLDB Endow.,
related by construction. Unlike previous methods that at-                3(1-2):1338–1347, 2010.
tempt to understand tabular data, we take a more pragmatic           [9] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant
and effective stance: we label pairs of columns in the table             supervision for relation extraction without labeled
with relations coming from an established knowledge base.                data. In ACL’09 Conf. Proc., 2009.
By doing so, all facts we extract can be interpreted in the         [10] V. Mulwad, T. Finin, and A. Joshi. Semantic Message
same way as those in the knowledge base.                                 Passing for Generating Linked Data from Tables. In
                                                                         ISWC’13 Conf. Proc., 2013.
We described a first implementation of our model using
                                                                    [11] E. Munoz, A. Hogan, and A. Mileo. Triplifying
linked open data resources–YAGO, PATTY and NELL–and
                                                                         wikipedia’s tables. In LD4IE’13 Workshop
showed experimentally that the approach is effective, despite            Proceedings, ISWC. CEUR, 2013.
the limitations in the way we match entities. We also showed
                                                                    [12] N. Nakashole, G. Weikum, and F. Suchanek. Patty: A
that it rather easy to find new knowledge with our model.
                                                                         taxonomy of relational patterns with semantic types.
Yet, we have only sketched a research direction rich in op-
                                                                         In EMNLP-CoNLL’12 Conf. Proc., 2012.
portunities to improve knowledge building and linking in our
opinion. There are some limitations that we aim to address          [13] Y. Shinyama and S. Sekine. Preemptive information
in future work. Instead of a limited number of YAGO rela-                extraction using unrestricted relation discovery. In
tions, we aim to use a wide range of relations (e.g., those in           HLT-NAACL’06 Conf. Proc., 2006.
Freebase). Moreover, we can increase recall by using proper         [14] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A
entity linking techniques such as those in [17].                         core of semantic knowledge. In WWW’07 Conf. Proc.,
                                                                         2007.
Another interesting line of future work would be to estimate        [15] P. P. Talukdar, D. Wijaya, and T. Mitchell. Acquiring
how many new triples could be extracted from tabular data                temporal constraints between relations. In CIKM’12
on the entire Web, and how accurate they could be. To                    Conf. Proc., 2012.
do so, one needs a systematic approach and some machin-             [16] P. Venetis, A. Halevy, J. Madhavan, M. Paşca,
ery to automatically check if the new facts already exists in            W. Shen, F. Wu, G. Miao, and C. Wu. Recovering
the knowledge base, as well as whether or not these facts                semantics of tables on the web. Proc. VLDB Endow.,
are accurate. Different notions of accuracy apply here. For              4(9):528–538, 2011.
example, it may be that the new facts contradict existing           [17] M. A. Yosef, J. Hoffart, I. Bordino, M. Spaniol, and
knowledge, or it could be that they are expressed at a dif-              G. Weikum. Aida: An online tool for accurate
ferent granularity, as was the case for our experiment with              disambiguation of named entities in text and tables.
NBA players. One could also use both quantitative and                    Proc. VLDB Endow., 4(12):1450–1453, 2011.
qualitative metrics to chart which websites provide the best        [18] G. Zhou, M. Zhang, D. Ji, and Q. Zhu. Tree
data.                                                                    Kernel-Based relation extraction with
                                                                         Context-Sensitive structured parse tree information.
Acknowledgements. This work was supported in part by                     In EMNLP-CoNLL’07 Conf. Proc., 2007.
grants from the Natural Sciences and Egineering Reseach
Council of Canada and the IBM Alberta Centre for Ad-
vanced Studies.