Knowledge Base Augmentation using Tabular Data∗ Yoones A. Sekhavat1 , Francesco di Paolo2 , Denilson Barbosa1 , Paolo Merialdo2 1 2 University of Alberta Roma Tre University {yoones.a.s,denilson}@ualberta.ca {fradipi,merialdo}@dia.uniroma3.it ABSTRACT Ronaldinho Brazil Barcelona FC Large linked data repositories have been built by leverag- Fabio Cannavaro Italy Juventus ing semi-structured data in Wikipedia (e.g., DBpedia) and Kaka Brazil AC Milan through extracting information from natural language text Lionel Messi Argentina Barcelona FC (e.g., YAGO). However, the Web contains many other vast sources of linked data, such as structured HTML tables and Figure 1: Example tabular data. spreadsheets. Often, the semantics in such tables is hidden, preventing one from extracting triples from them directly. This paper describes a probabilistic method that augments an existing knowledge base with facts from tabular data by together in the same rows indicates that there are relation- leveraging a Web text corpus and natural language patterns ships between them. In this case, we know that Ronaldinho associated with relations in the knowledge base. A prelim- was born in Brazil and played for Barcelona. Moreover, the inary evaluation shows high potential for this technique in same relationships apply to all rows: Cannavaro was born augmenting linked data repositories. in Italy and played for Juventus. In fact, we can say that the relations was born in and played for are defined over the 1. INTRODUCTION columns (1st and 2nd, and 1st and 3rd, respectively). The Web is bursting with valuable tabular data that could A general approach for understanding such tables with the be leveraged in many applications such as knowledge base help of linked open data web [2] would be to (1) link the val- population and query answering. For instance, Cafarella et ues in each cell to known entities in a linked data knowledge al. report 14.1 billion HTML tables in English documents base (e.g., YAGO [14] and Freebase [3]), and (2) identify in Googles’s main index [4] and over 400,000 Excel spread- relations between the linked values. The best-case-scenario sheets in the Clueweb091 crawl [5]. Tapping into these semi- for this approach would be when all entities are linked to the structured information sources is difficult, however, and of- same knowledge base, and a relation already exists between ten they end up being treated no differently than plain text them (this is the approach in [8]). However, there are other documents. What is worse, state-of-the-art text-based in- situations. For example, it could be that the entities exist formation extraction systems fail on such tables, as they in different, unlinked, knowledge bases, or that some of the require full sentences to yield good results. entities are not linked to anything yet. Tackling such cases Tables have inherent semantics which are often implicit or would provide new triples, thus augmenting the linked open only given in the pages that contain them. Consider the ex- data web with new facts. ample in Figure 1, which is a snapshot of a table in Wikipedia. This paper focuses on the problem of identifying plausible It may not obvious from the table that it is about winners relations between pairs of entities that appear in the same of FIFA’s World Player of the Year Award. Nevertheless, it row of a table. Our main assumption is that if someone went should be noted that the fact that someone put those literals to the trouble of juxtaposing these entities on the same file, ∗All data used to build and test the models in our work then there must be a relation between them. Moreover, we can be found at http://cs.ualberta.ca/~denilson/data/ seek to augment an existing repository with new instances of ldow14_ualberta_data.zip relations already defined. In other words, the list of possible 1 relations is part of the input. We also assume the entities can ClueWeb, 2009, http://lemurproject.org/clueweb09.php be resolved and linked to a linked open data repository or knowledge base. However, unlike previous works (e.g., [8]), our method does not require that. Overview. As an example, assume our input table has only columns 1 and 3 in Figure 1, and assume we have two can- didate relations: plays-for and lives-in. We start from a Copyright is held by the author/owner(s). LDOW2014, April 8, 2014, Seoul, Korea. list of textual patterns that are associated with each rela- tion; such patterns are detected by automatic methods for Figure 2: Rank Aggregation approach vs. Global Ranking. building knowledge bases from natural language2 . For ex- For example, the pattern “played in” may represent rela- ample, patterns for the relation plays-for could be “scored tions plays-for and performed-at (e.g., “Pink Floyd played for” and “signed contract with”. In practice, thousands of in Pompeii”). patterns are associated with a single relation; conversely, the same pattern may be associated with multiple relations. Another challenge, which has been the subject of substantial Thus, we build a probabilistic model to estimate the pos- work recently, is entity linking. For example, “Barcelona” terior probability of a relation, given a set of text patterns may represent the football club or the city. Evidently, the observed. To make a prediction for a given pair of entities, choice of entity linking approach will have an impact on we: (1) collect all sentences containing both entities from the quality of the triples produced by a method like ours. a large text corpus; (2) extract the text in between them; To avoid this factor in our study of the relation assignment (3) match those texts against the list of patterns; and (4) problem we use exact matching to “link” the entities. In estimate the posterior probability of all candidate relations. effect, this is akin to assuming the entities are already linked. Paper Outline. Section 2 illustrates our model to extract 2.1 Relation between a Pair of Entities triples from tabular data. Section 3 discusses our first imple- We start with determining whether a relation r applies to mentation of the model, and Section 4 reports the results of two entities. In the absence of further information, a reason- experiments on this first implementation. Section 5 presents able approach is to search the (textual) Web for all sentences related work, and Section 6 concludes the paper and presents connecting the entities and determine whether r follows from ideas to improve our model. those sentences. While textual entailment and other reason- ing methods could be used for this purpose, we turn this into a probabilistic inference as follows. We start from a known 2. TRIPLE EXTRACTION set of textual patterns p1 , . . . , pk that are commonly asso- In general terms, the problem we address is as follows: ciated with relation r; thus giving us prior probabilities for each pattern expressing the relation. To determine if enti- Given a table T , with n rows and k columns, whose cells con- ties e1 and e2 belong in relation r, we search the corpus for tain mentions to entities, and a set R = {r1 , . . . , rm } of rela- all sentences containing both e1 and e2 , extract all words tions of interest, we aim to produce all triples hti [x], rj , ti [y]i in between them, and match those words against the list ∀ti ∈ T , rj ∈ R, where ti [x] denotes the value of column x of patterns. We can use Bayesian inference to compute the in row i. posterior probability of r given the observed patterns. Furthermore, we seek to assign relation rj to pairs of columns In this model, relation r is a categorical class variable whose in T , in the sense that the predicate rj holds for all rows of domain is R, and the patterns p1 , ..., pn are binary evidence T . The problem boils down to the case of an input table with variables, representing patterns observed between entities in just two columns x and y, and, without loss of generality, the text corpus. The model is thus: we address this case here. P r(r)P r(p1 , ..., pk |r) Understanding semantic relations between entities using text P r(r|p1 , ..., pk ) = (1) P r(p1 , ..., pk ) corpora is a challenging task because a relation can be ex- pressed with different textual (surface) forms. For example, Relations maximizing P r(r|p1 , ..., pk ) are the most probable the relation plays-for can be inferred from sentences with relations for the entities (given the corpus used). A last com- the patterns “scored for” and “signed contract with”. On the ponent in the model is a threshold to filter low-confidence other hand, a pattern may express more than one relation. predictions. 2 We use the patterns in the PATTY system [12]. We assume evidence variables {p1 , ..., pn } are conditionally independent, which is reasonable since the probability that pattern pi represents a relation does not depend on the prob- Table 1: Number of PATTY patterns and ranked ability that another pattern pj also represents that relation. obtained by each strategy, for each relation. A – Given this assumption and using the Chain rule, we have: indicates the relation has been incorrectly ranked 4th or lower by the method k Q Relation Patterns SP KE MR GR P r(r) P r(pi |r) i=1 ismarriedto 1274 1 1 1 1 P r(r|p1 , ..., pk ) = (2) created 1148 1 1 1 1 P r(p1 , ..., pk ) haschild 1090 – – – 3 influences 694 1 1 2 – 2.2 Relation between Two Lists of Entities actedin 624 2 2 1 1 Consider now the case of a table with n rows (i.e. a list of n graduatedfrom 472 1 1 1 1 entity pairs). There are two main approaches, illustrated in isknownfor 452 – – – – Figure 2: we can find one relation assignment for each row worksat 447 – – 1 1 and compute an aggregate from all these assignments, or we holdspoliticalposition 417 – 3 1 1 can observe all text patterns for all rows at once and obtain directed 400 2 – 2 2 a global assignment. We discuss each next. playsfor 354 1 1 1 1 diedin 335 3 – 1 2 wasbornin 273 – – 1 1 Rank Aggregation. In this approach, we first obtain a rank- islocatedin 249 – 3 – – ing of all relations in R sorted in decreasing likelihood ac- livesin 200 – – – – cording to the model above, for each pair of entities in the isleaderof 156 – – – – table. Next, we combine these ranked lists to find a single iscitizenof 121 1 1 – – best relation for all entity pairs. haswonprize 81 – – 3 1 dealswith 59 – – – 1 Rank aggregation [7] is a well-studied problem: given n ispoliticianof 49 1 – – 1 ranked lists l1 , ..., ln , and aPdistance measure d, the problem participatedin 33 1 1 2 2 is to find a list l such that n happenedin 12 2 1 – 1 i=1 d(l, li ) is minimized. Among hascapital 1 2 1 – – different distance measures to aggregate ranking lists, we use Spearman’s Footrule (SP) and Kendall’s tau (KE) which were shown to outperform other approaches [7]. Kendall’s tau distance between two permutations of a list is the sum extracted via dependency parsing on the ClueWeb09 dataset of the number of pairs from the list which are not in the (Clueweb09 is a Web crawl with about 1 billion web pages same order in these two permutations. This distance is also in ten different languages). known as the number of exchanges needed in a bubble sort to convert one permutation to another. On the other hand, As mentioned before, we link entities mentioned in the tables Spearman’s Footrule distance is the sum of the distances to those in the text corpus with exact string matching. That between positions of each item in two different permuta- is, given entities e1 and e2 in a table row, we find all triples in tions. We also employ a simple average score method to the NELL corpus which have e1 as the subject and e2 as the aggregate ranking lists called Mean Ranking (MR), which object. Similarly, we match the verbs of the resulting triples works as follows: given a sorted (descending) list of proba- against the PATTY patterns (to obtain the observations). bilities P r(r|p1 , ..., pk ) generated for each pair hxi , yi i, and In effect, our system uses the “intersection” of PATTY pat- m possible relations in the knowledge base, we assign a score terns and NELL triples, consisting of 4,357 unique patterns sci (r) = m − pos(r), where pos(r) is the position of relation and 108,699,400 triples. The YAGO relation has-academic- r in this sorted list. advisor was discarded as none of its patterns are found in the NELL corpus. Table 1 shows the relations used and the number of patterns in each of them. Note the wide variation Global Ranking. Another approach is to feed all patterns in the number of patterns per relation. for all entity pairs in the relation as evidence to the prob- abilistic model. In this Global Ranking (GR) approach, all observed patterns for all entity pairs simultaneously con- Estimating Prior Probabilities. Recall Equation 2. We tribute to selection of the most probable relation. estimate the prior P probability of each of the 24 relations as P r(r)=|r|/ ri ∈R (|ri |), where |r| is the number of in- 3. IMPLEMENTATION stances of relation r, and R is the set of all relations in This section describes one instantiation of the probabilistic YAGO. As for P r(p|r), the prior probability that pattern p model developed in the previous section, using off-the-shelf, occurs among instances of relation r, we use the associations real-world knowledge bases and text corpora. between YAGO Prelations and textual patterns in PATTY: P r(p|r) = |p|/ pi ∈P T (r) |pi |, where P T (r) is the set of pat- terns associated with r. To avoid zero probabilities, we use Data. We use YAGO [14] and PATTY [12] to obtain re- the add-one Laplace smoothing technique. lations and patterns. YAGO is a popular knowledge base with about 10 million entities and 120 million facts, and PATTY, developed by the same research group, associates 4. EXPERIMENTAL EVALUATION 22,779 patterns mined from New York Times articles and Since we do not query YAGO to make predictions, we use Wikipedia with 25 relations. Of those, 24 exist in YAGO some of its facts to build a ground-truth to test the accu- and were used here. Our text corpus is the NELL Subject- racy of our model. We extracted facts from YAGO relations Verb-Object triple corpus [15], with about 604 million triples where both entities can be matched exactly in the NELL plications using our technique can exploit this trade-off to Table 2: Results of rankings on 23 YAGO relations set the appropriate threshold. Rank Aggregation Global Ranking SP KE MR GR Ranked First 8 9 9 12 Performance. For efficiency reasons, we index patterns us- Ranked Second 4 1 3 3 ing a suffix tree in memory. The average execution times in Ranked Third 1 2 1 1 milliseconds for processing a pair of entities (taken over 20 Ranked > 3 10 11 10 7 executions) are: 1688 for SP, 1868 for KE; 1729.4 for MR; and 1719 for GR. As one can see, there are no considerable differences among the methods. In fact, our observations Table 3: Results of rankings on filtered relations are that the majority of the time is spent on matching the Rank Aggregation Global Ranking entities against the NELL corpus. SP KE MR GR Ranked First 9 7 9 8 Ranked Second 1 3 2 2 4.1 Towards Knowledge Augmentation Ranked Third 2 1 2 3 The ultimate goal of our technique is knowledge augmenta- Ranked > 3 2 3 1 1 tion by generating new instances of relations from tabular data that are not already in the knowledge base. We per- formed preliminary experiments to assess if our system could corpus. The number of facts from YAGO that can be found accomplish this goal. from NELL triples using exact matching varies across rela- tions. The lower bound was 25 facts for relation is-known- Our first test was with a spreadsheet including song data for. For the other relations, we picked several random sam- available at www.aardvarkdjservices.co.uk (a website spe- ples with 25 pairs and tested each separately. The difference cialized in music services). We looked at 48 singer, song in accuracy was negligible, so we used 25 facts per relation. pairs from 2 albums, with 24 songs from Elvis Presley, and 24 songs from Frank Sinatra. Every approach returned cre- Effectiveness. We performed experiments to evaluate the ated as the best relation between those entities. We man- effectiveness of three different rank aggregation techniques ually verified the 48 facts in this case, and found that only (i.e., Spearman’s Footrule (SP), Kendall’s tau (KE), and 31 were already in YAGO. In another experiment, we used Mean Ranking (MR)) as well as the Global Ranking ap- a spreadsheet with data about NBA players extracted from proach (GR). The ranking of the correct relation generated wwww.espn.go.com, and tested our system with 100 player, by the system is considered as a measure of success to an- team pairs. YAGO had 92 of these pairs in the is-affiliated- notate that relation. The best result is achieved when the to relation. Every configuration of our system identified all correct relation appears at the top of the ranked list for the 100 pairs as instances of the plays-for relation, which, one facts in that relation in the ground-truth. Table 2 shows the can argue, is a suitable and more specific relation for these number of relations ranked the top 3 positions, as well as entities. anywhere above the 3rd place. As one can see, GR outperforms the rank aggregation tech- 5. RELATED WORK niques in identifying correct relations. Among the rank ag- A lot of work has been done towards understanding ta- gregation techniques, MR works slightly better than SP and bles within text or online. Some have attempted to exploit KE in this test. However, there is no statistical significance column headers to identify relations between two columns in the differences between the results of rank aggregation (e.g., [6]), which is akin to schema-based data integration. techniques based on analysis of variance. Another observa- We make no assumption about the existence of this informa- tion is that the number of PATTY patterns has an effect tion in our approach as this information may not be avail- on the accuracy, and this effect is more pronounced for the able for all tables, making our method more similar to an ranking aggregation methods. As shown in Table 1, relations instance-based data integration approach. associated with fewer patterns are less likely to be identified correctly by rank aggregation techniques. We argue this is A probabilistic model is proposed in [16] to associate class due to lack of sufficient evidence (patterns) for each pair. It labels with columns and identify relations between entity follows that rank aggregation techniques require more evi- columns and the rest of columns. Recently, the joint infer- dence in order to infer correct relations. On the other hand, ence technique is used to simultaneously annotate table cells, the GR performs better for relations associated with fewer table columns and relations between columns. In [8, 10], patterns. This happens because it is more likely that many graphical models are employed to annotate column headers, relevant patterns appear in the union of patterns of all pairs table cells, and relations between columns. Our work is sim- compared to individual entity pairs. ilar, to some extent, to [11] in which Wikipedia’s tables are used to generate new triples using DBpedia as a knowledge We also filtered out relations with 200 or less patterns in base. Unlike our method, these techniques require linking our corpus, recomputed their prior probabilities, and re- entities to one or more linked open data repositories. executed the experiments on the same dataset for them. Table 3 shows the results of this experiment. Although MR The problem has also been considered in terms of extracting performed slightly better than the other techniques, a sta- schema for tabular data. In [5], an extraction system is pro- tistical test reveals that the differences are not significant. posed to convert data stored in spreadsheets into relational What is important to note is that, as expected, filtering out tuples. In [1], a set of row classes representing common less popular relations leads to higher overall accuracy. Ap- features of individual rows in a table is identified. Then, Conditional Random Fields techniques are used to generate 7. REFERENCES a sequence of row labels. [1] M. D. Adelfio and H. Samet. Schema extraction for tabular data on the web. Proc. VLDB Endow., Our work is also related to relation extraction techniques 6(6):421–432, 2013. from text corpora. In supervised learning (e.g., [18]), manu- [2] C. Bizer, T. Heath, and T. Berners-Lee. Linked ally labelled relations are used to train a model for labeling data-the story so far. Int. J. Sem. Web Inf. Sys., 5(3), relations. On the other hand, in unsupervised approaches 2009. (e.g., [13]), strings between entities in a text corpus are clus- [3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and tered and then simplified to generate relations. In [9], the J. Taylor. Freebase: A collaboratively created graph classifier is trained using textual features of sentences be- database for structuring human knowledge. In tween known entities in Freebase. This technique generates SIGMOD’08 Conf. Proc., 2008. instances of new relations, while our technique generates in- stances of relations linked to exiting relations in linked open [4] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and data repositories. Y. Zhang. Webtables: Exploring the power of tables on the web. Proc. VLDB Endow., 1(1), 2008. [5] Z. Chen and M. Cafarella. Automatic web spreadsheet data extraction. In SSW’13 Conf. Proc., 2013. [6] L. Ding, D. DiFranzo, A. Graves, J. Michaelis, X. Li, 6. CONCLUSION D. L. McGuinness, and J. Hendler. Data-gov wiki: In this paper, we described a probabilistic approach for aug- Towards linking government data. In AAAI Spring menting linked open data repositories using tabular data, Symp.: Linked data meets AI, 2010. thus tapping into these under-explored sources of valuable [7] R. Fagin, R. Kumar, and D. Sivakumar. Comparing information. Unlike prior methods that focus on natural top k lists. In SODA’03 Conf. Proc., 2003. language understanding to determine whether two named [8] G. Limaye, S. Sarawagi, and S. Chakrabarti. entities are even related, we start from the (reasonable) as- Annotating and searching web tables using entities, sumption that all entities in the same row of a table are types and relationships. Proc. VLDB Endow., related by construction. Unlike previous methods that at- 3(1-2):1338–1347, 2010. tempt to understand tabular data, we take a more pragmatic [9] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant and effective stance: we label pairs of columns in the table supervision for relation extraction without labeled with relations coming from an established knowledge base. data. In ACL’09 Conf. Proc., 2009. By doing so, all facts we extract can be interpreted in the [10] V. Mulwad, T. Finin, and A. Joshi. Semantic Message same way as those in the knowledge base. Passing for Generating Linked Data from Tables. In ISWC’13 Conf. Proc., 2013. We described a first implementation of our model using [11] E. Munoz, A. Hogan, and A. Mileo. Triplifying linked open data resources–YAGO, PATTY and NELL–and wikipedia’s tables. In LD4IE’13 Workshop showed experimentally that the approach is effective, despite Proceedings, ISWC. CEUR, 2013. the limitations in the way we match entities. We also showed [12] N. Nakashole, G. Weikum, and F. Suchanek. Patty: A that it rather easy to find new knowledge with our model. taxonomy of relational patterns with semantic types. Yet, we have only sketched a research direction rich in op- In EMNLP-CoNLL’12 Conf. Proc., 2012. portunities to improve knowledge building and linking in our opinion. There are some limitations that we aim to address [13] Y. Shinyama and S. Sekine. Preemptive information in future work. Instead of a limited number of YAGO rela- extraction using unrestricted relation discovery. In tions, we aim to use a wide range of relations (e.g., those in HLT-NAACL’06 Conf. Proc., 2006. Freebase). Moreover, we can increase recall by using proper [14] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A entity linking techniques such as those in [17]. core of semantic knowledge. In WWW’07 Conf. Proc., 2007. Another interesting line of future work would be to estimate [15] P. P. Talukdar, D. Wijaya, and T. Mitchell. Acquiring how many new triples could be extracted from tabular data temporal constraints between relations. In CIKM’12 on the entire Web, and how accurate they could be. To Conf. Proc., 2012. do so, one needs a systematic approach and some machin- [16] P. Venetis, A. Halevy, J. Madhavan, M. Paşca, ery to automatically check if the new facts already exists in W. Shen, F. Wu, G. Miao, and C. Wu. Recovering the knowledge base, as well as whether or not these facts semantics of tables on the web. Proc. VLDB Endow., are accurate. Different notions of accuracy apply here. For 4(9):528–538, 2011. example, it may be that the new facts contradict existing [17] M. A. Yosef, J. Hoffart, I. Bordino, M. Spaniol, and knowledge, or it could be that they are expressed at a dif- G. Weikum. Aida: An online tool for accurate ferent granularity, as was the case for our experiment with disambiguation of named entities in text and tables. NBA players. One could also use both quantitative and Proc. VLDB Endow., 4(12):1450–1453, 2011. qualitative metrics to chart which websites provide the best [18] G. Zhou, M. Zhang, D. Ji, and Q. Zhu. Tree data. Kernel-Based relation extraction with Context-Sensitive structured parse tree information. Acknowledgements. This work was supported in part by In EMNLP-CoNLL’07 Conf. Proc., 2007. grants from the Natural Sciences and Egineering Reseach Council of Canada and the IBM Alberta Centre for Ad- vanced Studies.