DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference Feng Niu Ce Zhang Christopher Ré Jude Shavlik Department of Computer Sciences University of Wisconsin-Madison, USA {leonn,czhang,chrisre,shavlik}@cs.wisc.edu ABSTRACT linguistic features such as named-entity mentions and de- We present an end-to-end (live) demonstration system called pendency paths3 from terabytes of text; and (2) DeepDive DeepDive that performs knowledge-base construction (KBC) performs web-scale statistical learning and inference using from hundreds of millions of web pages. DeepDive employs classic data-management and optimization techniques. statistical learning and inference to combine diverse data Figure 2 depicts the current architecture of DeepDive. resources and best-of-breed algorithms. A key challenge of To populate a knowledge base, DeepDive first converts di- this approach is scalability, i.e., how to deal with terabytes verse input data (e.g., raw corpora and ontologies) into rela- of imperfect data efficiently. We describe how we address tional features using standard NLP tools and custom code. the scalability challenges to achieve web-scale KBC and the These features are then used to train statistical models rep- lessons we have learned from building DeepDive. resenting the correlations between linguistic patterns and target relations. Finally, DeepDive combines the trained statistical models with additional knowledge (e.g., domain 1. INTRODUCTION knowledge) into a Markov logic program that is then used Knowledge-base construction (KBC) is the process of pop- to transform the relational features (e.g., candidate entity ulating a knowledge base (KB) with facts (or assertions) mentions and linguistic patterns) into a knowledge base with extracted from text. It has recently received tremendous entities, relationships, and their provenance. interest from academia, e.g., CMU’s NELL [2] and MPI’s Given the amount of data and depth of processing, a key YAGO [7,9], and from industry, e.g., IBM’s DeepQA [5] and challenge is scaling feature extraction (see Figure 2). For Microsoft’s EntityCube [16]. To achieve high quality, these example, while deep linguistic features such as dependency systems leverage a wide variety of data resources and KBC paths are useful for relationship extraction, it takes about techniques. A crucial challenge that these systems face is 100K CPU hours for our NLP pipeline to finish process- coping with imperfect or conflicting information from mul- ing ClueWeb09. Although we have access to a 100-node tiple sources [3, 13]. To address this challenge, we present Hadoop cluster, we found that the throughput of Hadoop an end-to-end KBC system called DeepDive.1 DeepDive is often limited by pathological data chunks that take very went live in January 2012 after processing the 500M English long to process or even crash (due to Hadoop’s no-task-left- web pages in the ClueWeb09 corpus2 , and since then has behind failure model). Fortunately, thanks to the Condor been adding several million newly-crawled webpages every infrastructure4 , we were able to finish feature extraction day. Figure 1 shows several screenshots of DeepDive. on ClueWeb09 within a week by opportunistically assigning Similar to YAGO [7,9] and EntityCube [16], DeepDive is jobs on hundreds of workstations and shared cluster ma- based on the classic Entity-Relationship (ER) model [1] and chines using a best-effort failure model. employs popular techniques such as distant supervision [15] A second scalability challenge is statistical learning and and the Markov logic language [11] to combine a variety of inference. There are two aspects of scalability with statis- signals. However, DeepDive goes deeper in two ways: (1) tical learning: scale of training examples and scale of per- Unlike prior large-scale KBC systems, DeepDive performs formance. To scale up the amount of training examples for deep natural language processing (NLP) to extract useful KBC, we employ the distant supervision technique [8,14,15] that automatically generates what are called silver-standard 1 http://research.cs.wisc.edu/hazy/deepdive examples by heuristically aligning raw text with an existing 2 http://lemurproject.org/clueweb09.php/ knowledge base such as Freebase.5 To scale up the per- formance of machine learning, we leverage the Bismarck system [4] that executes a wide variety of machine learn- ing techniques inside an RDBMS. For statistical inference, DeepDive employs a popular statistical-inference frame- work called Markov logic. In Markov logic, one can write first-order logic rules with weights (that intuitively model VLDS’12 August 31, 2012. Istanbul, Turkey. 3 http://nlp.stanford.edu/software/ Copyright c 2012 for the individual papers by the papers’ authors. Copying 4 permitted for private and academic purposes. This volume is published and http://research.cs.wisc.edu/condor/ 5 copyrighted by its editors. http://freebase.com Figure 1: Screenshots of DeepDive showing facts about Barack Obama and provenance. (Left) Facts about Obama in DeepDive; (Top Right) Sentences mentioning the fact that Obama went to Harvard Law School; (Bottom Right) Text from a web page annotated with entity mentions. our confidence in a rule); this allows one to capture rules enables web-scale KBC in DeepDive, namely how we scale that are likely, but not certain, to be correct. A Markov logic up web-scale feature extration, machine learning for KBC, program (aka Markov logic network, or simply MLN) speci- and statistical inference in Markov logic, respectively. fies what (evidence) data are available, what predictions to make, and what constraints and correlations there are [10]. A Conceptual KBC Model. DeepDive adopts the clas- However, when trying to apply Markov logic to KBC, we sic Entity-Relationship (ER) model [1]: the schema of the found that existing MLN systems such as Alchemy6 do not target knowledge base (KB) is specified by an ER graph scale to our datasets. To cope, we designed and implemented G = (Ē, R̄) where Ē is one or more sets of entities (e.g., two novel approaches to MLN inference by leveraging data- people and organizations), and R̄ is a set of relationships. management and optimization techniques. Define E(G) = ∪E∈Ē E, i.e., the set of known entities. To In addition to statistical learning and inference, we have specify a KBC task to DeepDive, one provides the schema found debugging and tuning based on the output of a KBC G and a corpus D. Each document di ∈ D consists of a set of system to be an effective method to improve KBC quality. (possibly overlapping) text spans (e.g., tokens or sentences) To support systematic debugging and tuning, it is important T (di ). Text spans referring to entities or relationships are that the underlying statistical models are well-calibrated, called mentions (see Figure 4). Define T (D) = ∪di ∈D T (di ). i.e., predictions with probability around p should have an Our goal is to accurately populate the following tables: actual accuracy around p as well. Such calibration is an integral part of the development process of DeepDive. • Entity-mention table M (E(G), T (D)).7 We describe the DeepDive architecture in Section 2 and • Relationship-mention tables MRi ⊆ T (D)k+1 for each some implementation details in Section 3. We summarize Ri ∈ R̄; k is Ri ’s arity, and the first k attributes (resp. the lessons we have learned from DeepDive as follows: last attribute) are entity (resp. relationship) mentions. Inference and Learning. Statistical inference and learn- ing were bottlenecks when we first started DeepDive, • Relationship tables Ri ∈ R̄. but we can now scale them with data-management and Note that Ri can be derived from ME and MRi . By the optimization techniques. same token, ME and MRi provide provenance that connects Feature Extraction. Good features are a key bottleneck the KB back to the documents supporting each fact. The for KBC; with a scalable inference and learning infras- process of populating ME is called entity linking; the process tructure in place, we can now focus on gathering and of populating MRi is called relation extraction. Intuitively, tuning features for DeepDive. the goal is to produce an instance J of these tables that is as Debugging and Tuning. Developing KBC systems is an large as possible (high recall) and as correct as possible (high iterative process; systematic debugging and tuning re- precision). As shown in Figure 4, DeepDive populates the quires well-calibrated statistical models. target KB based on signals from mention-level features (over text spans, e.g., positions, contained words, and matched regular expressions) and entity-level features (over the target 2. DEEPDIVE ARCHITECTURE KB, e.g., age, gender, and alias). We first describe a simple KBC model that we use in 7 DeepDive, and then briefly discuss the infrastructure that For simplicity, we assume that all entities are known, but DeepDive supports generating novel entities for the KB as 6 http://alchemy.cs.washington.edu well (e.g., by clustering “dangling” mentions). Sta7s7cal  Models  for   Domain  Knowledge,   ClueWeb09   Men7on  Extrac7on   Developer/User  Feedback   500M  Pages   Wikipedia     Machine   3M  En77es   Learning   Web  Search   NLP  Tools,   Features   Markov  Logic   100M  Results   Custom  Code   Program     Feature   Candidate  Men7ons:   Sta+s+cal     …   Freebase   Extrac+on   En77es:  25B     Inference   KB   3M  Facts   En77es,  Rela7onships,   Rela7onships:  8B   Provenance             Figure 2: Architecture of DeepDive. DeepDive takes as input diverse data resources, converts them into relational features, and then performs machine learning and statistical inference to construct a KB. 0.40   Scaling Feature Extraction at Web Scale. To scale Deep- Precision   F1,  Recall,  or   0.30   Precision   Dive to web-scale KBC tasks, we employ high-throughput 0.20   parallel computing frameworks such as Hadoop8 and Con- F1   0.10   dor for feature extraction. We use the Hadoop File System Recall   for storage, but found a 100-node MapReduce cluster to be 0.00   1.E+02   1.E+03   1.E+04   1.E+05   1.E+06   1.E+07   1.E+08   insufficient for ClueWeb: (1) Hadoop’s all-or-nothing ap- Corpus  size   proach to failure handling hinders throughput, and (2) the number of cluster machines for Hadoop is limited. Fortu- Figure 3: The TAC-KBP quality improves as we in- nately, the Condor infrastructure supports a best-effort fail- crease the number of ClueWeb documents used for ure model, i.e., a job may finish successfully even when Con- distant supervision [15]. Note that F1 and Recall dor fails to process a small portion of the input data. More- continue to improve as DeepDive processes increas- over, Condor allows us to simultaneously leverage thousands ingly large corpora (we do not yet have an explana- of machines from across a department, an entire campus, or tion for the dip in precision). even the nation-wide Open Science Grid.9 Scaling Machine Learning for KBC. Traditional KBC systems rely on manual annotations or domain-specific rules ing step that essentially performs relational operations, and provided by experts, both of which are scarce resources. a search (or sampling) step that often comprises multiple To remedy these problems, recent years have seen inter- independent subproblems. Thus, Tuffy achieves orders of est in the distant supervision approach for relation extrac- magnitude speed-up in grounding (compared to Alchemy) tion [8, 14, 15]. The input to distant supervision is a set by translating grounding into SQL statements that are exe- of seed facts for the target relation together with an (un- cuted by an RDBMS. Moreover, Tuffy performs graph par- labeled) text corpus, and the output is a set of (noisy) an- titioning at the search step to achieve scalability in inference notations that can be used by any machine learning tech- and (serendipitously) improved quality. Felix is based on nique to train a statistical relation-extraction model. For ex- the observation that an MLN program (especially those for ample, given the target relation BirthPlace(person, place) KBC) often contains routine subtasks such as classification and a known fact BirthPlace(John, Springfield), the sen- and coreference resolution; these subtasks have specialized tence “John was born in Springfield in 1946 ” would qualify algorithms with high efficiency and quality. Thus, instead as a positive training example. Despite the noise in such of solving a whole MLN with generic inference algorithms, examples, Zhang et al. [15] show that the quality of the Felix splits the program into multiple parts and solves sub- TAC-KBP10 relation-extraction benchmark improves signif- tasks with corresponding specialized algorithms. To resolve icantly as we increase the training corpus size (Figure 3). possible conflicts between predictions from different tasks, DeepDive learns its relation-extraction models (as logistic Felix employs the classic dual decomposition technique. regression classifiers) on about 1M examples (see Section 3) using the RDBMS-based Bismarck system [4]. 3. IMPLEMENTATION DETAILS Scaling Statistical Inference in Markov Logic. To scale A key tenet of DeepDive is that statistical learning and up statistical inference in Markov logic, DeepDive employs inference enables one to build high-quality KBC systems the Tuffy [10] and Felix11 systems. Tuffy is based on (see Figure 1) by combining diverse resources. We briefly the observation that MLN inference consists of a ground- describe more technical details of DeepDive to conceretely demonstrate the advantage of this approach. 8 http://hadoop.apache.org/ 9 http://www.opensciencegrid.org Entity Linking. Recall that entity linking is the task of 10 http://nlp.cs.qc.cuny.edu/kbp/2010/ mapping a textual mention to a real-world entity. We run 11 http://research.cs.wisc.edu/hazy/felix the state-of-the-art StanfordNER (Named Entity Recogni- En99es   Corpus   Men9on-­‐level  Features   Mr.  Gates   PERSON   EnJty     Mr.  Gates  was  the  CEO  of  Microso;.   • Bill  Clinton   • “Mr.”  is  a  male  personal  Jtle   • Bill  Gates   MenJons   Google  acquired  Youtube  in  2006.   • “Gates”  is  a  common  name   • Steve  Jobs   …   …   • Barack  Obama   • …   RelaJonship  MenJons   Rela9onships   En9ty-­‐level  Features   ORGANIZATION   • Google  Inc.   FoundedBy   Acquired   Bill  Gates   • Microso;  Corp.   Company   Founder   Acquirer   Acquiree   • is  a  male  person   • United  States   • is  very  frequently  menJoned   • YouTube   Microso;  Corp.   Bill  Gates   Google  Inc.   YouTube   • aka  William  Gates  III   • …   …   …   …   Figure 4: An illustration of the KBC model in DeepDive. tion)12 to extract textual mentions and corresponding en- 5. ACKNOWLEDGMENTS tity types. DeepDive then tries to map each mention to a We gratefully acknowledge the support of DARPA grant FA8750- Wikipedia entry using the following signals: string match- 09-C-0181. CR is also generously supported by NSF CAREER award ing, Wikipedia redirects and inter-page anchor text, Google IIS-1054009, ONR award N000141210041, and gifts or research awards and Bing search results that link to Wikipedia pages, en- from Google, Greenplum, Johnson Controls, Inc., LogicBlox, and tity type compatibility between StanfordNER and Freebase, Oracle. We thank the generous support from the Center for High person-name coreference based on heuristics and proximity, Throughput Computing and Miron Livny’s Condor research group. etc. We write a couple dozen MLN rules for these data Any opinions, findings, and conclusions or recommendations expressed resources and then train the rule weights using the entity- in this material are those of the authors and do not necessarily reflect linking training data from TAC-KBP. Our entity-linking the view of the above companies, DARPA, or the US government. component achieves an F1 score of 0.80 on the TAC-KBP benchmark (human performance is around 0.90). As shown in Figure 1, DeepDive also solicits user feedback on the 6. REFERENCES [1] A. Calı̀, G. Gottlob, and A. Pieris. Query answering under entity types; we plan to integrate such feedback into the expressive entity-relationship schemata. Conceptual statistical inference process. Modeling–ER, 2010. [2] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. Hruschka Jr, and T. Mitchell. Toward an architecture for never-ending Relation Extraction. The relation-extraction models in Deep- language learning. In AAAI, 2010. Dive are trained using the methods described in Zhang [3] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. Rex: et al. [15]. Specifically, during feature extraction, we per- Explaining relationships between entity pairs. Proc. VLDB Endow., 5(3):241–252, Nov. 2011. form dependency parsing using MaltParser13 and Ensem- [4] X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified ble.14 We use Freebase and about 2M high-quality news architecture for in-RDBMS analytics. In SIGMOD, 2012. and blog articles provided by TAC-KBP to perform distant [5] D. Ferrucci et al. Building Watson: An overview of the supervision, generating about 1M training examples over 20 DeepQA project. AI Magazine, 2010. [6] H. Ji, R. Grishman, H. Dang, K. Griffitt, and J. Ellis. Overview target relations (including those shown in Figure 1). We of the TAC 2010 knowledge base population track. In Text use sparse logistic regression (`1 regularized) classifiers [12] Analysis Conference, 2010. to train statistical relation-extraction models using both lex- [7] G. Kasneci, M. Ramanath, F. Suchanek, and G. Weikum. The ical (e.g., word sequences) and syntactic (e.g., dependency YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 2008. paths) features. We achieved an F1 score of 0.31 on the [8] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant TAC-KBP relation extraction benchmark (lower than only supervision for relation extraction without labeled data. In the top participant in TAC-KBP 2010 [6]). ACL, pages 1003–1011, 2009. [9] N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting with high precision and high recall. In 4. CONCLUDING REMARKS WSDM, 2011. [10] F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: Scaling up We presented DeepDive, an end-to-end demonstration statistical inference in Markov logic networks using an system that performs knowledge-base construction from the RDBMS. In VLDB, 2011. web. DeepDive demonstrates that a promising approach to [11] M. Richardson and P. Domingos. Markov logic networks. KBC is to integrate diverse data resources and best-of-breed Machine Learning, 2006. [12] R. Tibshirani. Regression shrinkage and selection via the algorithms via statistical learning and inference. We dis- LASSO. Journal of the Royal Statistical Society., 1996. cussed the key lessons we have learned from building Deep- [13] G. Weikum and M. Theobald. From information to knowledge: Dive – including feature extraction, statistical learning and Harvesting entities and relationships from web sources. In inference, and systematic debugging – and hope that they PODS, 2010. [14] F. Wu and D. Weld. Autonomously semantifying Wikipedia. In are of value to other researchers. CIKM, 2007. [15] C. Zhang, F. Niu, C. Ré, and J. Shavlik. Big data versus the crowd: Looking for relationships in all the right places. In ACL, 2012. 12 http://nlp.stanford.edu/ner/index.shtml [16] J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. Wen. Statsnowball: A 13 statistical approach to extracting entity relationships. In http://www.maltparser.org/ 14 WWW, 2009. http://www.surdeanu.name/mihai/ensemble/