DeepDive: Web-scale Knowledge-base Construction
             using Statistical Learning and Inference

                            Feng Niu              Ce Zhang                   Christopher Ré         Jude Shavlik
                                               Department of Computer Sciences
                                              University of Wisconsin-Madison, USA
                                         {leonn,czhang,chrisre,shavlik}@cs.wisc.edu


ABSTRACT                                                                         linguistic features such as named-entity mentions and de-
We present an end-to-end (live) demonstration system called                      pendency paths3 from terabytes of text; and (2) DeepDive
DeepDive that performs knowledge-base construction (KBC)                         performs web-scale statistical learning and inference using
from hundreds of millions of web pages. DeepDive employs                         classic data-management and optimization techniques.
statistical learning and inference to combine diverse data                          Figure 2 depicts the current architecture of DeepDive.
resources and best-of-breed algorithms. A key challenge of                       To populate a knowledge base, DeepDive first converts di-
this approach is scalability, i.e., how to deal with terabytes                   verse input data (e.g., raw corpora and ontologies) into rela-
of imperfect data efficiently. We describe how we address                        tional features using standard NLP tools and custom code.
the scalability challenges to achieve web-scale KBC and the                      These features are then used to train statistical models rep-
lessons we have learned from building DeepDive.                                  resenting the correlations between linguistic patterns and
                                                                                 target relations. Finally, DeepDive combines the trained
                                                                                 statistical models with additional knowledge (e.g., domain
1.     INTRODUCTION                                                              knowledge) into a Markov logic program that is then used
   Knowledge-base construction (KBC) is the process of pop-                      to transform the relational features (e.g., candidate entity
ulating a knowledge base (KB) with facts (or assertions)                         mentions and linguistic patterns) into a knowledge base with
extracted from text. It has recently received tremendous                         entities, relationships, and their provenance.
interest from academia, e.g., CMU’s NELL [2] and MPI’s                              Given the amount of data and depth of processing, a key
YAGO [7,9], and from industry, e.g., IBM’s DeepQA [5] and                        challenge is scaling feature extraction (see Figure 2). For
Microsoft’s EntityCube [16]. To achieve high quality, these                      example, while deep linguistic features such as dependency
systems leverage a wide variety of data resources and KBC                        paths are useful for relationship extraction, it takes about
techniques. A crucial challenge that these systems face is                       100K CPU hours for our NLP pipeline to finish process-
coping with imperfect or conflicting information from mul-                       ing ClueWeb09. Although we have access to a 100-node
tiple sources [3, 13]. To address this challenge, we present                     Hadoop cluster, we found that the throughput of Hadoop
an end-to-end KBC system called DeepDive.1 DeepDive                              is often limited by pathological data chunks that take very
went live in January 2012 after processing the 500M English                      long to process or even crash (due to Hadoop’s no-task-left-
web pages in the ClueWeb09 corpus2 , and since then has                          behind failure model). Fortunately, thanks to the Condor
been adding several million newly-crawled webpages every                         infrastructure4 , we were able to finish feature extraction
day. Figure 1 shows several screenshots of DeepDive.                             on ClueWeb09 within a week by opportunistically assigning
   Similar to YAGO [7,9] and EntityCube [16], DeepDive is                        jobs on hundreds of workstations and shared cluster ma-
based on the classic Entity-Relationship (ER) model [1] and                      chines using a best-effort failure model.
employs popular techniques such as distant supervision [15]                         A second scalability challenge is statistical learning and
and the Markov logic language [11] to combine a variety of                       inference. There are two aspects of scalability with statis-
signals. However, DeepDive goes deeper in two ways: (1)                          tical learning: scale of training examples and scale of per-
Unlike prior large-scale KBC systems, DeepDive performs                          formance. To scale up the amount of training examples for
deep natural language processing (NLP) to extract useful                         KBC, we employ the distant supervision technique [8,14,15]
                                                                                 that automatically generates what are called silver-standard
1
    http://research.cs.wisc.edu/hazy/deepdive                                    examples by heuristically aligning raw text with an existing
2
    http://lemurproject.org/clueweb09.php/                                       knowledge base such as Freebase.5 To scale up the per-
                                                                                 formance of machine learning, we leverage the Bismarck
                                                                                 system [4] that executes a wide variety of machine learn-
                                                                                 ing techniques inside an RDBMS. For statistical inference,
                                                                                 DeepDive employs a popular statistical-inference frame-
                                                                                 work called Markov logic. In Markov logic, one can write
                                                                                 first-order logic rules with weights (that intuitively model
VLDS’12 August 31, 2012. Istanbul, Turkey.                                       3
                                                                                   http://nlp.stanford.edu/software/
Copyright c 2012 for the individual papers by the papers’ authors. Copying       4
permitted for private and academic purposes. This volume is published and          http://research.cs.wisc.edu/condor/
                                                                                 5
copyrighted by its editors.                                                        http://freebase.com
Figure 1: Screenshots of DeepDive showing facts about Barack Obama and provenance. (Left) Facts about
Obama in DeepDive; (Top Right) Sentences mentioning the fact that Obama went to Harvard Law School;
(Bottom Right) Text from a web page annotated with entity mentions.


our confidence in a rule); this allows one to capture rules       enables web-scale KBC in DeepDive, namely how we scale
that are likely, but not certain, to be correct. A Markov logic   up web-scale feature extration, machine learning for KBC,
program (aka Markov logic network, or simply MLN) speci-          and statistical inference in Markov logic, respectively.
fies what (evidence) data are available, what predictions to
make, and what constraints and correlations there are [10].       A Conceptual KBC Model. DeepDive adopts the clas-
However, when trying to apply Markov logic to KBC, we             sic Entity-Relationship (ER) model [1]: the schema of the
found that existing MLN systems such as Alchemy6 do not           target knowledge base (KB) is specified by an ER graph
scale to our datasets. To cope, we designed and implemented       G = (Ē, R̄) where Ē is one or more sets of entities (e.g.,
two novel approaches to MLN inference by leveraging data-         people and organizations), and R̄ is a set of relationships.
management and optimization techniques.                           Define E(G) = ∪E∈Ē E, i.e., the set of known entities. To
   In addition to statistical learning and inference, we have     specify a KBC task to DeepDive, one provides the schema
found debugging and tuning based on the output of a KBC           G and a corpus D. Each document di ∈ D consists of a set of
system to be an effective method to improve KBC quality.          (possibly overlapping) text spans (e.g., tokens or sentences)
To support systematic debugging and tuning, it is important       T (di ). Text spans referring to entities or relationships are
that the underlying statistical models are well-calibrated,       called mentions (see Figure 4). Define T (D) = ∪di ∈D T (di ).
i.e., predictions with probability around p should have an        Our goal is to accurately populate the following tables:
actual accuracy around p as well. Such calibration is an
integral part of the development process of DeepDive.                 • Entity-mention table M (E(G), T (D)).7
   We describe the DeepDive architecture in Section 2 and
                                                                      • Relationship-mention tables MRi ⊆ T (D)k+1 for each
some implementation details in Section 3. We summarize
                                                                        Ri ∈ R̄; k is Ri ’s arity, and the first k attributes (resp.
the lessons we have learned from DeepDive as follows:
                                                                        last attribute) are entity (resp. relationship) mentions.
Inference and Learning. Statistical inference and learn-
     ing were bottlenecks when we first started DeepDive,             • Relationship tables Ri ∈ R̄.
     but we can now scale them with data-management and
                                                                  Note that Ri can be derived from ME and MRi . By the
     optimization techniques.
                                                                  same token, ME and MRi provide provenance that connects
Feature Extraction. Good features are a key bottleneck            the KB back to the documents supporting each fact. The
    for KBC; with a scalable inference and learning infras-       process of populating ME is called entity linking; the process
    tructure in place, we can now focus on gathering and          of populating MRi is called relation extraction. Intuitively,
    tuning features for DeepDive.                                 the goal is to produce an instance J of these tables that is as
Debugging and Tuning. Developing KBC systems is an                large as possible (high recall) and as correct as possible (high
    iterative process; systematic debugging and tuning re-        precision). As shown in Figure 4, DeepDive populates the
    quires well-calibrated statistical models.                    target KB based on signals from mention-level features (over
                                                                  text spans, e.g., positions, contained words, and matched
                                                                  regular expressions) and entity-level features (over the target
2.     DEEPDIVE ARCHITECTURE                                      KB, e.g., age, gender, and alias).
 We first describe a simple KBC model that we use in
                                                                  7
DeepDive, and then briefly discuss the infrastructure that         For simplicity, we assume that all entities are known, but
                                                                  DeepDive supports generating novel entities for the KB as
6
    http://alchemy.cs.washington.edu                              well (e.g., by clustering “dangling” mentions).
                                                                Sta7s7cal	
  Models	
  for	
                                                                  Domain	
  Knowledge,	
  
                   ClueWeb09	
                                   Men7on	
  Extrac7on	
                                                                      Developer/User	
  Feedback	
  
                   500M	
  Pages	
  

                      Wikipedia	
  	
                                  Machine	
  
                     3M	
  En77es	
                                    Learning	
  

                     Web	
  Search	
       NLP	
  Tools,	
           Features	
                                             Markov	
  Logic	
  
                    100M	
  Results	
     Custom	
  Code	
                                                                   Program	
  
                                                                              	
                                             Feature	
         Candidate	
  Men7ons:	
                                           Sta+s+cal	
                                     	
  
                      …	
  
                       Freebase	
  
                                            Extrac+on	
            En77es:	
  25B	
  	
                                          Inference	
                                   KB         	
  
                       3M	
  Facts	
  
                                                                                                                                                              En77es,	
  Rela7onships,	
  
                                                                Rela7onships:	
  8B	
                                                                             Provenance	
  
                                                                       	
Figure 2: Architecture of DeepDive. DeepDive takes as input diverse data resources, converts them into
relational features, and then performs machine learning and statistical inference to construct a KB.

                                                                                                                            0.40	
  
Scaling Feature Extraction at Web Scale. To scale Deep-                                                                                                                       Precision	
  

                                                                                                 F1,	
  Recall,	
  or	
  
                                                                                                                            0.30	
  

                                                                                                   Precision	
  
Dive to web-scale KBC tasks, we employ high-throughput
                                                                                                                            0.20	
  
parallel computing frameworks such as Hadoop8 and Con-                                                                                                             F1	
  
                                                                                                                            0.10	
  
dor for feature extraction. We use the Hadoop File System                                                                                                                                              Recall	
  
for storage, but found a 100-node MapReduce cluster to be                                                                   0.00	
  
                                                                                                                               1.E+02	
      1.E+03	
     1.E+04	
          1.E+05	
             1.E+06	
     1.E+07	
     1.E+08	
  
insufficient for ClueWeb: (1) Hadoop’s all-or-nothing ap-
                                                                                                                                                                       Corpus	
  size	
  
proach to failure handling hinders throughput, and (2) the
number of cluster machines for Hadoop is limited. Fortu-
                                                                                      Figure 3: The TAC-KBP quality improves as we in-
nately, the Condor infrastructure supports a best-effort fail-
                                                                                      crease the number of ClueWeb documents used for
ure model, i.e., a job may finish successfully even when Con-
                                                                                      distant supervision [15]. Note that F1 and Recall
dor fails to process a small portion of the input data. More-
                                                                                      continue to improve as DeepDive processes increas-
over, Condor allows us to simultaneously leverage thousands
                                                                                      ingly large corpora (we do not yet have an explana-
of machines from across a department, an entire campus, or
                                                                                      tion for the dip in precision).
even the nation-wide Open Science Grid.9

Scaling Machine Learning for KBC. Traditional KBC
systems rely on manual annotations or domain-specific rules                           ing step that essentially performs relational operations, and
provided by experts, both of which are scarce resources.                              a search (or sampling) step that often comprises multiple
To remedy these problems, recent years have seen inter-                               independent subproblems. Thus, Tuffy achieves orders of
est in the distant supervision approach for relation extrac-                          magnitude speed-up in grounding (compared to Alchemy)
tion [8, 14, 15]. The input to distant supervision is a set                           by translating grounding into SQL statements that are exe-
of seed facts for the target relation together with an (un-                           cuted by an RDBMS. Moreover, Tuffy performs graph par-
labeled) text corpus, and the output is a set of (noisy) an-                          titioning at the search step to achieve scalability in inference
notations that can be used by any machine learning tech-                              and (serendipitously) improved quality. Felix is based on
nique to train a statistical relation-extraction model. For ex-                       the observation that an MLN program (especially those for
ample, given the target relation BirthPlace(person, place)                            KBC) often contains routine subtasks such as classification
and a known fact BirthPlace(John, Springfield), the sen-                              and coreference resolution; these subtasks have specialized
tence “John was born in Springfield in 1946 ” would qualify                           algorithms with high efficiency and quality. Thus, instead
as a positive training example. Despite the noise in such                             of solving a whole MLN with generic inference algorithms,
examples, Zhang et al. [15] show that the quality of the                              Felix splits the program into multiple parts and solves sub-
TAC-KBP10 relation-extraction benchmark improves signif-                              tasks with corresponding specialized algorithms. To resolve
icantly as we increase the training corpus size (Figure 3).                           possible conflicts between predictions from different tasks,
DeepDive learns its relation-extraction models (as logistic                           Felix employs the classic dual decomposition technique.
regression classifiers) on about 1M examples (see Section 3)
using the RDBMS-based Bismarck system [4].
                                                                                      3.              IMPLEMENTATION DETAILS
Scaling Statistical Inference in Markov Logic. To scale                                 A key tenet of DeepDive is that statistical learning and
up statistical inference in Markov logic, DeepDive employs                            inference enables one to build high-quality KBC systems
the Tuffy [10] and Felix11 systems. Tuffy is based on                                 (see Figure 1) by combining diverse resources. We briefly
the observation that MLN inference consists of a ground-                              describe more technical details of DeepDive to conceretely
                                                                                      demonstrate the advantage of this approach.
8
   http://hadoop.apache.org/
9
   http://www.opensciencegrid.org                                                     Entity Linking. Recall that entity linking is the task of
10
   http://nlp.cs.qc.cuny.edu/kbp/2010/                                                mapping a textual mention to a real-world entity. We run
11
   http://research.cs.wisc.edu/hazy/felix                                             the state-of-the-art StanfordNER (Named Entity Recogni-
                         En99es	
                                                                   Corpus	
                            Men9on-­‐level	
  Features	
  
                                                                                                                                         Mr.	
  Gates	
  
                      PERSON	
  
                                                 EnJty	
  	
             Mr.	
  Gates	
  was	
  the	
  CEO	
  of	
  Microso;.	
  
                      • Bill	
  Clinton	
                                                                                               • “Mr.”	
  is	
  a	
  male	
  personal	
  Jtle	
  
                      • Bill	
  Gates	
         MenJons	
                Google	
  acquired	
  Youtube	
  in	
  2006.	
                 • “Gates”	
  is	
  a	
  common	
  name	
  
                      • Steve	
  Jobs	
                                  …	
                                                            …	
  
                      • Barack	
  Obama	
  
                      • …	
  
                                                                                            RelaJonship	
  MenJons	
  
                                                Rela9onships	
                                                                          En9ty-­‐level	
  Features	
  
                      ORGANIZATION	
  
                      • Google	
  Inc.	
        FoundedBy	
                                         Acquired	
                           Bill	
  Gates	
  
                      • Microso;	
  Corp.	
     Company	
                Founder	
                  Acquirer	
           Acquiree	
     • is	
  a	
  male	
  person	
  
                      • United	
  States	
                                                                                              • is	
  very	
  frequently	
  menJoned	
  
                      • YouTube	
               Microso;	
  Corp.	
   Bill	
  Gates	
               Google	
  Inc.	
   YouTube	
  
                                                                                                                                        • aka	
  William	
  Gates	
  III	
  
                      • …	
                     …	
                                                 …	
                                 …	
  

                                     Figure 4: An illustration of the KBC model in DeepDive.


tion)12 to extract textual mentions and corresponding en-                                                   5.          ACKNOWLEDGMENTS
tity types. DeepDive then tries to map each mention to a                                                         We gratefully acknowledge the support of DARPA grant FA8750-
Wikipedia entry using the following signals: string match-                                                  09-C-0181. CR is also generously supported by NSF CAREER award
ing, Wikipedia redirects and inter-page anchor text, Google                                                 IIS-1054009, ONR award N000141210041, and gifts or research awards
and Bing search results that link to Wikipedia pages, en-                                                   from Google, Greenplum, Johnson Controls, Inc., LogicBlox, and
tity type compatibility between StanfordNER and Freebase,                                                   Oracle. We thank the generous support from the Center for High
person-name coreference based on heuristics and proximity,                                                  Throughput Computing and Miron Livny’s Condor research group.
etc. We write a couple dozen MLN rules for these data                                                       Any opinions, findings, and conclusions or recommendations expressed
resources and then train the rule weights using the entity-                                                 in this material are those of the authors and do not necessarily reflect
linking training data from TAC-KBP. Our entity-linking                                                      the view of the above companies, DARPA, or the US government.
component achieves an F1 score of 0.80 on the TAC-KBP
benchmark (human performance is around 0.90). As shown
in Figure 1, DeepDive also solicits user feedback on the
                                                                                                            6.          REFERENCES
                                                                                                             [1] A. Calı̀, G. Gottlob, and A. Pieris. Query answering under
entity types; we plan to integrate such feedback into the                                                        expressive entity-relationship schemata. Conceptual
statistical inference process.                                                                                   Modeling–ER, 2010.
                                                                                                             [2] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. Hruschka Jr,
                                                                                                                 and T. Mitchell. Toward an architecture for never-ending
Relation Extraction. The relation-extraction models in Deep-                                                     language learning. In AAAI, 2010.
Dive are trained using the methods described in Zhang                                                        [3] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. Rex:
et al. [15]. Specifically, during feature extraction, we per-                                                    Explaining relationships between entity pairs. Proc. VLDB
                                                                                                                 Endow., 5(3):241–252, Nov. 2011.
form dependency parsing using MaltParser13 and Ensem-                                                        [4] X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified
ble.14 We use Freebase and about 2M high-quality news                                                            architecture for in-RDBMS analytics. In SIGMOD, 2012.
and blog articles provided by TAC-KBP to perform distant                                                     [5] D. Ferrucci et al. Building Watson: An overview of the
supervision, generating about 1M training examples over 20                                                       DeepQA project. AI Magazine, 2010.
                                                                                                             [6] H. Ji, R. Grishman, H. Dang, K. Griffitt, and J. Ellis. Overview
target relations (including those shown in Figure 1). We                                                         of the TAC 2010 knowledge base population track. In Text
use sparse logistic regression (`1 regularized) classifiers [12]                                                 Analysis Conference, 2010.
to train statistical relation-extraction models using both lex-                                              [7] G. Kasneci, M. Ramanath, F. Suchanek, and G. Weikum. The
ical (e.g., word sequences) and syntactic (e.g., dependency                                                      YAGO-NAGA approach to knowledge discovery. SIGMOD
                                                                                                                 Record, 2008.
paths) features. We achieved an F1 score of 0.31 on the                                                      [8] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant
TAC-KBP relation extraction benchmark (lower than only                                                           supervision for relation extraction without labeled data. In
the top participant in TAC-KBP 2010 [6]).                                                                        ACL, pages 1003–1011, 2009.
                                                                                                             [9] N. Nakashole, M. Theobald, and G. Weikum. Scalable
                                                                                                                 knowledge harvesting with high precision and high recall. In
4.   CONCLUDING REMARKS                                                                                          WSDM, 2011.
                                                                                                            [10] F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: Scaling up
  We presented DeepDive, an end-to-end demonstration                                                             statistical inference in Markov logic networks using an
system that performs knowledge-base construction from the                                                        RDBMS. In VLDB, 2011.
web. DeepDive demonstrates that a promising approach to                                                     [11] M. Richardson and P. Domingos. Markov logic networks.
KBC is to integrate diverse data resources and best-of-breed                                                     Machine Learning, 2006.
                                                                                                            [12] R. Tibshirani. Regression shrinkage and selection via the
algorithms via statistical learning and inference. We dis-                                                       LASSO. Journal of the Royal Statistical Society., 1996.
cussed the key lessons we have learned from building Deep-                                                  [13] G. Weikum and M. Theobald. From information to knowledge:
Dive – including feature extraction, statistical learning and                                                    Harvesting entities and relationships from web sources. In
inference, and systematic debugging – and hope that they                                                         PODS, 2010.
                                                                                                            [14] F. Wu and D. Weld. Autonomously semantifying Wikipedia. In
are of value to other researchers.                                                                               CIKM, 2007.
                                                                                                            [15] C. Zhang, F. Niu, C. Ré, and J. Shavlik. Big data versus the
                                                                                                                 crowd: Looking for relationships in all the right places. In
                                                                                                                 ACL, 2012.
12
   http://nlp.stanford.edu/ner/index.shtml                                                                  [16] J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. Wen. Statsnowball: A
13                                                                                                               statistical approach to extracting entity relationships. In
   http://www.maltparser.org/
14                                                                                                               WWW, 2009.
   http://www.surdeanu.name/mihai/ensemble/