<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Open World Probabilistic Databases?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>İsmail İlkan Ceylan</string-name>
          <email>ceylan@tcs.inf.tu-dresden.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adnan Darwiche</string-name>
          <email>darwiche@cs.ucla.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guy Van den Broeck</string-name>
          <email>guyvdb@cs.ucla.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Automated Reasoning Lab, UCLA</institution>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Theoretical Computer Science</institution>
          ,
          <addr-line>TU Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Introduction and Motivation Driven by the need to learn from vast amounts of text data, efforts throughout natural language processing, information extraction, databases. and AI are coming together to build large-scale knowledge bases. Academic systems such as NELL [14], Reverb [7], Yago [11], and DeepDive [16] continuously crawl the web to extract relational information. Industry projects such as Microsoft's Probase [18] or Google's Knowledge Vault [6] similarly learn structured data from text to improve search products. Notably, such knowledge bases are inherently probabilistic and many of them [6, 16] are based on the foundations of tuple-independent probabilistic databases (PDBs) [17]. According to the PDB semantics, each database tuple is an independent Bernoulli random variable, and all other tuples have probability zero, enforcing a closed-world assumption (CWA) [15]. This paper revisits the choice for the CWA in probabilistic knowledge bases. We observe that the CWA is violated in their deployment, which makes it problematic to reason, learn, or mine on top of these databases. First, knowledge bases are part of a larger machine learning loop that continuously updates beliefs about facts based on new textual evidence. From a Bayesian learning perspective [2], this loop can only be principled when learned facts have an a priori non-zero probability. Hence, the CWA does not accurately represent this mode of operation and puts it on weak footing. Second, these issues are not temporary: it will never be possible to complete probabilistic knowledge bases of even the most trivial relations, as the memory requirements quickly become excessive. This already manifests today: statistical classifiers output facts at a high rate, but only the most probable ones make it into the knowledge base, and the rest is truncated, losing much of the statistical information. Third, query answering under the CWA does not take into account the effect the open world can have on the query probability. This makes it impossible to distinguish queries whose probability should intuitively differ. These issues stand in the way of some principled approaches to knowledge base completion and mining. We propose an alternative semantics for probabilistic knowledge bases to address these problems, which results in open-world PDBs (OpenPDBs). We show that OpenPDBs provide more meaningful answers. Finally, we pinpoint limitations of OpenPDBs and discuss ontology based data access (OBDA) as promising approach to further strengthen this framework.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        ? This is an extended abstract of the paper to be presented at KR’16 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
w_smith ali 0.9
w_smith sharktale 0.8
j_smith ali 0.6
arquette scream 0.7
pitt mr_ms_smith 0.5
jolie mr_ms_smith 0.7
jolie sharktale 0.9
      </p>
      <p>
        Couple
arquette cox 0.6
pitt jolie 0.8
thornton jolie 0.6
pitt aniston 0.9
kunis kutcher 0.7
OpenPDBs Inspired by the open-world assumption (OWA) OpenPDBs assume
that the knowledge of a domain may be incomplete. Hence, anything that is not
in the DB remains possible: Tuples that are not present in the DB, called open
tuples, are associated with a default probability interval of [0; ], where is
a fixed threshold. Intuitively, we obtain a lower probability (0) and an upper
probability ( ) for the open tuples as opposed to setting their probabilities to
0. Our proposal for OpenPDBsbuilds on the theory of imprecise probabilities,
and credal sets in particular [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], to allow interval-based probabilities for open
tuples. We briefly show that this framework provides more meaningful answers,
in terms of upper and lower bounds on the query probability.
      </p>
      <p>Query evaluation Consider the PDB given in Figure 1 and the queries
Q1(x; y) = Inmovie(x; z) ^ Inmovie(y; z) ^ Couple(x; y)</p>
      <p>Q2 = Inmovie(x; z) ^ Inmovie(y; z) ^ Couple(x; y):
First, note that P(Q2) = P(Q1(pitt; jolie)) = 0:28 in the PDB of Figure 1.
However, observe that Q1(pitt; jolie) entails Q2 and thus intuitively, we would expect
P(Q2) &gt; P(Q1(pitt; jolie)) in an open-world setting. In fact, in the open-world
setting, our estimate of P(Q2) should be higher, because there exist a large number
of couples that could satisfy this query. This is indeed the case for OpenPDBs
(for upper probabilities) since Q2 is entailed in many worlds that now have
non-zero probability. Of these, a large portion do not entail Q1(pitt; jolie).</p>
      <p>Second, by the CWA we lose the ability to distinguish queries that are
clearly not identical. Observe, for instance, that both P(Q1(thornton; aniston))
and P(Q1(w_smith; j_smith)) evaluate to 0. Taking into account the open world,
however Q1(w_smith; j_smith) is supported by two facts in the PDB, while
Q1(thornton; aniston) is supported by none, which should make it less likely.
Taking these observations to the extreme, the query Inmovie(x; y) ^ :Inmovie(x; y)
is unsatisfiable, yet it evaluates to the same probability as the satisfiable query
Q1(thornton; aniston). Clearly, there is no attribution for being closer to satisfied.
It is clear that in the open world setting the upper probability of a satisfiable
query will be greater than the upper probability of an unsatisfiable query. In
fact, any unsatisfiable query will still have the upper probability 0 in
OpenPDBs. These answers are clearly more in line with the incomplete projection of
the world.</p>
      <p>
        Overview of the Results Our open-world semantics is supported by a query
evaluation algorithm for unions of conjunctive queries (UCQs). This class of
queries, corresponding to monotone DNF, is particularly well-behaved and the
focal point of database research. Perhaps the largest appeal of PDBs comes from
a breakthrough dichotomy result by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], perfectly delineating which UCQs can be
answered efficiently in the size of the PDB. Their algorithm runs in polynomial
time for all efficient queries, called safe queries, and recognizes all others to
be #P-hard. Our OpenPDB algorithm extends the PDB algorithm of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
inherits its elegant properties: all safe queries run in polynomial time. When our
algorithm fails, the query is #P-hard. Moreover, a careful analysis shows that
both algorithms run in linear time in the number of (closed-world) tuples. Even
though OpenPDBs model a polynomially larger set of random variables, these
can be reasoned about as a whole, and there is no computational blow-up for
open-world reasoning. Hence, both OpenPDBs and PDBs admit the same data
complexity dichotomy between linear time and #P.3
      </p>
      <p>
        For queries with negation, only a partial classification of PDB data
complexity is known [
        <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
        ]. We show that the complexity of open-world reasoning
can go up significantly with negation. We identify a linear-time PDB query that
becomes NP-complete on OpenPDBs. Moreover, there exists a PP-complete(4)
query on PDBs that becomes NPPP-complete on OpenPDBs. Clearly, negation
leads to a much richer data complexity landscape.
      </p>
      <p>
        Ontologies and OpenPDBs OpenPDBs improve on PDBs and provide a
more suitable setting for large probabilistic knowledge bases. However, default
intervals are sometimes too loose and it is not always possible to distinguish
queries that should intuitively differ. One of the ways of restricting the open
world is to use an explicit formalism for restricting the models. More concretely,
the worlds induced by an OpenPDB can be refined and restricted by using a
background knowledge specified as a logical theory. Ontology based data access
(OBDA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has been introduced as a means of querying incomplete data sources
with the help of a background knowledge provided by an ontology.
      </p>
      <p>
        Probabilistic OBDA has been investigated; both in the context of Description
Logics and Datalog [
        <xref ref-type="bibr" rid="ref12 ref4 ref9">4, 9, 12</xref>
        ]. Closely related is the work by [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], where authors
enrich PDBs with light-weight ontologies. This provides a delta on PDBs: A
fact that is not in the ABox remains possible due to the axioms present in the
TBox. However, this approach is limited to the encoding in the ontology: Any
fact that is not entailed by the ontology still gets the probability 0. This is a
major difference from our approach. Investigating the computational properties
of OBDA in combination with OpenPDBs is left as future work.
3 To the best of our knowledge, and to our surprise, the fact that safe PDB queries
have linear-time data complexity, and that the dichotomy of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is between linear
time (not PTime) and #P, has not previously been observed in the literature.
4 Intuitively, PP is the decision version of the counting class #P.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bienvenu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cate</surname>
            ,
            <given-names>B.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Ontology-based data access: A study through disjunctive datalog, csp, and mmsnp</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .
          <volume>39</volume>
          (
          <issue>4</issue>
          ),
          <volume>33</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          :
          <fpage>44</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bishop</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          :
          <article-title>Pattern recognition and machine learning</article-title>
          . springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ceylan</surname>
          </string-name>
          , İ.İ.,
          <string-name>
            <surname>Darwiche</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Van Den Broeck, G.:
          <article-title>Open-world probabilistic databases</article-title>
          .
          <source>In: Proc. of KR'16</source>
          . AAAI Press (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ceylan</surname>
          </string-name>
          , İ.İ.,
          <string-name>
            <surname>Peñaloza</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Probabilistic Query Answering in the Bayesian Description Logic BEL</article-title>
          .
          <source>In: Proc. of SUM'15. LNAI</source>
          , vol.
          <volume>9310</volume>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>35</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dalvi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The dichotomy of probabilistic inference for unions of conjunctive queries</article-title>
          .
          <source>J. of the ACM</source>
          <volume>59</volume>
          (
          <issue>6</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>87</lpage>
          (
          <year>2012</year>
          ), http://dl.acm.org/citation. cfm?doid=
          <volume>2395116</volume>
          .
          <fpage>2395119</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gabrilovich</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heitz</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horn</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lao</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strohmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Zhang, W.:
          <article-title>Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion</article-title>
          .
          <source>In: Proc. of ACM SIGKDD'14</source>
          . pp.
          <fpage>601</fpage>
          -
          <lpage>610</lpage>
          . KDD'14,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fader</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Identifying relations for open information extraction</article-title>
          .
          <source>In: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>1535</fpage>
          -
          <lpage>1545</lpage>
          . Ass. for Computational Linguistics (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fink</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olteanu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Dichotomies for Queries with Negation in Probabilistic Databases</article-title>
          .
          <source>ACM Transactions on Database Systems (TODS)</source>
          (
          <year>2015</year>
          ), (to appear)
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gottlob</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lukasiewicz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>M.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simari</surname>
            ,
            <given-names>G.I.</given-names>
          </string-name>
          :
          <article-title>Query answering under Probabilistic Uncertainty in Datalog +/- Ontologies</article-title>
          . Ann. Math. AI
          <volume>69</volume>
          (
          <issue>1</issue>
          ),
          <fpage>37</fpage>
          -
          <lpage>72</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gribkoff</surname>
          </string-name>
          , E., Van den Broeck, G.,
          <string-name>
            <surname>Suciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Understanding the Complexity of Lifted Inference and Asymmetric Weighted Model Counting</article-title>
          .
          <source>In: Proc. of UAI'14</source>
          . pp.
          <fpage>280</fpage>
          -
          <lpage>289</lpage>
          . AUAI Press (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hoffart</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berberich</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Yago2: A spatially and temporally enhanced knowledge base from wikipedia</article-title>
          .
          <source>In: In Proc. of IJCAI'2013</source>
          . pp.
          <fpage>3161</fpage>
          -
          <lpage>3165</lpage>
          . AAAI Press (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Ontology-Based Access to Probabilistic Data with OWL QL</article-title>
          .
          <source>In: Proc. of ISWC'12. LNCS</source>
          , vol.
          <volume>7649</volume>
          , pp.
          <fpage>182</fpage>
          -
          <lpage>197</lpage>
          . Springer Verlag (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Levi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>The Enterprise of Knowledge</article-title>
          . MIT Press (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Mitchell, T.,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hruschka</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talukdar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betteridge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carlson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dalvi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Never-Ending Learning</article-title>
          .
          <source>In: Proc. of AAAI'15</source>
          . AAAI Press (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Reiter</surname>
          </string-name>
          , R.:
          <source>On closed world data bases. Logic and Data</source>
          Bases pp.
          <fpage>55</fpage>
          -
          <lpage>76</lpage>
          (
          <year>1978</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Sa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ré</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Incremental knowledge base construction using deepdive</article-title>
          .
          <source>In Proc. of VLDB Endowment</source>
          <volume>8</volume>
          (
          <issue>11</issue>
          ),
          <fpage>1310</fpage>
          -
          <lpage>1321</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Suciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olteanu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ré</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Probabilistic Databases</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>K.Q.</given-names>
          </string-name>
          :
          <article-title>Probase: A probabilistic taxonomy for text understanding</article-title>
          .
          <source>In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <fpage>481</fpage>
          -
          <lpage>492</lpage>
          . ACM (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>