Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules

Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules NdapandulaNakashole MauroSozio FabianSuchanek MartinTheobald Max-Planck Institute for Informatics

Saarbr ücken Germany

Institut Mines-Telecom Telecom ParisTech CNRS

Paris France

Max-Planck Institute for Informatics

Saarbr ücken Germany

Max-Planck Institute for Informatics

Saarbr ücken Germany

Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules 28BF54311489646548346ED48E65A24C GROBID - A machine learning software for extracting information from scholarly documents Uncertain RDF Soft and Hard Rules Deductive Grounding MaxSAT

Recent advances in information extraction have paved the way for the automatic construction and growth of large, semantic knowledge bases from Web sources. However, the very nature of these extraction techniques entails that the resulting RDF knowledge bases may face a significant amount of incorrect, incomplete, or even inconsistent (i.e., uncertain) factual knowledge, which makes efficient query answering over this kind of uncertain RDF data a challenge. Our engine, coined URDF, augments first-order reasoning by a combination of soft rules (Datalog-style implications), which are grounded in a deductive fashion in order to derive new facts from existing ones, and hard rules (mutual-exclusiveness constraints), which enforce additional consistency constraints among both base and derived facts. At the core of our approach is an efficient approximation algorithm for this constrained form of the weighted MaxSAT problem with soft and hard rules, allowing us to dynamically resolve inconsistencies directly at query-time. Experiments on real-world and synthetic data confirm a high robustness and significantly improved runtime of our framework in comparison to state-of-the-art MCMC techniques.

INTRODUCTION

The recent advent of information extraction techniques has enabled the automatic construction and growth of large, semantic knowledge bases from Web sources. Knowledge bases such as DBpedia [4], YAGO [25], Freebase.com, and TrueKnowledge.com consist of many millions or even billions of facts, which are usually captured in the form of RDF-style subject-predicate-object (SPO) triples. Moreover, the Linked-Data initiative (LinkedData.org) encompasses these and many other RDF datasets, along with extensive cross-linkage in the form of owl:sameAs properties between entities in different data sources. For high coverage of entities and their properties, it is natural to use automated, often heuristic or probabilistic, methods to populate these knowledge bases, and, VLDS '12, August 31, 2012. Istanbul, Turkey. Copyright c 2012 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors perhaps, also to automatically establish owl:sameAs links. Consequently, these data sources may contain a significant fraction of noise and incorrect triples. However, even if we accept such data errors and inconsistencies, no knowledge base can ever be complete, and even the entire Linked-Data cloud can hardly ever cover all interesting properties of relevant entities.

Research on knowledge base construction has adopted probabilistic models and statistical learning for cleaning the gathered pool of raw fact candidates (typically extracted from Wikipedia and textual or semi-structured Web pages). A powerful instrument to this end is reasoning with consistency constraints (see, e.g., [6,7,16,23,26]). For example, specifying that a person can have only one spouse (at a given snapshot in time) helps distinguishing marriages from romantic affairs, and removing false hypotheses for isMarriedTo triples. On the other hand, keeping only those triples with the highest confidence (or likelihood) of being correct, is an unduly eager and conservative approach. For example, when searching for musician couples where both partners have won a Grammy award, correct answers, such as John Lennon and Yoko Ono, may well be composed out of low-confidence input triples, as the join predicates in the query impose additional restrictions and could implicitly "de-noise" the answers. For example, the Lennon/Ono marriage is not in any of the Wikipedia infoboxes, and respective extractions from free text should have lower confidence.

When query answers are empty because critical pieces of knowledge are missing, reasoning over uncertain facts can be helpful to produce-at least-likely or speculative results. To this end, deduction rules are an instrument to infer answers that go beyond the extensionally represented content of the knowledge base. These rules themselves may be uncertain as well. For example, suppose we want to find the doctoral advisor of Alon Y. Halevy. Often (but not always) the senior author on the first few papers of a researcher is the doctoral advisor. Based on such a rule, we could deduce that Yehoshua Sagiv is Halevy's advisor. Moreover, deduction rules cover a wide range of RDF/S and OWL-based reasoning concepts, such as the owl:TransitiveProperty of predicates (e.g., for the rdfs:subClassOf property or over owl:sameAs links), which lie at the expressive intersection of Datalog-style deductive reasoning and OWL-DL. Additional consistency constraints, on the other hand, cover OWL concepts such as the owl:FunctionalProperty or owl:disjointWith properties of predicates and classes.

In summary, we emphasize the following desiderata for querytime reasoning over uncertain RDF triples: 1) to give answers to complex SPARQL queries over triples with highly varying confidence values; 2) to overcome the incompleteness problem, exploit deduction rules, which may themselves be uncertain, and to infer answers even if some critical triples are missing; 3) to counter amplified noise and keep query-result precision high, take into account consistency constraints in the specific context of a user query; 4) achieve all of the above with high efficiency, so that queries can be answered with interactive response time of a few seconds. Our aim with URDF is to address the above desiderata in an integrated manner. Our implementation is based on top-down, ondemand grounding of rules expressed in first-order logic, together with a constrained form of weighted MaxSAT solving. Considering hard constraints jointly with MaxSAT reasoning over propositional clauses poses additional challenges; to our knowledge, these have not been addressed in prior work for interactive, query-time reasoning.

DEFINITIONS AND NOTATIONS

We are given a set X of Boolean variables, each variable taking either the value true or false. The negation of a variable x ∈ X (denoted as x), has the value true if and only if x is assigned false. We shall refer to a variable x and its negation x as a literal. A Horn clause C is a set of literals containing at most one positive literal. Given a truth assignment to variables, we shall say that a Horn clause is satisfied if it contains at least one literal whose value is true (clauses are assumed to be in disjunctive form). Every clause C is associated with a positive weight w(C) ∈ R. A Horn formula is defined as a conjunction of Horn clauses (and hence Horn formulas are in conjunctive normal form, CNF). A Horn formula is satisfiable if there is a truth assignment to all literals such that all clauses are satisfied. As we deal with Horn formulas that might not be satisfiable, we seek to find a truth assignment that maximizes the total weight of satisfied clauses. An example of a Horn formula is the following

(x1 ∨ x2) ∧ (x2 ∨ x3) equiv. to (x1 ← x2) ∧ (x3 ← x2),

where ← denotes logical implication.

Given a set of relation types R and a set E of entities, a fact f is defined as a triplet of the form f = (e1, e2, r), which expresses an instance of a binary relation of type r ∈ R for two entities e1, e2 ∈ E (i.e., we could denote the fact that "John is married to Yoko", where John and Yoko are both entities). Moreover, facts may be uncertain. Hence every fact f is also associated with a positive weight w(f ) ∈ R, expressing the degree of confidence in the fact being correct. Moreover, every fact is also associated with a Boolean variable x f ∈ X, whose value indicates whether the corresponding fact is true or false. In the following, we shall simplify the notation and do not distinguish between variables and facts anymore. Hence, assigning the value true to a fact f corresponds to assigning true to the corresponding variable x f . We define a knowledge base KB as a triplet KB = {F, C, S}, where F is a set of facts, C is a set of Horn clauses, and S is a collection of disjoint sets of facts. We shall refer to C and S as soft deduction rules and hard consistency constraints, respectively (or soft and hard rules for short).

Soft Deduction Rules

We consider weighted Horn clauses with exactly one positive head literal as soft rules. A soft rule could state, for example, that "if Yoko and John are married and John lives in NY, then also Yoko lives in NY", with a confidence of 0.38 of being correct, which we write as follows:

lives(Yoko, NY ) ← married(Yoko, John) ∧ lives(John, NY ) [0.38]

To tackle the inherent incompleteness of F, we lift soft rules into first-order rules with universally quantified variables, which serve as input to our knowledge base. Soft rules then have the shape of first-order Horn clauses and can be written as Datalog-style implications. To express the first-order rule that "married couples usually live in the same place", for example, we use the following compact notation:

livesIn(p1 , loc1 ) ← marriedTo(p1 , p2 ) ∧ livesIn(p2 , loc1 ) [0.38]

Hard Consistency Constraints

The set of facts F may contain inconsistent information. Hence, we introduce hard consistency constraints that take the shape of a collection of disjoint sets of mutually-exclusive facts S = S1, . . . , S |S| . We enforce the constraint that for each hard rule S k , at most one fact f ∈ S k may be assigned the value true. For example, if we observe two or more different birth dates for a person, clearly something went wrong either during extraction or when reasoning with soft rules. We can formally identify such an inconsistency by formulating a consistency constraint as follows:

(date1 = date2 ) ← bornOn(p1, date1) ∧ bornOn(p1, date2)

Grounding this hard rules could, for example, then yields the following set of mutually exclusive facts {bornOn(John, 1931 ), bornOn(John, 1940 ), bornOn(John, 1957 )} which specifies that John could be born either in 1931,1940,1957, or in none of the above years. In contrast to soft rules, hard rules may not be violated by any truth assignment to the corresponding Boolean variables.

Problem Definition

As we deal with Horn formulas that might not be satisfiable, we seek to find a truth assignment that maximizes the total weight of the satisfied clauses. We call this problem weighted MaxSAT with soft and hard rules which we formally define as follows. We are given a set of facts F = {f1, f2, . . . , fn}, a collection C = C1, . . . , Cm of clauses in Horn form (soft rules), and a collection of sets S = S1, . . . , St that partition F (hard rules). Each clause C ∈ C is associated with a positive weight w(C). We wish to find a truth assignment to each variable f ∈ F such that for each set in S at most one fact is assigned the value true, and the sum of the weights of the satisfied clauses is maximized.

Given a knowledge base KB = {F, C, S} (in grounded form), we define an instance of the MaxSAT problem with soft and hard rules as follows. Every fact fi ∈ F is associated with a Boolean variable. In addition, we introduce a unit clause Ci = {fi}, whose weight is equal to the confidence of the corresponding fact. For convenience of notation, for each fact fi, we also introduce a unit hard rule Si = {fi} into S. As the MaxSAT problem with hard and soft rules is a generalization of the classic MaxSAT problem with Horn clauses, which is NP-Hard [11], it follows that also our problem is NP-Hard. Because of the intractability to compute an optimum solution for the above problem, we resort to devise an approximation algorithm.

We remark that instead of hard rules, one could also enforce consistency constraints by introducing soft rules with high weights. However, in combination with an approximation algorithm, this approach may involve non-trivial issues as illustrated by the following example. Consider the following two facts bornOn(John, 1931 ) and bornOn(John, 1940 ), whose weights are 0 and w 0, respectively. In order to enforce the hard rule that John can only have one date of birth, we could introduce the following soft rule (x ∨ ȳ) where x and y are the Boolean variables associated with facts bornOn(John, 1931 ) and bornOn(John, 1940 ), respectively. However, it is not clear how to determine the weight W of such a soft rule. If W is not large enough, then we could not enforce the hard consistency constraint. On the other hand, if W is too large, then any (1+ )−approximation algorithm (for > 0) might assign true to bornOn (John, 1931 ), as the ratio between such a solution and the optimum is W W +w .

ALGORITHM

Our algorithm is inspired by Johnson's approximation algorithm for the original weighted MaxSAT [12] problem (with no additional consistency constraints). This is based on the method of conditional probabilities [1] which allows to convert a randomized approximation algorithm into a deterministic one, while preserving the approximation guarantee.

The first step is to compute a real value pi in [0, 1] for each fact fi, satisfying the following property: the sum of all pi's corresponding to the facts within a same hard rule is at most 1. Each pi is interpreted as the probability that fi is assigned true and is computed in such a way that pi is large when the confidence in fact fi being true is high (i.e., w(fi) is large) and small otherwise. A simple algorithm for computing the pi's proceeds as follows: for each hard rule S, with probability 1/2 assign true to the fact with largest weight in S and with probability 1/2 assign false to all facts in S. This gives a 1/2-approximation algorithm, however there are more sophisticated algorithms which work better in practice, while maintaining the approximation guarantee (see our technical report for more details [27]). Then at each step t, we consider a hard rule St and we determine a truth assignment for all facts in St which maximizes a carefully defined potential function. Our potential function can be interpreted as the expected total weight of satisfied clauses with each unassigned fact fi being assigned true with probability pi (independently from the facts not in St).

Formally, we denote with Wt the value of our potential function at step t. At the beginning of our algorithm all facts are unassigned and the value of our potential function (W0) is defined as

W0 = C∈C w(C) • P(C is satisfied).

where the probability that a clause is satisfied is a function of the pi's computed at the beginning of our algorithm.

At step t, let ft−1 = f1, . . . , ft−1 be the truth assignment for the facts ft−1 = f1, . . . , ft−1 and let St be a hard rule considered at step t. We denote with St = f alse the truth assignment that assigns false to all facts in St. For every f in St, we define

W t,f =true = C∈C w(C) • P(C is sat.| ft−1 = ft−1, f = true)

where P(A|B) denotes a conditional probability. When all facts in St are assigned false our potential function is denoted as

W t,S t =f alse = C∈C w(C) • P(C is sat.| ft−1 = ft−1, St = f alse).

Our algorithm determines the truth assignment that maximizes the current potential function by choosing the largest value among all W t,f =true 's and W t,S t =f alse and then assigns the corresponding truth values to the facts in St. At each iteration, all satisfied clauses are removed from the set of clauses.

Our algorithm stops when all facts have been assigned a truth value. We remark that our algorithm is completely deterministic (i.e., it always produces the same output given the same input). Algorithm 1 shows pseudocode for our algorithm, while the approximation guarantee of our algorithm is stated in Theorem 1.

Algorithm 1 Weighted MaxSAT with Soft and Hard Rules 1: For each hard rule compute a prob. distribution over its facts so that for each fact fi, pi is proportional to w(fi), see [27]. 2: For each hard rule St ∈ S: 3: • Let f be the fact with largest W t,f=true among all facts in St.

• • If W t,f =true ≥ W t,S t =false then assign true to f , else assign false to all facts in St. Due to space constraints, we defer the proof of Theorem 1 to [27]. We are not aware of any closed form formula to express the solutions of the equation p = 1 − p λ as a function of λ. In the case λ = 2 we obtain p ≈ 0.618, while in the case λ = 3 we obtain a ratio of roughly 0.68. We can also show an approximation guarantee of 0.83 in some cases of interest (see [27]). Our algorithm reaches its worst case running time when every fact occurs in every grounded soft rule, i.e., when |C| = |F|, ∀C ∈ C. In practice, this case is unlikely, and in fact our experiments confirm the efficiency of our algorithm in real-world applications (see Figure 1a).

REASONING FRAMEWORK

In the absence of any soft and hard rules, URDF conforms to a standard (conjunctive) SPARQL engine where all facts in F are assumed to be true. Our key observation for query answering in combination with MaxSAT solving is that still often only a small subset of facts in F-often several orders of magnitude less facts than those contained in the entire knowledge base-are relevant for answering a specific query and for finding the corresponding truth assignments to the facts that are related to the query. For this purpose, we define the dependency graph DQ ⊆ F of a query Q as follows. Definition 1 already yields a recursive algorithm to compute the dependency graph, which is similar to SLD resolution [3] used in Datalog and Prolog. In our case, SLD resolution is extended by an additional grounding step for of hard rules, i.e., whenever we ground a fact fi, we also iteratively ground all hard rules that are related to it, using fi as a new subquery 1 . The URDF reasoning steps are summarized in Algorithm 2. 1 We employ a form of tabling (i.e., memoization) in order to cache redundant subgoals. This table also serves to break potential cy-Algorithm 2 URDF Reasoning Framework Require: A knowledge base KB with base facts F, first-order soft rules C and first-order hard rules S, a conjunctive query Q 1: Initialize the dependency graph DQ = ∅. 2: Ground all literals qi ∈ Q via SLD resolution and add their intersection to DQ. 3: Let CQ, SQ denote the sets of soft and hard rules grounded for answering Q. 4: Expand DQ by all facts f in grounded rules CQ and SQ. 5: Construct a CNF formula over grounded clauses CQ and individual facts DQ ⊆ F . 6: Solve the constrained weighted MaxSAT over the CNF and sets SQ (Algorithm 1). 7: return DQ with truth assignments to all facts f ∈ DQ Given a set of first order rules, this form of deductive grounding has a well-known polynomial runtime for non-recursive Datalog programs, and for linear, recursive programs, respectively. It however has an exponential complexity (in the number of facts) already for Datalog programs with a single, non-linear, recursive rule [9]. Line 3 denotes the rules that were grounded during this resolution phase in order to construct a Boolean formula in conjunctive normal form (CNF). These grounded rules are already available from the previous SLD resolution and can be kept in a simple buffer of the algorithm. The CNF construction in Line 5 itself is linear in the size of the grounded rules CQ and SQ, because all grounded soft rules are already in clause form, while the grounded hard rules can be input into our MaxSAT solver directly as plain sets of mutually exclusive facts. The next step in Line 6 requires the execution of Algorithm 1 for the weighted MaxSAT problem (with both soft and hard rules) described in Section 3.

We remark that the above form of dependency graph construction guarantees truth assignments to query answers that are consistent with the truth assigments that would be found by the MaxSAT solver for the entire sets of facts F, clauses C, and constraints S. That is, the truth assignments to facts in the dependency graph DQ ⊆ F for any query Q after MaxSAT solving are equivalent to the truth assignments that would be obtained for these facts when running the MaxSAT solver over the entire set of facts F in the knowledge base (modulo ties and possible MaxSAT approximation errors).

EXPERIMENTS

The following experiments were run on an AMD Opteron Quad-Core 2.6 GHz server with 16 GB RAM, using Oracle-11g as storage back-end for the underlying knowledge base. Physical I/O's were cached (thus aiming to eliminate variances due to disk operations) by running each query once and then taking the average runtime over 5 immediately repeated runs. Memory consumption was generally not a delimiting factor, with up to 501 MB overall memory consumption for our URDF Java implementation (including the high overhead of the Java VM) and less than 10 MB for the Alchemy package (see Subsection 5.2), implemented in C++.

YAGO Knowledge Base, Rules and Queries

The semantic graph of YAGO [25] serves as basis for our experiments. YAGO is a large common-sense knowledge base that has been extracted automatically from Wikipedia articles. YAGO contains more than 2 million entities and 19 million facts. The cles in SLD resolution if the same rule is attempted to be grounded repeatedly with the same bindings of variables to constants. facts include a class hierarchy of 200,000 classes with about 100 distinct relation types. Moreover, we employ 16 (partially recursive) hand-crafted soft deduction rules of common-sense reasoning about people and their relationships, together with 10 queries of different shapes. We enforce functional properties of the predicates bornIn, bornOnDate, diedIn, diedOnDate and marriedTo as consistency constraints. As weights for base facts, we employ the confidence weights provided by YAGO, while the weight for a soft rule is calculated as a conditional probability for the entire rule to hold (including the head literal), given that the body of the rule holds (when grounded over YAGO, see [27] for a detailed description of rules and queries). Queries were chosen such that many query predicates are defined via deduction rules, which led to a recursion depth of up to 7 in our experiments.

Markov Logic: MAP and MC-SAT

Alchemy 2 provides a series of algorithms for statistical relational learning and probabilistic inference based on the Markov Logic Networks [23]. It implements various MCMC sampling techniques, including MAP inference [24] (which is a memory-efficient stochastic MaxSAT solver based on MaxWalkSAT) and MC-SAT [20]. MAP inference yields truth assignments to facts (which allows for precision comparisons with URDF), whereas MC-SAT computes probability distributions over the underlying Markov network (which URDF does not do). Thus, we merely refer to MC-SAT for runtime comparisons as a state-of-art technique for MCMC sampling. We found grounding the above rules and queries over the entire YAGO knowledge base in Alchemy under an open-world assumption not to be feasible due to the nearly quadratic deflation of the resulting MLN structure. Hence, we provide the facts and rules grounded by URDF (via SLD resolution) directly as input to Alchemy, which effectively leads to a closed-world grounding of rules in the corresponding MLN structure. MLN running times (for MAP and MC-SAT) thus mostly correspond to the time needed for inference by Alchemy over this much smaller network structure (plus some overhead for parsing the formulas and grounding the network in closed-world mode).

Basic Query Processing over YAGO.

The first setting reports running times and result precision for the URDF reasoner compared to MAP inference and MC-SAT over the basic YAGO setting. As for approximation quality, we measure the relative precision of the URDF MaxSAT solver compared to the MAP baseline: if the MaxSAT weight computed by URDF (MAX-SAT-W) is at least as large as the weight achieved by MAP inference (MAP-W), and none of the hard constraints are violated by either approach, we conclude that we found an equally good or even better solution. Grounding time (SLD) denotes the time needed to ground the query atoms, soft and hard rules, and to expand the dependency graph via a top-down SLD resolution. In the following, #C and #S denote the number of grounded soft and hard rules (including unit clauses), while |C| and |S| denote the number of occurrences of facts in all grounded soft and hard rules, respectively. The overall query response time (in ms) for URDF is the sum of SLD grounding and MaxSAT solving. Table 1 shows that the URDF MaxSAT solver achieves more than two orders of magnitude runtime improvement over MAP and MC-SAT, at 90 percent precision compared to the MAP baseline for queries Q1-Q10. That is, for 9 out of the 10 queries URDF finds the same MaxSAT weight as MAP, while only for query Q7, the weight returned by URDF is marginally worse. We also see that running MC-SAT is generally more expensive than MAP inference. In this basic setting, SLD grounding clearly dominates query response times for URDF.

However, for all queries we achieve interactive response times with an overall runtime of at most 3 seconds per query.

Synthetic Rule Expansions.

To systematically investigate the asymptotic behavior of our algorithm for large dependency graphs and more complex rules, we employ synthetic expansions over the basic YAGO setting. In the first expansion setting (Figure 1)(a), we expand each grounded soft rule by 10 distinct facts as noise per expansion step, for each of the query results depicted in Table 1. For the noisy facts, we apply uniform weights and weights following Gaussian distributions (with σ 2 =1) around the original weights (µ) of the YAGO facts. We thus simulate very complex CNF formulas with more than 10 5 occurrences of facts in soft rules. In the following plots, the grounding time also includes the time for expanding the CNF formulas with these noisy facts and thus is not constant for all runs. Figure 1(a) confirms that the runtime of the MaxSAT algorithm is not affected by the weighting schemes and remains equally efficient.

In the second expansion setting (Figure 1)(b), we do not only vary the number of facts per soft rule, but also the number of grounded soft and hard rules by replicating rules with different combinations of noisy facts. That is, for each original grounded fact, we introduce 10 mutually exclusive facts as noise, and, for each soft rule, we expand the CNF by introducing 10 randomly expanded soft rules at each expansion step. Overlap among soft rules is achieved by uniform-randomly drawing the noisy facts from a static pool of 1,000 synthetic facts for all expansions. We keep Gaussian weights for the expanded facts. This setting yields very large CNF formulas with |C| + |S| ranging from 2,312 to 2,321,488. More specifically, we create constraints with up to 21,277 soft rules and 2,311,551 occurrences of facts in all soft rules, over an overall amount of only 9,937 distinct facts for the last expansion step. In this step, each fact on average occurs in more than 232 rules, each with an average rule length of 108 facts. URDF solves the CNF formulas for this expansion in 15.7 seconds, while remaining at a higher precision in computing the MaxSAT weight as MAP. We measure the approximation precision only for the first three expansion steps, yielding MaxSAT weights of 853.74, 975.02, 1, 099.34 for URDF and 854.58, 937.74, 1, 069.70 for MAP, respectively. MAP does not scale to larger CNFs than in these first three expansions, while MC-SAT cannot be run beyond the first expansion step anymore.

Inductively Learned Rules.

The previous experiments were run over the entire YAGO knowledge base, using 16 handcrafted rules in a fully recursive fashion. To measure the cost of deduction over such a large knowledge base with more general rules, we conduct runs over 42 recursive rules learned inductively by a variant of FOIL [21], using YAGO as background knowledge. Figure 2(a) depicts the runtimes (in seconds) for grounding queries Q1-Q10 over YAGO using these rules, however breaking SLD resolution at different deduction levels. URDF outperforms Jena even in the grounding step by a large margin, while Jena failed to respond to many queries at increasing scale factors (see also [10] for detailed results on LUBM).

RELATED WORK

Reasoning with inconsistency and uncertainty has been gaining significant attention in the Database, Semantic Web, and Machine Learning communities lately. In contrast to related works on uncertain and probabilistic databases, we consider a more dynamic way of querying and reasoning with deduction rules for intensional relations. All database approaches for uncertain data we are aware of-for example Trio [29], MayBMS [2], and MystiQ [5]-limit queries to materialized data. Some systems support views [2,29], but only in materialized form which is equivalent to comprehensive and thus expensive bottom-up grounding in Datalog. Moreover, we also found recent approaches that adopt inference methods for probabilistic graphical networks to database systems, such as [18] or [28], to be primarily designed for batch processing and not to be well suited for interactive querying. A similar observation also holds for Probabilistic Logic Programming (PLP) and Answer Set Programming (ASP). PLP combines logic programming (including Datalog) with probabilistic reasoning. ProbLog [22] employs SLD resolution [3] for grounding first-order rules, together with Monte Carlo sampling or binary decision diagrams for probabilistic inference over Boolean formulas obtained from SLD proofs. ASP, on the other hand, is a more generic paradigm for solving combinatorial problems with logic programs. In [19], the authors study tractable subclasses of ASP with cardinality constraints and weights. In Probabilistic Description Logic [14] (PDL) (see [15] for an overview), probability ranges can be attached to subsumption (or instance-of) statements. PDL generalizes the description logic SHOIN (D) and can thus express functional rules. However, PDL cannot deal with truly inconsistent input statements. Thus, in order to apply PDL to a knowledge base with noisy confidence scores in place of the probabilities, the probability ranges would have to be reconciled upfront-which amounts to solving the inconsistencies before starting the reasoner.

Statistical Relational Learning (SRL) has been gaining a strong momentum in both the Database and Machine Learning communities, with Markov Logic Networks (MLNs) [23] probably being one of the most generic approaches for combining first-order logic and probabilistic graphical models. In these classes of graphical models, sampling methods based on Markov Chain Monte Carlo (MCMC) provide a family of efficient approximation algorithms for probabilistic inference. Specifically, the algorithms employed in Markov Logic for maximum-a-posteriori (MAP) inference [24] and MC-SAT [20] constitute two state-of-the-art extensions to Max WalkSAT-based, stochastic MaxSAT solvers [13] and Gibbs-style sampling techniques, respectively. Our approach diverges from Markov Logic in two basic aspects: grounding and inference. Grounding a first-order Markov network in an open-world semantics works by binding all entities (constants) to variables in the first-order rules that match the type signature of the respective predicates. For binary predicates, this may result in grounded networks which are often nearly quadratic in the number of entities in the knowledge base. Unlike Markov Logic, we specifically focus on query-time reasoning, with a deductive (closed-world) grounding of soft rules.

In [26], we presented a "self organizing framework for information extraction", where the main problem has been formulated also as a MaxSAT problem. The main difference between this work and [26] is that here we deal with hard and soft rules in a more principled way, while in [26] hard rules are enforced by introducing soft rules with large weights (see Section 2 for a discussion about why a more principled approach is needed). Moreover, we present an algorithm with an approximation guarantee of 0.83 in some cases of interest. The MaxSAT problem is well studied in the theoretical computer science community (see for example [8]). Here, we focus on the effectiveness of our algorithms in solving real-life problems. In [17], we presented an initial demo of our interactive reasoning framework.

CONCLUSIONS

We presented a query-time reasoning approach for uncertain RDF data and SPARQL queries over a combination of soft deduction rules and hard consistency constraints. URDF employs a generalized weighted MaxSAT algorithm that guarantees consistent query answers with regard to the hard constraints. Our experiments confirm that our MaxSAT approximation algorithm yields interactive response times over formulas with many thousands of grounded rules and facts, while empirically performing much better than the provided lower bound of 1/2 for the approximation guarantee. We also saw that, in many cases, the grounding time for the Datalogstyle soft deduction rules is the actual delimiting factor for query response times. Our future work will investigate how to further scale up inference by a combination of dynamic grounding over the highly transient parts of the knowledge base with materialized facts for the more static parts, as well as by distributing our grounding and MaxSAT-based inference strategies.

4 :THEOREM 1 .COROLLARY 1 .411• Remove all satisfied clauses. Given a set C of Horn clauses and a set S of hard rules, let λ be the minimum number of negated literals in any Horn clause that has at least two literals. Our algorithm is a papproximation algorithm for the MaxSAT problem with soft and hard rules, where p is obtained by solving the equation p = 1 − p λ . The running time of our algorithm is O(m•n) in the worst case, with m = c∈C |c| and n = s∈S |s|. Our algorithm has an approximation guarantee of 1/2 for general Horn clauses.

DEFINITION 1 .1Given a knowledge base KB = {F, C, S} and a conjunctive query Q, then: • All possible groundings fi ∈ F of the query atoms qj ∈ Q are facts in the dependency graph DQ of Q. • If a grounded fact fn ∈ F is in DQ, then all grounded facts f1, . . . , fn−1 ∈ F of all grounded soft rules C ∈ C, in which fn is the head, are also in DQ. • If a grounded fact fi ∈ F is in DQ, then all grounded facts f1, . . . , f k ∈ F of the grounded hard rule S ∈ S, which are mutually exclusive to fi, are also in DQ.

Figure 1 :1Figure 1: Asymptotic MaxSAT behavior (a) and MaxSAT vs. MAP and MC-SAT (b).

Figure 2 :2Figure 2: SLD deduction levels (a) and LUBM results (b).

5. 2 . 424LUBM Benchmark.

Figure 2 (2Figure 2(b) shows a comparison of URDF runtimes (in seconds, including both SLD and MaxSAT) versus the Jena Semantic Web toolkit 3 (using PostgreSQL 8.3 as storage backend) over the LUBM benchmark setting [10] at scale factors (SF) 1, 5 and 10. Performance for URDF is similar to that observed on YAGO, where MaxSAT times are again much faster than SLD grounding times.URDF outperforms Jena even in the grounding step by a large margin, while Jena failed to respond to many queries at increasing scale factors (see also[10] for detailed results on LUBM).

Table 1 :1Basic YAGO results.URDF/MLNURDFMLNMLNURDFMLN#C|C|#S|S|SLD(ms)MaxSAT(ms)MAP(ms)MC-SAT(ms)MaxSAT-WMAP-WQ14910922252431806519.9219.92Q214201012531814179.699.69Q3324027282578141720.5620.56Q446822330178186122120.7720.77Q51762031671673,062131,56460,970161.93161.93Q63183423073105844111173292.40292.40Q7100220414822241,3441,03227.0627.91Q81951991921939371,87736,330188.21188.21Q94471273514371,40714226.3726.37Q10898989897142835,26786.8386.831,0631,3759059374,674499,155104,234853.74854.58

http://jena.sourceforge.net/

2 http://alchemy.cs.washington.edu/ 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 ms URDF:MAX-SAT-0 URDF:MAX-SAT-Uni URDF:MAX-SAT-Gauss SLD Grounding 0 10,000

20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 2,312 40,112 77,912 115,712 153,512 191,312 229,112 266,912 304,712 342,512 ms |C|+|S| URDF:MAX-SAT-0 URDF:MAX-SAT-Uni URDF:MAX-SAT-Gauss SLD Grounding 100 1,000 10,000 100,000 1,000,000 ms MLN:MC-SAT MLN:MAP URDF MAX SAT 10 100 1,000 10,000 100,000 1,000,000 2,312 51,475 150,239 304,079 513,466 776,161 1,090,442 1,453,442 1,863,967 2,321,488

The Probabilistic Method NAlon JSpencer 1992 John Wiley MayBMS: Managing incomplete information with probabilistic world-set decompositions LAntova CKoch DOlteanu ICDE 2007 Contributions to the theory of logic programming KRApt MHVan Emden J. ACM 29 3 1982 DBpedia: a nucleus for a web of open data SAuer CBizer GKobilarov JLehmann RCyganiak ZGIves ISWC 2007 MYSTIQ: a system for finding more answers by using probabilities JBoulos NDalvi BMandhani SMathur CRé DSuciu SIGMOD 2005 Coupled semi-supervised learning for information extraction ACarlson JBetteridge RCWang ER HJr TMMitchell WSDM 2010 Learning and inference with constraints M.-WChang L.-ARatinov NRizzolo DRoth AAAI 2008 New 3/4-approximation algorithms for the maximum satisfiability problem MXGoemans DPWilliamson SIAM J. Discr. Math 7 4 1994 On the complexity of single-rule Datalog queries GGottlob CPapadimitriou Inf. Comput 183 2003 LUBM: A benchmark for OWL knowledge base systems YGuo ZPan JHeflin Journal of Web Semantics 3 2-3 2005 On the complexity of the maximum satisfiability problem for Horn formulas BJaumard BSimeone Information Processing Letters 26 1 1987 Approximation algorithms for combinatorial problems DSJohnson Journal of Computer and System Sciences 1974 A general stochastic approach to solving problems with hard and soft constraints HKautz BSelman YJiang The Satisfiability Problem: Theory and Applications American Mathematical Society 1996 Probabilistic description logic programs TLukasiewicz Int. J. Approx. Reasoning 45 2 2007 Managing uncertainty and vagueness in description logics for the semantic web TLukasiewicz UStraccia J. of Web Semantics 6 2008 FACTORIE: Probabilistic programming via imperatively defined factor graphs AMccallum KSchultz SSingh NPIS 2009 Interactive reasoning in uncertain RDF knowledge bases TMeiser MDylla MTheobald CIKM 2011 Tuffy: Scaling up statistical inference in Markov Logic Networks using an RDBMS FNiu CRé ADoan JWShavlik PVLDB 4 6 2011 Tractable answer-set programming with weight constraints: Bounded treewidth is not enough RPichler SRümmele SSzeider SWoltran KR 2010 A general method for reducing the complexity of relational inference and its application to MCMC HPoon PDomingos MSumner AAAI 2008 Learning logical definitions from relations JRQuinlan Machine Learning 1990 5 ProbLog: A probabilistic Prolog and its application in link discovery LDRaedt AKimmig HToivonen IJCAI 2007 Markov Logic Networks MRichardson PDomingos Machine Learning 2006 62 Memory-efficient inference in relational domains PSingla PDomingos AAAI 2006 Yago: A core of semantic knowledge FMSuchanek GKasneci GWeikum 2007 WWW SOFIE: A self-organizing framework for information extraction FMSuchanek MSozio GWeikum 2009 WWW URDF: An efficient reasoning framework for uncertain knowledge bases with hard and soft rules MTheobald MSozio FMSuchanek NNakashole 2010 Max-Planck-Institute for Informatics Technical report Scalable probabilistic databases with factor graphs and MCMC MLWick AMccallum GMiklau PVLDB 3 1 2010 Trio: A system for integrated management of data, accuracy, and lineage JWidom CIDR 2005