=Paper=
{{Paper
|id=Vol-2646/43-paper
|storemode=property
|title=Probabilistic Answers over Inconsistent Knowledge Bases
|pdfUrl=https://ceur-ws.org/Vol-2646/43-paper.pdf
|volume=Vol-2646
|authors=Marco Calautti,Nicola Fiorentino,Sergio Greco,Cristian Molinaro,Irina Trubitsyna
|dblpUrl=https://dblp.org/rec/conf/sebd/CalauttiFGMT20
}}
==Probabilistic Answers over Inconsistent Knowledge Bases==
Probabilistic Answers over Inconsistent Knowledge Bases (DISCUSSION PAPER) Marco Calautti, Nicola Fiorentino, Sergio Greco, Cristian Molinaro, Irina Trubitsyna DIMES, University of Calabria {calautti, fiorentino, greco, cmolinaro, trubitsyna}@dimes.unical.it Abstract. Consistent query answering is a generally accepted approach for query- ing inconsistent knowledge bases. A consistent answer to a query is a tuple entailed by every repair, where a repair is a consistent database that “minimally” differs from the original (possibly inconsistent) one. This is a somewhat coarse-grained classification of tuples into consistent and non-consistent does not provide much information about the non-consistent tuples (e.g., a tuple entailed by 99 out of 100 repairs might be considered “almost consistent”). To overcome this limitation, we propose a probabilistic approach to querying inconsistent knowledge bases, which provides more informative query answers by associating a degree of consistency with each query answer by associating a probability to each repair, depending on the changes needed to obtain it. 1 Introduction There has been a great deal of work on extracting reliable information from inconsistent data, in both the AI and database communities. To deal with this problem, many seman- tics of query answering have been proposed. Most of them are based on the consistent query answering framework [2]. The idea is that a tuple should be considered a consistent answer to a query posed over an inconsistent knowledge base if the tuple is a query answer in every repair, where a repair is a consistent database that “minimally” differs from the original one. Different variants of the consistent query answering problem have been investigated depending on the minimality criterion—e.g., see [5]—or the employed repair primitive to restore consistency—e.g., see [3]. Regardless of the specific minimality criterion and repair primitive, all the resulting frameworks share the same drawback, which is a dichotomic classification of tuples in either consistent or non-consistent ones. This can provide very little information to users in many cases, as illustrated in the following example. Example 1. Consider the knowledge base (D, Σ), where D contains the following facts: employee bob cs nyc mike cs nyc alice cs paris Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. and Σ is an ontology consisting of the following equality-generating dependency σ: employee(E1 , D, C1 ), employee(E2 , D, C2 ) → C1 = C2 . As an example, the fact employee(bob, cs, nyc) states that bob is an employee work- ing in the cs department located in nyc. The dependency σ says that every department must be located in a single city. Clearly, the first two facts are conflicting with the last one. There are two repairs for this inconsistent knowledge base, which are as follows (more details on the employed notion of repair will be provided in the following): bob cs nyc bob cs paris R1 = mike cs nyc R2 = mike cs paris alice cs nyc alice cs paris Repair R1 is obtained from the original database by replacing paris with nyc, while R2 is obtained by replacing nyc with paris. The query asking for the city of the cs department has no consistent answers. Thus, a user trying to extract this information from the knowledge base is left with no answer altogether, all she/he gets is the empty set. To overcome the limitation illustrated in the example above, we propose a probabilis- tic approach to querying inconsistent knowledge bases whose aim is to provide more informative query answers than the classical consistent query answering framework. The main idea is to give a “degree of consistency” to every repair, which is then taken into account to assign a degree of consistency to query answers. Example 2. Consider again the scenario of Example 1. The degree of consistency of R1 is 2/3, as nyc is supported by two facts out of three in the original knowledge base, while the degree of consistency of R2 is 1/3, as paris is supported by only one fact out of three. Consider again the query asking for the city of the cs department. Rather than providing no information (like classical consistent query answers do), our query answers are nyc with confidence 2/3 (as this is entailed by R1 ), and paris with confidence 1/3 (as this is entailed by R2 ), which is much more informative than returning nothing. The previous examples illustrate the limitation of providing only certain information and how this might be overcome by assigning a degree of consistency to query answers (notice that consistent query answers are still provided – they have degree 1). In this paper, we consider knowledge bases where dependencies are expressed by means of equality-generating dependencies (EGDs). Equality-generating dependencies are one of the two major types of data dependencies—the other major type consists of tuple-generating dependencies (TGDs)—and can model several kinds of constraints arising in practice, such as functional dependencies and thus key dependencies. In the presence of EGDs, the two main approaches to restore consistency are to perform fact deletions or fact updates. A drawback of the former is that entire facts are deleted to resolve inconsistency, even if they may still contain “reliable” information. Thus, in this paper we adopt a notion of repair based on fact updates (we will provide a precise definition in the following). Nonetheless, the ideas developed in this paper can be used also when a notion of repair based on fact deletions is adopted. As we show, providing more informative query answers comes at a price: it is #P- hard. In light of this result, we discuss further developments providing polynomial time algorithms to compute approximate query answers. Related work. Most inconsistency-tolerant approaches in the literature adopt the classical notion of repair (via fact deletions/insertions) and consistent query answer [16,19,6,18,17], while in this paper we propose a probabilistic generalization where repairs use fact updates, akin to [7,21,11,15], and query answers are probabilistic. Some works on prob- abilistic query answering on inconsistent knowledge bases exist [1,13,10], but [1] and [13] only consider restricted classes of dependencies, while [10] deals with the classical notion of repair. 2 Preliminaries Basics. We assume the existence of the following pairwise disjoint (countably infinite) sets: a set Const of constants, a set Var of variables, and a set Null of labeled nulls. Nulls are denoted by the symbol ⊥ subscripted with natural numbers. A term is a constant, variable, or null. We also assume a set of predicates, disjoint from the aforementioned sets, with each predicate being associated with an arity, which is a non-negative integer. An atom is of the form p(t1 , . . . , tn ), where p is an n-ary predicate and the ti ’s are terms. We write an atom also as p(t), where t is a sequence of terms. An atom without variables is also called a fact. An instance is a finite multiset of facts. A database is a finite set of facts containing constants only. An instance containing only constants is said to be complete. Notice that, as opposed to databases, instances can contain duplicates— as shown in the following, instances are used in intermediate steps during repairs’ computation and it is needed to keep duplicates to count the number of modifications being made. We assume the existence of a function db converting a complete instance to a database by eliminating duplicates. A homomorphism is a mapping h : Const∪Var∪Null → Const∪Var∪Null that is the identity on Const. Homomorphisms are also applied to atoms and (multi) sets of atoms in the natural fashion, that is, h(p(t1 , . . . , tn )) = p(h(t1 ), . . . , h(tn )), and h(S) = {h(A) | A ∈ S} for every (multi) set S of atoms. A valuation is a homomorphism ν whose image is Const, that is, ν(t) ∈ Const for every t ∈ Const ∪ Var ∪ Null. An equality generating dependency (EGD) σ is a first-order formula of the form ∀x ϕ(x) → xi = xj , where ϕ(x) is a conjunction of atoms (without labeled nulls) whose variables are exactly x, and xi and xj are variables from x. We call ϕ(x) the body of σ, and call xi = xj the head of σ. We will omit the universal quantification in front of dependencies and assume that all variables are universally quantified. With a slight abuse of notation, we sometimes treat a conjunction of atoms as the set of its atoms. An instance J satisfies σ, denoted J |= σ, if whenever there exists a homomorphism h s.t. h(ϕ(x)) ⊆ J, then h(xi ) = h(xj ). An instance J satisfies a set Σ of EGDs, denoted J |= Σ, if I |= σ for every σ ∈ Σ. A knowledge base (KB) is a pair (D, Σ), where D is a database and Σ is a finite set of EGDs. We say that the knowledge base is consistent if D |= Σ, otherwise it is inconsistent. Acyclic EGDs. An argument of a set of EGDs Σ is an expression of the form p[i], where p is an n-ary predicate appearing in Σ and 1 ≤ i ≤ n. The argument graph of Σ is a directed graph GΣ = (V, E), where V is the set of all arguments of Σ, and E contains a directed edge from p[i] to q[j] labeled σ iff there is an EGD σ ∈ Σ such that: – the body of σ contains an atom p(t1 , . . . , tn ) such that either ti is a constant or ti is a variable occurring more than once in the body of σ, and – the body of σ contains an atom q(u1 , . . . , um ) such that uj is a variable also appearing in the head of σ. The dependency graph of Σ is a directed graph ΓΣ = (Σ, Ω), where Ω is the set {(σ1 , σ2 ) | GΣ = (V, E) ∧ (p([i], q[j], σ1 ), (q[j], r[k], σ2 ) ∈ E}. We say that Σ is acyclic if its dependency graph is acyclic. In the rest of the paper we consider acyclic sets of EGDs only. Thus, from now on, Σ is understood to be acyclic for every knowledge base (D, Σ). For acyclic sets of EGDs we can define a partial order (Σ, <) where EGDs in Σ are ordered by reachability in ΓΣ . The result of evaluating a query Q over a database D is denoted Q(D). Repairs. The notion of repair used in this paper is based on updating facts and assumes that conflicting facts denote an attribute-level uncertainty in the data [12,7,21,4,11,15]. Alternative approaches based on fact deletions assume that conflicting facts denote a tuple-level uncertainty [2,20,10]. The computation of repairs consists of a chase-like procedure that acts as follows: whenever an EGD ϕ(x) → xi = xj is not satisfied by a database D, i.e. there exists an homomorphism h such that D |= h(ϕ(x)) and h(xi ) 6= h(xj ), we have to enforce it by making the two values equal, that is either h(xi ) replaces h(xj ) or vice versa. A repair is obtained through an exhaustive application of this repair step. Note that, although we follow the partial order (Σ, <) to choose the EGD to enforce, there is a non-deterministic choice to be made when selecting an EGD and when updating values; this may lead to multiple repairs. Example 3. Consider the set of EGDs Σ = {σ1 : employee(X, Y1 , Z1 ) ∧ employee(X, Y2 , Z2 ) → Y1 = Y2 , σ2 : employee(X1 , Y, Z1 ) ∧ employee(X2 , Y, Z2 ) → Z1 = Z2 } and the database D: bob cs rome bob math rome alice math nyc By enforcing σ1 into the first two facts, either cs or math can be chosen as bob’s department. If the latter is chosen, then the instance D0 below is obtained. Suppose now that σ2 is enforced into the last two facts of D0 . Then, either rome or nyc can be chosen as math’s city. If the former is chosen, we obtain the instance D00 : bob math rome bob math rome D0 = bob math rome D00 = bob math rome alice math nyc alice math rome No further dependency enforcement is applicable at this point and thus the database derived from D00 by eliminating duplicates is a repair. The previous example informally illustrated the basic idea of the repair strategy we adopt. In the next section, we formally define it and show how to associate probabilities to repairs on the basis of the modifications that yielded them. Notice that it might be possible to restore consistency in other different ways. For instance, in the first step of the example above, one may modify the employee names. However, we do not consider this option because it is unclear which (different) values should be assigned (any constant in Const is a candidate value). For instance, bob in the first fact might be replaced with rome, but this is somewhat arbitrary and indeed does not make much sense. In contrast, our repair strategy chooses candidate values that are somehow “justified” by the content of the database (e.g., in the example above, bob works for either the cs or the math department). Moreover, when EGDs are key dependencies, the aforementioned way of restoring consistency may lead to the introduction of entities that are not meaningful. Indeed, our choice has been made by different approaches relying on value updates—e.g., [7]. For special classes of EGDs, the repair strategy based on updating values coincides with the repair strategy based on fact deletions. 3 Probabilistic Repairs and Query Answers One problem of the classical notion of consistent query answer is that query answers can provide little information in the presence of conflicting information. The alternative approach proposed in this paper is to compute probabilistic answers by taking into account in to what extent information is updated to restore consistency. In this section, we define probabilistic repairs and probabilistic query answers. In the next section, we show how to compute a compact representation of all probabilistic repairs. In both cases, we will use probabilistic instances, which are used to represent a single probabilistic repair in this section and are used to compactly represent all probabilistic repairs in the next section. Probabilistic instances are instances augmented with a set of “probability assignments”—intuitively, expressions stating conditions on nulls and probabilities on the values they can take. More formally, let C be the set of all expressions, called conditions, that can be built using the standard logical connectives ∧, ∨, ¬, and expressions of the form ti = tj , true, and false, where ti , tj ∈ Const ∪ Null. We will also use ti 6= tj as a shorthand for ¬(ti = tj ). A homomorphism h satisfies a condition ϕ, denoted h |= ϕ, if h(ϕ) is true. A probability assignment is an expression of the form P (ϕ1 | ϕ2 ) = p, where ϕ1 , ϕ2 are conditions and p ∈ [0, 1]. A homomorphism h satisfies a probability assignment P (ϕ1 | ϕ2 ) = p if the following holds true: if h |= ϕ2 then h |= ϕ1 . Also, h satisfies a set Φ of probability assignments, denoted h |= Φ, if h satisfies every probability assignment in Φ. For ease of presentation, a probability assignment of the form P (ϕ1 | true) = p will be simply written as P (ϕ1 ) = p. A probabilistic instance (PI) is a pair K = (J, φ), where J is an instance and φ is a finite set of probability assignments. For any valuation ν, we define P r(ν, K) = Π {p | P (ϕ1 | ϕ2 ) = p ∈ Φ and ν |= ϕ2 }. The semantics of K is given by the set of its probabilistic worlds, that is, the set of pairs (instance, probability) defined as follows: pw (K) = {(ν(J), P r(ν, K)) ν| is a valuation s.t. ν |= Φ ∧ P r(ν, K) > 0}. The probabilistic repairs of an inconsistent knowledge base (D, Σ) are computed starting from the probabilistic instance (D, ∅) and iteratively enforcing EGDs (following a topological sorting of ΓΣ ). During the repair process we keep track of which values have been updated and how many modifications have been applied using labeled nulls and probability assignments. The enforcement of an EGD, which we call probabilistic repair step, is informally shown in the next example. Example 4. Consider the knowledge base (D, Σ), where D and Σ are from Example 3. Starting from the probabilistic instance (D, ∅), by enforcing σ1 into the first two facts, either cs or math can be chosen as bob’s department. Therefore we obtain the following two (alternative) probabilistic instances K1 (left) and K2 (right): bob ⊥1 rome bob ⊥1 rome bob ⊥1 rome bob ⊥1 rome alice math nyc alice math nyc P (⊥1 = cs) = 1/2 P (⊥1 = math) = 1/2 The only probabilistic world of K1 is (D1 , 1/2), while the only probabilistic world of K2 is (D2 , 1/2), where D1 and D2 are as follows (and are obtained by replacing ⊥1 with cs and math, respectively): bob cs rome bob math rome D1 = bob cs rome D2 = bob math rome alice math nyc alice math nyc As D1 |= Σ, (db(D1 ), 1/2) is a probabilistic repair. On the other hand, D2 6|= Σ, and thus we need to keep applying pr-steps over K2 . By enforcing σ2 over the first and third facts of K2 we can get the following two (alternative) probabilistic instances K3 (left) and K4 (right): bob ⊥1 ⊥2 bob ⊥1 ⊥2 bob ⊥1 rome bob ⊥1 rome alice math ⊥2 alice math ⊥2 P (⊥1 = math) = 1/2 P (⊥1 = math) = 1/2 P (⊥2 = rome) = 1/2 P (⊥2 = nyc) = 1/2 The only probabilistic world of K3 is (D3 , 1/4), while the only probabilistic world of K4 is (D4 , 1/4), where D3 and D4 are as follows: bob math rome bob math nyc D3 = bob math rome D4 = bob math rome alice math rome alice math nyc In the rest of the paper, for every probabilistic instances (J, φ), we report the probability assignments in φ under J. Since D3 |= Σ, (db(D3 ), 1/4) is a probabilistic repair. On the other hand, D4 6|= Σ, and thus further pr-steps need to applied to K4 . By enforcing σ2 over the first two facts of K4 we can get the following two (alternative) probabilistic instances K5 (left) and K6 (right): bob ⊥1 ⊥3 bob ⊥1 ⊥3 bob ⊥1 ⊥3 bob ⊥1 ⊥3 alice math ⊥3 alice math ⊥3 P (⊥1 = math) = 1/2 P (⊥1 = math) = 1/2 P (⊥2 = nyc) = 1/2 P (⊥2 = nyc) = 1/2 P (⊥3 = rome) = 1/3 P (⊥3 = ⊥2 ) = 2/3 The only probabilistic world of K5 is (D5 , 1/12), while the only probabilistic world of K6 is (D6 , 1/6), where D5 and D6 are as follows: bob math rome bob math nyc D5 = bob math rome D6 = bob math nyc alice math rome alice math nyc Since D5 |= Σ and D6 |= Σ, both (db(D5 ), 1/12) and (db(D6 ), 1/6) are probabilistic repairs, and no further pr-step need to be applied. Thus, the probabilistic repairs are (db(D1 ), 1/2), (db(D3 ), 1/4), (db(D5 ), 1/12), and (db(D6 ), 1/6). Notice that the sum of the probabilities of the probabilistic repairs is 1. Also, notice that D3 and D5 coincide. They might be replaced by a single probabilistic repair (D0 , 1/4 + 1/12), where D0 = D3 = D5 . However, this does not affect query answering, that is, the probabilistic query answers are the same in both cases. 4 Further developments In light of previous result, in the full version of this paper we develop a polynomial time algorithm to compute approximate query answers where point probability are approximated by probability intervals. Thus, rather than providing query answers of the form (t, p), where t is a tuple and p is its probability, our approximation algorithm gives query answers of the form (t, [`, u]) guaranteeing that p ∈ [`, u]. In order to do that, we define probabilistic universal repairs, i.e., compact represen- tations of all repairs along with their degrees of consistency. This is one of the main tools used by our approximation algorithm, which works as follows: (i) it computes a universal probabilistic repair, (ii) it evaluates the given query over the universal probabilistic repair so as to produce a set of tuples associated with conditions (which keep track of how each tuple is derived), and (iii) it combines tuples’ conditions with probability assign- ments of the universal probabilistic repair to derive the approximate intervals. Further developments should consider more general frameworks with TGDs [10] and logic rules [14,8,9]. References 1. Periklis Andritsos, Ariel Fuxman, and Renée J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, page 30, 2006. 2. Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68–79, 1999. 3. Leopoldo E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. 4. Leopoldo E. Bertossi, Loreto Bravo, Enrico Franconi, and Andrei Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst., 33(4-5):407–434, 2008. 5. Meghyn Bienvenu, Camille Bourgaux, and Francois Goasdoué. Querying inconsistent de- scription logic knowledge bases under preferred repair semantics. In AAAI, pages 996–1002, 2014. 6. Meghyn Bienvenu and Riccardo Rosati. Tractable approximations of consistent query answe- ring for robust ontology-based data access. In IJCAI, pages 775–781, 2013. 7. Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143–154, 2005. 8. Marco Calautti, Sergio Greco, Cristian Molinaro, and Irina Trubitsyna. Checking termination of logic programs with function symbols through linear constraints. In RuleML, volume 8620 of LNCS, pages 97–111. Springer, 2014. 9. Marco Calautti, Sergio Greco, Cristian Molinaro, and Irina Trubitsyna. Logic program termination analysis using atom sizes. In IJCAI, pages 2833–2839. AAAI Press, 2015. 10. Marco Calautti, Leonid Libkin, and Andreas Pieris. An operational approach to consistent query answering. In PODS, 2018. 11. Sergio Flesca, Filippo Furfaro, and Francesco Parisi. Querying and repairing inconsistent numerical databases. ACM Trans. Database Syst., 35(2):14:1–14:50, 2010. 12. Enrico Franconi, Antonio Laureti Palma, Nicola Leone, Simona Perri, and Francesco Scarcello. Census data repair: a challenging application of disjunctive logic programming. In LPAR, pages 561–578, 2001. 13. Sergio Greco and Cristian Molinaro. Probabilistic query answering over inconsistent databases. Ann. Math. Artif. Intell., 64(2-3):185–207, 2012. 14. Sergio Greco, Cristian Molinaro, and Irina Trubitsyna. Logic programming with function symbols: Checking termination of bottom-up evaluation through program adornments. Theory Pract. Log. Program., 13(4-5):737–752, 2013. 15. Sergio Greco, Cristian Molinaro, and Irina Trubitsyna. Computing approximate query answers over inconsistent knowledge bases. In IJCAI, pages 1838–1846, 2018. 16. Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati, Marco Ruzzi, and Domenico Fabio Savo. Inconsistency-tolerant query answering in ontology-based data access. J. Web Sem., 33:3–29, 2015. 17. Thomas Lukasiewicz, Enrico Malizia, and Cristian Molinaro. Complexity of approximate query answering under inconsistency in datalog+/-. In IJCAI, pages 1921–1927, 2018. 18. Thomas Lukasiewicz, Maria Vanina Martinez, Andreas Pieris, and Gerardo I. Simari. From classical to consistent query answering under existential rules. In AAAI, pages 1546–1552, 2015. 19. Riccardo Rosati. On the complexity of dealing with inconsistency in description logic ontologies. In IJCAI, pages 1057–1062, 2011. 20. Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. 21. Jef Wijsen. Database repairing using updates. ACM Trans. Database Syst., 30(3):722–768, 2005.