Optimized encodings for Consistent Query Answering via ASP from different perspectives? Marco Manna, Francesco Ricca, and Giorgio Terracina Department of Mathematics, University of Calabria, Italy {manna,ricca,terracina}@mat.unical.it Abstract. A data integration system provides transparent access to different data sources by suitably combining their data, and providing the user with a unified view of them, called global schema. However, source data are generally not under the control of the data integration process, thus integrated data may violate global integrity constraints even in presence of locally-consistent data sources. In this scenario, it may be anyway interesting to retrieve as much consistent information as possible. The process of answering user queries under global constraint viola- tion is called consistent query answering (CQA). Several notions of CQA have been proposed, e.g. depending on whether the information in the database is as- sumed to be sound or complete. This paper provides a contribution in this setting; it uniforms solutions coming from different perspectives under a common core and provides some optimizations designed to identify and isolate the inefficient part of the computation of consistent answers. The effectiveness of the approach is evidenced by experimental results reported in the paper. 1 Introduction The enormous amount of information dispersed over many data sources, often stored in different heterogeneous databases, has boosted in recent years the interest for data inte- gration systems [11]. Roughly speaking, a data integration system provides transparent access to different data sources by suitably combining their data, and providing the user with a unified view of them, called global schema. In many cases the application domain imposes some consistency requirements on integrated data. For instance, it may be at least desirable to impose some integrity con- straints (ICs), e.g. primary/foreign keys, on global relations. However, data stored at the sources may violate global ICs when integrated, since in general source data are not under the control of the data integration process. The standard approach to this problem basically consists of explicitly modifying the data in order to eliminate violation of ICs (data cleaning). However, the explicit repair of data is not always convenient or possible. Therefore, when answering a user query, the system should be able to “virtually repair” relevant data in a declarative fashion (in the line of [11, 2, 4, 7]), in order to provide consistent answers (this task is also called consistent query answering, or CQA). ? This work has been partially supported by the Calabrian Region under PIA (Pacchetti Integrati di Agevolazione industria, artigianato e servizi) project DLVSYSTEM approved in BURC n. 20 parte III del 15/05/2009 - DR n. 7373 del 06/05/2009. 2 M. Manna, F. Ricca, G. Terracina The database community has spent many efforts in this area, relevant research re- sults have been obtained to clarify semantics, decidability, and complexity of data- integration under constraints and, specifically, for CQA. In particular, several notions of CQA have been proposed (see [4] for a survey), e.g. depending on whether the in- formation in the database is assumed to be sound or complete. Basically, the comple- teness assumption coincides with the closed world assumption, where facts missing from the database are considered to be false, whereas the soundness assumption coinci- des with the open world assumption, where information in the database can be enriched with missing information needed to satisfy the global constraints. Other notions weaken above requirements allowing some forms of exceptions [6]. However, while efficient systems are already available for simple data integration scenarios, scalable solutions have not been implemented yet for CQA, mainly due to the fact that handling inconsistencies arising from constraints violation is inherently hard. This paper provides a contribution in this setting. Specifically, it starts from state- of-the-art approaches on different semantic perspectives [2, 4, 7, 12] and revisits them in the light of the experience we gained in the INFOMIX [12] project in order to over- come the limitations experienced in real-world scenarios. In particular, we provide a purely declarative, logic-based approach to the problem, featuring several optimizations designed to “localize” and limit the inefficient part of the computation of consistent an- swers to small fragments of the input, yet allowing interesting classes of queries and constraints on the global schema. The main characteristics of the proposed approach are the following: – Well known semantics assumptions on consistent query answering are supported. – Declarative programming is exploited to uniformly express known semantics via Answer Set Programming (ASP) and to define a common “core” of optimizations. – The problem of consistent query answering is reduced to cautious reasoning on (possibly) disjunctive Datalog programs; this allows to effectively compute the query results precisely, by using a state-of-the-art disjunctive Datalog system. – Large amounts of data, typical of real-world integration scenarios, are handled using as internal query evaluator the DLVDB [15, 14] system, which allows for mass-memory database evaluations and distributed data management features. In order to assess the effectiveness of the proposed optimizations, we carried out experimental activities on a real world scenario comparing their behavior on different semantics. The plan of the paper is as follows. Section 2 formally introduces the data integra- tion model and the meaning of consistent query answering under different semantics. Section 3 first introduces a unified (standard) solution to handle CQA via ASP which uniforms available approaches over different semantics, and then presents some opti- mizations. Section 4 describes the benchmark framework we adopted in the tests and presents obtained results. Finally, in Section 5 we draw some conclusions. 2 Data Integration Framework In this paper, we use the following notation. We always denote, by Γ , a domain alphabet of values; by t, a tuple of values from Γ ; by X, a variable; by x̄ a sequence X1 , . . . , Xn Optimized encodings for CQA via ASP from different perspectives 3 of variables, and by |x̄| = n its length. Given x̄ and x̄0 , we denote by x̄, x̄0 a possible sequence of length |x̄| + |x̄0 | obtained as the juxtaposition of all the variables in x̄ and x̄0 . Also, we denote, by σ(x̄) a conjunction of comparison atoms of the form X ¯ X 0 with ¯ ∈ {≤, =, 6=}, and by ª, the symmetric difference operator between two sets. Finally, we adopt the unique name assumption. A (relational) schema is a pair R = hrelSym(R), intCon(R)i where relSym(R) and intCon(R) are the relational symbols and the integrity constraints (ICs) of R, respectively. The arity of a given relation r ∈ relSym(R) is denoted by arity(r). A database for R is any set of the form ∆ = {r(t) : r ∈ relSym(R) ∧ t is a tuple from Γ ∧ |t| = arity(r)} Let r1 , . . . , rm be relation symbols, the set intCon(R) contains ICs of the form: (c1 ) ∀x̄1 , . . . , x̄m ¬(r1 (x̄1 ) ∧ . . . ∧ rm (x̄m ) ∧ σ(x̄1 , . . . , x̄m )) (c2 ) ∀x̄1 ∃x̄2 r1 (x̄1 , x̄3 ) → r2 (x̄1 , x̄2 ) Constraints of type c1 (where arity(ri ) = |x̄i |, for each i in [1..m]) are denial con- straints (DCs) whereas those of type c2 are inclusion dependencies (INDs). Note that, functional dependencies as well as key dependencies are special cases of denial con- straints. A database ∆ for R is said to be consistent w.r.t R iff all ICs are satisfied. A conjunctive query q(x̄) over R is a formula of the form ∃x̄b1 , . . . , x̄bm r1 (x̄b1 , x̄f1 ) ∧ . . . ∧ rm (x̄bm , x̄fm ) ∧ σ(x̄b1 , x̄f1 . . . , x̄bm , x̄fm ) where x̄ = x̄f1 , . . . , x̄fm are its free variables. In particular, a query is ground if it does not contain any variable; it is quantifier free if all of its variables are free; it is simple conjunctive if it does not contain any repeated relation symbol. Given a database ∆ for R, and a query q(x̄), the answer to q is the set of n-tuples of values ans(q, ∆) = {t : ∆ |= q(t)}. 2.1 The Data Integration Model As usual [11], a data integration system is formalized as a triple I = hG, S, Mi where ¦ G is the global schema. A global database for I is any database for G; ¦ S is the source schema. A source database for I is any database consistent w.r.t S; ¦ M is the global-as-view (GAV) mapping, that associates to each element g in relSym(G) a (union of) conjunctive query over S. Let D be a source database for I. The retrieved global database is ret(I, D) = {g(t) : g ∈ relSym(G) ∧ t ∈ ans(q, D) ∧ q ∈ M(g)} for G satisfying the mapping. Notice that, when source data are combined in a unified schema with its own ICs, the retrieved global database might be inconsistent. 4 M. Manna, F. Ricca, G. Terracina Remark 1. In our setting, w.l.o.g. [1], queries are Datalog rules and a database consists of a set of Datalog facts. u t Example 1. Consider a bank association that desires to unify the databases of two bran- ches. The first (source) database models managers by using a table man(code, name) and employees by a table emp(code, name), where code is a primary key for both ta- bles. The second database stores the same data in table employee(code, name, role). Suppose that the data has to be integrated under a global schema with two tables m(code) and e(code, name), where the global ICs are: – ∀x1 , x2 , x3 , x4 ¬(e(x1 , x2 ) ∧ e(x3 , x4 ) ∧ x1 = x3 ∧ x2 6= x4 ) i.e., code is the primary key of e; – ∀x1 ∃x2 m(x1 ) → e(x1 , x2 ) i.e., an IND imposing that each manager code must be an employee code as well. The mapping is defined as follows: e(Xc , Xn ) :− emp(Xc , Xn ). m(Xc ) :− man(Xc , ). e(Xc , Xn ) :− employee(Xc , Xn , ). m(Xc ) :− employee(Xc , , ‘manager0 ). Assume that, emp stores tuples (‘e1’,‘john’), (‘e2’,‘mary’), (‘e3’,‘willy’), man stores (‘e1’,‘john’), and employee stores (‘e1’,‘ann’,‘manager’), (‘e2’,‘mary’,‘manager’), (‘e3’,‘rose’,‘emp’). It is easy to verify that, although the source databases are consistent w.r.t. local constraints, the global database, obtained by evaluating the mapping, violates the key constraint on e as both john and ann have the same code e1 in table e. u t 2.2 Consistent Query Answering under different semantics In case that ret(I, D) violates ICs, one can still be interested in querying the “con- sistent” information originating from D. One possibility is to “repair” ret(I, D) (by inserting or deleting tuples) in such a way that all the ICs are satisfied. But there are se- veral ways to “repair” ret(I, D) e.g., to satisfy an IND of the form r1 → r2 one might either remove violating tuples from r1 or insert new tuples in r2 . Moreover, the repai- ring strategy depends on the particular semantic assumption made on the data integra- tion system. Semantic assumptions may range from (strict) soundness to (strict) com- pleteness. Roughly speaking, completeness complies with the closed world assumption where missing facts are assumed to be false; on the contrary, soundness complies with the open world assumption where ret(I, D) may be incomplete. We consider the follo- wing semantics sound, complete, CM-complete, exact, loosely-sound, loosely-complete, loosely-exact [2, 4, 6, 7]. More formally, let Σ denote a semantics, and I = hG, S, Mi a data integration system. A global database B is said to be a Σ-repair for ret(I, D) if it is consistent w.r.t. G and one of the following conditions holds: 1. Σ = sound and B ⊇ ret(I, D); 2. Σ = complete and B ⊆ ret(I, D); 3. Σ = CM-complete, B ⊆ ret(I, D), and there is no other consistent global database B0 ⊆ ret(I, D) such that B0 ⊃ B; 4. Σ = exact and B = ret(I, D); Optimized encodings for CQA via ASP from different perspectives 5 5. Σ = loosely-sound and there is no other consistent global database B0 such that B0 ∩ ret(I, D) ⊃ B ∩ ret(I, D); 6. Σ = loosely-complete and there is no other consistent global database B 0 such that B0 − ret(I, D) ⊂ B − ret(I, D); 7. Σ = loosely-exact, and there is no other consistent global database B0 such that B0 ª ret(I, D) ⊂ B ª ret(I, D). Now, given a source database D for I and a semantics Σ, the consistent answer to a query q of arity n w.r.t. D and I, is the set: ansΣ (q, I, D) = {t : t ∈ ans(q, B), for each Σ-repair B} Consistent Query Answering (CQA) is the problem of computing ansΣ (q, I, D). Observe that exact semantics is trivial, since here CQA makes sense only if no global constraint is violated by the retrieved database. The complete semantics always allows the empty database as a repair and, thus, any query will return a void answer. CQA under loosely-complete semantics actually coincides with CQA under the com- plete one [5]. The CM-complete [7] semantics allows a minimal number of deletions in each repair to avoid empty repairs, if possible, but does not allow any insertion. The sound semantics, allowing insertions only, fails when some denial constraint is violated, whereas the loosely-sound one overcomes this problem by allowing a minimal amount of deletions. Finally, the loosely-exact semantics combines the loosely-sound and the loosely-complete ones; in fact it allows both insertions and deletions but minimizes the symmetric difference between the retrieved database and the repairs. Example 2. By following Example 1, the retrieved global database admits exactly the following repairs under the CM-complete semantics: B1r = {e(‘e2’,‘mary’), e(‘e1’,‘john’), e(‘e3’,‘willy’), m(‘e1’), m(‘e2’)} B2r = {e(‘e2’,‘mary’), e(‘e1’,‘john’), e(‘e3’,‘rose’), m(‘e1’), m(‘e2’)} B3r = {e(‘e2’,‘mary’), e(‘e1’,‘ann’), e(‘e3’,‘willy’), m(‘e1’), m(‘e2’)} B4r = {e(‘e2’,‘mary’), e(‘e1’,‘ann’), e(‘e3’,‘rose’), m(‘e1’), m(‘e2’)} Finally, the query m(X), asking for the list of manager codes, has both e1 and e2 as consistent answers. u t 2.3 Properties of CQA We next recall some well known definitions as well as some important CQA properties. Let r be a relation symbol with n = arity(r), and key(r) be a set of indexes from I = {1, . . . , n}. A key dependency (KD) for r consists of a set of DCs, exactly one for each index k belonging to I − key(r), of the form ∀x̄1 , x̄2 ¬(r(x̄1 ) ∧ r(x̄2 ) ∧ σ(x̄1 , x̄2 ) ∧ x̄k1 6= x̄k2 ) where |x̄1 | = |x̄2 | = n, and σ(x̄1 , x̄2 ) is a conjunction of comparison atoms of the form x̄i1 = x̄i2 , for each i ∈ key(r). The set key(r) is the primary-key of r. 6 M. Manna, F. Ricca, G. Terracina Let d be an IND of the form ∀x̄1 ∃x̄2 r1 (x̄1 , x̄3 ) → r2 (x̄1 , x̄2 ). We denote by univInd(d, ri ) the set of (projection) indexes from {1, . . . , arity(ri )} induced by the positions of the variables x̄1 in ri , for each i in [1..2]. Moreover, d is said to be non-key- conflicting (NKC) [5] if and only if univInd(d, r2 ) ⊆ key(r2 ). In the following, we recall some known results about computability and complexity of CQA under different semantic assumptions (as usual in both ASP and Data Integra- tion fields, we only refer to data complexity [16]). Proposition 1. [5] In general, CQA is undecidable if Σ ∈ {sound, loosely-sound, loosely-exact}. Proposition 2. [5] CQA with KDs and NKC-INDs, is decidable. In detail it is coNP- complete under the sound and loosely-sound semantics and Π2p -complete under loosely- exact semantics. Proposition 3. [7] CQA is decidable if Σ = CM-complete. In detail, it is Π2p -complete, in general; and coNP-complete in case of acyclic INDs. In light of the above propositions and the considerations outlined in Section 2, we concentrate our analysis on the following semantics with the specified restrictions: (i) CM-complete with acyclic INDs; (ii) loosely-sound with only primary-key-dependencies as DCs, and NKC-INDs. In fact, these are the decidable cases having a complexity at most in coNP. Recall, moreover, that the sound semantics fails when some denial con- straint is violated. 3 Computation of CQA via ASP In this section, we show how to exploit Answer Set Programming (ASP) [9, 10] for efficiently computing consistent answers to user queries under different semantic as- sumptions. ASP is a powerful logic programming paradigm allowing (in its general form) for disjunction in rule heads [13] and nonmonotonic negation in rule bodies. In the following, we assume that the reader is familiar with ASP. The suitability of ASP for implementing CQA has been already recognized in the literature [11, 2, 4, 7]. The general approaches are based on the following idea: produce an ASP program P having an answer set for each repair, so that the problem of compu- ting CQA corresponds to cautious reasoning on P . We next introduce two algorithms (for those cases having a complexity at most in coNP) that take as input a data integration system and a query, and produce an ASP program that can be exploited for computing CQA. After showing a general encoding, we propose an “optimized” method that is able to produce complexity-wise easier pro- grams that are even optimal according to the complexity classification of constraints and queries in case of CM-complete semantics. 3.1 Standard Solution Given a data integration system I = hG, S, Mi and a query q, we next give the general algorithm for building an ASP program, say Πcqa , and a new query, say qcqa , solving the Optimized encodings for CQA via ASP from different perspectives 7 CQA problem via ASP. This program is created by “rewriting” each IC in intCon(G). We report separately the rules created for each kind of IC, and we detail the creation of additional auxiliary rules. Denial Constraints. Let Σ ∈ {CM-complete, loosely-sound}. For each DC of the form ∀x̄1 , . . . , x̄m ¬(g1 (x̄1 ) ∧ . . . ∧ gm (x̄m ) ∧ σ(x̄1 , . . . , x̄m )) in intCon(G), insert the following rule into Πcqa : g1c (x̄1 ) ∨ · · · ∨ gm c (x̄m ) :− g1 (x̄1 ), · · · , gm (x̄m ), σ(x̄1 , . . . , x̄m ). Inclusion dependencies. Let Σ = CM-complete. For each IND in intCon(G) of the form ∀x̄1 ∃x̄2 g1 (x̄1 , x̄3 ) → g2 (x̄1 , x̄2 ), with π2 = univInd(d, g2 ), insert the following rules into Πcqa : g2r-π2 (x̄1 ) :− g2r (x̄1 , ). g c (x̄ , x̄ ) :− g (x̄ , x̄ ), not g r-π2 (x̄ ). 1 1 3 1 1 3 2 1 Repaired Relations. Let Σ ∈ {CM-complete, loosely-sound}. For each relation symbol g ∈ relSym(G), insert the following rule into Πcqa : g r (x̄) :− g(x̄), not g c (x̄). Cleaning Step. Remove useless body literals (i.e., negative literals whose complemen- tary literals do not occur in any head) and possibly duplicated rules. Query rewriting. Build qcqa from q as follows by considering each atom of the form g(x̄) where x̄ = X1 , . . . , Xn : 1. Replace g(x̄) by g r (x̄) only if there is an IC where g occurs. 2. If Σ = loosely-sound, then apply the unfolding technique described in [6]. Theorem 1. Let Σ ∈ {CM-complete, loosely-sound}. Given a retrieved global data- base ret(I, D) for a global schema G and a query q over G, t ∈ ansΣ (q, I, D) iff qcqa (t) is a cautious consequence of the ASP program ret(I, D) ∪ Πcqa . Proof (Sketch). If Σ = CM-complete, then we claim that Πcqa always allows to find only and all the repairs, exactly one per model. Let Br be a repair. In the following, we describe how to obtain a model containing for each relation, say g, exactly only and all the tuples of g that do not appear in B r . We will collect such tuples in the new relation g c , while we collect in g r only and all the tuples of g appearing in B r . First of all, observe that, since we are considering only acyclic INDs, there must necessarily be some relation which is not involved in the left-hand side of any IND. Thus, for each relation, say g, exhibiting such a property: 1. By the disjunctive rules (if any) involving g (and containing in the right-hand side only relations of G), we guess (exploiting the minimality of answer sets semantics) the minimal set of tuples of g, collected in g c , that do not appear in B r . 2. Contrariwise, only and all the tuples of g that have to be in Br are collected in the repaired relation g r and computed by the rule g r (x̄) :− g(x̄), not g c (x̄). 8 M. Manna, F. Ricca, G. Terracina 3. Next, for each IND of the form ∀x̄1 ∃x̄2 g1 (x̄1 , x̄3 ) → g(x̄1 , x̄2 ) (involving g in the right-hand side), we use the rule g1c (x̄1 , x̄3 ) :− g1 (x̄1 , x̄3 ), not g r-π2 (x̄1 ). for deciding which tuples of g1 cannot (necessarily) appear in B r (g r-π2 is a projection of g r ). Note that, these tuples are univocally identified after fixing g r . 4. Finally, we can reiterate from step 1, by considering g1 instead of g (the tuples of g1c may only be augmented by some disjunctive rule involving g1 ). It is clear that, by construction, Πcqa has exactly one answer set per repair. Finally, the query is reorganized to exploit the repaired relations, and cautious reasoning does the rest. Analogous considerations can be done when Σ = loosely-sound. Still disjunctive rules guess a minimal set of tuples to be “canceled” whereas unfolding allows to deal with INDs (cfr. [6]). u t Remark 2. It is worth noting that our general encoding belongs to the head-cycle free class of ASP programs [3] for which cautious reasoning is coNP-complete. In fact, each predicate appearing in the body of any disjunctive rule is only defined by the mapping which, clearly, does not depend on any canceled predicates. u t In our ongoing example, the program (and the new query built from q(X) :− m(X).) obtained by applying the above algorithm, under the CM-complete semantics, is: ec (Xc , Xn ) ∨ ec (Xc0 , Xn0 ) :− e(Xc , Xn ), e(Xc0 , Xn0 ), Xc = Xc0 , Xn 6= Xn0 . er (Xc , Xn ) :− e(Xc , Xn ), not ec (Xc , Xn ). er-{1} (Xc ) :− er (Xc , ). mc (Xc ) :− m(Xc ), not er-{1} (Xc ). mr (Xc ) :− m(Xc ), not mc (Xc ). qcqa (Xc ) :− mr (Xc ). When the obtained program is evaluated on the database we obtain four answer sets. It can be verified that, all the answer sets contain mr (‘e1’) and mr (‘e2’), (i.e., they are cautious consequences of Πcqa ) and, thus, ‘e1’ and ‘e2’ are the consistent answers to the original query. 3.2 Optimized Solution The algorithm reported in the previous section is a general solution for solving the CQA problem, but, in several cases, more efficient ASP programs can be produced. First of all, note that the general algorithm blindly considers all the ICs on the global schema, including those that have no effect on the specific query. Consequently, redundant logic rules might be produced which slow down program evaluation. Note also that, there are a number of cases in which, according to [5, 7], the complexity of CQA stays in P ; but disjunctive programs, for which cautious reasoning is a hard task [8], are generated in presence of denial constraints. This means that, the evaluation of the produced logic programs might be much more expensive than required in those “easy” cases. In more detail, depending on the types of both schema constraints and queries, CQA is tractable in the following cases: Optimized encodings for CQA via ASP from different perspectives 9 – Quantifier-free queries and either: • denial constraints only, or • at most one key per relation if Σ = CM-complete; – Simple Conjunctive queries and either: • at most one functional dependency per relation if Σ = CM-complete, or • at most one key per relation if Σ = CM-complete; – Conjunctive queries and: • inclusion dependencies only; In the following, we provide an optimized version of the standard algorithm that is capable of identifying tractable (sub-)cases for a generic input query and that produces ASP programs for CQA which are complexity-wise optimal. Given a global schema G and a query q, the optimized algorithm analyzes both intCon(G) and q, and: (i) singles out only the ICs affecting query results, (ii) employs positive non-disjunctive rules for dealing with DCs in known tractable cases, and (iii) exploits a strategy that replaces unfolding in case of INDs if Σ = loosely-sound. Spe- cifically, consider the directed labeled graph Gc = hrelSym(G), Ei, called constraint graph, built in such a way that arc (g, g 0 , c) ∈ E if and only if one of the following holds: – c is a DC in intCon(G) involving both g and g 0 ; or – c is an IND in intCon(G) of the form g → g 0 if Σ = CM-complete. – c is an IND in intCon(G) of the form g 0 → g if Σ = loosely-sound. After analyzing and classifying the query (to recognize whether it is either quantifier- free, or simple conjunctive, or conjunctive), the constraint graph Gc is visited several times starting from each relation in the query. The visited nodes of Gc correspond to the relations involved in the query process, whereas the arcs traversed during the visits correspond to the constraints that might influence the query results. Thus, the corre- sponding relations and constraints are marked to be considered for further processing; unmarked constraints will be discarded. At the same time, the algorithm tags each mar- ked constraint to be either easy or hard, depending on whether the above-reported con- ditions on the complexity of CQA are satisfied or not. In particular, the tag associated to a given constraint is set (or updated) during each visit depending on query kind, num- ber and type of encountered constraints. The tag of each constraint c corresponding to a traversed arc e is set to “easy” if both (i) c was not previously tagged as “hard”, and (ii) at least one of the following conditions holds (otherwise c is tagged as “hard”): 1. if the query is quantifier-free, and either a. all the arcs belonging to the connected component of Gc containing e are labe- led by denial constraints, or b. Σ = CM-complete and all the nodes belonging to the connected component of Gc containing e have at most one outgoing arc labeled by a key constraint 2. if the query is simple-conjunctive, Σ = CM-complete, and either a. all the nodes belonging to the connected component of Gc containing e have at most one outgoing arc labeled by a functional dependency constraint, or b. all the nodes belonging to the connected component of Gc containing e have at most one outgoing arc labeled by a key constraint 10 M. Manna, F. Ricca, G. Terracina 3. if the query is conjunctive, and a. all the arcs belonging to the connected component of Gc containing e are labe- led by inclusion dependencies At this point, the ASP program Πcqa (and qcqa , accordingly) can be modified as follows: Step 1. For each DC in intCon(G) marked “easy”, insert the following m rules (instead of the ones containing disjunction in the standard solution) into Πcqa : gic (x̄i ) :− g1 (x̄1 ), · · · , gm (x̄m ), σ(x̄1 , . . . , x̄m ). ∀i ∈ [1..m] Step 2. If Σ = loosely-sound, for each marked IND of the form ∀x̄1 ∃x̄2 g1 (x̄1 , x̄3 ) → g2 (x̄1 , x̄2 ), with π2 = univInd(d, g2 ), insert the following rules into Πcqa . – g2π2 (x̄1 ) :− g2r (x̄1 , x̄2 ). – g2π2 (x̄1 ) :− g π (x̄1 , x̄0 ). ∀d0 of the form g → g2 with π2 ⊆ univInd(d0 , g2 ) and π = univInd(d0 , g) Step 3. For the other constraints, add the corresponding rules of the standard solution only if they are marked. Step 4. Build qcqa from q as follows: – If Σ = CM-complete, then replace each g(x̄) by g r (x̄) whenever g occurs in some marked constraint. – If Σ = loosely-sound, and there is an IND having g in its right hand side, and π is the set of indexes such that i ∈ π iff Xi is a free-variable or a variable involved in some join w.r.t. q, and there is a g π in Πcqa , then replace g(x̄) by g π (x̄0 ) where x̄0 is the subsequence of x̄ induced by π. First of all, note that the new algorithm produces only non-redundant rules (i.e. the rules encoding constraints that influence the query answering process). Moreover, it is worth noticing that the rules produced by step 1, corresponding to “easy” constraints are non-disjunctive.1 This is a pay-as-you-go technique where the usage of complex evaluation algorithms is limited to either intractable cases or to cases in which tractabi- lity results are not known. Rules introduced by steps 2 and 4, when Σ=loosely-sound, substitute the unfolding approach of [6] for handling INDs, and reduce, in general, the arity of involved predicates. Moreover, note that the same query may involve both easy and hard constraints, but disjunctive rules are used only for the hard ones. For example, suppose that we add to the global schema of our ongoing example a new binary relation c(code, name) representing the list of customers, and that code is a key for c. Moreover, suppose that we ask for the query q(Xc , Xn ) :− c(Xc , Xn ), e(Xc , Xn ). retrieving the customers that are also employees of the bank. In this case, the query is quantifier free, and only denial constraints are marked (under the CM-complete seman- tics) visiting the constraint graph. Indeed, it is easy to see that there is no way to reach 1 In the “easy” cases the original database can be repaired by simply removing all the conflicting tuples. This can be done because each repair can be obtained from the original database by removing a single tuple among the ones that violate the same constraint. When rules of this kind are employed the answer sets do not correspond to repairs, but CQA still corresponds to cautious reasoning. Optimized encodings for CQA via ASP from different perspectives 11 m in the constraint graph starting from the query atoms since the arc generated for the IND ∀x1 ∃x2 m(x1 ) → e(x1 , x2 ) goes from m to e. This means that condition 1.a is verified, all marked constraints are “easy”, and the produced program is: ec (Xc , Xn ) :− e(Xc , Xn ), e(Xc , Xn0 ), Xn 6= Xn0 . ec (Xc , Xn0 ) :− e(Xc , Xn ), e(Xc , Xn0 ), Xn 6= Xn0 . cc (Xc , Xn ) :− c(Xc , Xn ), c(Xc , Xn0 ), Xn 6= Xn0 . cc (Xc , Xn0 ) :− c(Xc , Xn ), c(Xc , Xn0 ), Xn 6= Xn0 . er (Xc , Xn ) :− e(Xc , Xn ), not ec (Xc , Xn ). cr (Xc , Xn ) :− c(Xc , Xn ), not cc (Xc , Xn ). qcqa (Xc , Xn ) :− cr (Xc , Xn ), er (Xc , Xn ). where we have also simplified joins of the form X = X 0 . Note that the obtained pro- gram is non-disjunctive and stratified and it can be evaluated in polynomial time. In this case, the only answer set of the program contains the consistent answers to the original query. 4 Experiments In this section we present some of the experiments we carried out to assess the effec- tiveness of our approach to consistent query answering. The datalog evaluation engine used in the experiment is the DLVDB system [15, 14] coupled, via ODBC, with a MS SQLServer DBMS where input data were stored. 4.1 Data Set We exploited the real-world data integration framework developed in the INFOMIX project (IST-2001-33570) [12] which integrates data from a real university context. In particular, considered data sources were available at the University of Rome “La Sa- pienza”. These comprise information on students, professors, curricula and exams in various faculties of the university. This data is dispersed over several databases in va- rious (autonomous) administration offices. There are about 35 data sources in the application scenario, which are mapped into 14 global schema relations with about 20 GAV mappings and 29 integrity constraints. We call this data set Infomix in the following. Besides the original source database instance (which takes about 16Mb on DBMS), we obtained bigger instances artificially. Specifically, we generated a number of copies of the original database; each copy is disjoint from the other ones but maintains the same data correlations between instances as the original database. This has been carried out by mapping each original attribute value to a new value having a copy-specific prefix. Then, we considered two further datasets, namely Infomix-x-10 and Infomix-x-50 storing 10 copies (for a total amount of 160Mb of data) and 50 copies (800Mb) of the original database, respectively; clearly, in both cases one of the copies is the original database itself. 12 M. Manna, F. Ricca, G. Terracina Q1 Q2 Q3 Q4 Optimized Query co-NP co-NP P P Evaluation Query Class SC UC QF SC Involved Constraints K+I K+I I D+I Query arity 3 2 3 0 N. of source tuples infomix 17266 3749 17725 37831 infomix-x-10 172660 37490 177250 378310 infomix-x-50 863300 1873950 886250 1891550 Table 1. Summary of tested queries. 4.2 Tested Queries As previously pointed out, standard rewriting for CQA makes the time complexity of query evaluation to be in coNP in most cases; however, our optimizations allow in many relevant cases to simplify the rewriting in such a way that the complexity of the evaluation of the corresponding program can be in P. In order to carry out a comprehensive performance analysis, we designed a set of queries spanning over the following perspectives: – As for the computational complexity perspective we designed queries whose: • evaluation complexity with standard rewriting stays in coNP and evaluation complexity with optimized rewriting stays in P; • evaluation complexity with standard rewriting stays in coNP and evaluation complexity with optimized rewriting remains in coNP; – As for the constraints perspective we designed queries involving: • Arbitrary Denial constraints only (D in the following) • Key constraints only (K in the following) • Inclusion dependencies only (I in the following) • Arbitrary Denial and Inclusion dependencies (D+I in the following) • Key constraints and Inclusion dependencies (K+I in the following) – As for the query class perspective we designed: • Unrestricted Conjunctive queries (UC in the following) • Quantifier-free queries (QF in the following) • Simple Conjunctive queries (SC in the following) We designed and ran several queries. Due to space constraints we concentrate here on some of them. Table 1 summarizes their characteristics. Here Number of source tuples indicates the number of tuples of all source relations involved by the query. 4.3 Compared Methods In order to assess the characteristics of the proposed optimizations, we measured the execution time of each query with both the standard and the optimized rewriting. Spe- cifically, we tested: (i) complete semantics with standard rewriting; (ii) complete seman- tics with optimized rewriting; (iii) loosely sound semantics with standard rewriting; (iv) Optimized encodings for CQA via ASP from different perspectives 13 loosely sound semantics with optimized rewriting. In Figure 1, the first group of four graphs compares (i) and (ii); the second group of four graphs compares (iii) and (iv), whereas the last group puts together (ii), and (iv). 4.4 Results and discussion All tests have been carried out on an Intel Core 2 Duo T7300, 2.0 GHz, with 2 Gb Ram, running Windows 7 Operating System. We set a time limit of 30 minutes after which query execution has been killed. Results obtained for tested queries and methods (showing times in seconds) are illustrated in Figure 1. The bar for a method is absent in the graphs if query answering time was higher than the limit. From the analysis of the figures and the characteristics of the queries reported in Table 1, we may draw the following observations: The optimized rewriting provides important performance improvements in both Q1 , Q2 , and Q4 . The only exception is for query Q3 under the complete semantics, for which no optimization was possible and, consequently, standard and optimized rewritings coincide. Performance improvements of the optimized rewriting w.r.t. the standard one have been registered up to 80%2 for query Q4 , 76% for Q1 , 60% for Q3 , and 47% for Q2 . The proposed optimizations always reduce execution times and do not introduce computational overhead. As for the comparison among the semantics, we can observe that generally the com- plete semantics allows for faster response times. The only exception is on Query Q3 which comprises inclusion dependencies only. In this case, the rewriting for the com- plete semantics must choose the tuples to be deleted, whereas the loosely sound seman- tics can just work on the original data set. Finally, it is worth noticing that the scaling of the optimized algorithms over the three data sets is generally better than the standard ones. Observe that it can be interesting to evaluate execution times of all rewritings with the addition of magic sets applied in cascade on them. Our intuition is that magic sets optimization and our optimizations are complementary and, consequently, their benefits may be summed up if the query involves some constant. Clearly, it would be important to evaluate the impact of the overhead introduced by this approach on the overall re- sponse time. 5 Conclusion and Ongoing Work In this paper we presented an approach that allows to efficiently handle consistent query answering under different semantics and integrity constraints. The effectiveness of the approach is based on optimized algorithms capable of identifying both tractable queries and portions of the queries that may be treated efficiently. The approach is part of a complete system for data integration based on ASP. Its query evaluator engine allows to carry out queries directly on the databases where data reside, even in an ASP context. Results of our experimental activity demonstrate the effectiveness of the approach. As far as ongoing work, we are investigating for more optimizations that can be included in the algorithms to further improve query answering performances. 2 This information has been computed over the unhalted queries only. 14 M. Manna, F. Ricca, G. Terracina Fig. 1. Query Evaluation Execution Times. Optimized encodings for CQA via ASP from different perspectives 15 References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases: The Logical Level. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995. 2. M. Arenas, L. Bertossi, and J. Chomicki. Answer sets for consistent query answering in inconsistent databases. Theory and Practice of Logic Programming, 3(4):393–424, 2003. 3. R. Ben-Eliyahu and R. Dechter. Propositional Semantics for Disjunctive Logic Programs. Annals of Mathematics and Artificial Intelligence, 12(1-2):53–87, March 1994. 4. L. E. Bertossi, A. Hunter, and T. Schaub, editors. Inconsistency Tolerance, volume 3300 of Lecture Notes in Computer Science, Berlin / Heidelberg, January 2005. Springer. 5. A. Calı̀, D. Lembo, and R. Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. In PODS’03 – Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 09 - 11, 2003, San Diego, California, pages 260–271, New York, NY, USA, 2003. ACM. 6. A. Calı̀, D. Lembo, and R. Rosati. Query rewriting and answering under constraints in data integration systems. In IJCAI’03 – Proceedings of the 18th international joint conference on Artificial intelligence, August 09 - 15, 2003, Acapulco, Mexico, pages 16–21, San Francisco, CA, USA, August 2003. Morgan Kaufmann Publishers Inc. 7. J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple dele- tions. Information and Computation, 197(1-2):90–121, February 2005. 8. T. Eiter, G. Gottlob, and H. Mannila. Disjunctive datalog. ACM Transactions on Database Systems (TODS), 22(3):364–418, September 1997. 9. M. Gelfond and V. Lifschitz. The Stable Model Semantics for Logic Programming. In R. Kowalski and K. Bowen, editors, ICLP/SLP’88 – Proceedings of the 5th International Conference and Symposium on Logic Programming, pages 1070–1080. MIT Press, 1988. 10. M. Gelfond and V. Lifschitz. Classical Negation in Logic Programs and Disjunctive Databa- ses. New Generation Computing, 9(3-4):365–385, August 1991. 11. M. Lenzerini. Data integration: a theoretical perspective. In PODS’02 – Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 03 - 05, 2002, Madison, Wisconsin, pages 233–246, New York, NY, USA, 2002. ACM. 12. N. Leone, G. Greco, G. Ianni, V. Lio, G. Terracina, T. Eiter, W. Faber, M. Fink, G. Gottlob, R. Rosati, D. Lembo, M. Lenzerini, M. Ruzzi, E. Kalka, B. Nowicki, and W. Staniszkis. The INFOMIX system for advanced integration of incomplete and inconsistent data. In SIGMOD’05 – Proceedings of the 2005 ACM SIGMOD international conference on Mana- gement of data, June 14 - 16, 2005, Baltimore, Maryland, pages 915–917, New York, NY, USA, 2005. ACM. 13. J. Minker. On Indefinite Data Bases and the Closed World Assumption. In D. W. Loveland, editor, CADE’82 – Proceedings 6th Conference on Automated Deduction, June 79, 1982, New York, USA, volume 138 of Lecture Notes in Computer Science, pages 292–308, Berlin / Heidelberg, June 1982. Springer. 14. G. Terracina, E. De Francesco, C. Panetta, and N. Leone. Enhancing a DLP System for Ad- vanced Database Applications. In D. Calvanese and G. Lausen, editors, RR 2008 – Procee- dings of the second International Conference on Web Reasoning and Rule Systems , October 31 - November 1, 2008, Karlsruhe, Germany, volume 5341 of Lecture Notes in Computer Science, pages 119–134, Berlin / Heidelberg, October 2008. Springer. 15. G. Terracina, N. Leone, V. Lio, and C. Panetta. Experimenting with recursive queries in database and logic programming systems. Theory and Practice of Logic Programming, 8(2):129–165, March 2008. 16. M. Y. Vardi. The Complexity of Relational Query Languages (Ext. Abstract). In STOC’82 – Proceedings of the 14th annual ACM symposium on Theory of computing, May 05 - 07, 1982, San Francisco, California, USA, pages 137–146, New York, NY, USA, 1982. ACM.