A note on computing certain answers to queries over incomplete databases Marcelo Arenas1 , Elena Botoeva2 , Egor V. Kostylev3 , and Vladislav Ryzhikov2 1 2 3 Pontificia Universidad Free University of University of Oxford, Católica de Chile, Chile Bozen-Bolzano, Italy UK 1 Introduction The ability to handle incomplete information is fundamental in many areas, including data integration, data exchange, inconsistent databases, and the Semantic Web. In such areas, it is often impossible to describe the domain of interest in a full, comprehensive way, so the aim of an incomplete database is to concisely represent a (potentially infi- nite) number of completions. Incomplete information is usually represented by allowing placeholders for unknown values, which are called “nulls”. An incomplete database I with such nulls then represents a set of databases with complete information, called representations of I, each of which is obtained by replacing nulls by actual values. One of the most important problems associated to databases is query answering. Since an incomplete database I can have several representations, the aim in this context is to find the “certain answers” to a query Q, that is, an incomplete database I 0 that precisely represents the answers to Q over all the representations of I. Unfortunately, this is possible only in restricted settings, which allow for the so called strong repre- sentation systems [5]. To overcome this limitation, it was recently proposed in [7] the idea of computing “certain answers as objects”. It is argued in [7] that we should aim for finding an object (i.e., an incomplete database) representing the answers over the representations of an incomplete database in the most informative way. More precisely, informativeness in this context is formalised as the following preorder on incomplete databases: I1  I2 if and only if each representation of I2 is a representation of I1 , that is, the more representations an incomplete database has, the less informative it is. Then given a query Q over an incomplete database I, the certain answers as objects to Q over I are defined as a greatest lower bounds under  of the set of answers to Q over all the representations of I. In this paper, we make initial steps in the study of the complexity of the problem of computing certain answers as objects. In particular, we concentrate on the widely stud- ied setting of union of conjunctive queries with inequalities and incomplete databases under the open-world assumption, and we provide positive and negative results. On the positive side, we show that in this case the certain answers to a query can be com- puted. On the negative side, we show that this computation can be costly, as the certain answers can be of exponential size even if we restrict to conjunctive queries with in- equalities over binary relations. 2 Preliminaries We assume countably infinite disjoint sets of constants, denoted by Const, and of nulls, denoted by Null. A relational schema (or just schema) Σ is a set of relation names with associated arities. An incomplete database I over Σ is an assignment of a k-ary relation RI over (Const∪Null) to each k-ary name R in Σ. Sets of constants and nulls that occur in I are denoted by Const(I) and Null(I), respectively, and their union, called an active domain, is denoted by adom(I). A complete database is an incomplete database without nulls. In what follows, we use I, I1 , I2 , I 0 , . . . to denote incomplete databases, and D, D1 , D2 , D0 , . . . to denote complete databases. The semantics of an incomplete database I is defined in terms of its representations, which are complete databases that are considered as possible interpretations of I [1, 5]. To define this semantics, we need the following notion: a valuation on an incomplete database I is a map v : adom(I) → Const that is the identity on Const(I). Such a mapping naturally extends to databases, so we write v(I) as well. Then, under the open-world assumption, the set of representations of I, denoted by [[I]], is defined as [[I]] = {D | D is a complete database and v(I) ⊆ D for some valuation v of I}. Note that [[I]] 6= ∅ for every incomplete database I. An incomplete database I2 is at least as informative as an incomplete database I1 [7], denoted by I1  I2 , if [[I2 ]] ⊆ [[I1 ]]. Notice that  is a preorder, that is, it is reflexive and transitive. Moreover, given an incomplete database I and a set I of incomplete databases, I is a lower bound for I if I  I 0 for every I 0 ∈ I, and I is a greatest lower bound for I if I is a lower bound for I and for every lower bound I 0 for I, it holds that I 0  I. The set of all greatest lower bounds of I is denoted by glb(I). A homomorphism from an incomplete database I1 to an incomplete database I2 is a function h : adom(I1 ) → adom(I2 ) that is the identity on Const(I1 ) and such that h(I1 ) ⊆ I2 (we define h(I) analogously to v(I) before). We write I1 7→ I2 if there is a homomorphism from I1 to I2 , and I1 7→ I, for a set I of incomplete databases, if I1 7→ I2 for each I2 ∈ I. Then a characterization of  is given by the following statement [7]: I1  I2 if and only if I1 7→ I2 . (1) As customary, a query Q over a schema Σ is a function that assigns to each complete database D of Σ a complete database Q(D). Moreover, given an incomplete database I, the certain answers as object (or certain answers, for short) to Q over I, denoted by cert(Q, I), is defined [7] as an incomplete database satisfying cert(Q, I) ∈ glb(ans(Q, I)), where ans(Q, I) = {Q(D) | D ∈ [[I]]}. Notice that two greatest lower bounds I1 , I2 of ans(Q, I) are equivalent in terms of the information ordering: I1  I2 and I2  I1 ; thus, we can choose any greatest lower bound of ans(Q, I) as the certain answer to Q over I. 3 Computing Certain Answers In this section, we focus on the problem of computing cert(Q, I) for a union of conjunc- tive queries Q with inequalities (UCQ6= ) and an incomplete database I. It is assumed that a reader is familiar with the syntax of such queries as well as their semantics (i.e., the definition of Q(D)). Complete proofs of the results below can be found in [3]. The first issue we have to deal with when computing cert(Q, I) is the fact that ans(Q, I) may be infinite. The following proposition overcomes this limitation by showing that it is not necessary to consider the entire set ans(Q, I) when computing cert(Q, I), but instead one can consider just a finite subset of it. Proposition 1. Let Q be a UCQ6= over a schema Σ and I an incomplete database over Σ. Then there exists D ⊆ ans(Q, I) such that D is finite and cert(Q, I) is a greatest lower bound of D. In fact, from the proof of Proposition 1, it is possible to obtain a simple exponential- time algorithm that, given a UCQ6= Q and an incomplete database I, returns a set D ⊆ ans(Q, I) such that cert(Q, I) ∈ glb(D), where each D ∈ D is of linear size in the size of I. We denote such D by anscan (Q, I). Now, given a finite set D, it is known (see, e.g., [8], [6, Proposition 5], or [4]) that a greatest lower bound of D with respect to the relation 7→ and, by (1), to the relation  can be computed via the (direct) product of the databases in D, which is defined as follows. Let I1 and I2 be incomplete databases over a schema Σ. The product I1 × I2 is a database I such that, for each k-ry relation R in Σ,  RI = { a1 × b1 , . . . , ak × bk | (a1 , . . . , ak ) ∈ RI1 , (b1 , . . . , bk ) ∈ RI2 }; here a × a = a for all a ∈ Const ∪ Null and a × b is a fresh null na,b for all different a, b ∈ Const ∪ Null. It immediately follows that Const(I) ⊆ Const(I1 ) ∩ Const(I2 ). Note that the product is associative Q and commutative (up to renaming of nulls). For D = {D1 , . . . , Dn }, we denote by D the product D1 × · · · × Dn . The following picture illustrates this notion for a schema consisting of a single binary relation: a b a nb,a na,b a b c b nc,b nc,a I1 I2 I1 × I2 By combining Proposition 1 with the algorithm for computing anscan (Q, I), we obtain an algorithm for computing a certain answer. Proposition 2. Let Q be a UCQ6= over a schema Q Σ and I and incomplete database over Σ. Then cert(Q, I) can be computed as anscan (Q, I). Finally, in the following theorem, which is the main result of this note, we prove that the certain answers as objects can be of exponential size, even if we restrict to the case of conjunctive queries over binary relations. Theorem 1. There exists a family of incomplete databases In , for n ∈ N, and a con- junctive query Q with inequality such that the size of the smallest cert(Q, In ) grows exponentially in the size of In . Proof. (Sketch) For a natural number m, consider the complete database Zm = {Z(d1 , d2 ), . . . , Z(dm−1 , dm ), Z(dm , d1 )}, where d1 , d2 , . . . , dm are (distinct) constants, encoding a directed cycle of length m. Let n be a natural number, and let p1 , . . . , pn be the first n prime numbers. We define In to be the incomplete database containing – disjoint cycles Zp1 , . . . , Zpn , – for a constant ai , i = 1, .., n, the pairs R(ci , ai ) for each constant ci in Zpi , – for constants b1 and bn , the pairs P (a1 , b1 ) and P (an , bn ), and – for nulls l1 , . . . , ln−1 , the pairs P (ai , li ) and P (ai+1 , li ) for 1 ≤ i ≤ n − 1. Below we depict I5 : Z2 Z3 Z5 Z7 Z11 R R R R R a1 a2 a3 a4 a5 P P P P P P P P P P b1 l1 l2 l3 l4 b5 Consider the query Q(x, y) = ∃z(Z(x, y) ∧ R(x, z) ∧ R(y, z) ∧ Q6= (z)), where Q6= (z) = ∃u∃v(P (z, u) ∧ P (z, v) ∧ (u 6= v)). We can show the following property: (minimal) each binary relation Cpi = {(c, d) | Z(c, d) ∈ Zpi }, for 1 ≤ i ≤ n, is in ans(Q, In ), and any other relation in ans(Q, In ) subsumes one of them. Let Zn = {Cp1 , . . . , Cpn }. We define a directed cycle of nulls of size p1 × · · · × pn Z ∗ = {Z(n1 , n2 ), . . . , Z(np1 ×···×pn −1 , np1 ×···×pn ), Z(np1 ×···×pn , n1 )}. We can prove two claims. Claim. Any of certain answers cert(Q, In ) is in glb(Zn ). Claim. Set Z ∗ is in glb(Zn ). The proof of the claims relies on (minimal), while the construction of Z ∗ and the proof of the second claim rely on the fact that Cpi are cycles of prime length. Indeed, Z ∗ is (isomorphic to) the product of all Cpi , as the product of a pair of directed cycles of sizes k1 and k1 , when k1 and k2 are co-prime numbers, is a directed cycle of the size k1 × k2 . From the above claims, it follows that Z ∗ and cert(Q, In ) are homomorphically equivalent. It is well-known from graph theory that every graph that is homomorphically equivalent to a directed cycle has to contain that cycle. Hence, the minimal certain answer cert(Q, In ) is of exponential size in the size of In . (We observe that the size of In is polynomially bounded in n, as pn < cn2 , for a constant c [2].) t u References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 2. T. M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, New York, 1976. Undergraduate Texts in Mathematics. 3. M. Arenas, E. Botoeva, E. V. Kostylev, and V. Ryzhikov. A note on computing certain answers to queries over incomplete databases (extended version). Technical report, 2017. Available at http://marenas.sitios.ing.uc.cl/publications/amw17-ext.pdf. 4. C. Chang and H. Keisler. Model Theory. North-Holland, Amsterdam, 1990. 5. T. Imielinski and W. Lipski Jr. Incomplete information in relational databases. J. ACM, 31(4):761–791, 1984. 6. L. Libkin. Incomplete information and certain answers in general data models. In Proc. of the 30th ACM Symp. on Principles of Database Systems (PODS), pages 59–70, 2011. 7. L. Libkin. Certain answers as objects and knowledge. Artif. Intell., 232:1–19, 2016. 8. B. ten Cate and V. Dalmau. The product homomorphism problem and applications. In Proc. of the 18th International Conference on Database Theory (ICDT 2015), pages 161–176, 2015.