A note on computing certain answers to queries over
                  incomplete databases

    Marcelo Arenas1 , Elena Botoeva2 , Egor V. Kostylev3 , and Vladislav Ryzhikov2
        1                            2                       3
         Pontificia Universidad      Free University of          University of Oxford,
        Católica de Chile, Chile   Bozen-Bolzano, Italy                 UK


1    Introduction

The ability to handle incomplete information is fundamental in many areas, including
data integration, data exchange, inconsistent databases, and the Semantic Web. In such
areas, it is often impossible to describe the domain of interest in a full, comprehensive
way, so the aim of an incomplete database is to concisely represent a (potentially infi-
nite) number of completions. Incomplete information is usually represented by allowing
placeholders for unknown values, which are called “nulls”. An incomplete database I
with such nulls then represents a set of databases with complete information, called
representations of I, each of which is obtained by replacing nulls by actual values.
     One of the most important problems associated to databases is query answering.
Since an incomplete database I can have several representations, the aim in this context
is to find the “certain answers” to a query Q, that is, an incomplete database I 0 that
precisely represents the answers to Q over all the representations of I. Unfortunately,
this is possible only in restricted settings, which allow for the so called strong repre-
sentation systems [5]. To overcome this limitation, it was recently proposed in [7] the
idea of computing “certain answers as objects”. It is argued in [7] that we should aim
for finding an object (i.e., an incomplete database) representing the answers over the
representations of an incomplete database in the most informative way. More precisely,
informativeness in this context is formalised as the following preorder on incomplete
databases: I1  I2 if and only if each representation of I2 is a representation of I1 , that
is, the more representations an incomplete database has, the less informative it is. Then
given a query Q over an incomplete database I, the certain answers as objects to Q over
I are defined as a greatest lower bounds under  of the set of answers to Q over all the
representations of I.
     In this paper, we make initial steps in the study of the complexity of the problem of
computing certain answers as objects. In particular, we concentrate on the widely stud-
ied setting of union of conjunctive queries with inequalities and incomplete databases
under the open-world assumption, and we provide positive and negative results. On the
positive side, we show that in this case the certain answers to a query can be com-
puted. On the negative side, we show that this computation can be costly, as the certain
answers can be of exponential size even if we restrict to conjunctive queries with in-
equalities over binary relations.
2   Preliminaries
We assume countably infinite disjoint sets of constants, denoted by Const, and of nulls,
denoted by Null. A relational schema (or just schema) Σ is a set of relation names with
associated arities. An incomplete database I over Σ is an assignment of a k-ary relation
RI over (Const∪Null) to each k-ary name R in Σ. Sets of constants and nulls that occur
in I are denoted by Const(I) and Null(I), respectively, and their union, called an active
domain, is denoted by adom(I). A complete database is an incomplete database without
nulls. In what follows, we use I, I1 , I2 , I 0 , . . . to denote incomplete databases, and D,
D1 , D2 , D0 , . . . to denote complete databases.
      The semantics of an incomplete database I is defined in terms of its representations,
which are complete databases that are considered as possible interpretations of I [1, 5].
To define this semantics, we need the following notion: a valuation on an incomplete
database I is a map v : adom(I) → Const that is the identity on Const(I). Such
a mapping naturally extends to databases, so we write v(I) as well. Then, under the
open-world assumption, the set of representations of I, denoted by [[I]], is defined as
[[I]] = {D | D is a complete database and v(I) ⊆ D for some valuation v of I}. Note
that [[I]] 6= ∅ for every incomplete database I. An incomplete database I2 is at least
as informative as an incomplete database I1 [7], denoted by I1  I2 , if [[I2 ]] ⊆ [[I1 ]].
Notice that  is a preorder, that is, it is reflexive and transitive. Moreover, given an
incomplete database I and a set I of incomplete databases, I is a lower bound for I if
I  I 0 for every I 0 ∈ I, and I is a greatest lower bound for I if I is a lower bound for
I and for every lower bound I 0 for I, it holds that I 0  I. The set of all greatest lower
bounds of I is denoted by glb(I).
      A homomorphism from an incomplete database I1 to an incomplete database I2 is
a function h : adom(I1 ) → adom(I2 ) that is the identity on Const(I1 ) and such that
h(I1 ) ⊆ I2 (we define h(I) analogously to v(I) before). We write I1 7→ I2 if there
is a homomorphism from I1 to I2 , and I1 7→ I, for a set I of incomplete databases,
if I1 7→ I2 for each I2 ∈ I. Then a characterization of  is given by the following
statement [7]:
                                I1  I2 if and only if I1 7→ I2 .                           (1)
 As customary, a query Q over a schema Σ is a function that assigns to each complete
database D of Σ a complete database Q(D). Moreover, given an incomplete database I,
the certain answers as object (or certain answers, for short) to Q over I, denoted by
cert(Q, I), is defined [7] as an incomplete database satisfying
       cert(Q, I) ∈ glb(ans(Q, I)),       where ans(Q, I) = {Q(D) | D ∈ [[I]]}.
Notice that two greatest lower bounds I1 , I2 of ans(Q, I) are equivalent in terms of the
information ordering: I1  I2 and I2  I1 ; thus, we can choose any greatest lower
bound of ans(Q, I) as the certain answer to Q over I.


3   Computing Certain Answers
In this section, we focus on the problem of computing cert(Q, I) for a union of conjunc-
tive queries Q with inequalities (UCQ6= ) and an incomplete database I. It is assumed
that a reader is familiar with the syntax of such queries as well as their semantics (i.e.,
the definition of Q(D)). Complete proofs of the results below can be found in [3].
    The first issue we have to deal with when computing cert(Q, I) is the fact that
ans(Q, I) may be infinite. The following proposition overcomes this limitation by
showing that it is not necessary to consider the entire set ans(Q, I) when computing
cert(Q, I), but instead one can consider just a finite subset of it.

Proposition 1. Let Q be a UCQ6= over a schema Σ and I an incomplete database over
Σ. Then there exists D ⊆ ans(Q, I) such that D is finite and cert(Q, I) is a greatest
lower bound of D.

     In fact, from the proof of Proposition 1, it is possible to obtain a simple exponential-
time algorithm that, given a UCQ6= Q and an incomplete database I, returns a set D ⊆
ans(Q, I) such that cert(Q, I) ∈ glb(D), where each D ∈ D is of linear size in the size
of I. We denote such D by anscan (Q, I).
     Now, given a finite set D, it is known (see, e.g., [8], [6, Proposition 5], or [4]) that
a greatest lower bound of D with respect to the relation 7→ and, by (1), to the relation
 can be computed via the (direct) product of the databases in D, which is defined as
follows. Let I1 and I2 be incomplete databases over a schema Σ. The product I1 × I2
is a database I such that, for each k-ry relation R in Σ,
                                      
      RI = { a1 × b1 , . . . , ak × bk | (a1 , . . . , ak ) ∈ RI1 , (b1 , . . . , bk ) ∈ RI2 };
here a × a = a for all a ∈ Const ∪ Null and a × b is a fresh null na,b for all different
a, b ∈ Const ∪ Null. It immediately follows that Const(I) ⊆ Const(I1 ) ∩ Const(I2 ).
Note that the product is associative Q   and commutative (up to renaming of nulls). For
D = {D1 , . . . , Dn }, we denote by D the product D1 × · · · × Dn . The following
picture illustrates this notion for a schema consisting of a single binary relation:
            a        b         a                   nb,a                na,b
                                                      a                b
                c              b                     nc,b             nc,a
                I1             I2                           I1 × I2
By combining Proposition 1 with the algorithm for computing anscan (Q, I), we obtain
an algorithm for computing a certain answer.

Proposition 2. Let Q be a UCQ6= over a schema
                                          Q   Σ and I and incomplete database
over Σ. Then cert(Q, I) can be computed as anscan (Q, I).

    Finally, in the following theorem, which is the main result of this note, we prove
that the certain answers as objects can be of exponential size, even if we restrict to the
case of conjunctive queries over binary relations.

Theorem 1. There exists a family of incomplete databases In , for n ∈ N, and a con-
junctive query Q with inequality such that the size of the smallest cert(Q, In ) grows
exponentially in the size of In .
Proof. (Sketch) For a natural number m, consider the complete database
                       Zm = {Z(d1 , d2 ), . . . , Z(dm−1 , dm ), Z(dm , d1 )},
where d1 , d2 , . . . , dm are (distinct) constants, encoding a directed cycle of length m.
    Let n be a natural number, and let p1 , . . . , pn be the first n prime numbers. We
define In to be the incomplete database containing
 – disjoint cycles Zp1 , . . . , Zpn ,
 – for a constant ai , i = 1, .., n, the pairs R(ci , ai ) for each constant ci in Zpi ,
 – for constants b1 and bn , the pairs P (a1 , b1 ) and P (an , bn ), and
 – for nulls l1 , . . . , ln−1 , the pairs P (ai , li ) and P (ai+1 , li ) for 1 ≤ i ≤ n − 1.
Below we depict I5 :
           Z2                       Z3                    Z5                    Z7                    Z11


                   R                     R                     R                     R                     R

           a1                       a2                    a3                    a4                    a5

          P            P        P            P        P            P        P            P        P     P

              b1           l1                    l2                    l3                    l4       b5

    Consider the query

  Q(x, y) = ∃z(Z(x, y) ∧ R(x, z) ∧ R(y, z) ∧ Q6= (z)),
                                         where        Q6= (z) = ∃u∃v(P (z, u) ∧ P (z, v) ∧ (u 6= v)).
We can show the following property:
(minimal) each binary relation Cpi = {(c, d) | Z(c, d) ∈ Zpi }, for 1 ≤ i ≤ n, is in
   ans(Q, In ), and any other relation in ans(Q, In ) subsumes one of them.
Let Zn = {Cp1 , . . . , Cpn }. We define a directed cycle of nulls of size p1 × · · · × pn
      Z ∗ = {Z(n1 , n2 ), . . . , Z(np1 ×···×pn −1 , np1 ×···×pn ), Z(np1 ×···×pn , n1 )}.
We can prove two claims.
Claim. Any of certain answers cert(Q, In ) is in glb(Zn ).
Claim. Set Z ∗ is in glb(Zn ).
    The proof of the claims relies on (minimal), while the construction of Z ∗ and the
proof of the second claim rely on the fact that Cpi are cycles of prime length. Indeed,
Z ∗ is (isomorphic to) the product of all Cpi , as the product of a pair of directed cycles
of sizes k1 and k1 , when k1 and k2 are co-prime numbers, is a directed cycle of the size
k1 × k2 .
    From the above claims, it follows that Z ∗ and cert(Q, In ) are homomorphically
equivalent. It is well-known from graph theory that every graph that is homomorphically
equivalent to a directed cycle has to contain that cycle. Hence, the minimal certain
answer cert(Q, In ) is of exponential size in the size of In . (We observe that the size of
In is polynomially bounded in n, as pn < cn2 , for a constant c [2].)                    t
                                                                                         u
References
1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
2. T. M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, New York, 1976.
   Undergraduate Texts in Mathematics.
3. M. Arenas, E. Botoeva, E. V. Kostylev, and V. Ryzhikov. A note on computing certain answers
   to queries over incomplete databases (extended version). Technical report, 2017. Available at
   http://marenas.sitios.ing.uc.cl/publications/amw17-ext.pdf.
4. C. Chang and H. Keisler. Model Theory. North-Holland, Amsterdam, 1990.
5. T. Imielinski and W. Lipski Jr. Incomplete information in relational databases. J. ACM,
   31(4):761–791, 1984.
6. L. Libkin. Incomplete information and certain answers in general data models. In Proc. of the
   30th ACM Symp. on Principles of Database Systems (PODS), pages 59–70, 2011.
7. L. Libkin. Certain answers as objects and knowledge. Artif. Intell., 232:1–19, 2016.
8. B. ten Cate and V. Dalmau. The product homomorphism problem and applications. In Proc.
   of the 18th International Conference on Database Theory (ICDT 2015), pages 161–176, 2015.