Query Rewriting under Non-Guarded Rules

              Andrea Calı̀2,1,3 , Georg Gottlob1,2 and Andreas Pieris1
                   1
                 Computing Laboratory, University of Oxford, UK
     2
       Oxford-Man Institute of Quantitative Finance, University of Oxford, UK
    3
      Department of Information Systems and Computing, Brunel University, UK
                        firstname.lastname @comlab.ox.ac.uk


         Abstract. We address the problem of answering conjunctive queries
         over knowledge bases, specified by sets of first-order sentences called
         tuple-generating dependencies (TGDs). This problem is highly relevant
         to query optimization, information integration, and ontological reason-
         ing. We present a rewriting algorithm, inspired by resolution in Logic
         Programming, which is capable of dealing with an expressive class of
         TGDs, called sticky TGDs. Given a set of sticky TGDs and a conjunc-
         tive query, the algorithm produces a first-order query that can be then
         evaluated over the data, providing the correct answers. In this way, we
         establish that conjunctive query answering under sticky TGDs is in the
         highly tractable class ac0 in the data complexity.


1   Introduction

Answering queries over knowledge bases is a fundamental problem in knowledge
representation. It has been employed in schema integration, information integra-
tion, and service discovery (see, e.g., [11]). In the database field, the problem of
answering queries on incomplete data under constraints (a.k.a. dependencies) is
tightly related to the one of query containment [12]; in fact, the two problems
are mutually reducible (see, e.g., [3]). A milestone paper in the field is [17]; it
addresses conjunctive query containment under functional and inclusion depen-
dencies, a class of constraints that is highly relevant in practice. The results
of [17] were later extended in [7].
    Other works consider different classes of constraints. For instance, [2, 9] con-
sider constraints tailored to express Entity-Relationship schemas, and [23] deals
with expressive constraints based on Answer Set Programming. [6] introduces
and studies first-order constraints derived from a “light” version of F-logic [18],
called F-logic Lite. Another relevant formalisms for knowledge bases, especially
in the Semantic Web, is the DL-lite family; in [10, 22] tractable query answering
techniques under DL-lite knowledge bases are presented.
    A more general approach is adopted in the recent papers [3, 4]; in such
works, instead of focusing on a specific logic formalism, a general family of
languages is considered, called Datalog± , that captures and extends a large
class of knowledge representation formalisms in the literature, in particular
the DL-lite family. The constraints of [3, 4] are inspired by the guarded frag-
ment of first-order logic [16], and among them are the languages of linear
Datalog± , guarded Datalog± , and weakly-guarded Datalog± . It is important to
notice that Datalog± -rules are relevant tuple-generating dependencies (TGDs)
and equality-generating dependencies (EGDs), which are generalizations of in-
clusion dependencies and functional dependencies, respectively; we shall not
deal with EGDs in this paper. TGDs are first-order constraints that express
an implication from a conjunction of atoms to another. For instance, the TGD
∀E∀G director (E), leads(E, G) → ∃D manages(E, D) states that each director
that leads some group necessarily manages at least one department.
    A large part of the aforementioned works on query answering and con-
tainment are based on the well-known notion of chase, widely used for de-
pendency implication and query containment under dependencies [19, 17]. The
chase is a procedure that enforces the satisfaction of TGDs by the addition of
suitable tuples. For example, consider an instance D consisting of the atoms
{director (d), leads(d, g)}; the above TGD is not satisfied by D, therefore the
chase adds a suitable atom manages(d, z), where z is a so-called labeled null, i.e.,
a placeholder for an unknown value. Several works have recently studied query
evaluation problems for settings where the chase terminates and thus generates
a finite solution. This is the typical setting, for example, of data exchange [15],
where the materialization (and therefore the finiteness) of the chase is a require-
ment. To this aim, syntactic restrictions on TGDs, e.g., weak acyclicity [13],
have been proposed. Little was known about the cases where the chase is (in
general) infinite, with the notable exception of the classical work of Johnson and
Klug [17].
     In this work, along the lines of [3, 4], we also adopt a general approach. We
consider an expressive class of TGDs, called sticky TGDs (STGDs), first intro-
duced in [5] as an addition to the Datalog± family. The chase under STGDs is
not guaranteed to terminate. Stickiness is formally defined in Section 3 by an ef-
ficiently testable condition involving variable-marking. An equivalent definition,
which gives a better understanding of such class of constraints, is as follows. For
every instance D, assume that in the construction of the chase of D under a
set Σ of TGDs, we apply a TGD σ ∈ Σ that has a variable X appearing more
than once in its body; assume also that X maps (via homomorphism) on the
symbol z, and that by virtue of this application the atom a is introduced. Then,
z appears in a and in all atoms resulting from some chase derivation involving
a, “sticking” to them (hence the name “sticky TGDs”) [5].
    Our main contribution in this paper is an algorithm, similar to resolution and
partial evaluation in Logic Programming, for query answering under STGDs.
Our approach is based on query rewriting, that is: given a query, such query is
rewritten into another query that encodes the information about the constraints
and that, evaluated on the data, returns the correct answers. We consider the
class of conjunctive queries (CQs), also called select-project-join queries in the
database literature. The algorithm, inspired by the ones in [2, 8], takes as input
a set of STGDs and a conjunctive query, and it computes a (finite) rewriting
expressed in union of conjunctive queries (UCQs).
    Since our rewriting algorithm is sound and complete (established by making
use of the notion of chase), terminates under STGDs, and the rewritten query
is a UCQs, answering conjunctive queries in this case is in the highly tractable
class ac0 w.r.t. the size of the data1 , as already stated in [5], where no details
about a rewriting technique are given. We recall that ac0 is the complexity class
of recognizing words in languages defined by constant-depth Boolean circuits
with an (unlimited fan-in) and and or gates (see, e.g., [20]). Notice that a
UCQs can be immediately translated into an SQL query, therefore the rewriting
algorithm is especially suitable for real-world applications based on relational
DBMSs. Notice that the size of the rewritten query is worst-case exponential
w.r.t. the size of the given query and the size of the given set of constraints.
While query answering remains highly tractable in data complexity, this leaves
space for further optimizations. Generating “smaller” rewritings is one of the
directions currently pursued to improve efficiency of query processing in this
context [21].
    In [5] it is shown, similarly to what is done in [4], that STGDs enriched with
negative constraints of the form ∀X ϕ(X) → ⊥, where ϕ(X) is a conjunction of
atoms and ⊥ is the constant false, and with EGDs (key dependencies, in this
case) that do not interact with the STGDs, are capable of expressing languages
in the DL-lite family. Therefore, all tractability results on conjunctive query
answering in DL-lite are special cases of our results.


2     Preliminaries
In this section we recall some basics on databases, queries, TGDs and the TGD
chase procedure.
    General. We define the following pairwise disjoint (infinite) sets of symbols:
(i) a set Γ of constants (constitute the “normal” domain of a database), (ii)
a set Γf of labeled nulls (used as placeholders for unknown values, and thus
can be also seen as variables), and (iii) a set Γv of variables (used in queries
and dependencies). Different constants represent different values (unique name
assumption), while different nulls may represent the same value. A lexicographic
order is defined on Γ ∪ Γf , such that every value in Γf follows all those in Γ .
    A relational schema R (or simply schema) is a set of relational symbols (or
predicates), each with its associated arity. We write r/n to denote that the pred-
icate r has arity n. A position r[i] (in a schema R) is identified by a predicate
r ∈ R and its i-th argument (or attribute). A term t is a constant, null, or
variable. An atomic formula (or simply atom) has the form r(t1 , . . . , tn ), where
r/n is a relation, and t1 , . . . , tn are terms. For an atom a, we denote as dom(a),
var (a) and pred (a) the set of its terms, the set of its variables and its predi-
cate, respectively. These notations naturally extends to sets and conjunctions of
1
    The complexity w.r.t. the data size is known as data complexity, and it is relevant
    since in practice the size of the data is much larger than the size of the schema.
atoms. An atom is called ground if all of its terms are constants of Γ . Conjunc-
tions of atoms are often identified with the sets of their atoms.
     A substitution from one set of symbols S1 to another set of symbols S2
is a function h : S1 → S2 defined as follows: (i) ∅ is a substitution (empty
substitution), (ii) if h is a substitution, then h ∪ {X → Y } is a substitution,
where X ∈ S1 and Y ∈ S2 , and h does not already contain some X → Z with
Y 6= Z. If X → Y ∈ h we write h(X) = Y . A homomorphism from a set of
atoms A1 to a set of atoms A2 , both over the same schema R, is a substitution
h : dom(A1 ) → dom(A2 ) such that: (i) if t ∈ Γ , then h(t) = t, and (ii) if
r(t1 , . . . , tn ) is in A1 , then h(r(t1 , . . . , tn )) = r(h(t1 ), . . . , h(tn )) is in A2 . The
notion of homomorphism naturally extends to conjunctions of atoms.
     Databases and Queries. A database (instance) D for a schema R is a
(possibly infinite) set of atoms of the form r(t) (a.k.a. facts), where r/n ∈ R
and t ∈ (Γ ∪ Γf )n . We denote as r(D) the set {t | r(t) ∈ D}.
     A conjunctive query (CQ) q of arity n over a schema R, written as q/n, has
the form q(X) = ∃Yϕ(X, Y), where ϕ(X, Y) is a conjunction of atoms over R,
X and Y are sequences of variables or constants in Γ , and the length of X is n.
ϕ(X, Y) is called the body of q, denoted as body(q). A Boolean CQ (BCQ) is a
CQ of zero arity. The answer to a CQ q/n of the form q(X) = ∃Yϕ(X, Y) over
a database D, denoted as q(D), is the set of all n-tuples t ∈ Γ n for which there
exists a homomorphism h : X ∪ Y → Γ ∪ Γf such that h(ϕ(X, Y)) ⊆ D and
h(X) = t. A BCQ has only the empty tuple hi as possible answer, in which case
it is said that has positive answer. Formally, a BCQ has positive answer over D,
denoted as D |= q, iff hi ∈ q(D), or, equivalently, q(D) 6= ∅.
     Dependencies. Given a schema R, a tuple-generating dependency (TGD)
σ over R is a first-order formula ∀X∀Y ϕ(X, Y) → ∃Z ψ(X, Z), where ϕ(X, Y)
and ψ(X, Z) are conjunctions of atoms over R, called the body and the head of
σ, denoted as body(σ) and head (σ), respectively. Henceforth, to avoid notational
clutter, we will omit the universal quantifiers in TGDs. Such σ is satisfied by
a database D for R iff, whenever there exists a homomorphism h such that
h(ϕ(X, Y)) ⊆ D, there exists an extension h′ of h (i.e., h′ ⊇ h) such that
h′ (ψ(X, Z)) ⊆ D.
     We now define the notion of query answering under TGDs. Given a database
D for R, and a set Σ of TGDs over R, the models of D w.r.t. Σ, denoted
as mods(D, Σ), is the set of all databases B such that B |= D ∪ Σ. The an-
swer to a CQ q w.r.t. D and Σ, denoted as ans(q, D, Σ), is the set {t | t ∈
q(B) for each B ∈ mods(D, Σ)}. The answer to a BCQ q w.r.t. D and Σ is
positive, denoted as D ∪ Σ |= q, iff ans(q, D, Σ) 6= ∅. Note that query answering
under general TGDs is undecidable [1], even when the schema and the set of
TGDs are fixed [3].
     We recall that the two problems of CQ and BCQ evaluation under TGDs are
logspace-equivalent [3]. Moreover, it is easy to see that the query output tuple
problem (as a decision version of CQ evaluation) and BCQ evaluation are ac0 -
reducible to each other. Henceforth, we thus focus only on the BCQ evaluation
problem. All complexity results carry over to the other problems.
    The TGD Chase. The chase procedure (or simply chase) is a fundamental
algorithmic tool introduced for checking implication of dependencies [19], and
later for checking query containment [17]. Informally, the chase is a process of
repairing a database w.r.t. a set of dependencies so that the resulted database
satisfies the dependencies. We shall use the term chase interchangeably for both
the procedure and its result. The chase works on an instance through the so-
called TGD chase rule. The TGD chase rule comes in two different equivalent
fashions: oblivious and restricted [3], where the restricted one repairs TGDs only
when they are not satisfied. In the sequel, we focus on the oblivious one for better
technical clarity. The TGD chase rule is the building block of the chase.
    TGD Chase Rule. Consider a database D for a schema R, and a TGD
σ = ϕ(X, Y) → ∃Z ψ(X, Z) over R. If σ is applicable to D, i.e., there exists
a homomorphism h such that h(ϕ(X, Y)) ⊆ D, then: (i) define h′ ⊇ h such
that h′ (Zi ) = zi for each Zi ∈ Z, where zi ∈ Γf is a “fresh” labeled null not
introduced before, and following lexicographically all those introduced so far,
and (ii) add to D the set of atoms in h′ (ψ(X, Z)) if not already in D.
    Given a database D and set of TGDs Σ, the chase algorithm for D and Σ
consists of an exhaustive application of the TGD chase rule in a breadth-first
fashion, which leads as result to a (possibly infinite) chase for D and Σ, denoted
as chase(D, Σ). The formal definition of the chase algorithm is omitted due to
space reasons and can be found, for instance, in [3]. We denote as chase [k] (D, Σ)
the initial segment of the chase for D and Σ obtained by applying k > 0 times
the TGD chase rule.

Example 1 ([5]). Let R = {r, s}. Consider the set Σ of TGDs over R constituted
by the TGDs σ1 = r(X, Y ), s(Y ) → ∃Z r(Z, X) and σ2 = r(X, Y ) → s(X). Let
D be the database for R consisting of the two atoms r(a, b) and s(b). During
the construction of chase(D, Σ) we first apply σ1 , and we add the atom r(z1 , a),
where z1 is a “fresh” null. Moreover, σ2 is applicable and we add the atom s(a).
Now, σ1 is applicable and the atom r(z2 , z1 ) is obtained, where z2 is a “fresh”
null. Also, σ2 is applicable and the atom s(z1 ) is generated. It is clear that there
is no finite chase. Satisfying both σ1 , σ2 would require to construct the infinite
instance D ∪ {r(z1 , a), s(a), r(z2 , z1 ), s(z1 ), r(z3 , z2 ), s(z2 ), . . .}.

   It is well-known that the (possibly infinite) chase for D and Σ is a universal
model of D w.r.t. Σ, i.e., for each database B ∈ mods(D, Σ), there exists a
homomorphism from chase(D, Σ) to B [15, 14]. Using this fact it can be shown
that for a BCQ q, D ∪ Σ |= q iff chase(D, Σ) |= q.


3   Sticky TGDs
In this section we recall the class of sticky TGDs introduced in [5]. As we shall
see, query answering under sticky TGDs is highly tractable in data complexity.

Definition 1 ([5]). Consider a set Σ of TGDs over a schema R. We mark the
variables that occur in the body of the TGDs of Σ according to the following
marking procedure. First, for each TGD σ ∈ Σ and for each variable V in
body(σ), if there exists an atom a in head (σ) such that V does not appear in
a, then we mark each occurrence of V in body(σ). Now, we apply exhaustively
(i.e., until a fixpoint is reached) the following step: for each TGD σ ∈ Σ, if a
marked variable in body(σ) appears at position π, then for every TGD σ ′ ∈ Σ
(including the case σ ′ = σ), we mark each occurrence of the variables in body(σ ′ )
that appear in head (σ ′ ) at the same position π. We say that Σ is a set of sticky
TGDs (STGDs) if there is no TGD σ ∈ Σ such that a marked variable occurs
in body(σ) more than once.
Example 2 ([5]). Consider the following relational schema R:
                      dept(Dept Id, Mgr Id),
                      emp(Emp Id, Dept Id, Area, Project Id),
                      runs(Dept Id, Project Id),
                      in area(Project Id, Area),
                      external (Ext Id, Area, Project Id).
The fact that each department has as a manager an employee can be represented
by the TGD
                   dept(V, W ) → ∃X∃Y ∃Z emp(W, X, Y, Z).
The fact that each employee works on some project that falls into his/her area
of specialization ran by his/her department can be represented by the TGD
         emp(V, W, X, Y ) → ∃Z dept(W, Z), runs(W, Y ), in area(Y, X).
Finally, the fact that for each project ran by some department there exists an
external cooperator, specialized on the area of the project, that works on it can
be represented by the TGD
               runs(W, X), in area(X, Y ) → ∃Z external (Z, Y, X).
Let Σ be the set constituted by the above three TGDs. According to the variable-
marking procedure in Definition 1, we mark the variables as follows (we mark
variables with a cap, e.g., X̂):
        dept(V̂ , Ŵ ) → ∃X∃Y ∃Zemp(Ŵ , X̂, Ŷ , Ẑ)
        emp(V̂ , Ŵ , X̂, Ŷ ) → ∃Z dept(Ŵ , Ẑ), runs(Ŵ , Y ), in area(Y, X)
        runs(Ŵ , X), in area(X, Y ) → ∃Z external (Z, Y, X)
Clearly, for each TGD σ ∈ Σ, there is no marked variable that occurs in body(σ)
more than once. Therefore, Σ is a set of STGDs. It is important to observe that
the set Σ is neither weakly-acyclic [15], nor guarded [3], nor weakly-guarded [3].
In fact, the class of STGDs is incomparable to the above three known classes of
TGDs.
    We recall that query answering under (general) TGDs is equivalent to query
answering under TGDs with single-atom heads [3]. This is established by provid-
ing a logspace transformation from (general) TGDs to TGDs with single-atom
heads. Since this transformation preserves the syntactic condition of STGDs,
henceforth we assume w.l.o.g. that every TGD has just one atom in its head.
4    Query Answering by Rewriting

We say that a class C of TGDs is first-order rewritable, henceforth abbreviated as
FO-rewritable, iff for every set Σ of TGDs in C, and for every BCQ q, there exists
a first-order query qΣ such that, for every database D, D ∪ Σ |= q iff D |= qΣ [4].
Since answering first-order queries is in the class ac0 in data complexity [24], it
immediately follows that for FO-rewritable TGDs, query answering is in ac0 in
data complexity.
    In this section we establish that STGDs are FO-rewritable by providing a
query rewriting algorithm, inspired by resolution in Logic Programming, that
allow us to reformulate the given BCQ q into a union of BCQs Qr , that encodes
the information about the given STGDs, and then evaluate Qr over the given
database to obtain the correct answer. Formally, a union of BCQs over a schema
R is a set Q of BCQs over R, where each q ∈ Q uses the same predicate symbol
in the head. Q is said to have positive answer over a database D, denoted as
D |= Q, iff there exists q ∈ Q such that D |= q. Note that a union of BCQs is
trivially a first-order query. Before we proceed further we give some preliminary
definitions.
    Given a BCQ q we say that a certain variable is bound in q if it occurs
more than once in body(q), otherwise is called unbound. A bound term in q is
either a bound variable or a constant of Γ . Given two atoms a and b we say
that unify if there exists a substitution γ, called unifier for a and b, such that
γ(a) = γ(b). A most general unifier (MGU) is a unifier for a and b, denoted as
γa,b , such that for each other unifier γ for a and b there exists a substitution γ ′
such that γ = γ ′ ◦ γa,b . Note that if two atoms unify, then there exists a MGU.
Furthermore, the MGU for two atoms is unique (modulo variable renaming).
    We now define the important notion of applicability of a TGD to an atom
that occurs in the body of a BCQ.

Definition 2. Let R be a relational schema. Consider a TGD σ over R, a BCQ
q over R, and an atom a ∈ body(q). We say that σ is applicable to a if the
following conditions are satisfied: (i) a and head (σ) unify (recall that we assume
single-atom head TGDs); we denote as γa,σ the MGU for a and head (σ), (ii) if
the term at position π in a is either a constant of Γ , or a bound variable in q
that occurs in some atom of body(q) other than a, then the variable at position
π in head (σ) occurs also in body(σ), and (iii) if a bound variable in q occurs
only in a at positions π1 , . . . , πm , for m > 2, then either the variable at position
πi in head (σ), for each i ∈ {1, . . . , m}, occurs also in body(σ), or at positions
π1 , . . . , πm in head (σ) we have the same existentially quantified variable.

   The Algorithm rewrite. We are now ready to present the algorithm rewrite,
shown in Figure 1. The rewriting of a BCQ is computed by exhaustively applying
two steps: minimization and rewriting, corresponding to steps (a) and (b) of the
algorithm. In what follows we give an informal description of these two steps.
   Minimization Step. If there exists a BCQ q ∈ Qr such that body(q) contains
two atoms a and b that unify, then the algorithm computes the BCQ q ′ by
Algorithm rewrite
Input: Schema R, set Σ of STGDs over R, BCQ q
over R
Output: Rewritten query Qr over R

   Qr := {q}; Qcan  r := ∅; i := 0; Vars := var (q);
   repeat
      Q′ := Qr ; Q′′ := Qcan  r ;
      for each q ∈ Q′ do
      (a) for each a, b ∈ body(q) do
          if a and b unify then
              q ′ := γa,b (q)
              Qr := Qr ∪ {q ′ }
                                    ′
              Qcan
                 r := Qr ∪ τ (q );
                           can

      (b) for each a ∈ body(q) do
              for each σ ∈ Σ do
                   if σ is applicable to a then
                       i := i + 1 `                ´
                       ρ := rename    ` a,σi , Vars
                                      γ
                       q ′ := ρ γa,σi q[a/body(σ i )]
                                `                    ´´

                       Qr := Qr ∪ {q ′ }
                                             ′
                       Qcan
                          r := Qr ∪ τ (q );
                                  can

   until Q′′ = Qcanr ;
   return Qr ;


                                           Fig. 1. The Algorithm rewrite.


applying the particular MGU γa,b for a and b on q, where γa,b is obtained as
follows. Let a = r(V1 , . . . , Vn ) and b = r(W1 , . . . , Wn ). Construct the atoms
a′ and b′ from a and b as follows: for an arbitrary k ∈ {1, . . . , n}, (i) if both
Vk , Wk ∈  / Vars, then replace Wk with Vk , and (ii) if Vk ∈ Vars and Wk ∈    / Vars
(resp., Vk ∈  / Vars and Wk ∈ Vars), then replace Wk with Vk (resp., Vk with
Wk ). Recall that Vars is the set of variables of the given query. We define γa,b =
γa′ ,b′ . γa,b guarantees that any new variables, introduced during the rewriting
process, that appear in q ′ are unbound; its existence follows from the fact that
the TGDs are sticky. The canonical form of q ′ is computed by applying the
transformation τ , and then added to Qcan    r  which represents the canonical form
of Qr . In particular, τ replaces the unbound variables with the special term “⋆”.
     Rewriting Step. During the i-th application of the rewriting step, if there
exists a TGD σ and a BCQ q ∈ Qr containing an atom a such that σ is applicable
to a, then the algorithm computes the BCQ q ′ = ρ γa,σi q[a/body(σ i )] , that
                                                                              

is, the BCQ obtained from q by replacing a with body(σ i ), where σ i is obtained
by replacing each variable V that occurs in σ with V i , then applying the MGU
γa,σi for a and head (σ i ), and finally applying the renaming substitution ρ, where
ρ is constructed from γa,σi according to the procedure rename (see Figure 2).
By considering σ i (instead of σ) actually we rename, using the integer i, the
variables of the TGD σ which is applicable to a. This is needed to ensure that
the set of variables that appear in the rewritten query, and the set of variables
that occur in TGDs are disjoint, and also to avoid undesirable clutters between
new variables introduced during different applications of the rewriting step. The
application of the renaming substitution ρ is needed to guarantee that any new
variables in q ′ are unbound. Finally, the canonical form of q ′ is added to Qcan r .
Procedure rename
Input: Substitution γ, set Vars of variables
Output: Renaming substitution ρ

   µ := ∅;
   for each assertion V → U ∈ γ do
      if V ∈ Vars and U ∈   / Vars then
          if there is no assertion V ′ → U in µ then
               µ := µ ∪ {V → U };
   ρ := µ−1 ;
   return ρ;


                                         Fig. 2. The Procedure rename.


Example 3. Let R = {p, r, s}. Consider the set Σ of STGDs over R consti-
tuted by the TGDs σ1 = r(X, Y ) → ∃Z s(X, Z, Z) and σ2 = s(X, Y, Z) →
p(X, Z, Z). Consider also the BCQ q0 = ∃A∃B∃C p(A, B, C), s(A, B, B) over
R. Clearly, σ2 is applicable to a = p(A, B, C) ∈ body(q0 ). During the first ap-
plication of the rewriting step we get the BCQ q1 = ρ(γa,σ21 (q0 [a/body(σ21 )])),
where γa,σ21 = {X 1 → A, B → Z 1 , C → Z 1 }, and ρ = {Z 1 → B}. Hence,
q1 = ∃A∃B∃Y 1 s(A, Y 1 , B), s(A, B, B); note that Y 1 is a new variable. The
canonical form of q1 is obtained by replacing the unbound variable Y 1 with the
special term “⋆”. Observe now that the atoms s(A, Y 1 , B) and s(A, B, B) in
the body of q1 unify. Hence, the minimization step is applied and we get the
BCQ q2 = ∃A∃B s(A, B, B)}; note that τ (q2 ) = ∃B s(⋆, B, B). Finally, observe
that σ1 is applicable to the atom a = s(A, B, B) ∈ body(q2 ). During the second
application of the rewriting step we get the BCQ q3 = ρ′ (γa,σ12 (q2 [a/body(σ12 )])),
where γa,σ12 = {X 2 → A, Z 2 → B}, and ρ′ = ∅. Hence, q3 = ∃A∃Y 2 r(A, Y 2 )};
clearly, τ (q3 ) = r(⋆, ⋆).
   Soundness and Completeness. In what follows we show that the al-
gorithm rewrite produces a so-called perfect rewriting, i.e., a rewritten query
that produces the correct answers under STGDs when evaluated over the given
database. To this aim we need the notion of rewrite graph.
Definition 3. Consider a set Σ of STGDs expressed over a schema R, and a
BCQ q over R. The rewrite graph of rewrite(R, Σ, q), denoted as RG(R, Σ, q),
is a triple hV, E, λi, where V is the node set, E is the edge set, and λ is a
labeling function V → rewrite(R, Σ, q). For the BCQ q add a node v in V , and
let λ(v) = q. The rewrite graph is constructed inductively as follows. Suppose
that the minimization step is applied because the atoms a, b ∈ body(λ(v)) unify,
where v ∈ V . Add in V a node u and let λ(u) = q ′ , where q ′ is the constructed
BCQ, and if there exists a parent node of v, denoted as par (v), then add in
E an arc par (v)y u (if u is not already in the graph). Now, suppose that the
rewriting step is applied because the STGD σ is applicable to a, where σ ∈ Σ and
a ∈ body(λ(v)), for some node v ∈ V . Add in V a node u and let λ(u) = q ′ , where
q ′ is the constructed BCQ, and also add in E an arc v y u (if u is not already
in the graph). The depth of a node v of RG(R, Σ, q), written as depth(v), is the
length of the (unique) path from a node with zero in-degree, i.e., a node with no
incoming edges, to v; a node with zero in-degree has depth zero.
    Observe that the rewrite graph RG(R, Σ, q), as defined above, is a forest
comprised by at most |body(q)| disjoint rooted trees, i.e., trees in which one node
has been designated as the root node. Each rooted tree has a root node v (which
may be the only node in the tree) such that λ(v) is a BCQ obtained by applying
the minimization step on some query λ(u), where u is a node with zero depth.
Intuitively, the existence of a vertex v with depth d > 0 implies that the BCQ
λ(v) was obtained during the rewriting process starting from q and applying
either the rewriting step or the minimization step at least d times.
    We continue to establish two useful technical results.
Lemma 1. Let R be a relational schema. Consider a database D for R, a set
Σ of STGDs over R, and a BCQ q over R. Let RG(R, Σ, q) = hV, E, λi. If
chase [i] (D, Σ) |= λ(v), for i > 0 and v ∈ V , then there exists j > i such that
chase [j] (D, Σ) |= q.
Proof (sketch). By induction on the depth d > 0 of node v.

Lemma 2. Let R be a relational schema. Consider a database D for R, a set
Σ of STGDs over R, and a BCQ q over R. Let RG(R, Σ, q) = hV, E, λi. If
chase [i] (D, Σ) |= λ(v), for i > 0 and v ∈ V , then there exists v ′ ∈ V with
depth(v ′ ) > depth(v) such that D |= λ(v ′ ).
Proof (sketch). By induction on the number of applications of the chase rule.

   We are now ready, by using the above two technical lemmas, to establish that
the algorithm rewrite produces a perfect rewriting, i.e., is sound and complete.
Theorem 1. Let R be a relational schema. Consider a database D for R, a
set Σ of STGDs over R, and a BCQ q over R. Then, D ∪ Σ |= q iff D |=
rewrite(R, Σ, q).
Proof. For notational convenience, let Qr = rewrite(R, Σ, q). The claim is equiv-
alent to D |= Qr iff chase(D, Σ) |= q. By hypothesis, there exists a BCQ p ∈ Qr
such that D |= p, or, equivalently, chase [0] (D, Σ) |= p. Let RG(R, Σ, q) =
hV, E, λi be the rewrite graph of Qr , and suppose that p = λ(v), for some v ∈ V .
By Lemma 1 we get that there exists i > 0 such that chase [i] (D, Σ) |= q, which
implies that chase(D, Σ) |= q, as needed. Conversely, there exists a homomor-
phism h such that h(body(q)) ⊆ chase(D, Σ). Since h(body(q)) is finite it follows
that there exists i > 0 such that h(body(q)) ⊆ chase [i] (D, Σ), or, equivalently,
chase [i] (D, Σ) |= q. Suppose that q = λ(v), where v ∈ V . By definition of the
rewrite graph, depth(v) = 0. By Lemma 2 we get that there exists u ∈ V with
depth(u) > 0 such that D |= λ(u). Since λ(u) ∈ Qr we conclude that D |= Qr .

   Termination. We finally establish termination of the algorithm rewrite un-
der STGDs. The following lemma shows that new variables, introduced during
the rewriting process, occur at most once in the BCQs constructed during the
rewriting process.
Lemma 3. Let R be a relational schema. Consider a set Σ of STGDs over R, a
BCQ q over R, and a BCQ q ′ ∈ rewrite(R, Σ, q). Each variable in var (q ′ )\var (q)
is unbound in q ′ .

Proof (sketch). For notational convenience, let Qr = rewrite(R, Σ, q). We denote
by Qir the initial segment of Qr obtained starting from q and applying i > 0 times
either the minimization step or the rewriting step of the algorithm rewrite. The
proof is by induction on i. Note that the use of the particular MGU (instead of
applying an arbitrary one) during the minimization step of the algorithm rewrite,
and also the application of the renaming substitution (see Figure 2) during the
rewriting step are crucial.

   We are now ready, by exploiting Lemma 3, to establish termination of the
algorithm rewrite under STGDs.

Theorem 2. Let R be a relational schema. Consider a set Σ of STGDs over R,
and a BCQ q over R. The algorithm rewrite with input R, Σ and q terminates.

Proof. It suffices to show that the maximum number of BCQs that can appear
in the canonical form of the rewritten query rewrite(R, Σ, q), denoted as Qc , is
finite. By Lemma 3, if a new variable, introduced during the rewriting process,
occurs in a BCQ p ∈ rewrite(R, Σ, q), then is unbound in p. Thus, by definition
of τ , none of these new variables can appear in Qc since they are replaced with
the special term “⋆”. Hence, the set of terms used to construct Qc corresponds to
the set of variables and constants occurring in the BCQ q plus the special term
“⋆”; thus, is finite. Moreover, only predicates of R can appear in Qc which is
also finite. The claim follows since the number of atoms that can be constructed
using a finite set of terms and a finite set of predicates is finite.

   By combining Theorems 1 and 2 we immediately get the main complexity
results on STGDs (first stated in [5], where many more results are shown).

Corollary 1 ([5]). The class of STGDs is FO-rewritable.

   Since answering first-order queries is in the class ac0 in data complexity [24],
we immediately get the following result.

Corollary 2 ([5]). BCQ answering under STGDs is in ac0 in data complexity.

Acknowledgments. The research leading to these results has received funding
from the European Research Council under the European Community’s Sev-
enth Framework Programme (FP7/2007-2013)/ERC grant agreement no. 246858
– DIADEM. The authors also acknowledge support by the EPSRC project
“Schema Mappings and Automated Services for Data Integration and Exchange”
(EP/E010865/1). Georg Gottlob’s work was also supported by a Royal Society
Wolfson Research Merit Award.
References
 1. C. Beeri and M. Y. Vardi. The implication problem for data dependencies. In
    Proc. of ICALP, pages 73–85, 1981.
 2. A. Calı̀, D. Calvanese, G. D. Giacomo, and M. Lenzerini. Accessing data integration
    systems through conceptual schemas. In Proc. of ER, pages 270–284, 2001.
 3. A. Calı̀, G. Gottlob, and M. Kifer. Taming the infinite chase: Query answering
    under expressive relational constraints. In Proc. of KR, pages 70–80, 2008.
 4. A. Calı̀, G. Gottlob, and T. Lukasiewicz. A general datalog-based framework for
    tractable query answering over ontologies. In Proc. of PODS, pages 77–86, 2009.
 5. A. Calı̀, G. Gottlob, and A. Pieris. Advanced processing for ontological query
    answering. Submitted for publication, 2010.
 6. A. Calı̀ and M. Kifer. Containment of conjunctive object meta-queries. In Proc.
    of VLDB, pages 942–952, 2006.
 7. A. Calı̀, D. Lembo, and R. Rosati. Decidability and complexity of query answering
    over incosistent and incomplete databases. In Proc. of PODS, pages 260–271, 2003.
 8. A. Calı̀, D. Lembo, and R. Rosati. Query rewriting and answering under constraints
    in data integration systems. In Proc. of IJCAI, pages 16–21, 2003.
 9. A. Calı̀ and D. Martinenghi. Querying incomplete data over extended er schemata.
    TPLP, 2010. to appear.
10. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable
    reasoning and efficient query answering in description logics: The DL-lite family.
    J. Autom. Reasoning, 39(3):385–429, 2007.
11. D. Calvanese, G. D. Giacomo, and M. Lenzerini. Conjunctive query containment
    and answering under description logic constraints. ACM Trans. Comput. Log.,
    9(3), 2008.
12. A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries
    in relational data bases. In Proc. of STOCS, pages 77–90, 1977.
13. A. Deutsch. XML query reformulation over mixed and redundant storage. PhD
    thesis, Dept. of Computer and Information Sciences, Univ. of Pennsylvania, 2002.
14. A. Deutsch, A. Nash, and J. B. Remmel. The chase revisisted. In Proc. of PODS,
    pages 149–158, 2008.
15. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semantics and
    query answering. Theor. Comput. Sci., 336(1):89–124, 2005.
16. M. E. Goncalves and E. Grädel. Decidability issues for action guarded logics. In
    Description Logics, pages 123–132, 2000.
17. D. S. Johnson and A. C. Klug. Testing containment of conjunctive queries under
    functional and inclusion dependencies. J. Comput. Syst. Sci., 28(1):167–189, 1984.
18. M. Kifer, G. Lausen, and J. Wu. Logical foundations of object-oriented and frame-
    based languages. J. ACM, 42(4):741–843, 1995.
19. D. Maier, A. O. Mendelzon, and Y. Sagiv. Testing implications of data dependen-
    cies. ACM Trans. Database Syst., 4(4):455–469, 1979.
20. C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
21. H. Pérez-Urbina, I. Horrocks, and B. Motik. Practical aspects of query rewriting
    for owl 2. In Proc. of OWLED, 2009.
22. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati.
    Linking data to ontologies. J. Data Semantics, 10:133–173, 2008.
23. M. Simkus and T. Eiter. DNC: Decidable non-monotonic disjunctive logic pro-
    grams with function symbols. In Proc. of LPAR, pages 514–530, 2007.
24. M. Y. Vardi. On the complexity of bounded-variable queries. In Proc. of PODS,
    pages 266–276, 1995.