=Paper= {{Paper |id=None |storemode=property |title=Complexity of Answering Counting Aggregate Queries over DL-Lite |pdfUrl=https://ceur-ws.org/Vol-1014/paper_56.pdf |volume=Vol-1014 |dblpUrl=https://dblp.org/rec/conf/dlog/KostylevR13 }} ==Complexity of Answering Counting Aggregate Queries over DL-Lite== https://ceur-ws.org/Vol-1014/paper_56.pdf
    Complexity of Answering Counting Aggregate
               Queries over DL-Lite

                     Egor V. Kostylev1 and Juan L. Reutter2
                 1
                  University of Edinburgh, ekostyle@inf.ed.ac.uk
         2
             PUC Chile and University of Edinburgh, jreutter@ing.puc.cl



      Abstract. One of the main applications of description logics is the
      ontology-based data access model, which requires algorithms for query
      answering over ontologies. In fact, some description logics, like those in
      the DL-Lite family, are designed so that simple queries, such as conjunc-
      tive queries, are efficiently computable. In this paper we study count-
      ing aggregate queries over ontologies, i.e. queries which use aggregate
      functions COUNT and COUNT DISTINCT. We propose an intuitive se-
      mantics for certain answers for these queries, which conforms to the open
      world assumption. We compare our semantics with other approaches that
      have been proposed in different contexts. We establish data and com-
      bined computational complexity for the problems of answering counting
      aggregate queries over ontologies for several variants of DL-Lite.


1    Introduction
The growing popularity of ontologies as a paradigm for representing knowledge
in the Semantic Web is based on the ability to describe incomplete information
in the domain of interest.
    Several variations of the Web Ontology Language (OWL) have been formal-
ized to manage ontologies. Most of these languages correspond to various de-
cidable fragments of first order logic, which are called description logics (DLs).
However, applications like ontology-based data access (OBDA) require algorithms
not only to decide standard reasoning problems, such as satisfiability and model
checking, but also to answer database-style queries [1, 2]. This motivates the use
of description logics of the DL-Lite family in, e.g. OWL 2 QL, which have been
designed specifically to maximize expressive power while maintaining good query
answering properties [3]. In particular, the computational complexity of answer-
ing simple queries such as conjunctive queries (CQ s) and unions of conjunctive
queries (UCQs) over these DLs is the same as for relational databases [4, 5].
    Some attention has recently been paid to the problem of answering various
extensions of CQs and UCQs over ontologies. For example [6] study path queries
over ontologies, while [7], [8] and [?] consider adding some form of negation
to these simple queries. The general conclusion from these papers is that the
complexity of evaluation of such queries is usually higher than for CQs and UCQs
and even higher than for similar problems in relational databases. In some cases
this difference in complexity is surprisingly high: e.g. while answering UCQs with
inequalities is known to be efficiently computable for relational databases, the
problem is undecidable when such a query is posed over DL-Lite ontologies.
    Yet there is another extension of CQs that has received little attention in
the context of OBDA – aggregate queries. These queries answer questions such
as ”How many children does Ann have?” or ”What is the average salary over
each department in the Pandidakterion?” Usually, they combine various aggre-
gate functions, such as MIN, MAX, SUM, AVERAGE, COUNT and COUNT DISTINCT [9],
together with a grouping functionality, as in the usual GROUP BY clause of SQL.
    Aggregate queries are an important and heavily used part of almost every
relational database query language, including SQL. In the context of the Seman-
tic Web we expect a particular need for aggregates in the OBDA settings, with
applications such as SPARQL under entailment regimes [10]. But despite their
importance, the study of aggregate queries over ontologies has been lacking, save
for a few exceptions [11].
    The main reason for the lack of research in this direction is the difficulty of
defining a semantics for aggregate queries over ontologies. The complication is
that, unlike relational databases, in ontologies one assumes that every knowledge
base instance is incomplete and describes a part of the infinite number of models
of the knowledge base (i.e. the open world assumption is assumed), and a query
may have a different answer on each of these models. For standard queries like
CQs and UCQs this problem is usually overcome by computing the certain an-
swers of queries, i.e. the tuples that are answers in all possible models [4]. This
approach, however, is not suitable for aggregate queries, as the following shows.
    Consider a knowledge base where Ann is a parent and the ontology asserts
that every parent has at least one child. If nothing else is assumed then for
every positive integer n there exists a model where Ann has n children. Thus,
the answer to a simple query ”How many children does Ann have?” in different
models of the knowledge base can be any number greater than or equal to 1. The
syntactic intersection of these answers (i.e. applying standard certain answers
semantics) trivially gives us the empty set, which is clearly not satisfactory. As
a different approach, [11] introduced epistemic semantics for aggregate queries.
In a nutshell, the idea is to apply the aggregation function only to known values.
For example, the epistemic answer to the query above is 0, because we do not
definitely know anybody who is a child of Ann. But this is clearly not the desired
answer: since Ann is a parent we know that she has at least one child. Hence the
epistemic semantics does not always give a correct answer to COUNT queries.
    As the first contribution of this paper, in Sec. 3 we embark on the task
of defining a suitable semantics for answering what we call counting aggregate
queries, which are queries that use COUNT or COUNT DISTINCT functions. Mo-
tivated by the original idea of certain answers, we seek to find the maximal
information that is common in the answers to such a query for all the models
of a knowledge base. This gives rise to the notion of aggregate certain answers,
which can be explained as follows: a number is an aggregate certain answer to a
counting query over a knowledge base if it does not exceed the result of the query
over any model of this knowledge base. For instance, in the above example, even
if we do not know precisely how many children Ann has, we know that she has
at least one, and thus 1 is an aggregate certain answer to the query.
    Of course this semantics is not well suited for aggregation primitives such
as SUM or AVERAGE. But, as we show in this paper, it is a natural and useful
semantics for aggregate queries that count.
    Having established our semantics, we turn to the study of the algorithmic
properties of aggregate certain answers computation for counting queries. We
concentrate on ontologies of the DL-Lite family, in particular DL-Lite core and
DL-LiteR [4]. The choice of these DLs is twofold: first, as mentioned above,
these formalisms are important in the OBDA settings; second, they are among
the simplest DLs and hence good candidates to begin with.
    As usual in the theory of DLs, in Sec. 4 we study these problems assuming
that the query and the terminology (i.e. the TBox ) are fixed, and the only
input is the assertions (ABox ). This corresponds to the data complexity of the
problem in Vardi’s taxonomy [12]. Somewhat surprisingly, our results show that
the complexity of aggregate certain answers problem is resilient to the choice
of both DL and counting function and is coNP-complete in all cases. In order
to get a further understanding of the computational properties of the problems,
in Sec. 5 we study their combined complexity, i.e. assume that the query, ABox
and TBox are the input. Here we do find differences: both count distinct and
count aggregate query answering are coNExpTime-complete for DL-LiteR ; yet the
former problem is Π2p -complete and the latter is in coNExpTime for DL-Lite core .
Hereby, the small increase of expressivity from DL-Lite core to DL-LiteR makes
at least the count distinct problem exponentially more difficult. As far as we
are aware, these are the first tight complexity bounds for answering aggregate
queries in the presence of ontologies.
Related Work Although mostly unexplored in the context of ontologies, seman-
tics for aggregate queries have been already defined for other database settings
that feature incomplete information. For example, an inconsistent database in-
stance (w.r.t. a set of constraints) describes a set of repairs, each of which satisfies
the constraints and can be obtained from the instance by a minimal number of
transformations. Aggregate queries over inconsistent databases were explored in
[13], where the range semantics was defined. Intuitively, this semantics corre-
sponds to the interval between the minimal and the maximal possible answers
to the query, amongst all the repairs of a given instance. The same semantics
was adopted by [14, 15] in the context of data exchange.
    However, the techniques from these papers cannot be immediately applied
to ontologies, because of several specific properties. In particular, these papers
consider variations of the closed world assumption, whereas in ontologies the
open world assumption is assumed. Furthermore, data exchange settings are
based on source-to-target dependencies and weakly acyclic target dependencies.
This rules out all types of recursion in ontological knowledge, thus simplifying
the study to a great extent.
    In the context of ontologies, in [11] the range semantics itself was claimed to
be trivially meaningless for aggregate queries over ontologies. For example, for
almost any knowledge base we can construct a model such that the aggregate
value of an AVERAGE query evaluates to any number. Similar examples can be
given for all other standard aggregate functions, except for COUNT and COUNT
DISTINCT, which are precisely the aggregates that are the focus of this paper.
As we will show the computation of the upper bound of the range is almost
trivial in these cases as well. But the lower bound of the range, i.e. the minimal
possible value described above, is completely natural, and by no means trivial to
compute. In fact, the lower bound of the range semantics is strongly related to
our notion of aggregate certain answers as follows: a number is in the aggregate
certain answers if and only if it is less than or equal to the lower bound of the
range. Thus, this work on aggregate certain answers can be seen as an adaptation
of the range semantics of [13] to ontologies.
    This paper is an extended version of [16]. We sketch or even omit the proofs
of lemmas in the paper, which will be included in the full version.


2    Preliminaries
Syntax of DL-Lite Let A0 , A1 , . . . be atomic concepts and P0 , P1 , . . . be atomic
roles. Concepts C and roles E of DL-Lite languages are formed by the grammar

     B ::= Ai | ∃R,      R ::= Pi | Pi− ,    C ::= B | ¬B,        E ::= R | ¬R.

A TBox is a finite set of assertions. In the language of DL-Lite core the assertions
are of the form B ⊑ C. In DL-LiteR the form R ⊑ E is also allowed. An ABox
is a set of assertions of the forms Ai (a) and Pi (a, b) where constants a, b are
from an active domain D. A knowledge base (or KB ) K = hT , Ai of a DL-Lite
language contains a TBox T of the language and an ABox A.
Semantics of DL-Lite An interpretation I = (DI , ·I ) contains a (possibly
infinite) domain of elements DI such that D ⊆ DI , and maps each concept C to
a subset C I of DI and each role R to a binary relation RI over DI such that
             (Pi− )I = {(a, b) | (b, a) ∈ PiI }, (¬B)I = DI \B I ,
             (∃R)I = {a | ∃b : (a, b) ∈ RI }, (¬R)I = DI × DI \RI .

An interpretation I is a model of a KB K = hT , Ai (written I |= K) if for any
assertion B ⊑ C in T it holds that B I ⊆ C I , for any R ⊑ E it holds that
RI ⊆ E I , for any Ai (a) in A it holds that a ∈ AIi , and for any Pi (a, b) it holds
that (a, b) ∈ PiI .
    The definitions above say that D ⊆ DI in every interpretation I, which
essentially means that for each constant a from the active domain D we have
aI = a. By this we adopt the unique name assumption (UNA) on constants,
which is conventional for DL-Lite. However, dropping this assumption does not
affect any result of this paper, and we discuss explicitly how to adopt proofs
wherever it is not straightforward.
Conjunctive queries A conjunctive query (or CQ ) is an expression of the form

                                 q(x) :- ∃y φ(x, y),                              (1)
where x is a tuple of free variables, y is a tuple of existential variables, and the
body φ(x, y) is a conjunction of atoms of the form Ai (u) or Pi (u1 , u2 ), where
u, u1 , u2 are variables from x ∪ y.
    A CQ (1), holds for an interpretation I and a tuple t of elements from DI
(written I |= q(t)) iff there exists an evaluation from q to DI for t, i.e. a mapping
h : x∪y → DI , such that h(x) = t and h(z) ∈ S I , for every atom S(z) in φ(x, y).
A tuple t is in the certain answer to a CQ (1) over a KB K if I |= q(t) holds
for every model I of K.


3     Counting Queries over Ontologies

The ability to evaluate aggregate queries is a default in every DBMS and is in
the standard of SQL. However, as mentioned in the introduction, little attention
to this type of queries has been paid in the context of ontologies. Starting to
fill this gap, in this section we formally define counting aggregate queries over
ontologies of DL-Lite family and compare this definition with existing notions
in related areas.


3.1   Syntax and Semantics of Counting Queries

Following e.g. [9], an aggregate conjunctive query (or ACQ ) is an expression

                            q(x, f (z)) :- ∃y φ(x, y, z),                        (2)

where x is a tuple of free variables, y is a tuple of existential variables and z is
a tuple of aggregation variables; the body φ(x, y, z) is a conjunction of atoms of
the form Ai (u) or Pi (u1 , u2 ), where u, u1 , u2 are variables from x ∪ y ∪ z; and
f (z) is an aggregation function. In this paper we consider two such functions:
the unary count distinct function Cntd(z) and nullary count function Count().
We call such queries counting ACQs.

Example 1. Let K = hT , Ai be a knowledge base where T consists of the as-
sertion Parent ⊑ ∃HasChild, and A consists of the assertion Parent(Ann). The
query
                q1 (x, Count()) :- ∃y Parent(x) ∧ HasChild(x, y)
is an ACQ using the count function. Intuitively, it counts the children of each
parent. The query

                  q2 (Cntd(y)) :- ∃x Parent(x) ∧ HasChild(x, y)

is a count distinct ACQ. This query counts all different children having a parent.

    To define the semantics of counting queries over a particular model we again
follow [9]. We say that the core of an ACQ of the form (2) is the CQ q̄(x ∪
z) :- ∃y φ(x, y, z). Also, let N∞ be the set of natural numbers with 0 and +∞.
    A count ACQ q(x, Count()) holds for an interpretation I, a tuple t of ele-
ments from DI and a number n ∈ N∞ (written I |= q(t, n)) iff n is the number
of distinct evaluations from the core q̄ to DI for t.
    A count distinct ACQ q(x, Cntd(z)) holds for an interpretation I, a tuple t
of elements from DI and a number n ∈ N∞ (written I |= q(t, n)) iff n is the
number of distinct elements a ∈ DI such that I |= q̄(t, a) for the core q̄ of q.
Example 2. Coming back to Ex. 1, consider the interpretation I where ParentI =
{Ann} and HasChildI = {(Ann, Joe)}, which is clearly a model for K. Then it is
not difficult to see that I |= q1 (Ann, 1) and I |= q2 (1). For the model J such that
ParentJ = {Ann, Peter} and HasChildJ = {(Ann,Joe),(Ann,Rose),(Peter,Joe)},
it holds that J |= q1 (Ann, 2), J |= q1 (Peter, 1) and J |= q2 (2).

3.2   Certain Answers of Counting Queries over Ontologies
A knowledge base normally describes not a single model, but an infinite number
of them. This is why one is typically interested in computing the certain answers
of queries over a KB, which are usually defined as the intersection of the answers
of the query over all possible models of KB [4, 7].
    Unfortunately, a definition based on such a syntactical intersection is of little
use for ACQs, since it would almost always be empty. For instance, for the
query q1 from Ex. 1 and 2 we have that I |= q1 (Ann, 1), and I 6|= q1 (Ann, 2),
yet J 6|= q1 (Ann, 1) and J |= q1 (Ann, 2). This suggests avoiding using such a
syntactic intersection when defining the semantics of ACQs over ontologies.
    In the context of OBDA this problem has been identified before by [11].
Their solution was to concentrate only on aggregating over epistemic knowl-
edge, i.e. over values which are explicitly mentioned in the ABox of a KB. Such
epistemic aggregate queries usually have a non-empty certain answer, based on
the intersection, for all standard aggregate queries, including M ax and Average.
However, for counting queries this answer may be somehow non-satisfactory. For
example, the epistemic answer to the ACQ q1 over K from Ex. 1 is (Ann, 0), be-
cause we do not know anybody who is definitely a child of Ann.
    That is why we suggest the following definition of certain answers of counting
ACQs over DLs, which is essentially the minimum over possible values of the
counting function over all the models of a KB. In particular, our certain answer
to the query q1 over K from Ex. 1 contains (Ann, 1), which reflects the fact that
we definitely know that Ann has at least one child in any model. We deem this
definition to be in line with the open world assumption, adopted in ontologies.
Definition 1. A non-negative number n ∈ N∞ is in the aggregate certain an-
swers Cert(q, t, K) for a counting ACQ q, tuple of elements t, and a KB K iff
n ≤ minI|=K {k | I |= q(t, k)}.
    Note that a definition like above is non-trivial only for counting standard
aggregate queries. Indeed, it relies on their simple property that the minimum
above can potentially be any number greater than or equal to 0. For other
aggregation functions it is not the case: e.g. such a minimum for Average is
trivially almost always −∞.
3.3     Range Semantics of Aggregate Queries

The range semantics for aggregate queries was first proposed in [13] to study
aggregate queries over inconsistent databases, and it was later adopted in data
exchange [14, 15]. In the context of counting ACQs over ontologies it can be
defined as follows.
    The range of answers for a counting ACQ q, a tuple t, and a KB K is the
interval [m(q, t, K), M (q, t, K)], where

      m(q, t, K) = min {k | I |= q(t, k)},   M (q, t, K) = max{k | I |= q(t, k)}.
                   I|=K                                   I|=K


   It is easy to see that the lower bound of the range interval coincides with the
maximal certain answer from Def. 1. Considering the upper bound, let’s come
back to Ex. 1. We can find a model I of K such that I |= q1 (Ann, n) for any
number n ≥ 1, i.e. in this case the upper bound is +∞. The following proposition
says that this is not unusual.

Proposition 1. Given a counting ACQ q, a tuple of elements t, and a DL-Lite
KB K the upper endpoint M (q, t, K) of the range of answers belongs to the set
{0, 1, +∞}, and can be computed in polynomial time (in the size of q and K).

Proof. Indeed, M (q, t, K) = 0 iff hT , A ∪ Aq i has no model, where Aq is an ABox
over the variables of q as constants, containing the atoms of q as assertions. Oth-
erwise, we have that M (q, t, K) = 1 only if q uses count() and has no existentially
quantified variables. In all the remaining cases we have that M (q, t, K) = +∞,
since nothing prevents a model with an infinite number of witnesses.              ⊓
                                                                                  ⊔

    We may thus say that the aggregate certain answers semantics from Def. 1
is in fact an adaptation of the range semantics of [13] to ontologies.


4      Data Complexity of Counting Queries

It has been argued many times that in usual database settings the size of the
query and the TBox is much smaller than the size of the ABox (see e.g. [12] as
a more general statement and [4] in the context of DL’s). This is why in query
answering over ontologies one usually explores data complexity of problems,
i.e. only database knowledge from ABox is considered as part of the input. In
this section we do the same for aggregate certain answers. Formally, let X ∈
{core,R}, T be a TBox over DL-LiteX and q(x, f (z)) be a counting ACQ. We
are interested in the following family of problems.

               DL-LiteX f -Aggregate Certain Answers (T , q)
                Input:     ABox A, tuple t, and number n ∈ N∞ .
                Question: Is n ∈ Cert(q, t, hT , Ai)?
4.1   Count Queries

We start with the lower bound for count ACQs.

Lemma 1. There exist a DL-Litecore TBox T and a count ACQ q without free
variables such that checking whether n ∈ Cert(q, t∅ , K), where K = hT , Ai, for
an ABox A, a number n, and the empty tuple t∅ is coNP-hard.

Proof (sketch). Let A, B and E, P be atomic concepts and roles. Let T =
{A ⊑ ∃P, ∃P − ⊑ B} and q(Count()) :- ∃y1 . . . y4 B(y1 ) ∧ E(y2 , y3 ) ∧ P (y2 , y4 ) ∧
P (y3 , y4 ).
    The proof is by a reduction from the complement of the NP-complete 3-
colouring problem with an undirected graph G(V, E) as input.
    Let D = V ∪ {r, g, b, a}. Let A contain E(u, v) and E(v, u) for each (u, v) ∈ E,
A(v) for each v ∈ V, B(c) for each c ∈ {r, g, b}, and E(a, a), P (a, r).
    It holds that 4 ∈ Cert(q, t∅ , K) iff G has no 3-colouring.                       ⊓
                                                                                      ⊔

    This lemma continues to hold if one drops the UNA. To adopt the proof, it
suffices to state that r, g and b belong to pairwise disjoint concepts, and that a
belongs to a concept that is disjoint with a concept containing all v from V.
    The proof above make use of the non-connectivity of the query. It is an
interesting open problem whether the result holds for connected queries.
    Thus, the data complexity of count queries rises from P in the standard
database case at least to coNP for DL-Lite knowledge bases. The following lemma
establishes a matching upper bound for the problem.

Lemma 2. Let T be a fixed DL-LiteR TBox and q(x, Count()) be a fixed count
ACQ. Checking whether n ∈ Cert(q, t, K), where K = hT , Ai, for an ABox A,
a tuple t, and a number n can be done in coNP.

Proof (sketch). Given an interpretation J and a number k, it is well known
that checking whether J |= K and J |= q(t, k) is in polynomial time (since q is
fixed). Hence, it is enough to prove that if there exists a model I of K such that
I |= q(t, n0 ) for a number n0 then there exists a model Ī of K of polynomial
size in the size of A such that Ī |= q(t, n̄) for some number n̄ ≤ n0 .
    Note that K always has a model with a domain no bigger than |D| + |T |, so
we may assume that n0 ≤ (|D| + |T |)|q| (which is polynomial since q is fixed).
    Fix I as above. There exists a homomorphism f : Can(K) → I, where
Can(K) is the canonical model of K (see the definition in e.g. [4]). W.l.o.g. we
assume that it is surjective, i.e. f (Can(K)) = I; otherwise we can drop elements
and assertions of I which are not in the image of f , without increasing n0 .
    Let D∗ be all elements of DI which are either constants from D or images
of variables by evaluations from the core of q to DI . We can construct an inter-
pretation Î with the domain DÎ = ∪d∈DI \D∗ f −1 (d) ∪ D∗ and with a surjective
homomorphism from Can(K) so that Î |= K and Î |= q(t, n̄) for some n̄ ≤ n0 .
    For every element d ∈ DÎ \D∗ define Nq (d) as a sub-interpretation of DÎ
induced by all elements reachable from d by an (undirected) path though roles
of length no more than |q| and without intermediate nodes from D∗ . Define
equivalence Nq (d) ∼ Nq (d′ ) if there exists an isomorphism between Nq (d) and
Nq (d′ ) preserving D∗ .
     Note that every element of the canonical model which is not in D, has at
most |T | + 1 immediate neighbours. Hence each d ∈ DÎ \D∗ also has at most
|T | + 1 immediate neighbours in Î. Moreover, it holds that |D∗ | ≤ n0 |q| + |D|.
So, each Nq (d) is of polynomial size and there is only a polynomial number of
equivalence classes induced by ∼. Consider the model Ī obtained from Î by
merging all d1 , d2 such that Nq (d1 ) ∼ Nq (d2 ) and the distance from D to d1 and
d2 in the canonical model modulo |q| + 1 is the same. The model Ī is as required,
since such merging does not create new homomorphisms of the body of q.            ⊓
                                                                                  ⊔
   Note that the lower bound was shown for DL-Lite core , while the upper bound
holds for any DL-LiteR KB. Since DL-LiteR is more expressive than DL-Lite core ,
the lemmas above give us the following complexity result.
Theorem 1. The problem DL-LiteX Count-Aggregate Certain Answers
(T , q) is coNP-complete in data complexity for any X ∈ {core,R}.

4.2    Count Distinct Queries
The coNP bounds also apply for count distinct ACQs. The lower bound is again
established by reduction from the complement of the 3-colouring problem.
Lemma 3. There exist a DL-Litecore TBox T and a count distinct ACQ q
without free variables such that checking whether n ∈ Cert(q, t∅ , K), where
K = hT , Ai, for an ABox A and a number n is coNP-hard.
Proof (sketch). Consider T = {∃E ⊑ ∃P } and q(Cntd(z)) :- ∃y1 . . . y4 P (y1 , z) ∧
R(y1 , y2 ) ∧ P (y2 , y3 ) ∧ P (y4 , y3 ) ∧ E(y4 , y2 ), where E, P , R are atomic roles.
    Let G(V, E) be an input graph as in the proof of Lem. 1. Let D contain the set
of elements {v, v1 , v2 , v3 , v4 , v5 } for each v ∈ V, and elements a, a1 , a2 , a3 , r, g, b.
Let A contain the assertions E(u, v) and E(v, u) for each (u, v) ∈ E; the asser-
tions R(v, v1 ), P (v1 , v2 ), P (v3 , v2 ), E(v3 , v1 ), R(v4 , v), P (v4 , v5 ) for each v ∈ V;
and the assertions R(a, a1 ), P (a1 , a2 ), P (a3 , a2 ), E(a3 , a1 ), P (a, r), P (a, g) and
P (a, b). It holds that 4 ∈ Cert(q, t∅ , K) iff G has no 3-colouring.                          ⊓
                                                                                               ⊔
    This lemma again holds for the case when UNA is dropped, and the proof
can be adopted in the same way as the proof of Lem. 1. The matching algorithm
is also similar to the count case.
Lemma 4. Let T be a fixed DL-LiteR TBox and q(x, Cntd(z)) be a fixed count
distinct ACQ. Checking whether n ∈ Cert(q, t, hT , Ai) for an ABox A, a tuple
t, and a number n can be done in coNP.
   The proof goes the same lines as the proof of Lem. 2 except that we bound
n0 by |D| + |T |, and include into D∗ the active domain D and all homomorphic
images of the aggregation variable z to I. The following summarises the lemmas.
Theorem 2. The problem DL-LiteX Cntd-Aggregate Certain Answers
(T , q) is coNP-complete in data complexity for any X ∈ {core,R}.
5     Combined Complexity of Counting Queries

As pointed out in Sec. 4 data complexity is the most used measure of algorithms
in any database settings. However, combined complexity has its own value for
understanding fundamental properties of problems. In this section we study the
combined complexity of computing aggregate certain answers. Formally, let X ∈
{core, R} and f be a counting aggregate function. Now we are interested in the
following family of problems.

    DL-LiteX f -Aggregate Certain Answers
    Input:    KB K over DL-LiteX , f query q, tuple t, and number n ∈ N∞ .
    Question: Is n ∈ Cert(q, t, K)?


5.1    Count Queries

We start again with count queries. Recall the algorithm to compute the certain
answers for count queries explained in the proof of Lem. 2. Note that, if one
takes into consideration the size of the query and the TBox, then this algorithm
naturally gives a coNExpTime upper bound; the only difference is that in this
case the number of neighbourhoods is of exponential size (w.r.t. q and T ), and
thus the instance we need to guess is of exponential size. Next we show that this
bound is tight for DL-LiteR .

Lemma 5. The decision problem DL-LiteR Count-Aggregate Certain An-
swers is coNExpTime-hard.

    The proof is by a reduction from the complement of the satisfiability problem
for first-order logic (FO) formulas in the Bernays-Schöfinkel class [17]. This class
contains all FO formulae of form ∃x ∀y ψ(x, y), with ψ a quantifier-free formula
not using function symbols or equalities. The reduction is inspired by the tech-
niques used in [18] to show coNExpTime-hardness of query answering problems
in data exchange context.
    Unfortunately, the reduction above uses role inclusions in the TBox, i.e. it
is applicable only to DL-LiteR . We leave open the exact complexity of the DL-
Lite core Count-ACQ answering problem, although it is not difficult to adapt
the results of the following section to obtain a Π2p lower bound. We have the
summarizing theorem.

Theorem 3. (1) The problem DL-Lite core Count-Aggregate Certain An-
swers is in coNExpTime. (2) The problem DL-LiteR Count-Aggregate Cer-
tain Answers is coNExpTime-complete.

5.2    Count Distinct Queries

Just as we did for count queries, we can easily obtain a coNExpTime upper bound
for count distinct ones from the proof of Lem. 4. However, for DL-Lite core we
                          Data complexity Combined complexity
                 DL-Lite
                          Count Cntd      Count    Cntd
                     core coNP-c coNP-c in coNExp Π2p -c
                       R coNP-c coNP-c coNExp-c coNExp-c
Table 1. A summary of the complexity results. Here “-c” stands for “-complete” and
coNExp – for coNExpTime.



can improve the complexity by almost one exponential. The idea is to redefine
the sub-interpretations used in the proof of Lem. 4 to have them of polynomial
size, while keeping the possibility of merging them.

Lemma 6. There exists a Π2p -algorithm which solves the problem DL-Lite core
Cntd-Aggregate Certain Answers.

    In this case we have the matching lower bound.

Lemma 7. The problem DL-Lite core Cntd-Aggregate Certain Answers is
Π2p -hard.

    The proof is by reduction from the Π2p -complete ∀∃ 3-SAT problem [19].
    The remaining question is whether the algorithm for computing aggregate
certain answers over DL-LiteR knowledge bases is optimal. We settle this with
our last lemma, shown by a reduction similar to the one in the proof of Lem. 5.

Lemma 8. The decision problem DL-LiteR Cntd-Aggregate Certain An-
swers is coNExpTime-hard.

    Summing up, we have our last theorem.

Theorem 4. (1) The problem DL-Lite core Cntd-Aggregate Certain An-
swers is Π2p -complete. (2) The problem DL-LiteR Cntd-Aggregate Certain
Answers is coNExpTime-complete.


6    Conclusion

In this paper we have defined an intuitive semantics for counting aggregate
queries over ontologies and explored the computational complexity of the cor-
responding problems. The results, summarized in Table 1, show that the prob-
lems are decidable, but intractable. Hence, heuristics and approximations for
answering ACQs are on high demand from the practical point of view, with
applications, for instance, in the definition of general aggregation in SPARQL
under entailment regimes. We consider the epistemic semantics as one of such
approximations, since it has lower data complexity but does not always pro-
vide the desired answer. Our work settles the theoretical foundations for further
discussion.
References

 1. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Poggi, A., Rodriguez-
    Muro, M., Rosati, R., Ruzzi, M., Savo, D.F.: The MASTRO system for ontology-
    based data access. Semantic Web 2(1) (2011) 43–53
 2. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com-
    bined approach to ontology-based data access. In: IJCAI. (2011) 2656–2661
 3. Cuenca Grau, B., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P., Sattler,
    U.: Owl 2: The next step for OWL. Web Semant. 6(4) (November 2008) 309–322
 4. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable
    reasoning and efficient query answering in description logics: The DL-Lite family.
    J. of Automated Reasoning 39(3) (2007) 385–429
 5. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite family
    and relations. J. Artif. Intell. Res. (JAIR) 36 (2009) 1–69
 6. Bienvenu, M., Ortiz, M., Simkus, M.: Answering expressive path queries over
    lightweight DL knowledge bases. In Kazakov, Y., Lembo, D., Wolter, F., eds.:
    Description Logics. Volume 846 of CEUR Workshop Proceedings., CEUR-WS.org
    (2012)
 7. Rosati, R.: The limits of querying ontologies. In Schwentick, T., Suciu, D., eds.:
    ICDT. Volume 4353 of Lecture Notes in Computer Science., Springer (2007) 164–
    178
 8. Gutiérrez-Basulto, V., Ibáñez-Garcı́a, Y.A., Kontchakov, R.: An update on query
    answering with restricted forms of negation. In: Proceedings of the 6th interna-
    tional conference on Web Reasoning and Rule Systems. RR’12, Berlin, Heidelberg,
    Springer-Verlag (2012) 75–89
 9. Cohen, S., Nutt, W., Sagiv, Y.: Deciding equivalences among conjunctive aggregate
    queries. Journal of the ACM 54(2) (2007)
10. Glimm, B., Ogbuji, C., Hawke, S., Herman, I., Parsia, B., Polleres, A., Seaborne,
    A.: SPARQL 1.1 entailment regimes (2013) W3C Recommendation 21 March 2013,
    http://www.w3.org/TR/2013/REC-sparql11- entailment-20130321/.
11. Calvanese, D., Kharlamov, E., Nutt, W., Thorne, C.: Aggregate queries over on-
    tologies. In Elmasri, R., Doerr, M., Brochhausen, M., Han, H., eds.: ONISW, ACM
    (2008) 97–104
12. Vardi, M.Y.: The complexity of relational query languages (extended abstract).
    In: STOC. (1982) 137–146
13. Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar
    aggregation in inconsistent databases. Theor. Comput. Sci. 296(3) (March 2003)
    405–434
14. Libkin, L.: Data exchange and incomplete information. In Vansummeren, S., ed.:
    PODS, ACM (2006) 60–69
15. Afrati, F., Kolaitis, P.G.: Answering aggregate queries in data exchange. In:
    Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium
    on Principles of database systems. PODS ’08, New York, NY, USA, ACM (2008)
    129–138
16. Kostylev, E.V., Reutter, J.L.: Answering counting aggregate queries over ontologies
    of the DL-Lite family. In: Proc. of the 27th AAAI Conf. on Artificial Intelligence
    (AAAI). (2013)
17. Börger, E., Grädel, E., Gurevich, Y.: The Classical Decision Problem. Springer,
    Berlin (2001)
18. Arenas, M., Barceló, P., Reutter, J.L.: Query languages for data exchange: Beyond
    unions of conjunctive queries. Theory Comput. Syst. 49(2) (2011) 489–564
19. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
    of NP-Completeness. W. H. Freeman (1979)