=Paper=
{{Paper
|id=None
|storemode=property
|title=Complexity of Answering Counting Aggregate Queries over DL-Lite
|pdfUrl=https://ceur-ws.org/Vol-1014/paper_56.pdf
|volume=Vol-1014
|dblpUrl=https://dblp.org/rec/conf/dlog/KostylevR13
}}
==Complexity of Answering Counting Aggregate Queries over DL-Lite==
Complexity of Answering Counting Aggregate Queries over DL-Lite Egor V. Kostylev1 and Juan L. Reutter2 1 University of Edinburgh, ekostyle@inf.ed.ac.uk 2 PUC Chile and University of Edinburgh, jreutter@ing.puc.cl Abstract. One of the main applications of description logics is the ontology-based data access model, which requires algorithms for query answering over ontologies. In fact, some description logics, like those in the DL-Lite family, are designed so that simple queries, such as conjunc- tive queries, are efficiently computable. In this paper we study count- ing aggregate queries over ontologies, i.e. queries which use aggregate functions COUNT and COUNT DISTINCT. We propose an intuitive se- mantics for certain answers for these queries, which conforms to the open world assumption. We compare our semantics with other approaches that have been proposed in different contexts. We establish data and com- bined computational complexity for the problems of answering counting aggregate queries over ontologies for several variants of DL-Lite. 1 Introduction The growing popularity of ontologies as a paradigm for representing knowledge in the Semantic Web is based on the ability to describe incomplete information in the domain of interest. Several variations of the Web Ontology Language (OWL) have been formal- ized to manage ontologies. Most of these languages correspond to various de- cidable fragments of first order logic, which are called description logics (DLs). However, applications like ontology-based data access (OBDA) require algorithms not only to decide standard reasoning problems, such as satisfiability and model checking, but also to answer database-style queries [1, 2]. This motivates the use of description logics of the DL-Lite family in, e.g. OWL 2 QL, which have been designed specifically to maximize expressive power while maintaining good query answering properties [3]. In particular, the computational complexity of answer- ing simple queries such as conjunctive queries (CQ s) and unions of conjunctive queries (UCQs) over these DLs is the same as for relational databases [4, 5]. Some attention has recently been paid to the problem of answering various extensions of CQs and UCQs over ontologies. For example [6] study path queries over ontologies, while [7], [8] and [?] consider adding some form of negation to these simple queries. The general conclusion from these papers is that the complexity of evaluation of such queries is usually higher than for CQs and UCQs and even higher than for similar problems in relational databases. In some cases this difference in complexity is surprisingly high: e.g. while answering UCQs with inequalities is known to be efficiently computable for relational databases, the problem is undecidable when such a query is posed over DL-Lite ontologies. Yet there is another extension of CQs that has received little attention in the context of OBDA – aggregate queries. These queries answer questions such as ”How many children does Ann have?” or ”What is the average salary over each department in the Pandidakterion?” Usually, they combine various aggre- gate functions, such as MIN, MAX, SUM, AVERAGE, COUNT and COUNT DISTINCT [9], together with a grouping functionality, as in the usual GROUP BY clause of SQL. Aggregate queries are an important and heavily used part of almost every relational database query language, including SQL. In the context of the Seman- tic Web we expect a particular need for aggregates in the OBDA settings, with applications such as SPARQL under entailment regimes [10]. But despite their importance, the study of aggregate queries over ontologies has been lacking, save for a few exceptions [11]. The main reason for the lack of research in this direction is the difficulty of defining a semantics for aggregate queries over ontologies. The complication is that, unlike relational databases, in ontologies one assumes that every knowledge base instance is incomplete and describes a part of the infinite number of models of the knowledge base (i.e. the open world assumption is assumed), and a query may have a different answer on each of these models. For standard queries like CQs and UCQs this problem is usually overcome by computing the certain an- swers of queries, i.e. the tuples that are answers in all possible models [4]. This approach, however, is not suitable for aggregate queries, as the following shows. Consider a knowledge base where Ann is a parent and the ontology asserts that every parent has at least one child. If nothing else is assumed then for every positive integer n there exists a model where Ann has n children. Thus, the answer to a simple query ”How many children does Ann have?” in different models of the knowledge base can be any number greater than or equal to 1. The syntactic intersection of these answers (i.e. applying standard certain answers semantics) trivially gives us the empty set, which is clearly not satisfactory. As a different approach, [11] introduced epistemic semantics for aggregate queries. In a nutshell, the idea is to apply the aggregation function only to known values. For example, the epistemic answer to the query above is 0, because we do not definitely know anybody who is a child of Ann. But this is clearly not the desired answer: since Ann is a parent we know that she has at least one child. Hence the epistemic semantics does not always give a correct answer to COUNT queries. As the first contribution of this paper, in Sec. 3 we embark on the task of defining a suitable semantics for answering what we call counting aggregate queries, which are queries that use COUNT or COUNT DISTINCT functions. Mo- tivated by the original idea of certain answers, we seek to find the maximal information that is common in the answers to such a query for all the models of a knowledge base. This gives rise to the notion of aggregate certain answers, which can be explained as follows: a number is an aggregate certain answer to a counting query over a knowledge base if it does not exceed the result of the query over any model of this knowledge base. For instance, in the above example, even if we do not know precisely how many children Ann has, we know that she has at least one, and thus 1 is an aggregate certain answer to the query. Of course this semantics is not well suited for aggregation primitives such as SUM or AVERAGE. But, as we show in this paper, it is a natural and useful semantics for aggregate queries that count. Having established our semantics, we turn to the study of the algorithmic properties of aggregate certain answers computation for counting queries. We concentrate on ontologies of the DL-Lite family, in particular DL-Lite core and DL-LiteR [4]. The choice of these DLs is twofold: first, as mentioned above, these formalisms are important in the OBDA settings; second, they are among the simplest DLs and hence good candidates to begin with. As usual in the theory of DLs, in Sec. 4 we study these problems assuming that the query and the terminology (i.e. the TBox ) are fixed, and the only input is the assertions (ABox ). This corresponds to the data complexity of the problem in Vardi’s taxonomy [12]. Somewhat surprisingly, our results show that the complexity of aggregate certain answers problem is resilient to the choice of both DL and counting function and is coNP-complete in all cases. In order to get a further understanding of the computational properties of the problems, in Sec. 5 we study their combined complexity, i.e. assume that the query, ABox and TBox are the input. Here we do find differences: both count distinct and count aggregate query answering are coNExpTime-complete for DL-LiteR ; yet the former problem is Π2p -complete and the latter is in coNExpTime for DL-Lite core . Hereby, the small increase of expressivity from DL-Lite core to DL-LiteR makes at least the count distinct problem exponentially more difficult. As far as we are aware, these are the first tight complexity bounds for answering aggregate queries in the presence of ontologies. Related Work Although mostly unexplored in the context of ontologies, seman- tics for aggregate queries have been already defined for other database settings that feature incomplete information. For example, an inconsistent database in- stance (w.r.t. a set of constraints) describes a set of repairs, each of which satisfies the constraints and can be obtained from the instance by a minimal number of transformations. Aggregate queries over inconsistent databases were explored in [13], where the range semantics was defined. Intuitively, this semantics corre- sponds to the interval between the minimal and the maximal possible answers to the query, amongst all the repairs of a given instance. The same semantics was adopted by [14, 15] in the context of data exchange. However, the techniques from these papers cannot be immediately applied to ontologies, because of several specific properties. In particular, these papers consider variations of the closed world assumption, whereas in ontologies the open world assumption is assumed. Furthermore, data exchange settings are based on source-to-target dependencies and weakly acyclic target dependencies. This rules out all types of recursion in ontological knowledge, thus simplifying the study to a great extent. In the context of ontologies, in [11] the range semantics itself was claimed to be trivially meaningless for aggregate queries over ontologies. For example, for almost any knowledge base we can construct a model such that the aggregate value of an AVERAGE query evaluates to any number. Similar examples can be given for all other standard aggregate functions, except for COUNT and COUNT DISTINCT, which are precisely the aggregates that are the focus of this paper. As we will show the computation of the upper bound of the range is almost trivial in these cases as well. But the lower bound of the range, i.e. the minimal possible value described above, is completely natural, and by no means trivial to compute. In fact, the lower bound of the range semantics is strongly related to our notion of aggregate certain answers as follows: a number is in the aggregate certain answers if and only if it is less than or equal to the lower bound of the range. Thus, this work on aggregate certain answers can be seen as an adaptation of the range semantics of [13] to ontologies. This paper is an extended version of [16]. We sketch or even omit the proofs of lemmas in the paper, which will be included in the full version. 2 Preliminaries Syntax of DL-Lite Let A0 , A1 , . . . be atomic concepts and P0 , P1 , . . . be atomic roles. Concepts C and roles E of DL-Lite languages are formed by the grammar B ::= Ai | ∃R, R ::= Pi | Pi− , C ::= B | ¬B, E ::= R | ¬R. A TBox is a finite set of assertions. In the language of DL-Lite core the assertions are of the form B ⊑ C. In DL-LiteR the form R ⊑ E is also allowed. An ABox is a set of assertions of the forms Ai (a) and Pi (a, b) where constants a, b are from an active domain D. A knowledge base (or KB ) K = hT , Ai of a DL-Lite language contains a TBox T of the language and an ABox A. Semantics of DL-Lite An interpretation I = (DI , ·I ) contains a (possibly infinite) domain of elements DI such that D ⊆ DI , and maps each concept C to a subset C I of DI and each role R to a binary relation RI over DI such that (Pi− )I = {(a, b) | (b, a) ∈ PiI }, (¬B)I = DI \B I , (∃R)I = {a | ∃b : (a, b) ∈ RI }, (¬R)I = DI × DI \RI . An interpretation I is a model of a KB K = hT , Ai (written I |= K) if for any assertion B ⊑ C in T it holds that B I ⊆ C I , for any R ⊑ E it holds that RI ⊆ E I , for any Ai (a) in A it holds that a ∈ AIi , and for any Pi (a, b) it holds that (a, b) ∈ PiI . The definitions above say that D ⊆ DI in every interpretation I, which essentially means that for each constant a from the active domain D we have aI = a. By this we adopt the unique name assumption (UNA) on constants, which is conventional for DL-Lite. However, dropping this assumption does not affect any result of this paper, and we discuss explicitly how to adopt proofs wherever it is not straightforward. Conjunctive queries A conjunctive query (or CQ ) is an expression of the form q(x) :- ∃y φ(x, y), (1) where x is a tuple of free variables, y is a tuple of existential variables, and the body φ(x, y) is a conjunction of atoms of the form Ai (u) or Pi (u1 , u2 ), where u, u1 , u2 are variables from x ∪ y. A CQ (1), holds for an interpretation I and a tuple t of elements from DI (written I |= q(t)) iff there exists an evaluation from q to DI for t, i.e. a mapping h : x∪y → DI , such that h(x) = t and h(z) ∈ S I , for every atom S(z) in φ(x, y). A tuple t is in the certain answer to a CQ (1) over a KB K if I |= q(t) holds for every model I of K. 3 Counting Queries over Ontologies The ability to evaluate aggregate queries is a default in every DBMS and is in the standard of SQL. However, as mentioned in the introduction, little attention to this type of queries has been paid in the context of ontologies. Starting to fill this gap, in this section we formally define counting aggregate queries over ontologies of DL-Lite family and compare this definition with existing notions in related areas. 3.1 Syntax and Semantics of Counting Queries Following e.g. [9], an aggregate conjunctive query (or ACQ ) is an expression q(x, f (z)) :- ∃y φ(x, y, z), (2) where x is a tuple of free variables, y is a tuple of existential variables and z is a tuple of aggregation variables; the body φ(x, y, z) is a conjunction of atoms of the form Ai (u) or Pi (u1 , u2 ), where u, u1 , u2 are variables from x ∪ y ∪ z; and f (z) is an aggregation function. In this paper we consider two such functions: the unary count distinct function Cntd(z) and nullary count function Count(). We call such queries counting ACQs. Example 1. Let K = hT , Ai be a knowledge base where T consists of the as- sertion Parent ⊑ ∃HasChild, and A consists of the assertion Parent(Ann). The query q1 (x, Count()) :- ∃y Parent(x) ∧ HasChild(x, y) is an ACQ using the count function. Intuitively, it counts the children of each parent. The query q2 (Cntd(y)) :- ∃x Parent(x) ∧ HasChild(x, y) is a count distinct ACQ. This query counts all different children having a parent. To define the semantics of counting queries over a particular model we again follow [9]. We say that the core of an ACQ of the form (2) is the CQ q̄(x ∪ z) :- ∃y φ(x, y, z). Also, let N∞ be the set of natural numbers with 0 and +∞. A count ACQ q(x, Count()) holds for an interpretation I, a tuple t of ele- ments from DI and a number n ∈ N∞ (written I |= q(t, n)) iff n is the number of distinct evaluations from the core q̄ to DI for t. A count distinct ACQ q(x, Cntd(z)) holds for an interpretation I, a tuple t of elements from DI and a number n ∈ N∞ (written I |= q(t, n)) iff n is the number of distinct elements a ∈ DI such that I |= q̄(t, a) for the core q̄ of q. Example 2. Coming back to Ex. 1, consider the interpretation I where ParentI = {Ann} and HasChildI = {(Ann, Joe)}, which is clearly a model for K. Then it is not difficult to see that I |= q1 (Ann, 1) and I |= q2 (1). For the model J such that ParentJ = {Ann, Peter} and HasChildJ = {(Ann,Joe),(Ann,Rose),(Peter,Joe)}, it holds that J |= q1 (Ann, 2), J |= q1 (Peter, 1) and J |= q2 (2). 3.2 Certain Answers of Counting Queries over Ontologies A knowledge base normally describes not a single model, but an infinite number of them. This is why one is typically interested in computing the certain answers of queries over a KB, which are usually defined as the intersection of the answers of the query over all possible models of KB [4, 7]. Unfortunately, a definition based on such a syntactical intersection is of little use for ACQs, since it would almost always be empty. For instance, for the query q1 from Ex. 1 and 2 we have that I |= q1 (Ann, 1), and I 6|= q1 (Ann, 2), yet J 6|= q1 (Ann, 1) and J |= q1 (Ann, 2). This suggests avoiding using such a syntactic intersection when defining the semantics of ACQs over ontologies. In the context of OBDA this problem has been identified before by [11]. Their solution was to concentrate only on aggregating over epistemic knowl- edge, i.e. over values which are explicitly mentioned in the ABox of a KB. Such epistemic aggregate queries usually have a non-empty certain answer, based on the intersection, for all standard aggregate queries, including M ax and Average. However, for counting queries this answer may be somehow non-satisfactory. For example, the epistemic answer to the ACQ q1 over K from Ex. 1 is (Ann, 0), be- cause we do not know anybody who is definitely a child of Ann. That is why we suggest the following definition of certain answers of counting ACQs over DLs, which is essentially the minimum over possible values of the counting function over all the models of a KB. In particular, our certain answer to the query q1 over K from Ex. 1 contains (Ann, 1), which reflects the fact that we definitely know that Ann has at least one child in any model. We deem this definition to be in line with the open world assumption, adopted in ontologies. Definition 1. A non-negative number n ∈ N∞ is in the aggregate certain an- swers Cert(q, t, K) for a counting ACQ q, tuple of elements t, and a KB K iff n ≤ minI|=K {k | I |= q(t, k)}. Note that a definition like above is non-trivial only for counting standard aggregate queries. Indeed, it relies on their simple property that the minimum above can potentially be any number greater than or equal to 0. For other aggregation functions it is not the case: e.g. such a minimum for Average is trivially almost always −∞. 3.3 Range Semantics of Aggregate Queries The range semantics for aggregate queries was first proposed in [13] to study aggregate queries over inconsistent databases, and it was later adopted in data exchange [14, 15]. In the context of counting ACQs over ontologies it can be defined as follows. The range of answers for a counting ACQ q, a tuple t, and a KB K is the interval [m(q, t, K), M (q, t, K)], where m(q, t, K) = min {k | I |= q(t, k)}, M (q, t, K) = max{k | I |= q(t, k)}. I|=K I|=K It is easy to see that the lower bound of the range interval coincides with the maximal certain answer from Def. 1. Considering the upper bound, let’s come back to Ex. 1. We can find a model I of K such that I |= q1 (Ann, n) for any number n ≥ 1, i.e. in this case the upper bound is +∞. The following proposition says that this is not unusual. Proposition 1. Given a counting ACQ q, a tuple of elements t, and a DL-Lite KB K the upper endpoint M (q, t, K) of the range of answers belongs to the set {0, 1, +∞}, and can be computed in polynomial time (in the size of q and K). Proof. Indeed, M (q, t, K) = 0 iff hT , A ∪ Aq i has no model, where Aq is an ABox over the variables of q as constants, containing the atoms of q as assertions. Oth- erwise, we have that M (q, t, K) = 1 only if q uses count() and has no existentially quantified variables. In all the remaining cases we have that M (q, t, K) = +∞, since nothing prevents a model with an infinite number of witnesses. ⊓ ⊔ We may thus say that the aggregate certain answers semantics from Def. 1 is in fact an adaptation of the range semantics of [13] to ontologies. 4 Data Complexity of Counting Queries It has been argued many times that in usual database settings the size of the query and the TBox is much smaller than the size of the ABox (see e.g. [12] as a more general statement and [4] in the context of DL’s). This is why in query answering over ontologies one usually explores data complexity of problems, i.e. only database knowledge from ABox is considered as part of the input. In this section we do the same for aggregate certain answers. Formally, let X ∈ {core,R}, T be a TBox over DL-LiteX and q(x, f (z)) be a counting ACQ. We are interested in the following family of problems. DL-LiteX f -Aggregate Certain Answers (T , q) Input: ABox A, tuple t, and number n ∈ N∞ . Question: Is n ∈ Cert(q, t, hT , Ai)? 4.1 Count Queries We start with the lower bound for count ACQs. Lemma 1. There exist a DL-Litecore TBox T and a count ACQ q without free variables such that checking whether n ∈ Cert(q, t∅ , K), where K = hT , Ai, for an ABox A, a number n, and the empty tuple t∅ is coNP-hard. Proof (sketch). Let A, B and E, P be atomic concepts and roles. Let T = {A ⊑ ∃P, ∃P − ⊑ B} and q(Count()) :- ∃y1 . . . y4 B(y1 ) ∧ E(y2 , y3 ) ∧ P (y2 , y4 ) ∧ P (y3 , y4 ). The proof is by a reduction from the complement of the NP-complete 3- colouring problem with an undirected graph G(V, E) as input. Let D = V ∪ {r, g, b, a}. Let A contain E(u, v) and E(v, u) for each (u, v) ∈ E, A(v) for each v ∈ V, B(c) for each c ∈ {r, g, b}, and E(a, a), P (a, r). It holds that 4 ∈ Cert(q, t∅ , K) iff G has no 3-colouring. ⊓ ⊔ This lemma continues to hold if one drops the UNA. To adopt the proof, it suffices to state that r, g and b belong to pairwise disjoint concepts, and that a belongs to a concept that is disjoint with a concept containing all v from V. The proof above make use of the non-connectivity of the query. It is an interesting open problem whether the result holds for connected queries. Thus, the data complexity of count queries rises from P in the standard database case at least to coNP for DL-Lite knowledge bases. The following lemma establishes a matching upper bound for the problem. Lemma 2. Let T be a fixed DL-LiteR TBox and q(x, Count()) be a fixed count ACQ. Checking whether n ∈ Cert(q, t, K), where K = hT , Ai, for an ABox A, a tuple t, and a number n can be done in coNP. Proof (sketch). Given an interpretation J and a number k, it is well known that checking whether J |= K and J |= q(t, k) is in polynomial time (since q is fixed). Hence, it is enough to prove that if there exists a model I of K such that I |= q(t, n0 ) for a number n0 then there exists a model Ī of K of polynomial size in the size of A such that Ī |= q(t, n̄) for some number n̄ ≤ n0 . Note that K always has a model with a domain no bigger than |D| + |T |, so we may assume that n0 ≤ (|D| + |T |)|q| (which is polynomial since q is fixed). Fix I as above. There exists a homomorphism f : Can(K) → I, where Can(K) is the canonical model of K (see the definition in e.g. [4]). W.l.o.g. we assume that it is surjective, i.e. f (Can(K)) = I; otherwise we can drop elements and assertions of I which are not in the image of f , without increasing n0 . Let D∗ be all elements of DI which are either constants from D or images of variables by evaluations from the core of q to DI . We can construct an inter- pretation Î with the domain DÎ = ∪d∈DI \D∗ f −1 (d) ∪ D∗ and with a surjective homomorphism from Can(K) so that Î |= K and Î |= q(t, n̄) for some n̄ ≤ n0 . For every element d ∈ DÎ \D∗ define Nq (d) as a sub-interpretation of DÎ induced by all elements reachable from d by an (undirected) path though roles of length no more than |q| and without intermediate nodes from D∗ . Define equivalence Nq (d) ∼ Nq (d′ ) if there exists an isomorphism between Nq (d) and Nq (d′ ) preserving D∗ . Note that every element of the canonical model which is not in D, has at most |T | + 1 immediate neighbours. Hence each d ∈ DÎ \D∗ also has at most |T | + 1 immediate neighbours in Î. Moreover, it holds that |D∗ | ≤ n0 |q| + |D|. So, each Nq (d) is of polynomial size and there is only a polynomial number of equivalence classes induced by ∼. Consider the model Ī obtained from Î by merging all d1 , d2 such that Nq (d1 ) ∼ Nq (d2 ) and the distance from D to d1 and d2 in the canonical model modulo |q| + 1 is the same. The model Ī is as required, since such merging does not create new homomorphisms of the body of q. ⊓ ⊔ Note that the lower bound was shown for DL-Lite core , while the upper bound holds for any DL-LiteR KB. Since DL-LiteR is more expressive than DL-Lite core , the lemmas above give us the following complexity result. Theorem 1. The problem DL-LiteX Count-Aggregate Certain Answers (T , q) is coNP-complete in data complexity for any X ∈ {core,R}. 4.2 Count Distinct Queries The coNP bounds also apply for count distinct ACQs. The lower bound is again established by reduction from the complement of the 3-colouring problem. Lemma 3. There exist a DL-Litecore TBox T and a count distinct ACQ q without free variables such that checking whether n ∈ Cert(q, t∅ , K), where K = hT , Ai, for an ABox A and a number n is coNP-hard. Proof (sketch). Consider T = {∃E ⊑ ∃P } and q(Cntd(z)) :- ∃y1 . . . y4 P (y1 , z) ∧ R(y1 , y2 ) ∧ P (y2 , y3 ) ∧ P (y4 , y3 ) ∧ E(y4 , y2 ), where E, P , R are atomic roles. Let G(V, E) be an input graph as in the proof of Lem. 1. Let D contain the set of elements {v, v1 , v2 , v3 , v4 , v5 } for each v ∈ V, and elements a, a1 , a2 , a3 , r, g, b. Let A contain the assertions E(u, v) and E(v, u) for each (u, v) ∈ E; the asser- tions R(v, v1 ), P (v1 , v2 ), P (v3 , v2 ), E(v3 , v1 ), R(v4 , v), P (v4 , v5 ) for each v ∈ V; and the assertions R(a, a1 ), P (a1 , a2 ), P (a3 , a2 ), E(a3 , a1 ), P (a, r), P (a, g) and P (a, b). It holds that 4 ∈ Cert(q, t∅ , K) iff G has no 3-colouring. ⊓ ⊔ This lemma again holds for the case when UNA is dropped, and the proof can be adopted in the same way as the proof of Lem. 1. The matching algorithm is also similar to the count case. Lemma 4. Let T be a fixed DL-LiteR TBox and q(x, Cntd(z)) be a fixed count distinct ACQ. Checking whether n ∈ Cert(q, t, hT , Ai) for an ABox A, a tuple t, and a number n can be done in coNP. The proof goes the same lines as the proof of Lem. 2 except that we bound n0 by |D| + |T |, and include into D∗ the active domain D and all homomorphic images of the aggregation variable z to I. The following summarises the lemmas. Theorem 2. The problem DL-LiteX Cntd-Aggregate Certain Answers (T , q) is coNP-complete in data complexity for any X ∈ {core,R}. 5 Combined Complexity of Counting Queries As pointed out in Sec. 4 data complexity is the most used measure of algorithms in any database settings. However, combined complexity has its own value for understanding fundamental properties of problems. In this section we study the combined complexity of computing aggregate certain answers. Formally, let X ∈ {core, R} and f be a counting aggregate function. Now we are interested in the following family of problems. DL-LiteX f -Aggregate Certain Answers Input: KB K over DL-LiteX , f query q, tuple t, and number n ∈ N∞ . Question: Is n ∈ Cert(q, t, K)? 5.1 Count Queries We start again with count queries. Recall the algorithm to compute the certain answers for count queries explained in the proof of Lem. 2. Note that, if one takes into consideration the size of the query and the TBox, then this algorithm naturally gives a coNExpTime upper bound; the only difference is that in this case the number of neighbourhoods is of exponential size (w.r.t. q and T ), and thus the instance we need to guess is of exponential size. Next we show that this bound is tight for DL-LiteR . Lemma 5. The decision problem DL-LiteR Count-Aggregate Certain An- swers is coNExpTime-hard. The proof is by a reduction from the complement of the satisfiability problem for first-order logic (FO) formulas in the Bernays-Schöfinkel class [17]. This class contains all FO formulae of form ∃x ∀y ψ(x, y), with ψ a quantifier-free formula not using function symbols or equalities. The reduction is inspired by the tech- niques used in [18] to show coNExpTime-hardness of query answering problems in data exchange context. Unfortunately, the reduction above uses role inclusions in the TBox, i.e. it is applicable only to DL-LiteR . We leave open the exact complexity of the DL- Lite core Count-ACQ answering problem, although it is not difficult to adapt the results of the following section to obtain a Π2p lower bound. We have the summarizing theorem. Theorem 3. (1) The problem DL-Lite core Count-Aggregate Certain An- swers is in coNExpTime. (2) The problem DL-LiteR Count-Aggregate Cer- tain Answers is coNExpTime-complete. 5.2 Count Distinct Queries Just as we did for count queries, we can easily obtain a coNExpTime upper bound for count distinct ones from the proof of Lem. 4. However, for DL-Lite core we Data complexity Combined complexity DL-Lite Count Cntd Count Cntd core coNP-c coNP-c in coNExp Π2p -c R coNP-c coNP-c coNExp-c coNExp-c Table 1. A summary of the complexity results. Here “-c” stands for “-complete” and coNExp – for coNExpTime. can improve the complexity by almost one exponential. The idea is to redefine the sub-interpretations used in the proof of Lem. 4 to have them of polynomial size, while keeping the possibility of merging them. Lemma 6. There exists a Π2p -algorithm which solves the problem DL-Lite core Cntd-Aggregate Certain Answers. In this case we have the matching lower bound. Lemma 7. The problem DL-Lite core Cntd-Aggregate Certain Answers is Π2p -hard. The proof is by reduction from the Π2p -complete ∀∃ 3-SAT problem [19]. The remaining question is whether the algorithm for computing aggregate certain answers over DL-LiteR knowledge bases is optimal. We settle this with our last lemma, shown by a reduction similar to the one in the proof of Lem. 5. Lemma 8. The decision problem DL-LiteR Cntd-Aggregate Certain An- swers is coNExpTime-hard. Summing up, we have our last theorem. Theorem 4. (1) The problem DL-Lite core Cntd-Aggregate Certain An- swers is Π2p -complete. (2) The problem DL-LiteR Cntd-Aggregate Certain Answers is coNExpTime-complete. 6 Conclusion In this paper we have defined an intuitive semantics for counting aggregate queries over ontologies and explored the computational complexity of the cor- responding problems. The results, summarized in Table 1, show that the prob- lems are decidable, but intractable. Hence, heuristics and approximations for answering ACQs are on high demand from the practical point of view, with applications, for instance, in the definition of general aggregation in SPARQL under entailment regimes. We consider the epistemic semantics as one of such approximations, since it has lower data complexity but does not always pro- vide the desired answer. Our work settles the theoretical foundations for further discussion. References 1. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Poggi, A., Rodriguez- Muro, M., Rosati, R., Ruzzi, M., Savo, D.F.: The MASTRO system for ontology- based data access. Semantic Web 2(1) (2011) 43–53 2. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com- bined approach to ontology-based data access. In: IJCAI. (2011) 2656–2661 3. Cuenca Grau, B., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P., Sattler, U.: Owl 2: The next step for OWL. Web Semant. 6(4) (November 2008) 309–322 4. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. of Automated Reasoning 39(3) (2007) 385–429 5. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite family and relations. J. Artif. Intell. Res. (JAIR) 36 (2009) 1–69 6. Bienvenu, M., Ortiz, M., Simkus, M.: Answering expressive path queries over lightweight DL knowledge bases. In Kazakov, Y., Lembo, D., Wolter, F., eds.: Description Logics. Volume 846 of CEUR Workshop Proceedings., CEUR-WS.org (2012) 7. Rosati, R.: The limits of querying ontologies. In Schwentick, T., Suciu, D., eds.: ICDT. Volume 4353 of Lecture Notes in Computer Science., Springer (2007) 164– 178 8. Gutiérrez-Basulto, V., Ibáñez-Garcı́a, Y.A., Kontchakov, R.: An update on query answering with restricted forms of negation. In: Proceedings of the 6th interna- tional conference on Web Reasoning and Rule Systems. RR’12, Berlin, Heidelberg, Springer-Verlag (2012) 75–89 9. Cohen, S., Nutt, W., Sagiv, Y.: Deciding equivalences among conjunctive aggregate queries. Journal of the ACM 54(2) (2007) 10. Glimm, B., Ogbuji, C., Hawke, S., Herman, I., Parsia, B., Polleres, A., Seaborne, A.: SPARQL 1.1 entailment regimes (2013) W3C Recommendation 21 March 2013, http://www.w3.org/TR/2013/REC-sparql11- entailment-20130321/. 11. Calvanese, D., Kharlamov, E., Nutt, W., Thorne, C.: Aggregate queries over on- tologies. In Elmasri, R., Doerr, M., Brochhausen, M., Han, H., eds.: ONISW, ACM (2008) 97–104 12. Vardi, M.Y.: The complexity of relational query languages (extended abstract). In: STOC. (1982) 137–146 13. Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar aggregation in inconsistent databases. Theor. Comput. Sci. 296(3) (March 2003) 405–434 14. Libkin, L.: Data exchange and incomplete information. In Vansummeren, S., ed.: PODS, ACM (2006) 60–69 15. Afrati, F., Kolaitis, P.G.: Answering aggregate queries in data exchange. In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS ’08, New York, NY, USA, ACM (2008) 129–138 16. Kostylev, E.V., Reutter, J.L.: Answering counting aggregate queries over ontologies of the DL-Lite family. In: Proc. of the 27th AAAI Conf. on Artificial Intelligence (AAAI). (2013) 17. Börger, E., Grädel, E., Gurevich, Y.: The Classical Decision Problem. Springer, Berlin (2001) 18. Arenas, M., Barceló, P., Reutter, J.L.: Query languages for data exchange: Beyond unions of conjunctive queries. Theory Comput. Syst. 49(2) (2011) 489–564 19. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman (1979)