=Paper=
{{Paper
|id=None
|storemode=property
|title=Count Aggregation in Semantic Queries
|pdfUrl=https://ceur-ws.org/Vol-1046/SSWS2013_paper1.pdf
|volume=Vol-1046
|dblpUrl=https://dblp.org/rec/conf/semweb/KostovK13
}}
==Count Aggregation in Semantic Queries==
Count Aggregation in Semantic Queries Bogdan Kostov, Petr Křemen Department of Cybernetics, Czech Technical University, Prague, Czech Republic {bogdan.kostov,petr.kremen}@fel.cvut.cz Abstract. In this paper we study the distinct count aggregation func- tion used in queries into expressive ontologies. The main differences in this settings opposed to aggregation in relational database systems are the Open World Assumption and incomplete knowledge. We propose different interpretations useful in different practical use-cases of the dis- tinct count function, i.e. basic count, semantic count, epistemic count and semantic tuple count some of which use knowledge derived from the ontology in order to obtain results in accordance with the Open World Assumption. We use interval semantics to model the uncertainty of the distinct count function’s result induced by incomplete knowledge in the ontology. We show that interval semantics are particularly useful in ag- gregate queries with filtering clause and when we need the retrieval of the boundaries of the uncertainty of the distinct count function’s result. We study and present relationships among the different interpretations and decidability of the semantic tuple count. We also propose a theoret- ical approximation of the semantic tuple count. Our results are applica- ble to a wide range of description logic formalisms allowing to express equality/inequality between individuals, concepts and relations, e.g. Web Ontology Language. Keywords: semantic counting, count function, OWL, semantic query 1 Introduction Aggregation functions are an important feature of modern query languages. Ag- gregation is well understood in query languages of database systems with Unique Name Assumption (UNA) principle, e.g. SQL for relational databases. Although this simple well-known interpretation of aggregation can in some cases be used correctly in queries over semantic knowledge bases, in general this interpreta- tion will return incorrect results as it will not use knowledge inferred from the knowledge base with the Open World Assumption (OWA). As opposed to that, semantic aggregation relies on the knowledge inferred from the knowledge base in order to provide correct result of aggregation functions. Currently the research filed of semantic aggregation is in its infant state. However the need of semantic aggregation is becoming apparent as more and more applications of semantic technologies emerge. In this paper we propose the need of different interpretations of the distinct count function in the settings of OWA and ontology representing incomplete knowledge, i.e. basic count (BC), semantic tuple count (ST C), epistemic count (EC) and semantic count (SC). We study the relationships among the different interpretations. We also investigate the decidability of the ST C interpretation. 2 Related Work In recent years, expressive query languages, like SPARQL-DL [1], OWL-SAIQL [2], SQWRL [3] for OWL 2 [4] or SeRQL [5], SPARQL [6] for RDF, have been introduced and implemented in the field of semantic web. There have been deep studies [7–11] evaluating conjunctive queries in RDF and OWL, but few efforts have been spent on an algebra, as well as aggregation functions. Recently the RDF query language SPARQL [6] has been extended towards the new SPARQL 1.1 1 [12] including many new constructs, e.g. aggregation functions or negation as failure. There are already publicly available implemen- tations, e.g. ARQ [13], KGRAM [14], RDF::Query [15] or Sesame [16]. Inde- pendently on SPARQL 1.1 authors of [17] discuss the topic of aggregation over data structured as RDF graphs rather than on the relational data returned by the query (the result set table). The authors stress the need of different modes of aggregation. A few years ago the SQWRL query language for OWL [3] was proposed. All these efforts implement or interpret aggregation using the basic relational semantics, i.e. the result of aggregation functions is computed over the results of the non-aggregate variant of the query and neglect the impact of the interplay between aggregation functions and the inferred domain elements. More closely related to our study is the work [18]. It shows that exact se- mantics, of aggregate queries over a DL-LiteA ontology returns results only if the Abox is not empty and if axioms in the Tbox resolve as constraints over car- dinalities of the groups in the aggregate query. The authors propose epistemic semantics of aggregate queries in the settings of data integration use-cases re- turning the aggregate function’s results known from (inferred by) the ontology. The authors propose an evaluation algorithm for a sub set of aggregate queries, i.e. restricted epistemic aggregate queries, defined using functional dependency of query variables w.r.t. the Tbox of the queried ontology. Although not exactly in the same framework of query answering like in this paper, the authors in [19] propose an approach using ontology design pattern for incorporating a quantification over types without actually using explicit nu- merical information. In this paper we examine the distinct count aggregation function, its possible interpretations and their use-cases. We focus our research on the distinct count aggregation function alone as it is non-trivial but not too overly complicated. We believe that better understanding of the distinct count function and its possible interpretations will cover most of the peculiarities of aggregation in the con- text of expressive knowledge bases assuming the OWA principle and incomplete knowledge, thus contributing to OWL and RDF. 1 SPARQL 1.1 version recently became a W3C recommendation. 2 3 Motivation In this section we present a simple ontology which describes a simple taxonomy of teachers categories and contains assertions about teachers and courses. The ontology is presented using the well known description logic syntax, see [20]. Example 1. A simple example ontology O1 about teachers and the courses they teach. Tbox BusyTeacher v Teacher, Professor v Teacher, ∃teaches·Course v Teacher, BusyTeacher v ≥ 3 teaches, Professor v ≤ 3 teaches Abox BusyTeacher(Sara), Professor(Steve), Professor(John), BusyTeacher(John), Course(math), Course(physics), Course(history), teaches(Dave, math), teaches(Dave, physics), teaches(Dave, history), teaches(Sara, history), . 6 history math = We will use the following two aggregate queries throughout the paper to demonstrate the differences in the results of the individual interpretations of the distinct count function. Q1 - Find all teachers and the number of distinct courses they teach. Q2 - Find all teachers that teach more than one distinct courses. Next we will discuss the need of different interpretations of the distinct count function. The most natural interpretation of the distinct count function is the semantics count interpretation. This interpretation enables users to query for the distinct count function value or its constraints as entailed by the queried ontology. For example using this interpretation in query Q1 we obtain that John teaches exactly three courses or using it in query Q2 we will obtain that Dave, Sara and John teach more than one course. We call this the semantics count interpretation and we consider it suitable for knowledge retrieval oriented use- cases. This interpretation is a natural extension of the certain answer semantics for the distinct count function and it is monotonic. Although that is the most natural and in fact correct extension of the seman- tics of the distinct count function, there are some use-cases that need different CWA interpretations. As argued in [18] for ontology based data access (OBDA) use-cases, the certain answer semantics for aggregate queries is not practical as they return trivial results, e.g. empty or very restricted result sets. The authors propose that in this context a practical interpretation will be the one that re- turns the least known number of courses they teach. This is called the epistemic count interpretation. As opposed to the semantic count, the result set of query Q1 with the epistemic count interpretation will contain for example that Dave and Sara teach respectively two and three courses. Note that the epistemic count interpretation may count both named and unnamed entities. This makes the use of this interpretation inappropriate in purely data-centric ontology applications. In [21] the authors use OWL to model 3 integrity constraints (IC) and propose IC CWA semantics to enable instance data validation. We propose an extension to this semantics for the distinct count function. We call this the semantic tuple count interpretation. As opposed to the previous interpretations the result of Q2 with the semantic tuple count in- terpretation will contain only Dave as he is the one for which the data in the ontology O1 satisfies the condition in the query. The semantic tuple count can be used to depict IC for n-ary relations. The most common interpretation of the distinct count function is the basic count interpretation. This interpretation is scalable and it is safe to be used in data oriented use-cases in CWA and UNA systems. The usage of this interpreta- tion in applications assuming OWA may return incorrect results, however, it can still be used as an approximation. For example the result of query Q1 contains the answer Dave teaches three courses, which may or may not be true according to the ontology. We continue with the formal definition of the different interpretations of the distinct count function and the discussion of the results of queries Q1 and Q2 in these interpretations. 4 Preliminaries In this section we will define basic terms and notions used in the rest of the paper. We will start with the definition of ontology followed by the definition conjunctive queries and aggregate queries with distinct count function. 4.1 Ontology Definition 1. An ontology O is a pair hS, Ai, where S is a signature and A is a set of axioms. The semantics of ontologies use a first order interpretation I = (∆I , ·I ), where ∆I is an interpretation domain and ·I : S → ∆I is an inter- pretation function mapping elements from the ontology signature S to elements from the interpretation domain ∆I . An ontology O is satisfied by an interpreta- tion I, denoted by I |= O, if all of its axioms are satisfied by the interpretation I, such interpretation I is called a model of O. We say that a set of axioms A is entailed by the ontology O, denoted by O |= A, if every model I of the ontology O, is also a model of A, I |= A. Next we define the notion of the monotonic extension O0 of the ontology O. Definition 2. We say that O0 = hS 0 , A0 i is an extension of O = hS, Ai if A ⊂ A0 and S ⊂ S 0 . O0 is monotonic if the original ontology O is entailed by O0 , 0 0 O0 |= O. A model I 0 = (∆I , ·I ) of the ontology O is an extension of the model 0 0 0 I = (∆I , ·I ) |= O if ∆I ⊆ ∆I and ·I ⊆·I . For a full description of the syntax and semantics of different description logic formalisms see [20]. Note that our discussion is also applicable for OWL 2 ontologies [22, 23] since OWL 2 is backed by the SROIQ(D) description logic. 4 We consider description logics which allow only monotonic extensions. In section 5.2 where we prove decidability of the semantic tuple count we further restrict to a subfamily of description logics that enable expressing whether two entity objects are different, the same or it is not known, which we will refer to as equality/inequality relation. We require the OWA assumption over the equal- ity/inequality relation because it provides means to model incomplete knowl- edge, and it is the fundamental source of uncertainty of the semantic tuple count function’s value. 4.2 Conjunctive Queries The discussion in this paper about distinct count function and its proposed interpretations is set in the context of conjunctive queries. Definition 3. We will denote conjunctive queries using the following rule like notation Q(x̄) ← φ(x̄, z̄). (1) The head of the query Q(x̄) denotes the name of the query and the result variables Rvar (Q) = x̄. The body of the query φ(x̄, z̄) is a comma separated list of query atoms interpreted as a conjunctive query, as defined in the SPARQL-DL 2 query language. The query atom list φ may contain non result variables z̄. Note that the result variables must be distinguished. By Vvar (Q) we denote the list of all variables in the query. A binding µ : Vvar (Q) → S is a mapping of the variables of the query to elements in the ontology’s signature and Q|µ is the substitution of the variables in Q by the binding µ. The binding substitution of a tuple of variables v̄ = (v1 , . . . , vk ) is denoted by µ(v̄) = (µ(v1 ), . . . , µ(vk )). We call Q a ground query if there are no variables in the query. A solution to the query Q w.r.t. the ontology O is a binding µ for which the substitution Q|µ is a ground query, the body of which is entailed by the ontology (denoted by O |= Q|µ ). The set of all possible solutions of Q w.r.t. O or the model I of O is denoted by SatO I Q = {µ|O |= Q|µ } or SatQ = {µ|I |= Q|µ } respectively . The result set of query Q w.r.t. O denoted by QO , is a set of bindings of the result variables Rvar (Q), formally QO = {ā|ā = µ(Rvar (Q)) ∧ µ ∈ SatO Q }. 4.3 Aggregate Queries Here we define the types of aggregate queries along with their simple syntax used for the purpose of representing aggregate queries in this paper. We also define some additional terminology and symbols used later in the paper. Definition 4. We will denote distinct count retrieval and distinct count filtering queries respectively as follows: Qa (x̄, countSdist (ȳ)) ← φ(x̄, ȳ, z̄) (2) 2 See [1] for list of atoms and their interpretations. 5 Qac (x̄) ← (·op ncountSdist (ȳ)), φ(x̄, ȳ, z̄) (3) In the queries of the type Qa , the head Qa (x̄, countSdist (ȳ)) specifies the disjoint sets of grouping Gvar (Qa ) = x̄ and aggregation Avar (Qa ) = ȳ variables. The distinct count aggregation function which returns the number of distinct tuples according to the semantics mode specified by the superscript S w.r.t. the ontology O is denoted by countSdist . The body of Qa contains a query atom list. We also consider distinct count filtering by comparison queries of the form Qac show in (3). The head of queries of these type contain only the grouping variables. The body of the query contains cardinality restriction atom3 where ·op ∈ {>, <, ≤, ≥, =, 6=}. By Q∗ we denote a non aggregate variant of Q, obtained from Q by removing its aggregate function from the head or the comparison predicate in the body. The result variables of Q∗ are the union of group and aggregate variables of Q, Rvar (Q∗ ) = Gvar (Q) ∪ Avar (Q). Before we define the general semantics of the result of aggregate queries we clarify and define the auxiliary terms and notations in the following three definitions. ∞ Definition 5. Let N0,∞ be the extension of the set of natural numbers with ∞ 0,∞ ∞ zero and infinity N = N ∪ {0, ∞ }. Let L be a subset of N0,∞ and let Int(L) denote the smallest interval containing L,Int(L) = hinf(L), sup(L)i. We extend ∞ the intuitive comparison between elements in N0,∞ with comparison between sets ∞ 0,∞ ∞ 0,∞ L ⊆ N and elements n ∈ N . L ≤ n (L < n) is be true if and only if sup(L) ≤ n ∨ sup(L) = n = ∞ (sup(L) < n). L ≥ n (L > n) is true if and only if inf(L) ≥ n ∨ inf(L) = n = ∞ (inf(L) > n). Note that the symbol ∞ is an ∞ element of N0,∞ . Next we define the interpretation a tuple and a set of tuples. Definition 6. Let O = hS, Ai be an ontology, I = (∆I , ·I ) a model of O and T be a set of tuples composed of elements in S. The interpretation of the of the tuple t̄ = (t1 , t2 , . . . , tk ), t̄ ∈ T is t̄I = (tI1 , tI2 , . . . , tIk ). The interpretation of T is T I = {t̄I |∀t̄ ∈ T }. Next we define aggregate groups or simply groups as the set of tuples with common grouping variable binding. Definition 7. Let Q be an aggregate query of type Qa or Qac , O be an ontology, I a model of O and let k̄ = µ(Gvar (Q∗ )), where µ is an arbitrary binding. Then the aggregate group, denoted by Γ (O, Q, k̄), with key k̄ is equal to the set of tuples {µ(Avar (Q))|µ(Gvar (Q)) = k̄ ∧ µ ∈ SatO Q∗ }. The aggregate group with key k̄ in a tuple set T is Γ (T, k̄) = {ā|∀t̄ ∈ T, (k̄, ā) = t̄}. 3 Note that we reuse the well known notation for property cardinality restrictions, see [20]. 6 Note that the definition of the aggregate group Γ (O, Q, k̄) w.r.t. to the on- tology O can be used also to obtain an aggregate group w.r.t. to the model I of O, i.e. Γ (I, Q, k̄). Next we define results of aggregate queries in terms of the counting func- S tion f[O,Q] and the set K S (O, Q). Informally the counting function f[O,Q] S is an uncertainty aware generalization of the distinct count function, it takes as an input the key of the group to be counted and returns a set of possible values w.r.t. O and Q. The set K S (O, Q) is the set of keys of the groups to be counted. The counting function and the key of sets will be defined in each of the concrete distinct count interpretations. The superscript is used to distinguish among the different interpretations. ∞ Definition 8. Let D be the set of all tuples of arbitrary length and P(N0,∞ ) be ∞ S ∞ the power set of N . Let S be an aggregate semantic, f[O,Q] : D → P(N0,∞ 0,∞ ) S be the counting function and K (O, Q) ⊂ D be the key set in S. The result of an aggregate query Q of type (2) w.r.t. the ontology O is QO = {(k̄ ∈ K S (O, Q), inf (L))|L := f[O,Q] S (k̄) ∧ inf(L) = sup(L)} and the result set of an aggregate query Q with comparison filtering, i.e. queries of the form Qac , is QO = {k̄ ∈ K S (O, Q)|L := f[O,Q] S (k̄) ∧ L ·op n}. The interpretation of the last condition L ·op n is defined in Definition 5. 5 Distinct count function In this section we formally define the semantics of each of the various interpreta- S tions of the counting function f[O,Q] and the set of keys K S (O, Q). We present this definitions in order to be able to show formally the relations between the interpretations and in the case of SC and ST C to be able evaluate comparison filtering queries in OWA. We show and discuss the results of the example queries Q1 and Q2 from section 3 in each of the proposed semantics. We also study the relationship between individual interpretations of the distinct count function. We prove decidability of the semantic tuple count ST C interpretation. Finally we show how to enable distinct count queries in the context of the description logic SROIQ. In (4) we show the queries Q1 and Q2 from example 1 in the notation defined in Definition 4. Q1 (?t, countdist ?c) ← PropertyValue(teaches, ?t, ?c) (4) Q2 (?t) ← (>1countdist (?c)), PropertyValue(teaches, ?t, ?c) The tables in this section showing the results of the queries have the following notation. Only cells highlighted with grey background color 4 are part of the result set of the query. The other cells have white background. Note that through out this section O = hS, Ai is an arbitrary ontology with a signature S and an axiom set A, Q is an arbitrary aggregate query, O1 refers to the ontology in example 1 and Q1 and Q2 refer to the queries shown in (4). 4 The use of colors in this paper is intended to be readable in gray scale copies. 7 5.1 Known interpretations of the Distinct Count In this section we discuss three known interpretations of the distinct count func- tion in aggregate queries. Basic Count Interpretation The basic count interpretation (BC) is used originally in the SQL query language, but it is also used in semantic query languages, e.g. SPARQL and SQWRL. This interpretation is not adequate for ontological knowledge as it does not infer the distinct count function’s value from the ontology. The basic count interpretation can be implemented in n log n time, where n is the size of the tuple set to be counted. Definition 9. The set of keys K BC (O, Q) is the set of all syntactically distinct result bindings of the grouping variables Gvar (Q) of the query Q over the ontology O, i.e. K BC (O, Q) = {k̄|k̄ = µ(Gvar (Q)) ∧ µ ∈ SatO Q∗ }. For k̄ ∈ K BC (O, Q) the BC counting function is defined as f[O,Q] = {|Γ (O, Q, k̄)|}. The result of the distinct count query Q1 is shown in table 1(a). Table 1. The result set of the BC interpretation of the aggregate queries Q1 and Q2 . (a) result of Q1 (b) result of Q2 ?t countBC dist (?c) ?t (> 1countBC dist (?c)) Dave 3 Dave true Sara 1 Sara false There are two group keys Dave and Sara. The number of courses Dave teaches is three because there are three courses that Dave teaches asserted in the on- tology, whose names are syntactically different. For Sara, who teaches only one course according to the ontology, the result of the count function is one. We can see that this interpretation ignores the implicit knowledge derived from the fact that Sara is a BusyTeacher that teaches at least three courses. Semantic Count Interpretation Informally the semantic count (SC) inter- pretation counts the number of possible tuples entailed by the queried ontology. This interpretation is similar to the exact semantics presented in [18]. However our definition of SC presented here does not return a single value of the distinct count function but a set of possible values. While this definition has no effect on the results of queries of type Qa it ensures monotonic results in the queries of type Qac . The decidability of the SC interpretation is an open problem. Never- theless we believe that the minimum of the SC interpretation is more likely to be decidable. 8 Definition 10. The semantic count interpretation is denoted by SC. Let l = |Gvar (Q)| be the number of grouping variables in Q and S = S l is the set of all tuples of length l composed of elements from the signature S. The SC counting SC function is defined as follows f[O,Q] (k̄) = {|T I | |T := Γ (I, Q, k̄), ∀I |= O}. The SC key set is defined as follows K BC (O, Q) = {k̄ ∈ S| sup(f[O,Q] SC (k̄)) > 0}. Next we discuss the results of the example queries Q1 and Q2 , with formal representation shown in (4), over the ontology O1 from example 1. The last columns in the tables 2(a) and 2(b) show the boundaries of the intervals found for each of the group keys from the first column. Table 2. The result of the aggregate queries Q1 Q2 in 4 with SC interpretation. (a) result of Q1 (b) result of Q2 ?t countSC SC dist (?c) Int(f[O1 ,Q1 ] ) ?t (> 1countSC SC dist (?c)) Int(f[O1 ,Q2 ] ) Dave - h2, ∞ i Dave true h2, ∞ i Sara - h3, ∞i Sara true h3, ∞i Steve - h0, 3i Steve false h0, 3i John 3 h3, 3i John true h3, 3i math - h0, ∞ i math false h0, ∞ i In table 2(a) we show the results of query Q1 . Next we discuss the intuition of the calculation of the values of the SC counting function for each of groups in table 2(a). In order to obtain the final result of the query we need to apply the comparison semantics from definition 8. The last row in both tables show an unexpected group which the algorithm should process. In fact we have two more groups that we omitted from the table which are with keys history and physics. This is an effect of the OWA assumption. This anomaly is caused by the fact that the ontology O1 does not explicitly state that math, history and physics are not teachers and that only teachers can teach and thus allowing the existence of models in which subjects teach something. Moreover the calculated interval is zero to infinity because there are no axioms which constraint it. The SC can be used to locate such unwanted behavior. The minimum in the first row with group key Dave is obtained from the two semantically distinct courses math and history that Dave teaches. The maximum of the first row is infinity because the ontology does not constrain the number of courses Dave can teach. The minimum of the second row with group key Sara is determined from the constraint that a BusyTeacher teaches at least three courses and the fact that Sara is a BusyTeacher and who teaches history. In this case there are two restrictions, ’at least three’ and ’at least one’ which constraint the minimum of courses Sara teaches. In such cases we should select the most specific one. Therefore the minimum of the second row’s interval is three since the first re- striction is more specific. The evaluation of the maximum in the second row’s interval is analogous to the one in the first row. The third row’s minimum is 9 zero because Steve is not constrained to teach a minimum number of courses as Dave and Sara were in the first and second rows. The maximum in the third row is derived from the fact that Steve is a professor and the constraint of the Professor class which limits its instances to teach at most three courses. Now we apply interval semantics to the first three rows that we discussed. The three rows are not included in the result of query Q1 with SC semantics because ac- cording to the definition of the results of queries of type Qa in definition 8 which states that set returned by the counting function must be singleton. The answer contains only the last row because the return value of the counting function is a singleton containing only the number three. This is true because the John is both a Professor and a BusyTeacher the constraints of which were already discussed. The results of query Q2 as well as all the relevant group keys of Q2 are shown in table 2(b). We have discussed the interval boundaries of the intervals for groups of Q1 and since the Q2 has identical groups, note that K SC (O1 , Q1 ) = K SC (O1 , Q2 ), we skip this explanation for query Q2 . The result of Q2 contains all rows except for the third one with group key Steve that does not satisfies the comparison filter which limits the retrieval of groups with at least two distinct tuples in all models of the ontology O1 . We point out the importance of interval semantics in the settings of incom- plete knowledge. Note that in query Q2 we obtained results that were not present in the result of Q1 . This proves that the results of query Q2 obtained by filtering the results of query Q1 won’t contain the full result entailed by the ontology. Epistemic Count Interpretation The epistemic count (EC) interpretation is introduced in the work [18]. Informally this interpretation of the distinct count function returns a single value which represents the known number of distinct tuples in an aggregate group. For the number k of known tuples holds that in any model I of the ontology O there is at least k distinct tuples for the counted group. Because countECdist interpretation is equivalent to the infimum of the SC counting function here we define the countEC dist interpretation in terms of SC counting function and key set. The decidability of the EC interpretation is an open problem an it is equivalent to the decidability of the minimum of the SC interpretation problem. Definition 11. The EC key set is defined as K EC (O, Q) = K SC (O, Q). The EC EC SC counting function is defined as f[O,Q] (k̄) = {inf(f[O,Q] (k̄))}. In tables 3(a) and 3(a) we can see the results of the aggregate queries Q1 and Q2 respectively with distinct count function interpreted with the EC semantics. The results are identical with those of the minimum of the interval in tables 2(a) and 2(b) and are not discussed further. 5.2 Semantic Tuple Count Interpretation The semantic tuple count (ST C) interpretation is defined using interval seman- tics. Informally it counts the same tuples as the BC interpretation but it uses 10 Table 3. The result of the aggregate queries Q1 and Q2 in (4) with the EC interpre- tation. (a) result of Q1 (b) result of Q2 ?t countEC dist (?c) ?t (> 1countEC dist (?c)) Dave 2 Dave true Sara 3 Sara true John 3 John true math 0 math false knowledge in the ontology to derive the equivalence/inequivalence relation be- tween the counted tuples which is needed to remove semantically duplicate tuples and also to deal with uncertainty of the count’s value. Definition 12. The ST C key set K ST C is defined as the key set of the BC interpretation, i.e. K ST C (O, Q) = K BC (O, Q). We define the ST C counting ST C function as follows f[O,Q] (k̄) = {|Γ (O, Q, k̄)I ||∀I |= O}. Table 4. The result of the aggregate queries Q1 and Q2 in (4) with the ST C interpre- tation. (a) result of query Q1 (b) result of query Q2 ?t countST C ST C dist (?c) Int(f[O1 ,Q1 ] ) ?t (> 1countST C dist (?(c))) Dave - h2, 3i Dave true Sara 1 h1, 1i Sara false Next we discuss the results of queries Q1 and Q2 from example 1 with the ST C interpretation. Here we also show all the group keys that should be consid- ered during evaluation of the ST C interpretation and since the ST C interpreta- tion is interval based we also show the evaluated intervals in the last columns of the tables 4(a) and 4(b). In table 4(a) we have the results of Q1 . Now we discuss the values of the interval boundaries in rows one and two. The minimum in the first row is based on the two distinct courses math and history. The maximum is three because there are three entailed tuples in the group of the first row and because there is no other axioms that constraints the maximum to be smaller. The minimum in the second row is one because we have only one tuple for the Sara group. This is also the only possible value for the maximum of row two because there are no other entailed tuples in that group. Apparently from the interval semantics only the second row is returned. The results for the query Q2 are shown in table 4(b). We already discussed the interval boundaries in the discussion of query Q1 . Applying the interval semantics we filter the second row because Sara teaches only one distinct course and therefore it is not contained in the result. The first is contained in the result 11 because the minimum boundary, which is 2, is bigger than the less than operand in the filtering atom, which is one. Decidability of the ST C Interpretation In this section we prove decidability of the ST C interpretation. We present only the statements of the most relevant lemmas and the proof of the theorem at the end of the section. The omitted proofs and lemmas can be found in the technical report [24]. First we define the family of formalisms for which we prove that the ST C interpretation is decidable. Definition 13. We assume that the supported formalisms F (i) are capable of expressing the equality/inequality relation among elements of the ontology (ii) consistency check of ontologies and query answering is decidable in F and (iii) that F is monotonic. Implementing the evaluation of the ST C counting function based on defini- tion 5.2 is not feasible because the ontology O might have an infinite number of models. We show that there is a finite number of models sufficient for the calculation of the boundaries of the value of the ST C counting function. The next proposition states that the interval of the distinct count function with SC and ST C interpretation w.r.t. the ontology O will be more specific if we extend the queried ontology. The interval calculated w.r.t. the original ontology will include the one calculated from the extended ontology. Note that in this section we will use identical equality and inequality axioms, denoted by =a and 6=a respectively, for all type of entities, i.e. individuals,classes and properties. In the following section we show an equivalent representation of this axioms in the concrete logic SROIQ. Definition 14. Let O = hS, Ai be an ontology, T be a tuple set composed of elements in the set D ⊆ S. Let I = (∆I , ·I ) be an interpretation such that ·I is defined on D then the complete set of equality/inequality axioms satisfied by I is denoted as A# (I, D) = {a =a b|∀a, b ∈ D, aI = bI } ∪ {a 6=a b|∀a, b ∈ D, aI 6= bI }. We denote the set all complete sets of equality/inequality axioms between elements in D w.r.t. the ontology O as A# (O, D) = {A# (I, D)|∀I |= O}. The cannonic model of A# = A# (I, D), denoted by IA# = (∆IA# , ·IA# ) is a model with an interpretation function ·IA# the domain of which is D. Note that adding equality/inequality axioms to the complete set A# (I, D) wont change the original set or if it does the resulting set is unsatisfiable. Note also that the cannonic interpretation function ·IA# can be constructed in poly- nomial time. Lemma 1. Let O = hS, Ai be an ontology, D ⊆ S, then A# (O, D) is finite. Lemma 2. Let T be a tuple set composed of elements from set D and two inter- pretations ·I1 and ·I2 which agree on the equivalence and inequivalence between elements in D. Then the number of elements in the sets T I1 and T I2 is the same, |T I1 | = |T I2 |. 12 The next lemma states that for extensions of the ontology O with an axiom set from A# (O, D) the ST C counting function returns a singleton set and that the only value in the set can be calculated in polynomial time. Lemma 3. Let O = hS, Ai be an ontology, Q be an aggregate query, k̄ ∈ K ST C , T = Γ (Q∗O , k̄) and D be the set of elements composing tuples in T , I = (∆I , ·I ) be a model of O, A# = A# (I, D), O0 = hS, A ∪ A# i and ·IA# be the cannonic ST C I IA# interpretation function of A# . Then f[O 0 ,Q] (k̄) = {c}, where c = |T | = |T |. Instead of looking for the boundaries of the ST C interval in the set of all models as the definition 12 suggests, we can search for the boundaries in the finite set of representations of the equality interpretation. We first describe an algorithm which terminates in a final number of steps and then we prove its correctness. Algorithm 1 : STC - Semantic Tuple Count procedure PROCEDURE STC INPUT : O // the queried ontology, T // set of tuples to be counted OUTPUT : // the calculated interval a := inf; b := 0; D := {elements used in tuples in T}; FORALL A# IN A#(O,D) DO A’ := union(A, A#); IF A’ is consistent construct cannonic interpretation I#(A#); c := |T interpreted by I#(A#)|; IF a > c THEN a := c; IF b < c THEN b := c; END-IF END-FORALL RETURN ; END We will prove decidability by proving correctness and termination of algo- rithm 1. Theorem 1. The algorithm 1 terminates and evaluates correctly the interval ST C Int(f[O,Q] (k̄)) for an ontology O and aggregate query Q and group key k̄. Proof. Algorithm 1 terminates since there is a finite number of extensions to check. From lemma 3 we have that the ST C counting function is a subset of the calculated interval. Also from lemma 3 we have that the interval is the smallest because the boundaries correspond to some model of O. Corollary 1. Calculation of the ST C interval is decidable in the family of for- malisms defined in Definition 13. 13 ST C Interpretation approximation The proposed algorithm 1 is searching trough all the possible extensions for the set of elements D. As we show in corol- 2 lary 1 there are n = 2|D| possible extensions for each of which we need to make a consistency check. In this section we propose an approximation of the ST C interpretation which can be used for example as optimization of an evaluation algorithm. The approach presented here utilizes the knowledge inferred from the ontology. The prove of the correctness of the approximation is presented in the technical report [24]. Definition 15. Let G = (V, E) be an undirected graph with nodes V and edges E. A connected component K in G is a subgraph in G maximal with the property, for each pair of nodes in K there is a path in K. A (maximal) clique C is a graph maximal with the property C is a subgraph of G and is complete. The biggest clique C in G is called maximum. In order to define the approximation formally we need first to define the auxiliary term difference graph, and the notion of maximum clique. Definition 16. Let O = hS, Ai be an ontology, let T be a set of tuples of length l and composed of elements from the set D ⊆ S. Let s̄, t̄ ∈ T , equality s̄ =a t̄ and inequality of tuples w.r.t. the ontology O is defined respectively as O |= {s1 =a t1 , . . . , sl =a tl } and O |= si 6=a ti for some 1 ≤ i ≤ l. Let G= (O, T ) = (D, E= (O, T )) be the graph with edges E= (O, T ) = {(a, b)|∀(a, b) ∈ T 2 , O |= a =a b}. The difference graph is denoted by ∆G(O, T ) = (V, E) w.r.t. O and T . The set nodes is the set of connected components of G= (O, T ), there is an edge between the nodes u and v in V , {u, v} ∈ E if and only if the ontology entails 6=a axiom between some pair of the set u × v. Theorem 2. Let O be an ontology, Q an aggregate query, k̄ ∈ K ST C (O, Q) and let T = Γ (O, Q, k̄) be the group to be counted. Let the ∆G(O, T ) = (V, E) be the difference graph w.r.t. O and T and Cmax = (VC , EC ) be the maximum clique ST C graph of ∆G(O, T ). Int(f[O,Q] (Vvar (k))) ⊆ h|VC |, |V |i. ST C interpretation in SROIQ In this section we discuss the countST dist C function in a concrete description logic SROIQ. The ST C interpretation is decidable in this description logic since SROIQ satisfies all the requirements shown in Definition 13. We consider any query language which supports con- junctive queries with mixed Abox, Tbox, Rbox terms, e.g. SPARQL-DLN OT [25]. In order to apply the ST C interpretation in this scenario we need to show representations of equality and inequality between elements of the signature of the ontology which will be used to generate the complete set of axioms in A# . 5.2 shows the representations for each of the three term types. Individuals ik , k ∈ 1, 2, 3..., are unique and not contained in the ontology. Note that classes and properties have two possible ways of representing the inequality relation. If one of the representations fails we must also test the other when checking whether two terms of type class or property need to be compared for inequality. 14 Table 5. Semantics of comparison of different ontology elements type =a 6=a . . 6 i2 individuals i1 = i2 i1 = classes C1 ≡ C2 {C1 (i1 ), ¬C2 (i1 )} or {C2 (i2 ), ¬C1 (i2 )} properties p1 ≡ p2 {p1 (i1 , i2 ), ¬∃p2 .{i2 }(i1 )}) or {p2 (i3 , i4 ), ¬∃p1 .{i4 }(i3 )} 6 Conclusion We investigated the distinct count function in the context of different semantics, i.e. BC, SC and EC. We introduced a new interpretation ST C and compared its behavior with the other interpretations. We found the following relation- SC ships between SC and EC, inf(f[O,Q] ) = EC and also between SC and ST C, ST C SC ST C SC inf(f[O,Q] (k̄)) ≤ inf(f[O,Q] (k̄)) and sup(f[O,Q] (k̄)) ≤ sup(f[O,Q] (k̄)). We proved decidability of the ST C interpretation in the context of the selected family of formalisms, provided an approximation using basic graph problems and showed how to apply the ST C in the SROIQ description logic. In future work we would like to focus on the implementation of the evaluation of the ST C interpretation into the SPARQL-DLN OT [25] query language. In order to provide practical implementation research in optimizing the evaluation 2 is needed. In the worst case algorithm 1 will issue 2n consistency checks where n is the number of counted tuples. We are also interested in the investigation of the decidability of SC and EC. Acknowledgments This work has been supported by the grant by the grant of the Czech Technical University in Prague No. SGS13/204/OHK3/3T/13 Effective solving of engineering problems using semantic technologies. Authors also want to express thanks to the anonymous reviewers for providing useful comments during manuscript preparation. References 1. Sirin, E., Parsia, B.: SPARQL-DL: SPARQL Query for OWL-DL. In: 3rd OWL Experiences and Directions Workshop (OWLED-2007). (2007) 2. Kubias, E., Schenk, S., Staab, S., Pan, J.Z.: OWL SAIQL - an OWL DL Query Language for Ontology Extraction. In: In Proc. of OWLED-07. (2007) 3. O’Connor, M.J., Das, A.K.: SQWRL: A query language for OWL. In: OWLED. (2009) 4. Group, W.O.W.: OWL 2 Web Ontology Language Document Overview. W3C Recommendation, W3C (October 2009) http://www.w3.org/TR/2009/REC-owl2- overview-20091027, cit. 04.12.2012. 5. Broekstra, J., Kampman, A.: An rdf query and transformation language. In Staab, S., Stuckenschmidt, H., eds.: Semantic Web and Peer-to-Peer. Springer Berlin Heidelberg (2006) 23–39 6. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Recommendation, W3C (January 2008) http://www.w3.org/TR/2008/REC-rdf- sparql-query-20080115, cit. 3.2013. 15 7. Horrocks, I., Kutz, O., Sattler, U.: The Even More Irresistible SROIQ. In Doherty, P., Mylopoulos, J., Welty, C.A., eds.: KR, AAAI Press (2006) 57–67 8. Sirin, E., Parsia, B.: Optimizations for Answering Conjunctive ABox Queries. In: Description Logics. Volume 189 of CEUR. (2006) 9. Křemen, P., Kouba, Z.: Conjunctive Query Optimization in OWL2-DL. In: Pro- ceedings of the 22th International Conference on Database and Expert System Applications (DEXA 2011). Volume 6861 of LNCS., Springer Verlag (2011) 10. Glimm, B., Horrocks, I., Lutz, C., Sattler, U.: Conjunctive Query Answering in the Description Logic SHIQ. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007). (2007) 11. Kollia, I., Glimm, B., Horrocks, I.: Query Answering over SROIQ Knowledge Bases with SPARQL. In: Proceedings of the 2011 International Workshop on Description Logic (DL 2011). (2011) 12. Seaborne, A., Harris, S.: SPARQL 1.1 Query. W3C Working Draft, W3C (October 2009) http://www.w3.org/TR/2009/WD-sparql11-query-20091022, cit. 3.2013. 13. Apache: ARQ - A SPARQL Processor for Jena, web site (April 2011) http://jena.apache.org/documentation/query/index.html, cit. 3.2013. 14. Corby, O.: Kgram: a knowledge graph abstract machine, web site http://wimmics.inria.fr/corese, cit. 3.2013. 15. Williams, G.T.: RDF Query 2.909 - RDF::Query - A complete SPARQL 1.1 Query and Update implementation for use with RDF::Trine, web site (November 2012) http://search.cpan.org/dist/RDF-Query/, cit. 3.2013. 16. Arjohn Kampman, Christiaan Fluit, J.B.: Sesame, web site (January 2013) http://sourceforge.net/projects/sesame, cit. 3.2013. 17. Seid, D.Y., Mehrotra, S.: Grouping and aggregate queries over semantic web databases. In: ICSC. (2007) 775–782 18. Calvanese, D., Kharlamov, E., Nutt, W., Thorne, C.: Aggregate queries over on- tologies. In: ONISW. (2008) 97–104 19. Martı́nez, D.C., Janowicz, K., Hitzler, P.: A logical geo-ontology design pattern for quantifying over types. In: SIGSPATIAL/GIS. (2012) 239–248 20. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F., eds.: The Description Logic Handbook: Theory, Implementation, and Applications. In Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F., eds.: Description Logic Handbook, Cambridge University Press (2003) 21. Tao, J., Sirin, E., Bao, J., McGuinness, D.L.: Integrity constraints in owl. In: AAAI. (2010) 22. Motik, B., Parsia, B., Patel-Schneider, P.F.: OWL 2 Web Ontology Language Structural Specification and Functional-Style Syntax. W3C recommendation, W3C (October 2009) http://www.w3.org/TR/2009/REC-owl2-syntax-20091027, cit. 12.12.2012. 23. Patel-Schneider, P.F., Motik, B., Grau, B.C.: OWL 2 Web Ontol- ogy Language Direct Semantics. W3C Recommendation, W3C (Oc- tober 2009) http://www.w3.org/TR/2009/REC-owl2-direct-semantics-20091027, cit. 12.12.2012. 24. Kostov, B., Kremen, P.: Count aggregation in semantic queries - technical re- port. Technical report, Czech Technical University in Prague, Dept. of Cybernetics (2013) 25. Kremen, P., Kostov, B.: Expressive OWL Queries: Design, Evaluation, Visualiza- tion. International Journal On Semantic Web and Information Systems (2012) IGI Publishing. To appear in 2013. 16