=Paper=
{{Paper
|id=None
|storemode=property
|title=Count Aggregation in Semantic Queries
|pdfUrl=https://ceur-ws.org/Vol-1046/SSWS2013_paper1.pdf
|volume=Vol-1046
|dblpUrl=https://dblp.org/rec/conf/semweb/KostovK13
}}
==Count Aggregation in Semantic Queries==
<pdf width="1500px">https://ceur-ws.org/Vol-1046/SSWS2013_paper1.pdf</pdf>
<pre>
       Count Aggregation in Semantic Queries

                           Bogdan Kostov, Petr Křemen

              Department of Cybernetics, Czech Technical University,
                            Prague, Czech Republic
                  {bogdan.kostov,petr.kremen}@fel.cvut.cz


      Abstract. In this paper we study the distinct count aggregation func-
      tion used in queries into expressive ontologies. The main differences in
      this settings opposed to aggregation in relational database systems are
      the Open World Assumption and incomplete knowledge. We propose
      different interpretations useful in different practical use-cases of the dis-
      tinct count function, i.e. basic count, semantic count, epistemic count
      and semantic tuple count some of which use knowledge derived from the
      ontology in order to obtain results in accordance with the Open World
      Assumption. We use interval semantics to model the uncertainty of the
      distinct count function’s result induced by incomplete knowledge in the
      ontology. We show that interval semantics are particularly useful in ag-
      gregate queries with filtering clause and when we need the retrieval of
      the boundaries of the uncertainty of the distinct count function’s result.
      We study and present relationships among the different interpretations
      and decidability of the semantic tuple count. We also propose a theoret-
      ical approximation of the semantic tuple count. Our results are applica-
      ble to a wide range of description logic formalisms allowing to express
      equality/inequality between individuals, concepts and relations, e.g. Web
      Ontology Language.

      Keywords: semantic counting, count function, OWL, semantic query


1   Introduction
Aggregation functions are an important feature of modern query languages. Ag-
gregation is well understood in query languages of database systems with Unique
Name Assumption (UNA) principle, e.g. SQL for relational databases. Although
this simple well-known interpretation of aggregation can in some cases be used
correctly in queries over semantic knowledge bases, in general this interpreta-
tion will return incorrect results as it will not use knowledge inferred from the
knowledge base with the Open World Assumption (OWA). As opposed to that,
semantic aggregation relies on the knowledge inferred from the knowledge base
in order to provide correct result of aggregation functions.
    Currently the research filed of semantic aggregation is in its infant state.
However the need of semantic aggregation is becoming apparent as more and
more applications of semantic technologies emerge. In this paper we propose the
need of different interpretations of the distinct count function in the settings
of OWA and ontology representing incomplete knowledge, i.e. basic count (BC),
semantic tuple count (ST C), epistemic count (EC) and semantic count (SC). We
study the relationships among the different interpretations. We also investigate
the decidability of the ST C interpretation.


2     Related Work
In recent years, expressive query languages, like SPARQL-DL [1], OWL-SAIQL
[2], SQWRL [3] for OWL 2 [4] or SeRQL [5], SPARQL [6] for RDF, have been
introduced and implemented in the field of semantic web. There have been deep
studies [7–11] evaluating conjunctive queries in RDF and OWL, but few efforts
have been spent on an algebra, as well as aggregation functions.
     Recently the RDF query language SPARQL [6] has been extended towards
the new SPARQL 1.1 1 [12] including many new constructs, e.g. aggregation
functions or negation as failure. There are already publicly available implemen-
tations, e.g. ARQ [13], KGRAM [14], RDF::Query [15] or Sesame [16]. Inde-
pendently on SPARQL 1.1 authors of [17] discuss the topic of aggregation over
data structured as RDF graphs rather than on the relational data returned by
the query (the result set table). The authors stress the need of different modes
of aggregation. A few years ago the SQWRL query language for OWL [3] was
proposed. All these efforts implement or interpret aggregation using the basic
relational semantics, i.e. the result of aggregation functions is computed over the
results of the non-aggregate variant of the query and neglect the impact of the
interplay between aggregation functions and the inferred domain elements.
     More closely related to our study is the work [18]. It shows that exact se-
mantics, of aggregate queries over a DL-LiteA ontology returns results only if
the Abox is not empty and if axioms in the Tbox resolve as constraints over car-
dinalities of the groups in the aggregate query. The authors propose epistemic
semantics of aggregate queries in the settings of data integration use-cases re-
turning the aggregate function’s results known from (inferred by) the ontology.
The authors propose an evaluation algorithm for a sub set of aggregate queries,
i.e. restricted epistemic aggregate queries, defined using functional dependency
of query variables w.r.t. the Tbox of the queried ontology.
     Although not exactly in the same framework of query answering like in this
paper, the authors in [19] propose an approach using ontology design pattern
for incorporating a quantification over types without actually using explicit nu-
merical information.
     In this paper we examine the distinct count aggregation function, its possible
interpretations and their use-cases. We focus our research on the distinct count
aggregation function alone as it is non-trivial but not too overly complicated. We
believe that better understanding of the distinct count function and its possible
interpretations will cover most of the peculiarities of aggregation in the con-
text of expressive knowledge bases assuming the OWA principle and incomplete
knowledge, thus contributing to OWL and RDF.
1
    SPARQL 1.1 version recently became a W3C recommendation.


                                        2
3   Motivation
In this section we present a simple ontology which describes a simple taxonomy
of teachers categories and contains assertions about teachers and courses. The
ontology is presented using the well known description logic syntax, see [20].

Example 1. A simple example ontology O1 about teachers and the courses they
teach.
Tbox
BusyTeacher v Teacher, Professor v Teacher, ∃teaches·Course v Teacher,
BusyTeacher v ≥ 3 teaches, Professor v ≤ 3 teaches
Abox
BusyTeacher(Sara), Professor(Steve), Professor(John), BusyTeacher(John),
Course(math), Course(physics), Course(history), teaches(Dave, math),
teaches(Dave, physics), teaches(Dave, history), teaches(Sara, history),
       .
     6 history
math =

    We will use the following two aggregate queries throughout the paper to
demonstrate the differences in the results of the individual interpretations of the
distinct count function.

Q1 - Find all teachers and the number of distinct courses they teach.
Q2 - Find all teachers that teach more than one distinct courses.

    Next we will discuss the need of different interpretations of the distinct count
function. The most natural interpretation of the distinct count function is the
semantics count interpretation. This interpretation enables users to query for
the distinct count function value or its constraints as entailed by the queried
ontology. For example using this interpretation in query Q1 we obtain that John
teaches exactly three courses or using it in query Q2 we will obtain that Dave,
Sara and John teach more than one course. We call this the semantics count
interpretation and we consider it suitable for knowledge retrieval oriented use-
cases. This interpretation is a natural extension of the certain answer semantics
for the distinct count function and it is monotonic.
    Although that is the most natural and in fact correct extension of the seman-
tics of the distinct count function, there are some use-cases that need different
CWA interpretations. As argued in [18] for ontology based data access (OBDA)
use-cases, the certain answer semantics for aggregate queries is not practical as
they return trivial results, e.g. empty or very restricted result sets. The authors
propose that in this context a practical interpretation will be the one that re-
turns the least known number of courses they teach. This is called the epistemic
count interpretation. As opposed to the semantic count, the result set of query
Q1 with the epistemic count interpretation will contain for example that Dave
and Sara teach respectively two and three courses.
    Note that the epistemic count interpretation may count both named and
unnamed entities. This makes the use of this interpretation inappropriate in
purely data-centric ontology applications. In [21] the authors use OWL to model

                                         3
integrity constraints (IC) and propose IC CWA semantics to enable instance
data validation. We propose an extension to this semantics for the distinct count
function. We call this the semantic tuple count interpretation. As opposed to
the previous interpretations the result of Q2 with the semantic tuple count in-
terpretation will contain only Dave as he is the one for which the data in the
ontology O1 satisfies the condition in the query. The semantic tuple count can
be used to depict IC for n-ary relations.
     The most common interpretation of the distinct count function is the basic
count interpretation. This interpretation is scalable and it is safe to be used in
data oriented use-cases in CWA and UNA systems. The usage of this interpreta-
tion in applications assuming OWA may return incorrect results, however, it can
still be used as an approximation. For example the result of query Q1 contains
the answer Dave teaches three courses, which may or may not be true according
to the ontology.
     We continue with the formal definition of the different interpretations of the
distinct count function and the discussion of the results of queries Q1 and Q2 in
these interpretations.


4      Preliminaries
In this section we will define basic terms and notions used in the rest of the
paper. We will start with the definition of ontology followed by the definition
conjunctive queries and aggregate queries with distinct count function.

4.1     Ontology
Definition 1. An ontology O is a pair hS, Ai, where S is a signature and A
is a set of axioms. The semantics of ontologies use a first order interpretation
I = (∆I , ·I ), where ∆I is an interpretation domain and ·I : S → ∆I is an inter-
pretation function mapping elements from the ontology signature S to elements
from the interpretation domain ∆I . An ontology O is satisfied by an interpreta-
tion I, denoted by I |= O, if all of its axioms are satisfied by the interpretation
I, such interpretation I is called a model of O. We say that a set of axioms A is
entailed by the ontology O, denoted by O |= A, if every model I of the ontology
O, is also a model of A, I |= A.

      Next we define the notion of the monotonic extension O0 of the ontology O.

Definition 2. We say that O0 = hS 0 , A0 i is an extension of O = hS, Ai if A ⊂
A0 and S ⊂ S 0 . O0 is monotonic if the original ontology O is entailed by O0 ,
                           0   0
O0 |= O. A model I 0 = (∆I , ·I ) of the ontology O is an extension of the model
                         0     0       0
I = (∆I , ·I ) |= O if ∆I ⊆ ∆I and ·I ⊆·I .

    For a full description of the syntax and semantics of different description
logic formalisms see [20]. Note that our discussion is also applicable for OWL 2
ontologies [22, 23] since OWL 2 is backed by the SROIQ(D) description logic.


                                        4
    We consider description logics which allow only monotonic extensions. In
section 5.2 where we prove decidability of the semantic tuple count we further
restrict to a subfamily of description logics that enable expressing whether two
entity objects are different, the same or it is not known, which we will refer to
as equality/inequality relation. We require the OWA assumption over the equal-
ity/inequality relation because it provides means to model incomplete knowl-
edge, and it is the fundamental source of uncertainty of the semantic tuple count
function’s value.

4.2     Conjunctive Queries
The discussion in this paper about distinct count function and its proposed
interpretations is set in the context of conjunctive queries.

Definition 3. We will denote conjunctive queries using the following rule like
notation
                             Q(x̄) ← φ(x̄, z̄).                            (1)
The head of the query Q(x̄) denotes the name of the query and the result variables
Rvar (Q) = x̄. The body of the query φ(x̄, z̄) is a comma separated list of query
atoms interpreted as a conjunctive query, as defined in the SPARQL-DL 2 query
language. The query atom list φ may contain non result variables z̄. Note that
the result variables must be distinguished. By Vvar (Q) we denote the list of all
variables in the query. A binding µ : Vvar (Q) → S is a mapping of the variables
of the query to elements in the ontology’s signature and Q|µ is the substitution
of the variables in Q by the binding µ. The binding substitution of a tuple of
variables v̄ = (v1 , . . . , vk ) is denoted by µ(v̄) = (µ(v1 ), . . . , µ(vk )). We call Q a
ground query if there are no variables in the query. A solution to the query Q
w.r.t. the ontology O is a binding µ for which the substitution Q|µ is a ground
query, the body of which is entailed by the ontology (denoted by O |= Q|µ ). The
set of all possible solutions of Q w.r.t. O or the model I of O is denoted by
SatO                                   I
    Q = {µ|O |= Q|µ } or SatQ = {µ|I |= Q|µ } respectively . The result set
of query Q w.r.t. O denoted by QO , is a set of bindings of the result variables
Rvar (Q), formally QO = {ā|ā = µ(Rvar (Q)) ∧ µ ∈ SatO         Q }.


4.3     Aggregate Queries
Here we define the types of aggregate queries along with their simple syntax used
for the purpose of representing aggregate queries in this paper. We also define
some additional terminology and symbols used later in the paper.

Definition 4. We will denote distinct count retrieval and distinct count filtering
queries respectively as follows:

                             Qa (x̄, countSdist (ȳ)) ← φ(x̄, ȳ, z̄)                    (2)
2
    See [1] for list of atoms and their interpretations.


                                                5
                            Qac (x̄) ← (·op ncountSdist (ȳ)), φ(x̄, ȳ, z̄)                          (3)


In the queries of the type Qa , the head Qa (x̄, countSdist (ȳ)) specifies the disjoint
sets of grouping Gvar (Qa ) = x̄ and aggregation Avar (Qa ) = ȳ variables. The
distinct count aggregation function which returns the number of distinct tuples
according to the semantics mode specified by the superscript S w.r.t. the ontology
O is denoted by countSdist . The body of Qa contains a query atom list.
    We also consider distinct count filtering by comparison queries of the form
Qac show in (3). The head of queries of these type contain only the grouping
variables. The body of the query contains cardinality restriction atom3 where
·op ∈ {>, <, ≤, ≥, =, 6=}. By Q∗ we denote a non aggregate variant of Q, obtained
from Q by removing its aggregate function from the head or the comparison
predicate in the body. The result variables of Q∗ are the union of group and
aggregate variables of Q, Rvar (Q∗ ) = Gvar (Q) ∪ Avar (Q).

    Before we define the general semantics of the result of aggregate queries
we clarify and define the auxiliary terms and notations in the following three
definitions.
                        ∞
Definition 5. Let N0,∞    be the extension of the set of natural numbers with
                      ∞
                    0,∞                                        ∞
zero and infinity N     = N ∪ {0, ∞ }. Let L be a subset of N0,∞ and let Int(L)
denote the smallest interval containing L,Int(L) = hinf(L), sup(L)i. We extend
                                                ∞
the intuitive comparison between elements in N0,∞  with comparison between sets
          ∞
        0,∞                       ∞
                                0,∞
L ⊆ N        and elements n ∈ N . L ≤ n (L < n) is be true if and only if
sup(L) ≤ n ∨ sup(L) = n = ∞ (sup(L) < n). L ≥ n (L > n) is true if and only
if inf(L) ≥ n ∨ inf(L) = n = ∞ (inf(L) > n). Note that the symbol ∞ is an
                ∞
element of N0,∞   .

     Next we define the interpretation a tuple and a set of tuples.

Definition 6. Let O = hS, Ai be an ontology, I = (∆I , ·I ) a model of O and
T be a set of tuples composed of elements in S. The interpretation of the of the
tuple t̄ = (t1 , t2 , . . . , tk ), t̄ ∈ T is t̄I = (tI1 , tI2 , . . . , tIk ). The interpretation of T is
T I = {t̄I |∀t̄ ∈ T }.

   Next we define aggregate groups or simply groups as the set of tuples with
common grouping variable binding.

Definition 7. Let Q be an aggregate query of type Qa or Qac , O be an ontology,
I a model of O and let k̄ = µ(Gvar (Q∗ )), where µ is an arbitrary binding. Then
the aggregate group, denoted by Γ (O, Q, k̄), with key k̄ is equal to the set of tuples
{µ(Avar (Q))|µ(Gvar (Q)) = k̄ ∧ µ ∈ SatO      Q∗ }. The aggregate group with key k̄ in a
tuple set T is Γ (T, k̄) = {ā|∀t̄ ∈ T, (k̄, ā) = t̄}.
3
    Note that we reuse the well known notation for property cardinality restrictions, see
    [20].


                                                    6
    Note that the definition of the aggregate group Γ (O, Q, k̄) w.r.t. to the on-
tology O can be used also to obtain an aggregate group w.r.t. to the model I of
O, i.e. Γ (I, Q, k̄).
    Next we define results of aggregate queries in terms of the counting func-
       S
tion f[O,Q] and the set K S (O, Q). Informally the counting function f[O,Q]
                                                                        S
                                                                             is an
uncertainty aware generalization of the distinct count function, it takes as an
input the key of the group to be counted and returns a set of possible values
w.r.t. O and Q. The set K S (O, Q) is the set of keys of the groups to be counted.
The counting function and the key of sets will be defined in each of the concrete
distinct count interpretations. The superscript is used to distinguish among the
different interpretations.
                                                                               ∞
Definition 8. Let D be the set of all tuples of arbitrary length and P(N0,∞      ) be
                      ∞                                        S                  ∞
the power set of N . Let S be an aggregate semantic, f[O,Q] : D → P(N0,∞
                    0,∞
                                                                                    )
                                   S
be the counting function and K (O, Q) ⊂ D be the key set in S. The result
of an aggregate query Q of type (2) w.r.t. the ontology O is QO = {(k̄ ∈
K S (O, Q), inf (L))|L := f[O,Q]
                            S
                                 (k̄) ∧ inf(L) = sup(L)} and the result set of an
aggregate query Q with comparison filtering, i.e. queries of the form Qac , is
QO = {k̄ ∈ K S (O, Q)|L := f[O,Q]
                                S
                                      (k̄) ∧ L ·op n}. The interpretation of the last
condition L ·op n is defined in Definition 5.


5     Distinct count function
In this section we formally define the semantics of each of the various interpreta-
                                 S
tions of the counting function f[O,Q]  and the set of keys K S (O, Q). We present
this definitions in order to be able to show formally the relations between the
interpretations and in the case of SC and ST C to be able evaluate comparison
filtering queries in OWA. We show and discuss the results of the example queries
Q1 and Q2 from section 3 in each of the proposed semantics. We also study the
relationship between individual interpretations of the distinct count function.
We prove decidability of the semantic tuple count ST C interpretation. Finally
we show how to enable distinct count queries in the context of the description
logic SROIQ.
     In (4) we show the queries Q1 and Q2 from example 1 in the notation defined
in Definition 4.

            Q1 (?t, countdist ?c) ← PropertyValue(teaches, ?t, ?c)
                                                                                       (4)
            Q2 (?t) ← (>1countdist (?c)), PropertyValue(teaches, ?t, ?c)
   The tables in this section showing the results of the queries have the following
notation. Only cells highlighted with grey background color 4 are part of the
result set of the query. The other cells have white background. Note that through
out this section O = hS, Ai is an arbitrary ontology with a signature S and an
axiom set A, Q is an arbitrary aggregate query, O1 refers to the ontology in
example 1 and Q1 and Q2 refer to the queries shown in (4).
4
    The use of colors in this paper is intended to be readable in gray scale copies.


                                            7
5.1     Known interpretations of the Distinct Count

In this section we discuss three known interpretations of the distinct count func-
tion in aggregate queries.


Basic Count Interpretation The basic count interpretation (BC) is used
originally in the SQL query language, but it is also used in semantic query
languages, e.g. SPARQL and SQWRL. This interpretation is not adequate for
ontological knowledge as it does not infer the distinct count function’s value
from the ontology. The basic count interpretation can be implemented in n log n
time, where n is the size of the tuple set to be counted.

Definition 9. The set of keys K BC (O, Q) is the set of all syntactically distinct
result bindings of the grouping variables Gvar (Q) of the query Q over the ontology
O, i.e. K BC (O, Q) = {k̄|k̄ = µ(Gvar (Q)) ∧ µ ∈ SatO Q∗ }. For k̄ ∈ K
                                                                       BC
                                                                          (O, Q) the
                                  BC
counting function is defined as f[O,Q] = {|Γ (O, Q, k̄)|}.

      The result of the distinct count query Q1 is shown in table 1(a).


Table 1. The result set of the BC interpretation of the aggregate queries Q1 and Q2 .

                 (a) result of Q1                      (b) result of Q2
                 ?t    countBC
                            dist (?c)             ?t     (> 1countBC
                                                                  dist (?c))
                Dave         3                   Dave true
                Sara         1                   Sara false


    There are two group keys Dave and Sara. The number of courses Dave teaches
is three because there are three courses that Dave teaches asserted in the on-
tology, whose names are syntactically different. For Sara, who teaches only one
course according to the ontology, the result of the count function is one. We can
see that this interpretation ignores the implicit knowledge derived from the fact
that Sara is a BusyTeacher that teaches at least three courses.


Semantic Count Interpretation Informally the semantic count (SC) inter-
pretation counts the number of possible tuples entailed by the queried ontology.
This interpretation is similar to the exact semantics presented in [18]. However
our definition of SC presented here does not return a single value of the distinct
count function but a set of possible values. While this definition has no effect on
the results of queries of type Qa it ensures monotonic results in the queries of
type Qac . The decidability of the SC interpretation is an open problem. Never-
theless we believe that the minimum of the SC interpretation is more likely to
be decidable.


                                         8
Definition 10. The semantic count interpretation is denoted by SC. Let l =
|Gvar (Q)| be the number of grouping variables in Q and S = S l is the set of all
tuples of length l composed of elements from the signature S. The SC counting
                                SC
function is defined as follows f[O,Q] (k̄) = {|T I | |T := Γ (I, Q, k̄), ∀I |= O}. The
SC key set is defined as follows K BC (O, Q) = {k̄ ∈ S| sup(f[O,Q]
                                                                 SC
                                                                       (k̄)) > 0}.

    Next we discuss the results of the example queries Q1 and Q2 , with formal
representation shown in (4), over the ontology O1 from example 1. The last
columns in the tables 2(a) and 2(b) show the boundaries of the intervals found
for each of the group keys from the first column.


  Table 2. The result of the aggregate queries Q1 Q2 in 4 with SC interpretation.

                 (a) result of Q1                            (b) result of Q2
        ?t    countSC             SC
                   dist (?c) Int(f[O1 ,Q1 ] )       ?t   (> 1countSC              SC
                                                                  dist (?c)) Int(f[O1 ,Q2 ] )

       Dave -                    h2, ∞ i        Dave            true             h2, ∞ i
       Sara -                    h3, ∞i         Sara            true             h3, ∞i
       Steve -                    h0, 3i        Steve           false             h0, 3i
       John 3                     h3, 3i        John            true              h3, 3i
       math -                    h0, ∞ i        math            false            h0, ∞ i


    In table 2(a) we show the results of query Q1 . Next we discuss the intuition
of the calculation of the values of the SC counting function for each of groups
in table 2(a). In order to obtain the final result of the query we need to apply
the comparison semantics from definition 8. The last row in both tables show
an unexpected group which the algorithm should process. In fact we have two
more groups that we omitted from the table which are with keys history and
physics. This is an effect of the OWA assumption. This anomaly is caused by
the fact that the ontology O1 does not explicitly state that math, history and
physics are not teachers and that only teachers can teach and thus allowing the
existence of models in which subjects teach something. Moreover the calculated
interval is zero to infinity because there are no axioms which constraint it. The
SC can be used to locate such unwanted behavior. The minimum in the first
row with group key Dave is obtained from the two semantically distinct courses
math and history that Dave teaches. The maximum of the first row is infinity
because the ontology does not constrain the number of courses Dave can teach.
The minimum of the second row with group key Sara is determined from the
constraint that a BusyTeacher teaches at least three courses and the fact that
Sara is a BusyTeacher and who teaches history. In this case there are two
restrictions, ’at least three’ and ’at least one’ which constraint the minimum
of courses Sara teaches. In such cases we should select the most specific one.
Therefore the minimum of the second row’s interval is three since the first re-
striction is more specific. The evaluation of the maximum in the second row’s
interval is analogous to the one in the first row. The third row’s minimum is


                                                9
zero because Steve is not constrained to teach a minimum number of courses
as Dave and Sara were in the first and second rows. The maximum in the third
row is derived from the fact that Steve is a professor and the constraint of the
Professor class which limits its instances to teach at most three courses. Now
we apply interval semantics to the first three rows that we discussed. The three
rows are not included in the result of query Q1 with SC semantics because ac-
cording to the definition of the results of queries of type Qa in definition 8 which
states that set returned by the counting function must be singleton. The answer
contains only the last row because the return value of the counting function is
a singleton containing only the number three. This is true because the John is
both a Professor and a BusyTeacher the constraints of which were already
discussed.
    The results of query Q2 as well as all the relevant group keys of Q2 are
shown in table 2(b). We have discussed the interval boundaries of the intervals
for groups of Q1 and since the Q2 has identical groups, note that K SC (O1 , Q1 ) =
K SC (O1 , Q2 ), we skip this explanation for query Q2 . The result of Q2 contains
all rows except for the third one with group key Steve that does not satisfies the
comparison filter which limits the retrieval of groups with at least two distinct
tuples in all models of the ontology O1 .
    We point out the importance of interval semantics in the settings of incom-
plete knowledge. Note that in query Q2 we obtained results that were not present
in the result of Q1 . This proves that the results of query Q2 obtained by filtering
the results of query Q1 won’t contain the full result entailed by the ontology.

Epistemic Count Interpretation The epistemic count (EC) interpretation is
introduced in the work [18]. Informally this interpretation of the distinct count
function returns a single value which represents the known number of distinct
tuples in an aggregate group. For the number k of known tuples holds that
in any model I of the ontology O there is at least k distinct tuples for the
counted group. Because countECdist interpretation is equivalent to the infimum of
the SC counting function here we define the countEC  dist interpretation in terms of
SC counting function and key set. The decidability of the EC interpretation is
an open problem an it is equivalent to the decidability of the minimum of the
SC interpretation problem.
Definition 11. The EC key set is defined as K EC (O, Q) = K SC (O, Q). The EC
                                 EC                 SC
counting function is defined as f[O,Q] (k̄) = {inf(f[O,Q] (k̄))}.
   In tables 3(a) and 3(a) we can see the results of the aggregate queries Q1 and
Q2 respectively with distinct count function interpreted with the EC semantics.
The results are identical with those of the minimum of the interval in tables 2(a)
and 2(b) and are not discussed further.

5.2   Semantic Tuple Count Interpretation
The semantic tuple count (ST C) interpretation is defined using interval seman-
tics. Informally it counts the same tuples as the BC interpretation but it uses


                                        10
Table 3. The result of the aggregate queries Q1 and Q2 in (4) with the EC interpre-
tation.
                 (a) result of Q1                         (b) result of Q2
                ?t     countEC
                            dist (?c)                ?t     (> 1countEC
                                                                     dist (?c))
                Dave         2                      Dave           true
                Sara         3                      Sara           true
                John         3                      John           true
                math         0                      math           false


knowledge in the ontology to derive the equivalence/inequivalence relation be-
tween the counted tuples which is needed to remove semantically duplicate tuples
and also to deal with uncertainty of the count’s value.

Definition 12. The ST C key set K ST C is defined as the key set of the BC
interpretation, i.e. K ST C (O, Q) = K BC (O, Q). We define the ST C counting
                      ST C
function as follows f[O,Q] (k̄) = {|Γ (O, Q, k̄)I ||∀I |= O}.


Table 4. The result of the aggregate queries Q1 and Q2 in (4) with the ST C interpre-
tation.
             (a) result of query Q1                  (b) result of query Q2
        ?t   countST  C          ST C
                  dist (?c) Int(f[O1 ,Q1 ] )        ?t     (> 1countST  C
                                                                    dist (?(c)))

       Dave -                h2, 3i                 Dave           true
       Sara 1                h1, 1i                 Sara           false


    Next we discuss the results of queries Q1 and Q2 from example 1 with the
ST C interpretation. Here we also show all the group keys that should be consid-
ered during evaluation of the ST C interpretation and since the ST C interpreta-
tion is interval based we also show the evaluated intervals in the last columns of
the tables 4(a) and 4(b). In table 4(a) we have the results of Q1 . Now we discuss
the values of the interval boundaries in rows one and two. The minimum in the
first row is based on the two distinct courses math and history. The maximum
is three because there are three entailed tuples in the group of the first row and
because there is no other axioms that constraints the maximum to be smaller.
The minimum in the second row is one because we have only one tuple for the
Sara group. This is also the only possible value for the maximum of row two
because there are no other entailed tuples in that group. Apparently from the
interval semantics only the second row is returned.
    The results for the query Q2 are shown in table 4(b). We already discussed
the interval boundaries in the discussion of query Q1 . Applying the interval
semantics we filter the second row because Sara teaches only one distinct course
and therefore it is not contained in the result. The first is contained in the result


                                               11
because the minimum boundary, which is 2, is bigger than the less than operand
in the filtering atom, which is one.

Decidability of the ST C Interpretation In this section we prove decidability
of the ST C interpretation. We present only the statements of the most relevant
lemmas and the proof of the theorem at the end of the section. The omitted
proofs and lemmas can be found in the technical report [24]. First we define the
family of formalisms for which we prove that the ST C interpretation is decidable.

Definition 13. We assume that the supported formalisms F (i) are capable
of expressing the equality/inequality relation among elements of the ontology
(ii) consistency check of ontologies and query answering is decidable in F and
(iii) that F is monotonic.

    Implementing the evaluation of the ST C counting function based on defini-
tion 5.2 is not feasible because the ontology O might have an infinite number
of models. We show that there is a finite number of models sufficient for the
calculation of the boundaries of the value of the ST C counting function.
    The next proposition states that the interval of the distinct count function
with SC and ST C interpretation w.r.t. the ontology O will be more specific if we
extend the queried ontology. The interval calculated w.r.t. the original ontology
will include the one calculated from the extended ontology.
    Note that in this section we will use identical equality and inequality axioms,
denoted by =a and 6=a respectively, for all type of entities, i.e. individuals,classes
and properties. In the following section we show an equivalent representation of
this axioms in the concrete logic SROIQ.

Definition 14. Let O = hS, Ai be an ontology, T be a tuple set composed of
elements in the set D ⊆ S. Let I = (∆I , ·I ) be an interpretation such that ·I is
defined on D then the complete set of equality/inequality axioms satisfied by I
is denoted as A# (I, D) = {a =a b|∀a, b ∈ D, aI = bI } ∪ {a 6=a b|∀a, b ∈ D, aI 6=
bI }. We denote the set all complete sets of equality/inequality axioms between
elements in D w.r.t. the ontology O as A# (O, D) = {A# (I, D)|∀I |= O}. The
cannonic model of A# = A# (I, D), denoted by IA# = (∆IA# , ·IA# ) is a model
with an interpretation function ·IA# the domain of which is D.

    Note that adding equality/inequality axioms to the complete set A# (I, D)
wont change the original set or if it does the resulting set is unsatisfiable. Note
also that the cannonic interpretation function ·IA# can be constructed in poly-
nomial time.

Lemma 1. Let O = hS, Ai be an ontology, D ⊆ S, then A# (O, D) is finite. 

Lemma 2. Let T be a tuple set composed of elements from set D and two inter-
pretations ·I1 and ·I2 which agree on the equivalence and inequivalence between
elements in D. Then the number of elements in the sets T I1 and T I2 is the
same, |T I1 | = |T I2 |. 


                                         12
    The next lemma states that for extensions of the ontology O with an axiom
set from A# (O, D) the ST C counting function returns a singleton set and that
the only value in the set can be calculated in polynomial time.
Lemma 3. Let O = hS, Ai be an ontology, Q be an aggregate query, k̄ ∈ K ST C ,
T = Γ (Q∗O , k̄) and D be the set of elements composing tuples in T , I = (∆I , ·I )
be a model of O, A# = A# (I, D), O0 = hS, A ∪ A# i and ·IA# be the cannonic
                                         ST C                            I       IA#
interpretation function of A# . Then f[O   0 ,Q] (k̄) = {c}, where c = |T | = |T     |.

    Instead of looking for the boundaries of the ST C interval in the set of all
models as the definition 12 suggests, we can search for the boundaries in the
finite set of representations of the equality interpretation. We first describe an
algorithm which terminates in a final number of steps and then we prove its
correctness.
Algorithm 1 : STC - Semantic Tuple Count procedure
PROCEDURE STC
  INPUT : O // the queried ontology,
          T // set of tuples to be counted
  OUTPUT : <a,b> // the calculated interval
  a := inf; b := 0;
  D := {elements used in tuples in T};
  FORALL A# IN A#(O,D) DO
    A’ := union(A, A#);
    IF A’ is consistent
      construct cannonic interpretation I#(A#);
      c := |T interpreted by I#(A#)|;
      IF a > c THEN a := c;
      IF b < c THEN b := c;
    END-IF
  END-FORALL
  RETURN <a,b>;
END
    We will prove decidability by proving correctness and termination of algo-
rithm 1.
Theorem 1. The algorithm 1 terminates and evaluates correctly the interval
     ST C
Int(f[O,Q] (k̄)) for an ontology O and aggregate query Q and group key k̄.
Proof. Algorithm 1 terminates since there is a finite number of extensions to
check. From lemma 3 we have that the ST C counting function is a subset of the
calculated interval. Also from lemma 3 we have that the interval is the smallest
because the boundaries correspond to some model of O.
Corollary 1. Calculation of the ST C interval is decidable in the family of for-
malisms defined in Definition 13. 


                                          13
ST C Interpretation approximation The proposed algorithm 1 is searching
trough all the possible extensions for the set of elements D. As we show in corol-
                         2
lary 1 there are n = 2|D| possible extensions for each of which we need to make
a consistency check. In this section we propose an approximation of the ST C
interpretation which can be used for example as optimization of an evaluation
algorithm. The approach presented here utilizes the knowledge inferred from the
ontology. The prove of the correctness of the approximation is presented in the
technical report [24].

Definition 15. Let G = (V, E) be an undirected graph with nodes V and edges
E. A connected component K in G is a subgraph in G maximal with the property,
for each pair of nodes in K there is a path in K. A (maximal) clique C is a graph
maximal with the property C is a subgraph of G and is complete. The biggest
clique C in G is called maximum.

   In order to define the approximation formally we need first to define the
auxiliary term difference graph, and the notion of maximum clique.

Definition 16. Let O = hS, Ai be an ontology, let T be a set of tuples of length
l and composed of elements from the set D ⊆ S. Let s̄, t̄ ∈ T , equality s̄ =a t̄
and inequality of tuples w.r.t. the ontology O is defined respectively as O |=
{s1 =a t1 , . . . , sl =a tl } and O |= si 6=a ti for some 1 ≤ i ≤ l. Let G= (O, T ) =
(D, E= (O, T )) be the graph with edges E= (O, T ) = {(a, b)|∀(a, b) ∈ T 2 , O |=
a =a b}. The difference graph is denoted by ∆G(O, T ) = (V, E) w.r.t. O and T .
The set nodes is the set of connected components of G= (O, T ), there is an edge
between the nodes u and v in V , {u, v} ∈ E if and only if the ontology entails
6=a axiom between some pair of the set u × v.

Theorem 2. Let O be an ontology, Q an aggregate query, k̄ ∈ K ST C (O, Q) and
let T = Γ (O, Q, k̄) be the group to be counted. Let the ∆G(O, T ) = (V, E) be the
difference graph w.r.t. O and T and Cmax = (VC , EC ) be the maximum clique
                            ST C
graph of ∆G(O, T ). Int(f[O,Q]   (Vvar (k))) ⊆ h|VC |, |V |i. 


ST C interpretation in SROIQ In this section we discuss the countST             dist
                                                                                     C

function in a concrete description logic SROIQ. The ST C interpretation is
decidable in this description logic since SROIQ satisfies all the requirements
shown in Definition 13. We consider any query language which supports con-
junctive queries with mixed Abox, Tbox, Rbox terms, e.g. SPARQL-DLN OT
[25]. In order to apply the ST C interpretation in this scenario we need to show
representations of equality and inequality between elements of the signature of
the ontology which will be used to generate the complete set of axioms in A# .
     5.2 shows the representations for each of the three term types. Individuals
ik , k ∈ 1, 2, 3..., are unique and not contained in the ontology. Note that classes
and properties have two possible ways of representing the inequality relation.
If one of the representations fails we must also test the other when checking
whether two terms of type class or property need to be compared for inequality.

                                         14
            Table 5. Semantics of comparison of different ontology elements

     type         =a     6=a
                    .        .
                            6 i2
     individuals i1 = i2 i1 =
     classes     C1 ≡ C2 {C1 (i1 ), ¬C2 (i1 )} or {C2 (i2 ), ¬C1 (i2 )}
     properties p1 ≡ p2 {p1 (i1 , i2 ), ¬∃p2 .{i2 }(i1 )}) or {p2 (i3 , i4 ), ¬∃p1 .{i4 }(i3 )}


6    Conclusion
We investigated the distinct count function in the context of different semantics,
i.e. BC, SC and EC. We introduced a new interpretation ST C and compared
its behavior with the other interpretations. We found the following relation-
                                      SC
ships between SC and EC, inf(f[O,Q]       ) = EC and also between SC and ST C,
       ST C             SC                   ST C               SC
inf(f[O,Q] (k̄)) ≤ inf(f[O,Q] (k̄)) and sup(f[O,Q] (k̄)) ≤ sup(f[O,Q] (k̄)). We proved
decidability of the ST C interpretation in the context of the selected family of
formalisms, provided an approximation using basic graph problems and showed
how to apply the ST C in the SROIQ description logic.
     In future work we would like to focus on the implementation of the evaluation
of the ST C interpretation into the SPARQL-DLN OT [25] query language. In
order to provide practical implementation research in optimizing the evaluation
                                                           2
is needed. In the worst case algorithm 1 will issue 2n consistency checks where
n is the number of counted tuples. We are also interested in the investigation of
the decidability of SC and EC.

Acknowledgments This work has been supported by the grant by the grant of the
Czech Technical University in Prague No. SGS13/204/OHK3/3T/13 Effective
solving of engineering problems using semantic technologies. Authors also want
to express thanks to the anonymous reviewers for providing useful comments
during manuscript preparation.

References
 1. Sirin, E., Parsia, B.: SPARQL-DL: SPARQL Query for OWL-DL. In: 3rd OWL
    Experiences and Directions Workshop (OWLED-2007). (2007)
 2. Kubias, E., Schenk, S., Staab, S., Pan, J.Z.: OWL SAIQL - an OWL DL Query
    Language for Ontology Extraction. In: In Proc. of OWLED-07. (2007)
 3. O’Connor, M.J., Das, A.K.: SQWRL: A query language for OWL. In: OWLED.
    (2009)
 4. Group, W.O.W.: OWL 2 Web Ontology Language Document Overview. W3C
    Recommendation, W3C (October 2009) http://www.w3.org/TR/2009/REC-owl2-
    overview-20091027, cit. 04.12.2012.
 5. Broekstra, J., Kampman, A.: An rdf query and transformation language. In
    Staab, S., Stuckenschmidt, H., eds.: Semantic Web and Peer-to-Peer. Springer
    Berlin Heidelberg (2006) 23–39
 6. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C
    Recommendation, W3C (January 2008) http://www.w3.org/TR/2008/REC-rdf-
    sparql-query-20080115, cit. 3.2013.


                                                15
 7. Horrocks, I., Kutz, O., Sattler, U.: The Even More Irresistible SROIQ. In Doherty,
    P., Mylopoulos, J., Welty, C.A., eds.: KR, AAAI Press (2006) 57–67
 8. Sirin, E., Parsia, B.: Optimizations for Answering Conjunctive ABox Queries. In:
    Description Logics. Volume 189 of CEUR. (2006)
 9. Křemen, P., Kouba, Z.: Conjunctive Query Optimization in OWL2-DL. In: Pro-
    ceedings of the 22th International Conference on Database and Expert System
    Applications (DEXA 2011). Volume 6861 of LNCS., Springer Verlag (2011)
10. Glimm, B., Horrocks, I., Lutz, C., Sattler, U.: Conjunctive Query Answering in
    the Description Logic SHIQ. In: Proceedings of the 20th International Joint
    Conference on Artificial Intelligence (IJCAI 2007). (2007)
11. Kollia, I., Glimm, B., Horrocks, I.: Query Answering over SROIQ Knowledge Bases
    with SPARQL. In: Proceedings of the 2011 International Workshop on Description
    Logic (DL 2011). (2011)
12. Seaborne, A., Harris, S.: SPARQL 1.1 Query. W3C Working Draft, W3C (October
    2009) http://www.w3.org/TR/2009/WD-sparql11-query-20091022, cit. 3.2013.
13. Apache:      ARQ - A SPARQL Processor for Jena, web site (April 2011)
    http://jena.apache.org/documentation/query/index.html, cit. 3.2013.
14. Corby, O.:         Kgram: a knowledge graph abstract machine, web site
    http://wimmics.inria.fr/corese, cit. 3.2013.
15. Williams, G.T.: RDF Query 2.909 - RDF::Query - A complete SPARQL 1.1 Query
    and Update implementation for use with RDF::Trine, web site (November 2012)
    http://search.cpan.org/dist/RDF-Query/, cit. 3.2013.
16. Arjohn Kampman, Christiaan Fluit, J.B.: Sesame, web site (January 2013)
    http://sourceforge.net/projects/sesame, cit. 3.2013.
17. Seid, D.Y., Mehrotra, S.: Grouping and aggregate queries over semantic web
    databases. In: ICSC. (2007) 775–782
18. Calvanese, D., Kharlamov, E., Nutt, W., Thorne, C.: Aggregate queries over on-
    tologies. In: ONISW. (2008) 97–104
19. Martı́nez, D.C., Janowicz, K., Hitzler, P.: A logical geo-ontology design pattern
    for quantifying over types. In: SIGSPATIAL/GIS. (2012) 239–248
20. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.,
    eds.: The Description Logic Handbook: Theory, Implementation, and Applications.
    In Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.,
    eds.: Description Logic Handbook, Cambridge University Press (2003)
21. Tao, J., Sirin, E., Bao, J., McGuinness, D.L.: Integrity constraints in owl. In:
    AAAI. (2010)
22. Motik, B., Parsia, B., Patel-Schneider, P.F.: OWL 2 Web Ontology Language
    Structural Specification and Functional-Style Syntax. W3C recommendation,
    W3C (October 2009) http://www.w3.org/TR/2009/REC-owl2-syntax-20091027,
    cit. 12.12.2012.
23. Patel-Schneider, P.F., Motik, B., Grau, B.C.:              OWL 2 Web Ontol-
    ogy Language Direct Semantics.              W3C Recommendation, W3C (Oc-
    tober 2009) http://www.w3.org/TR/2009/REC-owl2-direct-semantics-20091027,
    cit. 12.12.2012.
24. Kostov, B., Kremen, P.: Count aggregation in semantic queries - technical re-
    port. Technical report, Czech Technical University in Prague, Dept. of Cybernetics
    (2013)
25. Kremen, P., Kostov, B.: Expressive OWL Queries: Design, Evaluation, Visualiza-
    tion. International Journal On Semantic Web and Information Systems (2012) IGI
    Publishing. To appear in 2013.


                                         16

</pre>