Using the TBox to Optimise SPARQL Queries

         Birte Glimm1 , Yevgeny Kazakov1 , Ilianna Kollia2 , and Giorgos Stamou2
                1
                  University of Ulm, Germany, <firstname.surname>@uni-ulm.de
 2
     National Technical University of Athens, Greece, ilianna2@mail.ntua.gr, gstam@cs.ntua.gr


         Abstract. We present an approach for using schema knowledge from the TBox
         to optimise the evaluation of SPARQL queries. The queries are evaluated over
         an OWL ontology using the OWL Direct Semantics entailment regime. For con-
         junctive instance queries, we proceed by transforming the query into an ABox.
         We then show how the TBox and this (small) query ABox can be used to build an
         equivalent query where the additional query atoms can be used for reducing the
         set of possible mappings for query variables. We also consider arbitrary SPARQL
         queries and show how the concept and role hierarchies can be used to prune the
         search space of possible answers based on the polarity of variable occurrences in
         the query.


1      Introduction

In this paper, we consider the SPARQL 1.1 query language [7], which was recently
standardised by the World Wide Web Consortium (W3C). SPARQL 1.1 includes sev-
eral entailment regimes [4] in order to use entailment when evaluating a query. In this
paper, we consider SPARQL queries evaluated using OWL’s Direct Semantics Entail-
ment [14]. In this setting, the WHERE clause or query pattern of a query can be seen
as a set of extended OWL or Description Logic (DL) axioms, which can have variables
in place of concept, role or individual names. Our goal in this paper is to optimise the
evaluation of such query patterns.
    Over the last decade, much effort has been spent on optimising standard reasoning
tasks such as entailment checking, classification, or realisation (i.e., the computation
of instances of all concepts and roles) [3, 18, 22]. The optimisation of query answer-
ing algorithms has, however, mostly been addressed for conjunctive queries in OWL
profiles, most notably the OWL 2 QL profile [2, 13, 16]. An exception to this are the
works on nRQL [6], SPARQL-DL [19], and our previous work [12].3 The query lan-
guage nRQL is supported by Racer Pro [5] and allows for queries that go beyond ABox
queries, e.g., one can retrieve sub- or super-concepts of a given concept. SPARQL-DL is
a fragment of SPARQL implemented in the Pellet reasoner [20]. The kinds of SPARQL
queries that are supported in SPARQL-DL are those that can directly be mapped to
reasoner tasks. Furthermore, KAON2 [9] supports SPARQL queries, but restricted to
ABox queries/conjunctive instance queries. In our previous work [12], we propose al-
gorithms for arbitrary SPARQL queries and optimisation techniques mainly based on
cost-based query planning. The system is implemented as a SPARQL Wrapper that can
 3
     An implementation is available at http://code.google.com/p/owl-bgp/
be used with any reasoner that implements the OWLReasoner interface of the OWL
API [8]. Finally, TrOWL4 supports SPARQL queries based on our SPARQL wrapper,
but the reasoning in TrOWL is approximate, i.e., an OWL DL ontology is rewritten into
an ontology that uses a less expressive language before reasoning is applied [21]. A
promising, but still very preliminary work is also the Semantic Web Entailment Regime
Translation and Inference Architecture (Swertia);5 a generic Semantic Web reasoning
framework that is based on first-order logic (FOL) reasoning.
    The contribution of this paper is two-fold: first, we present an optimisation that is
applicable to conjunctive instance queries. We show that one can compute an equivalent
query q̂ for a given query q by replacing the variables in q with fresh individual names.
We then perform realisation, i.e., we materialise entailed concept and role assertions,
for the queried TBox and (small) query ABox. Replacing the individual names again
with the corresponding variable names then yields q̂. The additional query atoms in q̂
can then be used for reducing the set of possible mappings for query variables. Second,
we propose an optimisation for also reducing the possible mappings for concept and
role variables by exploiting the polarity of variable occurrences in the query and the
(precomputed) concept and role hierarchies. We provide a prototypical implementation
and evaluation of the polarity based optimisation, which shows that it can lead to an
improvement of up to two orders of magnitude in the execution times of some queries.
The polarity based optimisation can be found in more detail in our earlier work [11].


2      Preliminaries

Due to lack of space, we do not introduce the Description Logic SHIQ, which we use
throughout the paper. For further details, we refer interested readers to the DL handbook
[1]. We assume that a knowledge base K is a pair (T , A) with T a TBox that possibly
includes role inclusion axioms and A an ABox.
    In this paper, we consider conjunctive instance queries and complex queries, which
can also contain axioms with variables in place of concept or role names. Such queries
are important in the context of SPARQL since SPARQL’s OWL Direct Semantics en-
tailment regime allows for such queries. Conjunctive instance queries are a subset of all
such SPARQL queries, but they differ from database-style conjunctive queries in that
they do not allow for existentially quantified or non-distinguished variables.

Definition 1 (Conjunctive Instance and Complex Queries). Let S = (NC , NR , NI ) be
a signature, K a knowledge base over S, and V = VC ]VR ]VI a countably infinite set of
variable names (concept variables in VC , role variables in VR , and individual variables
in VI ) disjoint from NC , NR , and NI . Let A ∈ NC , r ∈ NR , and x, y ∈ VI . A concept atom is
an expression A(x) and a role atom is an expression r(x, y). A conjunctive instance query
q is a non-empty set of (concept or role) atoms. The set of SHIQ concept templates
(or concept templates for short) over S and V is built as the set of SHIQ concepts,
where a concept variable can be used in place of a concept name, and a role variable
in place of a role name. A role axiom template has the form r v s where r, s ∈ NR ∪ VR .
 4
     http://trowl.eu
 5
     http://swertia.org
A concept axiom template has the form c v d with c, d concept templates. We again
abbreviate c v d and d v c as c ≡ d. A finite set of role axiom templates, concept axiom
templates, and (concept or role) atoms is called a complex query. We use Var(q) for
the set of variables in q. and |Var(q)| is called the arity of q.
    Let q = {at1 , . . . , atn } be a (conjunctive instance or complex) query. A total function
µ : Var(q) → NC ∪NR ∪NI is a mapping for q over K if µ(v) ∈ NC if v ∈ VC , µ(v) ∈ NR if
v ∈ VR , and µ(v) ∈ NI if v ∈ VI . Let X = {x1 , . . . , xn } ⊆ Var(q) a subset of the variables
of q, and M = {µ1 , . . . , µm } a set of mappings for q. The projection of X over M is the
set M|X = {{x 7→ a} | µ ∈ M, x ∈ X and µ(x) = a}. A mapping µ is a certain answer
for q over K if K |= µ(at) for each at ∈ q, written K |= µ(q), where µ(at) (µ(q)) is the
result of replacing each v ∈ Var(at) (Var (q)) with µ(v). We denote the set of all certain
answers for q over K with ans(K, q).


3   Query Answering via Approximate Instance Retrieval

In this section, we use the DL SHIQ and we present a technique that uses approxi-
mate reasoning algorithms in order to optimise the evaluation of conjunctive instance
queries. Approximate reasoning algorithms can either be sound and incomplete [17],
i.e., they underapproximate the set of certain or known answers or they can be complete
but unsound [15], i.e., they overapproximate the set of certain answers. Typical exam-
ples of such algorithms rewrite a knowledge base into a simpler logic in such a way
that computing the results over the simplified knowledge base yields the desired under-
or overapproximation. Another possibility is to use a pre-model or complete and clash
free tableau generated by a DL reasoner [12]. One can then read-off certain instances of
concepts or roles by analysing which concept and role facts have been added determin-
istically to the pre-model, i.e., one can obtain an underapproximation for concept and
role instances. Similarly, one can analyse the non-deterministically added and absent
concept and role facts to compute an overapproximation. More formally, we define an
approximate instance retrieval algorithm as follows:

Definition 2 (Approximate Instance Retrieval Algorithm). Let K = hT , Ai be a
knowledge base and at a concept or role atom such that the concept or role is from the
signature of K. An approximate instance retrieval algorithm inst(K, at) returns a pair
of sets of (total) functions from Var(at) to individual names from K, hK[at], P[at]i, such
that:

 1. if µ ∈ K[at], then K |= µ(at), and
 2. for each total function µ from Var(at) to individual names from K such that K |=
    µ(at), µ ∈ K[at] ∪ P[at].

Similarly, we define approximate query answering algorithms:

Definition 3 (Approximate Query Answering Algorithm). Let K = hT , Ai be a
knowledge base, and q a conjunctive instance query. An approximate query answer-
ing algorithm apprQA(K, q) returns a pair of sets of (total) functions from Var(q) to
individual names from K hK[q], P[q]i such that:
 1. if µ ∈ K[q], then µ ∈ ans(K, q), and
 2. for each total function µ from Var(q) to individual names from K such that µ ∈
    ans(K, q), µ ∈ K[q] ∪ P[q].
     Without loss of generality, in the rest of the paper we assume that for every approx-
imate instance retrieval or query answering algorithm K[·] ∩ P[·] = ∅ holds. It is not
difficult to see that an obvious approach to develop an approximate query answering
algorithm apprQA is to use an approximate instance retrieval algorithm inst. A naive
method to do this is to execute inst and take K[·] ∪ P[·] for each query atom and then
compute the join of these sets, where we straightforwardly interpret the mappings as
relations. The following example illustrates that some possible answers can be easily
rejected.
     For ease of presentation we represent mappings as tuples in the examples that fol-
low, i.e., abusing notation, K[at] is a set of individuals for concept atoms and a set of
pairs of individuals for role atoms and K[q] is a set of n-tuples, n being the arity of q.

Example 1. Let K = hT , Ai be a knowledge base and q = {C(x), r(x, y), D(y)} a query
with variables hx, yi. Suppose that (possibly as a result of inst(K, C(x)), inst(K, r(x, y))
and inst(K, D(y))) we have K[C(x)] = {a}, P[C(x)] = {b}, K[r(x, y)] = {ha, ci)},
P[r(x, y)] = {hb, di, hb, ei}, K[D(y)] = {c} and P[D(y)] = {d}. Even if we do not know K,
we can conclude that ha, ci is a certain answer to q, since a ∈ K[C(x)], ha, ci ∈ K[r(x, y)]
and c ∈ K[D(y)]. However, only hb, di is a possible answer for q since b ∈ P[C(x)],
hb, di ∈ P[r(x, y)] and d ∈ P[D(y)]; hb, ei cannot be an answer for q since although
hb, ei ∈ P[r(x, y)] and b ∈ P[C(x)], e < K[D(y)] ∪ P[D(y)].

Algorithm intersecQans (see Algorithm 1) formalises this idea, which we also illustrate
in Example 2:
Example 2. If we take q from Example 1 in the order given, we first initialise K[q] to
{a} and P[q] to {b}. During the next iterations, the (preliminary) set K[q] is extended
by performing a natural join with the known answers for the current atom at. The set
P[q] is extended by performing a natural join of P[q] with both K[at] and P[at] and
of K[q] with P[at]. For Example 1, we next process r(x, y) and obtain K[q] = {ha, ci}
and P[q] = {hb, di, hb, ei}. We finally process D(y) and keep K[q] = {ha, ci} and P[q] =
{hb, di}.

Lemma 1. Algorithm intersecQans is an approximate query answering algorithm.

    Concerning the optimisation of algorithm intersecQans, in practice, one would use
well-known ordering strategies from the area of databases for the atoms in q in order to
reduce the number of intermediate results, e.g., one would prefer joins over connected
atoms and join a small relation with a bigger one where possible. The question now is:
given a knowledge base K and a query q, how we can further reduce the cardinality of
P[q] computed by intersecQans with the aid of inst.
Example 3. Suppose that we have K and q = {C(x), r(x, y), D(y)} as in Example 1 and,
in addition, we are given that K |= ∃r.> u C v B and inst(K, B(x)) = h{a}, ∅i. From
Example 2, we have K[q] = {ha, ci} and P[q] = {hb, di}. In this case, hb, di is no longer
Algorithm 1 intersecQans(K, q)
Require: K = hT , Ai: a SHIQ knowledge base
           q: a conjunctive query
Ensure: hK[q], P[q]i: K[q], P[q] sets of known and possible answers for q
 1: for at ∈ q do
 2:    hK[at], P[at]i := inst(K, at)
 3:    if K[q] and P[q] not initialised then
 4:       hK[q], P[q]i := hK[at], P[at]i
 5:    else
 6:       K[q] := K[q] ./ K[at]
 7:       P[q] := (P[q] ./ P[at]) ∪ (K[q] ./ P[at]) ∪ (P[q] ./ K[at])
 8:    end if
 9: end for
10: return hK[q], P[q]i


a possible answer of q. If b would be a possible mapping for x, it would have an r-
successor (since r(x, y) ∈ q) and it would be an instance of C (since C(x) ∈ q) and,
hence, b should be in K[B(x)] ∪ P[B(x)] to satisfy the entailed axiom.
We now define the notion of restricting atoms such as B(x).
Definition 4 (Restricting Atoms). Let K = hT , Ai be a knowledge base, q a conjunc-
tive instance query, at a query atom with Var(at) ⊆ Var(q), and inst an approximate
instance retrieval algorithm. Then we say that at restricts q if

                        P[q]|Var(at) ∩ (K[at] ∪ P[at]) ⊂ P[q]|Var(at)

where hK[q], P[q]i = intersecQans(K, q) and hK[at], P[at]i = inst(K, at).
Example 4. Going back to Example 3, we find that B(x) is indeed a restricting atom
for q according to Definition 4 since we have P[q]|{x} = {b} and P[q]|{x} ∩ (K[B(x)] ∪
P[B(x)]) = ∅, which clearly is a subset of P[q]|{x} .
    If we want to preserve the certain answers of q, we should only use restricting atoms
that do not change the answers of q. Let q and q0 be conjunctive instance queries such
that q0 = q ∪ {at}. If q and q0 are equivalent queries, i.e., q and q0 yield the same
answers over a fixed TBox and any ABox, and at restricts q, then we can safely prune
the set of possible answers for q with the help of at. Obviously, we could also use more
than one restricting atom to even further restrict q. Since such atoms are not given as
input, we address the problem of (efficiently) computing such restricting atoms within
an approximate query answering algorithm after showing that using restricting atoms
for queries indeed preserves the certain answers.
Lemma 2. Let K = hT , Ai be a knowledge base, q and q0 two queries such that q0 =
q ∪ {at}, ans(K, q) = ans(K, q0 ), hK[q], P[q]i = intersecQans(K, q), hK[at], P[at]i =
inst(K, at), and at restricts P[q].
 1. An algorithm that returns hK[q], {µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}i given K
    and q as input is an approximate query answering algorithm and
 2. |{µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}| < |P[q]|.
    In order to define an improved approximate query answering algorithm based on
Lemma 2, we need to find a way of computing such restricting atoms. In the following,
we will see how we can use the TBox for this aim. The proposed query extension
technique is analogous to the use of the chase technique in relational database theory to
reason about conjunctive query containment [10].
Definition 5 (Query Extension). Let T be a TBox, q a query, and f a total and bijec-
tive function from Var(q) to a set of individual names from NI . The canonical ABox for
q w.r.t. f is defined as follows:
                  Aqf = {B( f (x)) | B(x) ∈ q} ∪ {r( f (x), f (y)) | r(x, y) ∈ q}

The extended canonical ABox Âqf for q w.r.t. f is defined as:

             Âqf = {B(a) | B ∈ NC , a ∈ Ind(Aqf ) and hT , Aqf i |= B(a)} ∪
                     {r(a, b) | r ∈ NR , a, b ∈ Ind(Aqf ) and hT , Aqf i |= r(a, b)}.

The extended query q̂ w.r.t. f and Âqf is:

              q̂ = {B( f − (a)) | B(a) ∈ Âqf } ∪ {r( f − (a), f − (b)) | r(a, b) ∈ Âqf }.
Example 5. Suppose we have K such that K |= ∃r.>uC v B and q = {C(x), r(x, y), D(y)}
from Example 3. In this case, the canonical ABox of q is Aqf = {C(a x ), r(a x , ay ), D(ay )}
where f maps a variable v ∈ {x, y, z} to the individual name av . The extended canonical
ABox for the query q w.r.t. f is Âqf = {C(a x ), r(a x , ay ), D(ay ), B(a x )} and the extended
query w.r.t. Âqf is q̂ = {C(x), r(x, y), D(y), B(x)}. Please note that q̂ has the same answers
as q (w.r.t. any ABox), because K |= ∃r.> u C v B.
Lemma 3. The extended query q̂ created as in Definition 5 is unique.
Intuitively (see Example 5), the extended query q̂ w.r.t. Âqf adds atoms to q that do not
change the set of answers of q, which we formalise by the following Theorem:
Theorem 1. Let T be a TBox, q a query, and q̂ the extended query as in Definition 5.
Then, T |= q ≡ q̂, i.e., for any ABox A0 , ans(hT , A0 i, q) = ans(hT , A0 i, q̂)
    Given a conjunctive instance query q and a knowledge base K = hT , Ai, we can
now compute the known and possible answers with Algorithm 1. We then compute the
extended canonical ABox Âqf and the extended query q̂ for some suitable bijection f .
Note that we do not have to consider the (often large) ABox A from K for computing q̂.
For each atom at ∈ q̂ \ q, we can then use our approximate instance retrieval algorithm
to check whether at restricts q and reduce the set of possible answers P[q] accordingly.
    Although K[·] and P[·] are often fast to compute (due to the use of simpler approxi-
mate reasoning algorithms or since the sets are simply extracted from a pre-model) and
often cached, it is not very efficient to always retrieve all required such sets from the
reasoner in order to perform the required joins. Hence, in our future work, we plan to
just use the cardinalities of the sets K[·] and P[·]. The idea is to use the cardinalities of
restricting atoms from q̂ to more precisely estimate the cardinalities of atoms in q. This
will allow for a better ordering of the query atoms in cost-based query planning.
4     Beyond Conjunctive Instance Queries
Since conjunctive instance queries are only a subset of the queries that are important
in the context of SPARQL, we present, in this section, an optimisation that aims at
improving the performance for evaluating arbitrary complex queries. In general for a
complex query q = {at1 , . . . , atn }, we partition q into two sets q s and qc such that q s is the
maximal conjunctive instance sub-query of q and qc contains the axiom templates. We
first evaluate q s , e.g., by employing optimisation techniques as described in the previous
section. The intermediate results can then already be used to reduce the possible answers
for qc . In order to optimise the evaluation of qc , we assume that the concepts and roles
are classified (this is often done before a system accepts any queries), which means that
axiom templates of the form x v A with x ∈ VC , A ∈ NC require only cache lookups.
We use the hierarchies to prune the search space of solutions in the evaluation of certain
axiom templates as illustrated with the following example:
Example 6. Let q = {Infection v ∃hasCausalLinkTo.x} with x ∈ VC a concept variable
and K some knowledge base. If, mapping x to the concept name A, does not yield
an entailed axiom and K |= B v A holds, then B is also not a certain answer. Thus,
when searching for certain answers, we choose the next binding to test by traversing the
concept hierarchy top-down. When we find a non-answer A, the subtree rooted in A of
the concept hierarchy can safely be pruned. Queries over ontologies with a large number
of concepts and a deep concept hierarchy can, therefore, gain the maximum advantage
from this optimisation. We employ similar optimisations using the role hierarchy.
    In the example above, we can prune the subconcepts of A because x has positive
polarity in the axiom, i.e., x occurs positively and not under a negation. In case a variable
occurs negatively, e.g., directly or indirectly under a negation or positively on the left-
hand side of an axiom, one can, instead, prune the superclasses. We next specify more
precisely the polarity of a concept variable in a concept axiom template.

Definition 6. Let x ∈ VC be a concept variable and C, C1 , C2 , D concept axiom tem-
plates, r a role, and n ∈ IN0 . We define the polarity of x in D as follows: x occurs
positively in x. Furthermore, x occurs positively (negatively) in
    – ¬C if x occurs negatively (positively) in C,
    – C1 u C2 or C1 t C2 if x occurs positively (negatively) in C1 or C2 ,
    – ∃r.C, ∀r.C, or > n r.C if x occurs positively (negatively) in C,
    – 6 n r.C if x occurs negatively (positively) in C
    – = n r.C if x occurs in C.
We further say that x occurs positively (negatively) in C1 v C2 if x occurs negatively
(positively) in C1 or positively (negatively) in C2 . We further define a partial function
polc that maps a concept variable x and a concept template C (axiom template of the
form C1 v C2 ) to pos if x occurs only positively in C (C1 v C2 ) and to neg if x occurs
only negatively in C (C1 v C2 ).

Note that x can also occur both positively and negatively. Note also that no matter
whether x occurs positively or negatively in a concept C, in any concept of the form
(= n r.C), x occurs positively as well as negatively. This is due to the fact that (= n r.C)
is equivalent to the concept template 6 n r.C u > n r.C in which x occurs positively as
well as negatively in C.
    Since the function polc is not defined for variables that appear both positively and
negatively, the concept hierarchy cannot be exploited in this case. For example, consider
the concept axiom template ¬x t ∃r.x with x ∈ VC (i.e., x v ∃r.x), where x appears
negatively in ¬x and positively in ∃r.x. Now, let δ ∈ ∆I be an arbitrary element from a
model I = (∆I , ·I ) of the queried knowledge base. It is obvious that if δ is an instance
of ¬A t ∃r.A and either A v B or B v A holds, we cannot deduce that δ is an instance
of ¬B t ∃r.B.
    The following theorem holds for every axiom template of the form C1 v C2 . We
use Cµ(x)=A with A ∈ NC to denote the concept obtained by applying the extension of µ
that also maps x to A.
Theorem 2. Let K be a knowledge base, A, B concept names such that K |= A v B,
C1 , C2 concept templates, C = ¬C1 t C2 , x ∈ VC a concept variable occurring in C,
and µ a mapping that covers all variables of C apart from x.
 1. If polc (x, C) = pos and K 6|= (C1 v C2 )µ(x)=B , then K 6|= (C1 v C2 )µ(x)=A .
 2. If polc (x, C) = neg and K 6|= (C1 v C2 )µ(x)=A , then K 6|= (C1 v C2 )µ(x)=B .
We now extend this optimisation to the case of role variables and we first define the
polarity of a role variable in a concept axiom template.
Definition 7. Let x ∈ VR be a role variable, C, C1 , C2 , D concept templates, r a role,
and n ∈ IN0 . We define the polarity of x in D as follows: x occurs positively in ∃x.C,
∃x− .C, > n x.C, > n x− .C, = n x.C, and = n x− .C; x occurs negatively in ∀x.C,
∀x− .C, 6 n x.C, 6 n x− .C, = n x.C, and = n x− .C. Furthermore, x occurs positively
(negatively) in
 – ¬C if x occurs negatively (positively) in C,
 – C1 u C2 or C1 t C2 if x occurs positively (negatively) in C1 or C2 ,
 – ∃r.C, ∃x.C, ∃x− .C, > n r.C, > n x.C, > n x− .C, ∀r.C, ∀x.C, or ∀x− .C if x occurs
   positively (negatively) in C,
 – 6 n r.C, 6 n x.C, or 6 n x− .C if x occurs negatively (positively) in C,
 – = n r.C if x occurs in C.
We further say that x occurs positively (negatively) in C1 v C2 if x occurs negatively
(positively) in C1 or positively (negatively) in C2 . We define a partial function polr that
maps a role variable x and a concept template C (axiom template of the form C1 v C2 )
to pos if x occurs only positively in C (C1 v C2 ) and to neg if x occurs only negatively
in C (C1 v C2 ).
We now show, that the hierarchy optimisation is also applicable to role variables, pro-
vided they occur only positively or only negatively.
Theorem 3. Let K be a knowledge base, r, s role names such that K |= r v s, C1 , C2
concept templates, C = ¬C1 t C2 , x ∈ VR a role variable occurring in C, and µ a
mapping that covers all variables of C apart from x.
Algorithm 2 getPossibleConceptVarMappings(x, µ, at, K)
Require: x: a concept variable,                          µ: a partial mapping with x ∈ dom(µ)
            at: an axiom template in which x occurs, K: a SHIQ knowledge base
Ensure: a set of possible answers for x
  if polc (x, at) == pos then
     return {µ0 | µ0 (x) =C, C is direct subconcept of µ(x) in K, µ0 (y) = µ(y) for y ∈ dom(µ)\{x}}
  else
     return {µ0 | µ0 (x) =C, C is direct superconcept of µ(x) in K, µ0 (y) = µ(y) for y ∈ dom(µ)\{x}}
  end if


 1. If polr (x, C) = pos and K 6|= (C1 v C2 )µ(x)=s , then K 6|= (C1 v C2 )µ(x)=r .
 2. If polr (x, C) = neg and K 6|= (C1 v C2 )µ(x)=r , then K 6|= (C1 v C2 )µ(x)=s .
    Algorithm 2 shows how we use the above theorems to create possible concept map-
pings for a concept variable x that appears only positively or only negatively in an axiom
template C1 v C2 . We can define a similar algorithm for role variables. Hence, an al-
gorithm for evaluating complex queries can call these algorithms to prune the search
space once a solution for such a concept or role variable has been found, e.g., by check-
ing entailment for an instantiated axiom template.

4.1   Complex Axiom Template Evaluation
In the absence of suitable standard benchmarks for complex queries or SPARQL queries
over OWL ontologies, we created a custom set of queries as shown in Tables 1 and 2
for the GALEN and the FBbt XP ontology, respectively. Systems that fully support
the SPARQL Direct Semantics entailment regime are still under development, which
makes it hard to compare our results for these kinds of queries with other systems. All
experiments were performed on a Mac OS X Lion machine with a 2.53 GHz Intel Core
i7 processor and Java 1.6 allowing 1GB of Java heap space. We measure the time for
one-off tasks such as classification separately since such tasks are usually performed
before the system accepts queries. The ontologies and all code required to perform
the experiments are available online.6 For the evaluation we have used the HermiT7
hypertableau reasoner.
    GALEN is a biomedical knowledge base. Its expressivity is (Horn-)SHIF and
it consists of 2,748 concepts and 413 abstract roles. FBbt XP is an ontology taken
from the Open Biological Ontologies (OBO) Foundry.8 Its expressivity is SHI and the
knowledge base consists of 7,221 concepts and 21 roles. We only consider the TBox
part of FBbt XP since the ABox is not relevant for showing the effects of the optimi-
sation for concept axiom templates. GALEN took 3.7 s to load and 11.1 s to classify
(concepts and roles), while FBbt XP took 1.5 s to load and 7.4 s to classify.
    For each query, we tested the execution once with (bold results) and once without
the proposed optimisation. The tables also show the number of consistency checks that
were performed for the evaluation of each query.
 6
   http://code.google.com/p/query-ordering/
 7
   http://www.hermit-reasoner.com/
 8
   http://www.obofoundry.org/
Table 1. Queries (with x(i) a concept and y(i) a role variable) and evaluation times in seconds for
the GALEN knowledge base with (bold) and without the hierarchy exploitation

                                                       Consistency Time
                      Query                                               Answers
                                                           Checks in sec.
                                                             2,750  1.68
       Infection v ∃hasCausalLinkTo.x                                          10
                                                                50  0.18
                                                        1,141,250 578.98
       Infection v ∃y.x                                                       214
                                                             1,291 10.00
                                                           19,250 102.37
              x1 v Infection u ∃hasCausalAgent.x2                           2,816
                                                             3,073  2.69
 NAMEDLigament v NAMEDInternalBodyPart u x1                16,135   7.68
                                                                               51
              x1 v ∃hasShapeAnalagousTo.x2 u ∃y.linear         197  1.12
              x1 v NonNormalCondition
              y1 v ModifierAttribute                         1,883  0.67
      Bacterium v ∃y1 .x2
                                                                            4,392
              y2 v StatusAttribute
              x2 v AbstractStatus                            1,883  0.80
              x1 v ∃y2 .Status


     GALEN Queries: As expected, an increase in the number of variables within an
axiom template leads to a significant increase in the query execution time because the
number of mappings that have to be checked grows exponentially in the number of vari-
ables. This can, in particular, be observed from the difference in execution time between
the first two queries, where it is also evident that the use of the hierarchy exploitation
optimisation leads to a decrease in execution time of up to two orders of magnitude. In
the last query, the hierarchy exploitation optimisation does not improve the execution
time since the variables on which the hierarchy optimisation can be applied, are already
bound when it comes to the evaluation of the complex templates. Hence, the running
times with and without the hierarchy exploitation are similar. The number of consis-
tency checks for this query is significantly lower than the number of answers because
the overall results are computed by taking the cartesian products of the results for the
two connected components of the query. Although our optimisations can significantly
improve the query execution time, the required time can still be quite high. In practice,
it is, therefore, advisable to add as many restrictive axiom templates (axiom templates
which require only cache lookups) for query variables as possible. For example, the
addition of y v Shape to the forth query reduces the runtime from 1.12 s to 0.65 s.
     FBbt XP Queries: We observe that in some queries the effect of the hierarchy
exploitation is more profound than in others. More precisely, the smaller the ratio of
the result size to the number of consistency checks without the hierarchy optimisation,
the more pronounced is the effect when enabling this optimisation. In other words,
when more tested mappings are indeed solutions, one can prune fewer parts of the
hierarchy since pruning can only be performed when we find a non-solution. For the
second query, we even observe a slight increase in running time when the hierarchy
optimisation is used. This is because the optimisation can only prune few candidate
mappings, which does not outweigh the overhead caused by maintaining information
Table 2. Queries (with x(i) a concept and y a role variable) and evaluation times in seconds for
the FBbt XP knowledge base with (bold) and without the hierarchy exploitation

                          Query                     Consistency      Time     Answers
                                                        Checks
                      x1 v ∀part of.x2                 151,683       44.13
                                                                                 7,243
                      x1 v FBbt 00005789                11,262        5.64
                       y v part of                      14,446       37.38
                                                                                 7,224
                       x v ∀y.FBbt 00001606             12,637       39.20
                       x v ∃y.FBbt 00025990             72,230      357.59
                                                                                   188
                       y v overlaps                     54,186      252.41
                                                       166,129      486.81
         FBbt 00001606 v ∃y.x                                                       68
                                                          1,335      17.03
         FBbt 00001606 v ∃y.x                           21,669       19.68
                                                                                     0
                     y v develops from                        3       0.01
                    x1 v FBbt 00001884                  43,338      183.66
                     y v part of                        32,490      152.38     43,338
                    x2 v ∃y.x1 u x3


about which hierarchy parts have already been tested. For the last query, the number of
consistency checks with the hierarchy optimisation is less than the result size (32, 490
versus 43, 338) since only the third axiom template of the query requires consistency
checks, whereas bindings for y and x1 can be looked up from the cached concept and
role hierarchy.


5   Conclusion
In the current paper we presented an approach for using the TBox of a knowledge
base to optimise SPARQL queries evaluated using the OWL Direct Semantics entail-
ment regime. For conjunctive instance queries we showed how we can build equivalent
queries with additional atoms which can be exploited to reduce the set of possible map-
pings for query variables. Conditions were defined for this purpose which identify the
cases in which such a reduction is achieved. For queries that go beyond conjunctive
instance queries we provide a highly effective and frequently applicable optimisation
which prunes the number of candidate solutions that have to be checked by exploit-
ing the concept and role hierarchy. One can, usually, assume that these hierarchies are
computed before a system accepts queries. Our empirical evaluation shows that this
optimisation can reduce the query evaluation times up to two orders of magnitude.
Acknowledgements This work was supported by IKY in collaboration with DAAD in
the program IKYDA: Automatic data generation for description logic knowledge bases.
References

 1. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.): The De-
    scription Logic Handbook: Theory, Implementation, and Applications. Cambridge Univer-
    sity Press, second edn. (2007)
 2. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
    and efficient query answering in description logics: The DL-Lite family. Journal of Auto-
    mated Reasoning 39(3), 385–429 (2007)
 3. Glimm, B., Horrocks, I., Motik, B., Shearer, R., Stoilos, G.: A novel approach to ontology
    classification. Journal of Web Semantics: Science, Services and Agents on the World Wide
    Web 14, 84–101 (2012), special Issue on Dealing with the Messiness of the Web of Data
 4. Glimm, B., Ogbuji, C. (eds.): SPARQL 1.1 Entailment Regimes. W3C Recommendation (21
    March 2013), available at http://www.w3.org/TR/sparql11-entailment/
 5. Haarslev, V., Möller, R.: Racer system description. In: Gor, R., Leitsch, A., Nipkow, T. (eds.)
    Proceedings of the 1st International Joint Conference on Automated Reasoning (IJCAR’01).
    LNCS, vol. 2083, pp. 701–705. Springer (2001)
 6. Haarslev, V., Möller, R., Wessel, M.: Querying the semantic web with Racer + nRQL. In:
    Proceedings of the KI-2004 International Workshop on Applications of Description Logics
    (2004)
 7. Harris, S., Seaborne, A. (eds.): SPARQL 1.1 Query Language. W3C Recommendation (21
    March 2013), available at http://www.w3.org/TR/sparql11-query/
 8. Horridge, M., Bechhofer, S.: The OWL API: A Java API for working with OWL 2 ontologies.
    In: Patel-Schneider, P.F., Hoekstra, R. (eds.) Proceedings of the OWLED 2009 Workshop on
    OWL: Experiences and Directions. CEUR Workshop Proceedings, vol. 529. CEUR-WS.org
    (2009)
 9. Hustadt, U., Motik, B., Sattler, U.: Reducing SHIQ− description logic to disjunctive datalog
    programs. In: Proceedings of the 9th International Conference on Principles of Knowledge
    Representation and Reasoning (KR’04). pp. 152–162. AAAI Press (2004)
10. Johnson, D.S., Klug, A.C.: Testing containment of conjunctive queries under functional and
    inclusion dependencies. J. Comput. Syst. Sci. 28(1), 167–189 (1984)
11. Kollia, I., Glimm, B.: Optimizing SPARQL Query Answering over OWL Ontologies. Ac-
    cepted at Journal of Artificial Intelligence Research (2013)
12. Kollia, I., Glimm, B.: Cost based query ordering over OWL ontologies. In: Proceedings of
    the 11th International Semantic Web Conference (ISWC 2012). Lecture Notes in Computer
    Science, vol. 7649, pp. 231–246. Springer-Verlag (2012)
13. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The combined approach
    to query answering in DL-Lite. In: Lin, F., Sattler, U. (eds.) Proceedings of the 12th Inter-
    national Conference on Principles of Knowledge Representation and Reasoning (KR’10).
    AAAI Press (2010)
14. Motik, B., Patel-Schneider, P.F., Cuenca Grau, B. (eds.): OWL 2 Web Ontology
    Language: Direct Semantics. W3C Recommendation (27 October 2009), available at
    http://www.w3.org/TR/owl2-direct-semantics/
15. Pan, J.Z., Thomas, E., Zhao, Y.: Completeness guaranteed approximation for owl dl query
    answering. In: Proceedings of the 22nd International Workshop on Description Logics
    (DL2009) (2009)
16. Pérez-Urbina, H., Motik, B., Horrocks, I.: Tractable query answering and rewriting under
    description logic constraints. Journal of Applied Logic 8(2), 186–209 (2010)
17. Ren, Y., Pan, J.Z., Zhao, Y.: Soundness preserving approximation for tbox reasoning. In:
    Proceedings of the 25th AAAI Conference (AAAI2010) (2010)
18. Sirin, E., Cuenca Grau, B., Parsia, B.: From wine to water: Optimizing description logic
    reasoning for nominals. In: Doherty, P., Mylopoulos, J., Welty, C.A. (eds.) Proceedings of
    the 10th International Conference on Principles of Knowledge Representation and Reasoning
    (KR’06). pp. 90–99. AAAI Press (2006)
19. Sirin, E., Parsia, B.: SPARQL-DL: SPARQL query for OWL-DL. In: Golbreich, C., Kalyan-
    pur, A., Parsia, B. (eds.) Proceedings of the OWLED 2007 Workshop on OWL: Experiences
    and Directions. CEUR Workshop Proceedings, vol. 258. CEUR-WS.org (2007)
20. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A practical OWL-DL rea-
    soner. Journal of Web Semantics 5(2), 51–53 (2007)
21. Thomas, E., Pan, J.Z., Ren, Y.: TrOWL: Tractable OWL 2 reasoning infrastructure. In: Pro-
    ceedings of the Extended Semantic Web Conference (ESWC’10) (2010)
22. Tsarkov, D., Horrocks, I., Patel-Schneider, P.F.: Optimizing terminological reasoning for ex-
    pressive description logics. Journal of Automated Reasoning 39(3), 277–316 (2007)
A    Proofs

Lemma (1). Algorithm intersecQans is an approximate query answering algorithm.

Proof (Proof of Lemma 1). We prove the lemma by induction on the length of the
query q given as input to the algorithm intersecQans(K, q). For q = {at} the two condi-
tions of the definition of approximate query answering algorithms are satisfied. Indeed,

 1. if µ ∈ K[q], i.e., µ ∈ K[at] then K |= µ(at) (since inst(K, at) is an approximate
    instance retrieval algorithm), which means that K |= µ(q).
 2. for each total function µ from Var(q) to individual names from K such that µ ∈
    ans(K, q), i.e., µ ∈ ans(K, at), i.e., K |= µ(at), it holds µ ∈ K[at] ∪ P[at] (since
    inst(K, at) is an approximate instance retrieval algorithm), i.e., µ ∈ K[q] ∪ P[q].

Let q0 = q ∪ {at} and let us assume that intersecQans(K, q) is an approximate query
answering algorithm. We prove that inst(K, q0 ) is also an approximate query answering
algorithm by showing that the two conditions of Definition 3 hold for q0 .

 1. If µ ∈ K[q0 ], this means that µ|Var(q) ∈ K[q] and µ|Var(at) ∈ K[at] (it can easily be
    seen from the construction K[q0 ] in intersecQans(K, q0 )). By induction hypothe-
    sis µ|Var(q) ∈ ans(K, q) and, since inst(K, at) is an approximate instance retrieval
    algorithm, K |= µ|Var(at) (at). Since µ = µ|Var(q) ./ µ|Var(at) , it holds µ ∈ ans(K, q0 ).
 2. For each total function µ from Var(q) to individual names from K such that µ ∈
    ans(K, q0 ), it holds µ|Var(q) ∈ ans(K, q) and K |= µ|Var(at) (at). By the induction
    hypothesis, µ|Var(q) ∈ K[q]∪ P[q] and µ|Var(at) ∈ K[at]∪ P[at]. It holds µ = µ|Var(q) ./
    µ|Var(at) . We distinguish between the following cases:
      – if µ|Var(q) ∈ K[q] and µ|Var(at) ∈ K[at], then µ ∈ K[q0 ]
      – if µ|Var(q) ∈ K[q] and µ|Var(at) ∈ P[at], then µ ∈ P[q0 ]
      – if µ|Var(q) ∈ P[q] and µ|Var(at) ∈ K[at], then µ ∈ P[q0 ]
      – if µ|Var(q) ∈ P[q] and µ|Var(at) ∈ P[at], then µ ∈ P[q0 ]
    We see from the above that in any case µ ∈ K[q0 ] ∪ P[q0 ].
                                                                                                t
                                                                                                u

Lemma (2). Let K = hT , Ai be a knowledge base, q and q0 two queries such that q0 =
q ∪ {at}, ans(K, q) = ans(K, q0 ), hK[q], P[q]i = intersecQans(K, q), hK[at], P[at]i =
inst(K, at), and at restricts P[q].

 1. An algorithm that returns hK[q], {µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}i given K
    and q as input is an approximate query answering algorithm and
 2. |{µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}| < |P[q]|.

Proof (Proof of Lemma 2). For the first claim, according to Lemma 1, intersecQans(K, q)
is an approximate query answering algorithm. Hence, the first condition on approx-
imate query answering algorithms is satisfied. For the second condition and in con-
trary to what is to be shown, assume there is some µ such that µ ∈ ans(K, q), but
µ < K[q] ∪ {µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}, i.e., µ|Var(at) < K[at] ∪ P[at]. By Defi-
nition 4 and since at restricts P[q], we have Var(at) ⊆ Var(q) and all variables from at
are in the domain of µ. Since inst(K, at) is an approximate instance retrieval algorithm,
this means K 6|= µ|Var(at) (at), but this means that µ < ans(K, q0 ), which contradicts
ans(K, q) = ans(K, q0 ).
    The second claim follows since {µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])} is a strict
subset of {P[q]} by Definition 4 and since at restricts q.                           t
                                                                                     u

Lemma (3). The extended query q̂ created as in Definition 5 is unique.

Proof (Proof of Lemma 3). It is obvious that the extended query q̂ does not depend on
the function f that is used for its construction since the canonical ABoxes and conse-
quently the extended canonical ABoxes produced w.r.t. different functions f are iso-
morphic to each other, i.e., identical modulo renaming of individuals.              t
                                                                                    u

Theorem (1). Let T be a TBox, q a query, and q̂ the extended query as in Definition 5.
Then, T |= q ≡ q̂, i.e., for any ABox A0 , ans(hT , A0 i, q) = ans(hT , A0 i, q̂).

Proof (Proof of Theorem 1).
     We want to prove that for any ABox A, ans(hT , Ai, q) = ans(hT , Ai, q̂), i.e.,
ans(hT , Ai, q) ⊆ ans(hT , Ai, q̂) and ans(hT , Ai, q̂) ⊆ ans(hT , Ai, q). From Defini-
tion 5 it is easily seen that Var(q) = Var(q̂) and since q has more atoms (is more
restricting) than q, it holds ans(hT , Ai, q̂) ⊆ ans(hT , Ai, q). We will now show that
ans(hT , Ai, q) ⊆ ans(hT , Ai, q̂). Let µ be a mapping such that µ ∈ ans(hT , Ai, q),
i.e., hT , Ai |= µ(q). From Definition 5 it is obvious that for any total function f from
Var(q) to a set of individual names from NI , hT , Aqf i |= Âqf . Hence, hT , Aµq i |= Âµq
or hT , µ(q)i |= µ(q̂) which means hT , A ∪ µ(q)i |= µ(q̂) due to the monotonicity prop-
erty of description logics, which means that hT , Ai |= µ(q̂) (since hT , Ai |= µ(q)), i.e.,
µ ∈ ans(hT , Ai, q̂).
                                                                                          t
                                                                                          u

Theorem (2). Let K be a knowledge base, A, B concept names such that K |= A v B,
C1 , C2 concept templates, C = ¬C1 t C2 , x ∈ VC a concept variable occurring in C,
and µ a mapping that covers all variables of C apart from x.

 1. If polc (x, C) = pos and K 6|= (C1 v C2 )µ(x)=B , then K 6|= (C1 v C2 )µ(x)=A .
 2. If polc (x, C) = neg and K 6|= (C1 v C2 )µ(x)=A , then K 6|= (C1 v C2 )µ(x)=B .

Proof (Proof Sketch of Theorem 2). The claim can be shown by induction on the struc-
ture of the concept template C. We only consider the case of polc (x, C) = pos; the case
of polc (x, C) = neg is analogous. It suffices to show (in contrapositive form) for some
model I = (∆I , ·I ) of K and some element δ ∈ ∆I , if δ ∈ (Cµ(x)=A )I , then δ ∈ (Cµ(x)=B )I .
    For the base case, C = x, x occurs positively in C. Now, if δ ∈ (xµ(x)=A )I , then
δ ∈ AI and, hence, δ ∈ BI since K |= A v B by assumption. Hence, δ ∈ (xµ(x)=B )I .
    We show the induction step for C = ∃r.D. The other cases are analogous. We assume
δ ∈ ((∃r.D)µ(x)=A )I , then δ has at least one r-successor, say δ0 , that is an instance of
Dµ(x)=A . Since K |= A v B and by induction hypothesis, δ0 ∈ Dµ(x)=B . Hence, δ ∈
(∃r.(Dµ(x)=B ))I = ((∃r.D)µ(x)=B )I . Note that the case C = (= n r.D) cannot occur since
the polarity of x in C is always positive and negative and polc (x, C) is undefined.        t
                                                                                            u
Theorem (3). Let K be a knowledge base, r, s role names such that K |= r v s, C1 , C2
concept templates, C = ¬C1 t C2 , x ∈ VR a role variable occurring in C, and µ a
mapping that covers all variables of C apart from x.
 1. If polr (x, C) = pos and K 6|= (C1 v C2 )µ(x)=s , then K 6|= (C1 v C2 )µ(x)=r .
 2. If polr (x, C) = neg and K 6|= (C1 v C2 )µ(x)=r , then K 6|= (C1 v C2 )µ(x)=s .

Proof (Proof Sketch of Theorem 3). The claim can be shown by induction on the struc-
ture of the concept template C. We only consider the case of polr (x, C) = pos; the case
of polr (x, C) = neg is analogous. It suffices to show (in contrapositive form) for some
model I = (∆I , ·I ) of K and some element δ ∈ ∆I , if δ ∈ (Cµ(x)=r )I , then δ ∈ (Cµ(x)=s )I .
    We have several base cases here, but only present the case for concept templates of
the form C = ∃x.D with x not occurring in D. The cases for C = ∀x.D, C = > n x.D,
C = 6 n x.D with x not occurring in D are similar. Assume, δ ∈ ((∃x.D)µ(x)=r )I , that is,
δ ∈ (∃r.µ(D))I . Then there is some δ0 ∈ ∆I such that hδ, δ0 i ∈ rI and δ0 ∈ µ(D)I . Since
K |= r v s, we also have hδ, δ0 i ∈ sI and, therefore, δ ∈ (∃s.µ(D))I = ((∃x.D)µ(x)=s )I .
    For the induction step, let C = ∃x.D with x occurring in D, we also have polr (x, D) =
pos. Now, if δ ∈ ((∃x.D)µ(x)=r )I , then δ has at least one r-successor which is an instance
of Dµ(x)=r . Since K |= r v s and by induction hypothesis, δ has at least one s-successor
that is an instance of Dµ(x)=s . Hence, δ ∈ ((∃x.D)µ(x)=s )I . The case of C = ∃p.D with p
a role and x occurring in D is analogous and so are the further induction steps.            t
                                                                                            u