Using the TBox to Optimise SPARQL Queries Birte Glimm1 , Yevgeny Kazakov1 , Ilianna Kollia2 , and Giorgos Stamou2 1 University of Ulm, Germany, @uni-ulm.de 2 National Technical University of Athens, Greece, ilianna2@mail.ntua.gr, gstam@cs.ntua.gr Abstract. We present an approach for using schema knowledge from the TBox to optimise the evaluation of SPARQL queries. The queries are evaluated over an OWL ontology using the OWL Direct Semantics entailment regime. For con- junctive instance queries, we proceed by transforming the query into an ABox. We then show how the TBox and this (small) query ABox can be used to build an equivalent query where the additional query atoms can be used for reducing the set of possible mappings for query variables. We also consider arbitrary SPARQL queries and show how the concept and role hierarchies can be used to prune the search space of possible answers based on the polarity of variable occurrences in the query. 1 Introduction In this paper, we consider the SPARQL 1.1 query language [7], which was recently standardised by the World Wide Web Consortium (W3C). SPARQL 1.1 includes sev- eral entailment regimes [4] in order to use entailment when evaluating a query. In this paper, we consider SPARQL queries evaluated using OWL’s Direct Semantics Entail- ment [14]. In this setting, the WHERE clause or query pattern of a query can be seen as a set of extended OWL or Description Logic (DL) axioms, which can have variables in place of concept, role or individual names. Our goal in this paper is to optimise the evaluation of such query patterns. Over the last decade, much effort has been spent on optimising standard reasoning tasks such as entailment checking, classification, or realisation (i.e., the computation of instances of all concepts and roles) [3, 18, 22]. The optimisation of query answer- ing algorithms has, however, mostly been addressed for conjunctive queries in OWL profiles, most notably the OWL 2 QL profile [2, 13, 16]. An exception to this are the works on nRQL [6], SPARQL-DL [19], and our previous work [12].3 The query lan- guage nRQL is supported by Racer Pro [5] and allows for queries that go beyond ABox queries, e.g., one can retrieve sub- or super-concepts of a given concept. SPARQL-DL is a fragment of SPARQL implemented in the Pellet reasoner [20]. The kinds of SPARQL queries that are supported in SPARQL-DL are those that can directly be mapped to reasoner tasks. Furthermore, KAON2 [9] supports SPARQL queries, but restricted to ABox queries/conjunctive instance queries. In our previous work [12], we propose al- gorithms for arbitrary SPARQL queries and optimisation techniques mainly based on cost-based query planning. The system is implemented as a SPARQL Wrapper that can 3 An implementation is available at http://code.google.com/p/owl-bgp/ be used with any reasoner that implements the OWLReasoner interface of the OWL API [8]. Finally, TrOWL4 supports SPARQL queries based on our SPARQL wrapper, but the reasoning in TrOWL is approximate, i.e., an OWL DL ontology is rewritten into an ontology that uses a less expressive language before reasoning is applied [21]. A promising, but still very preliminary work is also the Semantic Web Entailment Regime Translation and Inference Architecture (Swertia);5 a generic Semantic Web reasoning framework that is based on first-order logic (FOL) reasoning. The contribution of this paper is two-fold: first, we present an optimisation that is applicable to conjunctive instance queries. We show that one can compute an equivalent query q̂ for a given query q by replacing the variables in q with fresh individual names. We then perform realisation, i.e., we materialise entailed concept and role assertions, for the queried TBox and (small) query ABox. Replacing the individual names again with the corresponding variable names then yields q̂. The additional query atoms in q̂ can then be used for reducing the set of possible mappings for query variables. Second, we propose an optimisation for also reducing the possible mappings for concept and role variables by exploiting the polarity of variable occurrences in the query and the (precomputed) concept and role hierarchies. We provide a prototypical implementation and evaluation of the polarity based optimisation, which shows that it can lead to an improvement of up to two orders of magnitude in the execution times of some queries. The polarity based optimisation can be found in more detail in our earlier work [11]. 2 Preliminaries Due to lack of space, we do not introduce the Description Logic SHIQ, which we use throughout the paper. For further details, we refer interested readers to the DL handbook [1]. We assume that a knowledge base K is a pair (T , A) with T a TBox that possibly includes role inclusion axioms and A an ABox. In this paper, we consider conjunctive instance queries and complex queries, which can also contain axioms with variables in place of concept or role names. Such queries are important in the context of SPARQL since SPARQL’s OWL Direct Semantics en- tailment regime allows for such queries. Conjunctive instance queries are a subset of all such SPARQL queries, but they differ from database-style conjunctive queries in that they do not allow for existentially quantified or non-distinguished variables. Definition 1 (Conjunctive Instance and Complex Queries). Let S = (NC , NR , NI ) be a signature, K a knowledge base over S, and V = VC ]VR ]VI a countably infinite set of variable names (concept variables in VC , role variables in VR , and individual variables in VI ) disjoint from NC , NR , and NI . Let A ∈ NC , r ∈ NR , and x, y ∈ VI . A concept atom is an expression A(x) and a role atom is an expression r(x, y). A conjunctive instance query q is a non-empty set of (concept or role) atoms. The set of SHIQ concept templates (or concept templates for short) over S and V is built as the set of SHIQ concepts, where a concept variable can be used in place of a concept name, and a role variable in place of a role name. A role axiom template has the form r v s where r, s ∈ NR ∪ VR . 4 http://trowl.eu 5 http://swertia.org A concept axiom template has the form c v d with c, d concept templates. We again abbreviate c v d and d v c as c ≡ d. A finite set of role axiom templates, concept axiom templates, and (concept or role) atoms is called a complex query. We use Var(q) for the set of variables in q. and |Var(q)| is called the arity of q. Let q = {at1 , . . . , atn } be a (conjunctive instance or complex) query. A total function µ : Var(q) → NC ∪NR ∪NI is a mapping for q over K if µ(v) ∈ NC if v ∈ VC , µ(v) ∈ NR if v ∈ VR , and µ(v) ∈ NI if v ∈ VI . Let X = {x1 , . . . , xn } ⊆ Var(q) a subset of the variables of q, and M = {µ1 , . . . , µm } a set of mappings for q. The projection of X over M is the set M|X = {{x 7→ a} | µ ∈ M, x ∈ X and µ(x) = a}. A mapping µ is a certain answer for q over K if K |= µ(at) for each at ∈ q, written K |= µ(q), where µ(at) (µ(q)) is the result of replacing each v ∈ Var(at) (Var (q)) with µ(v). We denote the set of all certain answers for q over K with ans(K, q). 3 Query Answering via Approximate Instance Retrieval In this section, we use the DL SHIQ and we present a technique that uses approxi- mate reasoning algorithms in order to optimise the evaluation of conjunctive instance queries. Approximate reasoning algorithms can either be sound and incomplete [17], i.e., they underapproximate the set of certain or known answers or they can be complete but unsound [15], i.e., they overapproximate the set of certain answers. Typical exam- ples of such algorithms rewrite a knowledge base into a simpler logic in such a way that computing the results over the simplified knowledge base yields the desired under- or overapproximation. Another possibility is to use a pre-model or complete and clash free tableau generated by a DL reasoner [12]. One can then read-off certain instances of concepts or roles by analysing which concept and role facts have been added determin- istically to the pre-model, i.e., one can obtain an underapproximation for concept and role instances. Similarly, one can analyse the non-deterministically added and absent concept and role facts to compute an overapproximation. More formally, we define an approximate instance retrieval algorithm as follows: Definition 2 (Approximate Instance Retrieval Algorithm). Let K = hT , Ai be a knowledge base and at a concept or role atom such that the concept or role is from the signature of K. An approximate instance retrieval algorithm inst(K, at) returns a pair of sets of (total) functions from Var(at) to individual names from K, hK[at], P[at]i, such that: 1. if µ ∈ K[at], then K |= µ(at), and 2. for each total function µ from Var(at) to individual names from K such that K |= µ(at), µ ∈ K[at] ∪ P[at]. Similarly, we define approximate query answering algorithms: Definition 3 (Approximate Query Answering Algorithm). Let K = hT , Ai be a knowledge base, and q a conjunctive instance query. An approximate query answer- ing algorithm apprQA(K, q) returns a pair of sets of (total) functions from Var(q) to individual names from K hK[q], P[q]i such that: 1. if µ ∈ K[q], then µ ∈ ans(K, q), and 2. for each total function µ from Var(q) to individual names from K such that µ ∈ ans(K, q), µ ∈ K[q] ∪ P[q]. Without loss of generality, in the rest of the paper we assume that for every approx- imate instance retrieval or query answering algorithm K[·] ∩ P[·] = ∅ holds. It is not difficult to see that an obvious approach to develop an approximate query answering algorithm apprQA is to use an approximate instance retrieval algorithm inst. A naive method to do this is to execute inst and take K[·] ∪ P[·] for each query atom and then compute the join of these sets, where we straightforwardly interpret the mappings as relations. The following example illustrates that some possible answers can be easily rejected. For ease of presentation we represent mappings as tuples in the examples that fol- low, i.e., abusing notation, K[at] is a set of individuals for concept atoms and a set of pairs of individuals for role atoms and K[q] is a set of n-tuples, n being the arity of q. Example 1. Let K = hT , Ai be a knowledge base and q = {C(x), r(x, y), D(y)} a query with variables hx, yi. Suppose that (possibly as a result of inst(K, C(x)), inst(K, r(x, y)) and inst(K, D(y))) we have K[C(x)] = {a}, P[C(x)] = {b}, K[r(x, y)] = {ha, ci)}, P[r(x, y)] = {hb, di, hb, ei}, K[D(y)] = {c} and P[D(y)] = {d}. Even if we do not know K, we can conclude that ha, ci is a certain answer to q, since a ∈ K[C(x)], ha, ci ∈ K[r(x, y)] and c ∈ K[D(y)]. However, only hb, di is a possible answer for q since b ∈ P[C(x)], hb, di ∈ P[r(x, y)] and d ∈ P[D(y)]; hb, ei cannot be an answer for q since although hb, ei ∈ P[r(x, y)] and b ∈ P[C(x)], e < K[D(y)] ∪ P[D(y)]. Algorithm intersecQans (see Algorithm 1) formalises this idea, which we also illustrate in Example 2: Example 2. If we take q from Example 1 in the order given, we first initialise K[q] to {a} and P[q] to {b}. During the next iterations, the (preliminary) set K[q] is extended by performing a natural join with the known answers for the current atom at. The set P[q] is extended by performing a natural join of P[q] with both K[at] and P[at] and of K[q] with P[at]. For Example 1, we next process r(x, y) and obtain K[q] = {ha, ci} and P[q] = {hb, di, hb, ei}. We finally process D(y) and keep K[q] = {ha, ci} and P[q] = {hb, di}. Lemma 1. Algorithm intersecQans is an approximate query answering algorithm. Concerning the optimisation of algorithm intersecQans, in practice, one would use well-known ordering strategies from the area of databases for the atoms in q in order to reduce the number of intermediate results, e.g., one would prefer joins over connected atoms and join a small relation with a bigger one where possible. The question now is: given a knowledge base K and a query q, how we can further reduce the cardinality of P[q] computed by intersecQans with the aid of inst. Example 3. Suppose that we have K and q = {C(x), r(x, y), D(y)} as in Example 1 and, in addition, we are given that K |= ∃r.> u C v B and inst(K, B(x)) = h{a}, ∅i. From Example 2, we have K[q] = {ha, ci} and P[q] = {hb, di}. In this case, hb, di is no longer Algorithm 1 intersecQans(K, q) Require: K = hT , Ai: a SHIQ knowledge base q: a conjunctive query Ensure: hK[q], P[q]i: K[q], P[q] sets of known and possible answers for q 1: for at ∈ q do 2: hK[at], P[at]i := inst(K, at) 3: if K[q] and P[q] not initialised then 4: hK[q], P[q]i := hK[at], P[at]i 5: else 6: K[q] := K[q] ./ K[at] 7: P[q] := (P[q] ./ P[at]) ∪ (K[q] ./ P[at]) ∪ (P[q] ./ K[at]) 8: end if 9: end for 10: return hK[q], P[q]i a possible answer of q. If b would be a possible mapping for x, it would have an r- successor (since r(x, y) ∈ q) and it would be an instance of C (since C(x) ∈ q) and, hence, b should be in K[B(x)] ∪ P[B(x)] to satisfy the entailed axiom. We now define the notion of restricting atoms such as B(x). Definition 4 (Restricting Atoms). Let K = hT , Ai be a knowledge base, q a conjunc- tive instance query, at a query atom with Var(at) ⊆ Var(q), and inst an approximate instance retrieval algorithm. Then we say that at restricts q if P[q]|Var(at) ∩ (K[at] ∪ P[at]) ⊂ P[q]|Var(at) where hK[q], P[q]i = intersecQans(K, q) and hK[at], P[at]i = inst(K, at). Example 4. Going back to Example 3, we find that B(x) is indeed a restricting atom for q according to Definition 4 since we have P[q]|{x} = {b} and P[q]|{x} ∩ (K[B(x)] ∪ P[B(x)]) = ∅, which clearly is a subset of P[q]|{x} . If we want to preserve the certain answers of q, we should only use restricting atoms that do not change the answers of q. Let q and q0 be conjunctive instance queries such that q0 = q ∪ {at}. If q and q0 are equivalent queries, i.e., q and q0 yield the same answers over a fixed TBox and any ABox, and at restricts q, then we can safely prune the set of possible answers for q with the help of at. Obviously, we could also use more than one restricting atom to even further restrict q. Since such atoms are not given as input, we address the problem of (efficiently) computing such restricting atoms within an approximate query answering algorithm after showing that using restricting atoms for queries indeed preserves the certain answers. Lemma 2. Let K = hT , Ai be a knowledge base, q and q0 two queries such that q0 = q ∪ {at}, ans(K, q) = ans(K, q0 ), hK[q], P[q]i = intersecQans(K, q), hK[at], P[at]i = inst(K, at), and at restricts P[q]. 1. An algorithm that returns hK[q], {µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}i given K and q as input is an approximate query answering algorithm and 2. |{µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}| < |P[q]|. In order to define an improved approximate query answering algorithm based on Lemma 2, we need to find a way of computing such restricting atoms. In the following, we will see how we can use the TBox for this aim. The proposed query extension technique is analogous to the use of the chase technique in relational database theory to reason about conjunctive query containment [10]. Definition 5 (Query Extension). Let T be a TBox, q a query, and f a total and bijec- tive function from Var(q) to a set of individual names from NI . The canonical ABox for q w.r.t. f is defined as follows: Aqf = {B( f (x)) | B(x) ∈ q} ∪ {r( f (x), f (y)) | r(x, y) ∈ q} The extended canonical ABox Âqf for q w.r.t. f is defined as: Âqf = {B(a) | B ∈ NC , a ∈ Ind(Aqf ) and hT , Aqf i |= B(a)} ∪ {r(a, b) | r ∈ NR , a, b ∈ Ind(Aqf ) and hT , Aqf i |= r(a, b)}. The extended query q̂ w.r.t. f and Âqf is: q̂ = {B( f − (a)) | B(a) ∈ Âqf } ∪ {r( f − (a), f − (b)) | r(a, b) ∈ Âqf }. Example 5. Suppose we have K such that K |= ∃r.>uC v B and q = {C(x), r(x, y), D(y)} from Example 3. In this case, the canonical ABox of q is Aqf = {C(a x ), r(a x , ay ), D(ay )} where f maps a variable v ∈ {x, y, z} to the individual name av . The extended canonical ABox for the query q w.r.t. f is Âqf = {C(a x ), r(a x , ay ), D(ay ), B(a x )} and the extended query w.r.t. Âqf is q̂ = {C(x), r(x, y), D(y), B(x)}. Please note that q̂ has the same answers as q (w.r.t. any ABox), because K |= ∃r.> u C v B. Lemma 3. The extended query q̂ created as in Definition 5 is unique. Intuitively (see Example 5), the extended query q̂ w.r.t. Âqf adds atoms to q that do not change the set of answers of q, which we formalise by the following Theorem: Theorem 1. Let T be a TBox, q a query, and q̂ the extended query as in Definition 5. Then, T |= q ≡ q̂, i.e., for any ABox A0 , ans(hT , A0 i, q) = ans(hT , A0 i, q̂) Given a conjunctive instance query q and a knowledge base K = hT , Ai, we can now compute the known and possible answers with Algorithm 1. We then compute the extended canonical ABox Âqf and the extended query q̂ for some suitable bijection f . Note that we do not have to consider the (often large) ABox A from K for computing q̂. For each atom at ∈ q̂ \ q, we can then use our approximate instance retrieval algorithm to check whether at restricts q and reduce the set of possible answers P[q] accordingly. Although K[·] and P[·] are often fast to compute (due to the use of simpler approxi- mate reasoning algorithms or since the sets are simply extracted from a pre-model) and often cached, it is not very efficient to always retrieve all required such sets from the reasoner in order to perform the required joins. Hence, in our future work, we plan to just use the cardinalities of the sets K[·] and P[·]. The idea is to use the cardinalities of restricting atoms from q̂ to more precisely estimate the cardinalities of atoms in q. This will allow for a better ordering of the query atoms in cost-based query planning. 4 Beyond Conjunctive Instance Queries Since conjunctive instance queries are only a subset of the queries that are important in the context of SPARQL, we present, in this section, an optimisation that aims at improving the performance for evaluating arbitrary complex queries. In general for a complex query q = {at1 , . . . , atn }, we partition q into two sets q s and qc such that q s is the maximal conjunctive instance sub-query of q and qc contains the axiom templates. We first evaluate q s , e.g., by employing optimisation techniques as described in the previous section. The intermediate results can then already be used to reduce the possible answers for qc . In order to optimise the evaluation of qc , we assume that the concepts and roles are classified (this is often done before a system accepts any queries), which means that axiom templates of the form x v A with x ∈ VC , A ∈ NC require only cache lookups. We use the hierarchies to prune the search space of solutions in the evaluation of certain axiom templates as illustrated with the following example: Example 6. Let q = {Infection v ∃hasCausalLinkTo.x} with x ∈ VC a concept variable and K some knowledge base. If, mapping x to the concept name A, does not yield an entailed axiom and K |= B v A holds, then B is also not a certain answer. Thus, when searching for certain answers, we choose the next binding to test by traversing the concept hierarchy top-down. When we find a non-answer A, the subtree rooted in A of the concept hierarchy can safely be pruned. Queries over ontologies with a large number of concepts and a deep concept hierarchy can, therefore, gain the maximum advantage from this optimisation. We employ similar optimisations using the role hierarchy. In the example above, we can prune the subconcepts of A because x has positive polarity in the axiom, i.e., x occurs positively and not under a negation. In case a variable occurs negatively, e.g., directly or indirectly under a negation or positively on the left- hand side of an axiom, one can, instead, prune the superclasses. We next specify more precisely the polarity of a concept variable in a concept axiom template. Definition 6. Let x ∈ VC be a concept variable and C, C1 , C2 , D concept axiom tem- plates, r a role, and n ∈ IN0 . We define the polarity of x in D as follows: x occurs positively in x. Furthermore, x occurs positively (negatively) in – ¬C if x occurs negatively (positively) in C, – C1 u C2 or C1 t C2 if x occurs positively (negatively) in C1 or C2 , – ∃r.C, ∀r.C, or > n r.C if x occurs positively (negatively) in C, – 6 n r.C if x occurs negatively (positively) in C – = n r.C if x occurs in C. We further say that x occurs positively (negatively) in C1 v C2 if x occurs negatively (positively) in C1 or positively (negatively) in C2 . We further define a partial function polc that maps a concept variable x and a concept template C (axiom template of the form C1 v C2 ) to pos if x occurs only positively in C (C1 v C2 ) and to neg if x occurs only negatively in C (C1 v C2 ). Note that x can also occur both positively and negatively. Note also that no matter whether x occurs positively or negatively in a concept C, in any concept of the form (= n r.C), x occurs positively as well as negatively. This is due to the fact that (= n r.C) is equivalent to the concept template 6 n r.C u > n r.C in which x occurs positively as well as negatively in C. Since the function polc is not defined for variables that appear both positively and negatively, the concept hierarchy cannot be exploited in this case. For example, consider the concept axiom template ¬x t ∃r.x with x ∈ VC (i.e., x v ∃r.x), where x appears negatively in ¬x and positively in ∃r.x. Now, let δ ∈ ∆I be an arbitrary element from a model I = (∆I , ·I ) of the queried knowledge base. It is obvious that if δ is an instance of ¬A t ∃r.A and either A v B or B v A holds, we cannot deduce that δ is an instance of ¬B t ∃r.B. The following theorem holds for every axiom template of the form C1 v C2 . We use Cµ(x)=A with A ∈ NC to denote the concept obtained by applying the extension of µ that also maps x to A. Theorem 2. Let K be a knowledge base, A, B concept names such that K |= A v B, C1 , C2 concept templates, C = ¬C1 t C2 , x ∈ VC a concept variable occurring in C, and µ a mapping that covers all variables of C apart from x. 1. If polc (x, C) = pos and K 6|= (C1 v C2 )µ(x)=B , then K 6|= (C1 v C2 )µ(x)=A . 2. If polc (x, C) = neg and K 6|= (C1 v C2 )µ(x)=A , then K 6|= (C1 v C2 )µ(x)=B . We now extend this optimisation to the case of role variables and we first define the polarity of a role variable in a concept axiom template. Definition 7. Let x ∈ VR be a role variable, C, C1 , C2 , D concept templates, r a role, and n ∈ IN0 . We define the polarity of x in D as follows: x occurs positively in ∃x.C, ∃x− .C, > n x.C, > n x− .C, = n x.C, and = n x− .C; x occurs negatively in ∀x.C, ∀x− .C, 6 n x.C, 6 n x− .C, = n x.C, and = n x− .C. Furthermore, x occurs positively (negatively) in – ¬C if x occurs negatively (positively) in C, – C1 u C2 or C1 t C2 if x occurs positively (negatively) in C1 or C2 , – ∃r.C, ∃x.C, ∃x− .C, > n r.C, > n x.C, > n x− .C, ∀r.C, ∀x.C, or ∀x− .C if x occurs positively (negatively) in C, – 6 n r.C, 6 n x.C, or 6 n x− .C if x occurs negatively (positively) in C, – = n r.C if x occurs in C. We further say that x occurs positively (negatively) in C1 v C2 if x occurs negatively (positively) in C1 or positively (negatively) in C2 . We define a partial function polr that maps a role variable x and a concept template C (axiom template of the form C1 v C2 ) to pos if x occurs only positively in C (C1 v C2 ) and to neg if x occurs only negatively in C (C1 v C2 ). We now show, that the hierarchy optimisation is also applicable to role variables, pro- vided they occur only positively or only negatively. Theorem 3. Let K be a knowledge base, r, s role names such that K |= r v s, C1 , C2 concept templates, C = ¬C1 t C2 , x ∈ VR a role variable occurring in C, and µ a mapping that covers all variables of C apart from x. Algorithm 2 getPossibleConceptVarMappings(x, µ, at, K) Require: x: a concept variable, µ: a partial mapping with x ∈ dom(µ) at: an axiom template in which x occurs, K: a SHIQ knowledge base Ensure: a set of possible answers for x if polc (x, at) == pos then return {µ0 | µ0 (x) =C, C is direct subconcept of µ(x) in K, µ0 (y) = µ(y) for y ∈ dom(µ)\{x}} else return {µ0 | µ0 (x) =C, C is direct superconcept of µ(x) in K, µ0 (y) = µ(y) for y ∈ dom(µ)\{x}} end if 1. If polr (x, C) = pos and K 6|= (C1 v C2 )µ(x)=s , then K 6|= (C1 v C2 )µ(x)=r . 2. If polr (x, C) = neg and K 6|= (C1 v C2 )µ(x)=r , then K 6|= (C1 v C2 )µ(x)=s . Algorithm 2 shows how we use the above theorems to create possible concept map- pings for a concept variable x that appears only positively or only negatively in an axiom template C1 v C2 . We can define a similar algorithm for role variables. Hence, an al- gorithm for evaluating complex queries can call these algorithms to prune the search space once a solution for such a concept or role variable has been found, e.g., by check- ing entailment for an instantiated axiom template. 4.1 Complex Axiom Template Evaluation In the absence of suitable standard benchmarks for complex queries or SPARQL queries over OWL ontologies, we created a custom set of queries as shown in Tables 1 and 2 for the GALEN and the FBbt XP ontology, respectively. Systems that fully support the SPARQL Direct Semantics entailment regime are still under development, which makes it hard to compare our results for these kinds of queries with other systems. All experiments were performed on a Mac OS X Lion machine with a 2.53 GHz Intel Core i7 processor and Java 1.6 allowing 1GB of Java heap space. We measure the time for one-off tasks such as classification separately since such tasks are usually performed before the system accepts queries. The ontologies and all code required to perform the experiments are available online.6 For the evaluation we have used the HermiT7 hypertableau reasoner. GALEN is a biomedical knowledge base. Its expressivity is (Horn-)SHIF and it consists of 2,748 concepts and 413 abstract roles. FBbt XP is an ontology taken from the Open Biological Ontologies (OBO) Foundry.8 Its expressivity is SHI and the knowledge base consists of 7,221 concepts and 21 roles. We only consider the TBox part of FBbt XP since the ABox is not relevant for showing the effects of the optimi- sation for concept axiom templates. GALEN took 3.7 s to load and 11.1 s to classify (concepts and roles), while FBbt XP took 1.5 s to load and 7.4 s to classify. For each query, we tested the execution once with (bold results) and once without the proposed optimisation. The tables also show the number of consistency checks that were performed for the evaluation of each query. 6 http://code.google.com/p/query-ordering/ 7 http://www.hermit-reasoner.com/ 8 http://www.obofoundry.org/ Table 1. Queries (with x(i) a concept and y(i) a role variable) and evaluation times in seconds for the GALEN knowledge base with (bold) and without the hierarchy exploitation Consistency Time Query Answers Checks in sec. 2,750 1.68 Infection v ∃hasCausalLinkTo.x 10 50 0.18 1,141,250 578.98 Infection v ∃y.x 214 1,291 10.00 19,250 102.37 x1 v Infection u ∃hasCausalAgent.x2 2,816 3,073 2.69 NAMEDLigament v NAMEDInternalBodyPart u x1 16,135 7.68 51 x1 v ∃hasShapeAnalagousTo.x2 u ∃y.linear 197 1.12 x1 v NonNormalCondition y1 v ModifierAttribute 1,883 0.67 Bacterium v ∃y1 .x2 4,392 y2 v StatusAttribute x2 v AbstractStatus 1,883 0.80 x1 v ∃y2 .Status GALEN Queries: As expected, an increase in the number of variables within an axiom template leads to a significant increase in the query execution time because the number of mappings that have to be checked grows exponentially in the number of vari- ables. This can, in particular, be observed from the difference in execution time between the first two queries, where it is also evident that the use of the hierarchy exploitation optimisation leads to a decrease in execution time of up to two orders of magnitude. In the last query, the hierarchy exploitation optimisation does not improve the execution time since the variables on which the hierarchy optimisation can be applied, are already bound when it comes to the evaluation of the complex templates. Hence, the running times with and without the hierarchy exploitation are similar. The number of consis- tency checks for this query is significantly lower than the number of answers because the overall results are computed by taking the cartesian products of the results for the two connected components of the query. Although our optimisations can significantly improve the query execution time, the required time can still be quite high. In practice, it is, therefore, advisable to add as many restrictive axiom templates (axiom templates which require only cache lookups) for query variables as possible. For example, the addition of y v Shape to the forth query reduces the runtime from 1.12 s to 0.65 s. FBbt XP Queries: We observe that in some queries the effect of the hierarchy exploitation is more profound than in others. More precisely, the smaller the ratio of the result size to the number of consistency checks without the hierarchy optimisation, the more pronounced is the effect when enabling this optimisation. In other words, when more tested mappings are indeed solutions, one can prune fewer parts of the hierarchy since pruning can only be performed when we find a non-solution. For the second query, we even observe a slight increase in running time when the hierarchy optimisation is used. This is because the optimisation can only prune few candidate mappings, which does not outweigh the overhead caused by maintaining information Table 2. Queries (with x(i) a concept and y a role variable) and evaluation times in seconds for the FBbt XP knowledge base with (bold) and without the hierarchy exploitation Query Consistency Time Answers Checks x1 v ∀part of.x2 151,683 44.13 7,243 x1 v FBbt 00005789 11,262 5.64 y v part of 14,446 37.38 7,224 x v ∀y.FBbt 00001606 12,637 39.20 x v ∃y.FBbt 00025990 72,230 357.59 188 y v overlaps 54,186 252.41 166,129 486.81 FBbt 00001606 v ∃y.x 68 1,335 17.03 FBbt 00001606 v ∃y.x 21,669 19.68 0 y v develops from 3 0.01 x1 v FBbt 00001884 43,338 183.66 y v part of 32,490 152.38 43,338 x2 v ∃y.x1 u x3 about which hierarchy parts have already been tested. For the last query, the number of consistency checks with the hierarchy optimisation is less than the result size (32, 490 versus 43, 338) since only the third axiom template of the query requires consistency checks, whereas bindings for y and x1 can be looked up from the cached concept and role hierarchy. 5 Conclusion In the current paper we presented an approach for using the TBox of a knowledge base to optimise SPARQL queries evaluated using the OWL Direct Semantics entail- ment regime. For conjunctive instance queries we showed how we can build equivalent queries with additional atoms which can be exploited to reduce the set of possible map- pings for query variables. Conditions were defined for this purpose which identify the cases in which such a reduction is achieved. For queries that go beyond conjunctive instance queries we provide a highly effective and frequently applicable optimisation which prunes the number of candidate solutions that have to be checked by exploit- ing the concept and role hierarchy. One can, usually, assume that these hierarchies are computed before a system accepts queries. Our empirical evaluation shows that this optimisation can reduce the query evaluation times up to two orders of magnitude. Acknowledgements This work was supported by IKY in collaboration with DAAD in the program IKYDA: Automatic data generation for description logic knowledge bases. References 1. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.): The De- scription Logic Handbook: Theory, Implementation, and Applications. Cambridge Univer- sity Press, second edn. (2007) 2. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: The DL-Lite family. Journal of Auto- mated Reasoning 39(3), 385–429 (2007) 3. Glimm, B., Horrocks, I., Motik, B., Shearer, R., Stoilos, G.: A novel approach to ontology classification. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 14, 84–101 (2012), special Issue on Dealing with the Messiness of the Web of Data 4. Glimm, B., Ogbuji, C. (eds.): SPARQL 1.1 Entailment Regimes. W3C Recommendation (21 March 2013), available at http://www.w3.org/TR/sparql11-entailment/ 5. Haarslev, V., Möller, R.: Racer system description. In: Gor, R., Leitsch, A., Nipkow, T. (eds.) Proceedings of the 1st International Joint Conference on Automated Reasoning (IJCAR’01). LNCS, vol. 2083, pp. 701–705. Springer (2001) 6. Haarslev, V., Möller, R., Wessel, M.: Querying the semantic web with Racer + nRQL. In: Proceedings of the KI-2004 International Workshop on Applications of Description Logics (2004) 7. Harris, S., Seaborne, A. (eds.): SPARQL 1.1 Query Language. W3C Recommendation (21 March 2013), available at http://www.w3.org/TR/sparql11-query/ 8. Horridge, M., Bechhofer, S.: The OWL API: A Java API for working with OWL 2 ontologies. In: Patel-Schneider, P.F., Hoekstra, R. (eds.) Proceedings of the OWLED 2009 Workshop on OWL: Experiences and Directions. CEUR Workshop Proceedings, vol. 529. CEUR-WS.org (2009) 9. Hustadt, U., Motik, B., Sattler, U.: Reducing SHIQ− description logic to disjunctive datalog programs. In: Proceedings of the 9th International Conference on Principles of Knowledge Representation and Reasoning (KR’04). pp. 152–162. AAAI Press (2004) 10. Johnson, D.S., Klug, A.C.: Testing containment of conjunctive queries under functional and inclusion dependencies. J. Comput. Syst. Sci. 28(1), 167–189 (1984) 11. Kollia, I., Glimm, B.: Optimizing SPARQL Query Answering over OWL Ontologies. Ac- cepted at Journal of Artificial Intelligence Research (2013) 12. Kollia, I., Glimm, B.: Cost based query ordering over OWL ontologies. In: Proceedings of the 11th International Semantic Web Conference (ISWC 2012). Lecture Notes in Computer Science, vol. 7649, pp. 231–246. Springer-Verlag (2012) 13. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The combined approach to query answering in DL-Lite. In: Lin, F., Sattler, U. (eds.) Proceedings of the 12th Inter- national Conference on Principles of Knowledge Representation and Reasoning (KR’10). AAAI Press (2010) 14. Motik, B., Patel-Schneider, P.F., Cuenca Grau, B. (eds.): OWL 2 Web Ontology Language: Direct Semantics. W3C Recommendation (27 October 2009), available at http://www.w3.org/TR/owl2-direct-semantics/ 15. Pan, J.Z., Thomas, E., Zhao, Y.: Completeness guaranteed approximation for owl dl query answering. In: Proceedings of the 22nd International Workshop on Description Logics (DL2009) (2009) 16. Pérez-Urbina, H., Motik, B., Horrocks, I.: Tractable query answering and rewriting under description logic constraints. Journal of Applied Logic 8(2), 186–209 (2010) 17. Ren, Y., Pan, J.Z., Zhao, Y.: Soundness preserving approximation for tbox reasoning. In: Proceedings of the 25th AAAI Conference (AAAI2010) (2010) 18. Sirin, E., Cuenca Grau, B., Parsia, B.: From wine to water: Optimizing description logic reasoning for nominals. In: Doherty, P., Mylopoulos, J., Welty, C.A. (eds.) Proceedings of the 10th International Conference on Principles of Knowledge Representation and Reasoning (KR’06). pp. 90–99. AAAI Press (2006) 19. Sirin, E., Parsia, B.: SPARQL-DL: SPARQL query for OWL-DL. In: Golbreich, C., Kalyan- pur, A., Parsia, B. (eds.) Proceedings of the OWLED 2007 Workshop on OWL: Experiences and Directions. CEUR Workshop Proceedings, vol. 258. CEUR-WS.org (2007) 20. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A practical OWL-DL rea- soner. Journal of Web Semantics 5(2), 51–53 (2007) 21. Thomas, E., Pan, J.Z., Ren, Y.: TrOWL: Tractable OWL 2 reasoning infrastructure. In: Pro- ceedings of the Extended Semantic Web Conference (ESWC’10) (2010) 22. Tsarkov, D., Horrocks, I., Patel-Schneider, P.F.: Optimizing terminological reasoning for ex- pressive description logics. Journal of Automated Reasoning 39(3), 277–316 (2007) A Proofs Lemma (1). Algorithm intersecQans is an approximate query answering algorithm. Proof (Proof of Lemma 1). We prove the lemma by induction on the length of the query q given as input to the algorithm intersecQans(K, q). For q = {at} the two condi- tions of the definition of approximate query answering algorithms are satisfied. Indeed, 1. if µ ∈ K[q], i.e., µ ∈ K[at] then K |= µ(at) (since inst(K, at) is an approximate instance retrieval algorithm), which means that K |= µ(q). 2. for each total function µ from Var(q) to individual names from K such that µ ∈ ans(K, q), i.e., µ ∈ ans(K, at), i.e., K |= µ(at), it holds µ ∈ K[at] ∪ P[at] (since inst(K, at) is an approximate instance retrieval algorithm), i.e., µ ∈ K[q] ∪ P[q]. Let q0 = q ∪ {at} and let us assume that intersecQans(K, q) is an approximate query answering algorithm. We prove that inst(K, q0 ) is also an approximate query answering algorithm by showing that the two conditions of Definition 3 hold for q0 . 1. If µ ∈ K[q0 ], this means that µ|Var(q) ∈ K[q] and µ|Var(at) ∈ K[at] (it can easily be seen from the construction K[q0 ] in intersecQans(K, q0 )). By induction hypothe- sis µ|Var(q) ∈ ans(K, q) and, since inst(K, at) is an approximate instance retrieval algorithm, K |= µ|Var(at) (at). Since µ = µ|Var(q) ./ µ|Var(at) , it holds µ ∈ ans(K, q0 ). 2. For each total function µ from Var(q) to individual names from K such that µ ∈ ans(K, q0 ), it holds µ|Var(q) ∈ ans(K, q) and K |= µ|Var(at) (at). By the induction hypothesis, µ|Var(q) ∈ K[q]∪ P[q] and µ|Var(at) ∈ K[at]∪ P[at]. It holds µ = µ|Var(q) ./ µ|Var(at) . We distinguish between the following cases: – if µ|Var(q) ∈ K[q] and µ|Var(at) ∈ K[at], then µ ∈ K[q0 ] – if µ|Var(q) ∈ K[q] and µ|Var(at) ∈ P[at], then µ ∈ P[q0 ] – if µ|Var(q) ∈ P[q] and µ|Var(at) ∈ K[at], then µ ∈ P[q0 ] – if µ|Var(q) ∈ P[q] and µ|Var(at) ∈ P[at], then µ ∈ P[q0 ] We see from the above that in any case µ ∈ K[q0 ] ∪ P[q0 ]. t u Lemma (2). Let K = hT , Ai be a knowledge base, q and q0 two queries such that q0 = q ∪ {at}, ans(K, q) = ans(K, q0 ), hK[q], P[q]i = intersecQans(K, q), hK[at], P[at]i = inst(K, at), and at restricts P[q]. 1. An algorithm that returns hK[q], {µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}i given K and q as input is an approximate query answering algorithm and 2. |{µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}| < |P[q]|. Proof (Proof of Lemma 2). For the first claim, according to Lemma 1, intersecQans(K, q) is an approximate query answering algorithm. Hence, the first condition on approx- imate query answering algorithms is satisfied. For the second condition and in con- trary to what is to be shown, assume there is some µ such that µ ∈ ans(K, q), but µ < K[q] ∪ {µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])}, i.e., µ|Var(at) < K[at] ∪ P[at]. By Defi- nition 4 and since at restricts P[q], we have Var(at) ⊆ Var(q) and all variables from at are in the domain of µ. Since inst(K, at) is an approximate instance retrieval algorithm, this means K 6|= µ|Var(at) (at), but this means that µ < ans(K, q0 ), which contradicts ans(K, q) = ans(K, q0 ). The second claim follows since {µ ∈ P[q] | µ|Var(at) ∈ (K[at] ∪ P[at])} is a strict subset of {P[q]} by Definition 4 and since at restricts q. t u Lemma (3). The extended query q̂ created as in Definition 5 is unique. Proof (Proof of Lemma 3). It is obvious that the extended query q̂ does not depend on the function f that is used for its construction since the canonical ABoxes and conse- quently the extended canonical ABoxes produced w.r.t. different functions f are iso- morphic to each other, i.e., identical modulo renaming of individuals. t u Theorem (1). Let T be a TBox, q a query, and q̂ the extended query as in Definition 5. Then, T |= q ≡ q̂, i.e., for any ABox A0 , ans(hT , A0 i, q) = ans(hT , A0 i, q̂). Proof (Proof of Theorem 1). We want to prove that for any ABox A, ans(hT , Ai, q) = ans(hT , Ai, q̂), i.e., ans(hT , Ai, q) ⊆ ans(hT , Ai, q̂) and ans(hT , Ai, q̂) ⊆ ans(hT , Ai, q). From Defini- tion 5 it is easily seen that Var(q) = Var(q̂) and since q has more atoms (is more restricting) than q, it holds ans(hT , Ai, q̂) ⊆ ans(hT , Ai, q). We will now show that ans(hT , Ai, q) ⊆ ans(hT , Ai, q̂). Let µ be a mapping such that µ ∈ ans(hT , Ai, q), i.e., hT , Ai |= µ(q). From Definition 5 it is obvious that for any total function f from Var(q) to a set of individual names from NI , hT , Aqf i |= Âqf . Hence, hT , Aµq i |= µq or hT , µ(q)i |= µ(q̂) which means hT , A ∪ µ(q)i |= µ(q̂) due to the monotonicity prop- erty of description logics, which means that hT , Ai |= µ(q̂) (since hT , Ai |= µ(q)), i.e., µ ∈ ans(hT , Ai, q̂). t u Theorem (2). Let K be a knowledge base, A, B concept names such that K |= A v B, C1 , C2 concept templates, C = ¬C1 t C2 , x ∈ VC a concept variable occurring in C, and µ a mapping that covers all variables of C apart from x. 1. If polc (x, C) = pos and K 6|= (C1 v C2 )µ(x)=B , then K 6|= (C1 v C2 )µ(x)=A . 2. If polc (x, C) = neg and K 6|= (C1 v C2 )µ(x)=A , then K 6|= (C1 v C2 )µ(x)=B . Proof (Proof Sketch of Theorem 2). The claim can be shown by induction on the struc- ture of the concept template C. We only consider the case of polc (x, C) = pos; the case of polc (x, C) = neg is analogous. It suffices to show (in contrapositive form) for some model I = (∆I , ·I ) of K and some element δ ∈ ∆I , if δ ∈ (Cµ(x)=A )I , then δ ∈ (Cµ(x)=B )I . For the base case, C = x, x occurs positively in C. Now, if δ ∈ (xµ(x)=A )I , then δ ∈ AI and, hence, δ ∈ BI since K |= A v B by assumption. Hence, δ ∈ (xµ(x)=B )I . We show the induction step for C = ∃r.D. The other cases are analogous. We assume δ ∈ ((∃r.D)µ(x)=A )I , then δ has at least one r-successor, say δ0 , that is an instance of Dµ(x)=A . Since K |= A v B and by induction hypothesis, δ0 ∈ Dµ(x)=B . Hence, δ ∈ (∃r.(Dµ(x)=B ))I = ((∃r.D)µ(x)=B )I . Note that the case C = (= n r.D) cannot occur since the polarity of x in C is always positive and negative and polc (x, C) is undefined. t u Theorem (3). Let K be a knowledge base, r, s role names such that K |= r v s, C1 , C2 concept templates, C = ¬C1 t C2 , x ∈ VR a role variable occurring in C, and µ a mapping that covers all variables of C apart from x. 1. If polr (x, C) = pos and K 6|= (C1 v C2 )µ(x)=s , then K 6|= (C1 v C2 )µ(x)=r . 2. If polr (x, C) = neg and K 6|= (C1 v C2 )µ(x)=r , then K 6|= (C1 v C2 )µ(x)=s . Proof (Proof Sketch of Theorem 3). The claim can be shown by induction on the struc- ture of the concept template C. We only consider the case of polr (x, C) = pos; the case of polr (x, C) = neg is analogous. It suffices to show (in contrapositive form) for some model I = (∆I , ·I ) of K and some element δ ∈ ∆I , if δ ∈ (Cµ(x)=r )I , then δ ∈ (Cµ(x)=s )I . We have several base cases here, but only present the case for concept templates of the form C = ∃x.D with x not occurring in D. The cases for C = ∀x.D, C = > n x.D, C = 6 n x.D with x not occurring in D are similar. Assume, δ ∈ ((∃x.D)µ(x)=r )I , that is, δ ∈ (∃r.µ(D))I . Then there is some δ0 ∈ ∆I such that hδ, δ0 i ∈ rI and δ0 ∈ µ(D)I . Since K |= r v s, we also have hδ, δ0 i ∈ sI and, therefore, δ ∈ (∃s.µ(D))I = ((∃x.D)µ(x)=s )I . For the induction step, let C = ∃x.D with x occurring in D, we also have polr (x, D) = pos. Now, if δ ∈ ((∃x.D)µ(x)=r )I , then δ has at least one r-successor which is an instance of Dµ(x)=r . Since K |= r v s and by induction hypothesis, δ has at least one s-successor that is an instance of Dµ(x)=s . Hence, δ ∈ ((∃x.D)µ(x)=s )I . The case of C = ∃p.D with p a role and x occurring in D is analogous and so are the further induction steps. t u