Introduction

Cost Based Query Ordering over OWL Ontologies

Ilianna Kollia

ilianna.kolliag@uni-ulm.de 0 1

Birte Glimm

1 0 National Technical University of Athens , Greece 1 Ulm University , Germany

Introduction

Query answering—the computation of answers to users’ queries w.r.t. ontologies and data—is an important task provided by many description logic (DL) reasoners. Although much e ort has been spent on optimizing the ‘reasoning’ part of query answering, i.e., the extraction of the individuals that satisfy a concept or role atom, less attention has been given to optimizing the actual query answering part when ontologies in expressive languages are used.

In the context of databases or triple stores, cost-based ordering techniques for finding an optimal or near optimal join ordering have been widely applied [9, 10]. These techniques involve the maintenance of a set of statistics about relations and indexes, e.g., number of pages in a relation, number of pages in an index, number of distinct values in a column, together with formulas for the estimation of the selectivity of predicates and the estimation of the CPU and I/O costs of query execution that depends amongst others, on the number of pages that have to be read from or written to secondary memory. The formulas for the estimation of selectivities of predicates (result output size of query atoms) estimate the data distributions using histograms, parametric or sampling methods or combinations of them.

In the context of ontologies, the formulas should be extended to take into account another important cost component, i.e., the cost of executing specific reasoner tasks such as entailment checks or instance retrievals. It is much more di cult to estimate this cost precisely before query evaluation as this cost varies and takes values from a wide range. For example, SROIQ has a worst case complexity of 2-NExpTime [4] and typical implementations are not worst case optimal. The hypertableau satisfiability checking algorithm for SROIQ that we use in this paper has a worst-case complexity of 3-NExpTime in the size of the ontology [7, 4].3 Instead, a subset of possible mappings for the variables of a query can be sampled and, based on the statistics extracted from these samples, the most e cient join ordering can be estimated. This, however, requires that the samples are selected according to an ontology based criterion and not randomly. Preliminary e orts for finding good join ordering in knowledge bases have already been made [8].

In this paper we address the issue of query atom ordering, which constitutes an optimization task, for conjunctions of instance queries issued over ontologies with expressivity up to SROIQ. The optimization goal is to find the execution plan (an order for query atoms) which leads to the most e cient execution of the query. This involves 3 The 2-NExpTime result for SHOIQ+ increases to 3-NExpTime when adding role chains [4] the minimization of the number of needed reasoning tasks and the size of intermediate results. The execution plan which satisfies the above property is determined by means of a cost function that assigns costs to query atoms within an execution plan. This cost function is based on heuristics and summaries for statistics about the data which are extracted from a DL reasoner model. We explore static and dynamic algorithms that greedily explore the execution plan search space to determine an optimal or near optimal execution plan. Static ordering refers to the finding of a join order before query evaluation starts whereas dynamic ordering determines the ordering of query atoms during query evaluation taking advantage of already computed query atom results. 2

Preliminaries

In this section we briefly present an overview of the model building tableau and hypertableau calculi and give an introduction to conjunctive instance queries.

It is known that checking whether an individual s0 (pair of individuals hs0; s1i) is an instance (are instances) of a concept C (role R) w.r.t. an ontology O is equivalent to checking whether O [ f:C(c0)g (O [ f(8R::fs1g)(s0)g) is inconsistent w.r.t. O. In order to perform this action, most DL reasoners use a model construction calculus such as tableau or hypertableau. In the remainder, we focus on the hypertableau calculus [7], but a tableau calculus could equally be used and we state how our results can be transferred to tableau calculi. The hypertableau calculus starts from an initial set of assertions and by applying derivation rules it tries to construct (an abstraction of) a model of O. Derivation rules usually add new concept and role assertion axioms, they may introduce new individuals, they can be nondeterministic, leading to the need to choose between several alternative assertions to add or they can lead to a clash when a contradiction is detected. To show that an ontology O is (in)consistent, the hypertableau calculus constructs a derivation, i.e., a sequence of sets of assertions A0; : : : ; An, such that A0 contains all assertions in O, Ai+1 is the result of applying a derivation rule to Ai and An is the final set of assertions where no more rules are applicable. If a derivation exists such that An does not contain a clash, then O is consistent and An is called a pre-model of O. Otherwise O is inconsisent. Each assertion in a set of assertions Ai is derived either deterministically or nondeterministically. An assertion is derived deterministically if it is derived by the application of a deterministic derivation rule from assertions that were all derived deterministically. Any other derived assertion is derived nondeterministically. It is easy to know whether an assertion was derived deterministically or not because of the dependency directed backtracking that most (hyper)tableau reasoners employ. In the pre-model, each individual s0 is assigned a label L(s0) representing the concepts it is (non)deterministically an instance of and each pair of individuals hs0; s1i is assigned a label L(hs0; s1i) representing the roles through which individual s0 is (non)deterministically related to individual s1.

Definition 1. Let S = (NC; NR; NI ) be a signature of an ontology O, NV a countably infinite set of variables disjoint from NC; NR and NI and S O = (CO; RO; IO) the restriction of S to terms that occur in O. A term t is an element from NV [ NI . Let C 2 CO be a concept, r 2 RO a role, and t; t0 2 IO [ NV terms. An atom is an expression C(t) or r(t; t0) and we refer to these types of atoms as concept and role atoms, respectively. A query q is a non-empty set of atoms. We use Vars(q) to denote the set of variables occurring in q, Inds(q) to denote the set of individual names occurring in q, and Terms(q) = Vars(q) [ Inds(q) for the set of terms in q. We use jqj to denote the number of atoms in q.

Let q = fat1; : : : ; atng be a query. A mapping for q over O is a total function : Terms(q) ! IO such that (a) = a for each a 2 Inds(q). The set q of all possible mappings for q is defined as q := f j is a mapping for qg. A solution mapping for q over O is a mapping such that O j= C( (t)) for each concept atom C(t) 2 q and O j= r( (t); (t0)) for each role atom r(t; t0) 2 q.

According to the above definition, we deal with conjunctive queries with only distinguished variables. Without loss of generality, we assume that queries are connected. In case they are not, the connected components of a query can be evaluated independently and the results of these evaluations can be combined in the end [1]. 3

Extracting Individual Information from Reasoner Models

The first step in the ordering of query atoms is the extraction of statistics, exploiting information generated by reasoners.

As has been mentioned in Section 2, if an ontology is consistent and a pre-model is constructed, individuals are assingned labels in this pre-model. These labels can provide us with information about the concepts the individuals belong to or the roles in which they participate. We exploit this information similarly as was suggested for determining known (non-)subsumers for classes during classification [2]. In the hypertableau calculus, the following two properties hold for each ontology O and each constructed pre-model An for O: (P1) for each atomic concept C (role R), each individual s0 (each pair of individuals hs1; s2i) in An, if C 2 LAn (s0) (R 2 LAn (hs1; s2i)) and the assertion C(s0) (R(s1 s2)) was derived deterministically, then it holds O j= C(s0) (O j= R(s1; s2)). (P2) for an arbitrary individual s0 in An (an arbitrary pair of individuals hs1; s2i in An) and an arbitrary atomic concept C (simple role R), if C < LAn (s0) (R < LAn (hs1; s2i)), then O 6j= C(s0) (O 6j= R(s1; s2)).

We use these properties to extract information from the pre-model of a satisfiable ontology O as outlined in Algorithm 1. In our implementation we use a more complicated procedure to only store the direct types of each individual. The information we extract involves the maintenance of the sets of known and possible instances for all atomic concepts of O. The known instances of a concept C (K[C]) are the individuals that can be safely considered instances of the concept according to the pre-model, i.e., the individuals of the assertions referring to concept C that have been derived deterministically. The possible instances of a concept C (P[C]) are the individuals of the assertions referring to C that have been derived nondeterministically. These individuals require costly consistency checks in order to decide whether they are real instances of the respective concept.

Algorithm 1 initializeKnownAndPossibleConceptInstances Require: a consistent SROIQ ontology O to be queried Ensure: sets K[C] (P[C]) of known (possible) instances for each concept C of O are computed 1: An := buildModelFor(O) 2: for all ind 2 IO do 3: for all C 2 LAn (ind) do 4: if C was derived deterministically then 5: K[C] := K[C] [ findg 6: else 7: P[C] := P[C] [ findg 8: end if 9: end for 10: end for

The procedure to find the known and possible instances of simple and complex (transitive roles or roles having a transitive subrole) roles or, given an individual, the known and possible role successors or predecessors, can be defined similarly. In the case of complex roles, however, before the buildModelFor procedure is applied, O is expanded with additional axioms that capture the semantics of the transitive relations since (hyper)tableau reasoners typically do not deal with transitivity directly [7]. In particular, for each individual ind and each complex role p, the new concepts Cipnd and Cind are created and the axioms Cind(ind) and Cind v 8p:Cipnd are added to O. Intuitively, the consequent application of the transitivity encoding [7] produces axioms Cipnd(s) that propagate to each individual s that is reachable from ind via a p-chain. The known and possible p-successors for ind can then be determined from concept assertions Cipnd(s) in the pre-model.

The technique presented in this paper can be used with any (hyper)tableau calculus for which properties (P1) and (P2) hold. All (hyper)tableau calculi used in practice that we are aware of satisfy property (P1). Pre-models produced by tableau algorithms as presented in the literature also satisfy property (P2); however, commonly used optimizations, such as lazy unfolding, can compromise property (P2), which we illustrate with the following example. Let us assume we have an ontology O containing the axioms A v 9R:(C u D), B 9R:C and A(a). It is obvious that in this ontology A is a subconcept of B (hence O j= B(a)) since every individual that is R-related to an individual that is an instance of the intersection of C and D is also R-related to an individual that is an instance of the concept C. However, even though the assertion A(a) occurs in the ABox, the assertion B(a) is not added in the pre-model when we use lazy unfolding. With lazy unfolding, instead of treating B 9R:C as two disjunctions :B t 9R:C and B t 8R:(:C) as is typically done for general concept inclusion axioms (GCIs), B is only lazily unfolded into its definition 9R:C once B occurs in the label of an individual. Thus, although (9R:(C u D))(a) would be derived, this does not lead to the addition of B(a).

Nevertheless, most (if not all) implemented calculi produce pre-models that satisfy at least the following weaker property: (P3) for an arbitrary individual s0 in An and an arbitrary concept C where C is primitive in O,4 if C < LAn (s0), then O 6j= C(s0).

Hence, properties (P2) and (P3) can be used to extract (non-)instance information from pre-models. For tableau calculi that only satisfy (P3), Algorithm 1 can be modified accordingly. In particular, for each non-primitive concept C in O we need to add to P[C] the individuals in O that do not include the concept C in their label.

Even though the proposed technique for determining known and possible instances of concepts and roles can be used in the same way with both tableau and hypertableau reasoners, the e ect that it will have when tableau algorithms are used is less intense. This happens because tableau algorithms often introduce more nondeterminism than hypertableau. In particular, in tableau algorithms a disjunction is added to each individual for each GCI in O and, when optimizations such as lazy unfolding are used, these compromise property (P2) and we have to use the weaker property (P3) or even consider all concepts that do not occur in the label of an individual as possible types, which results in less accurate statistics. 4

Query Answering and Query Atom Ordering

In this section we describe two di erent algorithms (a static and a dynamic one) for ordering the atoms of a query based on some costs and then we deal with the formulation of these costs. We first introduce the abstract graph representation of a query q by means of a labeled graph Gq on which we define the computed statistical costs. Definition 2. A query join graph Gq for a query q is a tuple (V; E; EL), where – V = q is a set of vertices (one for each query atom); – E V V is a set of edges such that hat1; at2i 2 E i Vars(at1) \ Vars(at2) , ; and at1 , at2; – EL is a function that assigns a set of variables to each hat1; at2i 2 E such that EL(at1; at2) = Vars(at1) \ Vars(at2).

In the remainder we use q for a query fat1; : : : ; atng, Gq for the according query join graph and q for the solution mappings of q. Our goal is to find a query execution plan, which determines the evaluation order for atoms in q. Since the number of possible execution plans is of order jqj!, the ordering task quickly becomes impractical. In the following, we focus on greedy algorithms for determining an execution order, which prune the search space considerably. Roughly speaking, we proceed as follows: We define a cost function, which consists of two components: an estimate for the reasoning costs and an estimate for the intermediate result size. Both components are combined to induce an order among query atoms. In this paper, we simply build the sum of the two cost components, but di erent combinations such as a weighted sum of the two values could also be used. For the query plan construction we distinguish static from dynamic planning. For the former, we start constructing the plan by adding a minimal 4 A concept C is considered primitive in O if O is unfoldable and it contains no axiom of the form C E atom according to the order. Variables from this atom are then considered bound, which changes the cost function and might induce a di erent order among the remaining query atoms. Considering the updated order, we again select the minimal query atom that is not yet in the plan and update the costs. This process continues until the plan contains all atoms. Once a complete plan has been determined the atoms are evaluated. The dynamic case di ers in that after selecting an atom for the plan, we immediately determine the solutions for the chosen atom, which are then used to update the cost function. Dynamic ordering is a costly but accurate procedure. Sampling techniques can be used so that not all mappings are used for the update of the cost function but only a subset of them. In Section 5 we show that random sampling is not adequate and a more sophisticated sampling criterion is needed. However, the definition of such criterion is out of the scope of the current paper. We now make the process of query plan construction more precise, but we leave the exact details of defining the cost function and the ordering it induces to later.

Definition 3. A static (dynamic) cost function w.r.t. q is a function s : q 2Vars(q) ! R R (d : q 2 q ! R R). The two costs are combined to yield a static ordering s (a dynamic ordering d), which is a total order over the atoms of q.

An execution plan for q is a duplicate-free sequence of query atoms from q. The initial execution plan is the empty sequence and a complete execution plan is a sequence containing all atoms of q. For Pi = (at1; : : : ; ati) with i < n an execution plan for q with query join graph Gq = (V; E; EL), we define the potential next atoms qi for Pi w.r.t. Gq as qi = q for Pi the initial execution plan and qi = fat j hat0; ati 2 E; at0 2 fat1; : : : ; atig; at 2 q n fat1; : : : ; atigg otherwise. The static (dynamic) ordering induces an execution plan Pi+1 = (at1; : : : ; ati; ati+1) with ati+1 2 qi and ati+1 s at (ati+1 d at) for each at 2 qi such that at , ati+1.

For i > 0, the set of potential next atoms only contains atoms that are connected to an atom that is already in the plan since unconnected atoms will cause an unnecessary blowup of the number of intermediate results. Let Pi = (at1; : : : ; ati) with i n be an execution plan for q. The procedure we follow to find the solution mappings i for Pi is recursively defined as follows: Initially, our solution set contains only the identity mapping 0 = f 0g, which does not map any variable to any value. Assuming that we have evaluated the sequence Pi 1 = (at1; : : : ; ati 1) and we have found the set of solution mappings i 1, in order to find the solution mappings i of Pi, we use instance retrieval tasks of reasoners to extend the mappings in i 1 to cover the new variables of ati or the entailment check service of reasoners if ati does not contain new variables. A detailed description of the method we are using for the evaluation of an execution plan together with optimizations can be found in our previous work [5].

We now define the cost functions s and d more precisely, which estimate the cost of the required reasoner operations and the estimated result output size of evaluating a query atom. The intuition behind the estimated value of the reasoner operation costs (the functions’ first component) is that the evaluation of possible instances is much more costly than the evaluation of known instances since possible instances require expensive consistency checks whereas known instances require cheap cache lookups. The estimated result size (the functions’ second component) takes into account the number of known and possible instances and the probability that possible instances are actual instances. The static cost function has more cases since for atoms we might only know that their variables are bound, without knowing to which individuals they are bound to. The functions depend on several factors: – K[C] (K[R]) and P[C] (P[R]) for known and possible instances of a concept C (a role R) from Section 3 – sucK[R] := fi j 9 j:hi; ji 2 K[R]g (preK[R] := fi j 9 j:h j; ii 2 K[R]g) for the individuals with known successors (predecessors) for a role R. We define analogous functions sucP[R] and preP[R] for the individuals with possible successors and predecessors for a role R – sucK[R; a] := fi j ha; ii 2 K[R]g (preK[R; a] := fi j hi; ai 2 K[R]g) for known R-successors (R-predecessors) of an individual a. We define analogous functions sucP[R; a] and preP[R; a] for possible R-successors and R-predecessors of an individual a – CL for the cost of a cache lookup in the reasoner’s internal structures – CE for the cost of an entailment check – PIS for the possible instance success, i.e, an estimate for percentage of possible instances that are actual instances

The values CL, CE and PIS are determined experimentally. The time needed for a cache lookup is much less than the time needed for an entailment check with the di erence between the two depending on the ontology and even within an ontology on the queried concept (role). The two costs (CL and CE) were determined by taking the average time of the previous performed checks (lookup or entailment). In the case of CE, we multiply this number with the depth of the concept (role) hierarchy. The depth of the concept (role) hierarchy should be taken into account for the estimation of CE since we only store the direct types of each individual (roles in which each individual participates). In order to find the instances of a concept (role), we may need to check all its subconcepts (subroles) that contain possible instances. The possible instance success, PIS , was determined by testing several ontologies and checking how many of the initial possible instances were real ones, which was around 50% in nearly all ontologies.

In the following, we use a; b for individual names and x; y for variables. We first define the static cost function s, which takes a pair hat(~t); VarsBi as input, where at(~t) is a query atom and VarsB is the set of (bound) variables from Vars(at(~t)), and returns a pair of real numbers as follows: – for hat(~t); VarsBi 2 fhC(x); ;i; hR(x; y); ;ig hjK[at]j CL + jP[at]j CE; jK[at]j + PIS jP[at]ji ( 1 ) ( 2 ) – for hat(~t); VarsBi 2 fhR(a; x); ;ig

hjsucK[at; a]j CL + jsucP[at; a]j CE; jsucK[at; a]j + PIS jsucP[at; a]ji – for hat(~t); VarsBi 2 fhR(x; a); ;ig we use preK (preP) instead of sucK (sucP) in ( 2 ) – for hat(~t); VarsBi 2 fhC(a); ;i; hR(a; b); ;ig

hCL; 1i if ~t 2 K[at] hCE ; PISi if ~t 2 P[at]

hCL; 0i otherwise – for hat(~t); VarsBi 2 fhC(x); fxgi; hR(x; y); fx; ygi; hR(a; x); fxgi; hR(x; a); fxgig * jK[at]j CL + jP[at]j CE ; jK[at]j + PIS jP[at]j +

jIOj jIOj jIOj jIOj – for hat(~t); VarsBi = hR(x; y); fxgi * jK[at]j jsucK[at]j

CL +

jP[at]j jsucP[at]j

CE ;

jK[at]j jsucK[at]j +

jP[at]j jsucP[at]j

PIS + – for hat(~t); VarsBi = hR(x; y); fygi we use preK (preP) instead of sucK (sucP) in ( 5 ) The dynamic cost function d is based on the static function s, but only applies to cases ( 1 ), ( 2 ) and ( 3 ). The function takes a pair hat(~t); i as input, where at(~t) is a query atom and is the set of solution mappings for the atoms that have already been evaluated, and returns a pair of real numbers using matrix addition as follows: d(at(~t); ) =

X s( (at(~t)); ;) 2

A motivating example showing the di erence between static and dynamic ordering and justifying why dynamic ordering can be beneficial in our setting is shown below. Let us assume that a query q consists of the three query atoms: C(x), R(x; y), D(y). Table 1 gives information about the known and possible instances of these atoms within a sequence. In particular, the first column enumerates possible execution sequences S i = (at1; : : : ; ati) for the atoms of q. Column 2 ( 3 ) gives the number of mappings to known (possible) instances of ati (i.e., the number of known (possible) instances of ati) that satisfy at the same time the atoms (at1; : : : ; ati 1). Column 4 gives the number of possible instances of ati from Column 3 that are real instances (that belong to i). Let ( 3 ) ( 4 ) ( 5 ) us assume that we have 10,000 individuals in our ontology O. We will now explain what the formulas described above are doing. We assume that CL CE which is always the case since a cache lookup is less expensive than a consistency check. In both techniques (static and dynamic) the atom R(x; y) will be chosen first since it has the least number of possible instances (200) while it has the same (or smaller) number of known instances (200) with the other atoms (s(R(x; y); ;) = d(R(x; y); f 0g) = h200 CL + 200 CE; 200 + PIS 200i, s(C(x); ;) = d(C(x); f 0g) = h200 CL + 350 CE; 200 + PIS 350i, s(D(y); ;) = d(D(y); f 0g) = h700 CL + 600 CE; 700 + PIS 600i). Then, in the case of static ordering, after R(x; y), the atom C(x) is chosen since C has less possible (and known) instances than D (350 versus 600). Indeed, s(C(x); fxg) = h 102;00000 CL + 103;50000 CE; 200+130580 PIS i, s(D(y); fyg) = h 10;000 CL + 106;00000 CE; 700+600 PIS i. Hence, the order of evaluation in this 700

108 case will be P = (R(x; y); C(x); D(y)) leading to 200(row 2) + 150(row 4) + 40(row 7) entailment checks. In the dynamic case, after the evaluation of R(x; y), which gives a set of solutions 1, the atom D(y) has fewer known and possible instances (50 known and 50 possible) than the atom C(x) (100 known and 150 possible) and, hence, a lower cost. Indeed, d(D(y); 1) = h50 CL + 150 CL + 50 CE; 50 + 0 + 50 PISi, d(C(x); 1) = h100 CL + 0 CL + 150 CE; 100 + 0 + 150 PISi. Therefore, atom D(y) will be chosen next leading to the execution of the query atoms in the order P = (R(x; y); D(y); C(x)) and the execution of 200(row 2) + 50(row 5) + 35(row 6) entailment checks. 5

Evaluation

We tested our ordering techniques with the Lehigh University Benchmark (LUBM) [3] as a case where no disjunctive information is present and with the more expressive University Ontology Benchmark (UOBM) [6] using the HermiT5 hypertableau reasoner. All experiments were performed on a Windows 7 machine with a double core 2.53 GHz Intel x86 64 bit processor and Java 1.6 allowing 1GB of Java heap space. We measure the time for one-o tasks such as classification separately since such tasks are usually performed before the system accepts queries. The ontologies and all code required to perform the experiments are available online.6

We first used the 14 conjunctive ABox queries provided in LUBM. From these, queries 2, 7, 8, 9 are the most interesting ones in our setting since they contain many atoms and ordering them can have an e ect in running time. We tested the queries on LUBM( 1,0 ) and LUBM( 2,0 ) which contain data for one or two universities respectively, starting from index 0. LUBM( 1,0 ) contains 16,283 individuals and LUBM( 2,0 ) contains 38,334 individuals. LUBM( 1,0 ) took 3.8 s to load and 22.7 s for classification and initialization of known and possible instances of concepts and roles. LUBM( 2,0 ) took 15.8 s to load and 146 s for classification and initialization of known and possible instances. Table 2 shows the execution time for each of the four queries for LUBM( 1,0 ) and LUBM( 2,0 ). The queries marked with (*) are the queries where the static and dynamic algorithms result in the same ordering. In these queries we observe an increase in running time when the dynamic technique is used (in comparison to the static) which is 5 http://www.hermit-reasoner.com/ 6 http://code.google.com/p/query-ordering/ especially evident on LUBM( 2,0 ) Query 8, where the number of individuals in the ontology and the intermediate result sizes are larger. Dynamic ordering also behaves worse than static in queries 2 and 9. This happens because, although the dynamic algorithm chooses a better ordering than the static algorithm, the intermediate results (that need to be checked in each iteration to determine the next query atom to be executed) are quite large and hence the cost of iterating over all possible mappings in the dynamic case far outweighs the better ordering that is obtained. We also observe that a random sampling of individuals for collecting the ordering statistics in the dynamic case (checking only 50% of individuals in i 1 randomly for detecting the next query atom to be executed) leads to much worse results in most queries than plain static or dynamic ordering.

From the nondeterministic UOBM ontology we removed the nominals and only used the first three departments containing 6,409 individuals. The ontology took 6.4 s to load and 30.3 s to classify and initiliaze the known and possible instances. We ran our static and dynamic algorithms on queries 4, 9, 11, 12 and 14 provided in UOBM, which are the most interesting ones because they consist of many atoms. Most of these queries contain one atom with possible instances. Static and dynamic ordering show similar performance because the intermediate result set sizes are small and the available statistics in this case are quite accurate and result in the same ordering for both methods. For both ordering methods, atoms with possible instances for these queries are executed last. In order to illustrate when dynamic ordering performs better than static, we also created the two custom queries: q1 = fisAdvisedBy(x,y), GraduateStudent(x), Woman(y) g q2 = f SportsFan(x), GraduateStudent(x), Woman(x) g In both queries, P[GraduateStudent], P[Woman] and P[isAdvisedBy] are non-empty. The running times for dynamic ordering are smaller since the more accurate statistics result in a smaller number of possible instances that have to be checked during query execution. In particular, for the static ordering, 151 and 41 possible instances have to be checked in query q1 and q2, respectively, compared to only 77 and 23 for the dynamic ordering. Moreover, the intermediate results are generally smaller in dynamic ordering than in static leading to a significant reduction in the running time of the queries. All of the presented queries could not be answered in the time limit of 30 minutes in case no ordering algorithm was used.

Conclusions

In the current paper, we have dealt with the definition of cost formulas that are based on information extracted from reasoners’ models for ordering the atoms of a conjunctive instance query that is issued over an OWL ontology. We have devised two algorithms, a static and a dynamic one, for finding a good order and show (through an experimental study) that static techniques are quite adequate for deterministic ontologies, however, when disjunctive knowledge is present, dynamic techniques often perform better. The proposed query ordering costs can be used with either tableau or hypertableau reasoners, however, in the case of tableau reasoners they can be less accurate. Future work will include the definition of additional cost measures and the use of appropriate criteria for sampling the individuals and use only these samples for extracting the costs for dynamic ordering.

Acknowledgements This work was done within the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” funded by the German Research Foundation (DFG).

1. Glimm , B. , Horrocks , I. , Lutz , C. , Sattler , U. : Conjunctive query answering for the description logic SHIQ . Journal of Artificial Intelligence Research 31 , 151 - 198 ( 2008 )

2. Glimm , B. , Horrocks , I. , Motik , B. , Shearer , R. , Stoilos , G.: A novel approach to ontology classification . Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Accepted ( 2012 )

3. Guo , Y. , Pan , Z. , Heflin , J.: LUBM: A benchmark for OWL knowledge base systems . J. Web Semantics 3 ( 2-3 ), 158 - 182 ( 2005 )

4. Kazakov , Y. : RIQ and SROIQ are harder than SH OIQ . In: Brewka, G. , Lang , J . (eds.) Proc. 11th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR'08) . pp. 274 - 284 . AAAI Press ( 2008 )

5. Kollia , I. , Glimm , B. , Horrocks , I. : SPARQL query answering over OWL ontologies . In: Proceedings of the 8th Extended Semantic Web Conference (ESWC 2011 ). pp. 382 - 396 . Lecture Notes in Computer Science, Springer-Verlag ( 2011 )

6. Ma , L. , Yang , Y. , Qiu , Z. , Xie , G. , Pan , Y. , Liu , S. : Towards a complete OWL ontology benchmark . In: The Semantic Web: Research and Applications, chap. 12 , pp. 125 - 139 . Lecture Notes in Computer Science, Springer ( 2006 )

7. Motik , B. , Shearer , R. , Horrocks , I. : Hypertableau reasoning for description logics . Journal of Artificial Intelligence Research 36 , 165 - 228 ( 2009 )

8. Sirin , E. , Parsia , B. : Optimizations for answering conjunctive ABox queries: First results . In: Proc. of the Int. Description Logics Workshop DL ( 2006 )

9. Steinbrunn , M. , Moerkotte , G. , Kemper , A. : Heuristic and randomized optimization for the join ordering problem . VLDB Journal 6 , 191 - 208 ( 1997 )

10. Stocker , M. , Seaborne , A. , Bernstein , A. , Kiefer , C. , Reynolds , D. : SPARQL basic graph pattern optimization using selectivity estimation . In: Proceedings of the 17th international conference on World Wide Web . pp. 595 - 604 . WWW '08, ACM , New York, NY, USA ( 2008 )