<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cost Based Query Ordering over OWL Ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilianna Kollia</string-name>
          <email>ilianna.kolliag@uni-ulm.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Birte Glimm</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ulm University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Query answering—the computation of answers to users’ queries w.r.t. ontologies and
data—is an important task provided by many description logic (DL) reasoners.
Although much e ort has been spent on optimizing the ‘reasoning’ part of query
answering, i.e., the extraction of the individuals that satisfy a concept or role atom, less
attention has been given to optimizing the actual query answering part when ontologies
in expressive languages are used.</p>
      <p>In the context of databases or triple stores, cost-based ordering techniques for
finding an optimal or near optimal join ordering have been widely applied [9, 10]. These
techniques involve the maintenance of a set of statistics about relations and indexes,
e.g., number of pages in a relation, number of pages in an index, number of distinct
values in a column, together with formulas for the estimation of the selectivity of
predicates and the estimation of the CPU and I/O costs of query execution that depends
amongst others, on the number of pages that have to be read from or written to
secondary memory. The formulas for the estimation of selectivities of predicates (result
output size of query atoms) estimate the data distributions using histograms, parametric
or sampling methods or combinations of them.</p>
      <p>In the context of ontologies, the formulas should be extended to take into account
another important cost component, i.e., the cost of executing specific reasoner tasks
such as entailment checks or instance retrievals. It is much more di cult to estimate
this cost precisely before query evaluation as this cost varies and takes values from
a wide range. For example, SROIQ has a worst case complexity of 2-NExpTime [4]
and typical implementations are not worst case optimal. The hypertableau satisfiability
checking algorithm for SROIQ that we use in this paper has a worst-case complexity
of 3-NExpTime in the size of the ontology [7, 4].3 Instead, a subset of possible mappings
for the variables of a query can be sampled and, based on the statistics extracted from
these samples, the most e cient join ordering can be estimated. This, however, requires
that the samples are selected according to an ontology based criterion and not randomly.
Preliminary e orts for finding good join ordering in knowledge bases have already been
made [8].</p>
      <p>In this paper we address the issue of query atom ordering, which constitutes an
optimization task, for conjunctions of instance queries issued over ontologies with
expressivity up to SROIQ. The optimization goal is to find the execution plan (an order
for query atoms) which leads to the most e cient execution of the query. This involves
3 The 2-NExpTime result for SHOIQ+ increases to 3-NExpTime when adding role chains [4]
the minimization of the number of needed reasoning tasks and the size of intermediate
results. The execution plan which satisfies the above property is determined by means
of a cost function that assigns costs to query atoms within an execution plan. This cost
function is based on heuristics and summaries for statistics about the data which are
extracted from a DL reasoner model. We explore static and dynamic algorithms that
greedily explore the execution plan search space to determine an optimal or near
optimal execution plan. Static ordering refers to the finding of a join order before query
evaluation starts whereas dynamic ordering determines the ordering of query atoms
during query evaluation taking advantage of already computed query atom results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>In this section we briefly present an overview of the model building tableau and
hypertableau calculi and give an introduction to conjunctive instance queries.</p>
      <p>It is known that checking whether an individual s0 (pair of individuals hs0; s1i) is
an instance (are instances) of a concept C (role R) w.r.t. an ontology O is equivalent to
checking whether O [ f:C(c0)g (O [ f(8R::fs1g)(s0)g) is inconsistent w.r.t. O. In order
to perform this action, most DL reasoners use a model construction calculus such as
tableau or hypertableau. In the remainder, we focus on the hypertableau calculus [7],
but a tableau calculus could equally be used and we state how our results can be
transferred to tableau calculi. The hypertableau calculus starts from an initial set of assertions
and by applying derivation rules it tries to construct (an abstraction of) a model of O.
Derivation rules usually add new concept and role assertion axioms, they may introduce
new individuals, they can be nondeterministic, leading to the need to choose between
several alternative assertions to add or they can lead to a clash when a contradiction
is detected. To show that an ontology O is (in)consistent, the hypertableau calculus
constructs a derivation, i.e., a sequence of sets of assertions A0; : : : ; An, such that A0
contains all assertions in O, Ai+1 is the result of applying a derivation rule to Ai and An
is the final set of assertions where no more rules are applicable. If a derivation exists
such that An does not contain a clash, then O is consistent and An is called a pre-model
of O. Otherwise O is inconsisent. Each assertion in a set of assertions Ai is derived
either deterministically or nondeterministically. An assertion is derived deterministically
if it is derived by the application of a deterministic derivation rule from assertions that
were all derived deterministically. Any other derived assertion is derived
nondeterministically. It is easy to know whether an assertion was derived deterministically or not
because of the dependency directed backtracking that most (hyper)tableau reasoners
employ. In the pre-model, each individual s0 is assigned a label L(s0) representing the
concepts it is (non)deterministically an instance of and each pair of individuals hs0; s1i
is assigned a label L(hs0; s1i) representing the roles through which individual s0 is
(non)deterministically related to individual s1.</p>
      <p>Definition 1. Let S = (NC; NR; NI ) be a signature of an ontology O, NV a countably
infinite set of variables disjoint from NC; NR and NI and S O = (CO; RO; IO) the
restriction of S to terms that occur in O. A term t is an element from NV [ NI . Let C 2 CO
be a concept, r 2 RO a role, and t; t0 2 IO [ NV terms. An atom is an expression
C(t) or r(t; t0) and we refer to these types of atoms as concept and role atoms,
respectively. A query q is a non-empty set of atoms. We use Vars(q) to denote the set of
variables occurring in q, Inds(q) to denote the set of individual names occurring in q, and
Terms(q) = Vars(q) [ Inds(q) for the set of terms in q. We use jqj to denote the number
of atoms in q.</p>
      <p>Let q = fat1; : : : ; atng be a query. A mapping for q over O is a total function
: Terms(q) ! IO such that (a) = a for each a 2 Inds(q). The set q of all possible
mappings for q is defined as q := f j is a mapping for qg. A solution mapping
for q over O is a mapping such that O j= C( (t)) for each concept atom C(t) 2 q and
O j= r( (t); (t0)) for each role atom r(t; t0) 2 q.</p>
      <p>According to the above definition, we deal with conjunctive queries with only
distinguished variables. Without loss of generality, we assume that queries are connected.
In case they are not, the connected components of a query can be evaluated
independently and the results of these evaluations can be combined in the end [1].
3</p>
    </sec>
    <sec id="sec-3">
      <title>Extracting Individual Information from Reasoner Models</title>
      <p>The first step in the ordering of query atoms is the extraction of statistics, exploiting
information generated by reasoners.</p>
      <p>As has been mentioned in Section 2, if an ontology is consistent and a pre-model is
constructed, individuals are assingned labels in this pre-model. These labels can provide
us with information about the concepts the individuals belong to or the roles in which
they participate. We exploit this information similarly as was suggested for
determining known (non-)subsumers for classes during classification [2]. In the hypertableau
calculus, the following two properties hold for each ontology O and each constructed
pre-model An for O:
(P1) for each atomic concept C (role R), each individual s0 (each pair of individuals
hs1; s2i) in An, if C 2 LAn (s0) (R 2 LAn (hs1; s2i)) and the assertion C(s0) (R(s1 s2))
was derived deterministically, then it holds O j= C(s0) (O j= R(s1; s2)).
(P2) for an arbitrary individual s0 in An (an arbitrary pair of individuals hs1; s2i in
An) and an arbitrary atomic concept C (simple role R), if C &lt; LAn (s0) (R &lt;
LAn (hs1; s2i)), then O 6j= C(s0) (O 6j= R(s1; s2)).</p>
      <p>We use these properties to extract information from the pre-model of a satisfiable
ontology O as outlined in Algorithm 1. In our implementation we use a more
complicated procedure to only store the direct types of each individual. The information we
extract involves the maintenance of the sets of known and possible instances for all
atomic concepts of O. The known instances of a concept C (K[C]) are the individuals
that can be safely considered instances of the concept according to the pre-model, i.e.,
the individuals of the assertions referring to concept C that have been derived
deterministically. The possible instances of a concept C (P[C]) are the individuals of the
assertions referring to C that have been derived nondeterministically. These individuals
require costly consistency checks in order to decide whether they are real instances of
the respective concept.</p>
      <p>Algorithm 1 initializeKnownAndPossibleConceptInstances
Require: a consistent SROIQ ontology O to be queried
Ensure: sets K[C] (P[C]) of known (possible) instances for each concept C of O are computed
1: An := buildModelFor(O)
2: for all ind 2 IO do
3: for all C 2 LAn (ind) do
4: if C was derived deterministically then
5: K[C] := K[C] [ findg
6: else
7: P[C] := P[C] [ findg
8: end if
9: end for
10: end for</p>
      <p>The procedure to find the known and possible instances of simple and complex
(transitive roles or roles having a transitive subrole) roles or, given an individual, the
known and possible role successors or predecessors, can be defined similarly. In the
case of complex roles, however, before the buildModelFor procedure is applied, O is
expanded with additional axioms that capture the semantics of the transitive relations
since (hyper)tableau reasoners typically do not deal with transitivity directly [7]. In
particular, for each individual ind and each complex role p, the new concepts Cipnd and
Cind are created and the axioms Cind(ind) and Cind v 8p:Cipnd are added to O. Intuitively,
the consequent application of the transitivity encoding [7] produces axioms Cipnd(s) that
propagate to each individual s that is reachable from ind via a p-chain. The known and
possible p-successors for ind can then be determined from concept assertions Cipnd(s) in
the pre-model.</p>
      <p>The technique presented in this paper can be used with any (hyper)tableau
calculus for which properties (P1) and (P2) hold. All (hyper)tableau calculi used in practice
that we are aware of satisfy property (P1). Pre-models produced by tableau algorithms
as presented in the literature also satisfy property (P2); however, commonly used
optimizations, such as lazy unfolding, can compromise property (P2), which we illustrate
with the following example. Let us assume we have an ontology O containing the
axioms A v 9R:(C u D), B 9R:C and A(a). It is obvious that in this ontology A is a
subconcept of B (hence O j= B(a)) since every individual that is R-related to an
individual that is an instance of the intersection of C and D is also R-related to an individual
that is an instance of the concept C. However, even though the assertion A(a) occurs in
the ABox, the assertion B(a) is not added in the pre-model when we use lazy unfolding.
With lazy unfolding, instead of treating B 9R:C as two disjunctions :B t 9R:C and
B t 8R:(:C) as is typically done for general concept inclusion axioms (GCIs), B is
only lazily unfolded into its definition 9R:C once B occurs in the label of an individual.
Thus, although (9R:(C u D))(a) would be derived, this does not lead to the addition of
B(a).</p>
      <p>Nevertheless, most (if not all) implemented calculi produce pre-models that satisfy
at least the following weaker property:
(P3) for an arbitrary individual s0 in An and an arbitrary concept C where C is primitive
in O,4 if C &lt; LAn (s0), then O 6j= C(s0).</p>
      <p>Hence, properties (P2) and (P3) can be used to extract (non-)instance information from
pre-models. For tableau calculi that only satisfy (P3), Algorithm 1 can be modified
accordingly. In particular, for each non-primitive concept C in O we need to add to
P[C] the individuals in O that do not include the concept C in their label.</p>
      <p>Even though the proposed technique for determining known and possible instances
of concepts and roles can be used in the same way with both tableau and hypertableau
reasoners, the e ect that it will have when tableau algorithms are used is less intense.
This happens because tableau algorithms often introduce more nondeterminism than
hypertableau. In particular, in tableau algorithms a disjunction is added to each
individual for each GCI in O and, when optimizations such as lazy unfolding are used, these
compromise property (P2) and we have to use the weaker property (P3) or even
consider all concepts that do not occur in the label of an individual as possible types, which
results in less accurate statistics.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Query Answering and Query Atom Ordering</title>
      <p>In this section we describe two di erent algorithms (a static and a dynamic one) for
ordering the atoms of a query based on some costs and then we deal with the formulation
of these costs. We first introduce the abstract graph representation of a query q by means
of a labeled graph Gq on which we define the computed statistical costs.
Definition 2. A query join graph Gq for a query q is a tuple (V; E; EL), where
– V = q is a set of vertices (one for each query atom);
– E V V is a set of edges such that hat1; at2i 2 E i Vars(at1) \ Vars(at2) , ;
and at1 , at2;
– EL is a function that assigns a set of variables to each hat1; at2i 2 E such that
EL(at1; at2) = Vars(at1) \ Vars(at2).</p>
      <p>In the remainder we use q for a query fat1; : : : ; atng, Gq for the according query join
graph and q for the solution mappings of q. Our goal is to find a query execution plan,
which determines the evaluation order for atoms in q. Since the number of possible
execution plans is of order jqj!, the ordering task quickly becomes impractical. In the
following, we focus on greedy algorithms for determining an execution order, which
prune the search space considerably. Roughly speaking, we proceed as follows: We
define a cost function, which consists of two components: an estimate for the reasoning
costs and an estimate for the intermediate result size. Both components are combined
to induce an order among query atoms. In this paper, we simply build the sum of the
two cost components, but di erent combinations such as a weighted sum of the two
values could also be used. For the query plan construction we distinguish static from
dynamic planning. For the former, we start constructing the plan by adding a minimal
4 A concept C is considered primitive in O if O is unfoldable and it contains no axiom of the
form C E
atom according to the order. Variables from this atom are then considered bound, which
changes the cost function and might induce a di erent order among the remaining query
atoms. Considering the updated order, we again select the minimal query atom that is
not yet in the plan and update the costs. This process continues until the plan contains all
atoms. Once a complete plan has been determined the atoms are evaluated. The dynamic
case di ers in that after selecting an atom for the plan, we immediately determine the
solutions for the chosen atom, which are then used to update the cost function. Dynamic
ordering is a costly but accurate procedure. Sampling techniques can be used so that not
all mappings are used for the update of the cost function but only a subset of them. In
Section 5 we show that random sampling is not adequate and a more sophisticated
sampling criterion is needed. However, the definition of such criterion is out of the
scope of the current paper. We now make the process of query plan construction more
precise, but we leave the exact details of defining the cost function and the ordering it
induces to later.</p>
      <p>Definition 3. A static (dynamic) cost function w.r.t. q is a function s : q 2Vars(q) !
R R (d : q 2 q ! R R). The two costs are combined to yield a static ordering s
(a dynamic ordering d), which is a total order over the atoms of q.</p>
      <p>An execution plan for q is a duplicate-free sequence of query atoms from q. The
initial execution plan is the empty sequence and a complete execution plan is a sequence
containing all atoms of q. For Pi = (at1; : : : ; ati) with i &lt; n an execution plan for q
with query join graph Gq = (V; E; EL), we define the potential next atoms qi for Pi
w.r.t. Gq as qi = q for Pi the initial execution plan and qi = fat j hat0; ati 2 E; at0 2
fat1; : : : ; atig; at 2 q n fat1; : : : ; atigg otherwise. The static (dynamic) ordering induces an
execution plan Pi+1 = (at1; : : : ; ati; ati+1) with ati+1 2 qi and ati+1 s at (ati+1 d at) for
each at 2 qi such that at , ati+1.</p>
      <p>For i &gt; 0, the set of potential next atoms only contains atoms that are connected to
an atom that is already in the plan since unconnected atoms will cause an unnecessary
blowup of the number of intermediate results. Let Pi = (at1; : : : ; ati) with i n be an
execution plan for q. The procedure we follow to find the solution mappings i for Pi
is recursively defined as follows: Initially, our solution set contains only the identity
mapping 0 = f 0g, which does not map any variable to any value. Assuming that we
have evaluated the sequence Pi 1 = (at1; : : : ; ati 1) and we have found the set of solution
mappings i 1, in order to find the solution mappings i of Pi, we use instance retrieval
tasks of reasoners to extend the mappings in i 1 to cover the new variables of ati or the
entailment check service of reasoners if ati does not contain new variables. A detailed
description of the method we are using for the evaluation of an execution plan together
with optimizations can be found in our previous work [5].</p>
      <p>We now define the cost functions s and d more precisely, which estimate the cost
of the required reasoner operations and the estimated result output size of evaluating
a query atom. The intuition behind the estimated value of the reasoner operation costs
(the functions’ first component) is that the evaluation of possible instances is much
more costly than the evaluation of known instances since possible instances require
expensive consistency checks whereas known instances require cheap cache lookups. The
estimated result size (the functions’ second component) takes into account the number
of known and possible instances and the probability that possible instances are actual
instances. The static cost function has more cases since for atoms we might only know
that their variables are bound, without knowing to which individuals they are bound to.
The functions depend on several factors:
– K[C] (K[R]) and P[C] (P[R]) for known and possible instances of a concept C (a
role R) from Section 3
– sucK[R] := fi j 9 j:hi; ji 2 K[R]g (preK[R] := fi j 9 j:h j; ii 2 K[R]g) for the
individuals with known successors (predecessors) for a role R. We define analogous
functions sucP[R] and preP[R] for the individuals with possible successors and
predecessors for a role R
– sucK[R; a] := fi j ha; ii 2 K[R]g (preK[R; a] := fi j hi; ai 2 K[R]g) for known
R-successors (R-predecessors) of an individual a. We define analogous functions
sucP[R; a] and preP[R; a] for possible R-successors and R-predecessors of an
individual a
– CL for the cost of a cache lookup in the reasoner’s internal structures
– CE for the cost of an entailment check
– PIS for the possible instance success, i.e, an estimate for percentage of possible
instances that are actual instances</p>
      <p>The values CL, CE and PIS are determined experimentally. The time needed for a
cache lookup is much less than the time needed for an entailment check with the di
erence between the two depending on the ontology and even within an ontology on the
queried concept (role). The two costs (CL and CE) were determined by taking the
average time of the previous performed checks (lookup or entailment). In the case of CE,
we multiply this number with the depth of the concept (role) hierarchy. The depth of
the concept (role) hierarchy should be taken into account for the estimation of CE since
we only store the direct types of each individual (roles in which each individual
participates). In order to find the instances of a concept (role), we may need to check all its
subconcepts (subroles) that contain possible instances. The possible instance success,
PIS , was determined by testing several ontologies and checking how many of the initial
possible instances were real ones, which was around 50% in nearly all ontologies.</p>
      <p>
        In the following, we use a; b for individual names and x; y for variables. We first
define the static cost function s, which takes a pair hat(~t); VarsBi as input, where at(~t)
is a query atom and VarsB is the set of (bound) variables from Vars(at(~t)), and returns a
pair of real numbers as follows:
– for hat(~t); VarsBi 2 fhC(x); ;i; hR(x; y); ;ig
hjK[at]j CL + jP[at]j CE; jK[at]j + PIS jP[at]ji
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
– for hat(~t); VarsBi 2 fhR(a; x); ;ig
      </p>
      <p>
        hjsucK[at; a]j CL + jsucP[at; a]j CE; jsucK[at; a]j + PIS jsucP[at; a]ji
– for hat(~t); VarsBi 2 fhR(x; a); ;ig we use preK (preP) instead of sucK (sucP) in (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
– for hat(~t); VarsBi 2 fhC(a); ;i; hR(a; b); ;ig
      </p>
      <p>hCL; 1i if ~t 2 K[at]
hCE ; PISi if ~t 2 P[at]</p>
      <p>hCL; 0i otherwise
– for hat(~t); VarsBi 2 fhC(x); fxgi; hR(x; y); fx; ygi; hR(a; x); fxgi; hR(x; a); fxgig
* jK[at]j CL + jP[at]j CE ; jK[at]j + PIS jP[at]j +</p>
      <p>jIOj jIOj jIOj jIOj
– for hat(~t); VarsBi = hR(x; y); fxgi
* jK[at]j
jsucK[at]j</p>
      <p>CL +</p>
      <p>jP[at]j
jsucP[at]j</p>
      <p>CE ;</p>
      <p>jK[at]j
jsucK[at]j
+</p>
      <p>jP[at]j
jsucP[at]j</p>
      <p>
        PIS
+
– for hat(~t); VarsBi = hR(x; y); fygi we use preK (preP) instead of sucK (sucP) in (
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
The dynamic cost function d is based on the static function s, but only applies to
cases (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ). The function takes a pair hat(~t); i as input, where at(~t) is a
query atom and is the set of solution mappings for the atoms that have already been
evaluated, and returns a pair of real numbers using matrix addition as follows:
d(at(~t); ) =
      </p>
      <p>X s( (at(~t)); ;)
2</p>
      <p>
        A motivating example showing the di erence between static and dynamic ordering
and justifying why dynamic ordering can be beneficial in our setting is shown below.
Let us assume that a query q consists of the three query atoms: C(x), R(x; y), D(y).
Table 1 gives information about the known and possible instances of these atoms within
a sequence. In particular, the first column enumerates possible execution sequences
S i = (at1; : : : ; ati) for the atoms of q. Column 2 (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) gives the number of mappings to
known (possible) instances of ati (i.e., the number of known (possible) instances of ati)
that satisfy at the same time the atoms (at1; : : : ; ati 1). Column 4 gives the number of
possible instances of ati from Column 3 that are real instances (that belong to i). Let
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
us assume that we have 10,000 individuals in our ontology O. We will now explain what
the formulas described above are doing. We assume that CL CE which is always the
case since a cache lookup is less expensive than a consistency check. In both techniques
(static and dynamic) the atom R(x; y) will be chosen first since it has the least number of
possible instances (200) while it has the same (or smaller) number of known instances
(200) with the other atoms (s(R(x; y); ;) = d(R(x; y); f 0g) = h200 CL + 200 CE; 200 +
PIS 200i, s(C(x); ;) = d(C(x); f 0g) = h200 CL + 350 CE; 200 + PIS 350i, s(D(y); ;) =
d(D(y); f 0g) = h700 CL + 600 CE; 700 + PIS 600i). Then, in the case of static ordering,
after R(x; y), the atom C(x) is chosen since C has less possible (and known) instances
than D (350 versus 600). Indeed, s(C(x); fxg) = h 102;00000 CL + 103;50000 CE; 200+130580 PIS i,
s(D(y); fyg) = h 10;000 CL + 106;00000 CE; 700+600 PIS i. Hence, the order of evaluation in this
700
      </p>
      <p>108
case will be P = (R(x; y); C(x); D(y)) leading to 200(row 2) + 150(row 4) + 40(row 7)
entailment checks. In the dynamic case, after the evaluation of R(x; y), which gives a
set of solutions 1, the atom D(y) has fewer known and possible instances (50 known
and 50 possible) than the atom C(x) (100 known and 150 possible) and, hence, a lower
cost. Indeed, d(D(y); 1) = h50 CL + 150 CL + 50 CE; 50 + 0 + 50 PISi, d(C(x); 1) =
h100 CL + 0 CL + 150 CE; 100 + 0 + 150 PISi. Therefore, atom D(y) will be chosen
next leading to the execution of the query atoms in the order P = (R(x; y); D(y); C(x))
and the execution of 200(row 2) + 50(row 5) + 35(row 6) entailment checks.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>We tested our ordering techniques with the Lehigh University Benchmark (LUBM) [3]
as a case where no disjunctive information is present and with the more expressive
University Ontology Benchmark (UOBM) [6] using the HermiT5 hypertableau reasoner.
All experiments were performed on a Windows 7 machine with a double core 2.53 GHz
Intel x86 64 bit processor and Java 1.6 allowing 1GB of Java heap space. We measure
the time for one-o tasks such as classification separately since such tasks are usually
performed before the system accepts queries. The ontologies and all code required to
perform the experiments are available online.6</p>
      <p>
        We first used the 14 conjunctive ABox queries provided in LUBM. From these,
queries 2, 7, 8, 9 are the most interesting ones in our setting since they contain many
atoms and ordering them can have an e ect in running time. We tested the queries on
LUBM(
        <xref ref-type="bibr" rid="ref1">1,0</xref>
        ) and LUBM(
        <xref ref-type="bibr" rid="ref2">2,0</xref>
        ) which contain data for one or two universities
respectively, starting from index 0. LUBM(
        <xref ref-type="bibr" rid="ref1">1,0</xref>
        ) contains 16,283 individuals and LUBM(
        <xref ref-type="bibr" rid="ref2">2,0</xref>
        )
contains 38,334 individuals. LUBM(
        <xref ref-type="bibr" rid="ref1">1,0</xref>
        ) took 3.8 s to load and 22.7 s for classification
and initialization of known and possible instances of concepts and roles. LUBM(
        <xref ref-type="bibr" rid="ref2">2,0</xref>
        )
took 15.8 s to load and 146 s for classification and initialization of known and possible
instances. Table 2 shows the execution time for each of the four queries for LUBM(
        <xref ref-type="bibr" rid="ref1">1,0</xref>
        )
and LUBM(
        <xref ref-type="bibr" rid="ref2">2,0</xref>
        ). The queries marked with (*) are the queries where the static and
dynamic algorithms result in the same ordering. In these queries we observe an increase in
running time when the dynamic technique is used (in comparison to the static) which is
5 http://www.hermit-reasoner.com/
6 http://code.google.com/p/query-ordering/
especially evident on LUBM(
        <xref ref-type="bibr" rid="ref2">2,0</xref>
        ) Query 8, where the number of individuals in the
ontology and the intermediate result sizes are larger. Dynamic ordering also behaves worse
than static in queries 2 and 9. This happens because, although the dynamic algorithm
chooses a better ordering than the static algorithm, the intermediate results (that need to
be checked in each iteration to determine the next query atom to be executed) are quite
large and hence the cost of iterating over all possible mappings in the dynamic case far
outweighs the better ordering that is obtained. We also observe that a random sampling
of individuals for collecting the ordering statistics in the dynamic case (checking only
50% of individuals in i 1 randomly for detecting the next query atom to be executed)
leads to much worse results in most queries than plain static or dynamic ordering.
      </p>
      <p>From the nondeterministic UOBM ontology we removed the nominals and only
used the first three departments containing 6,409 individuals. The ontology took 6.4 s
to load and 30.3 s to classify and initiliaze the known and possible instances. We ran
our static and dynamic algorithms on queries 4, 9, 11, 12 and 14 provided in UOBM,
which are the most interesting ones because they consist of many atoms. Most of these
queries contain one atom with possible instances. Static and dynamic ordering show
similar performance because the intermediate result set sizes are small and the available
statistics in this case are quite accurate and result in the same ordering for both methods.
For both ordering methods, atoms with possible instances for these queries are executed
last. In order to illustrate when dynamic ordering performs better than static, we also
created the two custom queries:
q1 = fisAdvisedBy(x,y), GraduateStudent(x), Woman(y) g
q2 = f SportsFan(x), GraduateStudent(x), Woman(x) g
In both queries, P[GraduateStudent], P[Woman] and P[isAdvisedBy] are non-empty.
The running times for dynamic ordering are smaller since the more accurate statistics
result in a smaller number of possible instances that have to be checked during query
execution. In particular, for the static ordering, 151 and 41 possible instances have to be
checked in query q1 and q2, respectively, compared to only 77 and 23 for the dynamic
ordering. Moreover, the intermediate results are generally smaller in dynamic ordering
than in static leading to a significant reduction in the running time of the queries. All of
the presented queries could not be answered in the time limit of 30 minutes in case no
ordering algorithm was used.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In the current paper, we have dealt with the definition of cost formulas that are based on
information extracted from reasoners’ models for ordering the atoms of a conjunctive
instance query that is issued over an OWL ontology. We have devised two algorithms,
a static and a dynamic one, for finding a good order and show (through an experimental
study) that static techniques are quite adequate for deterministic ontologies, however,
when disjunctive knowledge is present, dynamic techniques often perform better. The
proposed query ordering costs can be used with either tableau or hypertableau
reasoners, however, in the case of tableau reasoners they can be less accurate. Future work
will include the definition of additional cost measures and the use of appropriate
criteria for sampling the individuals and use only these samples for extracting the costs for
dynamic ordering.</p>
      <p>Acknowledgements This work was done within the Transregional Collaborative
Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems”
funded by the German Research Foundation (DFG).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Glimm</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Conjunctive query answering for the description logic SHIQ</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>31</volume>
          ,
          <fpage>151</fpage>
          -
          <lpage>198</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Glimm</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shearer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoilos</surname>
          </string-name>
          , G.:
          <article-title>A novel approach to ontology classification</article-title>
          .
          <source>Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Accepted</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heflin</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>LUBM: A benchmark for OWL knowledge base systems</article-title>
          .
          <source>J. Web Semantics</source>
          <volume>3</volume>
          (
          <issue>2-3</issue>
          ),
          <fpage>158</fpage>
          -
          <lpage>182</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kazakov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>RIQ and SROIQ are harder than SH OIQ</article-title>
          . In: Brewka,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . (eds.)
          <source>Proc. 11th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR'08)</source>
          . pp.
          <fpage>274</fpage>
          -
          <lpage>284</lpage>
          . AAAI Press (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kollia</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glimm</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>SPARQL query answering over OWL ontologies</article-title>
          .
          <source>In: Proceedings of the 8th Extended Semantic Web Conference (ESWC</source>
          <year>2011</year>
          ). pp.
          <fpage>382</fpage>
          -
          <lpage>396</lpage>
          . Lecture Notes in Computer Science, Springer-Verlag (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Towards a complete OWL ontology benchmark</article-title>
          .
          <source>In: The Semantic Web: Research and Applications, chap. 12</source>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>139</lpage>
          . Lecture Notes in Computer Science, Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shearer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Hypertableau reasoning for description logics</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>36</volume>
          ,
          <fpage>165</fpage>
          -
          <lpage>228</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sirin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Optimizations for answering conjunctive ABox queries: First results</article-title>
          .
          <source>In: Proc. of the Int. Description Logics Workshop DL</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Steinbrunn</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moerkotte</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kemper</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Heuristic and randomized optimization for the join ordering problem</article-title>
          .
          <source>VLDB Journal 6</source>
          ,
          <fpage>191</fpage>
          -
          <lpage>208</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiefer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>SPARQL basic graph pattern optimization using selectivity estimation</article-title>
          .
          <source>In: Proceedings of the 17th international conference on World Wide Web</source>
          . pp.
          <fpage>595</fpage>
          -
          <lpage>604</lpage>
          . WWW '08,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>