Inside the Query Space of DL Knowledge Bases

      Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou

                 National Technical University of Athens, Greece
       achort@cs.ntua.gr, mikegiatzi@image.ntua.gr, gstam@cs.ntua.gr


       Abstract. We structure the query space of DL knowledge bases (KBs)
       by finding equivalent and partial ordering relations between answer sets
       of conjunctive queries, and defining a formal graph structure that allows
       qualitative and quantitative exploration of the query space. We also dis-
       cuss how to compute such a graph for restricted DL-LiteR KBs.


1    Introduction
Purely assertional or ontology-enriched KBs are increasingly being used by ap-
plications that need access to data and knowledge, and are typically accessed
through queries or facet-based interfaces. When interacting with such a KB, the
application usually has no idea about the quality or completeness of its cover-
age, e.g. which queries have an ‘adequate’, representative number of sufficiently
diverse answers indicating that the KB is a good source for answering such
queries, or which may produce degenerate answers. General-purpose KBs may
in fact have significantly unbalanced contents with respect to topic coverage.
    To explore this aspect of KBs, in this paper we propose a theoretical frame-
work for structuring the space of conjunctive queries that can be posed on a DL
KB in a semi-lattice structure, called the query space graph (QSG) using query
answers. The QSG captures significant information about the coverage of a par-
ticular KB, and reveals, apart from any partial and equivalent ordering relations
following from the theory (e.g. query containment relations [4]), any additional,
potentially informing, epistemic such relations. Provided that the QSG satisfies
certain properties, it fully represents the underlying query space and encodes
the answers to all queries. Our work draws ideas from Formal Concept Anal-
ysis (FCA) [7], but the structure we propose can represent arbitrary queries;
in the case of queries consisting only of concept atoms our graph reduces to a
concept lattice. We also link the graph construction with the query generation
and answering phases. Because the size of the query space explodes combinato-
rially, computation of all queries is practically infeasible unless their answers are
considered, so as to restrict them to those having at least one answer. In this
context, we propose an algorithm for computing the query space of restricted
DL-LiteR KBs. Our work shares also ideas with faceted search [12], since a path
in the QSG captures conjunctive faceted query navigation [2].
    In the rest of the paper, after some preliminaries in Section 2, Section 3
builds the theoretical framework underlying QSGs. Section 4 outlines the general
approach to compute a QSG, and Section 5 specializes it for DL-Lite−     R . Section 6
presents some results for DBPedia, and Section 7 concludes the paper.
2       Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou

2    Preliminaries
Let CN, RN, IN be mutually disjoint, finite sets of concept, role and individual
names, respectively. The triple hCN, RN, INi is a vocabulary V, and a TBox T
(ABox A) over V and a DL dialect L is a set of axioms (assertions) that use
elements of V and constructors of L. The pair hV, Li is a language, and K =
hT , Ai is a knowledge base (KB) over some language. In part of the paper we
use DL-Lite−  R as dialect L, a subset of DL-LiteR that excludes axioms of the
form S v R− . In particular, it allows only axioms of the form C v D or R v S,
where C, D are concepts and R, S atomic roles. D can be either atomic or of
the form ∃R(−) .>, and C can also be of the form ∃R(−) .A, where A is atomic.
The semantics of KBs are defined in the standard way using interpretations [3].
An interpretation I = (∆I , ·I ) assigns a set C I ⊆ ∆I to concept C, and a set
rI ⊆ ∆I × ∆I to role r. I is a model of an ABox A iff aI ∈ C I for all C(a) ∈ A,
and (aI , bI ) ∈ rI for all r(a, b) ∈ A. I is a model of a TBox T if it satisfies all
axioms in T . An axiom (or assertion) τ is entailed by a TBox T (a KB K) if it
is satisfied in every model of T (K), and we write T |= τ (K |= τ ).
    Given a vocabulary V and a set of variable names VN, a conjunctive query
(simply, a query) Q is an expression { hx1 , . . . xk i | ∃y1 . . . ∃yl .(a1 ∧ . . . ∧ an ) },
where k, l ≥ 0, n ≥ 1, xi , yi ∈ VN, each ai is an atom C(u) or R(u, v), where
C ∈ CN, R ∈ RN, u, v ∈ IN ∪ {xi }ki=1 ∪ {yi }li=1 and all xi , yi appear in at least
one atom. The vector x = hx1 , . . . xk i is the head of Q (head(Q)), its elements
are the answer variables (avar(Q)), and {a1 , . . . , an } is the body of Q (body(Q)).
var(Q) is the set of all variables appearing in Q. Those that appear once only in
the body are unbound ; all others are bound. If avar(Q) = ∅, the query is boolean.
In this paper we will talk about non-boolean queries containing only variables
(u, v ∈/ IN). We will often write Q as x {a1 , ..., an }, or x Q, to make explicit the
answer variables. Given a KB K, a query Q with head hx1 , . . . xk i and a model I
for K, a vector of individuals c = hc1 , . . . ck i, ci ∈ IN, is an answer to Q in I iff
K, I |= Q(a), where Q(a) = Q{xi /ci }ki=1 . c is a certain answer to Q for K, iff it
is an answer in all models I of K. The set of certain answers to Q is cert(Q, K).
    A query Q1 subsumes a query Q2 (Q1 ⊆S Q2 ) iff there is a substitution θ
s.t. head(Q1 θ) = head(Q2 ) and body(Q1 θ) ⊆ body(Q2 ). If Q1 , Q2 are mutually
subsumed they are syntactically equivalent (Q1 ≡S Q2 ). We write Q ∈ Q iff set Q
contains a query Q0 such that Q0 ≡S Q. Two queries Q1 , Q2 are similar if there
are renamings σ1 , σ2 s.t. head(Q1 σ1 ) = head(Q2 ) and head(Q2 σ2 ) = head(Q1 ).
(A substitution θ = {xi /yi }ni=1 is a renaming for a query Q if xi ∈ var(Q), and
{yi }ni=1 are distinct variables s.t. each yi equals some xj or yi ∈       / var(Q) [10].)
Given Q, let Q0 be a query s.t. head(Q0 ) = head(Q) and body(Q0 ) ⊆ body(Q). If
body(Q0 ) is a minimal subset of body(Q) s.t. Q ⊆S Q0 , Q0 is a condensation of
Q (cond(Q)). If the minimal body(Q0 ) equals body(Q), Q is condensed [8].
    We define the following operations: i) The intersection of two similar queries
                                .
Q1 , Q2 is the query Q1 uQ2 = cond( head(Q1 ) {body(Q1 )∪body(Q2 σ)} ) where σ is
a renaming s.t. head(Q1 ) = head(Q2 σ) and var(Q2 σ) ∩ var(Q1 ) = avar(Q1 ), that
renames all non-answer variables of Q2 to variables not appearing in Q1 . Clearly,
Q1 u Q2 = Q2 u Q1 . We will write x Q1 u x Q2 , assuming that var(x Q1 ) ∩ var(x Q2 )
                              Inside the Query Space of DL Knowledge Bases            3

equals the variables in x. ii) The τ -concatenation of query Q1 with query Q2 ,
where τ is a renaming of a subset of var(Q2 ) to a subset of var(Q1 ), is the query
           .
Q1 ◦τ Q2 = cond( head(Q1 ) {body(Q1 ) ∪ body(Q2 στ )} ) where σ is a renaming that
renames all variables of Q2 that are not renamed by τ to variables not appearing
in Q1 . Dropping τ , we will write x Q1 ◦ x0 Q2 , assuming that var(x Q1 ) contains all
variables in x0 , and var(x Q1 ) ∩ var(x0 Q2 )) equals the variables in x0 . If x = x0 ,
x Q1 ◦ x Q2 = x Q1 u x Q2 . iii) The projection of a query x Q on y, where the vari-
                                                                         .
ables in y are a subset of the variables in x, is the query x Q|y = y {body(x Q)}.
                                       ¯
     Given a query Q, let gra(Q) (gra(Q))        be the (directed) graph with nodes
var(Q) having the edge {z1 , z2 } (resp. (z1 , z2 )) iff R(z1 , z2 ) ∈ body(Q). Q is
connected iff gra(Q) is connected, and (strictly) tree-shaped iff head(Q) = hxi
                                                                  hk,n :n i
              ¯
and gra(Q) (gra(Q))   is a (directed) tree with root x. By QV 1 2 we denote the
syntactically equivalent-free set of all condensed, connected, non-boolean queries
with k ≥ 1 answer variables, head of length k, n variables where k ≤ n1 ≤ n ≤
                                          hk,ni       hk,k:ni               S+∞ hk,ji
n2 , constructable over V. We write QV          for QV        , and QkV for j=1 QV .

3    Structuring the Query Space
In general, to structure a set of similar queries we need a partial ordering rela-
tion . Using query subsumption we can write Q1 S Q2 iff Q2 ⊆S Q1 , whereas
using query containment we can write Q1 T Q2 , iff for given a TBox T we have
cert(Q1 , hT , Ai) ⊆ cert(Q2 , hT , Ai) for any ABox A. These partial ordering rela-
tions capture general properties of the queries that follow from their structure or
the axioms of the underlying TBox. In concrete KBs however, comprising partic-
ular ABoxes, arise more specific, incidental ‘strict containment’ or ‘equivalence’
relations. To capture such relations we need a strict partial ordering relation
that takes into account the actual ABox contents.
Definition 1. Let K = hT , Ai be a KB over some hV, Li, and Q1 , Q2 two (sim-
ilar) queries in QkV , for some k. Q1 and Q2 are answer equivalent w.r.t. K
(Q1 ≡K Q2 ), if cert(Q1 , K) = cert(Q2 , K). Also, Q1 is answer smaller than Q2
w.r.t. T (Q1 ≺K Q2 ), if cert(Q1 , K) ⊂ cert(Q2 , K).
    The smaller-or-equivalent relation K can then be defined as usual. If Q1 S
Q2 then Q1 T Q2 for any T ; given T , if Q1 T Q2 then Q1 K Q2 regardless
of the particular A. ≡K and ≺K can be considered as ‘epistemic-like’ equivalence
and strict partial ordering relations, induced by a particular ABox.
    Based on the above, we can define an equivalence and strict partial order-
ing relation for any Q ⊆ QkV . An equivalence relation ≡ partitions Q to a set
of equivalence classes EC≡ (Q). We call the equivalence class that contains all
queries with zero answers null, and its members null queries. We will say that a
pair (≡, ≺) on a set Q is a structure if for any Q1 , Q2 ∈ Q s.t. Q1 ≡ Q2 we have
Q1 6≺ Q2 and Q2 6≺ Q1 , and inversely if Q1 ≺ Q2 or Q2 ≺ Q1 , then Q1 6≡ Q2 .
Clearly, (≡K , ≺K ) is a structure.
Definition 2. A set Q ⊆ QkV is complete iff for any Q1 , Q2 ∈ Q, we have
Q1 u Q2 ∈ Q.
4       Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou

   The completion of a finite set Q is constructed by iteratively extending Q
with the intersections of its elements until reaching a fixpoint. This terminates
because intersections are condensed queries. Since each equivalence class Q̃ ∈
EC≡ (Q) of a complete set Q contains all intersections of its elements, there is a
unique Q ∈ Q̃, called the representative of Q̃, s.t. Q0 ⊆S Q for all Q0 ∈ Q̃. The
representative is the syntactically most ‘specific’ query of the equivalent class.

Definition 3. Let Q ⊆ QkV and (≡, ≺) be a structure. The representative order-
ing of Q
       S is the partially ordered set (Q̂≡ , ≺), where Q̂ is the completion of Q, and
Q̂≡ = Q̃∈EC≡ (Q̂) {Q ∈ Q̃ | Q0 ⊆S Q for all Q0 ∈ Q̃} (the set of representatives).

Example 1. Consider that Q = {x {A(x)}, x {A(x), B(x)}, x {C(x), D(x)}} and
x {C(x), D(x)} ≺ x {A(x)} and x {A(x)} ≡ x {A(x), B(x)}. The completion of Q
is Q̂ = {x {A(x)}, x {A(x), B(x)}, x {C(x), D(x)}, x {A(x), C(x), D(x)}, x {A(x),
B(x), C(x), D(x)}}. The equivalence classes are {x {A(x)}, x {A(x), B(x)}} and
{x {C(x), D(x)}, x {A(x), C(x), D(x)}, x {A(x), B(x), C(x), D(x)}}, thus Q̂≡ =
{x {A(x), B(x)}, x {A(x), B(x), C(x), D(x)}} with x {A(x), B(x), C(x), D(x)} ≺
x {A(x), B(x)}.

Lemma 1. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV . For
any Q1 , Q2 ∈ Q̂≡ we have Q1 ≺ Q2 iff Q2 ⊂S Q1 .
Proof. Assume that Q1 ≺ Q2 and Q2 6⊂S Q1 . Since Q1 , Q2 ∈ Q̂≡ we have
Q1 , Q2 ∈ Q̂, where Q̂ is the completion of Q, and so Q1 u Q2 ∈ Q̂. Because
Q1 ≺ Q2 , Q1 u Q2 must belong to the same equivalence class as Q1 . But Q2 6⊂S
Q1 , and so Q1 ⊂S Q1 u Q2 6= Q1 . Thus, Q1 ∈     / Q̂≡ which is a contradiction.
Inversely, if Q2 ⊂S Q1 , given that Q2 6≡ Q1 , by definition we get Q1 ≺ Q2 .

    We say that a query Q ∈ Q̂≡ is minimal (maximal ) w.r.t. to some property if
it satisfies the property and there is no Q0 ∈ Q̂≡ s.t. Q0 ≺ Q (Q ≺ Q0 ) satisfying
the same property.

Definition 4. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV , and
let Q ∈ QkV . The top (bottom) cover of Q in (Q̂≡ , ≺) is the set of all minimal
(maximal) Q0 ∈ Q̂≡ s.t Q0 ⊆S Q (Q ⊆S Q0 ).

    The covers may be empty, e.g. if Q̂≡ = {{x {A(x)}} and Q = x {B(x)}.

Lemma 2. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV , and
let Q ∈ QkV . If the top and bottom covers, Q> and Q⊥ , respectively, of Q are
non-empty, then there is a Q̃ ∈ Q̂≡ s.t. Q0  Q  Q̃ for all Q0 ∈ Q⊥ .
Proof. Let Q̄ be the intersection of the queries in Q> . By definition, Q̄ ⊆S Q,
and hence Q  Q̄. By completeness there is a unique Q̃ ∈ Q̄≡ s.t. Q̃ ≡ Q̄, and so
Q  Q̃. Finally, by definition, for all Q0 ∈ Q⊥ we have Q ⊆S Q0 , hence Q0  Q.

   This allows us to use (Q̂≡ , ≺) to bound the answers to any query Q ∈ QkV
          / Q, provided that Q> , Q⊥ exist. If Q ∈ Q, we have a stronger result.
even if Q ∈
                            Inside the Query Space of DL Knowledge Bases         5

Lemma 3. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV . The
bottom cover Q⊥ of any Q ∈ Q is a singleton.
Proof. Since Q ∈ Q, there is a unique Q̃ ∈ EC≡ (Q̂), s.t. Q ∈ Q̃. The represen-
tative Q0 is s.t. Q0 ≡ Q and Q ⊆S Q0 . This is clearly maximal and is in Q⊥ . For
any other Q00 ∈ Q̂≡ s.t. Q ⊆S Q00 , we have Q00 ≺ Q0 , hence it cannot be in Q⊥ .

    This unique element of Q⊥ is the anchor of Q in (Q̂≡ , ≺), and the representa-
tive ordering of Q fully describes Q: any query Q ∈ Q is represented in (Q̂≡ , ≺)
by its unique anchor, and the answers to Q are the answers to its anchor.

Example 2. Let x {A(x)} ≺ x {S(x, y)} ≺ x {S(x, y), R(y, z), B(z)} and Q consist
of these queries. Then Q̂≡ = {{x {A(x)}}, {x {A(x), S(x, y)}}, {x {A(x), S(x, y),
R(y, z), B(z)}} with the obvious ordering. Let Q be x {S(x, y), R(y, z)}. We have
that Q ∈/ Q, and the bottom cover of Q is {x {A(x), S(x, y), R(y, z), B(z)}. Note
that x {S(x, y), R(y, z), B(z)} ∈ Q has the same bottom cover.

    This shows that ‘anchors’ can exist also for queries that do not belong in Q.
Thus, only if we know that a query belongs to Q we can be sure that the answers
of the anchor are also exactly its answers.
    If for the representative ordering (Q̂≡ , ≺) we define the meet operation ∧ so
that Q1 ∧ Q2 is the maximal query Q ∈ Q̂≡ s.t. Q1 u Q2 ⊆S Q, then (Q̂≡ , ≺)
becomes a meet semi-lattice. Given a finite Q, Q̂≡ is finite, thus the semi-lattice
has a bottom. To capture the minimum needed partial ordering relations for a
representative ordering we use a directed acyclic graph.

Definition 5. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV . The
query space graph (QSG) of Q is the transitive reduction of the directed acyclic
graph induced by (Q̂≡ , ≺).

   Because (Q̂≡ , ≺) is a finite semi-lattice, the QSG is finite with a bottom, but
may have more than one roots. The QSG is shortcut-free, i.e. if there is an edge
(Q1 , Q2 ) there is no path from Q1 to Q2 in G other than (Q1 , Q2 ).


4   Query Space Graph Computation
Now, we will describe how, given a language hV, Li and a KB K, we can compute
                                                     h1,ki∗
the QSG for the (≡K , ≺K ) structure and the set QV         of tree-shaped queries
with one answer variable and up to k variables. To do this, we need to perform
two tasks: compute the set of queries for which to build the QSG and then
arrange them into an actual QSG. We will start from the latter.
    A given set of queries Q can be arranged into a graph, by first computing the
equivalence classes (the nodes of the graph) and then adding the edges induced
by the partial ordering. Such a graph will not be a QSG unless Q is complete. If
not, all intersections of the queries in Q have to be iteratively computed. This
can be done by exploiting the already produced graph, and let it guide the query
intersection process in a way similar to FCA algorithms [9].
6        Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou

    To compute the actual queries, we can compute first all possible tree shapes,
the so-called rooted trees (e.g. there are 1, 1, 2, 4, 9, 20, 48, 115, 286, 719 such
trees with 1 . . . 10 nodes, [1]). We represent a rooted tree with n > 0 nodes as
a set of edges (i, j) (i is the parent of j), where i, j are integers representing
nodes. We assume that if j is a descendent of i, then j > i, so that the root is
the smallest of the integers being used (usually 0). The tree T = ∅ consists only
of a root. To compute all rooted trees with up to t nodes we can start e.g. from
tree {(0, 1)}, and iteratively connect a new node to each existing node, taking
care to discard equivalent representations of the same rooted trees (reflections).
    Let D (C) be the set of all non syntactically equivalent queries of the form
          k
x {Ci (x)}i=1 , Ci ∈ CN, k ≥ 1 (k = 1), i.e. the queries consisting of concept atoms.
                   h1,1i∗                 P|CN|
                          and |D| = i=1 |CN|                = 2|CN| − 1. For a query Q known
                                                        
Clearly, D = QV                                       i
to be in D (C), we use the adornment notation D                           C
                                                                  x Q (x Q). Similarly, let S (R)
be the set of all non syntactically equivalent queries of the form x,y {Ri (xi )}ki=1 ,
Ri ∈ RN, k ≥ 1 (k = 1), and xi = (x, y) or (y, x), i.e. the queries consisting of
                         P|RN|
role atoms. |QS| = i=1 2i |RN|               = 3|RN| − 1. For strictly tree-shaped queries,
                                           
                                       i
where all xi s must be the same, we denote the respective sets by S∗ and R∗. For
a query Q known to be in S (R), we use the notation Sx,y Q (R                    x,y Q), and similarly
for the starred versions. D (S) can be obtained as the node set of the QSG of
the completion of C (R). These sets are the material to put on the edges and
nodes of a rooted tree to get tree-shaped queries. The tree T = ∅ gives rise to
the queries D x0 Q; any tree with nodes {0, . . . , k}, k ≥ 1, gives rise to the queries
                                                                                        
          [D                 S              D                     S              D
           x0 Q ◦ ] (. . . ((x̂1 ,x1 Q1 [ ◦ x1 Q̃1 ]) ◦ . . .) ◦ (x̂k ,xk Qk [ ◦ xk Q̃k ])   |x0   (1)

where x̂i = xparent(i) . The optional parts in square brackets specify concepts in
the tree nodes. Due to the recurrence of Sx̂i ,xi Q [ ◦ D
                                                        xi Q̃], we define the frame link
                            S
                 L = S ∪ x,y Q1 ◦ D          S              D
                                      y Q2 | x,y Q1 ∈ S, y Q2 ∈ D                     (2)
i.e the set of queries with two variables, some roles connecting them and possibly
some concepts on the second variable. By Lx,y Q we will denote a query in L. We
define similarly (i.e. by replacing S with S∗ in Eq. 2) L∗ and L∗  x,y Q .
     Given the above, a simple approach to compute the queries induced by a
rooted tree T is to compute L (or L∗), create a copy of it for each edge, appro-
priately renaming variables, and take all possible combinations between frame
link elements. An alternative, backwards iterative approach is based on the ob-
servation that, if the rooted trees of sizes up to k − 1 are available, a rooted tree
of size k can be directly constructed from one or more such trees. In particular,
a query with shape a rooted tree of size k can be constructed either by a one
step extension of an existing query with shape a rooted tree S of size k − 1 (by
renaming variables, adding a new root and connecting the new to the old root)
or by merging two or more existing queries with shape trees Si (by renaming
variables and joining them on their roots). In the first case, if y Q is the set of
all queries with shape S, its extension is obtained by the unary operator
                 x 3 y Q = (x,y Q ◦ y Q)|x | x,y Q ∈ L and y Q ∈ y Q
                           L                L
                              Inside the Query Space of DL Knowledge Bases           7

In the second case, let x Qi be the set of all queries with shape Si . Merging is an
n-ary operator ⊗, for n ≥ 1:

        x Q1 ⊗ x Q2 ⊗ . . . ⊗ x Qn = {x Q1 u x Q2 u . . . u x Qn | x Qi ∈ x Qi }

    Given the above, the set of all queries represented by Formula 1 are

                             QT = [{ D
                                     x0 Q ∈ D } ⊗ ] x0 Q̃                          (3)

i.e. the optional merging of the queries including only concept atoms on the root
variable, with the set x0 Q̃
                           Nof all queries with the shape of T . These are obtained
for i = 0 from xi Q̃ =         j∈children(i) xi Q̂ij which expresses the merging of the
queries represented by the several branches of the tree assuming the root is node
i. Each branch is either a set of queries in L, if the branch has length 1, or the
extension of the set of queries represented by the first child in the branch:
                       (n                               o
                           L
                           xi ,xj Q| xi  | L
                                           xi ,xj Q ∈ L   if children(j) = ∅
             xi Q̂ij =                                                              (4)
                        xi 3 xj Q̃                        otherwise
                                      S
     Since there are infinite T s, T QT is infinite. To guarantee a finite set of
queries, we must set additional criteria, such as compute only queries with up to
k variables. Such a set will not be complete. A QSG on a non-complete set will
fully describe only the queries in the set (they will have unique anchors), but this
approximation can in practice be useful. Finally, to arrange the queries in a QSG,
we need their answers. If we can answer arbitrary queries, we could compute first
the queries and then evaluate them. However, the number of queries explodes
combinatorially, yet most of them will have zero answers. Thus, assuming that
we are not interested in null queries, we can try to combine query generation
with answering to avoid generating and evaluating many of the null queries.


5    Building a QSG for a DL-Lite−
                                 R Knowledge Base

We will now discuss how we can construct and answer tree-shaped queries for
normalized DL-Lite−  R KBs. We are interested only in non-null queries. Our ap-
proach will be based on the rewriting approach to query answering, which is
particularly applicable to the DL-Lite family [5]: given a query Q and a DL-Lite
KB hT , Ai, there is a setS of queries rewr(Q), called the rewritings of Q, such
that cert(Q, hT , Ai) = Q0 ∈rewr(Q,T ) cert(Q0 , h∅, Ai) i.e. the consequences of T
are incorporated in the several Q0 s, which can then be answered ignoring T (for
brevity, we will omit the KB arguments from cert(·) and rewr(·)). Rewritings are
‘transitive’: A rewriting of a rewriting of Q is a rewriting of Q. It follows that if
Q0 ∈ rewr(Q) and Q00 ∈ rewr(Q0 ) then cert(Q00 ) ⊆ cert(Q0 ) ⊆ cert(Q).
    We focus on DL-Lite−    R KBs, because the absence of axioms R v S
                                                                            −
                                                                               guar-
antees that the edges of a tree-shaped query do not change direction during
rewriting. In DL-Lite− R , the rewriting set of an atomic query is a set (union) of
atomic queries: rewr(x {C(x)}) = { x {A(x)} | T |= A v C} ∪ x {R(x, y)} | T |=
8        Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou

∃R v C} ∪ x {R(y, x)} | T |= ∃R− v C} }, rewr(x {R(x, y)}) = { x {B(x)} | T |=
B v ∃R} ∪ x {S(x, y)} | T |= ∃S v ∃R} ∪ x {S(y, x)} | T |= ∃S − v ∃R} }, and
rewr(x,y {R(x, y)}) = { x,y {S(x, y)} | T |= S v R}.
    An unfolding of query Q is a rewriting obtained by substituting each atom
in body(Q) by one of its rewritings, according to the above, taking care to ap-
propriately name newly introduced variables. An unfolding of Q (before con-
densation) retains all bound variables of Q. The set of all unfoldings of Q
is unfold(Q). Apart from unfoldings, Q may have also shrinking rewritings.
Shrinkings arise due to axioms of the form A v ∃R.C. Given such an ax-
iom, we have x {A(x)} ∈ rewr({R(x, y), C(y)}x ) (in fact {A(x)}x is also in
rewr({S(x, y), D(y)}x ) if T |= R v S and T |= C v D). Query x {A(x)} is
a shrinking of x {R(x, y), C(y)} and is an element of shrinky (Q) because it con-
tains all variables of Q apart from bound variable y, which has been eliminated
by reasoning using an axiom of the form A v ∃R.C. Let
                                                                 
                                      [          [
               shrink(Q) = {Q} ∪                      shrink(Q0 )           (5)
                                          v∈var(Q)           Q0 ∈shrinkv (Q)

It can be proved [6, 11] that rewr(Q) = Q0 ∈shrink(Q) unfold(Q0 ). If var(Q) =
                                            S

avar(Q), then rewr(Q) = unfold(Q0 ), since answer variables cannot be eliminated.
    Applying iteratively shrinking on a query using Eq. 5, we may end up with
a query containing only concept atoms on the answer variable. If all of them
have a common unfolding, by unfolding the query we will end up with a query
consisting of a single concept atom.
Definition 6. The kernel of a tree-shaped query x Q is the set
                   0                          C 00    C 0      C 0        C 00
kernel(x Q) = { C
                x Q ∈ rewr(x Q) | there is no x Q 6≡S x Q s.t. x Q ∈ rewr(x Q ) }

Theorem 1. If x Q1 , . . . , x Qn are tree-shaped queries, then
                                                             n
                                                             \
                     cert(x Q1 u . . . u x Qn ) =                  cert(x Qi )      and
                                                             i=1
                                        [                             [
kernel(x Q1 u . . . u x Qn ) =                         ...                         kernel(x Q01 u . . . ux Q0n )
                                    0                           0
                                 x Q1 ∈kernel(x Q1 )         x Qn ∈kernel(x Qn )


Proof. Because all x Qi s share only the answer variable x, their intersection (and
the intersections of their rewritings) has no shrinking on x. Thus, the rewritings
of x Q1 u . . . u xS
                   Qn are all combinations   of the rewritings of each x Qi : rewr(x Q1 u
. . . u x Qn ) = xQ0 ∈rewr(x Q1 ) . . . xQ0 ∈rewr(x Qn ) x Q01 u . . . u x Q0n . Because these
                                       S
                      1                   n
rewritings share only variable x we get the desired results.
    The above allows us to compute cert(x Q1 u . . . u x Qn ) and kernel(x Q1 u . . . u
x Qn ) given the certain answers and kernels of the merged queries. Because a
merging may be used to construct further queries, it should be kept either if it
has a non-empty kernel (which can cause queries constructed as extensions of it
be non-null), or it is itself non-null.
                                 Inside the Query Space of DL Knowledge Bases                9

Theorem 2. Let x Q = (L∗  x,y Q1 ◦ y Q2 )|x ), where x,y Q1 ∈ L, and y Q2 is a strictly
                                                     L∗

tree-shaped query with no concept atoms on y. Then
                                                             [
       cert(x Q) = cert(L∗
                        x,y Q1 ) ./ 2=1 cert(y Q2 ) |1 ∪              cert(C
                                                                           x Q̃)
                                                              C Q̃∈kernel( Q)
                                                              x           x

                       S
                                                 x,y Q1 ◦ y Q̃2 )|x ), ./i=j denotes join-
where kernel(x Q) = C Q̃2 ∈kernel(y Q2 ) kernel((L∗       C
                         y
ing of the left- and right-hand side tuples on the i-th and j-th element respectively,
and |i denotes projection on the i-th element of the tuples.
Proof. If a rewriting x Q̃ of x Q contains y, no shrinking has been applied on y, so
x Q̃ = (x,y Q̃1 ◦ y Q̃2 )|x , x,y Q̃1 ∈ rewr(x,y Q1 ) and y Q̃2 ∈ rewr(y Q2 ). Such rewrit-
                                             L∗

                                     L∗
ings contribute answers cert(x,y Q1 )./ 2=1 cert(y Q2 ))|1 . If x Q̃ doesn’t contain y but
contains a variable z, it was obtained by shrinking directly x Q on y, or by shrink-
ing on y some other shrinking of x Q (on other variables), or it is a rewriting of
                                                                                      −
these. Let L∗x,y Q1 = x,y Q1A [ ◦y Q1B ]. Because there are no axioms S v R , y oc-
                      S∗           D

curs at the same place in all role atoms. Thus, S∗      x,y Q1A contains only atoms of the
form S(x, y) (or S(y, x)), and y Q2 only atoms of the form S(zi , y) (or S(y, zi )).
This is impossible (x Q is strictly tree-shaped), so no shrinking on y is applicable.
    If x Q̃ results by shrinking on y a shrinking x Q0 of x Q on other variables, x Q0
cannot contain variables other than x and y, otherwise we could shrink directly
on y. Thus, x Q0 = (S∗  x,y Q1A [ ◦y Q1B ] ◦ y Q3 )|x , where y Q3 is a result of shrinking
                                    D           D               D

on the other variables. For shrinking on y to be applicable on x Q0 it must be
the case that for all E(y) ∈ body(D        y Q1B ) we have C v E, and for all S(x, y) ∈
body(x,y Q1A ) we have T |= R v S. There can be no S 0 (y, x) ∈ body(S∗
       S∗
                                                                                     x,y Q1A ).
These are necessary for all such E(y), S(x, y) to give their place to A(x) in the
shrinking due to A v ∃R.C. Thus, we must have x {A(x)} ∈ kernel(L∗               x,y Q1 ). But
for all E(y) ∈ body(D   y Q 3 ), i.e.  for  all y {E(y)} ∈   kernel(y Q2 ) we must  also have
C v E, i.e. x {A(x)} ∈ kernel((L∗        Q
                                      x,y 1 y ◦   {E(y)})|x ). Taking  the union for  all such
E and A we get the second part of the desired result.
    If x Q̃ is an unfolding of the above rewritings, its answers are included in the
certain answers of the query from which it has been obtained and we can ignore it.

    We call (cert(L∗x,y Q1 ) ./ 2=1 cert(y Q2 ))|1 the direct answers of x Q, because
they are obtained directly by joining the answers of L∗        x,y Q1 and y Q2 . So, the
theorem tells us that for the extension we have to stick to a strictly tree-shaped
query x Q, compute its direct answers, its kernel (L∗  x,y Q1 ◦ y Q2 )|x (needing for that
only the kernel of y Q2 ), find the answers to the kernel queries, and add them to
the direct answers.
    To implement Formula 3 we still need to compute queries L∗                     L∗
                                                                         x,y Q and x,y Q|x ,
used in the second and first branch of Eq. 4, respectively. For this, we need
first all queries L∗
                  x,y Q that may give rise to non-null queries. According to Th. 2
these are those that have at least one direct answer, and those whose projection
on x has a non-empty kernel. To compute the set of queries with at least one
direct answer, say Z, we can implement Eq. 2, by computing all intersections of
the nodes of the QSG for S∗ with the nodes of the QSG for D. Then, we must
10       Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou

update Z with the queries having non-empty kernels. The queries of the form
x,y Q1 ◦ y Q2 that have x {A(x)} in their kernel due to A v ∃R.C, are in the set
R∗       C


    EAv∃R.C
     x,y    = {{S(x, y), D(y)}x,y | A v ∃R.C, T |= R v S and T |= C v D} ∪
              {{S(y, x), D(y)}x,y | A v ∃R− .C, T |= R v S and T |= C v D}

As in the case of D and C, we can obtain the set FAv∃R.Cx,y     of the equivalence
classes of the queries in x,y Q that have the same answers and have x {A(x)} in
                           L

their kernel due to axiom A v ∃R.C, as the node set of the QSG for EAv∃R.C x,y    .
Note that since we are interested in the kernels, we must compute the QSG
including the null equivalence class. Because the kernel does not contribute any
answers (since the head is hx, yi), those of these queries that have at least one
answer should already be represented by an equivalence class in Z. However, that
equivalence class may represent a more general query, with additional conjuncts,
which has therefore an empty or different kernel. Thus, we need to enrich Z with
all the queries in the several sets FAv∃R.C
                                     x,y      . Now, to use Eq. 4, we need also to
know all queries L∗
                  x,y Q|x that may  give rise to non-null queries. Again, these are
those that have at least one direct answer, and those that have a non-empty
kernel. To compute them, we have to take the projections on x of all queries
in Z, and as before, extend this set with the projections of the queries in each
FAv∃R.C
  x,y     on x, by including however also any answers to the kernel.


6     Evaluation
We implemented the QSG computation for DL-Lite−       R KBs in Java, and we ap-
plied it on the DBPedia KB (2016-10 version), which contains 785 concepts,
1,111 properties (in the dbo namespace), 5,150,432 concept and 24,431,982 role
assertions. To test scalability we created several subsets of the KB by keeping
only the concepts and roles (and the relevant assertions) that, viewed as atomic
queries, had certain minimum numbers of answers (ranging from 1 to 106 ), and
tried to compute the QSG for Qh1,ki∗ , k = 1, 2, 3. The number of nodes of the
computed QSGs and computation time, as a function of the number of kept
concepts and roles are shown in Figure 1. Using a machine with 64GB RAM we
were able to compute Qh1,1i∗ and Qh1,2i∗ for the entire DBPedia, but for Qh1,3i∗
memory was enough only for small subsets. We note that the size of the QSG,
after a short superlinear growth at the beginning, grows linearly or sublinearly in
the size of the ontology. Thus, in practice, the combinatorial explosion seems to
occur only when we increase the size of the query but not the size of the ontology.
(In computing rewr(x {C(x)}) we ignored axioms of the form ∃R− v C. Although
theoretically incorrect, it turned out to be more reasonable because in DBPedia
such axioms are used as domain and range constraints; if they participate in the
reasoning process they propagate a large number of inconsistencies.)
    An example path in the graph for Qh1,3i∗ is Q0 = x0 {producer(x0 , x1 )}
(157,591 answers), Q1 = Q0 ◦ x1 {Agent(x1 )} (135,354), Q2 = Q1 ◦ x1 {Person(x1 )}
(119,578), Q3 = Q2 ◦ x0 {Work(x0 )} (119,577), Q4 = Q3 ◦ x1 {occupation(x1 , x2 )}
                                     Inside the Query Space of DL Knowledge Bases             11

                                            ·104                        ·105
                                        4                           6
|G|       400                           3                           4
          200                           2
                                                                    2
                                        1
           0                            0                           0
                0     500    1,000          0       500     1,000       0         500     1,000
                                                                         ·104
                                                                    6
          30                          600
minutes


                                                                    4
          20                          400
          10                          200                           2
           0                            0                           0
                0     500    1,000          0        500    1,000       0         500     1,000
                    |CN| + |RN|                    |CN| + |RN|                  |CN| + |RN|

Fig. 1: QSG size and computation time for Qh1,1i∗ , Qh1,2i∗ , Qh1,3i∗ , respectively.


(81,320), Q5 = Q4 ◦ x1 {Athlete(x1 )} (11), Q6 = Q5 ◦ x2 {PersonFunction(x2 )} (10),
Q7 = Q6 ◦ x0 {Album(x0 ), MusicalWork(x0 )} (1). Each edge refines the query, and
we discover that from the 157,591 producer assertions there is only one producer
that is an athlete and has produced a music album. The graph revealed also un-
balances on the topic coverage. E.g. from the about 1,243,400 persons, 360,941
were athletes, 318,392 athletic team members, and 18,622 soccer managers.

7         Conclusion
We defined a theoretical framework for modeling epistemic equivalence and or-
dering relations between queries of a DL KB, arranging them in a graph, and
discussed how to compute such a graph for DL-Lite−      R KBs. The experimental
results showed that the computation of restricted versions of the QSG might be
possible even for relatively big datasets. Given the complexity and memory needs
of the computation, future work will target improving the prototype implemen-
tation or approximative computation. Efficient computation for more expressive
DL dialects, starting from the full DL-LiteR , should also be investigated. We see
several applications of our framework, from construction of balanced datasets
for machine learning, to semantic clustering and autocomplete suggestions.

8         Acknowledgements
We acknowledge support of this work by the project ‘Apollonis: Greek Infrastruc-
ture for Digital Arts, Humanities and Language Research and Innovation’ (MIS
5002738) which is implemented under the Action ‘Reinforcement of the Research
and Innovation Infrastructure’, funded by the Operational Programme ‘Compet-
itiveness, Entrepreneurship and Innovation’ (NSRF 2014-2020) and co-financed
by Greece and the European Union (European Regional Development Fund).
12      Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou

References
 1. The on-line encyclopedia of integer sequences. Number of unlabeled rooted trees
    with n nodes (or connected functions with a fixed point)., http://oeis.org/
    A000081
 2. Arenas, M., Grau, B.C., Kharlamov, E., Marciuska, S., Zheleznyakov, D.: Faceted
    search over RDF-based knowledge graphs. J. Web Semant. 37-38, 55–74 (2016)
 3. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.
    (eds.): The Description Logic Handbook: Theory, Implementation, and Applica-
    tions. Cambridge University Press (2003)
 4. Bienvenu, M., Lutz, C., Wolter, F.: Query containment in description logics recon-
    sidered. In: KR. AAAI Press (2012)
 5. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable
    reasoning and efficient query answering in description logics: The DL-Lite family.
    J. Autom. Reasoning 39(3), 385–429 (2007)
 6. Chortaras, A., Trivela, D., Stamou, G.B.: Optimized query rewriting for OWL
    2 QL. In: CADE. Lecture Notes in Computer Science, vol. 6803, pp. 192–206.
    Springer (2011)
 7. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations.
    Springer (1999)
 8. Gottlob, G., Fermüller, C.G.: Removing redundancy from a clause. Artif. Intell.
    61(2), 263–289 (1993)
 9. Kuznetsov, S.O., Obiedkov, S.A.: Comparing performance of algorithms for gen-
    erating concept lattices. J. Exp. Theor. Artif. Intell. 14(2-3), 189–216 (2002)
10. Nienhuys-Cheng, S., de Wolf, R. (eds.): Foundations of Inductive Logic Program-
    ming, Lecture Notes in Computer Science, vol. 1228. Springer (1997)
11. Trivela, D., Stoilos, G., Chortaras, A., Stamou, G.B.: Optimising resolution-based
    rewriting algorithms for OWL ontologies. J. Web Semant. 33, 30–49 (2015)
12. Tunkelang, D.: Faceted Search. Synthesis Lectures on Information Concepts, Re-
    trieval, and Services, Morgan & Claypool Publishers (2009)