<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Inside the Query Space of DL Knowledge Bases</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We structure the query space of DL knowledge bases (KBs) by nding equivalent and partial ordering relations between answer sets of conjunctive queries, and de ning a formal graph structure that allows qualitative and quantitative exploration of the query space. We also discuss how to compute such a graph for restricted DL-LiteR KBs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Purely assertional or ontology-enriched KBs are increasingly being used by
applications that need access to data and knowledge, and are typically accessed
through queries or facet-based interfaces. When interacting with such a KB, the
application usually has no idea about the quality or completeness of its
coverage, e.g. which queries have an `adequate', representative number of su ciently
diverse answers indicating that the KB is a good source for answering such
queries, or which may produce degenerate answers. General-purpose KBs may
in fact have signi cantly unbalanced contents with respect to topic coverage.</p>
      <p>
        To explore this aspect of KBs, in this paper we propose a theoretical
framework for structuring the space of conjunctive queries that can be posed on a DL
KB in a semi-lattice structure, called the query space graph (QSG) using query
answers. The QSG captures signi cant information about the coverage of a
particular KB, and reveals, apart from any partial and equivalent ordering relations
following from the theory (e.g. query containment relations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), any additional,
potentially informing, epistemic such relations. Provided that the QSG satis es
certain properties, it fully represents the underlying query space and encodes
the answers to all queries. Our work draws ideas from Formal Concept
Analysis (FCA) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], but the structure we propose can represent arbitrary queries;
in the case of queries consisting only of concept atoms our graph reduces to a
concept lattice. We also link the graph construction with the query generation
and answering phases. Because the size of the query space explodes
combinatorially, computation of all queries is practically infeasible unless their answers are
considered, so as to restrict them to those having at least one answer. In this
context, we propose an algorithm for computing the query space of restricted
DL-LiteR KBs. Our work shares also ideas with faceted search [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], since a path
in the QSG captures conjunctive faceted query navigation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In the rest of the paper, after some preliminaries in Section 2, Section 3
builds the theoretical framework underlying QSGs. Section 4 outlines the general
approach to compute a QSG, and Section 5 specializes it for DL-LiteR. Section 6
presents some results for DBPedia, and Section 7 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>
        Let CN, RN, IN be mutually disjoint, nite sets of concept, role and individual
names, respectively. The triple hCN; RN; INi is a vocabulary V, and a TBox T
(ABox A) over V and a DL dialect L is a set of axioms (assertions) that use
elements of V and constructors of L. The pair hV; Li is a language, and K =
hT ; Ai is a knowledge base (KB) over some language. In part of the paper we
use DL-LiteR as dialect L, a subset of DL-LiteR that excludes axioms of the
form S v R . In particular, it allows only axioms of the form C v D or R v S,
where C; D are concepts and R; S atomic roles. D can be either atomic or of
the form 9R( ):&gt;, and C can also be of the form 9R( ):A, where A is atomic.
The semantics of KBs are de ned in the standard way using interpretations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
An interpretation I = ( I ; I ) assigns a set CI I to concept C, and a set
rI I I to role r. I is a model of an ABox A i aI 2 CI for all C(a) 2 A,
and (aI ; bI ) 2 rI for all r(a; b) 2 A. I is a model of a TBox T if it satis es all
axioms in T . An axiom (or assertion) is entailed by a TBox T (a KB K) if it
is satis ed in every model of T (K), and we write T j= (K j= ).
      </p>
      <p>Given a vocabulary V and a set of variable names VN, a conjunctive query
(simply, a query) Q is an expression f hx1; : : : xki j 9y1 : : : 9yl:(a1 ^ : : : ^ an) g,
where k; l 0, n 1, xi; yi 2 VN, each ai is an atom C(u) or R(u; v), where
k l
C 2 CN, R 2 RN, u; v 2 IN [ fxigi=1 [ fyigi=1 and all xi; yi appear in at least
one atom. The vector x = hx1; : : : xki is the head of Q (head(Q)), its elements
are the answer variables (avar(Q)), and fa1; : : : ; ang is the body of Q (body(Q)).
var(Q) is the set of all variables appearing in Q. Those that appear once only in
the body are unbound ; all others are bound. If avar(Q) = ?, the query is boolean.
In this paper we will talk about non-boolean queries containing only variables
(u; v 2= IN). We will often write Q as xfa1; :::; ang, or xQ, to make explicit the
answer variables. Given a KB K, a query Q with head hx1; : : : xki and a model I
for K, a vector of individuals c = hc1; : : : cki, ci 2 IN, is an answer to Q in I i
k
K; I j= Q(a), where Q(a) = Qfxi=cigi=1. c is a certain answer to Q for K, i it
is an answer in all models I of K. The set of certain answers to Q is cert(Q; K).</p>
      <p>
        A query Q1 subsumes a query Q2 (Q1 S Q2) i there is a substitution
s.t. head(Q1 ) = head(Q2) and body(Q1 ) body(Q2). If Q1, Q2 are mutually
subsumed they are syntactically equivalent (Q1 S Q2). We write Q 2 Q i set Q
contains a query Q0 such that Q0 S Q. Two queries Q1, Q2 are similar if there
are renamings 1, 2 s.t. head(Q1 1) = head(Q2) and head(Q2 2) = head(Q1).
(A substitution = fxi=yigin=1 is a renaming for a query Q if xi 2 var(Q), and
n
fyigi=1 are distinct variables s.t. each yi equals some xj or yi 2= var(Q) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].)
Given Q, let Q0 be a query s.t. head(Q0) = head(Q) and body(Q0) body(Q). If
body(Q0) is a minimal subset of body(Q) s.t. Q S Q0, Q0 is a condensation of
Q (cond(Q)). If the minimal body(Q0) equals body(Q), Q is condensed [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>We de ne the following operations: i) The intersection of two similar queries
Q1, Q2 is the query Q1 uQ2 =: cond( head(Q1)fbody(Q1)[body(Q2 )g ) where is
a renaming s.t. head(Q1) = head(Q2 ) and var(Q2 ) \ var(Q1) = avar(Q1), that
renames all non-answer variables of Q2 to variables not appearing in Q1. Clearly,
Q1 u Q2 = Q2 u Q1. We will write xQ1 u xQ2, assuming that var(xQ1) \ var(xQ2)
equals the variables in x. ii) The -concatenation of query Q1 with query Q2,
where is a renaming of a subset of var(Q2) to a subset of var(Q1), is the query
Q1 Q2 =: cond( head(Q1)fbody(Q1) [ body(Q2 )g ) where is a renaming that
renames all variables of Q2 that are not renamed by to variables not appearing
in Q1. Dropping , we will write xQ1 x0 Q2, assuming that var(xQ1) contains all
variables in x0, and var(xQ1) \ var(x0 Q2)) equals the variables in x0. If x = x0,
xQ1 xQ2 = xQ1 u xQ2. iii) The projection of a query xQ on y,:where the
variables in y are a subset of the variables in x, is the query xQjy = yfbody(xQ)g.</p>
      <p>Given a query Q, let gra(Q) (gra(Q)) be the (directed) graph with nodes
var(Q) having the edge fz1; z2g (resp. (z1; z2)) i R(z1; z2) 2 body(Q). Q is
connected i gra(Q) is connected, and (strictly) tree-shaped i head(Q) = hxi
hk;n1:n2i we denote the
and gra(Q) (gra(Q)) is a (directed) tree with root x. By QV
syntactically equivalent-free set of all condensed, connected, non-boolean queries
with k 1 answer variables, head of length k, n variables where k n1 n
n2, constructable over V. We write QhVk;ni for QhVk;k:ni, and QkV for Sj+=11 QhVk;ji.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Structuring the Query Space</title>
      <p>In general, to structure a set of similar queries we need a partial ordering
relation . Using query subsumption we can write Q1 S Q2 i Q2 S Q1, whereas
using query containment we can write Q1 T Q2, i for given a TBox T we have
cert(Q1; hT ; Ai) cert(Q2; hT ; Ai) for any ABox A. These partial ordering
relations capture general properties of the queries that follow from their structure or
the axioms of the underlying TBox. In concrete KBs however, comprising
particular ABoxes, arise more speci c, incidental `strict containment' or `equivalence'
relations. To capture such relations we need a strict partial ordering relation
that takes into account the actual ABox contents.</p>
      <p>De nition 1. Let K = hT ; Ai be a KB over some hV; Li, and Q1; Q2 two
(simk , for some k. Q1 and Q2 are answer equivalent w.r.t. K
ilar) queries in QV
(Q1 K Q2), if cert(Q1; K) = cert(Q2; K). Also, Q1 is answer smaller than Q2
w.r.t. T (Q1 K Q2), if cert(Q1; K) cert(Q2; K).</p>
      <p>The smaller-or-equivalent relation K can then be de ned as usual. If Q1 S
Q2 then Q1 T Q2 for any T ; given T , if Q1 T Q2 then Q1 K Q2 regardless
of the particular A. K and K can be considered as `epistemic-like' equivalence
and strict partial ordering relations, induced by a particular ABox.</p>
      <p>Based on the above, we can de ne an equivalence and strict partial
ordering relation for any Q QkV . An equivalence relation partitions Q to a set
of equivalence classes EC (Q). We call the equivalence class that contains all
queries with zero answers null, and its members null queries. We will say that a
pair ( ; ) on a set Q is a structure if for any Q1; Q2 2 Q s.t. Q1 Q2 we have
Q1 6 Q2 and Q2 6 Q1, and inversely if Q1 Q2 or Q2 Q1, then Q1 6 Q2.
Clearly, ( K; K) is a structure.</p>
      <p>De nition 2. A set Q
Q1 u Q2 2 Q.</p>
      <p>QkV is complete i for any Q1; Q2 2 Q, we have</p>
      <p>The completion of a nite set Q is constructed by iteratively extending Q
with the intersections of its elements until reaching a xpoint. This terminates
because intersections are condensed queries. Since each equivalence class Q~ 2
EC (Q) of a complete set Q contains all intersections of its elements, there is a
unique Q 2 Q~, called the representative of Q~, s.t. Q0 S Q for all Q0 2 Q~. The
representative is the syntactically most `speci c' query of the equivalent class.
De nition 3. Let Q QkV and ( ; ) be a structure. The representative
ordering of Q is the partially ordered set (Q^ ; ), where Q^ is the completion of Q, and
Q^ = SQ~2EC (Q^)fQ 2 Q~ j Q0 S Q for all Q0 2 Q~g (the set of representatives).
Example 1. Consider that Q = fxfA(x)g; xfA(x); B(x)g; xfC(x); D(x)gg and
xfC(x); D(x)g xfA(x)g and xfA(x)g xfA(x); B(x)g. The completion of Q
is Q^ = fxfA(x)g; xfA(x); B(x)g; xfC(x); D(x)g; xfA(x); C(x); D(x)g, xfA(x);
B(x); C(x); D(x)gg. The equivalence classes are fxfA(x)g; xfA(x); B(x)gg and
fxfC(x); D(x)g; xfA(x); C(x); D(x)g; xfA(x); B(x); C(x); D(x)gg, thus Q^ =
fxfA(x); B(x)g; xfA(x); B(x); C(x); D(x)gg with xfA(x); B(x); C(x); D(x)g
xfA(x); B(x)g.</p>
      <p>Lemma 1. Let (Q^ ; ) be the representative ordering of some Q
any Q1; Q2 2 Q^ we have Q1 Q2 i Q2 S Q1.</p>
      <p>Proof. Assume that Q1 Q2 and Q2 6 S Q1. Since Q1; Q2 2 Q^ we have
Q1; Q2 2 Q^, where Q^ is the completion of Q, and so Q1 u Q2 2 Q^. Because
Q1 Q2, Q1 u Q2 must belong to the same equivalence class as Q1. But Q2 6 S
Q1, and so Q1 S Q1 u Q2 6= Q1. Thus, Q1 2= Q^ which is a contradiction.
Inversely, if Q2 S Q1, given that Q2 6 Q1, by de nition we get Q1 Q2.</p>
      <p>We say that a query Q 2 Q^ is minimal (maximal ) w.r.t. to some property if
it satis es the property and there is no Q0 2 Q^ s.t. Q0 Q (Q Q0) satisfying
the same property.</p>
      <p>QkV . For
De nition 4. Let (Q^ ; ) be the representative ordering of some Q QkV , and
let Q 2 QkV . The top (bottom) cover of Q in (Q^ ; ) is the set of all minimal
(maximal) Q0 2 Q^ s.t Q0 S Q (Q S Q0).</p>
      <p>The covers may be empty, e.g. if Q^
= ffxfA(x)gg and Q = xfB(x)g.</p>
      <p>Lemma 2. Let (Q^ ; ) be the representative ordering of some Q QkV , and
let Q 2 QkV . If the top and bottom covers, Q&gt; and Q?, respectively, of Q are
non-empty, then there is a Q~ 2 Q^ s.t. Q0 Q Q~ for all Q0 2 Q?.
Proof. Let Q be the intersection of the queries in Q&gt;. By de nition, Q S Q,
and hence Q Q. By completeness there is a unique Q~ 2 Q s.t. Q~ Q, and so
Q Q~. Finally, by de nition, for all Q0 2 Q? we have Q S Q0, hence Q0 Q.</p>
      <p>This allows us to use (Q^ ; ) to bound the answers to any query Q 2 QV
k
even if Q 2= Q, provided that Q&gt;, Q? exist. If Q 2 Q, we have a stronger result.
Lemma 3. Let (Q^ ; ) be the representative ordering of some Q
bottom cover Q? of any Q 2 Q is a singleton.</p>
      <p>Proof. Since Q 2 Q, there is a unique Q~ 2 EC (Q^), s.t. Q 2 Q~. The
representative Q0 is s.t. Q0 Q and Q S Q0. This is clearly maximal and is in Q?. For
any other Q00 2 Q^ s.t. Q S Q00, we have Q00 Q0, hence it cannot be in Q?.</p>
      <p>This unique element of Q? is the anchor of Q in (Q^ ; ), and the
representative ordering of Q fully describes Q: any query Q 2 Q is represented in (Q^ ; )
by its unique anchor, and the answers to Q are the answers to its anchor.
QkV . The
Example 2. Let xfA(x)g xfS(x; y)g xfS(x; y); R(y; z); B(z)g and Q consist
of these queries. Then Q^ = ffxfA(x)gg; fxfA(x); S(x; y)gg; fxfA(x); S(x; y);
R(y; z); B(z)gg with the obvious ordering. Let Q be xfS(x; y); R(y; z)g. We have
that Q 2= Q, and the bottom cover of Q is fxfA(x); S(x; y); R(y; z); B(z)g. Note
that xfS(x; y); R(y; z); B(z)g 2 Q has the same bottom cover.</p>
      <p>This shows that `anchors' can exist also for queries that do not belong in Q.
Thus, only if we know that a query belongs to Q we can be sure that the answers
of the anchor are also exactly its answers.</p>
      <p>If for the representative ordering (Q^ ; ) we de ne the meet operation ^ so
that Q1 ^ Q2 is the maximal query Q 2 Q^ s.t. Q1 u Q2 S Q, then (Q^ ; )
becomes a meet semi-lattice. Given a nite Q, Q^ is nite, thus the semi-lattice
has a bottom. To capture the minimum needed partial ordering relations for a
representative ordering we use a directed acyclic graph.</p>
      <p>De nition 5. Let (Q^ ; ) be the representative ordering of some Q QkV . The
query space graph (QSG) of Q is the transitive reduction of the directed acyclic
graph induced by (Q^ ; ).</p>
      <p>Because (Q^ ; ) is a nite semi-lattice, the QSG is nite with a bottom, but
may have more than one roots. The QSG is shortcut-free, i.e. if there is an edge
(Q1; Q2) there is no path from Q1 to Q2 in G other than (Q1; Q2).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Query Space Graph Computation</title>
      <p>Now, we will describe how, given a language hV; Li and a KB K, we can compute
the QSG for the ( K; K) structure and the set QhV1;ki of tree-shaped queries
with one answer variable and up to k variables. To do this, we need to perform
two tasks: compute the set of queries for which to build the QSG and then
arrange them into an actual QSG. We will start from the latter.</p>
      <p>
        A given set of queries Q can be arranged into a graph, by rst computing the
equivalence classes (the nodes of the graph) and then adding the edges induced
by the partial ordering. Such a graph will not be a QSG unless Q is complete. If
not, all intersections of the queries in Q have to be iteratively computed. This
can be done by exploiting the already produced graph, and let it guide the query
intersection process in a way similar to FCA algorithms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
(1)
(2)
      </p>
      <p>
        To compute the actual queries, we can compute rst all possible tree shapes,
the so-called rooted trees (e.g. there are 1, 1, 2, 4, 9, 20, 48, 115, 286, 719 such
trees with 1 : : : 10 nodes, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). We represent a rooted tree with n &gt; 0 nodes as
a set of edges (i; j) (i is the parent of j), where i; j are integers representing
nodes. We assume that if j is a descendent of i, then j &gt; i, so that the root is
the smallest of the integers being used (usually 0). The tree T = ? consists only
of a root. To compute all rooted trees with up to t nodes we can start e.g. from
tree f(0; 1)g, and iteratively connect a new node to each existing node, taking
care to discard equivalent representations of the same rooted trees (re ections).
      </p>
      <p>Let D (C) be the set of all non syntactically equivalent queries of the form
xfCi(x)gik=1, Ci 2 CN, k 1 (k = 1), i.e. the queries consisting of concept atoms.
Clearly, D = QhV1;1i and jDj = PjiC=N1j jCiNj = 2jCNj 1. For a query Q known
to be in D (C), we use the adornment notation xDQ (xCQ). Similarly, let S (R)
be the set of all non syntactically equivalent queries of the form x;yfRi(xi)gik=1,
Ri 2 RN, k 1 (k = 1), and xi = (x; y) or (y; x), i.e. the queries consisting of
role atoms. jQSj = PjiR=N1j 2i jRiNj = 3jRNj 1. For strictly tree-shaped queries,
where all xis must be the same, we denote the respective sets by S and R . For
a query Q known to be in S (R), we use the notation Sx;yQ (xR;yQ), and similarly
for the starred versions. D (S) can be obtained as the node set of the QSG of
the completion of C (R). These sets are the material to put on the edges and
nodes of a rooted tree to get tree-shaped queries. The tree T = ? gives rise to
the queries xD0 Q; any tree with nodes f0; : : : ; kg, k 1, gives rise to the queries
[ xD0 Q ] (: : : ((Sx^1;x1 Q1[ xD1 Q~1])
: : :)
(x^k;xk Qk[ xDk Q~k])
S
jx0
where x^i = xparent(i). The optional parts in square brackets specify concepts in
the tree nodes. Due to the recurrence of Sx^i;xi Q [ xDi Q~], we de ne the frame link
L = S [</p>
      <p>Sx;yQ1</p>
      <p>yDQ2 j Sx;yQ1 2 S; yDQ2 2 D
i.e the set of queries with two variables, some roles connecting them and possibly
some concepts on the second variable. By xL;yQ we will denote a query in L. We
de ne similarly (i.e. by replacing S with S in Eq. 2) L and xL;yQ .</p>
      <p>Given the above, a simple approach to compute the queries induced by a
rooted tree T is to compute L (or L ), create a copy of it for each edge,
appropriately renaming variables, and take all possible combinations between frame
link elements. An alternative, backwards iterative approach is based on the
observation that, if the rooted trees of sizes up to k 1 are available, a rooted tree
of size k can be directly constructed from one or more such trees. In particular,
a query with shape a rooted tree of size k can be constructed either by a one
step extension of an existing query with shape a rooted tree S of size k 1 (by
renaming variables, adding a new root and connecting the new to the old root)
or by merging two or more existing queries with shape trees Si (by renaming
variables and joining them on their roots). In the rst case, if yQ is the set of
all queries with shape S, its extension is obtained by the unary operator
x3 yQ = (xL;yQ</p>
      <p>yQ)jx j xL;yQ 2 L and yQ 2 yQ
In the second case, let xQi be the set of all queries with shape Si. Merging is an
n-ary operator , for n 1:
xQ1
xQ2
: : :</p>
      <p>xQn = fxQ1 u xQ2 u : : : u xQn j xQi 2 xQig
Given the above, the set of all queries represented by Formula 1 are
(3)
(4)
QT = [f xD0 Q 2 D g
i.e. the optional merging of the queries including only concept atoms on the root
variable, with the set x0 Q~ of all queries with the shape of T . These are obtained
for i = 0 from xi Q~ = Nj2children(i) xi Q^ij which expresses the merging of the
queries represented by the several branches of the tree assuming the root is node
i. Each branch is either a set of queries in L, if the branch has length 1, or the
extension of the set of queries represented by the rst child in the branch:
xi Q^ij =</p>
      <p>xi 3 xj Q~
( n xLi;xj Qjxi j xLi;xj Q 2 L o if children(j) = ?
otherwise</p>
      <p>Since there are in nite T s, ST QT is in nite. To guarantee a nite set of
queries, we must set additional criteria, such as compute only queries with up to
k variables. Such a set will not be complete. A QSG on a non-complete set will
fully describe only the queries in the set (they will have unique anchors), but this
approximation can in practice be useful. Finally, to arrange the queries in a QSG,
we need their answers. If we can answer arbitrary queries, we could compute rst
the queries and then evaluate them. However, the number of queries explodes
combinatorially, yet most of them will have zero answers. Thus, assuming that
we are not interested in null queries, we can try to combine query generation
with answering to avoid generating and evaluating many of the null queries.
5</p>
      <p>
        Building a QSG for a DL-LiteR Knowledge Base
We will now discuss how we can construct and answer tree-shaped queries for
normalized DL-LiteR KBs. We are interested only in non-null queries. Our
approach will be based on the rewriting approach to query answering, which is
particularly applicable to the DL-Lite family [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: given a query Q and a DL-Lite
KB hT ; Ai, there is a set of queries rewr(Q), called the rewritings of Q, such
that cert(Q; hT ; Ai) = SQ02rewr(Q;T ) cert(Q0; h?; Ai) i.e. the consequences of T
are incorporated in the several Q0s, which can then be answered ignoring T (for
brevity, we will omit the KB arguments from cert( ) and rewr( )). Rewritings are
`transitive': A rewriting of a rewriting of Q is a rewriting of Q. It follows that if
Q0 2 rewr(Q) and Q00 2 rewr(Q0) then cert(Q00) cert(Q0) cert(Q).
      </p>
      <p>We focus on DL-LiteR KBs, because the absence of axioms R v S
guarantees that the edges of a tree-shaped query do not change direction during
rewriting. In DL-LiteR, the rewriting set of an atomic query is a set (union) of
atomic queries: rewr(xfC(x)g) = f xfA(x)g j T j= A v Cg [ xfR(x; y)g j T j=
9R v Cg [ xfR(y; x)g j T j= 9R v Cg g, rewr(xfR(x; y)g) = f xfB(x)g j T j=
B v 9Rg [ xfS(x; y)g j T j= 9S v 9Rg [ xfS(y; x)g j T j= 9S v 9Rg g, and
rewr(x;yfR(x; y)g) = f x;yfS(x; y)g j T j= S v Rg.</p>
      <p>
        An unfolding of query Q is a rewriting obtained by substituting each atom
in body(Q) by one of its rewritings, according to the above, taking care to
appropriately name newly introduced variables. An unfolding of Q (before
condensation) retains all bound variables of Q. The set of all unfoldings of Q
is unfold(Q). Apart from unfoldings, Q may have also shrinking rewritings.
Shrinkings arise due to axioms of the form A v 9R:C. Given such an
axiom, we have xfA(x)g 2 rewr(fR(x; y); C(y)gx) (in fact fA(x)gx is also in
rewr(fS(x; y); D(y)gx) if T j= R v S and T j= C v D). Query xfA(x)g is
a shrinking of xfR(x; y); C(y)g and is an element of shrinky(Q) because it
contains all variables of Q apart from bound variable y, which has been eliminated
by reasoning using an axiom of the form A v 9R:C. Let
shrink(Q) = fQg [
0
It can be proved [
        <xref ref-type="bibr" rid="ref11 ref6">6, 11</xref>
        ] that rewr(Q) = SQ02shrink(Q) unfold(Q0). If var(Q) =
avar(Q), then rewr(Q) = unfold(Q0), since answer variables cannot be eliminated.
      </p>
      <p>Applying iteratively shrinking on a query using Eq. 5, we may end up with
a query containing only concept atoms on the answer variable. If all of them
have a common unfolding, by unfolding the query we will end up with a query
consisting of a single concept atom.</p>
      <p>De nition 6. The kernel of a tree-shaped query xQ is the set
kernel(xQ) = f xCQ0 2 rewr(xQ) j there is no xCQ00 6 S xCQ0 s.t. xCQ0 2 rewr(xCQ00) g
Theorem 1. If xQ1; : : : ; xQn are tree-shaped queries, then
n
\ cert(xQi)
i=1</p>
      <p>[
cert(xQ1 u : : : u xQn) =
and
kernel(xQ1 u : : : u xQn) =</p>
      <p>: : :
xQ012kernel(xQ1)
xQ0n2kernel(xQn)
Proof. Because all xQis share only the answer variable x, their intersection (and
the intersections of their rewritings) has no shrinking on x. Thus, the rewritings
of xQ1 u : : : u xQn are all combinations of the rewritings of each xQi: rewr(xQ1 u
: : : u xQn) = SxQ012rewr(xQ1) : : : SxQ0n2rewr(xQn) xQ01 u : : : u xQ0n. Because these
rewritings share only variable x we get the desired results.</p>
      <p>The above allows us to compute cert(xQ1 u : : : u xQn) and kernel(xQ1 u : : : u
xQn) given the certain answers and kernels of the merged queries. Because a
merging may be used to construct further queries, it should be kept either if it
has a non-empty kernel (which can cause queries constructed as extensions of it
be non-null), or it is itself non-null.</p>
      <p>kernel(xQ01 u : : : ux Q0n)
Theorem 2. Let xQ = (xL;yQ1 yQ2)jx), where xL;yQ1 2 L, and yQ2 is a strictly
tree-shaped query with no concept atoms on y. Then
cert(xQ) = cert(xL;yQ1) ./ 2=1 cert(yQ2) j1 [
[
CxQ~2kernel(xQ)
cert(xCQ~)
where kernel(xQ) = SCyQ~22kernel(yQ2) kernel((xL;yQ1 yCQ~2)jx), ./i=j denotes
joining of the left- and right-hand side tuples on the i-th and j-th element respectively,
and ji denotes projection on the i-th element of the tuples.</p>
      <p>Proof. If a rewriting xQ~ of xQ contains y, no shrinking has been applied on y, so
xQ~ = (x;yQ~1 yQ~2)jx, x;yQ~1 2 rewr(xL;yQ1) and yQ~2 2 rewr(yQ2). Such
rewritings contribute answers cert(xL;yQ1)./ 2=1cert(yQ2))j1. If xQ~ doesn't contain y but
contains a variable z, it was obtained by shrinking directly xQ on y, or by
shrinking on y some other shrinking of xQ (on other variables), or it is a rewriting of
these. Let xL;yQ1 = Sx;yQ1A [ yDQ1B]. Because there are no axioms S v R , y
occurs at the same place in all role atoms. Thus, Sx;yQ1A contains only atoms of the
form S(x; y) (or S(y; x)), and yQ2 only atoms of the form S(zi; y) (or S(y; zi)).
This is impossible (xQ is strictly tree-shaped), so no shrinking on y is applicable.</p>
      <p>If xQ~ results by shrinking on y a shrinking xQ0 of xQ on other variables, xQ0
cannot contain variables other than x and y, otherwise we could shrink directly
on y. Thus, xQ0 = (x;yQ1A [ yDQ1B] yDQ3)jx, where yDQ3 is a result of shrinking</p>
      <p>S
on the other variables. For shrinking on y to be applicable on xQ0 it must be
the case that for all E(y) 2 body(yDQ1B) we have C v E, and for all S(x; y) 2
body(Sx;yQ1A) we have T j= R v S. There can be no S0(y; x) 2 body(Sx;yQ1A).
These are necessary for all such E(y); S(x; y) to give their place to A(x) in the
shrinking due to A v 9R:C. Thus, we must have xfA(x)g 2 kernel(xL;yQ1). But
for all E(y) 2 body(yDQ3), i.e. for all yfE(y)g 2 kernel(yQ2) we must also have
C v E, i.e. xfA(x)g 2 kernel((xL;yQ1 yfE(y)g)jx). Taking the union for all such
E and A we get the second part of the desired result.</p>
      <p>If xQ~ is an unfolding of the above rewritings, its answers are included in the
certain answers of the query from which it has been obtained and we can ignore it.</p>
      <p>We call (cert(xL;yQ1) ./ 2=1 cert(yQ2))j1 the direct answers of xQ, because
they are obtained directly by joining the answers of xL;yQ1 and yQ2. So, the
theorem tells us that for the extension we have to stick to a strictly tree-shaped
query xQ, compute its direct answers, its kernel (xL;yQ1 yQ2)jx (needing for that
only the kernel of yQ2), nd the answers to the kernel queries, and add them to
the direct answers.</p>
      <p>To implement Formula 3 we still need to compute queries xL;yQ and xL;yQjx,
used in the second and rst branch of Eq. 4, respectively. For this, we need
rst all queries xL;yQ that may give rise to non-null queries. According to Th. 2
these are those that have at least one direct answer, and those whose projection
on x has a non-empty kernel. To compute the set of queries with at least one
direct answer, say Z, we can implement Eq. 2, by computing all intersections of
the nodes of the QSG for S with the nodes of the QSG for D. Then, we must
update Z with the queries having non-empty kernels. The queries of the form
x;yQ1 yCQ2 that have xfA(x)g in their kernel due to A v 9R:C, are in the set
R</p>
      <p>ExA;vy9R:C = ffS(x; y); D(y)gx;y j A v 9R:C, T j= R v S and T j= C v Dg [
ffS(y; x); D(y)gx;y j A v 9R :C, T j= R v S and T j= C v Dg
As in the case of D and C, we can obtain the set FxA;vy9R:C of the equivalence
classes of the queries in xL;yQ that have the same answers and have xfA(x)g in
their kernel due to axiom A v 9R:C, as the node set of the QSG for ExA;vy9R:C .
Note that since we are interested in the kernels, we must compute the QSG
including the null equivalence class. Because the kernel does not contribute any
answers (since the head is hx; yi), those of these queries that have at least one
answer should already be represented by an equivalence class in Z. However, that
equivalence class may represent a more general query, with additional conjuncts,
which has therefore an empty or di erent kernel. Thus, we need to enrich Z with
all the queries in the several sets FxA;vy9R:C . Now, to use Eq. 4, we need also to
know all queries xL;yQjx that may give rise to non-null queries. Again, these are
those that have at least one direct answer, and those that have a non-empty
kernel. To compute them, we have to take the projections on x of all queries
in Z, and as before, extend this set with the projections of the queries in each
FxA;vy9R:C on x, by including however also any answers to the kernel.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>We implemented the QSG computation for DL-LiteR KBs in Java, and we
applied it on the DBPedia KB (2016-10 version), which contains 785 concepts,
1,111 properties (in the dbo namespace), 5,150,432 concept and 24,431,982 role
assertions. To test scalability we created several subsets of the KB by keeping
only the concepts and roles (and the relevant assertions) that, viewed as atomic
queries, had certain minimum numbers of answers (ranging from 1 to 106), and
tried to compute the QSG for Qh1;ki , k = 1; 2; 3. The number of nodes of the
computed QSGs and computation time, as a function of the number of kept
concepts and roles are shown in Figure 1. Using a machine with 64GB RAM we
were able to compute Qh1;1i and Qh1;2i for the entire DBPedia, but for Qh1;3i
memory was enough only for small subsets. We note that the size of the QSG,
after a short superlinear growth at the beginning, grows linearly or sublinearly in
the size of the ontology. Thus, in practice, the combinatorial explosion seems to
occur only when we increase the size of the query but not the size of the ontology.
(In computing rewr(xfC(x)g) we ignored axioms of the form 9R v C. Although
theoretically incorrect, it turned out to be more reasonable because in DBPedia
such axioms are used as domain and range constraints; if they participate in the
reasoning process they propagate a large number of inconsistencies.)</p>
      <p>An example path in the graph for Qh1;3i is Q0 = x0 fproducer(x0; x1)g
(157,591 answers), Q1 = Q0 x1 fAgent(x1)g (135,354), Q2 = Q1 x1 fPerson(x1)g
(119,578), Q3 = Q2 x0 fWork(x0)g (119,577), Q4 = Q3 x1 foccupation(x1; x2)g
500
1;000
500</p>
      <p>1;000
104
105
104
6
4
2
0
6
4
2
0
0
0
500</p>
      <p>1;000
500</p>
      <p>1;000
jCNj + jRNj
400
(81,320), Q5 = Q4 x1 fAthlete(x1)g (11), Q6 = Q5 x2 fPersonFunction(x2)g (10),
Q7 = Q6 x0 fAlbum(x0); MusicalWork(x0)g (1). Each edge re nes the query, and
we discover that from the 157,591 producer assertions there is only one producer
that is an athlete and has produced a music album. The graph revealed also
unbalances on the topic coverage. E.g. from the about 1,243,400 persons, 360,941
were athletes, 318,392 athletic team members, and 18,622 soccer managers.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We de ned a theoretical framework for modeling epistemic equivalence and
ordering relations between queries of a DL KB, arranging them in a graph, and
discussed how to compute such a graph for DL-LiteR KBs. The experimental
results showed that the computation of restricted versions of the QSG might be
possible even for relatively big datasets. Given the complexity and memory needs
of the computation, future work will target improving the prototype
implementation or approximative computation. E cient computation for more expressive
DL dialects, starting from the full DL-LiteR, should also be investigated. We see
several applications of our framework, from construction of balanced datasets
for machine learning, to semantic clustering and autocomplete suggestions.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>We acknowledge support of this work by the project `Apollonis: Greek
Infrastructure for Digital Arts, Humanities and Language Research and Innovation' (MIS
5002738) which is implemented under the Action `Reinforcement of the Research
and Innovation Infrastructure', funded by the Operational Programme
`Competitiveness, Entrepreneurship and Innovation' (NSRF 2014-2020) and co- nanced
by Greece and the European Union (European Regional Development Fund).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. The on-line encyclopedia of integer sequences. Number of unlabeled rooted trees with n nodes (or connected functions with a xed point)</article-title>
          ., http://oeis.org/ A000081
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grau</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kharlamov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marciuska</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheleznyakov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Faceted search over RDF-based knowledge graphs</article-title>
          .
          <source>J. Web Semant</source>
          .
          <fpage>37</fpage>
          -
          <issue>38</issue>
          ,
          <issue>55</issue>
          {
          <fpage>74</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baader</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nardi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.F</given-names>
          </string-name>
          . (eds.):
          <article-title>The Description Logic Handbook: Theory, Implementation, and Applications</article-title>
          . Cambridge University Press (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bienvenu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Query containment in description logics reconsidered</article-title>
          .
          <source>In: KR</source>
          . AAAI Press (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Giacomo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Tractable reasoning and e cient query answering in description logics: The DL-Lite family</article-title>
          .
          <source>J. Autom. Reasoning</source>
          <volume>39</volume>
          (
          <issue>3</issue>
          ),
          <volume>385</volume>
          {
          <fpage>429</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chortaras</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trivela</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamou</surname>
            ,
            <given-names>G.B.</given-names>
          </string-name>
          :
          <article-title>Optimized query rewriting for OWL 2 QL</article-title>
          . In
          <source>: CADE. Lecture Notes in Computer Science</source>
          , vol.
          <volume>6803</volume>
          , pp.
          <volume>192</volume>
          {
          <fpage>206</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal concept analysis - mathematical foundations</source>
          . Springer (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gottlob</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Fermuller, C.G.:
          <article-title>Removing redundancy from a clause</article-title>
          .
          <source>Artif. Intell</source>
          .
          <volume>61</volume>
          (
          <issue>2</issue>
          ),
          <volume>263</volume>
          {
          <fpage>289</fpage>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Obiedkov</surname>
            ,
            <given-names>S.A.:</given-names>
          </string-name>
          <article-title>Comparing performance of algorithms for generating concept lattices</article-title>
          .
          <source>J. Exp. Theor. Artif. Intell</source>
          .
          <volume>14</volume>
          (
          <issue>2-3</issue>
          ),
          <volume>189</volume>
          {
          <fpage>216</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Nienhuys-Cheng, S., de Wolf, R. (eds.):
          <source>Foundations of Inductive Logic Programming, Lecture Notes in Computer Science</source>
          , vol.
          <volume>1228</volume>
          . Springer (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Trivela</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoilos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chortaras</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamou</surname>
            ,
            <given-names>G.B.</given-names>
          </string-name>
          :
          <article-title>Optimising resolution-based rewriting algorithms for OWL ontologies</article-title>
          .
          <source>J. Web Semant</source>
          .
          <volume>33</volume>
          ,
          <issue>30</issue>
          {
          <fpage>49</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tunkelang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Faceted Search.
          <source>Synthesis Lectures on Information Concepts</source>
          , Retrieval, and Services, Morgan &amp; Claypool Publishers (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>