Introduction

Inside the Query Space of DL Knowledge Bases

0 National Technical University of Athens , Greece

We structure the query space of DL knowledge bases (KBs) by nding equivalent and partial ordering relations between answer sets of conjunctive queries, and de ning a formal graph structure that allows qualitative and quantitative exploration of the query space. We also discuss how to compute such a graph for restricted DL-LiteR KBs.

Introduction

Purely assertional or ontology-enriched KBs are increasingly being used by applications that need access to data and knowledge, and are typically accessed through queries or facet-based interfaces. When interacting with such a KB, the application usually has no idea about the quality or completeness of its coverage, e.g. which queries have an `adequate', representative number of su ciently diverse answers indicating that the KB is a good source for answering such queries, or which may produce degenerate answers. General-purpose KBs may in fact have signi cantly unbalanced contents with respect to topic coverage.

To explore this aspect of KBs, in this paper we propose a theoretical framework for structuring the space of conjunctive queries that can be posed on a DL KB in a semi-lattice structure, called the query space graph (QSG) using query answers. The QSG captures signi cant information about the coverage of a particular KB, and reveals, apart from any partial and equivalent ordering relations following from the theory (e.g. query containment relations [ 4 ]), any additional, potentially informing, epistemic such relations. Provided that the QSG satis es certain properties, it fully represents the underlying query space and encodes the answers to all queries. Our work draws ideas from Formal Concept Analysis (FCA) [ 7 ], but the structure we propose can represent arbitrary queries; in the case of queries consisting only of concept atoms our graph reduces to a concept lattice. We also link the graph construction with the query generation and answering phases. Because the size of the query space explodes combinatorially, computation of all queries is practically infeasible unless their answers are considered, so as to restrict them to those having at least one answer. In this context, we propose an algorithm for computing the query space of restricted DL-LiteR KBs. Our work shares also ideas with faceted search [ 12 ], since a path in the QSG captures conjunctive faceted query navigation [ 2 ].

In the rest of the paper, after some preliminaries in Section 2, Section 3 builds the theoretical framework underlying QSGs. Section 4 outlines the general approach to compute a QSG, and Section 5 specializes it for DL-LiteR. Section 6 presents some results for DBPedia, and Section 7 concludes the paper.

Preliminaries

Let CN, RN, IN be mutually disjoint, nite sets of concept, role and individual names, respectively. The triple hCN; RN; INi is a vocabulary V, and a TBox T (ABox A) over V and a DL dialect L is a set of axioms (assertions) that use elements of V and constructors of L. The pair hV; Li is a language, and K = hT ; Ai is a knowledge base (KB) over some language. In part of the paper we use DL-LiteR as dialect L, a subset of DL-LiteR that excludes axioms of the form S v R . In particular, it allows only axioms of the form C v D or R v S, where C; D are concepts and R; S atomic roles. D can be either atomic or of the form 9R( ):>, and C can also be of the form 9R( ):A, where A is atomic. The semantics of KBs are de ned in the standard way using interpretations [ 3 ]. An interpretation I = ( I ; I ) assigns a set CI I to concept C, and a set rI I I to role r. I is a model of an ABox A i aI 2 CI for all C(a) 2 A, and (aI ; bI ) 2 rI for all r(a; b) 2 A. I is a model of a TBox T if it satis es all axioms in T . An axiom (or assertion) is entailed by a TBox T (a KB K) if it is satis ed in every model of T (K), and we write T j= (K j= ).

Given a vocabulary V and a set of variable names VN, a conjunctive query (simply, a query) Q is an expression f hx1; : : : xki j 9y1 : : : 9yl:(a1 ^ : : : ^ an) g, where k; l 0, n 1, xi; yi 2 VN, each ai is an atom C(u) or R(u; v), where k l C 2 CN, R 2 RN, u; v 2 IN [ fxigi=1 [ fyigi=1 and all xi; yi appear in at least one atom. The vector x = hx1; : : : xki is the head of Q (head(Q)), its elements are the answer variables (avar(Q)), and fa1; : : : ; ang is the body of Q (body(Q)). var(Q) is the set of all variables appearing in Q. Those that appear once only in the body are unbound ; all others are bound. If avar(Q) = ?, the query is boolean. In this paper we will talk about non-boolean queries containing only variables (u; v 2= IN). We will often write Q as xfa1; :::; ang, or xQ, to make explicit the answer variables. Given a KB K, a query Q with head hx1; : : : xki and a model I for K, a vector of individuals c = hc1; : : : cki, ci 2 IN, is an answer to Q in I i k K; I j= Q(a), where Q(a) = Qfxi=cigi=1. c is a certain answer to Q for K, i it is an answer in all models I of K. The set of certain answers to Q is cert(Q; K).

A query Q1 subsumes a query Q2 (Q1 S Q2) i there is a substitution s.t. head(Q1 ) = head(Q2) and body(Q1 ) body(Q2). If Q1, Q2 are mutually subsumed they are syntactically equivalent (Q1 S Q2). We write Q 2 Q i set Q contains a query Q0 such that Q0 S Q. Two queries Q1, Q2 are similar if there are renamings 1, 2 s.t. head(Q1 1) = head(Q2) and head(Q2 2) = head(Q1). (A substitution = fxi=yigin=1 is a renaming for a query Q if xi 2 var(Q), and n fyigi=1 are distinct variables s.t. each yi equals some xj or yi 2= var(Q) [ 10 ].) Given Q, let Q0 be a query s.t. head(Q0) = head(Q) and body(Q0) body(Q). If body(Q0) is a minimal subset of body(Q) s.t. Q S Q0, Q0 is a condensation of Q (cond(Q)). If the minimal body(Q0) equals body(Q), Q is condensed [ 8 ].

We de ne the following operations: i) The intersection of two similar queries Q1, Q2 is the query Q1 uQ2 =: cond( head(Q1)fbody(Q1)[body(Q2 )g ) where is a renaming s.t. head(Q1) = head(Q2 ) and var(Q2 ) \ var(Q1) = avar(Q1), that renames all non-answer variables of Q2 to variables not appearing in Q1. Clearly, Q1 u Q2 = Q2 u Q1. We will write xQ1 u xQ2, assuming that var(xQ1) \ var(xQ2) equals the variables in x. ii) The -concatenation of query Q1 with query Q2, where is a renaming of a subset of var(Q2) to a subset of var(Q1), is the query Q1 Q2 =: cond( head(Q1)fbody(Q1) [ body(Q2 )g ) where is a renaming that renames all variables of Q2 that are not renamed by to variables not appearing in Q1. Dropping , we will write xQ1 x0 Q2, assuming that var(xQ1) contains all variables in x0, and var(xQ1) \ var(x0 Q2)) equals the variables in x0. If x = x0, xQ1 xQ2 = xQ1 u xQ2. iii) The projection of a query xQ on y,:where the variables in y are a subset of the variables in x, is the query xQjy = yfbody(xQ)g.

Given a query Q, let gra(Q) (gra(Q)) be the (directed) graph with nodes var(Q) having the edge fz1; z2g (resp. (z1; z2)) i R(z1; z2) 2 body(Q). Q is connected i gra(Q) is connected, and (strictly) tree-shaped i head(Q) = hxi hk;n1:n2i we denote the and gra(Q) (gra(Q)) is a (directed) tree with root x. By QV syntactically equivalent-free set of all condensed, connected, non-boolean queries with k 1 answer variables, head of length k, n variables where k n1 n n2, constructable over V. We write QhVk;ni for QhVk;k:ni, and QkV for Sj+=11 QhVk;ji. 3

Structuring the Query Space

In general, to structure a set of similar queries we need a partial ordering relation . Using query subsumption we can write Q1 S Q2 i Q2 S Q1, whereas using query containment we can write Q1 T Q2, i for given a TBox T we have cert(Q1; hT ; Ai) cert(Q2; hT ; Ai) for any ABox A. These partial ordering relations capture general properties of the queries that follow from their structure or the axioms of the underlying TBox. In concrete KBs however, comprising particular ABoxes, arise more speci c, incidental `strict containment' or `equivalence' relations. To capture such relations we need a strict partial ordering relation that takes into account the actual ABox contents.

De nition 1. Let K = hT ; Ai be a KB over some hV; Li, and Q1; Q2 two (simk , for some k. Q1 and Q2 are answer equivalent w.r.t. K ilar) queries in QV (Q1 K Q2), if cert(Q1; K) = cert(Q2; K). Also, Q1 is answer smaller than Q2 w.r.t. T (Q1 K Q2), if cert(Q1; K) cert(Q2; K).

The smaller-or-equivalent relation K can then be de ned as usual. If Q1 S Q2 then Q1 T Q2 for any T ; given T , if Q1 T Q2 then Q1 K Q2 regardless of the particular A. K and K can be considered as `epistemic-like' equivalence and strict partial ordering relations, induced by a particular ABox.

Based on the above, we can de ne an equivalence and strict partial ordering relation for any Q QkV . An equivalence relation partitions Q to a set of equivalence classes EC (Q). We call the equivalence class that contains all queries with zero answers null, and its members null queries. We will say that a pair ( ; ) on a set Q is a structure if for any Q1; Q2 2 Q s.t. Q1 Q2 we have Q1 6 Q2 and Q2 6 Q1, and inversely if Q1 Q2 or Q2 Q1, then Q1 6 Q2. Clearly, ( K; K) is a structure.

De nition 2. A set Q Q1 u Q2 2 Q.

QkV is complete i for any Q1; Q2 2 Q, we have

The completion of a nite set Q is constructed by iteratively extending Q with the intersections of its elements until reaching a xpoint. This terminates because intersections are condensed queries. Since each equivalence class Q~ 2 EC (Q) of a complete set Q contains all intersections of its elements, there is a unique Q 2 Q~, called the representative of Q~, s.t. Q0 S Q for all Q0 2 Q~. The representative is the syntactically most `speci c' query of the equivalent class. De nition 3. Let Q QkV and ( ; ) be a structure. The representative ordering of Q is the partially ordered set (Q^ ; ), where Q^ is the completion of Q, and Q^ = SQ~2EC (Q^)fQ 2 Q~ j Q0 S Q for all Q0 2 Q~g (the set of representatives). Example 1. Consider that Q = fxfA(x)g; xfA(x); B(x)g; xfC(x); D(x)gg and xfC(x); D(x)g xfA(x)g and xfA(x)g xfA(x); B(x)g. The completion of Q is Q^ = fxfA(x)g; xfA(x); B(x)g; xfC(x); D(x)g; xfA(x); C(x); D(x)g, xfA(x); B(x); C(x); D(x)gg. The equivalence classes are fxfA(x)g; xfA(x); B(x)gg and fxfC(x); D(x)g; xfA(x); C(x); D(x)g; xfA(x); B(x); C(x); D(x)gg, thus Q^ = fxfA(x); B(x)g; xfA(x); B(x); C(x); D(x)gg with xfA(x); B(x); C(x); D(x)g xfA(x); B(x)g.

Lemma 1. Let (Q^ ; ) be the representative ordering of some Q any Q1; Q2 2 Q^ we have Q1 Q2 i Q2 S Q1.

Proof. Assume that Q1 Q2 and Q2 6 S Q1. Since Q1; Q2 2 Q^ we have Q1; Q2 2 Q^, where Q^ is the completion of Q, and so Q1 u Q2 2 Q^. Because Q1 Q2, Q1 u Q2 must belong to the same equivalence class as Q1. But Q2 6 S Q1, and so Q1 S Q1 u Q2 6= Q1. Thus, Q1 2= Q^ which is a contradiction. Inversely, if Q2 S Q1, given that Q2 6 Q1, by de nition we get Q1 Q2.

We say that a query Q 2 Q^ is minimal (maximal ) w.r.t. to some property if it satis es the property and there is no Q0 2 Q^ s.t. Q0 Q (Q Q0) satisfying the same property.

QkV . For De nition 4. Let (Q^ ; ) be the representative ordering of some Q QkV , and let Q 2 QkV . The top (bottom) cover of Q in (Q^ ; ) is the set of all minimal (maximal) Q0 2 Q^ s.t Q0 S Q (Q S Q0).

The covers may be empty, e.g. if Q^ = ffxfA(x)gg and Q = xfB(x)g.

Lemma 2. Let (Q^ ; ) be the representative ordering of some Q QkV , and let Q 2 QkV . If the top and bottom covers, Q> and Q?, respectively, of Q are non-empty, then there is a Q~ 2 Q^ s.t. Q0 Q Q~ for all Q0 2 Q?. Proof. Let Q be the intersection of the queries in Q>. By de nition, Q S Q, and hence Q Q. By completeness there is a unique Q~ 2 Q s.t. Q~ Q, and so Q Q~. Finally, by de nition, for all Q0 2 Q? we have Q S Q0, hence Q0 Q.

This allows us to use (Q^ ; ) to bound the answers to any query Q 2 QV k even if Q 2= Q, provided that Q>, Q? exist. If Q 2 Q, we have a stronger result. Lemma 3. Let (Q^ ; ) be the representative ordering of some Q bottom cover Q? of any Q 2 Q is a singleton.

Proof. Since Q 2 Q, there is a unique Q~ 2 EC (Q^), s.t. Q 2 Q~. The representative Q0 is s.t. Q0 Q and Q S Q0. This is clearly maximal and is in Q?. For any other Q00 2 Q^ s.t. Q S Q00, we have Q00 Q0, hence it cannot be in Q?.

This unique element of Q? is the anchor of Q in (Q^ ; ), and the representative ordering of Q fully describes Q: any query Q 2 Q is represented in (Q^ ; ) by its unique anchor, and the answers to Q are the answers to its anchor. QkV . The Example 2. Let xfA(x)g xfS(x; y)g xfS(x; y); R(y; z); B(z)g and Q consist of these queries. Then Q^ = ffxfA(x)gg; fxfA(x); S(x; y)gg; fxfA(x); S(x; y); R(y; z); B(z)gg with the obvious ordering. Let Q be xfS(x; y); R(y; z)g. We have that Q 2= Q, and the bottom cover of Q is fxfA(x); S(x; y); R(y; z); B(z)g. Note that xfS(x; y); R(y; z); B(z)g 2 Q has the same bottom cover.

This shows that `anchors' can exist also for queries that do not belong in Q. Thus, only if we know that a query belongs to Q we can be sure that the answers of the anchor are also exactly its answers.

If for the representative ordering (Q^ ; ) we de ne the meet operation ^ so that Q1 ^ Q2 is the maximal query Q 2 Q^ s.t. Q1 u Q2 S Q, then (Q^ ; ) becomes a meet semi-lattice. Given a nite Q, Q^ is nite, thus the semi-lattice has a bottom. To capture the minimum needed partial ordering relations for a representative ordering we use a directed acyclic graph.

De nition 5. Let (Q^ ; ) be the representative ordering of some Q QkV . The query space graph (QSG) of Q is the transitive reduction of the directed acyclic graph induced by (Q^ ; ).

Because (Q^ ; ) is a nite semi-lattice, the QSG is nite with a bottom, but may have more than one roots. The QSG is shortcut-free, i.e. if there is an edge (Q1; Q2) there is no path from Q1 to Q2 in G other than (Q1; Q2). 4

Query Space Graph Computation

Now, we will describe how, given a language hV; Li and a KB K, we can compute the QSG for the ( K; K) structure and the set QhV1;ki of tree-shaped queries with one answer variable and up to k variables. To do this, we need to perform two tasks: compute the set of queries for which to build the QSG and then arrange them into an actual QSG. We will start from the latter.

A given set of queries Q can be arranged into a graph, by rst computing the equivalence classes (the nodes of the graph) and then adding the edges induced by the partial ordering. Such a graph will not be a QSG unless Q is complete. If not, all intersections of the queries in Q have to be iteratively computed. This can be done by exploiting the already produced graph, and let it guide the query intersection process in a way similar to FCA algorithms [ 9 ]. (1) (2)

To compute the actual queries, we can compute rst all possible tree shapes, the so-called rooted trees (e.g. there are 1, 1, 2, 4, 9, 20, 48, 115, 286, 719 such trees with 1 : : : 10 nodes, [ 1 ]). We represent a rooted tree with n > 0 nodes as a set of edges (i; j) (i is the parent of j), where i; j are integers representing nodes. We assume that if j is a descendent of i, then j > i, so that the root is the smallest of the integers being used (usually 0). The tree T = ? consists only of a root. To compute all rooted trees with up to t nodes we can start e.g. from tree f(0; 1)g, and iteratively connect a new node to each existing node, taking care to discard equivalent representations of the same rooted trees (re ections).

Let D (C) be the set of all non syntactically equivalent queries of the form xfCi(x)gik=1, Ci 2 CN, k 1 (k = 1), i.e. the queries consisting of concept atoms. Clearly, D = QhV1;1i and jDj = PjiC=N1j jCiNj = 2jCNj 1. For a query Q known to be in D (C), we use the adornment notation xDQ (xCQ). Similarly, let S (R) be the set of all non syntactically equivalent queries of the form x;yfRi(xi)gik=1, Ri 2 RN, k 1 (k = 1), and xi = (x; y) or (y; x), i.e. the queries consisting of role atoms. jQSj = PjiR=N1j 2i jRiNj = 3jRNj 1. For strictly tree-shaped queries, where all xis must be the same, we denote the respective sets by S and R . For a query Q known to be in S (R), we use the notation Sx;yQ (xR;yQ), and similarly for the starred versions. D (S) can be obtained as the node set of the QSG of the completion of C (R). These sets are the material to put on the edges and nodes of a rooted tree to get tree-shaped queries. The tree T = ? gives rise to the queries xD0 Q; any tree with nodes f0; : : : ; kg, k 1, gives rise to the queries [ xD0 Q ] (: : : ((Sx^1;x1 Q1[ xD1 Q~1]) : : :) (x^k;xk Qk[ xDk Q~k]) S jx0 where x^i = xparent(i). The optional parts in square brackets specify concepts in the tree nodes. Due to the recurrence of Sx^i;xi Q [ xDi Q~], we de ne the frame link L = S [

Sx;yQ1

yDQ2 j Sx;yQ1 2 S; yDQ2 2 D i.e the set of queries with two variables, some roles connecting them and possibly some concepts on the second variable. By xL;yQ we will denote a query in L. We de ne similarly (i.e. by replacing S with S in Eq. 2) L and xL;yQ .

Given the above, a simple approach to compute the queries induced by a rooted tree T is to compute L (or L ), create a copy of it for each edge, appropriately renaming variables, and take all possible combinations between frame link elements. An alternative, backwards iterative approach is based on the observation that, if the rooted trees of sizes up to k 1 are available, a rooted tree of size k can be directly constructed from one or more such trees. In particular, a query with shape a rooted tree of size k can be constructed either by a one step extension of an existing query with shape a rooted tree S of size k 1 (by renaming variables, adding a new root and connecting the new to the old root) or by merging two or more existing queries with shape trees Si (by renaming variables and joining them on their roots). In the rst case, if yQ is the set of all queries with shape S, its extension is obtained by the unary operator x3 yQ = (xL;yQ

yQ)jx j xL;yQ 2 L and yQ 2 yQ In the second case, let xQi be the set of all queries with shape Si. Merging is an n-ary operator , for n 1: xQ1 xQ2 : : :

xQn = fxQ1 u xQ2 u : : : u xQn j xQi 2 xQig Given the above, the set of all queries represented by Formula 1 are (3) (4) QT = [f xD0 Q 2 D g i.e. the optional merging of the queries including only concept atoms on the root variable, with the set x0 Q~ of all queries with the shape of T . These are obtained for i = 0 from xi Q~ = Nj2children(i) xi Q^ij which expresses the merging of the queries represented by the several branches of the tree assuming the root is node i. Each branch is either a set of queries in L, if the branch has length 1, or the extension of the set of queries represented by the rst child in the branch: xi Q^ij =

xi 3 xj Q~ ( n xLi;xj Qjxi j xLi;xj Q 2 L o if children(j) = ? otherwise

Since there are in nite T s, ST QT is in nite. To guarantee a nite set of queries, we must set additional criteria, such as compute only queries with up to k variables. Such a set will not be complete. A QSG on a non-complete set will fully describe only the queries in the set (they will have unique anchors), but this approximation can in practice be useful. Finally, to arrange the queries in a QSG, we need their answers. If we can answer arbitrary queries, we could compute rst the queries and then evaluate them. However, the number of queries explodes combinatorially, yet most of them will have zero answers. Thus, assuming that we are not interested in null queries, we can try to combine query generation with answering to avoid generating and evaluating many of the null queries. 5

Building a QSG for a DL-LiteR Knowledge Base We will now discuss how we can construct and answer tree-shaped queries for normalized DL-LiteR KBs. We are interested only in non-null queries. Our approach will be based on the rewriting approach to query answering, which is particularly applicable to the DL-Lite family [ 5 ]: given a query Q and a DL-Lite KB hT ; Ai, there is a set of queries rewr(Q), called the rewritings of Q, such that cert(Q; hT ; Ai) = SQ02rewr(Q;T ) cert(Q0; h?; Ai) i.e. the consequences of T are incorporated in the several Q0s, which can then be answered ignoring T (for brevity, we will omit the KB arguments from cert( ) and rewr( )). Rewritings are `transitive': A rewriting of a rewriting of Q is a rewriting of Q. It follows that if Q0 2 rewr(Q) and Q00 2 rewr(Q0) then cert(Q00) cert(Q0) cert(Q).

We focus on DL-LiteR KBs, because the absence of axioms R v S guarantees that the edges of a tree-shaped query do not change direction during rewriting. In DL-LiteR, the rewriting set of an atomic query is a set (union) of atomic queries: rewr(xfC(x)g) = f xfA(x)g j T j= A v Cg [ xfR(x; y)g j T j= 9R v Cg [ xfR(y; x)g j T j= 9R v Cg g, rewr(xfR(x; y)g) = f xfB(x)g j T j= B v 9Rg [ xfS(x; y)g j T j= 9S v 9Rg [ xfS(y; x)g j T j= 9S v 9Rg g, and rewr(x;yfR(x; y)g) = f x;yfS(x; y)g j T j= S v Rg.

An unfolding of query Q is a rewriting obtained by substituting each atom in body(Q) by one of its rewritings, according to the above, taking care to appropriately name newly introduced variables. An unfolding of Q (before condensation) retains all bound variables of Q. The set of all unfoldings of Q is unfold(Q). Apart from unfoldings, Q may have also shrinking rewritings. Shrinkings arise due to axioms of the form A v 9R:C. Given such an axiom, we have xfA(x)g 2 rewr(fR(x; y); C(y)gx) (in fact fA(x)gx is also in rewr(fS(x; y); D(y)gx) if T j= R v S and T j= C v D). Query xfA(x)g is a shrinking of xfR(x; y); C(y)g and is an element of shrinky(Q) because it contains all variables of Q apart from bound variable y, which has been eliminated by reasoning using an axiom of the form A v 9R:C. Let shrink(Q) = fQg [ 0 It can be proved [ 6, 11 ] that rewr(Q) = SQ02shrink(Q) unfold(Q0). If var(Q) = avar(Q), then rewr(Q) = unfold(Q0), since answer variables cannot be eliminated.

Applying iteratively shrinking on a query using Eq. 5, we may end up with a query containing only concept atoms on the answer variable. If all of them have a common unfolding, by unfolding the query we will end up with a query consisting of a single concept atom.

De nition 6. The kernel of a tree-shaped query xQ is the set kernel(xQ) = f xCQ0 2 rewr(xQ) j there is no xCQ00 6 S xCQ0 s.t. xCQ0 2 rewr(xCQ00) g Theorem 1. If xQ1; : : : ; xQn are tree-shaped queries, then n \ cert(xQi) i=1

[ cert(xQ1 u : : : u xQn) = and kernel(xQ1 u : : : u xQn) =

: : : xQ012kernel(xQ1) xQ0n2kernel(xQn) Proof. Because all xQis share only the answer variable x, their intersection (and the intersections of their rewritings) has no shrinking on x. Thus, the rewritings of xQ1 u : : : u xQn are all combinations of the rewritings of each xQi: rewr(xQ1 u : : : u xQn) = SxQ012rewr(xQ1) : : : SxQ0n2rewr(xQn) xQ01 u : : : u xQ0n. Because these rewritings share only variable x we get the desired results.

The above allows us to compute cert(xQ1 u : : : u xQn) and kernel(xQ1 u : : : u xQn) given the certain answers and kernels of the merged queries. Because a merging may be used to construct further queries, it should be kept either if it has a non-empty kernel (which can cause queries constructed as extensions of it be non-null), or it is itself non-null.

kernel(xQ01 u : : : ux Q0n) Theorem 2. Let xQ = (xL;yQ1 yQ2)jx), where xL;yQ1 2 L, and yQ2 is a strictly tree-shaped query with no concept atoms on y. Then cert(xQ) = cert(xL;yQ1) ./ 2=1 cert(yQ2) j1 [ [ CxQ~2kernel(xQ) cert(xCQ~) where kernel(xQ) = SCyQ~22kernel(yQ2) kernel((xL;yQ1 yCQ~2)jx), ./i=j denotes joining of the left- and right-hand side tuples on the i-th and j-th element respectively, and ji denotes projection on the i-th element of the tuples.

Proof. If a rewriting xQ~ of xQ contains y, no shrinking has been applied on y, so xQ~ = (x;yQ~1 yQ~2)jx, x;yQ~1 2 rewr(xL;yQ1) and yQ~2 2 rewr(yQ2). Such rewritings contribute answers cert(xL;yQ1)./ 2=1cert(yQ2))j1. If xQ~ doesn't contain y but contains a variable z, it was obtained by shrinking directly xQ on y, or by shrinking on y some other shrinking of xQ (on other variables), or it is a rewriting of these. Let xL;yQ1 = Sx;yQ1A [ yDQ1B]. Because there are no axioms S v R , y occurs at the same place in all role atoms. Thus, Sx;yQ1A contains only atoms of the form S(x; y) (or S(y; x)), and yQ2 only atoms of the form S(zi; y) (or S(y; zi)). This is impossible (xQ is strictly tree-shaped), so no shrinking on y is applicable.

If xQ~ results by shrinking on y a shrinking xQ0 of xQ on other variables, xQ0 cannot contain variables other than x and y, otherwise we could shrink directly on y. Thus, xQ0 = (x;yQ1A [ yDQ1B] yDQ3)jx, where yDQ3 is a result of shrinking

S on the other variables. For shrinking on y to be applicable on xQ0 it must be the case that for all E(y) 2 body(yDQ1B) we have C v E, and for all S(x; y) 2 body(Sx;yQ1A) we have T j= R v S. There can be no S0(y; x) 2 body(Sx;yQ1A). These are necessary for all such E(y); S(x; y) to give their place to A(x) in the shrinking due to A v 9R:C. Thus, we must have xfA(x)g 2 kernel(xL;yQ1). But for all E(y) 2 body(yDQ3), i.e. for all yfE(y)g 2 kernel(yQ2) we must also have C v E, i.e. xfA(x)g 2 kernel((xL;yQ1 yfE(y)g)jx). Taking the union for all such E and A we get the second part of the desired result.

If xQ~ is an unfolding of the above rewritings, its answers are included in the certain answers of the query from which it has been obtained and we can ignore it.

We call (cert(xL;yQ1) ./ 2=1 cert(yQ2))j1 the direct answers of xQ, because they are obtained directly by joining the answers of xL;yQ1 and yQ2. So, the theorem tells us that for the extension we have to stick to a strictly tree-shaped query xQ, compute its direct answers, its kernel (xL;yQ1 yQ2)jx (needing for that only the kernel of yQ2), nd the answers to the kernel queries, and add them to the direct answers.

To implement Formula 3 we still need to compute queries xL;yQ and xL;yQjx, used in the second and rst branch of Eq. 4, respectively. For this, we need rst all queries xL;yQ that may give rise to non-null queries. According to Th. 2 these are those that have at least one direct answer, and those whose projection on x has a non-empty kernel. To compute the set of queries with at least one direct answer, say Z, we can implement Eq. 2, by computing all intersections of the nodes of the QSG for S with the nodes of the QSG for D. Then, we must update Z with the queries having non-empty kernels. The queries of the form x;yQ1 yCQ2 that have xfA(x)g in their kernel due to A v 9R:C, are in the set R

ExA;vy9R:C = ffS(x; y); D(y)gx;y j A v 9R:C, T j= R v S and T j= C v Dg [ ffS(y; x); D(y)gx;y j A v 9R :C, T j= R v S and T j= C v Dg As in the case of D and C, we can obtain the set FxA;vy9R:C of the equivalence classes of the queries in xL;yQ that have the same answers and have xfA(x)g in their kernel due to axiom A v 9R:C, as the node set of the QSG for ExA;vy9R:C . Note that since we are interested in the kernels, we must compute the QSG including the null equivalence class. Because the kernel does not contribute any answers (since the head is hx; yi), those of these queries that have at least one answer should already be represented by an equivalence class in Z. However, that equivalence class may represent a more general query, with additional conjuncts, which has therefore an empty or di erent kernel. Thus, we need to enrich Z with all the queries in the several sets FxA;vy9R:C . Now, to use Eq. 4, we need also to know all queries xL;yQjx that may give rise to non-null queries. Again, these are those that have at least one direct answer, and those that have a non-empty kernel. To compute them, we have to take the projections on x of all queries in Z, and as before, extend this set with the projections of the queries in each FxA;vy9R:C on x, by including however also any answers to the kernel. 6

Evaluation

We implemented the QSG computation for DL-LiteR KBs in Java, and we applied it on the DBPedia KB (2016-10 version), which contains 785 concepts, 1,111 properties (in the dbo namespace), 5,150,432 concept and 24,431,982 role assertions. To test scalability we created several subsets of the KB by keeping only the concepts and roles (and the relevant assertions) that, viewed as atomic queries, had certain minimum numbers of answers (ranging from 1 to 106), and tried to compute the QSG for Qh1;ki , k = 1; 2; 3. The number of nodes of the computed QSGs and computation time, as a function of the number of kept concepts and roles are shown in Figure 1. Using a machine with 64GB RAM we were able to compute Qh1;1i and Qh1;2i for the entire DBPedia, but for Qh1;3i memory was enough only for small subsets. We note that the size of the QSG, after a short superlinear growth at the beginning, grows linearly or sublinearly in the size of the ontology. Thus, in practice, the combinatorial explosion seems to occur only when we increase the size of the query but not the size of the ontology. (In computing rewr(xfC(x)g) we ignored axioms of the form 9R v C. Although theoretically incorrect, it turned out to be more reasonable because in DBPedia such axioms are used as domain and range constraints; if they participate in the reasoning process they propagate a large number of inconsistencies.)

An example path in the graph for Qh1;3i is Q0 = x0 fproducer(x0; x1)g (157,591 answers), Q1 = Q0 x1 fAgent(x1)g (135,354), Q2 = Q1 x1 fPerson(x1)g (119,578), Q3 = Q2 x0 fWork(x0)g (119,577), Q4 = Q3 x1 foccupation(x1; x2)g 500 1;000 500

1;000 104 105 104 6 4 2 0 6 4 2 0 0 0 500

1;000 500

1;000 jCNj + jRNj 400 (81,320), Q5 = Q4 x1 fAthlete(x1)g (11), Q6 = Q5 x2 fPersonFunction(x2)g (10), Q7 = Q6 x0 fAlbum(x0); MusicalWork(x0)g (1). Each edge re nes the query, and we discover that from the 157,591 producer assertions there is only one producer that is an athlete and has produced a music album. The graph revealed also unbalances on the topic coverage. E.g. from the about 1,243,400 persons, 360,941 were athletes, 318,392 athletic team members, and 18,622 soccer managers. 7

Conclusion

We de ned a theoretical framework for modeling epistemic equivalence and ordering relations between queries of a DL KB, arranging them in a graph, and discussed how to compute such a graph for DL-LiteR KBs. The experimental results showed that the computation of restricted versions of the QSG might be possible even for relatively big datasets. Given the complexity and memory needs of the computation, future work will target improving the prototype implementation or approximative computation. E cient computation for more expressive DL dialects, starting from the full DL-LiteR, should also be investigated. We see several applications of our framework, from construction of balanced datasets for machine learning, to semantic clustering and autocomplete suggestions. 8

Acknowledgements

We acknowledge support of this work by the project `Apollonis: Greek Infrastructure for Digital Arts, Humanities and Language Research and Innovation' (MIS 5002738) which is implemented under the Action `Reinforcement of the Research and Innovation Infrastructure', funded by the Operational Programme `Competitiveness, Entrepreneurship and Innovation' (NSRF 2014-2020) and co- nanced by Greece and the European Union (European Regional Development Fund).

1. The on-line encyclopedia of integer sequences. Number of unlabeled rooted trees with n nodes (or connected functions with a xed point) ., http://oeis.org/ A000081

2. Arenas , M. , Grau , B.C. , Kharlamov , E. , Marciuska , S. , Zheleznyakov , D. : Faceted search over RDF-based knowledge graphs . J. Web Semant . 37 - 38 , 55 { 74 ( 2016 )

3. Baader , F. , Calvanese , D. , McGuinness , D.L. , Nardi , D. , Patel-Schneider , P.F . (eds.): The Description Logic Handbook: Theory, Implementation, and Applications . Cambridge University Press ( 2003 )

4. Bienvenu , M. , Lutz , C. , Wolter , F. : Query containment in description logics reconsidered . In: KR . AAAI Press ( 2012 )

5. Calvanese , D. , De Giacomo , G. , Lembo , D. , Lenzerini , M. , Rosati , R. : Tractable reasoning and e cient query answering in description logics: The DL-Lite family . J. Autom. Reasoning 39 ( 3 ), 385 { 429 ( 2007 )

6. Chortaras , A. , Trivela , D. , Stamou , G.B. : Optimized query rewriting for OWL 2 QL . In : CADE. Lecture Notes in Computer Science , vol. 6803 , pp. 192 { 206 . Springer ( 2011 )

7. Ganter , B. , Wille , R.: Formal concept analysis - mathematical foundations . Springer ( 1999 )

8. Gottlob , G. , Fermuller, C.G.: Removing redundancy from a clause . Artif. Intell . 61 ( 2 ), 263 { 289 ( 1993 )

9. Kuznetsov , S.O. , Obiedkov , S.A.: Comparing performance of algorithms for generating concept lattices . J. Exp. Theor. Artif. Intell . 14 ( 2-3 ), 189 { 216 ( 2002 )

10. Nienhuys-Cheng, S., de Wolf, R. (eds.): Foundations of Inductive Logic Programming, Lecture Notes in Computer Science , vol. 1228 . Springer ( 1997 )

11. Trivela , D. , Stoilos , G. , Chortaras , A. , Stamou , G.B. : Optimising resolution-based rewriting algorithms for OWL ontologies . J. Web Semant . 33 , 30 { 49 ( 2015 )

12. Tunkelang , D. : Faceted Search. Synthesis Lectures on Information Concepts , Retrieval, and Services, Morgan & Claypool Publishers ( 2009 )