Inside the Query Space of DL Knowledge Bases Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou National Technical University of Athens, Greece achort@cs.ntua.gr, mikegiatzi@image.ntua.gr, gstam@cs.ntua.gr Abstract. We structure the query space of DL knowledge bases (KBs) by finding equivalent and partial ordering relations between answer sets of conjunctive queries, and defining a formal graph structure that allows qualitative and quantitative exploration of the query space. We also dis- cuss how to compute such a graph for restricted DL-LiteR KBs. 1 Introduction Purely assertional or ontology-enriched KBs are increasingly being used by ap- plications that need access to data and knowledge, and are typically accessed through queries or facet-based interfaces. When interacting with such a KB, the application usually has no idea about the quality or completeness of its cover- age, e.g. which queries have an ‘adequate’, representative number of sufficiently diverse answers indicating that the KB is a good source for answering such queries, or which may produce degenerate answers. General-purpose KBs may in fact have significantly unbalanced contents with respect to topic coverage. To explore this aspect of KBs, in this paper we propose a theoretical frame- work for structuring the space of conjunctive queries that can be posed on a DL KB in a semi-lattice structure, called the query space graph (QSG) using query answers. The QSG captures significant information about the coverage of a par- ticular KB, and reveals, apart from any partial and equivalent ordering relations following from the theory (e.g. query containment relations [4]), any additional, potentially informing, epistemic such relations. Provided that the QSG satisfies certain properties, it fully represents the underlying query space and encodes the answers to all queries. Our work draws ideas from Formal Concept Anal- ysis (FCA) [7], but the structure we propose can represent arbitrary queries; in the case of queries consisting only of concept atoms our graph reduces to a concept lattice. We also link the graph construction with the query generation and answering phases. Because the size of the query space explodes combinato- rially, computation of all queries is practically infeasible unless their answers are considered, so as to restrict them to those having at least one answer. In this context, we propose an algorithm for computing the query space of restricted DL-LiteR KBs. Our work shares also ideas with faceted search [12], since a path in the QSG captures conjunctive faceted query navigation [2]. In the rest of the paper, after some preliminaries in Section 2, Section 3 builds the theoretical framework underlying QSGs. Section 4 outlines the general approach to compute a QSG, and Section 5 specializes it for DL-Lite− R . Section 6 presents some results for DBPedia, and Section 7 concludes the paper. 2 Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou 2 Preliminaries Let CN, RN, IN be mutually disjoint, finite sets of concept, role and individual names, respectively. The triple hCN, RN, INi is a vocabulary V, and a TBox T (ABox A) over V and a DL dialect L is a set of axioms (assertions) that use elements of V and constructors of L. The pair hV, Li is a language, and K = hT , Ai is a knowledge base (KB) over some language. In part of the paper we use DL-Lite− R as dialect L, a subset of DL-LiteR that excludes axioms of the form S v R− . In particular, it allows only axioms of the form C v D or R v S, where C, D are concepts and R, S atomic roles. D can be either atomic or of the form ∃R(−) .>, and C can also be of the form ∃R(−) .A, where A is atomic. The semantics of KBs are defined in the standard way using interpretations [3]. An interpretation I = (∆I , ·I ) assigns a set C I ⊆ ∆I to concept C, and a set rI ⊆ ∆I × ∆I to role r. I is a model of an ABox A iff aI ∈ C I for all C(a) ∈ A, and (aI , bI ) ∈ rI for all r(a, b) ∈ A. I is a model of a TBox T if it satisfies all axioms in T . An axiom (or assertion) τ is entailed by a TBox T (a KB K) if it is satisfied in every model of T (K), and we write T |= τ (K |= τ ). Given a vocabulary V and a set of variable names VN, a conjunctive query (simply, a query) Q is an expression { hx1 , . . . xk i | ∃y1 . . . ∃yl .(a1 ∧ . . . ∧ an ) }, where k, l ≥ 0, n ≥ 1, xi , yi ∈ VN, each ai is an atom C(u) or R(u, v), where C ∈ CN, R ∈ RN, u, v ∈ IN ∪ {xi }ki=1 ∪ {yi }li=1 and all xi , yi appear in at least one atom. The vector x = hx1 , . . . xk i is the head of Q (head(Q)), its elements are the answer variables (avar(Q)), and {a1 , . . . , an } is the body of Q (body(Q)). var(Q) is the set of all variables appearing in Q. Those that appear once only in the body are unbound ; all others are bound. If avar(Q) = ∅, the query is boolean. In this paper we will talk about non-boolean queries containing only variables (u, v ∈/ IN). We will often write Q as x {a1 , ..., an }, or x Q, to make explicit the answer variables. Given a KB K, a query Q with head hx1 , . . . xk i and a model I for K, a vector of individuals c = hc1 , . . . ck i, ci ∈ IN, is an answer to Q in I iff K, I |= Q(a), where Q(a) = Q{xi /ci }ki=1 . c is a certain answer to Q for K, iff it is an answer in all models I of K. The set of certain answers to Q is cert(Q, K). A query Q1 subsumes a query Q2 (Q1 ⊆S Q2 ) iff there is a substitution θ s.t. head(Q1 θ) = head(Q2 ) and body(Q1 θ) ⊆ body(Q2 ). If Q1 , Q2 are mutually subsumed they are syntactically equivalent (Q1 ≡S Q2 ). We write Q ∈ Q iff set Q contains a query Q0 such that Q0 ≡S Q. Two queries Q1 , Q2 are similar if there are renamings σ1 , σ2 s.t. head(Q1 σ1 ) = head(Q2 ) and head(Q2 σ2 ) = head(Q1 ). (A substitution θ = {xi /yi }ni=1 is a renaming for a query Q if xi ∈ var(Q), and {yi }ni=1 are distinct variables s.t. each yi equals some xj or yi ∈ / var(Q) [10].) Given Q, let Q0 be a query s.t. head(Q0 ) = head(Q) and body(Q0 ) ⊆ body(Q). If body(Q0 ) is a minimal subset of body(Q) s.t. Q ⊆S Q0 , Q0 is a condensation of Q (cond(Q)). If the minimal body(Q0 ) equals body(Q), Q is condensed [8]. We define the following operations: i) The intersection of two similar queries . Q1 , Q2 is the query Q1 uQ2 = cond( head(Q1 ) {body(Q1 )∪body(Q2 σ)} ) where σ is a renaming s.t. head(Q1 ) = head(Q2 σ) and var(Q2 σ) ∩ var(Q1 ) = avar(Q1 ), that renames all non-answer variables of Q2 to variables not appearing in Q1 . Clearly, Q1 u Q2 = Q2 u Q1 . We will write x Q1 u x Q2 , assuming that var(x Q1 ) ∩ var(x Q2 ) Inside the Query Space of DL Knowledge Bases 3 equals the variables in x. ii) The τ -concatenation of query Q1 with query Q2 , where τ is a renaming of a subset of var(Q2 ) to a subset of var(Q1 ), is the query . Q1 ◦τ Q2 = cond( head(Q1 ) {body(Q1 ) ∪ body(Q2 στ )} ) where σ is a renaming that renames all variables of Q2 that are not renamed by τ to variables not appearing in Q1 . Dropping τ , we will write x Q1 ◦ x0 Q2 , assuming that var(x Q1 ) contains all variables in x0 , and var(x Q1 ) ∩ var(x0 Q2 )) equals the variables in x0 . If x = x0 , x Q1 ◦ x Q2 = x Q1 u x Q2 . iii) The projection of a query x Q on y, where the vari- . ables in y are a subset of the variables in x, is the query x Q|y = y {body(x Q)}. ¯ Given a query Q, let gra(Q) (gra(Q)) be the (directed) graph with nodes var(Q) having the edge {z1 , z2 } (resp. (z1 , z2 )) iff R(z1 , z2 ) ∈ body(Q). Q is connected iff gra(Q) is connected, and (strictly) tree-shaped iff head(Q) = hxi hk,n :n i ¯ and gra(Q) (gra(Q)) is a (directed) tree with root x. By QV 1 2 we denote the syntactically equivalent-free set of all condensed, connected, non-boolean queries with k ≥ 1 answer variables, head of length k, n variables where k ≤ n1 ≤ n ≤ hk,ni hk,k:ni S+∞ hk,ji n2 , constructable over V. We write QV for QV , and QkV for j=1 QV . 3 Structuring the Query Space In general, to structure a set of similar queries we need a partial ordering rela- tion . Using query subsumption we can write Q1 S Q2 iff Q2 ⊆S Q1 , whereas using query containment we can write Q1 T Q2 , iff for given a TBox T we have cert(Q1 , hT , Ai) ⊆ cert(Q2 , hT , Ai) for any ABox A. These partial ordering rela- tions capture general properties of the queries that follow from their structure or the axioms of the underlying TBox. In concrete KBs however, comprising partic- ular ABoxes, arise more specific, incidental ‘strict containment’ or ‘equivalence’ relations. To capture such relations we need a strict partial ordering relation that takes into account the actual ABox contents. Definition 1. Let K = hT , Ai be a KB over some hV, Li, and Q1 , Q2 two (sim- ilar) queries in QkV , for some k. Q1 and Q2 are answer equivalent w.r.t. K (Q1 ≡K Q2 ), if cert(Q1 , K) = cert(Q2 , K). Also, Q1 is answer smaller than Q2 w.r.t. T (Q1 ≺K Q2 ), if cert(Q1 , K) ⊂ cert(Q2 , K). The smaller-or-equivalent relation K can then be defined as usual. If Q1 S Q2 then Q1 T Q2 for any T ; given T , if Q1 T Q2 then Q1 K Q2 regardless of the particular A. ≡K and ≺K can be considered as ‘epistemic-like’ equivalence and strict partial ordering relations, induced by a particular ABox. Based on the above, we can define an equivalence and strict partial order- ing relation for any Q ⊆ QkV . An equivalence relation ≡ partitions Q to a set of equivalence classes EC≡ (Q). We call the equivalence class that contains all queries with zero answers null, and its members null queries. We will say that a pair (≡, ≺) on a set Q is a structure if for any Q1 , Q2 ∈ Q s.t. Q1 ≡ Q2 we have Q1 6≺ Q2 and Q2 6≺ Q1 , and inversely if Q1 ≺ Q2 or Q2 ≺ Q1 , then Q1 6≡ Q2 . Clearly, (≡K , ≺K ) is a structure. Definition 2. A set Q ⊆ QkV is complete iff for any Q1 , Q2 ∈ Q, we have Q1 u Q2 ∈ Q. 4 Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou The completion of a finite set Q is constructed by iteratively extending Q with the intersections of its elements until reaching a fixpoint. This terminates because intersections are condensed queries. Since each equivalence class Q̃ ∈ EC≡ (Q) of a complete set Q contains all intersections of its elements, there is a unique Q ∈ Q̃, called the representative of Q̃, s.t. Q0 ⊆S Q for all Q0 ∈ Q̃. The representative is the syntactically most ‘specific’ query of the equivalent class. Definition 3. Let Q ⊆ QkV and (≡, ≺) be a structure. The representative order- ing of Q S is the partially ordered set (Q̂≡ , ≺), where Q̂ is the completion of Q, and Q̂≡ = Q̃∈EC≡ (Q̂) {Q ∈ Q̃ | Q0 ⊆S Q for all Q0 ∈ Q̃} (the set of representatives). Example 1. Consider that Q = {x {A(x)}, x {A(x), B(x)}, x {C(x), D(x)}} and x {C(x), D(x)} ≺ x {A(x)} and x {A(x)} ≡ x {A(x), B(x)}. The completion of Q is Q̂ = {x {A(x)}, x {A(x), B(x)}, x {C(x), D(x)}, x {A(x), C(x), D(x)}, x {A(x), B(x), C(x), D(x)}}. The equivalence classes are {x {A(x)}, x {A(x), B(x)}} and {x {C(x), D(x)}, x {A(x), C(x), D(x)}, x {A(x), B(x), C(x), D(x)}}, thus Q̂≡ = {x {A(x), B(x)}, x {A(x), B(x), C(x), D(x)}} with x {A(x), B(x), C(x), D(x)} ≺ x {A(x), B(x)}. Lemma 1. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV . For any Q1 , Q2 ∈ Q̂≡ we have Q1 ≺ Q2 iff Q2 ⊂S Q1 . Proof. Assume that Q1 ≺ Q2 and Q2 6⊂S Q1 . Since Q1 , Q2 ∈ Q̂≡ we have Q1 , Q2 ∈ Q̂, where Q̂ is the completion of Q, and so Q1 u Q2 ∈ Q̂. Because Q1 ≺ Q2 , Q1 u Q2 must belong to the same equivalence class as Q1 . But Q2 6⊂S Q1 , and so Q1 ⊂S Q1 u Q2 6= Q1 . Thus, Q1 ∈ / Q̂≡ which is a contradiction. Inversely, if Q2 ⊂S Q1 , given that Q2 6≡ Q1 , by definition we get Q1 ≺ Q2 . We say that a query Q ∈ Q̂≡ is minimal (maximal ) w.r.t. to some property if it satisfies the property and there is no Q0 ∈ Q̂≡ s.t. Q0 ≺ Q (Q ≺ Q0 ) satisfying the same property. Definition 4. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV , and let Q ∈ QkV . The top (bottom) cover of Q in (Q̂≡ , ≺) is the set of all minimal (maximal) Q0 ∈ Q̂≡ s.t Q0 ⊆S Q (Q ⊆S Q0 ). The covers may be empty, e.g. if Q̂≡ = {{x {A(x)}} and Q = x {B(x)}. Lemma 2. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV , and let Q ∈ QkV . If the top and bottom covers, Q> and Q⊥ , respectively, of Q are non-empty, then there is a Q̃ ∈ Q̂≡ s.t. Q0  Q  Q̃ for all Q0 ∈ Q⊥ . Proof. Let Q̄ be the intersection of the queries in Q> . By definition, Q̄ ⊆S Q, and hence Q  Q̄. By completeness there is a unique Q̃ ∈ Q̄≡ s.t. Q̃ ≡ Q̄, and so Q  Q̃. Finally, by definition, for all Q0 ∈ Q⊥ we have Q ⊆S Q0 , hence Q0  Q. This allows us to use (Q̂≡ , ≺) to bound the answers to any query Q ∈ QkV / Q, provided that Q> , Q⊥ exist. If Q ∈ Q, we have a stronger result. even if Q ∈ Inside the Query Space of DL Knowledge Bases 5 Lemma 3. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV . The bottom cover Q⊥ of any Q ∈ Q is a singleton. Proof. Since Q ∈ Q, there is a unique Q̃ ∈ EC≡ (Q̂), s.t. Q ∈ Q̃. The represen- tative Q0 is s.t. Q0 ≡ Q and Q ⊆S Q0 . This is clearly maximal and is in Q⊥ . For any other Q00 ∈ Q̂≡ s.t. Q ⊆S Q00 , we have Q00 ≺ Q0 , hence it cannot be in Q⊥ . This unique element of Q⊥ is the anchor of Q in (Q̂≡ , ≺), and the representa- tive ordering of Q fully describes Q: any query Q ∈ Q is represented in (Q̂≡ , ≺) by its unique anchor, and the answers to Q are the answers to its anchor. Example 2. Let x {A(x)} ≺ x {S(x, y)} ≺ x {S(x, y), R(y, z), B(z)} and Q consist of these queries. Then Q̂≡ = {{x {A(x)}}, {x {A(x), S(x, y)}}, {x {A(x), S(x, y), R(y, z), B(z)}} with the obvious ordering. Let Q be x {S(x, y), R(y, z)}. We have that Q ∈/ Q, and the bottom cover of Q is {x {A(x), S(x, y), R(y, z), B(z)}. Note that x {S(x, y), R(y, z), B(z)} ∈ Q has the same bottom cover. This shows that ‘anchors’ can exist also for queries that do not belong in Q. Thus, only if we know that a query belongs to Q we can be sure that the answers of the anchor are also exactly its answers. If for the representative ordering (Q̂≡ , ≺) we define the meet operation ∧ so that Q1 ∧ Q2 is the maximal query Q ∈ Q̂≡ s.t. Q1 u Q2 ⊆S Q, then (Q̂≡ , ≺) becomes a meet semi-lattice. Given a finite Q, Q̂≡ is finite, thus the semi-lattice has a bottom. To capture the minimum needed partial ordering relations for a representative ordering we use a directed acyclic graph. Definition 5. Let (Q̂≡ , ≺) be the representative ordering of some Q ⊆ QkV . The query space graph (QSG) of Q is the transitive reduction of the directed acyclic graph induced by (Q̂≡ , ≺). Because (Q̂≡ , ≺) is a finite semi-lattice, the QSG is finite with a bottom, but may have more than one roots. The QSG is shortcut-free, i.e. if there is an edge (Q1 , Q2 ) there is no path from Q1 to Q2 in G other than (Q1 , Q2 ). 4 Query Space Graph Computation Now, we will describe how, given a language hV, Li and a KB K, we can compute h1,ki∗ the QSG for the (≡K , ≺K ) structure and the set QV of tree-shaped queries with one answer variable and up to k variables. To do this, we need to perform two tasks: compute the set of queries for which to build the QSG and then arrange them into an actual QSG. We will start from the latter. A given set of queries Q can be arranged into a graph, by first computing the equivalence classes (the nodes of the graph) and then adding the edges induced by the partial ordering. Such a graph will not be a QSG unless Q is complete. If not, all intersections of the queries in Q have to be iteratively computed. This can be done by exploiting the already produced graph, and let it guide the query intersection process in a way similar to FCA algorithms [9]. 6 Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou To compute the actual queries, we can compute first all possible tree shapes, the so-called rooted trees (e.g. there are 1, 1, 2, 4, 9, 20, 48, 115, 286, 719 such trees with 1 . . . 10 nodes, [1]). We represent a rooted tree with n > 0 nodes as a set of edges (i, j) (i is the parent of j), where i, j are integers representing nodes. We assume that if j is a descendent of i, then j > i, so that the root is the smallest of the integers being used (usually 0). The tree T = ∅ consists only of a root. To compute all rooted trees with up to t nodes we can start e.g. from tree {(0, 1)}, and iteratively connect a new node to each existing node, taking care to discard equivalent representations of the same rooted trees (reflections). Let D (C) be the set of all non syntactically equivalent queries of the form k x {Ci (x)}i=1 , Ci ∈ CN, k ≥ 1 (k = 1), i.e. the queries consisting of concept atoms. h1,1i∗ P|CN| and |D| = i=1 |CN| = 2|CN| − 1. For a query Q known  Clearly, D = QV i to be in D (C), we use the adornment notation D C x Q (x Q). Similarly, let S (R) be the set of all non syntactically equivalent queries of the form x,y {Ri (xi )}ki=1 , Ri ∈ RN, k ≥ 1 (k = 1), and xi = (x, y) or (y, x), i.e. the queries consisting of P|RN| role atoms. |QS| = i=1 2i |RN| = 3|RN| − 1. For strictly tree-shaped queries,  i where all xi s must be the same, we denote the respective sets by S∗ and R∗. For a query Q known to be in S (R), we use the notation Sx,y Q (R x,y Q), and similarly for the starred versions. D (S) can be obtained as the node set of the QSG of the completion of C (R). These sets are the material to put on the edges and nodes of a rooted tree to get tree-shaped queries. The tree T = ∅ gives rise to the queries D x0 Q; any tree with nodes {0, . . . , k}, k ≥ 1, gives rise to the queries    [D S D S D x0 Q ◦ ] (. . . ((x̂1 ,x1 Q1 [ ◦ x1 Q̃1 ]) ◦ . . .) ◦ (x̂k ,xk Qk [ ◦ xk Q̃k ]) |x0 (1) where x̂i = xparent(i) . The optional parts in square brackets specify concepts in the tree nodes. Due to the recurrence of Sx̂i ,xi Q [ ◦ D xi Q̃], we define the frame link S L = S ∪ x,y Q1 ◦ D S D y Q2 | x,y Q1 ∈ S, y Q2 ∈ D (2) i.e the set of queries with two variables, some roles connecting them and possibly some concepts on the second variable. By Lx,y Q we will denote a query in L. We define similarly (i.e. by replacing S with S∗ in Eq. 2) L∗ and L∗ x,y Q . Given the above, a simple approach to compute the queries induced by a rooted tree T is to compute L (or L∗), create a copy of it for each edge, appro- priately renaming variables, and take all possible combinations between frame link elements. An alternative, backwards iterative approach is based on the ob- servation that, if the rooted trees of sizes up to k − 1 are available, a rooted tree of size k can be directly constructed from one or more such trees. In particular, a query with shape a rooted tree of size k can be constructed either by a one step extension of an existing query with shape a rooted tree S of size k − 1 (by renaming variables, adding a new root and connecting the new to the old root) or by merging two or more existing queries with shape trees Si (by renaming variables and joining them on their roots). In the first case, if y Q is the set of all queries with shape S, its extension is obtained by the unary operator x 3 y Q = (x,y Q ◦ y Q)|x | x,y Q ∈ L and y Q ∈ y Q L L Inside the Query Space of DL Knowledge Bases 7 In the second case, let x Qi be the set of all queries with shape Si . Merging is an n-ary operator ⊗, for n ≥ 1: x Q1 ⊗ x Q2 ⊗ . . . ⊗ x Qn = {x Q1 u x Q2 u . . . u x Qn | x Qi ∈ x Qi } Given the above, the set of all queries represented by Formula 1 are QT = [{ D x0 Q ∈ D } ⊗ ] x0 Q̃ (3) i.e. the optional merging of the queries including only concept atoms on the root variable, with the set x0 Q̃ Nof all queries with the shape of T . These are obtained for i = 0 from xi Q̃ = j∈children(i) xi Q̂ij which expresses the merging of the queries represented by the several branches of the tree assuming the root is node i. Each branch is either a set of queries in L, if the branch has length 1, or the extension of the set of queries represented by the first child in the branch: (n o L xi ,xj Q| xi | L xi ,xj Q ∈ L if children(j) = ∅ xi Q̂ij = (4) xi 3 xj Q̃ otherwise S Since there are infinite T s, T QT is infinite. To guarantee a finite set of queries, we must set additional criteria, such as compute only queries with up to k variables. Such a set will not be complete. A QSG on a non-complete set will fully describe only the queries in the set (they will have unique anchors), but this approximation can in practice be useful. Finally, to arrange the queries in a QSG, we need their answers. If we can answer arbitrary queries, we could compute first the queries and then evaluate them. However, the number of queries explodes combinatorially, yet most of them will have zero answers. Thus, assuming that we are not interested in null queries, we can try to combine query generation with answering to avoid generating and evaluating many of the null queries. 5 Building a QSG for a DL-Lite− R Knowledge Base We will now discuss how we can construct and answer tree-shaped queries for normalized DL-Lite− R KBs. We are interested only in non-null queries. Our ap- proach will be based on the rewriting approach to query answering, which is particularly applicable to the DL-Lite family [5]: given a query Q and a DL-Lite KB hT , Ai, there is a setS of queries rewr(Q), called the rewritings of Q, such that cert(Q, hT , Ai) = Q0 ∈rewr(Q,T ) cert(Q0 , h∅, Ai) i.e. the consequences of T are incorporated in the several Q0 s, which can then be answered ignoring T (for brevity, we will omit the KB arguments from cert(·) and rewr(·)). Rewritings are ‘transitive’: A rewriting of a rewriting of Q is a rewriting of Q. It follows that if Q0 ∈ rewr(Q) and Q00 ∈ rewr(Q0 ) then cert(Q00 ) ⊆ cert(Q0 ) ⊆ cert(Q). We focus on DL-Lite− R KBs, because the absence of axioms R v S − guar- antees that the edges of a tree-shaped query do not change direction during rewriting. In DL-Lite− R , the rewriting set of an atomic query is a set (union) of atomic queries: rewr(x {C(x)}) = { x {A(x)} | T |= A v C} ∪ x {R(x, y)} | T |= 8 Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou ∃R v C} ∪ x {R(y, x)} | T |= ∃R− v C} }, rewr(x {R(x, y)}) = { x {B(x)} | T |= B v ∃R} ∪ x {S(x, y)} | T |= ∃S v ∃R} ∪ x {S(y, x)} | T |= ∃S − v ∃R} }, and rewr(x,y {R(x, y)}) = { x,y {S(x, y)} | T |= S v R}. An unfolding of query Q is a rewriting obtained by substituting each atom in body(Q) by one of its rewritings, according to the above, taking care to ap- propriately name newly introduced variables. An unfolding of Q (before con- densation) retains all bound variables of Q. The set of all unfoldings of Q is unfold(Q). Apart from unfoldings, Q may have also shrinking rewritings. Shrinkings arise due to axioms of the form A v ∃R.C. Given such an ax- iom, we have x {A(x)} ∈ rewr({R(x, y), C(y)}x ) (in fact {A(x)}x is also in rewr({S(x, y), D(y)}x ) if T |= R v S and T |= C v D). Query x {A(x)} is a shrinking of x {R(x, y), C(y)} and is an element of shrinky (Q) because it con- tains all variables of Q apart from bound variable y, which has been eliminated by reasoning using an axiom of the form A v ∃R.C. Let   [ [ shrink(Q) = {Q} ∪  shrink(Q0 ) (5) v∈var(Q) Q0 ∈shrinkv (Q) It can be proved [6, 11] that rewr(Q) = Q0 ∈shrink(Q) unfold(Q0 ). If var(Q) = S avar(Q), then rewr(Q) = unfold(Q0 ), since answer variables cannot be eliminated. Applying iteratively shrinking on a query using Eq. 5, we may end up with a query containing only concept atoms on the answer variable. If all of them have a common unfolding, by unfolding the query we will end up with a query consisting of a single concept atom. Definition 6. The kernel of a tree-shaped query x Q is the set 0 C 00 C 0 C 0 C 00 kernel(x Q) = { C x Q ∈ rewr(x Q) | there is no x Q 6≡S x Q s.t. x Q ∈ rewr(x Q ) } Theorem 1. If x Q1 , . . . , x Qn are tree-shaped queries, then n \ cert(x Q1 u . . . u x Qn ) = cert(x Qi ) and i=1 [ [ kernel(x Q1 u . . . u x Qn ) = ... kernel(x Q01 u . . . ux Q0n ) 0 0 x Q1 ∈kernel(x Q1 ) x Qn ∈kernel(x Qn ) Proof. Because all x Qi s share only the answer variable x, their intersection (and the intersections of their rewritings) has no shrinking on x. Thus, the rewritings of x Q1 u . . . u xS Qn are all combinations of the rewritings of each x Qi : rewr(x Q1 u . . . u x Qn ) = xQ0 ∈rewr(x Q1 ) . . . xQ0 ∈rewr(x Qn ) x Q01 u . . . u x Q0n . Because these S 1 n rewritings share only variable x we get the desired results. The above allows us to compute cert(x Q1 u . . . u x Qn ) and kernel(x Q1 u . . . u x Qn ) given the certain answers and kernels of the merged queries. Because a merging may be used to construct further queries, it should be kept either if it has a non-empty kernel (which can cause queries constructed as extensions of it be non-null), or it is itself non-null. Inside the Query Space of DL Knowledge Bases 9 Theorem 2. Let x Q = (L∗ x,y Q1 ◦ y Q2 )|x ), where x,y Q1 ∈ L, and y Q2 is a strictly L∗ tree-shaped query with no concept atoms on y. Then  [ cert(x Q) = cert(L∗ x,y Q1 ) ./ 2=1 cert(y Q2 ) |1 ∪ cert(C x Q̃) C Q̃∈kernel( Q) x x S x,y Q1 ◦ y Q̃2 )|x ), ./i=j denotes join- where kernel(x Q) = C Q̃2 ∈kernel(y Q2 ) kernel((L∗ C y ing of the left- and right-hand side tuples on the i-th and j-th element respectively, and |i denotes projection on the i-th element of the tuples. Proof. If a rewriting x Q̃ of x Q contains y, no shrinking has been applied on y, so x Q̃ = (x,y Q̃1 ◦ y Q̃2 )|x , x,y Q̃1 ∈ rewr(x,y Q1 ) and y Q̃2 ∈ rewr(y Q2 ). Such rewrit- L∗ L∗ ings contribute answers cert(x,y Q1 )./ 2=1 cert(y Q2 ))|1 . If x Q̃ doesn’t contain y but contains a variable z, it was obtained by shrinking directly x Q on y, or by shrink- ing on y some other shrinking of x Q (on other variables), or it is a rewriting of − these. Let L∗x,y Q1 = x,y Q1A [ ◦y Q1B ]. Because there are no axioms S v R , y oc- S∗ D curs at the same place in all role atoms. Thus, S∗ x,y Q1A contains only atoms of the form S(x, y) (or S(y, x)), and y Q2 only atoms of the form S(zi , y) (or S(y, zi )). This is impossible (x Q is strictly tree-shaped), so no shrinking on y is applicable. If x Q̃ results by shrinking on y a shrinking x Q0 of x Q on other variables, x Q0 cannot contain variables other than x and y, otherwise we could shrink directly on y. Thus, x Q0 = (S∗ x,y Q1A [ ◦y Q1B ] ◦ y Q3 )|x , where y Q3 is a result of shrinking D D D on the other variables. For shrinking on y to be applicable on x Q0 it must be the case that for all E(y) ∈ body(D y Q1B ) we have C v E, and for all S(x, y) ∈ body(x,y Q1A ) we have T |= R v S. There can be no S 0 (y, x) ∈ body(S∗ S∗ x,y Q1A ). These are necessary for all such E(y), S(x, y) to give their place to A(x) in the shrinking due to A v ∃R.C. Thus, we must have x {A(x)} ∈ kernel(L∗ x,y Q1 ). But for all E(y) ∈ body(D y Q 3 ), i.e. for all y {E(y)} ∈ kernel(y Q2 ) we must also have C v E, i.e. x {A(x)} ∈ kernel((L∗ Q x,y 1 y ◦ {E(y)})|x ). Taking the union for all such E and A we get the second part of the desired result. If x Q̃ is an unfolding of the above rewritings, its answers are included in the certain answers of the query from which it has been obtained and we can ignore it. We call (cert(L∗x,y Q1 ) ./ 2=1 cert(y Q2 ))|1 the direct answers of x Q, because they are obtained directly by joining the answers of L∗ x,y Q1 and y Q2 . So, the theorem tells us that for the extension we have to stick to a strictly tree-shaped query x Q, compute its direct answers, its kernel (L∗ x,y Q1 ◦ y Q2 )|x (needing for that only the kernel of y Q2 ), find the answers to the kernel queries, and add them to the direct answers. To implement Formula 3 we still need to compute queries L∗ L∗ x,y Q and x,y Q|x , used in the second and first branch of Eq. 4, respectively. For this, we need first all queries L∗ x,y Q that may give rise to non-null queries. According to Th. 2 these are those that have at least one direct answer, and those whose projection on x has a non-empty kernel. To compute the set of queries with at least one direct answer, say Z, we can implement Eq. 2, by computing all intersections of the nodes of the QSG for S∗ with the nodes of the QSG for D. Then, we must 10 Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou update Z with the queries having non-empty kernels. The queries of the form x,y Q1 ◦ y Q2 that have x {A(x)} in their kernel due to A v ∃R.C, are in the set R∗ C EAv∃R.C x,y = {{S(x, y), D(y)}x,y | A v ∃R.C, T |= R v S and T |= C v D} ∪ {{S(y, x), D(y)}x,y | A v ∃R− .C, T |= R v S and T |= C v D} As in the case of D and C, we can obtain the set FAv∃R.Cx,y of the equivalence classes of the queries in x,y Q that have the same answers and have x {A(x)} in L their kernel due to axiom A v ∃R.C, as the node set of the QSG for EAv∃R.C x,y . Note that since we are interested in the kernels, we must compute the QSG including the null equivalence class. Because the kernel does not contribute any answers (since the head is hx, yi), those of these queries that have at least one answer should already be represented by an equivalence class in Z. However, that equivalence class may represent a more general query, with additional conjuncts, which has therefore an empty or different kernel. Thus, we need to enrich Z with all the queries in the several sets FAv∃R.C x,y . Now, to use Eq. 4, we need also to know all queries L∗ x,y Q|x that may give rise to non-null queries. Again, these are those that have at least one direct answer, and those that have a non-empty kernel. To compute them, we have to take the projections on x of all queries in Z, and as before, extend this set with the projections of the queries in each FAv∃R.C x,y on x, by including however also any answers to the kernel. 6 Evaluation We implemented the QSG computation for DL-Lite− R KBs in Java, and we ap- plied it on the DBPedia KB (2016-10 version), which contains 785 concepts, 1,111 properties (in the dbo namespace), 5,150,432 concept and 24,431,982 role assertions. To test scalability we created several subsets of the KB by keeping only the concepts and roles (and the relevant assertions) that, viewed as atomic queries, had certain minimum numbers of answers (ranging from 1 to 106 ), and tried to compute the QSG for Qh1,ki∗ , k = 1, 2, 3. The number of nodes of the computed QSGs and computation time, as a function of the number of kept concepts and roles are shown in Figure 1. Using a machine with 64GB RAM we were able to compute Qh1,1i∗ and Qh1,2i∗ for the entire DBPedia, but for Qh1,3i∗ memory was enough only for small subsets. We note that the size of the QSG, after a short superlinear growth at the beginning, grows linearly or sublinearly in the size of the ontology. Thus, in practice, the combinatorial explosion seems to occur only when we increase the size of the query but not the size of the ontology. (In computing rewr(x {C(x)}) we ignored axioms of the form ∃R− v C. Although theoretically incorrect, it turned out to be more reasonable because in DBPedia such axioms are used as domain and range constraints; if they participate in the reasoning process they propagate a large number of inconsistencies.) An example path in the graph for Qh1,3i∗ is Q0 = x0 {producer(x0 , x1 )} (157,591 answers), Q1 = Q0 ◦ x1 {Agent(x1 )} (135,354), Q2 = Q1 ◦ x1 {Person(x1 )} (119,578), Q3 = Q2 ◦ x0 {Work(x0 )} (119,577), Q4 = Q3 ◦ x1 {occupation(x1 , x2 )} Inside the Query Space of DL Knowledge Bases 11 ·104 ·105 4 6 |G| 400 3 4 200 2 2 1 0 0 0 0 500 1,000 0 500 1,000 0 500 1,000 ·104 6 30 600 minutes 4 20 400 10 200 2 0 0 0 0 500 1,000 0 500 1,000 0 500 1,000 |CN| + |RN| |CN| + |RN| |CN| + |RN| Fig. 1: QSG size and computation time for Qh1,1i∗ , Qh1,2i∗ , Qh1,3i∗ , respectively. (81,320), Q5 = Q4 ◦ x1 {Athlete(x1 )} (11), Q6 = Q5 ◦ x2 {PersonFunction(x2 )} (10), Q7 = Q6 ◦ x0 {Album(x0 ), MusicalWork(x0 )} (1). Each edge refines the query, and we discover that from the 157,591 producer assertions there is only one producer that is an athlete and has produced a music album. The graph revealed also un- balances on the topic coverage. E.g. from the about 1,243,400 persons, 360,941 were athletes, 318,392 athletic team members, and 18,622 soccer managers. 7 Conclusion We defined a theoretical framework for modeling epistemic equivalence and or- dering relations between queries of a DL KB, arranging them in a graph, and discussed how to compute such a graph for DL-Lite− R KBs. The experimental results showed that the computation of restricted versions of the QSG might be possible even for relatively big datasets. Given the complexity and memory needs of the computation, future work will target improving the prototype implemen- tation or approximative computation. Efficient computation for more expressive DL dialects, starting from the full DL-LiteR , should also be investigated. We see several applications of our framework, from construction of balanced datasets for machine learning, to semantic clustering and autocomplete suggestions. 8 Acknowledgements We acknowledge support of this work by the project ‘Apollonis: Greek Infrastruc- ture for Digital Arts, Humanities and Language Research and Innovation’ (MIS 5002738) which is implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Programme ‘Compet- itiveness, Entrepreneurship and Innovation’ (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). 12 Alexandros Chortaras, Michalis Giazitzoglou, and Giorgos Stamou References 1. The on-line encyclopedia of integer sequences. Number of unlabeled rooted trees with n nodes (or connected functions with a fixed point)., http://oeis.org/ A000081 2. Arenas, M., Grau, B.C., Kharlamov, E., Marciuska, S., Zheleznyakov, D.: Faceted search over RDF-based knowledge graphs. J. Web Semant. 37-38, 55–74 (2016) 3. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applica- tions. Cambridge University Press (2003) 4. Bienvenu, M., Lutz, C., Wolter, F.: Query containment in description logics recon- sidered. In: KR. AAAI Press (2012) 5. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. Autom. Reasoning 39(3), 385–429 (2007) 6. Chortaras, A., Trivela, D., Stamou, G.B.: Optimized query rewriting for OWL 2 QL. In: CADE. Lecture Notes in Computer Science, vol. 6803, pp. 192–206. Springer (2011) 7. Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations. Springer (1999) 8. Gottlob, G., Fermüller, C.G.: Removing redundancy from a clause. Artif. Intell. 61(2), 263–289 (1993) 9. Kuznetsov, S.O., Obiedkov, S.A.: Comparing performance of algorithms for gen- erating concept lattices. J. Exp. Theor. Artif. Intell. 14(2-3), 189–216 (2002) 10. Nienhuys-Cheng, S., de Wolf, R. (eds.): Foundations of Inductive Logic Program- ming, Lecture Notes in Computer Science, vol. 1228. Springer (1997) 11. Trivela, D., Stoilos, G., Chortaras, A., Stamou, G.B.: Optimising resolution-based rewriting algorithms for OWL ontologies. J. Web Semant. 33, 30–49 (2015) 12. Tunkelang, D.: Faceted Search. Synthesis Lectures on Information Concepts, Re- trieval, and Services, Morgan & Claypool Publishers (2009)