Data complexity of finite query entailment in description logics with transitive roles Jakub Kuklis jk.kuklis@gmail.com University of Warsaw, 02-097 Warszawa, Poland Abstract. We study the problem of finite ontology-mediated query an- swering (FOMQA), the variant of OMQA where the represented world is assumed to be finite, and thus only finite models of the ontology are considered. We adapt the setting of unions of conjunctive queries and ontologies expressed in description logics with transitive role declara- tions. FOMQA is already undecidable for SHOIF, but was proved to be 2Exp-complete for SOI, SOF and SIF. The proof relies on the fact that any finite counter-model can be transformed to a counter-model of certain regularity. By adjusting this transformation, we are able to pro- vide an upper bound on the size of the smallest finite counter-model for SOI and SOF ; the bound is linear in the size of the ABox. This lets us show that for these logics FOMQA is in coNP with respect to data complexity; as coNP-hardness follows from previous results on weaker logics, we deduce coNP-completeness. 1 Keywords: FOMQA · Data complexity · Counter-models · SOI· SOF 1 Introduction Evaluating queries in the presence of background knowledge has been extensively studied in several communities. A particularly prominent take on this problem is ontology mediated query answering (OMQA) where background knowledge rep- resented by an ontology is leveraged to infer more complete answers to queries [3]. A widely accepted family of ontology languages with varying expressive power is offered by Description Logics (DLs) [2], while the most commonly studied query language is that of (unions of) conjunctive queries. Identifying ontology languages, for which query evaluation scales to large amounts of instance data, is one of the key research goals, as data tends to dominate in size both the ontology used and the evaluated queries. An early reference on data complexity in DLs, that is, complexity in the settings with fixed TBox and query, is showing coNP-hardness of instance queries in ALE [13]. In the much more expressive DL SHIQ, a coNP upper bound was obtained for instance queries [8] and conjunctive queries without transitive roles [10], and later 1 Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 J. Kuklis generalized to any CQs [5]. The classical approach to avoiding intractability is to replace ALC and SHIQ with less expressive Horn DLs. These often admit query evaluation in PTime, examples include a variety of logics from the EL [1] and DL-Lite families [4], as well as Horn-SHIQ [8]. An alternative approach is to study complexity with respect to fixed TBoxes only [9]. Often, the intended models of the ontology are finite and this additional assumption allows to infer more properties: finite ontology mediated query an- swering (FOMQA) is the variant of OMQA restricted to finite models. For some logics, like ALC, the finite variant and the unrestricted variant of the problem coincide; we then say that OMQA is finitely controllable. Studying FOMQA is interesting in settings lacking finite controllability. This is the case not only for DLs lacking the finite model property (e.g., DLs allowing both inverse roles and number restrictions), but also for logics allowing transitive role declarations, like logics from the S family. FOMQA is undecidable for SHOIF ontologies [11], but recently its decidability has been shown for three proper SOIF fragments: SOI, SOF, and SIF [6]. For all three fragments, the problem turned out to be 2Exp-complete in terms of combined complexity, but its data complexity was not studied. The present paper aims to fill this gap, thus initiating the study of the data complexity of FOMQA. In [6], FOMQA is reduced to query entailment over a certain class of tree-like models recognisable by tree automata. Here, we adjust and adapt this approach to obtain the linear-size counter-model property for SOI and SOF (with respect to fixed TBox and query). This lets us derive an algorithm for FOMQA, which is coNP with respect to data complexity. The matching lower bound can be ob- tained using finite controllability from the results on answering instance queries, which is coNP-hard even for the ALE fragment of ALC [13] (see also [2]). 2 Preliminaries The DL SOIF extends the classical DL ALC with transitivity declarations on roles (S), nominals (O), inverses (I), and role functionality declarations (F) [2]. We assume a signature of countably infinite disjoint sets of concept names NC = {A1 , A2 , . . . }, role names NR = {r1 , r2 , . . . } and individual names NI = {a1 , a2 , . . . }. SOIF-concepts C, D are defined by the grammar: C, D ::= > | A | ¬C | C u D | {a} | ∃r.C , where r ∈ NR ∪ {s− | s ∈ NR } is a role. Roles of the form r− are called inverse roles. A SOIF TBox T is a finite set of concept inclusions (CIs) C v D, transitivity declarations Tr(r), functionality declarations Fn(r), where C, D are SOIF-concepts and r is a role. We assume that if the TBox contains Tr(r), then it contains neither Fn(r) nor Fn(r− ). With an appropriate extension of the signature, each SOIF TBox can be transformed into an equivalent TBox with CIs in the following normal forms: u Ai v t B j , A ≡ {a} , A v ∀r.B , A v ∃r.B , Data complexity of FOMQA in DLs with transitive roles 3 where A, Ai , B, Bj are concept names, empty conjunction is equivalent to > and empty disjunction to ⊥. We assume that all knowledge bases we work with are normalized; normalization yields at most linearly larger knowledge base, so the size estimations still hold. We also assume that for each concept name A used in T there is a complementary concept name Ā axiomatised using CIs > v A t Ā and A u Ā v ⊥. SOI and SOF TBoxes are restrictions of SOIF TBoxes. SOI TBoxes do not contain functionality declarations, whereas concept inclusions in SOF do not contain inverse roles. As the inverse of a transitive role is transitive anyway, for SOI we shall assume that if Tr(r) is present in the TBox, then so is Tr(r− ). An ABox is a finite set of concept and role assertions of the form A(a) and r(a, b), where A ∈ NC , r ∈ NR and {a, b} ⊆ NI . A knowledge base (KB) is a pair K = (T , A). We write |K| for |A| + |T |. We use CN(K), Rol(K), TRol(K), Nom(K), and Ind(K) to denote, respectively, the set of all concept names, roles, transitive roles, nominals, and individuals occurring in K. Each nominal is an individual, so Nom(K) ⊆ Ind(K). A unary type is a subset of CN(K) that contains exactly one of the concept names A, Ā for each A ∈ CN(K). We let Tp(K) denote the set of all unary types. The semantics is defined via interpretations I = (∆I , ·I ) with a non-empty domain ∆I and an interpretation function ·I assigning to each A ∈ CN(K) a set AI ⊆ ∆I and to each role name r with r ∈ Rol(K), a binary relation rI ⊆ ∆I × ∆I . The interpretation of complex concepts and roles is defined as usual [2]. We only consider interpretations complying with the standard name assumption and set aI = a for each individual name a used in ABox or TBox. We write d ∈ I as a shorthand for d ∈ ∆I . Interpretation I is a subinterpretation of interpretation J , written as I ⊆ J , if ∆I ⊆ ∆J , AI ⊆ AJ , and rI ⊆ rJ for all A ∈ CN(K), r ∈ Rol(K). An interpretation I is a subinterpretation of J induced by ∆0 ⊆ ∆J , written as I = J  ∆0 , if ∆I = ∆0 , AI = AJ ∩ ∆0 , and rI = rJ ∩ ∆0 × ∆0 for all A ∈ CN(K), r ∈ Rol(K). We write J \ X for the subinterpretation of J induced by ∆J \ X. An interpretation I satisfies α ∈ T ∪ A, written as I |= α, if the following holds: if α is a CI C v D then C I ⊆ DI , if α is a transitivity declaration Tr(r) then rI is transitive, if α is a functionality declaration Fn(r) then rI is a partial function, if α is an assertion A(a) then a ∈ AI , and if α is an assertion r(a, b) then (a, b) ∈ rI . Finally, I is a model of: a TBox T , denoted I |= T , if I |= α for all α ∈ T ; an ABox A, denoted I |= A, if I |= α for all α ∈ A; and a KB K if I |= T and I |= A. Let I and J be interpretations of K. A homomorphism from I to J , written as h : I → J is a function h : ∆I → ∆J that preserves roles, concepts, and individual names; that is, (h(d), h(d0 )) ∈ rJ whenever (d, d0 ) ∈ rI , r ∈ Rol(K), h(d) ∈ AJ whenever d ∈ AI , A ∈ CN(K), and h(a) = a for all a ∈ Ind(K). Note that I ⊆ J iff the identity mapping id is a homomorphism id : I → J . Let NV be a countably infinite set of variables. An atom is an expression of the form A(x) or r(x, y) with A ∈ NC , r ∈ NR , and x, y ∈ NV , referred to as 4 J. Kuklis concept atoms and role atoms, respectively. A conjunctive query (CQ) Q is an existentially quantified conjunction q of atoms, ∃x1 · · · ∃xn q . For simplicity we restrict q to Boolean; that is, q uses only variables x1 , · · · , xn . This is without loss of generality since the case of non-Boolean CQs can be reduced to the case of Boolean queries; see e.g. [12]. Let var denote the function from queries to sets of variables, such that var(Q) is the set of all the variables used in Q. A match for Q in I is a total function π : var(Q) → ∆I such that I, π |= q under the standard semantics of first-order logic. An interpretation I satisfies Q, written as I |= Q, if there exists a match for Q in I. A partial match for Q in I is a partial function π : var(Q) → ∆I such that I, π |= q 0 , where q 0 is obtained from q by dropping atoms which bind some variable not in the domain of π. Note that we do not consider queries with constants (i.e., individual names): such queries can be viewed as non-boolean queries with a fixed valuation of free variables, and thus are covered by the reduction to the Boolean case. We do consider unions of conjunctive queries (UCQs), which are disjunctions of CQs. An interpretation I satisfies a UCQ Q if it satisfies one of its disjuncts. It follows immediately that UCQs are preserved under homomorphisms; that is, if I |= Q and there is a homomorphism from I to J , then also J |= Q. A query Q is entailed by a KB K, denoted as K |= Q, if every model of K satisfies Q. A model of K that does not satisfy Q is called a counter-model. The query entailment problem asks whether a KB K entails a (U)CQ Q. Moreover, this problem is equivalent to that of finding a counter-model. It is well known that the query answering problem can be reduced to query entailment. In this paper, we address the problem of finite query entailment, which is a variant of query entailment where only finite interpretations are considered: an interpretation I is finite if ∆I is finite, and a query Q is finitely entailed by K, denoted as K |=fin Q, if every finite model of K satisfies Q. 3 Counter-model surgery Later in this paper, we transform a tree-like counter-model into a finite counter- model of bounded size, using a variant of the blocking principle: a systematic policy of reusing elements. Before we dive into this construction, we show that we can add certain edges to a counter-model without satisfying the query. These edges must connect a pair of points, for which a different pair of similar points is already connected via an edge over the same role. Throughout this section, we assume no transitivity declarations are present; we will handle these later. Sharing the same unary type is a natural similarity criterion. Without tran- sitivity and functional declarations, concept inclusions rely only on the unary types of directly linked elements, so adding an r-edge between a pair of elements yields a correct model of the knowledge base if a pair of elements of the same unary types is already connected with an r-edge. However, that criterion is not sufficient, since this way we might accidentally satisfy the query, for example by creating a cycle of given length when the query asks for one. This happens to Data complexity of FOMQA in DLs with transitive roles 5 be a key difficulty to overcome: we cannot introduce cycles shorter than the size of the query. The first step to enhance the criterion is to look at sufficiently large neigh- bourhoods, rather than just unary types. We adapt the definition of neighbour- hoods from [6] and parameterize it with a set of constants Const ⊆ ∆I : elements which are considered neighbours of all the elements in the interpretation, regard- less of the actual connections between them. Throughout the paper, we assume Const = Nom(K) unless specified otherwise. By the distance between two ele- ments we understand the length of the shortest undirected path over the union of all roles between these two elements. Definition 1. For d in I \ Const and an integer n, the n-neighbourhood NnI (d) of d with respect to Const is the subinterpretation of I induced by Const and all elements e in I \ Const within distance n from d in I \ Const, enriched with a concept D interpreted as {d}, and a concept interpreted as {a} for each a ∈ Const. For a ∈ Const, NnI (a) is the subinterpretation induced by Const, enriched similarly. Replacing unary types with large neighbourhoods is still not enough, because nearby elements can have arbitrary large isomorphic neighbourhoods: in the integers with the successor relation, all n-neighbourhoods are isomorphic. The next step is to enrich the initial counter-model in such a way that overlapping neighbourhoods are not isomorphic, following an idea from [7]. Definition 2. A colouring with k colours of an interpretation I is an exten- sion J of I with ∆J = ∆I , such that J coincides with I in every element in the signature of I, and interprets k fresh concept names B1 , . . . , Bk such that B1J , . . . , BkJ is a partition of ∆J . We say that d ∈ BiJ has colour Bi . A colour- ing J of I is n-proper if for each d ∈ ∆J all elements of NnI (d) have different colours. Lemma 1. If I \ Const has bounded degree, then for all n ≥ 0 there exists an n-proper colouring of I with at most M + |Const| colours, where M is the size of the largest (2n)-neighbourhood. We write In for an arbitrarily chosen n-proper colouring of I. Definition 3. Let i ≤ n and let d, e be elements of In . We say that (d, e) is an i-link along role r if either d has an r-successor e0 in In such that NiIn (e0 ) ' NiIn (e), or e has an r-predecessor d0 in In such that NiIn (d0 ) ' NiIn (d). Theorem 1 (Lifting Theorem). For a conjunctive query with at most k bi- nary atoms and n ≥ k and an interpretation In in which the query is not satis- fied, adding n-links to In does not make the query satisfied. This theorem was established in [6] for n ≥ k 2 . We prove the stronger variant in the appendix. The strengthening is not necessary for our approach, but it tightens the resulting size estimations. 6 J. Kuklis 4 Clique-forest counter-models In this section, we revisit prior work showing that for SOI and SOF, the exis- tence of a finite counter-model for Q is equivalent to the existence of a so called counter-example, a possibly infinite counter-model of a special form, which gen- eralises tree-shaped models [6]. We impose tighter size constraints on counter- examples and construct automata recognizing them. The special form is based on the notion of clique-forests. Definition 4 ([6]). Let us fix a KB K. A clique-forest for an interpretation I is a forest whose each node v is labelled with a subinterpretation Iv of I \ Nom(K) such that: – the sets ∆Iv are a partition of ∆I\Nom(K) ; – each Iv is either a single element with all roles empty (element node) or a clique over some transitive role with all other roles empty and no repetitions of unary types (clique node); – apart from edges within cliques and between individuals, in I \ Nom(K) there is exactly one edge between ∆Iu and ∆Iv for every two adjacent nodes u and v: assuming u is the parent of v, it is an r-edge from an element of ∆Iu to an element of ∆Iv for some r ∈ Rol(K). We call v the origin of Iv . Definition 5. Let K∗ denote the KB obtained from K by dropping transitivity declarations and let I ∗ denote the interpretation obtained from I by closing transitively the interpretation of each transitive role. Note that each existential restriction satisfied in I is also satisfied in I ∗ . The same holds for quantifier-free CI, and for universal restrictions involving non- transitive roles. For universal restrictions involving transitive roles, we ensure this property by preprocessing the interpretation: for each pair of B ∈ CN(K) and r ∈ TRol(K), we add a fresh concept name Pr,B , axiomatised as Pr,B v ∀r.(B u Pr,B ), and replace each CI of the form A v ∀r.B with A v ∀r.Pr,B . An interpretation is safe if it does not contain an infinite simple r-path for any transitive role r. Definition 6. A counter-example for Q is a safe interpretation I such that: I |= K∗ , I ∗ |6 = Q, and I admits a clique-forest with at most |Ind(K)| + |Nom(K)| · |CI(K)| trees, nodes containing at most |CN(K)| elements, at most |CI(K)| chil- dren nodes connected to a single element, and each element of Ind(K) \ Nom(K) occurring in some root. If I is a counter-example for Q, then I ∗ is a counter-model for Q, thanks to the initial preprocessing. Theorem 2. Q has a finite counter-model iff Q has a counter-example. The theorem was originally stated in [6], where counter-examples by defini- tion admitted clique-forests with at most |K|2 trees and branching at most |K|2 , Data complexity of FOMQA in DLs with transitive roles 7 but the proof there is still valid for the present stronger statement. The bound on the branching of clique-forests is replaced with bounds on the size of clique- tree nodes and number of children nodes connected to a single element within any node; their precise values follow from the construction of a single clique-tree in the proof. The bound on the number of trees can be tightened, as the con- structed clique-forests are built of trees rooted in non-nominal individuals and children nodes of nominals. To obtain the final counter-model of bounded size, we are going to apply the blocking principle to a counter-example. Thanks to Theorem 2, we already know that a counter-example exists whenever Q has a finite counter-model. In the next section, we require that the clique-forests of counter-examples we work with are of certain regularity. We will show that in regular interpretations with a so-called completeness property, if two elements are connected with a path over a transitive role, then there is at least one short path over that role which connects them. To be able to focus on counter-examples with regular clique-forests, we shall construct tree automata recognizing the latter. The following definitions refer to edges and paths in a fixed interpretation I, either known from the context or explicitly specified. Definition 7. Fix an ordered pair of elements a, b. We say that (a, b) is an r- edge if there is an edge over r from a to b. We say that (a, b) is an r∗ -edge, if either r is transitive and there is a path over r from a to b, or (a, b) is an r-edge. Definition 8. An interpretation J is individual-complete, if in J each r∗ -edge over two individuals used in K is an r-edge. An interpretation J is nominal- complete, if in J each r∗ -edge with a nominal as one of the endpoints is an r-edge. An interpretation is complete, if it is individual- and nominal-complete. Note the difference between the individual- and nominal-completeness def- initions: in the former, both endpoints of the edges considered have to be in- dividuals, whereas in the latter, only one endpoint has to be a nominal. Each counter-model can be extended to a complete one in a natural way so that the transitive closure stays the same. We can thus restate Theorem 2: Q has a finite counter-model iff Q has a complete counter-example. Theorem 3. For a union Q of CQs, with each CQ of size at most m, and a complete interpretation M with ∆M = Ind(K), there exists an automaton with at 4m 3m most 22·3 |Q||T | states recognizing clique-forests of complete counter-examples for Q, for which the subinterpretation induced by Ind(K) is equal to M. See Appendix for the automaton construction. Theorem 3 is inspired by a similar theorem in [6]. We add the completeness requirement and restrict the size of the automaton state-space, rather than its overall size. This way, we are able to get a bound which is independent of the ABox; it depends only on the TBox and query sizes. The automaton size still depends on the ABox, since its description has to contain all the initial lists of states, each state corresponding to one tree, and we consider clique-forest with at least one tree per individual occurring in K. 8 J. Kuklis These tweaks result in significant differences between our construction of the automaton and the construction presented in [6]. Most importantly, we cannot transfer the information about the edges between individuals to the TBox, as that would introduce a dependency of the automaton state-space on the ABox. This influences especially the automaton component responsible for verifying that the query cannot be satisfied. 5 Bounded counter-models We say that DL has the linear-size counter-model property if whenever there exists a counter-model for a given query Q and a knowledge base K = (A, T ) expressed in that DL, there exists a counter-model for Q of size bounded by |A| · B(Q, T ) for some function B. The goal of the larger part of this section is to prove the following theorem. Theorem 4. SOI and SOF have the linear-size counter-model property. Clique-forests of complete counter-examples can be recognized by a family of automata by Theorem 3. Suppose K and Q admit a finite counter-model. As discussed in the last section, they admit then a complete counter-example, which in turn can be recognized by an automaton. Any automaton accepting some clique-forest must also accept some regular forest, which has only finitely many non-isomorphic subtrees. The number of these subtrees can be bounded by the number of states of the automaton. Following the bound in Theorem 3, we pick a complete counter-example I admitting a regular clique-forest with at 4m 3m most p = 22·3 |Q||T | non-isomorphic subtrees. We shall turn I into a finite counter-model for Q, using the blocking prin- ciple. The main obstacle is that Q uses transitive roles, which are not fully represented. To solve that, we replace Q with a different query. In [6], this was done by exploiting a bound on the length of simple r-paths for transitive roles r, guaranteed by the regularity of clique-forests of the considered counter-examples. Here, we require that the counter-example we process is complete and consider the distance between elements that are connected by r-paths, which will let us derive better bounds on the size of the obtained finite counter-models. Definition 9. A pair of elements a, b is at r-distance k if the shortest r-path from a to b is of length k. An interpretation is `-ranged if for each r∗ -edge, its endpoints are at r-distance at most `. For transitive role atoms in queries, the path that leads from one variable to another does not play any role in the query evaluation, only the unary types of the connected variables matter (and the fact that they are connected by some path). We use this intuition to deduce that in `-ranged interpretations, any query with transitive binary atoms can be rewritten so that closing the roles by transitivity does not influence the query evaluation. Let Q∗ beW obtained from Q by replacing each transitive atom s(x, y) by the disjunction i≤` si (x, y), where si (x, y) is the conjunctive query expressing the Data complexity of FOMQA in DLs with transitive roles 9 i-fold composition of s. If each disjunct of Q contains at most k binary atoms, Q∗ can be rewritten as a union of conjunctive queries, each using at most k · ` binary atoms. Lemma 2. For all `-ranged J , J ∗ |= Q iff J |= Q∗ . Lemma 3. I \ Ind(K) is (4p)-ranged. Proof. Let r be a transitive role in K. Each r-path going down the clique-forest of I contains at most p nodes. Indeed, if there were a longer r-path, then a subtree would occur twice on that path, which immediately leads to an infinite simple r-path in I \ Ind(K), contradicting the safety of I. For each clique node, at most one additional step is needed to traverse it with an r-edge. Additionally, if K is a SOI KB, an r-path can be composed of two parts, one going down the tree and the other consisting of r−1 edges going upwards. If follows that, for SOI and SOF, any r-reachable pair in I \ Ind(K) is at r-distance at most 4p. t u The following lemma, regarding a bound on the range of an interpretation, is a stronger version of the analogous lemma presented in [6], regarding a bound on the length of simple transitive paths. The bound on the range has to be multiplied by a constant to get the bound on the range in the full interpretation, whereas in [6], the size of the set of nominals also comes into play. We draw upon the individual-completeness of the interpretations we consider to achieve this improvement. Lemma 4. For each individual-complete interpretation J , if J \ Ind(K) is `- ranged, then J is (2` + 3)-ranged. Proof. Take two elements a, b, such that (a, b) is an r∗ -edge in J , but not an r∗ -edge in J \ Ind(K); it follows that any r-path from a to b visits an individual. If r is non-transitive, then (a, b) is an r-edge and a, b are at distance 1, otherwise consider such r-path π. It starts in one of the trees in J \ Ind(K) (or in an individual, in which case the tree degenerates to an empty one), and reaches a nominal connected to this tree or the tree root. Eventually, π enters the tree containing its other endpoint, again via a nominal connected to that tree or its root. Let c and d be the first and last individuals on π. As J is individual- complete, (c, d) is an r-edge, if c 6= d. We can reduce π to a path segment starting in a in the first tree, a single edge to c, possibly a single r-edge (c, d), a single edge from d, and a path segment ending in b in the last visited tree. As the first and last path segments are in J \ Ind(K), each pair of elements in them is at r-distance at most `. This gives a (2` + 3) bound on r-distance in J . t u We know that I is `∗ -ranged for `∗ = (8p + 3) by Lemmas 3 and 4. We conclude that I |6 = Q∗ by Lemma 2. The degree of elements in I\Ind(K) is bounded by |CN(K)|+|CI(K)| (elements are connected within clique-nodes of size at most |CN(K)| and with at most |CI(K)| + 1 elements from other nodes). Non-nominal individuals used in K can be connected with any number of other individuals, so even though their degree 10 J. Kuklis is bounded, the bound is relatively large. Nominals can be connected with an arbitrary number of elements. We aim to use the blocking principle. We consider an n-proper colouring In0 of I 0 = I \ Ind(K), for some fixed n, instead of a proper colouring of I; we do so to avoid the influence of the individuals on the size of the neighbourhoods (and therefore the required number of colours). I 0 is a clique-forest too: the children nodes of non-nominal individuals become new roots. On each branch π in In0 , let Dπ be the first node for which some earlier node Eπ satisfies NnIn (dπ ) ' NnIn (eπ ), where dπ ∈ Dπ and eπ ∈ Eπ are the endpoints of the edges connecting Dπ and Eπ to their parent nodes. The new interpretation Fn is obtained by including the branch π up to the predecessor of node Dπ , with the edge originally leading to dπ redirected to eπ . Individuals are copied from I, together with their adjacent edges. Note that for SOF, all functionality declarations are satisfied, as each redirected edge is a forward edge. As each node in In0 has finitely many children and we cut all the infinite paths off, Fn is finite by Kőnig’s lemma. As we start with I |= K∗ and the unary types of edge endpoints are preserved in Fn , we get Fn |= K∗ . By the initial preprocessing, (Fn )∗ |= K. The conclusions drawn so far are valid for any n. Lemma 5. Fn \ Ind(K) is (4p)-ranged for n ≥ 4p. Proof. During the construction of Fn from I, we add only one link for each infinite downwards path in I starting in a tree root. Each of these links goes up the tree. By link origins we shall understand the elements that had an edge added during the construction of Fn . We can visualize Fn ∪ I as the core Fn with additional hanging subtrees, each attached to Fn with a single edge to a link origin. There is no way of accessing these subtrees from Fn \ Ind(K) other than via the corresponding link origin. Let ` = 4p and fix some n ≥ `. We prove the following claim by induction on m: if there is a path of length ` + 1 from a in Fn \ Ind(K) to b in (Fn ∪ I) \ Ind(K) over a transitive role r using at most m links, then there is an r-path in (Fn ∪ I) \ Ind(K) from a to b over r of length at most `. If we prove the claim for all m, we can deduce that Fn \ Ind(K) is `-ranged. Indeed, take a transitive role s and an s-path π in Fn \ Ind(K) of length ` + 1, with endpoints (a, b) and m links (possibly none). Applying the claim, we get an s-path ρ in (Fn ∪ I) \ Ind(K) of length at most `, connecting a and b. To show that ρ has to be in Fn \ Ind(K), we take advantage of the tree structure: if a path goes down a given branch, it has to use a link to come back to an element higher up in the tree. There are no links from elements in I \ Fn , so ρ must only visit nodes in Fn \ Ind(K); otherwise ρ would get stuck in I \ Fn and have its endpoint there. Let us approach the induction proof. The induction base for m = 0 is almost immediate. A path without links in (Fn ∪ I) \ Ind(K) is also a path in I \ Ind(K), which is `-ranged, as shown in Lemma 3. Assume the inductive hypothesis holds for m. Suppose that for a transitive role s there is an s-path π in (Fn ∪ I) \ Ind(K) of length l + 1 from a to b, such Data complexity of FOMQA in DLs with transitive roles 11 that it uses at most m + 1 links. If π uses fewer than m + 1 links, we simply use the inductive hypothesis; thus we can assume π uses exactly m + 1 links. As m ≥ 0, there is at least one link in π. Let (d, e) be the last link in π, with the witnessing element e0 in I \ Ind(K). Note that the node containing e0 is not an ancestor of the node containing a in the clique-tree, as otherwise a would lie in a subtree rooted in e0 and would be removed while constructing Fn . Let πa,d and πe,b denote the parts of π from a to d and from e to b. The path πe,b is of length at most `. We have chosen n ≥ `, so the n-neighbourhood of e0 in I \ Ind(K) contains a path πe0 0 ,b0 isomorphic to πe,b . The path consisting of πa,d , the edge from d to e0 and πe0 ,b0 is of length ` + 1, but uses only m links, so there is an s-path ρ in (Fn ∪ I) \ Ind(K) from a to b0 of length at most `. Due to the construction of Fn , each path to e0 starting in Fn \ Ind(K) must visit d. Let ρa,d and ρe0 ,b0 denote the parts of ρ from a to d and from e0 to b0 . Note that ρe0 ,b0 lies in I \ Ind(K), so it does not use any links. At least one of the following inequalities holds: |ρa,d | < |πa,d | or |ρe0 ,b0 | < |πe,b |. If the first one holds, we can take the path consisting of ρa,d , the edge from d to e and of πe,b as an s-path from a to b of length at most `. If the second inequality holds, we use the fact that ρe0 ,b0 lies in the n-neighbourhood of e0 in I \ Ind(K) and find an isomorphic path ρ0e,b from e to b. The path consisting of πa,d , an edge from d to e and of ρ0e,b is an s-path from a to b of length at most `. This proves the inductive hypothesis for m + 1. t u 4m 3m Recall that p = 22·3 |Q||T | and l∗ = 8p + 3. For n ≥ 4p, Fn is `∗ -ranged by Lemmas 4 and 5. The size of the expanded query Q∗ is bounded by k · `∗ . Furthermore, Fn |6 = Q∗ for n ≥ k · `∗ by Theorem 1. We conclude that Fn∗ |6 = Q for n ≥ k · `∗ by Lemma 2. We can now approach the estimation of the size of Fn . The interpretation I is a counter-example, so in its clique-forest, each node is of size at most |CN(K)| and each element is connected to at most |CI(K)| children nodes. Children nodes of individuals in I form the roots of clique-trees in I 0 , so the number of trees in In0 is bounded by |Ind(K)|·|CI(K)|. As described before, the degree of elements in I 0 is bounded by D = |CN(K)|+|CI(K)|, so the neighbourhoods in I 0 of radius m consist of at most O(Dm ) elements. It is enough to use C = O(D2n ) + |Nom(K)| colours to obtain a proper n-colouring of I 0 by Lemma 1. There are 2|CN(K)| unary types, so there are L = 2|CN(K)| · C ways of labelling a single element with its unary type and colour. The neighbourhoods of radius n are of size n O(Dn ), which means there are N = LO(D ) different neighbourhoods in In0 . On a downward path in Fn , we cannot come across the same neighbourhood centred in an element node or a clique origin, as otherwise a link upwards would be created higher up on this path; thus N bounds the depth of clique-trees in Fn . The degree of the clique-forest of In0 is bounded by D0 = |CN(K)| · |CI(K)|, so the number of nodes in each tree is bounded by (D0 )N . The overall size of Fn is bounded by M ·S(Q, T ), where M = |Ind(K)|·|CI(K)| bounds the number of clique-trees in I 0 and S(Q, T ) is a bound on the size of a single tree. The size estimation of a single tree is independent of the ABox size; it is only governed by the TBox and query. This finalizes the proof of Theorem 4. 12 J. Kuklis Corollary 1. Finite entailment of UCQs by SOI and SOF KBs is coNP-complete with respect to data complexity. Proof. If we fix the sizes of the TBox and query, then S(Q, T ) is fixed and the size of Fn is linear with respect to the ABox size. We consider the following non-deterministic approach to disproving query entailment: we simply guess a transitively closed counter-model of size at most M · S(Q, T ) and verify all the required properties. First, we need to check that it is indeed transitively closed. This can be done in polynomial time with respect to the interpretation size. Next, we need to verify ABox and TBox constraints; as the interpretation is transitively closed, for each element each constraint can be verified by looking at that element’s immediate neighbours. Finally, we need to check that the query is not satisfied. This comes down to verifying a first-order logic formula, which can be done in time polynomial of |Ind(K)|, where the degree of the polynomial depends only on the sizes of the TBox and query. Altogether, this constitutes a coNP algorithm for deciding query entailment for fixed-size TBox and query. We have shown that finite query entailment is in coNP with respect to data complexity. It is known that (unrestricted) query entailment is coNP-hard (with respect to data complexity) in ALE [13]. By finite controllability, so is finite query entailment. Since ALE is a fragment of both SOI and SOF, NP-hardness carries over. t u 6 Conclusions By enhancing the approach to finite query entailment presented in [6], we were able to show the linear-size counter-model property for SOI and SOF (for fixed TBox and query). This let us show that finite query entailment is in coNP with respect to data complexity for these DLs. Basing on previous work, we deduced that it is a coNP-complete problem. To improve the final size estimations, we also established a stronger version of the Lifting Theorem. The techniques used to construct the final counter-models were not efficient enough to provide asymptotically tight size estimations for arbitrary (not fixed) TBoxes and queries. One limiting factor is the doubly-exponential bound on the range of the considered counter-models, stemming from the automata state- space size. It would be interesting to see if this could be improved. Another natural question is whether our techniques are applicable to SIF. Acknowledgements. The author was supported by Poland’s National Science Centre grant 2018/30/E/ST6/00042. The whole work has been prepared with huge help from Filip Murlak, the supervisor of the author’s Master thesis. We thank the anonymous reviewers of DL 2020 for their helpful comments. References 1. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope. In: Proc. Int. J. C. on Artif. Int. pp. 364–369 (2005) Data complexity of FOMQA in DLs with transitive roles 13 2. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Description Logic Handbook: Theory, Implementation and Applications. Cam- bridge University Press, New York, NY, USA, 2nd edn. (2010) 3. Bienvenu, M., Ortiz, M.: Ontology-mediated query answering with data-tractable description logics. In: Reasoning Web. LNCS, vol. 9203, pp. 218–307. Springer (2015) 4. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., , Riccardo: Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. of Autom. Reasoning 39(3), 385–429 (2007) 5. Glimm, B., Lutz, C., Horrocks, I., Sattler, U.: Conjunctive query answering for the description logic SHIQ. J. Artif. Intell. Res. 31, 157–204 (2008) 6. Gogacz, T., Ibáñez-Garcı́a, Y.A., Murlak, F.: Finite query answering in expressive description logics with transitive roles. In: Proc. KR 2018. pp. 369–378 (2018) 7. Gogacz, T., Marcinkowski, J.: On the BDD/FC conjecture. In: PODS. pp. 127–138 (2013). https://doi.org/10.1145/2463664.2463668 8. Hustadt, U., Motik, B., Sattler, U.: Reasoning in description logics by a reduction to disjunctive datalog. J. Autom. Reasoning 39(3), 351–384 (2007) 9. Lutz, C., Wolter, F.: The data complexity of description logic ontologies. Log. Met. in Comp. Sci. 13(4:7), 1–46 (2017) 10. Ortiz, M., Calvanese, D., Eiter, T.: Characterizing data complexity for conjunctive query answering in expressive description logics. In: Proc. Nat. C. on Artif. Int. pp. 275–280. AAAI Press (2006) 11. Rudolph, S.: Undecidability results for database-inspired reasoning problems in very expressive description logics. In: KR. pp. 247–257. AAAI Press (2016) 12. Rudolph, S., Glimm, B.: Nominals, inverses, counting, and conjunctive queries or: Why infinity is your friend! J. Artif. Intell. Res. 39, 429–481 (2010) 13. Schaerf, A.: On the complexity of the instance checking problem in concept lan- guages with existential quantification. J. of Intel. Inf. Systems 2, 265–278 (1993) 14 J. Kuklis A Proof of Lemma 1 The following proof is taken from [6], where the lemma was formulated with no mention of the number of required colours. Let n ≥ 0. Because I \ Nom(K) has bounded degree, 2n-neighbourhoods in I have size bounded by some m. We colour the elements of I one by one, with m colours. Pick an uncoloured element I d. At most m − 1 colours are already used in N2n (d). Assign to d any colour that I is not yet used in N2n (d). This procedure gives an n-proper colouring. Indeed, consider different e, e0 from NnI (d) for some d ∈ I. Without loss of generality we can assume that e was coloured before e0 . But e belongs to N2n I (e0 ), so the 0 colours of e and e are different by construction. B Lifting Theorem B.1 Formulation Definition 10. For i ≤ n, let Ini be the interpretation obtained from In by including into the interpretation of each role r all i-links along r; that is, for i every role r and every i-link (d, e) along r, (d, e) ∈ rIn . Notice that for i < j, each j-link is also an i-link. Note also that (d, e) is an i-link along role r if and only if (e, d) is an i-link along r− . We have In ⊆ Inn ⊆ Inn−1 ⊆ · · · ⊆ In1 ⊆ In0 , but the domains of all these interpretations coincide. We keep referring to the edges present in Ini but not in In as i-links, even though they are ordinary edges now. Definition 11. We say that functions f, f 0 : X → ∆I are d-indistinguishable if for all x ∈ X: NdI (f (x)) ' NdI (f 0 (x)). Throughout the paper, we consider neighbourhoods with respect to Const = Nom(K), however, all the definitions reliant on the notion of neigh- bourhoods are correct for every choice of the set of constants Const. Let us note that for every pair of d-indistinguishable functions f, f 0 , if for some x ∈ X and c ∈ Const either f (x) = c or f 0 (x) = c, then f (x) = f 0 (x). This is because whenever we consider any neighbourhoods, we add a nominal-like concept for each constant. Theorem 5. Let P be a CQ with at most k binary atoms and let n ≥ k. For each homomorphism h : P → Inn there exists a homomorphism h0 : P → In which is (n − k)-indistinguishable from h. Data complexity of FOMQA in DLs with transitive roles 15 Theorem 5 holds for any interpretation I of any SOIF KB. It was originally stated in [6] for n ≥ k 2 and regarded a pair of (n − k 2 )-indistinguishable homo- morphisms. Furthermore, it was proved only for Const = Nom(K), and here we consider an arbitrary choice of the set of constants Const. We prove the enhanced version of the theorem in the next subsection; let us now see that it implies the Lifting Theorem (Theorem 1). Proof (Lifting Theorem). Interpretation Jn we obtain in the process satisfies Jn ⊆ Inn . Consequently, if there was a homomorphism h : P → Jn ⊆ Inn for some CQ P constituting Q, Theorem 5 would yield a homomorphism h0 : P → In , contradicting I |6 = Q. t u B.2 Proof of Theorem 5 We start with the following observation. Lemma 6. All links involving elements of Const are ordinary edges. Proof. Assume (d, e) is an i-link for d in Const (the symmetrical case is analo- gous). Assume that d has a successor f such that NiIn (f ) ' NiIn (e). Then, the edge (d, e) must also exist in NiIn (e), which shows that (d, e) is an ordinary edge. Assume now that e has a predecessor f such that NiIn (f ) ' NiIn (d). Then, f must be equal to d, due to the concepts used to mark constants in the definition of a neighbourhood, so (d, e) is an ordinary edge. t u Let h(P ) denote the subinterpretation of Inn obtained by restricting the do- main to h(var(P )), and only keeping in each role r edges (h(x), h(y)) such that r(x, y) is an atom from P . We say that h uses an r-edge of Inn if this r-edge is present in h(P ). We say that h uses a link over r of Inn if this link is an edge of h(P ), but not an edge of In . Definition 12. By J 0 ∩ J 00 we mean the interpretation J such that ∆J = 0 00 0 00 0 00 ∆J ∩ ∆J , AJ = AJ ∩ AJ for all concept names A, and rJ = rJ ∩ rJ for all role names r. Let us analyze the structure of h(P ). Let In0 = In \Const and let C denote the set of connected components in h(P )∩In0 , understood as graphs. Each component in C contains at most k edges, as P contains at most k binary atoms. We shall now prove that h is far-linked ; that is, h does not use links that join two elements within the same connected component in C. This is a general property of such homomorphisms, which stems from the fact that each link connects elements that are far apart, as expressed in the following lemma. Lemma 7. Elements forming a link are at distance at least 2n in In0 . Proof. Let (a, b) be a link along r. Suppose that the distance between a and b in In0 is smaller than 2n. Take the shortest path π in In0 between a and b; π is of length at most 2n − 1. As (a, b) is a link, either a has an r-successor f in 16 J. Kuklis In0 with the same colour as b, or b has an r-predecessor f in In0 with the same colour as a. In the first case, there is an element on π, which is at distance at most n from both b and f , in the second case, there is an element on π, which is at distance at most n from both a and f . Since a, b, f share the same colour, are pairwise unequal and are in an n-neighbourhood of another element, we obtain a contradiction. Thus, elements forming a link must be at distance at least 2n in In0 . t u Lemma 8. Each homomorphism h00 : P → Ini is far-linked. Proof. Let r(x, y) be an atom of the query P such that h00 (x) and h00 (y) are in the same connected component C ∈ h00 (P ) ∩ In0 . As P has at most k binary atoms, elements in C are connected with at most k edges. Since h00 (x) and h00 (y) are both in C, they at distance at most k < 2n in In0 (unless k = 0; we verify this case separately). We use Lemma 7 to deduce that (h00 (x), h00 (y)) is an r-edge in In0 . t u We see that all the links used by h join elements from different connected components in C. We will construct a series of homomorphisms (h0 , h1 , ..., hl ), n with l ≤ k, hj : P → In j , nj < nj−1 for each j > 0, starting with h0 = h and n0 = n, satisfying the following constraints. First, we require that hj+1 uses strictly fewer links than hj . Second, we require that hj+1 and hj are (nj+1 )- indistinguishable. Third, we require that hl uses no links and that nl ≥ n − k. If we manage to satisfy these constraints, hl treated as h0 : P → In is the homomorphism postulated in the theorem statement. Let Cj denote the connected components of hj (P )∩In0 . We keep the invariant that (Cj ∩ C) is non-empty for j < l. We construct the components graph of Cj : the graph with elements of Cj as vertices and edges between vertices representing connected components from Cj that are joined by a link between their elements. The second invariant is that for j > 0, there is exactly one connected component in Cj that is not present in Cj−1 ; let Dj0 denote that component. If j > 0 and there is a component Dj ∈ (Cj ∩ C) joined by a link to Dj0 , set Cj = Dj and Cj0 = Dj0 . Otherwise, either there are Cj , Cj0 ∈ (Cj ∩ C) joined by a link, or hj uses no links and we finish the construction. In case Cj , Cj0 exist, homomorphism hj+1 is obtained from hj by pulling Cj closer to Cj0 by the link that joins these components; as a result, Cj , Cj0 are combined into one component, namely Dj+1 0 . The details of that construction are presented in the next subsection. We will later show that hj+1 uses links for a subset of these binary atoms of P , for which hj uses links and which do not join Cj with Cj0 ; the links joining Cj and Cj0 in hj are replaced with regular edges of In in hj+1 . We compensate for eliminating these links by paying a certain cost: homomorphism hj+1 is less strict than hj when it comes to preserving the neighbourhoods of the elements joined with a link, which is why we have to set nj+1 = nj − |Cj |. We will now prove that nj ≥ n − k for all j. This will entail that h0 is indeed (n−k)-indistinguishable from h. We use the following invariant: altogether there are at least n − nj edges in connected components from (Cj \ C). This can be proved by induction on j. The base step for j = 0 reduces to the statement that Data complexity of FOMQA in DLs with transitive roles 17 an empty graph has no edges. In the inductive step, we pull the component Cj towards Cj0 , merging the two within a larger connected component. We have that 0 Cj ∈ Cj is in C and the new connected component Dj+1 ∈ Cj+1 is not in C. This means that the number of edges in connected components from Cj+1 \ C must be larger than the number of edges in connected components from Cj \ C. The total increase is equal to at least |Cj |; this stems from incorporating |Cj | − 1 edges within Cj , and from reducing at least one link. Using the inductive hypothesis, we get that there are at least n − nj edges in connected components from Cj \ C, and at least n − nj + |Cj | edges in connected components from Cj+1 \ C. As nj+1 = nj − |Cj |, we get n − nj + |Cj | = n − nj+1 , which proves the inductive step. If there was j such that nj < n − k, the number of edges in connected components from (Cj \ C) would be at least n − nj ; as n − nj > n − (n − k) = k, this would contradict the fact that hj (P ) has at most k edges. B.3 Definition of hj+1 In this subsection, we mark the names of interpretation elements with bold colour fonts. Whenever we mention distance between elements, we refer to distance in In0 . We are going to define hj+1 based on hj and Cj , Cj0 ∈ Cj . Let (d, e) be a link along s used by hj and joining Cj with Cj0 . We only consider the case when d ∈ Cj0 and e ∈ Cj ; the proof for the other case is symmetrical. Either d has an s-successor e0 such that NnIjn (e) ' NnIjn (e0 ), or e has an s-predecessor d0 such that NnIjn (d) ' NnIjn (d0 ); once again we take advantage of almost full symmetry and only present the proof for the first case. Let g : NnIjn (e) → NnIjn (e0 ) be the witnessing isomorphism for the (d, e) link. n We define hj+1 : P → In j+1 for nj+1 = nj − |Cj | as follows: for each z ∈ var(P ), let hj+1 (z) = g(hj (z)) if hj (z) ∈ Cj , and hj+1 (z) = hj (z) otherwise. To ensure that the definition of hj+1 is correct, we need to show that nj+1 ≥ 0. Altogether, components in C have at most n edges (more precisely, at most k ≤ n edges), and we have proven an invariant that there are at least n−nj edges in connected components from (Cj \ C). As Cj is in (Cj ∩ C), and (d, e) is a link, Cj can have at most n − 1 − (n − nj ) = nj − 1 edges. Thus, nj+1 = nj − |Cj | ≥ 0. By definition, hj and hj+1 are (nj+1 )-indistinguishable: NnIjn (e) ' NnIjn (e0 ) and nj+1 = nj − |Cj |, so g preserves (nj+1 )-neighbourhoods of elements within distance |Cj | from e. We need to verify that hj+1 is indeed a homomorphism. Let r(x, y) be an atom of the query P and let p = hj (x), q = hj (y), p0 = hj+1 (x), q0 = hj+1 (y). We have to consider the following cases: 1. p, q ∈ / Cj , 2. p, q ∈ Cj , 3. exactly one of p, q is in Cj ; we take advantage of the symmetry and only consider the case p ∈ / Cj , q ∈ Cj , with the following subcases: (a) p ∈ Cj0 , (b) p ∈ Cj00 for Cj00 ∈ Cj such that Cj00 6= Cj0 . (c) p ∈ Const, 18 J. Kuklis Case 1. By definition, (p0 , q0 ) = (p, q). As hj is a homomorphism into nj n n In ⊆ In j+1 , we have that (p0 , q0 ) is an r-edge in In j+1 . Clearly, hj+1 uses no new links for such atoms. Case 2. By definition, (p0 , q0 ) = (g(p), g(q)). We assumed that p and q are both in Cj , and connected components in Cj contain at most k − 1 edges (we subtract one, because (d, e) is a link). This means p, q are at distance at most k − 1 < 2n, and thus (p, q) is an r-edge in In by Lemma 7. As g(p) and g(q) are 0 both in Dj+1 ∈ Cj+1 , they are at distance at most k. This means that (p0 , q0 ) is an r-edge in In . Case 3. By definition, (p0 , q0 ) = (p, g(q)). As g preserves (nj+1 )-neighbourhoods of elements within distance |Cj | from e and the distance between q and e is at most |Cj |, we see that NnIj+1 n (q) ' NnIj+1n (g(q)). Let us analyze the three sub- cases. Subcase 3 (a). As p ∈ Cj0 and q ∈ Cj , and Cj0 and Cj are different connected components in Cj , we have that (p, q) is a link along r. Since Cj0 and Cj are 0 0 merged together into Dj+1 , both p and g(q) are in Dj+1 . This means they are at distance at most k from each other, which allows us to use Lemma 7 to deduce that (p0 , q0 ) is an r-edge in In . Subcase 3 (b). Like above, p and q are in different connected components from Cj , so it must be that (p, q) is a link along r. We need a different line of reasoning than in the previous case, since we cannot assume Cj and Cj00 are merged together. Since (p, q) is an nj -link, it is also an (nj+1 )-link. We have two subcases to consider. Suppose that p has an r-successor f in In0 such that NnIj+1n (f) ' NnIj+1 n (q) ' In Nnj+1 (g(q)). We see that then (p, g(q)) is an (nj+1 )-link. If f = g(q)), then (p0 , q0 ) is an r-edge in In . Furthermore, the connected component Cj00 of p merges with Cj0 ; all three components Cj , Cj0 and Cj00 form subgraphs of Dj+1 0 . As ele- 0 ments in Dj+1 are at distance at most k from each other, there can be no links between them by Lemma 7. This means that all the links between Cj00 and Cj used by hj become regular edges in hj+1 . We also see that hj cannot use any links between Cj00 and Cj0 , as elements of these components are at distance at most k from each other in In0 . The other possibility is that f 6= g(q)), in which case (p0 , q0 ) is a link along r. By analogous reasoning, this also entails that hj+1 uses links for all the links used by hj which join Cj and the connected component of f. Suppose now that q has an r-predecessor f in In0 such that NnIj+1 n (f) ' In 0 0 Nnj+1 (p). We prove in the same way as above that either (p , q ) is an r-edge in In0 and the connected component of p and Cj0 merge, or (p0 , q0 ) is a link along r. Data complexity of FOMQA in DLs with transitive roles 19 Subcase 3 (c). (p, q) is an r-edge in In by Lemma 6. As NnIj+1 n (q) ' Nnj+1 (g(q)), (p, g(q)) must be an r-edge in Nnj+1 (g(q)). Thus, (p , q0 ) is an In In 0 r-edge in In . This proves that hj+1 is a homomorphism, which uses links for a subset of these binary atoms of P , for which hj uses links and which do not join Cj with Cj0 . C Proof of Theorem 3 C.1 Construction C.1.1 Outline The authors of [6] present a product of tree automata that independently verify the following acceptance conditions: – consistency of the forest encoding, – satisfying the constraints from the knowledge base K∗ = (T , A), – not satisfying the query Q, – path safety. We enhance their construction by adapting these components and adding ver- ification of individual- and nominal-completeness. Our automaton is parametrized by a fixed subinterpretation M induced by individuals (note that it is not the case in [6]). We describe the construction with SOI in mind, but it adapts easily to SOF. The only differences are that for SOF, we do not validate parts referring to inverse roles, and we add another component verifying functionality declarations, which does not affect our size estimations. This component is decribed in the last subsection. C.1.2 Preprocessing The interpretations we work with are preprocessed to cater for transitivity: for an interpretation I, if I |= K, then also I ∗ |= K. This is achieved by supplementing the interpretation with auxiliary concepts expressing universal restrictions on transitive roles, as described under Definition 5: for each pair of B ∈ CN(K) and a transitive role r, we add a fresh concept name Pr,B , axiomatised as Pr,B v ∀r.(B u Pr,B ), and replace each CI of the form A v ∀r.B with A v Pr,B . The information about the edges to and from nominals is also stored in auxiliary concepts: for each a ∈ Nom(K) and r ∈ Rol(K) fresh concept names Nr,a and Nr− ,a are added to the knowledge base, axiomatised as Nr,a ≡ ∃r.{a} and {a} ≡ ∀r.Nr− ,a . These concepts do not substantially increase the size of the automaton. We aim to accept only these counter-examples which are complete. First, we tackle nominal-completeness. For each a ∈ Nom(K) and r ∈ TRol(K) we add the following concept inclusions to TBox: N̄r,a v ∀r.N̄r,a and Nr− ,a v ∀r.Nr− ,a . 20 J. Kuklis This ensures that whenever for a transitive role r there is no edge r from some element to a nominal (or vice versa), then there is also no r-path from that element to the nominal. Recall that we assume the completeness of M (the subinterpretation induced by individuals). Consequently, any path along a transitive role between two individuals not connected with a single edge must enter some clique-tree and lead through a nominal. Additional CIs described in the last paragraph ensure that whenever there is a path from an individual to a nominal in the clique-tree associated with that individual, these elements are also connected with a single edge. If there was a path breaking individual-completeness of the whole forest, it would entail that also nominal-completeness does not hold. Thus, individual- completeness follows from nominal-completeness. C.1.3 Automaton To make clique-forests accessible to automata, we encode them as finitely labelled forests. Let TRol(K) be the set of transitive roles from Rol(K), and let [X]≤k be the family of subsets of X of size at most k. In the encoding, nodes are labelled with elements of the alphabet   Σ = Tp(K) ∪ TRol(K) × [Tp(K)]≤|CN(K)| and edges are labelled with elements of the alphabet Γ = Tp(K) × Rol(K) × Tp(K) . We build an automaton parametrized by the subinterpretation M. Let I denote the interpretation whose clique-forest is encoded in the input forest. The automaton runs on the encoding of the clique-forest of I \ Ind(K) (not of I, like in [6]; this change is motivated by the way we modify the query component). We say that an individual a is the super-root of a clique-tree t in I \ Ind(K), if the root of t is a successor of a in I; we also say that t is super-rooted in a. For a complete counter-example I, we obtain the encoding of the clique-forest for I \ Ind(K) as follows. We fix some order on individuals and order the trees in I \ Ind(K) so that they are sorted with respect to the order on their super- roots. We add a separating tree stub (represented with a fresh symbol added to Σ) between contiguous blocks of trees super-rooted in the same individual; it is only used as a marker for the automaton so that it can recognize which trees are associated with a given individual. All the components of the automaton always accept these tree stubs. Next, we label each element node with the single unary type it realises and each clique node with its single nonempty role and the set of unary types it realises. We do not represent nominals explicitly in the encoding, but thanks to the initial preprocessing, all relevant information about them is contained in the unary types of the remaining elements. Finally, if in I \ Ind(K) there is an r-edge from an element of type τ in some parent node to an element of type σ in some child node, then in the encoding the edge from the parent node to the child node Data complexity of FOMQA in DLs with transitive roles 21 is labelled with (τ, r, σ). Because unary types do not repeat within cliques, this uniquely determines the endpoints. There are at most N = |Ind(K)| · |CI(K)| trees in the input forests, with branching bounded by D = |CN(K)| · |CI(K)|. Transition relation has the form δ ⊆ Q × Σ × (Γ × Q)≤D , where Q is the set of states. The automaton processes the forests top down. Whenever a node is reached after an edge labelled (τ, r, σ), we refer to the unique element of type τ in that node as the current element, and to the unique element of type σ in the parent node as the parent element. The initial states are specified for each tree separately: the automaton has a set I ⊆ Q≤N of sequences of initial states. A run is a labelling of the input forest with states in such a way that the sequence of states in the roots belongs to I, and if a node has state q, label α, and its children are connected via edges with labels β1 , β2 , . . . , βn and have states q1 , q2 , . . . , qn , then    q, α, (β1 , q1 ), . . . , (βn , qn ) ∈ δ . We use Büchi acceptance condition: we specify a set F ⊆ Q of marked states that need to be revisited, and consider a run accepting if on each branch marked states occur infinitely often. A forest is accepted by the automaton if there exists an accepting run over it. An automaton has trivial acceptance condition if F = Q. Then, each run is accepting but the automaton may still reject some forests, because there may be no run for them: a branch of the computation can get stuck if no transition is consistent with the current state, label and edge labels. An automaton is weak if on each branch of each run, once a marked state is visited, all subsequent states are marked. Notice that all automata with trivial acceptance condition are weak. Given a weak automaton and an arbitrary Büchi automaton it is particularly easy to construct an automaton recognising trees accepted by both input au- tomata: it suffices to take the standard (synchronous) product automaton and mark all states that contain a marked states on both coordinates. C.1.4 Automata components The final automaton is obtained as a product of automata verifying independently various parts of the condition. The first thing to check is the consistency of the encoding: if an edge has label (τ, r, σ), then τ must occur in the label of the parent node, and σ must occur in the label of the child node. To check this, it suffices to examine for each node the labels of all edges incident to it plus the label of the node itself. When a transition is made, all these are available except the label on the edge to the parent: it must be stored in the state. The automaton has |Γ | states and trivial acceptance condition. The second thing to check is that the clique-forest is a model of K∗ . If M does not satisfy the restrictions from the ABox A of K∗ , the final automaton rejects everything. 22 J. Kuklis To verify that the TBox is satisfied we need to check each CI. For CIs of the form u Ai v t B j i j we have a two-state automaton with trivial acceptance condition that simply tests that each type used in the encoding satisfies this CI. If the type of some individual violates this CI, the final automaton rejects everything. To verify CIs of the form A v ∀r.B for elements in I\Ind(K), it suffices to check that in the input interpretation there is no r-edge from an element whose unary type contains A to an element whose unary type contains B̄. This amounts to verifying that none of the following are used in the encoding: – node labels (r, T ) such that A ∈ τ ∈ T and B̄ ∈ σ ∈ T for some τ , σ; – edge labels (τ, r, σ) with A ∈ τ and B̄ ∈ σ; – edge labels (σ, r− , τ ) with A ∈ τ and B̄ ∈ σ; – unary types containing both A and Nr,b for some b such that b ∈ B̄ M ; – unary type containing both B̄ and Nr− ,b for some b such that b ∈ AM . These conditions simply disallow certain labels; they can be checked by a two- state automaton with a trivial acceptance condition. Let us take a CI of the form A v ∃r.B . For elements in I \ Ind(K), this condition can be tested in a similar way as above, except that one needs access to the label of the current node and all edges incident to it. Like for the initial consistency check, it suffices to store in the state the label of the edge to the parent. To ensure the existential restrictions are met for nominals, for each nominal a such that a ∈ AM , if that CI is not met in M, we construct a two-state weak automaton looking for a label that uses a type τ such that B ∈ τ and Nr− ,a ∈ τ . Note that this automaton has a non- trivial acceptance condition, but it is weak: as soon as it finds an appropriate label, it loops in a marked state. The initial lists of states for that weak consist of a single unmarked state and all the remaining initial states already marked. We also have to verify that all existential and universal restrictions are met for non-nominal individuals. The former are ensured by constraining initial lists of states so that all the existential restrictions for an individual not satisfied in M are satisfied by roots of clique-trees super-rooted in that individual. If the universal restrictions are not met in M, the final automaton rejects everything; otherwise, we ensure that the restrictions hold for non-individual elements con- nected to individuals by constraining initial lists of states so that the roots of clique-trees super-rooted in a given individual do not break universal restrictions imposed on that individual. Summing up, the total size of the state-space of the TBox component is not larger than 2|CI(K)| ·2|CI(K)| ·|Γ ||CI(K)| ·2|Nom(K)|·|CI(K)| . The first three factors stem Data complexity of FOMQA in DLs with transitive roles 23 from the sizes of automata used to verify on I \ Ind(K) all the CIs of the forms ui Ai v tj Bj , A v ∀r.B, and A v ∃r.B, respectively. The last factor stems from the verification of existential restrictions for nominals. Next, we verify that the query cannot be satisfied. We need an automaton recognizing counter-examples such that the number of states in the automaton does not depend on the number of individuals in K. This is achieved by mak- ing sure each tree in the forest is processed without explicitly referring to the knowledge about individuals. In [6], all the information about edges between individuals is transferred to the TBox. In particular, this means that the query parts are self-contained in single trees, as individuals are not connected (and the edges to nominals are hidden from the automata perspective). It lets the query component build partial query matchings independently for each tree, but at the cost of enlarging the TBox. Here, we cannot use the same approach: the number of states depends on the number of unary types in particular, which would be tied to the ABox size if we were to store the information about edges between individuals in concepts. Matchings of connected parts of the query are no longer contained in single trees: they might span over multiple trees. We have to accommodate for that by enhancing the query component. We start by replacing query Q with a query Q0 such that for each model I of K: I ∗ |= Q iff (I \ Nom(K))∗ |= Q0 . The query Q0 is obtained in two steps. In the first step, we account for the possibility that binary atoms of the form r(x, y) for transitive r can be satisfied with paths that cross different trees. Take such path π, and let a and b denote respectively the first and the last individual on π (possibly a = b). Thanks to the individual-completeness property, π can be reduced to a path consisting of: a path segment ending in a and contained in a single tree, a single edge from a to b (if a 6= b), and another path segment contained in a single tree. To detect such paths, for each CQ P constituting Q, we add to Q each CQ that can be obtained from P by subdividing some transitive atoms; that is, by replacing some atoms of the form r(x, y) either with atoms r(x, z) and r(z, y) for a fresh variable z, or with atoms r(x, z1 ), r(z1 , z2 ), r(z2 , y) for fresh variables z1 , z2 . In the second step, we account for the influence of nominals on the query satisfaction. For each CQ P of the modified Q, we add to Q each CQ that can be obtained from P by performing the following operation any number of times. Let tp(x) be the set of all A such that P contains A(x). Choose x ∈ var(P ) and a ∈ Nom(K) such that a ∈ AM whenever A ∈ tp(x). Drop all atoms of the form A(x) from P . Replace in P each atom of the form r(y, x) by Nr,a (y) and each atom of the form r(x, y) by Nr− ,a (y). The resulting query Q0 satisfies the condition I ∗ |= Q iff (I \Nom(K))∗ |= Q0 . After the first step, the number of CQs grows by the factor 3m and their size is at most 3m. After the second step, the number of CQs grows by the factor |Nom(K)|3m and their size is still at most 3m. Thus, the size of the resulting query is at most |Q| · 3m · |Nom(K)|3m , and its CQs have size at most 3m. 24 J. Kuklis For each CQ P of the preprocessed union of queries Q0 , we construct an automaton that ensures that I ∗ |6 = P . Its states consist of the parent edge label β and a set F representing all partial matchings of P to elements represented in the subtree rooted in the currently processed node. The set F consists of partial functions f : var(P ) → L, with L = {succ, other}, representing different ways variables can be assigned in relation to the parent element. Each such function represents a partial matching, that is a partial as- signment of variables such that the restricted query induced by these variables is satisfied (not all the collected partial matchings can potentially be extended to a total match; discussion on that part is presented in the next subsection). The identifier succ is used for each variable mapped to an element in the cur- rent subtree that is an r-successor of the parent element in I \ Ind(K). If r is non-transitive, this simply means the current element (the clique origin, if we are processing a clique node). If r is transitive, it means any element r-reachable from the current element. The identifier other is used for variables assigned to any other elements in the current subtree. The rationale behind the automaton is that it constructs all the possible partial matchings across the forest. The trick here is that the states reached before traversing a given node already contain the information about all the partial matchings that will be constructed in the subtree rooted at that node. The automaton will have at most one accepting run on each forest—the states in the initial list of states in that run will contain all the possible partial matchings across the forest. We define the transition relation in the next subsection. The definition will ensure the intended semantics of the states; when processing a given node, sets F in the automaton state will represent all partial matchings of P to elements represented in the subtree rooted in that node. All states are accepting. Apart from the partial matchings collected in trees rooted in children nodes of individuals, we also compute all partial matchings in M \ Nom(K) (the possibility that some variables are assigned to nominals has already been accounted for in the second step of the query preprocessing). As lists of initial states, we take all lists of states such that a total match over the whole forest cannot be constructed by fusing together a partial matching in M \ Nom(K) and one partial matching per initial state. We know how to stitch these partial matchings together, because we know the edges from super-roots to new clique-tree roots and the way we collect partial matchings in a clique- tree accounts for the label on that edge. This sums up the query automaton construction. This extra information about the edges from super-roots motivates the approach of traversing clique forests built of trees rooted in children nodes of individuals, rather than trees rooted in individuals. The last component of the automaton checks that the input interpretation is safe. Observe that it is unsafe if in the input clique-forest there is a branch with consecutive node and edge labels α1 β1 α2 β2 . . . such that for some transitive r and all i large enough, βi = (τi , r, σi ) and either σi = τi+1 (consecutive edges are incident in the clique-forest) or αi+i = (r, Ti ) (consecutive edges are incident Data complexity of FOMQA in DLs with transitive roles 25 with an r-clique). An automaton can easily check that there is no such branch. Each time it sees a transitive role it moves to an unmarked state, storing the role. It moves to a marked state as soon as the condition above is broken. The automaton has |TRol(K)| states. The automaton recognising safe counter-examples can be obtained from these components by the simple product construction described above, because only the last component is not weak. C.2 Transition relation of the query component Below we describe the set of transitions of the query satisfaction component of the product automaton. A transition ((β, F ), τ, ((βi , (βi , Fi ))ni=1 ) over an element node of unary type τ , with β = (σ, r, τ ) and βi = (τ, ri , τi ), is a transition of the automaton if F is the set of all functions f : var(P ) → L which are both upward- and downward-compatibile with some witnessing functions f1 ∈ F1 , f2 ∈ F2 , ..., fn ∈ Fn ; we define these conditions below. Fix a function f : var(P ) → L and a sequence of domain-disjoint functions (f1 , f2 , ..., fn ), with fi : var(P ) → L. f is upward-compatibile with functions fi if for each variable x ∈ var(P ): – if ∃i : fi (x) = other, then f (x) = other, – if ∃i : fi (x) = succ, then f (x) = succ if r is transitive and ri = r, otherwise f (x) = other, – otherwise either x ∈ / dom(f ), or f (x) = succ and tp(x) ⊆ τ , which we interpret as S matching x to the current element. In the latter case x is in (dom(f ) \ dom(fi )). f is downward-compatibile with functions fi if for each atom s(x, y) of P : S – if x ∈ (dom(f ) \S dom(fi )) and y ∈ dom(fi ), then ri = s and fi (y) = succ, – if y ∈ (dom(f ) \ dom(fi )) and x ∈ dom(fi ), then ri = s− and fi (x) = succ, – if x ∈ dom(fi ) and y ∈ dom(fj ) for i 6= j, then s is transitive, ri = rj− = s, and fi (x) = fj (y) = succ. Note that we only seem to notice binary atoms for which both variables are already present in the subtree. When a variable gets the identifier other and there are unsatisfied binary atoms containing that variable, we know that such a matching cannot be extended to a total matching. We still gather such matchings in the automaton, which might be unintuitive. For clarity, we can forget such unextendable matchings in the automaton runs by adding the following condition to the downward-compatibility definition: – if fi (x) = other or fi (y) = other, then x, y ∈ dom(fi ). 26 J. Kuklis The transitions over clique-nodes are described using analogous compatibility concepts, adapted to the clique-node setting. A transition ((β, F ), (r0 , T ), ((βi , (βi , Fi ))ni=1 ) over a clique node encoded by a role r0 and a set of unary types T , with β = (σ, r, τ ) and βi = (σi , ri , τi ), is a transition of the automaton if F is the set of all functions f : var(P ) → L which are both upward- and downward-compatibile with some witnessing functions f1 ∈ F1 , f2 ∈ F2 , ..., fn ∈ Fn ; we define these conditions below. Fix a function f : var(P ) → L and a sequence of domain-disjoint functions (f1 , f2 , ..., fn ), with fi : var(P ) → L. f is upward-compatibile with functions fi if for each variable x ∈ var(P ): – if ∃i : fi (x) = other, then f (x) = other, – if ∃i : fi (x) = succ, then f (x) = succ if r is transitive, ri = r, and either σi = τ or r0 = r, otherwise f (x) = other, – otherwise one of the following holds: • x∈ / dom(f ), • f (x) = succ and tp(x) ⊆ τ , which we interpret as matching x to the current element, • f (x) = other and tp(x) ⊆ τ 0 ∈ T \ {τ }, which we interpret as matching x to an element in the current clique-node. f is downward-compatibile with functions fi if for each atom s(x, y) of P : S – if x ∈ (dom(f ) \ dom(fi )) and y ∈ dom(fi ), then ri = s, fi (y) = succ, and either tp(x) ⊆ σi ,Sor s is transitive and r0 = s, – if y ∈ (dom(f ) \ dom(fi )) and x ∈ dom(fi ), then ri = s− , fi (x) = succ, and either tp(y) ⊆ σi , or s is transitive and r0 = s, – if x ∈ dom(fi ) and y ∈ dom(fj ) for i 6= j, then s is transitive, ri = rj− = s, fi (x) = fj (y) = succ, and either σi = σj or r0 = s. Once again, we can enhance the downward-compatibility definition to get rid of unextendable partial matchings; it is enough to add the same condition as before. To see that this transition relation ensures the intended semantics of the states one needs to argue that each partial matching is accurately represented. This can be done by induction on the size of the image. For size one, the match- ing will be accounted for based solely on the labels. For larger images, use the inductive hypothesis for restrictions of the match to variables mapped to the trees rooted at the children of the current node. C.3 Size estimation Let us approach the size estimation of the automaton state space. We start by analyzing the factors that contribute to the size of automata components. Note that k and m denote, respectively, the maximum number of binary atoms and the maximum size of a single conjunctive query in the union of CQs Q. Data complexity of FOMQA in DLs with transitive roles 27 Measured entity Notated as Size estimation |CN(K)|+|Nom(K)|·|TRol(K)| number of unary types |Tp(K)| 2 size of node alphabet |Σ| O(|TRol(K)| · |Tp(K)||CN(K)| ) size of edge alphabet |Γ | |Tp(K)|2 · |TRol(K)| number of trees in a forest N |Ind(K)| · |CI(K)| clique-trees branching D |CI(K)| · |CN(K)| size of preprocessed query |Q0 | |Q| · 3m · |Nom(K)|3m The bound of the size of the unary types set is not just 2|CN(K)| due to the encoding of neighbouring nominals, which requires O(|Nom(K)| · |Rol(K)|) new concepts and concept inclusions. The size estimation of the automata components is presented below. Component States count upper bound consistency of encoding |Γ | TBox constraints: - CIs of form: uAi v tBj 2|CI(K)| - CIs of form: A v ∀r.B 2|CI(K)| |CI(K)| - CIs of form: A v ∃r.B |Γ | · 2|Nom(K)|·|CI(K)| path safeness |TRol(K)| 3m 0 query unsatisfaction 23 ·|Q | By taking a product of the sizes of automata components, we bound the 4m 3m number of states of the product automaton by 22·3 ·|Q|·|T | , where |T | = |CI(K)| + |CN(K)| + |Nom(K)| · |TRol(K)|. The factor 2 in the exponent is in- troduced to eliminate summands of lower order. The query component makes the size of the state-space of the product automaton double exponential. This finalizes the proof of Theorem 3. As a side note, an automaton with k states has total size O(k ·|Σ|·(k ·|Γ |)N + O(m) k ), which in our case can be expanded to 2|Q|·|K| N . C.4 Functionality component for SOF We include an additional component for each functionality declaration Fn(r). To check functionality of r for ordinary nodes it suffices to examine the label of the node and the labels on all incident edges, which only requires storing in the state the label of the edge to the parent. Additionally, for all a ∈ Nom(K): if a ∈ (Nr,b )M and a ∈ (Nr,b0 )M for some b 6= b0 , the final automaton rejects everything; if a ∈ (Nr,b )M , the automaton checks that no type used in the input / (Nr,b )M for each b ∈ Nom(K), the automaton checks forest contains Nr− ,a ; if a ∈ that a type with Nr− ,a occurs at most once in the input forest. The total number of states in the described component is |Γ ||CI(K)| · 2|Nom(K)|·|CI(K)| , so including it does not affect the overall upper bound. 28 J. Kuklis D Proof of Lemma 2 For each transitive role s, all the s-atoms in Q are replaced with disjunctions of s-chains of lengths from 1 to ` in Q∗ . If such an atom can be satisfied in J ∗ for elements a and b, then there is an s-path from a to b of length at most `. This shows that if J |= Q∗ , then J ∗ |= Q. On the other hand, if J ∗ |= Q, then each transitive role atom must be satisfied by two endpoints of some s-path from J , of length at most `, which satisfies the subquery replacing this atom in Q∗ . This proves the converse implication.