Data complexity of finite query entailment in description logics with transitive roles

Data complexity of finite query entailment in description logics with transitive roles JakubKuklis jk.kuklis@gmail.com University of Warsaw

02-097 Warszawa Poland

Data complexity of finite query entailment in description logics with transitive roles A7987A6F77BD0F64D4536DDA86F441C5 GROBID - A machine learning software for extracting information from scholarly documents FOMQA Data complexity Counter-models SOI SOF

We study the problem of finite ontology-mediated query answering (FOMQA), the variant of OMQA where the represented world is assumed to be finite, and thus only finite models of the ontology are considered. We adapt the setting of unions of conjunctive queries and ontologies expressed in description logics with transitive role declarations. FOMQA is already undecidable for SHOIF, but was proved to be 2Exp-complete for SOI, SOF and SIF. The proof relies on the fact that any finite counter-model can be transformed to a counter-model of certain regularity. By adjusting this transformation, we are able to provide an upper bound on the size of the smallest finite counter-model for SOI and SOF ; the bound is linear in the size of the ABox. This lets us show that for these logics FOMQA is in coNP with respect to data complexity; as coNP-hardness follows from previous results on weaker logics, we deduce coNP-completeness. 1

Introduction

Evaluating queries in the presence of background knowledge has been extensively studied in several communities. A particularly prominent take on this problem is ontology mediated query answering (OMQA) where background knowledge represented by an ontology is leveraged to infer more complete answers to queries [3]. A widely accepted family of ontology languages with varying expressive power is offered by Description Logics (DLs) [2], while the most commonly studied query language is that of (unions of) conjunctive queries.

Identifying ontology languages, for which query evaluation scales to large amounts of instance data, is one of the key research goals, as data tends to dominate in size both the ontology used and the evaluated queries. An early reference on data complexity in DLs, that is, complexity in the settings with fixed TBox and query, is showing coNP-hardness of instance queries in ALE [13]. In the much more expressive DL SHIQ, a coNP upper bound was obtained for instance queries [8] and conjunctive queries without transitive roles [10], and later generalized to any CQs [5]. The classical approach to avoiding intractability is to replace ALC and SHIQ with less expressive Horn DLs. These often admit query evaluation in PTime, examples include a variety of logics from the EL [1] and DL-Lite families [4], as well as Horn-SHIQ [8]. An alternative approach is to study complexity with respect to fixed TBoxes only [9].

Often, the intended models of the ontology are finite and this additional assumption allows to infer more properties: finite ontology mediated query answering (FOMQA) is the variant of OMQA restricted to finite models. For some logics, like ALC, the finite variant and the unrestricted variant of the problem coincide; we then say that OMQA is finitely controllable. Studying FOMQA is interesting in settings lacking finite controllability. This is the case not only for DLs lacking the finite model property (e.g., DLs allowing both inverse roles and number restrictions), but also for logics allowing transitive role declarations, like logics from the S family. FOMQA is undecidable for SHOIF ontologies [11], but recently its decidability has been shown for three proper SOIF fragments: SOI, SOF, and SIF [6]. For all three fragments, the problem turned out to be 2Exp-complete in terms of combined complexity, but its data complexity was not studied. The present paper aims to fill this gap, thus initiating the study of the data complexity of FOMQA.

In [6], FOMQA is reduced to query entailment over a certain class of tree-like models recognisable by tree automata. Here, we adjust and adapt this approach to obtain the linear-size counter-model property for SOI and SOF (with respect to fixed TBox and query). This lets us derive an algorithm for FOMQA, which is coNP with respect to data complexity. The matching lower bound can be obtained using finite controllability from the results on answering instance queries, which is coNP-hard even for the ALE fragment of ALC [13] (see also [2]).

Preliminaries

The DL SOIF extends the classical DL ALC with transitivity declarations on roles (S), nominals (O), inverses (I), and role functionality declarations (F) [2]. We assume a signature of countably infinite disjoint sets of concept names N C = {A 1 , A 2 , . . . }, role names N R = {r 1 , r 2 , . . . } and individual names N I = {a 1 , a 2 , . . . }. SOIF-concepts C, D are defined by the grammar:

C, D ::= | A | ¬C | C D | {a} | ∃r.C , where r ∈ N R ∪ {s − | s ∈ N R } is a role.

Roles of the form r − are called inverse roles. A SOIF TBox T is a finite set of concept inclusions (CIs) C D, transitivity declarations Tr(r), functionality declarations Fn(r), where C, D are SOIF-concepts and r is a role. We assume that if the TBox contains Tr(r), then it contains neither Fn(r) nor Fn(r − ). With an appropriate extension of the signature, each SOIF TBox can be transformed into an equivalent TBox with CIs in the following normal forms:

A i B j , A ≡ {a} , A ∀r.B , A ∃r.B ,

where A, A i , B, B j are concept names, empty conjunction is equivalent to and empty disjunction to ⊥. We assume that all knowledge bases we work with are normalized; normalization yields at most linearly larger knowledge base, so the size estimations still hold. We also assume that for each concept name A used in T there is a complementary concept name Ā axiomatised using CIs A Ā and A Ā ⊥.

SOI and SOF TBoxes are restrictions of SOIF TBoxes. SOI TBoxes do not contain functionality declarations, whereas concept inclusions in SOF do not contain inverse roles. As the inverse of a transitive role is transitive anyway, for SOI we shall assume that if Tr(r) is present in the TBox, then so is Tr(r − ).

An ABox is a finite set of concept and role assertions of the form A(a) and r(a, b), where A ∈ N C , r ∈ N R and {a, b} ⊆ N I . A knowledge base (KB) is a pair K = (T , A). We write |K| for |A| + |T |. We use CN(K), Rol(K), TRol(K), Nom(K), and Ind(K) to denote, respectively, the set of all concept names, roles, transitive roles, nominals, and individuals occurring in K. Each nominal is an individual, so Nom(K) ⊆ Ind(K).

A unary type is a subset of CN(K) that contains exactly one of the concept names A, Ā for each A ∈ CN(K). We let Tp(K) denote the set of all unary types.

The semantics is defined via interpretations I = (∆ I , • I ) with a non-empty domain ∆ I and an interpretation function • I assigning to each A ∈ CN(K) a set A I ⊆ ∆ I and to each role name r with r ∈ Rol(K), a binary relation r I ⊆ ∆ I × ∆ I . The interpretation of complex concepts and roles is defined as usual [2]. We only consider interpretations complying with the standard name assumption and set a I = a for each individual name a used in ABox or TBox. We write d ∈ I as a shorthand for d ∈ ∆ I .

Interpretation I is a subinterpretation of interpretation J , written as I ⊆ J , if ∆ I ⊆ ∆ J , A I ⊆ A J , and r I ⊆ r J for all A ∈ CN(K), r ∈ Rol(K). An interpretation I is a subinterpretation of J induced by ∆ 0 ⊆ ∆ J , written as I = J ∆ 0 , if ∆ I = ∆ 0 , A I = A J ∩ ∆ 0 , and r I = r J ∩ ∆ 0 × ∆ 0 for all A ∈ CN(K), r ∈ Rol(K). We write J \ X for the subinterpretation of J induced by ∆ J \ X.

An interpretation I satisfies α ∈ T ∪ A, written as Let I and J be interpretations of K. A homomorphism from I to J , written as h : I → J is a function h : ∆ I → ∆ J that preserves roles, concepts, and individual names; that is, (h

I |= α, if the following holds: if α is a CI C D then C I ⊆ D I , if α is a transitivity declaration Tr(r) then r I is transitive, if α is a functionality declaration Fn(r) then r I is a partial function, if α is an assertion A(a) then a ∈ A I ,(d), h(d )) ∈ r J whenever (d, d ) ∈ r I , r ∈ Rol(K), h(d) ∈ A J whenever d ∈ A I , A ∈ CN(K)

, and h(a) = a for all a ∈ Ind(K). Note that I ⊆ J iff the identity mapping id is a homomorphism id : I → J .

Let N V be a countably infinite set of variables. An atom is an expression of the form A(x) or r(x, y) with A ∈ N C , r ∈ N R , and x, y ∈ N V , referred to as concept atoms and role atoms, respectively. A conjunctive query (CQ) Q is an existentially quantified conjunction q of atoms, ∃x 1 • • • ∃x n q . For simplicity we restrict q to Boolean; that is, q uses only variables x 1 , • • • , x n . This is without loss of generality since the case of non-Boolean CQs can be reduced to the case of Boolean queries; see e.g. [12].

Let var denote the function from queries to sets of variables, such that var(Q) is the set of all the variables used in Q. A match for Q in I is a total function π : var(Q) → ∆ I such that I, π |= q under the standard semantics of first-order logic. An interpretation I satisfies Q, written as I |= Q, if there exists a match for Q in I. A partial match for Q in I is a partial function π : var(Q) → ∆ I such that I, π |= q , where q is obtained from q by dropping atoms which bind some variable not in the domain of π.

Note that we do not consider queries with constants (i.e., individual names): such queries can be viewed as non-boolean queries with a fixed valuation of free variables, and thus are covered by the reduction to the Boolean case. We do consider unions of conjunctive queries (UCQs), which are disjunctions of CQs. An interpretation I satisfies a UCQ Q if it satisfies one of its disjuncts. It follows immediately that UCQs are preserved under homomorphisms; that is, if I |= Q and there is a homomorphism from I to J , then also J |= Q.

A query Q is entailed by a KB K, denoted as K |= Q, if every model of K satisfies Q. A model of K that does not satisfy Q is called a counter-model.

The query entailment problem asks whether a KB K entails a (U)CQ Q. Moreover, this problem is equivalent to that of finding a counter-model. It is well known that the query answering problem can be reduced to query entailment.

In this paper, we address the problem of finite query entailment, which is a variant of query entailment where only finite interpretations are considered: an interpretation I is finite if ∆ I is finite, and a query Q is finitely entailed by K, denoted as K |= fin Q, if every finite model of K satisfies Q.

Counter-model surgery

Later in this paper, we transform a tree-like counter-model into a finite countermodel of bounded size, using a variant of the blocking principle: a systematic policy of reusing elements. Before we dive into this construction, we show that we can add certain edges to a counter-model without satisfying the query. These edges must connect a pair of points, for which a different pair of similar points is already connected via an edge over the same role. Throughout this section, we assume no transitivity declarations are present; we will handle these later.

Sharing the same unary type is a natural similarity criterion. Without transitivity and functional declarations, concept inclusions rely only on the unary types of directly linked elements, so adding an r-edge between a pair of elements yields a correct model of the knowledge base if a pair of elements of the same unary types is already connected with an r-edge. However, that criterion is not sufficient, since this way we might accidentally satisfy the query, for example by creating a cycle of given length when the query asks for one. This happens to be a key difficulty to overcome: we cannot introduce cycles shorter than the size of the query.

The first step to enhance the criterion is to look at sufficiently large neighbourhoods, rather than just unary types. We adapt the definition of neighbourhoods from [6] and parameterize it with a set of constants Const ⊆ ∆ I : elements which are considered neighbours of all the elements in the interpretation, regardless of the actual connections between them. Throughout the paper, we assume Const = Nom(K) unless specified otherwise. By the distance between two elements we understand the length of the shortest undirected path over the union of all roles between these two elements. Replacing unary types with large neighbourhoods is still not enough, because nearby elements can have arbitrary large isomorphic neighbourhoods: in the integers with the successor relation, all n-neighbourhoods are isomorphic. The next step is to enrich the initial counter-model in such a way that overlapping neighbourhoods are not isomorphic, following an idea from [7]. Definition 2. A colouring with k colours of an interpretation I is an extension J of I with ∆ J = ∆ I , such that J coincides with I in every element in the signature of I, and interprets k fresh concept names B 1 , . . . , B k such that B J 1 , . . . , B J k is a partition of ∆ J . We say that d ∈ B J i has colour B i . A colouring J of I is n-proper if for each d ∈ ∆ J all elements of N I n (d) have different colours.

Lemma 1. If I \ Const has bounded degree, then for all n ≥ 0 there exists an n-proper colouring of I with at most M + |Const| colours, where M is the size of the largest (2n)-neighbourhood.

We write I n for an arbitrarily chosen n-proper colouring of I. Theorem 1 (Lifting Theorem). For a conjunctive query with at most k binary atoms and n ≥ k and an interpretation I n in which the query is not satisfied, adding n-links to I n does not make the query satisfied.

This theorem was established in [6] for n ≥ k 2 . We prove the stronger variant in the appendix. The strengthening is not necessary for our approach, but it tightens the resulting size estimations.

Clique-forest counter-models

In this section, we revisit prior work showing that for SOI and SOF, the existence of a finite counter-model for Q is equivalent to the existence of a so called counter-example, a possibly infinite counter-model of a special form, which generalises tree-shaped models [6]. We impose tighter size constraints on counterexamples and construct automata recognizing them.

The special form is based on the notion of clique-forests.

Definition 4 ([6]

). Let us fix a KB K. A clique-forest for an interpretation I is a forest whose each node v is labelled with a subinterpretation I v of I \ Nom(K) such that:

the sets ∆ Iv are a partition of ∆ I\Nom(K) ; each I v is either a single element with all roles empty (element node) or a clique over some transitive role with all other roles empty and no repetitions of unary types (clique node); apart from edges within cliques and between individuals, in I \ Nom(K) there is exactly one edge between ∆ Iu and ∆ Iv for every two adjacent nodes u and v: assuming u is the parent of v, it is an r-edge from an element of ∆ Iu to an element of ∆ Iv for some r ∈ Rol(K). We call v the origin of I v .

Definition 5. Let K * denote the KB obtained from K by dropping transitivity declarations and let I * denote the interpretation obtained from I by closing transitively the interpretation of each transitive role.

Note that each existential restriction satisfied in I is also satisfied in I * . The same holds for quantifier-free CI, and for universal restrictions involving nontransitive roles. For universal restrictions involving transitive roles, we ensure this property by preprocessing the interpretation: for each pair of B ∈ CN(K) and r ∈ TRol(K), we add a fresh concept name P r,B , axiomatised as P r,B ∀r.(B P r,B ), and replace each CI of the form A ∀r.B with A ∀r.P r,B .

An interpretation is safe if it does not contain an infinite simple r-path for any transitive role r. If I is a counter-example for Q, then I * is a counter-model for Q, thanks to the initial preprocessing.

Theorem 2. Q has a finite counter-model iff Q has a counter-example.

The theorem was originally stated in [6], where counter-examples by definition admitted clique-forests with at most |K| 2 trees and branching at most |K| 2 , but the proof there is still valid for the present stronger statement. The bound on the branching of clique-forests is replaced with bounds on the size of cliquetree nodes and number of children nodes connected to a single element within any node; their precise values follow from the construction of a single clique-tree in the proof. The bound on the number of trees can be tightened, as the constructed clique-forests are built of trees rooted in non-nominal individuals and children nodes of nominals.

To obtain the final counter-model of bounded size, we are going to apply the blocking principle to a counter-example. Thanks to Theorem 2, we already know that a counter-example exists whenever Q has a finite counter-model. In the next section, we require that the clique-forests of counter-examples we work with are of certain regularity. We will show that in regular interpretations with a so-called completeness property, if two elements are connected with a path over a transitive role, then there is at least one short path over that role which connects them. To be able to focus on counter-examples with regular clique-forests, we shall construct tree automata recognizing the latter.

The following definitions refer to edges and paths in a fixed interpretation I, either known from the context or explicitly specified. Definition 7. Fix an ordered pair of elements a, b. We say that (a, b) is an redge if there is an edge over r from a to b. We say that (a, b) is an r * -edge, if either r is transitive and there is a path over r from a to b, or (a, b) is an r-edge. Definition 8. An interpretation J is individual-complete, if in J each r * -edge over two individuals used in K is an r-edge. An interpretation J is nominalcomplete, if in J each r * -edge with a nominal as one of the endpoints is an r-edge. An interpretation is complete, if it is individual-and nominal-complete.

Note the difference between the individual-and nominal-completeness definitions: in the former, both endpoints of the edges considered have to be individuals, whereas in the latter, only one endpoint has to be a nominal. Each counter-model can be extended to a complete one in a natural way so that the transitive closure stays the same. We can thus restate Theorem 2: Q has a finite counter-model iff Q has a complete counter-example.

Theorem 3. For a union Q of CQs, with each CQ of size at most m, and a complete interpretation M with ∆ M = Ind(K), there exists an automaton with at most 2 2•3 4m |Q||T | 3m states recognizing clique-forests of complete counter-examples for Q, for which the subinterpretation induced by Ind(K) is equal to M.

See Appendix for the automaton construction. Theorem 3 is inspired by a similar theorem in [6]. We add the completeness requirement and restrict the size of the automaton state-space, rather than its overall size. This way, we are able to get a bound which is independent of the ABox; it depends only on the TBox and query sizes. The automaton size still depends on the ABox, since its description has to contain all the initial lists of states, each state corresponding to one tree, and we consider clique-forest with at least one tree per individual occurring in K.

These tweaks result in significant differences between our construction of the automaton and the construction presented in [6]. Most importantly, we cannot transfer the information about the edges between individuals to the TBox, as that would introduce a dependency of the automaton state-space on the ABox. This influences especially the automaton component responsible for verifying that the query cannot be satisfied.

Bounded counter-models

We say that DL has the linear-size counter-model property if whenever there exists a counter-model for a given query Q and a knowledge base K = (A, T ) expressed in that DL, there exists a counter-model for Q of size bounded by |A| • B(Q, T ) for some function B. The goal of the larger part of this section is to prove the following theorem. Clique-forests of complete counter-examples can be recognized by a family of automata by Theorem 3. Suppose K and Q admit a finite counter-model. As discussed in the last section, they admit then a complete counter-example, which in turn can be recognized by an automaton. Any automaton accepting some clique-forest must also accept some regular forest, which has only finitely many non-isomorphic subtrees. The number of these subtrees can be bounded by the number of states of the automaton. Following the bound in Theorem 3, we pick a complete counter-example I admitting a regular clique-forest with at most p = 2 2•3 4m |Q||T | 3m non-isomorphic subtrees.

We shall turn I into a finite counter-model for Q, using the blocking principle. The main obstacle is that Q uses transitive roles, which are not fully represented. To solve that, we replace Q with a different query. In [6], this was done by exploiting a bound on the length of simple r-paths for transitive roles r, guaranteed by the regularity of clique-forests of the considered counter-examples. Here, we require that the counter-example we process is complete and consider the distance between elements that are connected by r-paths, which will let us derive better bounds on the size of the obtained finite counter-models. Definition 9. A pair of elements a, b is at r-distance k if the shortest r-path from a to b is of length k. An interpretation is -ranged if for each r * -edge, its endpoints are at r-distance at most .

For transitive role atoms in queries, the path that leads from one variable to another does not play any role in the query evaluation, only the unary types of the connected variables matter (and the fact that they are connected by some path). We use this intuition to deduce that in -ranged interpretations, any query with transitive binary atoms can be rewritten so that closing the roles by transitivity does not influence the query evaluation.

Let Q * be obtained from Q by replacing each transitive atom s(x, y) by the disjunction i≤ s i (x, y), where s i (x, y) is the conjunctive query expressing the i-fold composition of s. If each disjunct of Q contains at most k binary atoms, Q * can be rewritten as a union of conjunctive queries, each using at most k • binary atoms.

Lemma 2. For all -ranged J , J * |= Q iff J |= Q * . Lemma 3. I \ Ind(K) is (4p)-ranged.

Proof. Let r be a transitive role in K. Each r-path going down the clique-forest of I contains at most p nodes. Indeed, if there were a longer r-path, then a subtree would occur twice on that path, which immediately leads to an infinite simple r-path in I \ Ind(K), contradicting the safety of I. For each clique node, at most one additional step is needed to traverse it with an r-edge. Additionally, if K is a SOI KB, an r-path can be composed of two parts, one going down the tree and the other consisting of r −1 edges going upwards. If follows that, for SOI and SOF, any r-reachable pair in I \ Ind(K) is at r-distance at most 4p.

The following lemma, regarding a bound on the range of an interpretation, is a stronger version of the analogous lemma presented in [6], regarding a bound on the length of simple transitive paths. The bound on the range has to be multiplied by a constant to get the bound on the range in the full interpretation, whereas in [6], the size of the set of nominals also comes into play. We draw upon the individual-completeness of the interpretations we consider to achieve this improvement.

Lemma 4. For each individual-complete interpretation J , if J \ Ind(K) is - ranged, then J is (2 + 3)-ranged.

Proof. Take two elements a, b, such that (a, b) is an r * -edge in J , but not an r * -edge in J \ Ind(K); it follows that any r-path from a to b visits an individual. If r is non-transitive, then (a, b) is an r-edge and a, b are at distance 1, otherwise consider such r-path π. It starts in one of the trees in J \ Ind(K) (or in an individual, in which case the tree degenerates to an empty one), and reaches a nominal connected to this tree or the tree root. Eventually, π enters the tree containing its other endpoint, again via a nominal connected to that tree or its root. Let c and d be the first and last individuals on π. As J is individualcomplete, (c, d) is an r-edge, if c = d. We can reduce π to a path segment starting in a in the first tree, a single edge to c, possibly a single r-edge (c, d), a single edge from d, and a path segment ending in b in the last visited tree. As the first and last path segments are in J \ Ind(K), each pair of elements in them is at r-distance at most . This gives a (2 + 3) bound on r-distance in J .

We know that I is * -ranged for * = (8p + 3) by Lemmas 3 and 4. We conclude that I |= Q * by Lemma 2.

The degree of elements in I\Ind(K) is bounded by |CN(K)|+|CI(K)| (elements are connected within clique-nodes of size at most |CN(K)| and with at most |CI(K)| + 1 elements from other nodes). Non-nominal individuals used in K can be connected with any number of other individuals, so even though their degree is bounded, the bound is relatively large. Nominals can be connected with an arbitrary number of elements.

We aim to use the blocking principle. We consider an n-proper colouring I n of I = I \ Ind(K), for some fixed n, instead of a proper colouring of I; we do so to avoid the influence of the individuals on the size of the neighbourhoods (and therefore the required number of colours). I is a clique-forest too: the children nodes of non-nominal individuals become new roots. On each branch π in I n , let D π be the first node for which some earlier node

E π satisfies N In n (d π ) N In n (e π )

, where d π ∈ D π and e π ∈ E π are the endpoints of the edges connecting D π and E π to their parent nodes. The new interpretation F n is obtained by including the branch π up to the predecessor of node D π , with the edge originally leading to d π redirected to e π . Individuals are copied from I, together with their adjacent edges. Note that for SOF, all functionality declarations are satisfied, as each redirected edge is a forward edge.

As each node in I n has finitely many children and we cut all the infinite paths off, F n is finite by Kőnig's lemma. As we start with I |= K * and the unary types of edge endpoints are preserved in F n , we get F n |= K * . By the initial preprocessing, (F n ) * |= K. The conclusions drawn so far are valid for any n.

Lemma 5. F n \ Ind(K) is (4p)-ranged for n ≥ 4p.

Proof. During the construction of F n from I, we add only one link for each infinite downwards path in I starting in a tree root. Each of these links goes up the tree. By link origins we shall understand the elements that had an edge added during the construction of F n . We can visualize F n ∪ I as the core F n with additional hanging subtrees, each attached to F n with a single edge to a link origin. There is no way of accessing these subtrees from F n \ Ind(K) other than via the corresponding link origin.

Let = 4p and fix some n ≥ . We prove the following claim by induction on m: if there is a path of length + 1 from a in F n \ Ind(K) to b in (F n ∪ I) \ Ind(K) over a transitive role r using at most m links, then there is an r-path in (F n ∪ I) \ Ind(K) from a to b over r of length at most .

If we prove the claim for all m, we can deduce that F n \ Ind(K) is -ranged. Indeed, take a transitive role s and an s-path π in F n \ Ind(K) of length + 1, with endpoints (a, b) and m links (possibly none). Applying the claim, we get an s-path ρ in (F n ∪ I) \ Ind(K) of length at most , connecting a and b. To show that ρ has to be in F n \ Ind(K), we take advantage of the tree structure: if a path goes down a given branch, it has to use a link to come back to an element higher up in the tree. There are no links from elements in I \ F n , so ρ must only visit nodes in F n \ Ind(K); otherwise ρ would get stuck in I \ F n and have its endpoint there.

Let us approach the induction proof. The induction base for m = 0 is almost immediate. A path without links in (F n ∪ I) \ Ind(K) is also a path in I \ Ind(K), which is -ranged, as shown in Lemma 3.

Assume the inductive hypothesis holds for m. Suppose that for a transitive role s there is an s-path π in (F n ∪ I) \ Ind(K) of length l + 1 from a to b, such that it uses at most m + 1 links. If π uses fewer than m + 1 links, we simply use the inductive hypothesis; thus we can assume π uses exactly m + 1 links.

As m ≥ 0, there is at least one link in π. Let (d, e) be the last link in π, with the witnessing element e in I \ Ind(K). Note that the node containing e is not an ancestor of the node containing a in the clique-tree, as otherwise a would lie in a subtree rooted in e and would be removed while constructing Proof. If we fix the sizes of the TBox and query, then S(Q, T ) is fixed and the size of F n is linear with respect to the ABox size. We consider the following non-deterministic approach to disproving query entailment: we simply guess a transitively closed counter-model of size at most M • S(Q, T ) and verify all the required properties. First, we need to check that it is indeed transitively closed. This can be done in polynomial time with respect to the interpretation size. Next, we need to verify ABox and TBox constraints; as the interpretation is transitively closed, for each element each constraint can be verified by looking at that element's immediate neighbours. Finally, we need to check that the query is not satisfied. This comes down to verifying a first-order logic formula, which can be done in time polynomial of |Ind(K)|, where the degree of the polynomial depends only on the sizes of the TBox and query. Altogether, this constitutes a coNP algorithm for deciding query entailment for fixed-size TBox and query.

We have shown that finite query entailment is in coNP with respect to data complexity. It is known that (unrestricted) query entailment is coNP-hard (with respect to data complexity) in ALE [13]. By finite controllability, so is finite query entailment. Since ALE is a fragment of both SOI and SOF, NP-hardness carries over.

Conclusions

By enhancing the approach to finite query entailment presented in [6], we were able to show the linear-size counter-model property for SOI and SOF (for fixed TBox and query). This let us show that finite query entailment is in coNP with respect to data complexity for these DLs. Basing on previous work, we deduced that it is a coNP-complete problem. To improve the final size estimations, we also established a stronger version of the Lifting Theorem.

The techniques used to construct the final counter-models were not efficient enough to provide asymptotically tight size estimations for arbitrary (not fixed) TBoxes and queries. One limiting factor is the doubly-exponential bound on the range of the considered counter-models, stemming from the automata statespace size. It would be interesting to see if this could be improved. Another natural question is whether our techniques are applicable to SIF.

A Proof of Lemma 1

The following proof is taken from [6], where the lemma was formulated with no mention of the number of required colours. Notice that for i < j, each j-link is also an i-link. Note also that (d, e) is an i-link along role r if and only if (e, d) is an i-link along r − . We have

I n ⊆ I n n ⊆ I n−1 n ⊆ • • • ⊆ I 1 n ⊆ I 0 n ,

but the domains of all these interpretations coincide. We keep referring to the edges present in I i n but not in I n as i-links, even though they are ordinary edges now.

Definition 11. We say that functions f, f :

X → ∆ I are d-indistinguishable if for all x ∈ X: N I d (f (x)) N I d (f (x)).

Throughout the paper, we consider neighbourhoods with respect to Const = Nom(K), however, all the definitions reliant on the notion of neighbourhoods are correct for every choice of the set of constants Const. Let us note that for every pair of d-indistinguishable functions f, f , if for some x ∈ X and c ∈ Const either f (x) = c or f (x) = c, then f (x) = f (x). This is because whenever we consider any neighbourhoods, we add a nominal-like concept for each constant.

Theorem 5. Let P be a CQ with at most k binary atoms and let n ≥ k. For each homomorphism h : P → I n n there exists a homomorphism h : P → I n which is (n − k)-indistinguishable from h.

Theorem 5 holds for any interpretation I of any SOIF KB. It was originally stated in [6] for n ≥ k 2 and regarded a pair of (n − k 2 )-indistinguishable homomorphisms. Furthermore, it was proved only for Const = Nom(K), and here we consider an arbitrary choice of the set of constants Const. We prove the enhanced version of the theorem in the next subsection; let us now see that it implies the Lifting Theorem (Theorem 1).

Proof (Lifting Theorem). Interpretation J n we obtain in the process satisfies J n ⊆ I n n . Consequently, if there was a homomorphism h : P → J n ⊆ I n n for some CQ P constituting Q, Theorem 5 would yield a homomorphism h : P → I n , contradicting I |= Q.

B.2 Proof of Theorem 5

We start with the following observation. Lemma 6. All links involving elements of Const are ordinary edges.

Proof. Assume (d, e) is an i-link for d in Const (the symmetrical case is analogous). Assume that d has a successor f such that N In i (f ) N In i (e). Then, the edge (d, e) must also exist in N In i (e), which shows that (d, e) is an ordinary edge. Assume now that e has a predecessor f such that

N In i (f ) N In i (d).

Then, f must be equal to d, due to the concepts used to mark constants in the definition of a neighbourhood, so (d, e) is an ordinary edge.

Let h(P ) denote the subinterpretation of I n n obtained by restricting the domain to h(var(P )), and only keeping in each role r edges (h(x), h(y)) such that r(x, y) is an atom from P . We say that h uses an r-edge of I n n if this r-edge is present in h(P ). We say that h uses a link over r of I n n if this link is an edge of h(P ), but not an edge of I n . Definition 12. By J ∩ J we mean the interpretation J such that ∆ J = ∆ J ∩ ∆ J , A J = A J ∩ A J for all concept names A, and r J = r J ∩ r J for all role names r.

Let us analyze the structure of h(P ). Let I n = I n \Const and let C denote the set of connected components in h(P )∩I n , understood as graphs. Each component in C contains at most k edges, as P contains at most k binary atoms. We shall now prove that h is far-linked ; that is, h does not use links that join two elements within the same connected component in C. This is a general property of such homomorphisms, which stems from the fact that each link connects elements that are far apart, as expressed in the following lemma. Proof. Let (a, b) be a link along r. Suppose that the distance between a and b in I n is smaller than 2n. Take the shortest path π in I n between a and b; π is of length at most 2n − 1. As (a, b) is a link, either a has an r-successor f in I n with the same colour as b, or b has an r-predecessor f in I n with the same colour as a. In the first case, there is an element on π, which is at distance at most n from both b and f , in the second case, there is an element on π, which is at distance at most n from both a and f . Since a, b, f share the same colour, are pairwise unequal and are in an n-neighbourhood of another element, we obtain a contradiction. Thus, elements forming a link must be at distance at least 2n in I n . Lemma 8. Each homomorphism h : P → I i n is far-linked.

Proof. Let r(x, y) be an atom of the query P such that h (x) and h (y) are in the same connected component C ∈ h (P ) ∩ I n . As P has at most k binary atoms, elements in C are connected with at most k edges. Since h (x) and h (y) are both in C, they at distance at most k < 2n in I n (unless k = 0; we verify this case separately). We use Lemma 7 to deduce that (h (x), h (y)) is an r-edge in I n .

We see that all the links used by h join elements from different connected components in C. We will construct a series of homomorphisms (h 0 , h 1 , ..., h l ), with l ≤ k, h j : P → I nj n , n j < n j−1 for each j > 0, starting with h 0 = h and n 0 = n, satisfying the following constraints. First, we require that h j+1 uses strictly fewer links than h j . Second, we require that h j+1 and h j are (n j+1 )indistinguishable. Third, we require that h l uses no links and that n l ≥ n − k. If we manage to satisfy these constraints, h l treated as h : P → I n is the homomorphism postulated in the theorem statement.

Let C j denote the connected components of h j (P )∩I n . We keep the invariant that (C j ∩ C) is non-empty for j < l. We construct the components graph of C j : the graph with elements of C j as vertices and edges between vertices representing connected components from C j that are joined by a link between their elements. The second invariant is that for j > 0, there is exactly one connected component in C j that is not present in C j−1 ; let D j denote that component. If j > 0 and there is a component D j ∈ (C j ∩ C) joined by a link to D j , set C j = D j and C j = D j . Otherwise, either there are C j , C j ∈ (C j ∩ C) joined by a link, or h j uses no links and we finish the construction. In case C j , C j exist, homomorphism h j+1 is obtained from h j by pulling C j closer to C j by the link that joins these components; as a result, C j , C j are combined into one component, namely D j+1 . The details of that construction are presented in the next subsection.

We will later show that h j+1 uses links for a subset of these binary atoms of P , for which h j uses links and which do not join C j with C j ; the links joining C j and C j in h j are replaced with regular edges of I n in h j+1 . We compensate for eliminating these links by paying a certain cost: homomorphism h j+1 is less strict than h j when it comes to preserving the neighbourhoods of the elements joined with a link, which is why we have to set n j+1 = n j − |C j |.

We will now prove that n j ≥ n − k for all j. This will entail that h is indeed (n−k)-indistinguishable from h. We use the following invariant: altogether there are at least n − n j edges in connected components from (C j \ C). This can be proved by induction on j. The base step for j = 0 reduces to the statement that an empty graph has no edges. In the inductive step, we pull the component C j towards C j , merging the two within a larger connected component. We have that C j ∈ C j is in C and the new connected component D j+1 ∈ C j+1 is not in C. This means that the number of edges in connected components from C j+1 \ C must be larger than the number of edges in connected components from C j \ C. The total increase is equal to at least |C j |; this stems from incorporating |C j | − 1 edges within C j , and from reducing at least one link. Using the inductive hypothesis, we get that there are at least n − n j edges in connected components from C j \ C, and at least n

− n j + |C j | edges in connected components from C j+1 \ C. As n j+1 = n j − |C j |, we get n − n j + |C j | = n − n j+1

, which proves the inductive step. If there was j such that n j < n − k, the number of edges in connected components from (C j \ C) would be at least n − n j ; as n − n j > n − (n − k) = k, this would contradict the fact that h j (P ) has at most k edges.

B.3 Definition of h j+1

In this subsection, we mark the names of interpretation elements with bold colour fonts. Whenever we mention distance between elements, we refer to distance in I n .

We are going to define h j+1 based on h j and C j , C j ∈ C j . Let (d, e) be a link along s used by h j and joining C j with C j . We only consider the case when d ∈ C j and e ∈ C j ; the proof for the other case is for n j+1 = n j − |C j | as follows: for each z ∈ var(P ), let h j+1 (z) = g(h j (z)) if h j (z) ∈ C j , and h j+1 (z) = h j (z) otherwise. To ensure that the definition of h j+1 is correct, we need to show that n j+1 ≥ 0. Altogether, components in C have at most n edges (more precisely, at most k ≤ n edges), and we have proven an invariant that there are at least n−n j edges in connected components from (C j \ C). As C j is in (C j ∩ C), and (d, e) is a link,

C j can have at most n − 1 − (n − n j ) = n j − 1 edges. Thus, n j+1 = n j − |C j | ≥ 0.

By definition, h j and h j+1 are (n j+1 )-indistinguishable: N In nj (e) N In nj (e ) and n j+1 = n j − |C j |, so g preserves (n j+1 )-neighbourhoods of elements within distance |C j | from e. We need to verify that h j+1 is indeed a homomorphism.

Let r(x, y) be an atom of the query P and let p = h j (x), q = h j (y), p = h j+1 (x), q = h j+1 (y). We have to consider the following cases:

1. p, q / ∈ C j , 2. p, q ∈ C j , 3. exactly one of p, q is in C j ; we take advantage of the symmetry and only consider the case p / ∈ C j , q ∈ C j , with the following subcases:

(a) p ∈ C j , (b) p ∈ C j for C j ∈ C j such that C j = C j . (c) p ∈ Const,

Case 1. By definition, (p , q ) = (p, q). As h j is a homomorphism into

I nj n ⊆ I nj+1 n

, we have that (p , q ) is an r-edge in I nj+1 n

. Clearly, h j+1 uses no new links for such atoms.

Case 2. By definition, (p , q ) = (g(p), g(q)). We assumed that p and q are both in C j , and connected components in C j contain at most k − 1 edges (we subtract one, because (d, e) is a link). This means p, q are at distance at most k − 1 < 2n, and thus (p, q) is an r-edge in I n by Lemma 7. As g(p) and g(q) are both in D j+1 ∈ C j+1 , they are at distance at most k. This means that (p , q ) is an r-edge in I n .

Case 3. By definition, (p , q ) = (p, g(q)). As g preserves (n j+1 )-neighbourhoods of elements within distance |C j | from e and the distance between q and e is at most |C j |, we see that N In nj+1 (q) N In nj+1 (g(q)). Let us analyze the three subcases.

Subcase 3 (a). As p ∈ C j and q ∈ C j , and C j and C j are different connected components in C j , we have that (p, q) is a link along r. Since C j and C j are merged together into D j+1 , both p and g(q) are in D j+1 . This means they are at distance at most k from each other, which allows us to use Lemma 7 to deduce that (p , q ) is an r-edge in I n . Subcase 3 (b). Like above, p and q are in different connected components from C j , so it must be that (p, q) is a link along r. We need a different line of reasoning than in the previous case, since we cannot assume C j and C j are merged together. Since (p, q) is an n j -link, it is also an (n j+1 )-link. We have two subcases to consider.

Suppose that p has an r-successor f in I n such that N In nj+1 (f) N In nj+1 (q) N In nj+1 (g(q)). We see that then (p, g(q)) is an (n j+1 )-link. If f = g(q)), then (p , q ) is an r-edge in I n . Furthermore, the connected component C j of p merges with C j ; all three components C j , C j and C j form subgraphs of D j+1 . As elements in D j+1 are at distance at most k from each other, there can be no links between them by Lemma 7. This means that all the links between C j and C j used by h j become regular edges in h j+1 . We also see that h j cannot use any links between C j and C j , as elements of these components are at distance at most k from each other in I n . The other possibility is that f = g(q)), in which case (p , q ) is a link along r. By analogous reasoning, this also entails that h j+1 uses links for all the links used by h j which join C j and the connected component of f. Suppose now that q has an r-predecessor f in I n such that N In nj+1 (f) N In nj+1 (p). We prove in the same way as above that either (p , q ) is an r-edge in I n and the connected component of p and C j merge, or (p , q ) is a link along r.

Subcase 3 (c). (p, q) is an r-edge in I n by Lemma 6. As N In nj+1 (q) N In nj+1 (g(q)), (p, g(q)) must be an r-edge in N In nj+1 (g(q)). Thus, (p , q ) is an r-edge in I n . This proves that h j+1 is a homomorphism, which uses links for a subset of these binary atoms of P , for which h j uses links and which do not join C j with C j .

C Proof of Theorem 3

C.1 Construction C.1.1 Outline The authors of [6] present a product of tree automata that independently verify the following acceptance conditions:

consistency of the forest encoding, satisfying the constraints from the knowledge base K * = (T , A), not satisfying the query Q, path safety.

We enhance their construction by adapting these components and adding verification of individual-and nominal-completeness. Our automaton is parametrized by a fixed subinterpretation M induced by individuals (note that it is not the case in [6]).

We describe the construction with SOI in mind, but it adapts easily to SOF. The only differences are that for SOF, we do not validate parts referring to inverse roles, and we add another component verifying functionality declarations, which does not affect our size estimations. This component is decribed in the last subsection.

C.1.2 Preprocessing

The interpretations we work with are preprocessed to cater for transitivity: for an interpretation I, if I |= K, then also I * |= K. This is achieved by supplementing the interpretation with auxiliary concepts expressing universal restrictions on transitive roles, as described under Definition 5: for each pair of B ∈ CN(K) and a transitive role r, we add a fresh concept name P r,B , axiomatised as P r,B ∀r.(B P r,B ), and replace each CI of the form A ∀r.B with A P r,B .

The information about the edges to and from nominals is also stored in auxiliary concepts: for each a ∈ Nom(K) and r ∈ Rol(K) fresh concept names N r,a and N r − ,a are added to the knowledge base, axiomatised as N r,a ≡ ∃r.{a} and {a} ≡ ∀r.N r − ,a . These concepts do not substantially increase the size of the automaton.

We aim to accept only these counter-examples which are complete. First, we tackle nominal-completeness. For each a ∈ Nom(K) and r ∈ TRol(K) we add the following concept inclusions to TBox: Nr,a ∀r. Nr,a and N r − ,a ∀r.N r − ,a . This ensures that whenever for a transitive role r there is no edge r from some element to a nominal (or vice versa), then there is also no r-path from that element to the nominal.

Recall that we assume the completeness of M (the subinterpretation induced by individuals). Consequently, any path along a transitive role between two individuals not connected with a single edge must enter some clique-tree and lead through a nominal. Additional CIs described in the last paragraph ensure that whenever there is a path from an individual to a nominal in the clique-tree associated with that individual, these elements are also connected with a single edge. If there was a path breaking individual-completeness of the whole forest, it would entail that also nominal-completeness does not hold. Thus, individualcompleteness follows from nominal-completeness.

C.1.3 Automaton

To make clique-forests accessible to automata, we encode them as finitely labelled forests. Let TRol(K) be the set of transitive roles from Rol(K), and let [X] ≤k be the family of subsets of X of size at most k. In the encoding, nodes are labelled with elements of the alphabet

Σ = Tp(K) ∪ TRol(K) × [Tp(K)] ≤|CN(K)|

and edges are labelled with elements of the alphabet

Γ = Tp(K) × Rol(K) × Tp(K) .

We build an automaton parametrized by the subinterpretation M. Let I denote the interpretation whose clique-forest is encoded in the input forest. The automaton runs on the encoding of the clique-forest of I \ Ind(K) (not of I, like in [6]; this change is motivated by the way we modify the query component). We say that an individual a is the super-root of a clique-tree t in I \ Ind(K), if the root of t is a successor of a in I; we also say that t is super-rooted in a.

For a complete counter-example I, we obtain the encoding of the clique-forest for I \ Ind(K) as follows. We fix some order on individuals and order the trees in I \ Ind(K) so that they are sorted with respect to the order on their superroots. We add a separating tree stub (represented with a fresh symbol added to Σ) between contiguous blocks of trees super-rooted in the same individual; it is only used as a marker for the automaton so that it can recognize which trees are associated with a given individual. All the components of the automaton always accept these tree stubs.

Next, we label each element node with the single unary type it realises and each clique node with its single nonempty role and the set of unary types it realises. We do not represent nominals explicitly in the encoding, but thanks to the initial preprocessing, all relevant information about them is contained in the unary types of the remaining elements. Finally, if in I \ Ind(K) there is an r-edge from an element of type τ in some parent node to an element of type σ in some child node, then in the encoding the edge from the parent node to the child node is labelled with (τ, r, σ). Because unary types do not repeat within cliques, this uniquely determines the endpoints.

There

δ ⊆ Q × Σ × (Γ × Q) ≤D ,

where Q is the set of states. The automaton processes the forests top down. Whenever a node is reached after an edge labelled (τ, r, σ), we refer to the unique element of type τ in that node as the current element, and to the unique element of type σ in the parent node as the parent element. The initial states are specified for each tree separately: the automaton has a set I ⊆ Q ≤N of sequences of initial states. A run is a labelling of the input forest with states in such a way that the sequence of states in the roots belongs to I, and if a node has state q, label α, and its children are connected via edges with labels β 1 , β 2 , . . . , β n and have states q 1 , q 2 , . . . , q n , then q, α, (β 1 , q 1 ), . . . , (β n , q n ) ∈ δ .

We use Büchi acceptance condition: we specify a set F ⊆ Q of marked states that need to be revisited, and consider a run accepting if on each branch marked states occur infinitely often. A forest is accepted by the automaton if there exists an accepting run over it.

An automaton has trivial acceptance condition if F = Q. Then, each run is accepting but the automaton may still reject some forests, because there may be no run for them: a branch of the computation can get stuck if no transition is consistent with the current state, label and edge labels. An automaton is weak if on each branch of each run, once a marked state is visited, all subsequent states are marked. Notice that all automata with trivial acceptance condition are weak. Given a weak automaton and an arbitrary Büchi automaton it is particularly easy to construct an automaton recognising trees accepted by both input automata: it suffices to take the standard (synchronous) product automaton and mark all states that contain a marked states on both coordinates.

C.1.4 Automata components

The final automaton is obtained as a product of automata verifying independently various parts of the condition.

The first thing to check is the consistency of the encoding: if an edge has label (τ, r, σ), then τ must occur in the label of the parent node, and σ must occur in the label of the child node. To check this, it suffices to examine for each node the labels of all edges incident to it plus the label of the node itself. When a transition is made, all these are available except the label on the edge to the parent: it must be stored in the state. The automaton has |Γ | states and trivial acceptance condition.

The second thing to check is that the clique-forest is a model of K * . If M does not satisfy the restrictions from the ABox A of K * , the final automaton rejects everything.

To verify that the TBox is satisfied we need to check each CI. For CIs of the form

i A i j B j

we have a two-state automaton with trivial acceptance condition that simply tests that each type used in the encoding satisfies this CI. If the type of some individual violates this CI, the final automaton rejects everything. To verify CIs of the form A ∀r.B

for elements in I\Ind(K), it suffices to check that in the input interpretation there is no r-edge from an element whose unary type contains A to an element whose unary type contains B. This amounts to verifying that none of the following are used in the encoding:

node labels (r, T ) such that A ∈ τ ∈ T and B ∈ σ ∈ T for some τ , σ; For elements in I \ Ind(K), this condition can be tested in a similar way as above, except that one needs access to the label of the current node and all edges incident to it. Like for the initial consistency check, it suffices to store in the state the label of the edge to the parent. To ensure the existential restrictions are met for nominals, for each nominal a such that a ∈ A M , if that CI is not met in M, we construct a two-state weak automaton looking for a label that uses a type τ such that B ∈ τ and N r − ,a ∈ τ . Note that this automaton has a nontrivial acceptance condition, but it is weak: as soon as it finds an appropriate label, it loops in a marked state. The initial lists of states for that weak consist of a single unmarked state and all the remaining initial states already marked. We also have to verify that all existential and universal restrictions are met for non-nominal individuals. The former are ensured by constraining initial lists of states so that all the existential restrictions for an individual not satisfied in M are satisfied by roots of clique-trees super-rooted in that individual. If the universal restrictions are not met in M, the final automaton rejects everything; otherwise, we ensure that the restrictions hold for non-individual elements connected to individuals by constraining initial lists of states so that the roots of clique-trees super-rooted in a given individual do not break universal restrictions imposed on that individual.

Summing up, the total size of the state-space of the TBox component is not larger than 2 Next, we verify that the query cannot be satisfied. We need an automaton recognizing counter-examples such that the number of states in the automaton does not depend on the number of individuals in K. This is achieved by making sure each tree in the forest is processed without explicitly referring to the knowledge about individuals.

|CI(K)| •2 |CI(K)| •|Γ | |CI(K)| •2 |Nom(K)|•|CI(K)|

In [6], all the information about edges between individuals is transferred to the TBox. In particular, this means that the query parts are self-contained in single trees, as individuals are not connected (and the edges to nominals are hidden from the automata perspective). It lets the query component build partial query matchings independently for each tree, but at the cost of enlarging the TBox.

Here, we cannot use the same approach: the number of states depends on the number of unary types in particular, which would be tied to the ABox size if we were to store the information about edges between individuals in concepts. Matchings of connected parts of the query are no longer contained in single trees: they might span over multiple trees. We have to accommodate for that by enhancing the query component.

We start by replacing query Q with a query Q such that for each model I of K:

I * |= Q iff (I \ Nom(K)) * |= Q . The query Q is obtained in two steps.

In the first step, we account for the possibility that binary atoms of the form r(x, y) for transitive r can be satisfied with paths that cross different trees. Take such path π, and let a and b denote respectively the first and the last individual on π (possibly a = b). Thanks to the individual-completeness property, π can be reduced to a path consisting of: a path segment ending in a and contained in a single tree, a single edge from a to b (if a = b), and another path segment contained in a single tree. To detect such paths, for each CQ P constituting Q, we add to Q each CQ that can be obtained from P by subdividing some transitive atoms; that is, by replacing some atoms of the form r(x, y) either with atoms r(x, z) and r(z, y) for a fresh variable z, or with atoms r(x, z 1 ), r(z 1 , z 2 ), r(z 2 , y) for fresh variables z 1 , z 2 .

In the second step, we account for the influence of nominals on the query satisfaction. For each CQ P of the modified Q, we add to Q each CQ that can be obtained from P by performing the following operation any number of times. Let tp(x) be the set of all A such that P contains A(x). Choose x ∈ var(P ) and a ∈ Nom(K) such that a ∈ A M whenever A ∈ tp(x). Drop all atoms of the form A(x) from P . Replace in P each atom of the form r(y, x) by N r,a (y) and each atom of the form r(x, y) by N r − ,a (y).

The resulting query Q satisfies the condition I * |= Q iff (I \Nom(K)) * |= Q . After the first step, the number of CQs grows by the factor 3 m and their size is at most 3m. After the second step, the number of CQs grows by the factor |Nom(K)| 3m and their size is still at most 3m. Thus, the size of the resulting query is at most |Q| • 3 m • |Nom(K)| 3m , and its CQs have size at most 3m.

For each CQ P of the preprocessed union of queries Q , we construct an automaton that ensures that I * |= P . Its states consist of the parent edge label β and a set F representing all partial matchings of P to elements represented in the subtree rooted in the currently processed node.

The set F consists of partial functions f : var(P ) → L, with L = {succ, other}, representing different ways variables can be assigned in relation to the parent element. Each such function represents a partial matching, that is a partial assignment of variables such that the restricted query induced by these variables is satisfied (not all the collected partial matchings can potentially be extended to a total match; discussion on that part is presented in the next subsection). The identifier succ is used for each variable mapped to an element in the current subtree that is an r-successor of the parent element in I \ Ind(K). If r is non-transitive, this simply means the current element (the clique origin, if we are processing a clique node). If r is transitive, it means any element r-reachable from the current element. The identifier other is used for variables assigned to any other elements in the current subtree.

The rationale behind the automaton is that it constructs all the possible partial matchings across the forest. The trick here is that the states reached before traversing a given node already contain the information about all the partial matchings that will be constructed in the subtree rooted at that node. The automaton will have at most one accepting run on each forest-the states in the initial list of states in that run will contain all the possible partial matchings across the forest.

We define the transition relation in the next subsection. The definition will ensure the intended semantics of the states; when processing a given node, sets F in the automaton state will represent all partial matchings of P to elements represented in the subtree rooted in that node.

All states are accepting. Apart from the partial matchings collected in trees rooted in children nodes of individuals, we also compute all partial matchings in M \ Nom(K) (the possibility that some variables are assigned to nominals has already been accounted for in the second step of the query preprocessing). As lists of initial states, we take all lists of states such that a total match over the whole forest cannot be constructed by fusing together a partial matching in M \ Nom(K) and one partial matching per initial state. We know how to stitch these partial matchings together, because we know the edges from super-roots to new clique-tree roots and the way we collect partial matchings in a cliquetree accounts for the label on that edge. This sums up the query automaton construction. This extra information about the edges from super-roots motivates the approach of traversing clique forests built of trees rooted in children nodes of individuals, rather than trees rooted in individuals.

The last component of the automaton checks that the input interpretation is safe. Observe that it is unsafe if in the input clique-forest there is a branch with consecutive node and edge labels α 1 β 1 α 2 β 2 . . . such that for some transitive r and all i large enough, β i = (τ i , r, σ i ) and either σ i = τ i+1 (consecutive edges are incident in the clique-forest) or α i+i = (r, T i ) (consecutive edges are incident with an r-clique). An automaton can easily check that there is no such branch. Each time it sees a transitive role it moves to an unmarked state, storing the role. It moves to a marked state as soon as the condition above is broken. The automaton has |TRol(K)| states.

The automaton recognising safe counter-examples can be obtained from these components by the simple product construction described above, because only the last component is not weak.

C.2 Transition relation of the query component

Below we describe the set of transitions of the query satisfaction component of the product automaton.

A transition ((β, F ), τ, ((β i , (β i , F i )) n i=1 ) over an element node of unary type τ , with β = (σ, r, τ ) and β i = (τ, r i , τ i ), is a transition of the automaton if F is the set of all functions f : var(P ) → L which are both upward-and downward-compatibile with some witnessing functions f 1 ∈ F 1 , f 2 ∈ F 2 , ..., f n ∈ F n ; we define these conditions below.

Fix a function f : var(P ) → L and a sequence of domain-disjoint functions (f 1 , f 2 , ..., f n ), with f i : var(P ) → L.

f is upward-compatibile with functions f i if for each variable x ∈ var(P ):

if ∃i : f i (x) = other, then f (x) = other, if ∃i : f i (x) = succ, then f (x) = succ if r is transitive and r i = r, otherwise f (x) = other, otherwise either x / ∈ dom(f ), or f (x) = succ and tp(x) ⊆ τ , which we interpret as matching x to the current element. In the latter case x is in (dom(f ) \ dom(f i )).

f is downward-compatibile with functions f i if for each atom s(x, y) of P :

if x ∈ (dom(f ) \ dom(f i )) and y ∈ dom(f i ), then r i = s and f i (y) = succ, if y ∈ (dom(f ) \ dom(f i )) and x ∈ dom(f i ), then r i = s − and f i (x) = succ, if x ∈ dom(f i ) and y ∈ dom(f j ) for i = j, then s is transitive, r i = r − j = s, and f i (x) = f j (y) = succ.

Note that we only seem to notice binary atoms for which both variables are already present in the subtree. When a variable gets the identifier other and there are unsatisfied binary atoms containing that variable, we know that such a matching cannot be extended to a total matching. We still gather such matchings in the automaton, which might be unintuitive. For clarity, we can forget such unextendable matchings in the automaton runs by adding the following condition to the downward-compatibility definition:

if f i (x) = other or f i (y) = other, then x, y ∈ dom(f i ).

The transitions over clique-nodes are described using analogous compatibility concepts, adapted to the clique-node setting. A transition ((β, F ), (r , T ), ((β i , (β i , F i )) n i=1 ) over a clique node encoded by a role r and a set of unary types T , with β = (σ, r, τ ) and β i = (σ i , r i , τ i ), is a transition of the automaton if F is the set of all functions f : var(P ) → L which are both upward-and downward-compatibile with some witnessing functions f 1 ∈ F 1 , f 2 ∈ F 2 , ..., f n ∈ F n ; we define these conditions below.

Fix a function f : var(P ) → L and a sequence of domain-disjoint functions (f 1 , f 2 , ..., f n ), with f i : var(P ) → L.

f is upward-compatibile with functions f i if for each variable x ∈ var(P ):

if ∃i : f i (x) = other, then f (x) = other, if ∃i : f i (x) = succ, then f (x) = succ if r is transitive, r i = r, and either σ i = τ or r = r, otherwise f (x) = other, otherwise one of the following holds:

• x / ∈ dom(f ), • f (x) = succ and tp(x) ⊆ τ , which we interpret as matching x to the current element, • f (x) = other and tp(x) ⊆ τ ∈ T \ {τ }, which we interpret as matching

x to an element in the current clique-node.

f is downward-compatibile with functions f i if for each atom s(x, y) of P :

if x ∈ (dom(f ) \ dom(f i )) and y ∈ dom(f i ), then r i = s, f i (y) = succ, and either tp(x) ⊆ σ i , or s is transitive and r = s, if y ∈ (dom(f ) \ dom(f i )) and x ∈ dom(f i ), then r i = s − , f i (x) = succ, and either tp(y) ⊆ σ i , or s is transitive and r = s, if x ∈ dom(f i ) and y ∈ dom(f j ) for i = j, then s is transitive, r i = r − j = s, f i (x) = f j (y) = succ, and either σ i = σ j or r = s.

Once again, we can enhance the downward-compatibility definition to get rid of unextendable partial matchings; it is enough to add the same condition as before.

To see that this transition relation ensures the intended semantics of the states one needs to argue that each partial matching is accurately represented. This can be done by induction on the size of the image. For size one, the matching will be accounted for based solely on the labels. For larger images, use the inductive hypothesis for restrictions of the match to variables mapped to the trees rooted at the children of the current node.

C.3 Size estimation

Let us approach the size estimation of the automaton state space. We start by analyzing the factors that contribute to the size of automata components. Note that k and m denote, respectively, the maximum number of binary atoms and the maximum size of a single conjunctive query in the union of CQs Q.

D Proof of Lemma 2

For each transitive role s, all the s-atoms in Q are replaced with disjunctions of s-chains of lengths from 1 to in Q * . If such an atom can be satisfied in J * for elements a and b, then there is an s-path from a to b of length at most . This shows that if J |= Q * , then J * |= Q. On the other hand, if J * |= Q, then each transitive role atom must be satisfied by two endpoints of some s-path from J , of length at most , which satisfies the subquery replacing this atom in Q * . This proves the converse implication.

Definition 1 .1For d in I \ Const and an integer n, the n-neighbourhood N I n (d) of d with respect to Const is the subinterpretation of I induced by Const and all elements e in I \ Const within distance n from d in I \ Const, enriched with a concept D interpreted as {d}, and a concept interpreted as {a} for each a ∈ Const. For a ∈ Const, N I n (a) is the subinterpretation induced by Const, enriched similarly.

Definition 3 .3Let i ≤ n and let d, e be elements of I n . We say that (d, e) is an i-link along role r if either d has an r-successor e in I n such that N In i (e ) N In i (e), or e has an r-predecessor d in I n such that N In i (d ) N In i (d).

Definition 6 .6A counter-example for Q is a safe interpretation I such that: I |= K * , I * |= Q, and I admits a clique-forest with at most |Ind(K)| + |Nom(K)| • |CI(K)| trees, nodes containing at most |CN(K)| elements, at most |CI(K)| children nodes connected to a single element, and each element of Ind(K) \ Nom(K) occurring in some root.

Theorem 4 .4SOI and SOF have the linear-size counter-model property.

Corollary 1 .1F n . Let π a,d and π e,b denote the parts of π from a to d and from e to b. The path π e,b is of length at most . We have chosen n ≥ , so the n-neighbourhood of e in I \ Ind(K) contains a path π e ,b isomorphic to π e,b . The path consisting of π a,d , the edge from d to e and π e ,b is of length + 1, but uses only m links, so there is an s-path ρ in (F n ∪ I) \ Ind(K) from a to b of length at most . Due to the construction of F n , each path to e starting in F n \ Ind(K) must visit d. Let ρ a,d and ρ e ,b denote the parts of ρ from a to d and from e to b . Note that ρ e ,b lies in I \ Ind(K), so it does not use any links. At least one of the following inequalities holds: |ρ a,d | < |π a,d | or |ρ e ,b | < |π e,b |. If the first one holds, we can take the path consisting of ρ a,d , the edge from d to e and of π e,b as an s-path from a to b of length at most . If the second inequality holds, we use the fact that ρ e ,b lies in the n-neighbourhood of e in I \ Ind(K) and find an isomorphic path ρ e,b from e to b. The path consisting of π a,d , an edge from d to e and of ρ e,b is an s-path from a to b of length at most . This proves the inductive hypothesis for m + 1. Recall that p = 2 2•3 4m |Q||T | 3m and l * = 8p + 3. For n ≥ 4p, F n is * -ranged by Lemmas 4 and 5. The size of the expanded query Q * is bounded by k • * . Furthermore, F n |= Q * for n ≥ k • * by Theorem 1. We conclude that F * n |= Q for n ≥ k • * by Lemma 2.We can now approach the estimation of the size of F n . The interpretation I is a counter-example, so in its clique-forest, each node is of size at most |CN(K)| and each element is connected to at most |CI(K)| children nodes. Children nodes of individuals in I form the roots of clique-trees in I , so the number of trees in I n is bounded by |Ind(K)|•|CI(K)|. As described before, the degree of elements in I is bounded by D = |CN(K)|+|CI(K)|, so the neighbourhoods in I of radius m consist of at most O(D m ) elements. It is enough to use C = O(D 2n ) + |Nom(K)| colours to obtain a proper n-colouring of I by Lemma 1. There are 2 |CN(K)| unary types, so there are L = 2 |CN(K)| • C ways of labelling a single element with its unary type and colour. The neighbourhoods of radius n are of size O(D n ), which means there are N = L O(D n ) different neighbourhoods in I n . On a downward path in F n , we cannot come across the same neighbourhood centred in an element node or a clique origin, as otherwise a link upwards would be created higher up on this path; thus N bounds the depth of clique-trees in F n . The degree of the clique-forest of I n is bounded by D = |CN(K)| • |CI(K)|, so the number of nodes in each tree is bounded by (D ) N . The overall size of F n is bounded by M •S(Q, T ), where M = |Ind(K)|•|CI(K)| bounds the number of clique-trees in I and S(Q, T ) is a bound on the size of a single tree. The size estimation of a single tree is independent of the ABox size; it is only governed by the TBox and query. This finalizes the proof of Theorem 4. Finite entailment of UCQs by SOI and SOF KBs is coNP-complete with respect to data complexity.

Let n ≥ 0. Because I \ Nom(K) has bounded degree, 2n-neighbourhoods in I have size bounded by some m. We colour the elements of I one by one, with m colours. Pick an uncoloured element d. At most m − 1 colours are already used in N I 2n (d). Assign to d any colour that is not yet used in N I 2n (d). This procedure gives an n-proper colouring. Indeed, consider different e, e from N I n (d) for some d ∈ I. Without loss of generality we can assume that e was coloured before e . But e belongs to N I 2n (e ), so the colours of e and e are different by construction.B Lifting TheoremB.1 Formulation Definition 10. For i ≤ n, let I i n be the interpretation obtained from I n by including into the interpretation of each role r all i-links along r; that is, for every role r and every i-link (d, e) along r, (d, e) ∈ r I i n .

Lemma 7 .7Elements forming a link are at distance at least 2n in I n .

symmetrical. Either d has an s-successor e such that N In nj (e) N In nj (e ), or e has an s-predecessor d such that N In nj (d) N In nj (d ); once again we take advantage of almost full symmetry and only present the proof for the first case. Let g : N In nj (e) → N In nj (e ) be the witnessing isomorphism for the (d, e) link. We define h j+1 : P → I nj+1 n

are at most N = |Ind(K)| • |CI(K)| trees in the input forests, with branching bounded by D = |CN(K)| • |CI(K)|. Transition relation has the form

edge labels (τ, r, σ) with A ∈ τ and B ∈ σ; edge labels (σ, r − , τ ) with A ∈ τ and B ∈ σ; unary types containing both A and N r,b for some b such that b ∈ BM ; unary type containing both B and N r − ,b for some b such that b ∈ A M .These conditions simply disallow certain labels; they can be checked by a twostate automaton with a trivial acceptance condition.Let us take a CI of the form A ∃r.B .

. The first three factors stem from the sizes of automata used to verify on I \ Ind(K) all the CIs of the forms i A i j B j , A ∀r.B, and A ∃r.B, respectively. The last factor stems from the verification of existential restrictions for nominals.

Acknowledgements. The author was supported by Poland's National Science Centre grant 2018/30/E/ST6/00042. The whole work has been prepared with huge help from Filip Murlak, the supervisor of the author's Master thesis. We thank the anonymous reviewers of DL 2020 for their helpful comments.

Measured entity

Notated as Size estimation number of unary types

The bound of the size of the unary types set is not just 2

By taking a product of the sizes of automata components, we bound the number of states of the product automaton by 2

C.4 Functionality component for SOF

We include an additional component for each functionality declaration Fn(r).

To check functionality of r for ordinary nodes it suffices to examine the label of the node and the labels on all incident edges, which only requires storing in the state the label of the edge to the parent. Additionally, for all a ∈ Nom(K): if a ∈ (N r,b ) M and a ∈ (N r,b ) M for some b = b , the final automaton rejects everything; if a ∈ (N r,b ) M , the automaton checks that no type used in the input forest contains N r − ,a ; if a / ∈ (N r,b ) M for each b ∈ Nom(K), the automaton checks that a type with N r − ,a occurs at most once in the input forest. The total number of states in the described component is |Γ | |CI(K)| • 2 |Nom(K)|•|CI(K)| , so including it does not affect the overall upper bound.

Pushing the EL envelope FBaader SBrandt CLutz Proc. Int. J. C. on Artif. Int Int. J. C. on Artif. Int 2005 The Description Logic Handbook: Theory, Implementation and Applications FBaader DCalvanese DLMcguinness DNardi PFPatel-Schneider 2010 Cambridge University Press New York, NY, USA 2nd edn. Ontology-mediated query answering with data-tractable description logics MBienvenu MOrtiz Reasoning Web. LNCS Springer 2015 9203 Tractable reasoning and efficient query answering in description logics: The DL-Lite family DCalvanese GDGiacomo DLembo MLenzerini Riccardo J. of Autom. Reasoning 39 3 2007 Conjunctive query answering for the description logic SHIQ BGlimm CLutz IHorrocks USattler J. Artif. Intell. Res 31 2008 Finite query answering in expressive description logics with transitive roles TGogacz YAIbáñez-García FMurlak Proc. KR KR 2018. 2018 On the BDD/FC conjecture TGogacz JMarcinkowski 10.1145/2463664.2463668 PODS 2013 Reasoning in description logics by a reduction to disjunctive datalog UHustadt BMotik USattler J. Autom. Reasoning 39 3 2007 The data complexity of description logic ontologies CLutz FWolter Log. Met. in Comp. Sci 13 4 2017 Characterizing data complexity for conjunctive query answering in expressive description logics MOrtiz DCalvanese TEiter Proc. Nat. C. on Artif. Int Nat. C. on Artif. Int AAAI Press 2006 Undecidability results for database-inspired reasoning problems in very expressive description logics SRudolph KR. 2016 AAAI Press Nominals, inverses, counting, and conjunctive queries or: Why infinity is your friend! SRudolph BGlimm J. Artif. Intell. Res 39 2010 On the complexity of the instance checking problem in concept languages with existential quantification ASchaerf J. of Intel. Inf. Systems 2 1993