Introduction

Vertically acyclic conjunctive queries over trees ?

Filip Murlak

Grzegorz Zielinski

0 0 University of Warsaw

2013

Seeking a manageable subclass of conjunctive queries over trees that would reach beyond tree patterns, we nd that vertical acyclicity of queries is su cient to guarantee the same complexity bounds for static analysis problems, as those enjoyed by tree patterns. Conjunctive queries (CQs) are a staple of relational databases, o ering a good balance between expressive power and the computational cost of typical static analysis problems, like query containment [3]. However, over tree-structured data, like XML documents, their place is taken by tree patterns [7], which correspond to downward, acyclic CQs. The arguments in their favour are threefold. First, allowing arbitrary CQs induces exponential increase in the complexity of query containment in the presence of schema information: for unions of tree patterns (using horizontal and vertical axes) it is ExpTime-complete in general and ExpSpace-complete under non-recursive schemas (where the depth of trees is bounded by the size of the schema) [1, 6, 9]; for unions of CQs it is 2ExpTimecomplete in general (even without horizontal axes) [2] and ExpSpace-complete under non-recursive schemas (NExpTime-complete without horizontal axes) [8]. Second, each CQ (over trees) can be rewritten as a union of exponentially many polynomial-size tree patterns [5]. Hence, unions of tree patterns and unions of CQs are equiexpressive languages, but the latter is more succinct. By choosing tree patterns as the basic formalism, we put the burden of dealing with the exponential blow-up on the user's shoulders. An advantage|from the user's point of view|is that the computational cost is more directly re ected in the size of queries, making it easier to control when writing them. Finally, over tree-structured data it is more natural to navigate in the tree, rather than declaratively specifying properties of nodes in the fashion of rstorder logic. For acyclic queries, this navigational approach can be easily supported with appropriate syntax, as illustrated palpably by the popularity of the XPath query language. Certain attempts of going beyond acyclic queries have been made with the introduction of path intersection operator to XPath 2.0, but there seems to be no natural way of dealing with full CQs. Still, following [2], one can ask how much of the succinctness of full conjunctive queries can be allowed without compromising the complexity bounds. A partial answer was given in [8]: without in uencing the complexity bounds,

Introduction

a1 a1 a2 a2 a1 a2 one can extend tree patterns with arbitrary horizontal CQs over siblings (see Fig. 1); arbitrary joins with horizontal CQs cannot be allowed, as they may induce additional vertical relations, breaking the acyclicity condition. The argument combines the well-known automata construction for tree patterns with a new construction of an exponential-size deterministic automaton over words, equivalent to a given horizontal CQ.

It is well known, however, that an exponential-size automaton can be constructed for any acyclic CQ (see e.g. [ 4 ]). Since acyclic CQs can navigate up and down the tree, the construction is more involved than for tree patterns (which only go down). Can it be used to extend the result of [ 8 ] from tree patterns with horizontal CQs to vertically acyclic CQs (see Fig. 1)? We show that the answer is yes, but rather unexpectedly it requires substantial additional e ort. The construction for the horizontal patterns provided by [ 8 ] is too weak: we need an automaton that nds all matches of the horizontal CQ in the input word, not just some match. In other words, we need a way to run multiple copies of the original automaton without a ecting drastically the size of the state-space.

The reminder of the paper is organized as follows: in Section 2 we recall the basic notions, in Section 3 we construct the enhanced automaton for horizontal CQs, and in Section 4 we combine it with the construction for acyclic CQs. 2

Preliminaries

XML documents and trees. We model XML documents as unranked labelled trees. Formally, a tree over a nite labelling alphabet is a relational structure T = hT; #; #+; !; !+; (aT )a2 i, where { the set T is a nite unranked tree domain, i.e., a nite pre x-closed subset of N such that v i 2 T implies v j 2 T for all j < i; { the binary relations # and ! are the child relation (v # v i) and the nextsibling relation (v i ! v (i + 1)); { #+ and !+ are transitive closures of # and !; { (aT )a2 is a partition of the domain T into possibly empty sets. We write jT j to denote the number of nodes of tree T . The partition (aT )a2 de nes a labelling of the nodes of T with elements of , denoted by `T . Tree automata. We abstract schemas as tree automata. We use a variant in which the state in a node v depends on the states in the previous sibling and the last child of v. Formally, an automaton is a tuple A = ( ; Q; q0; F; ), where is the labelling alphabet (the set of element types in our case), Q is the state space with the initial state q0 and nal states F , and Q Q Q is the transition relation. A run of A over a tree T is a labelling of the nodes of T with the states of A such that for each node v with children v 0; v 1; : : : ; v k and previous sibling w, (w); (v k); `T (v); (v) 2 . If v has no previous sibling, (w) in the condition above is replaced with q0. Similarly, if v has no children, (v k) is replaced with q0. The language of trees recognized by A, denoted by L(A), consists of all trees admitting an accepting run of A, i.e. a run that assigns one of the nal states to the root.

A schema is nonrecursive, if the depth of trees it accepts is bounded by a constant dependent on the schema. For schemas de ned with automata, this constant is always at most the number of states.

CQs and patterns. A conjunctive query (CQ) over alphabet is a formula of rst order logic using only conjunction and existential quanti cation, over unary predicates a(x) for a 2 and binary predicates #; #+; !; !+ (referred to as child, descendant, next sibling, and following sibling, respectively). Since we work only with Boolean queries, to avoid unnecessary clutter we often skip the quanti ers, assuming that all variables are by default quanti ed existentially.

An alternative way of looking at CQs is via patterns. A pattern over can be presented as = hV; Ec; Ed; En; Ef ; ` i where ` is a partial function from V to , and hV; Ec [ Ed [ En [ Ef i is a nite graph whose edges are split into child edges Ec, descendant edges Ed, next-sibling edges En, and following-sibling edges Ef . By k k we mean the size of the underlying graph.

We say that a tree T = hT; #; #+; !; !+; (aT )a2 i satis es a pattern = hV; Ec; Ed; En; Ef ; ` i, denoted T j= , if there exists a homomorphism h : ! T , i.e., a function h : V ! T such that + { h : hV; Ec; Ed; En; Ef i ! hT; #; #+; !; !i is a homomorphism of relational structures; and { `T (h(x)) = ` (x) for all x in the domain of ` .

Each pattern can be seen as a CQ, and vice versa. In what follows we use the terms \pattern" and \CQ" interchangeably. Tree patterns are patterns whose underlying graph is a directed tree with edges (of the four kinds) pointing from parents to children. 3

Finding all matches of a horizontal pattern

We write V w for the set of positions of word w, f1; 2; : : : ; jwjg, and use and +1 for the standard order and successor on the positions of words. A horizontal pattern uses only ! and !+ edges. Homomorphisms into words are de ned just like for trees. Whenever we write h : ! w, we implicitly assume that h is a homomorphism. By a pattern rooted at vertex x 2 V we mean a pair ( ; x); for i 2 V w we write w; i j= ( ; x) if there is h : ! w such that h(x) = i.

The aim of this section is to prove the following theorem, which provides an economic construction of an automaton nding all matches of a rooted horizontal pattern in the input word, used crucially in the proof of the main result. Theorem 1. For each rooted horizontal pattern ( ; x) one can construct in polynomial working space an exponential-size deterministic automaton Match ;x recognizing the language of words (a1; 1)(a2; 2) : : : (an; n) 2 ( f>; ?g) such that for w = a1a2 : : : an and all i 2 V w, i = > if and only if w; i j= ( ; x).

We shall think of Match ;x as a decision procedure reading letter by letter a word w 2 and an additional labelling 2 f>; ?gjwj, storing some information in the working memory of size polynomial in k k and independent from jwj. Earliest matchings. Like in [ 8 ], Match ;x will look for the earliest (leftmost) matchings witnessing w; i j= ( ; x). A word with in nity w1 is a word w extended with an additional position 1, greater then all original positions, that is its own successor, and bears all the labels in . Over words with in nity we use < to denote the composition of and the successor relation, which can be equivalently de ned as: x < y i x < y or x = y = 1. The notion of homomorphism extends naturally to words with in nity. Note that since 1 is not successor of any ordinary position, there can be no ! edge between a vertex mapped to an ordinary position and a vertex mapped to 1, but there can be a ! edge between two vertices mapped to 1. The source and target of each !+ edge must be mapped to positions i < j. Note that each homomorphism into w1 induces by restriction a partial homomorphism into w (we shall blur the distinction between them), but not every partial homomorphism into w extends to a homomorphism into w1.

De nition 1. Let g; h :

! w1 be two homomorphisms. { We write g h if g(x) h(x) for each vertex x of . { We de ne min(g; h) : ! w1 as min(g; h)(x) = min(g(x); h(x)). { We say that h respects relation V V w, if h [ V f1g.

Note that h constantly equal to 1 respects each . We will be mostly interested in two special cases of : extending a xed partial homomorphism, and mapping the pattern's root to a xed position (or one of several xed positions), but for conciseness we treat them in this abstract fashion.

Lemma 1. 1. For all g; h : ! w1, min(g; h) is a homomorphism. 2. There exists hmin : ! w1 such that hmin h for all h : ! w1. 3. For each relation V V w1, there exists hmin : ! w1 respecting such that hmin h0 for each h0 : ! w1 respecting . tu We call the unique hmin from Lemma 1 the earliest matching of in w1, and hmin the earliest matching respecting . Note that hmin = hmin for = V V w. Of course, pattern can be matched in word w if and only if the earliest matching hmin : ! w1 maps no vertex to 1, and similarly for matchings respecting . Algorithm. The procedure Match ;x works with components of , called rm subpatterns, described in De nition 3.

De nition 2. A !-component of is a maximal connected subgraph of the !-graph of . In the graph of !-components of , denoted G , there is an edge from a !-component 1 to a !-component 2 if there is a !+ edge in from a vertex of 1 to a vertex of 2.

De nition 3. A pattern is rm if G is strongly connected. In general, each strongly connected component X of G de nes a rm subpattern of : the subgraph of induced by the vertices of !-components contained in X. The DAG of rm subpatterns of , denoted F , is the standard DAG of strongly connected components of G .

All four horizontal subpatterns in Fig. 1 are rm, but have two !-components.

Match ;x(w) does two independent checks, the positive and the negative; it accepts when both checks accept. The negative check should reject if it nds a matching of that maps x to a position with ?, and accept otherwise. We read the input word w from left to right letter by letter, trying to build the earliest matching of that maps x to a ? position. To process position i, { compute the earliest matching in w1::i1 that extends the constructed partial homomorphism into w1::i 1 and maps x to a ? position.

The positive check is dual: it should accept if it can nd matchings for all positions with >. To keep the used space bounded, instead of running a separate check for each position with >, we run a single test, but we partially reset the constructed partial homomorphism each time we see >. We keep a look-ahead of size k k; that is, when position i is being processed, we already have read the input word up to position i + k k. To process position i, { if (i) = >, reset the partial homomorphism constructed so far: restrict it to components not reachable from the rm component containing x; { compute the earliest matching in w1::i+k k1 that extends the currently stored partial homomorphism and maps x to the most recent position j i with (j) = >; { if (i) = >, but the root component cannot be matched within w1::i+k k, reject immediately.

Accept if after processing the whole w a full homomorphism

! w is found.

Invariants. Correctness of the algorithm follows from the following invariants: Invariant 1 The partial homomorphism computed by the negative check after reading the i-th letter is the earliest matching of in w1::i1 mapping x to a ? position.

Invariant 2 The partial homomorphism h computed by the positive check after processing the i-th letter is the earliest matching of in w1::i+k k1 mapping x to the most recent > position in w1::i (not mapping it at all, i.e., respecting Sy6=xfyg f1; 2; : : : ; i + k kg, if there has been no > position yet); for each earlier > position j, the earliest matching hj : ! w1::i+k k1 mapping x to j, satis es hj h.

Invariant 1 follows immediately from Lemma 2, by induction on i.

Lemma 2. The earliest matchings h : ! u1 and h0 :

V V u always agree over vertices mapped to u by h. ! uu01 respecting tu

For Invariant 2, we use the following properties of the earliest matchings. Lemma 3. Let h0 : ! u1 be the earliest matching respecting Sy6=xfyg V u, and let hi : ! u1 be the earliest matching that maps the root x of to i 2 V u. 1. For every node y, whose component cannot be reached from the component of the root x, hi(y) = hj (y) for all i; j. 2. If hi(x) = i and 1 i j, then hi hj . tu

Slightly abusing notation, we shall apply the same symbol hj from Lemma 3 for di erent words; that is, hj : ! w1::i 1+k k1 is not the same as hj : ! w1::i+k k1 (although, by Lemma 2, the latter extends the former).

Before reading the rst letter, Invariant 2 is satis ed: the empty partial homomorphism is the earliest matching in the empty word (with in nity), and the second part trivialises. Assume that Invariant 2 holds for i 1 and let us see that it holds for i. Let gi 1 be the partial homomorphism constructed so far, and let gi be the one to be constructed. By Invariant 2 for i 1, gi 1 is the earliest matching of in w1::i 1+k k1 that maps root x to the most recent > position, or using notation from Lemma 3, gi 1 = hj : ! w1::i 1+k k1 for some 0 j i 1. If (i) = ?, the algorithm computes gi as the earliest matching extending the partial homomorphism gi 1 and mapping x to the most recent > position in w1::i+k k|which is also the most recent > position in w1::i 1+k k|and we conclude by Lemma 2. Assume that (i) = >. Then, the algorithm computes gi0 1, by restricting gi 1 to components that are not reachable from the root component. By Lemma 2, gi0 1 hj : ! w1::i+k k1. By Lemma 3 (1), gi0 1 hi : ! w1::i+k k1. Hence, the earliest matching in w1::i+k k1 that extends gi0 1 and maps x to i is equal to hi : ! w1::i+k k1. Since this is exactly how gi is computed, we have gi = hi : ! w1::i+k k1. This concludes the proof of the rst part of Invariant 2. To prove the second part, assume that i is not the rst ? position; that is, 0 < j < i. Since we have not rejected yet, gi 1 = hj : ! w1::i 1+k k1 matches the root component of and gi 1(x) = j. By Lemma 2, also hj : ! w1::i+k k1 maps x to j (rather than to 1). Hence, the second part of Invariant 2 for i follows from Lemma 3 (2) and Invariant 2 for i 1.

Correctness. Let us begin with a simple observation. It is routine to check that each homomorphic image of a rm pattern 0 is a subword of length at most k 0k. Consequently, if a node y of the pattern is matched at position i, then all rm components from which the component of y is reachable, must be matched before position i + k k: they can reach at most k k positions forward.

If the algorithm accepts, Invariant 2 guarantees that all suitable homomorphisms exist: indeed, if the earliest matching for the last > position does not use 1, neither do the ones for the earlier > positions.

Assume that the algorithm rejects. That means that hi : ! w1::i+k k1 does not match the root component (that is, matches it to 1). Assume that hi : ! w1 does match the root component. Then it maps x to i, so by the observation above it maps all the components from which the component of y is reachable, within w1::i+k k. Restricting hi : ! w1 to these components and extending with in nity to other vertices of would give a matching earlier then hi : ! w1::i+k k1 | a contradiction. Thus, hi : ! w1 is not a total homomorphism into w, so w; i 6j= ( ; x) and the algorithm rejects correctly. Memory bound. Now that we have seen that Match ;x(w) is correct, we need to bound the memory it uses. We claim that while processing position i the algorithm only needs to have access to positions between i k k and i + k k. That means that the algorithm only needs to store last 2 j j + 1 symbols read, plus the matching constructed so far, restricted to this su x.

We shall use a dualized version of the observation made previously: if a node y of the pattern is matched at position i, then all rm components reachable from the component of y must be matched after position i k k: they can reach at most k k positions back.

For the negative check, the claim is almost obvious. After reading a new symbol we match the subpatterns greedily, until no more can be matched. Among matched patterns each has an ancestor (possibly itself) that touches the last symbol (otherwise, it would have been matched before). By the dualized observation none of these patterns can reach back further than k k, so it su ces to store last k k symbols read.

For the positive check the situation is similar, except that before we start matching we rst check if k k positions back we had > or ?. If it was ? we proceed essentially like in the negative check. But if it was >, we rematch the rm components reachable from the component of the root x, in such a way that x is matched k k positions back. Again, by the dualized observation, the rematched components can reach at most k k more positions back, so they also t within the intended window. 4

The main result

Using the matching procedure we prove our main result: we show that extending acyclic CQs using vertical axes with arbitrary horizontal CQs over siblings does not increase the complexity of the containment problem. In fact, we shall work with a more general problem of satis ability of Boolean combinations, or BC-SAT: given a Boolean combination of patterns ' and an automaton A, decide if there is a tree T 2 L(A) such that T j= '. Containment for unions of CQs corresponds to non-satis ability for Boolean combinations of the form ^ : 1 ^ : 2 ^ ^ : k. Since we work with boolean combinations anyway, each disconnected pattern can be replaced with a conjunction of connected patterns. Thus, we shall only consider connected patterns from now on.

Assuming that vertical edges propagate through horizontal edges (e.g., next sibling of a child is also explicitly connected with a child edge), one can dene vertically acyclic patterns as those whose #; #+-subgraph is acyclic in the undirected sense; since patterns are connected, it is then an undirected tree. Without this assumption we de ne this notion as follows. A horizontal component of pattern is a connected component of the !; !+-subgraph of . Let H = hV ; #; #+i be a graph over horizontal components of , where edge X # Y is present if x # y for some x 2 X and y 2 Y , and X #+ Y is present if x #+ y for some x 2 X and y 2 Y , but there is no edge X # Y . We say that pattern is vertically acyclic, if this graph is acyclic in the undirected sense. Again, as is connected, the graph is an undirected tree.

Given a vertically acyclic pattern , by simple preprocessing we ensure that there is at most one edge in between each two horizontal components: if there are more, we can merge their starting points in the parent/ancestor horizontal component, and drop all edges except one (a child edge, if there is one). Theorem 2. For vertically acyclic patterns, BC-SAT is ExpTime-complete, and PSpace-complete under non-recursive schemas.

Proof. We shall see that for a vertically acyclic pattern one can construct an equivalent automaton A in polynomial working space (states have polynomialsize representations and all ingredients of A can be enumerated in polynomial working space). Despite its nondeterminism, we will be able to complement A by simply changing accepting states. Hence, one can easily reduce BC-SAT to nonemptiness of tree automata, by constructing in polynomial working space an automaton A' equivalent to a given boolean combination '. The product B of A' with a nonrecursive automaton A can be tested for nonemptiness by a nondeterministic algorithm using space O(kAk log kAk poly (k'k)), which guesses a tree of depth at most kAk in the depth- rst order and nondeterministically evaluates B over it, processing it in the post-order fashion. By Savitch's theorem, this gives a PSpace algorithm for satis ability under nonrecursive schemas. As A' is singly exponential in k'k, the standard polynomial-time emptiness test gives an ExpTime algorithm for satis ability under arbitrary schemas.

The idea is that the automaton A guesses in each node which subpatterns are matched there, and then checks that these guesses are consistent: a subpattern is matched in some node if and only if its root can be matched there and all its sub-subpatterns can be matched in appropriate nodes. To ensure that these dependencies are well founded, we root the undirected tree forming the graph H , by arbitrarily xing a root component and treating all the edges as pointing outwards. This builds the structure of a directed tree over the graph H ; the orientation of edges in this directed tree is entirely unrelated to the character of the corresponding edges in the pattern: child and descendent edges can point both up and down the tree. We use this structure to speak precisely of subpatterns: for each horizontal component X of we de ne subpattern :X obtained by restricting to X and all descendants of X in the directed tree. If component X has successors Y1; Y2; : : : ; Yn in the directed tree, then the subpattern :X is composed of the nodes in the horizontal component X connected to subpatterns :Y1; :Y2; : : : ; :Yn with child or descendant edges, pointing to or form X. We call patterns :Y1; :Y2; : : : ; :Yn the immediate subpatterns of pattern :X. Unless X is the root component of the whole directed tree, it is connected to some other horizontal component Z, with a child or descendent edge, pointing to or from X. If the edge points to X, we call :X a down-subpattern. Otherwise, we call it an up-subpattern. In either case, we call the vertex of X that is the end of this edge, the root vertex of :X; if X is the root component of the whole tree, we choose an arbitrary vertex in X for the root vertex. We say that :X is matched at node v if its root vertex is mapped to v.

The automaton A , before reading the sequence of children of a node v, non-deterministically selects the set Expected (v) of subpatterns that can be matched at v. Similarly, A non-deterministically selects a set ExpectedAbove(v) of subpatterns that can be matched at some ancestor of v. (Note that for downsubpatterns the exact place where the root vertex is matched among siblings is irrelevant; we make the distinction only for the purpose of uniform treatment by the procedure match.) After selecting Expected (v) and ExpectedAbove(v), the automaton proceeds to read the sequence of children of v. While doing so it computes the set Matched (v) of subpatterns of matched in the children of v and the set MatchedBelow (v) of subpatterns that were matched in the children of some descendant of v, and performs a number of consistency checks using a slightly modi ed version of the procedure Match, described below.

With each subpattern :X we associate a horizontal pattern X obtained by restricting to X and including in the label of each vertex x the set of immediate subpatterns :Y of :X connected to x. In the modi ed procedure Match for X , a vertex labelled with ( ; ) can be matched in a position labelled with ( ; ) only if = and . It is straightforward to check that this does not in uence the correctness of Match. Observe that the extended alphabet is exponential, but each symbol can be stored in polynomial memory. Hence, Match still works in memory polynomial in the size of the pattern.

Reading the sequence of children of v, the automaton performs simultaneously (using the product construction) the following tasks: 1. Check that Expected (v) and ExpectedAbove(v) do not contradict the choices for the children of v. The subpatterns expected to be matched in an ancestor of a child w must either be matched in the parent of w or an ancestor of the parent of w; that is, ExpectedAbove(w) = Expected (v) [ ExpectedAbove(v) must hold for each child w. If for some w it fails to hold, the computation stops and no state is assigned to v, because the guesses are inconsistent.

The automaton treats the root as any other node; that is, it performs the tasks above for the single-node sequence of siblings consisting only from the root, treating it as a child of a dummy node #. The automaton accepts if: Expected (#) = ExpectedAbove(#) = ; and 2 Matched (#)[MatchedBelow (#).

By the structural induction over subpatterns one easily proves that in any complete run (accepting or rejecting) of the described automaton A , the sets Expected (v), ExpectedAbove(v), Matched (v), and MatchedBelow (v) are guessed or computed correctly for each node of the input tree. Hence, the construction is correct. Moreover, for each node v of the input tree, there exists exactly one correct set Expected (v) and one correct set ExpectedAbove(v). Hence, for each input tree there is exactly one complete run, modulo the last guess of Expected (#) and ExpectedAbove(#). Consequently, the automaton A can be complemented simply by changing the acceptance condition to: Expected (#) = ExpectedAbove(#) = ; and 2= Matched (#) [ MatchedBelow (#). tu

Benedikt ,

Fan ,

Geerts . XPath satis ability in the presence of DTDs . J. ACM 55 ( 2 ), 2008 .

2. H. Bjorklund, W. Martens,

Schwentick . Optimizing conjunctive queries over trees using schema information . Proc. MFCS 2008 : 132 { 143 .

3. A. K. Chandra , P. M. Merlin . Optimal implementation of conjunctive queries in relational data bases . Proc. STOC 1977 : 77 { 90 .

Francis ,

David ,

Libkin . A Direct Translation from XPath to Nondeterministic Automata . Proc. AMW 2011 .

Gottlob ,

Koch ,

Schulz . Conjunctive queries over trees . J. ACM 53 ( 2006 ), 238 { 272 .

Marx . XPath with conditional axis relations . Proc. EDBT 2004 : 477 { 494 .

Miklau ,

Suciu . Containment and equivalence for a fragment of XPath . J. ACM 51 ( 1 ): 2{ 45 , 2004 .

Murlak ,

Oginski ,

Przybylko . Between Tree Patterns and Conjunctive Queries: Is There Tractability beyond Acyclicity? Proc. MFCS 2012 : 705 - 717 .

Neven ,

Schwentick . On the complexity of XPath containment in the presence of disjunction, DTDs, and variables . Logical Meth. in Comp. Sci. 2 ( 3 ): 1{ 30 , 2006 .