Introduction

Expressiveness and Static Analysis of Extended Conjunctive Regular Path Queries

Institut fur Informatik

Goethe-Universitat

Frankfurt am Main

Germany

freydenberger@em.uni-frankfurt.de

schweika@informatik.uni-frankfurt.de

We study the expressiveness and the complexity of static analysis of extended conjunctive regular path queries (ECRPQs), introduced by Barcelo et al. (PODS '10). ECRPQs are an extension of conjunctive regular path queries (CRPQs), a well-studied language for querying graph structured databases. Our rst main result shows that query containment and equivalence of a CRPQ in an ECRPQ is undecidable. This settles one of the main open problems posed by Barcelo et al. As a second main result, we prove a non-recursive succinctness gap between CRPQs and the CRPQ-expressible fragment of ECRPQs. Apart from this, we develop a tool for proving inexpressibility results for CRPQs and ECRPQs. In particular, this enables us to show that there exist queries de nable by regular expressions with backreferencing, but not expressible by ECRPQs.

Introduction

Many application areas (e. g., concerning the Semantic Web or biological applications) consider graph structured data, where the data consists of a nite set of nodes connected by labeled edges. For querying such data, one usually needs to specify types of paths along which nodes are connected. A widely studied class of queries for graph structured databases are the conjunctive regular path queries (CRPQs) (cf., e. g., [ 5, 7, 8 ]), where types of paths can be described by regular expressions specifying labels along the paths. For modern applications, however, also more expressive query languages are desirable, allowing not only to specify regular properties of path labels, but also to compare paths based on, e. g., their lengths, labels, or similarity.

To start a formal investigation of such concepts, Barcelo et al. [ 4 ] introduced the class of extended conjunctive regular path queries (ECRPQs), allowing to use not only regular languages to express properties of individual paths, but also regular relations among several paths, capable of expressing certain associations between paths. The authors of [ 4 ] investigated the complexity of query evaluation and static analysis of ECRPQs. While query containment is known to be decidable and Expspace-complete for CRPQs [ 8, 5 ], it was shown to be undecidable for ECRPQs [ 4 ]. However, checking containment of an ECRPQ in a CRPQ still is decidable and Expspace-complete [ 4 ]. (Un)Decidability of checking containment (or, equivalence) of a CRPQ in an ECRPQ was posed as an open question in [ 4 ].

In the present paper, we answer this question by showing that containment of a CRPQ in an ECRPQ is undecidable | even if the ECRPQ is, in fact, a CRPQ extended only by relations for checking equality of path labels (or, similarly, equal lengths of paths). Our proof proceeds by (a) simulating Turing machine runs by so-called H-systems, a concept from formal language theory generalizing pattern languages, and (b) using CRPQs and ECRPQs to represent languages described by H-systems. Our proof generalizes to (i) the case where one of the two queries is xed, (ii) the case where all queries are Boolean and acyclic, and (iii) the problem of deciding equivalence rather than containment of CRPQs and ECRPQs.

Apart from the static analysis of queries, the present paper also investigates the expressiveness and succinctness of ECRPQs. Using the machinery developed for proving our undecidability results concerning static analysis, we show that CRPQ-de nability of a given ECRPQ is undecidable, and that there is no recursive function f such that every CRPQ-de nable ECRPQ of length n is equivalent to a CRPQ of length f (n).

Concerning the expressivity of (E)CRPQs, to the best of our knowledge, tools for showing inexpressibility results have not been presented in the literature yet. We develop such tools, enabling us to show, for example, that no ECRPQ-query can return exactly those tuples of nodes between which there is a path whose length is a composite number (i. e., a number of the from nm for n; m 2). Since these paths can be easily described by a regular expression with backreferencing (cf. [ 1 ]) of the form (a a+)%x x+, this refutes a claim of [ 4 ] stating that all regular expressions with backreferencing can be expressed by ECRPQs.

Structure of the paper. We start with the necessary notations and de nitions in Section 2 where, in particular, the syntax and semantics of ECRPQs (and restrictions thereof) are de ned. Section 3 is devoted to the static analysis of ECRPQs and CRPQs, showing that containment and equivalence of CRPQs in ECRPQs are undecidable. Section 4 investigates the relative succinctness between CRPQs and CRPQ-expressible ECRPQs and provides tools for proving limitations to the expressive power of CRPQs and ECRPQs. Due to space limitations, many technical details of the proofs had to be deferred to the full version. 2

Preliminaries

Let N denote the set of non-negative integers. We denote the empty word by ". Let A; B be alphabets. A morphism (between A and B ) is a function h : A ! B with h(uv) = h(u)h(v) for all u; v 2 A . For every word w 2 A , jwj stands for the length of w, and for every letter a 2 A, jwja denotes the number of occurrences of a in w.

DB-Graphs and Queries. A -labeled db-graph is a directed graph G = (V; E), where V is a nite set of nodes, and E V V is a nite set of directed edges with labels from . A path between two nodes v0 and vn in G with n 0 is a sequence v0a1v1 vn 1anvn with v0; : : : ; vn 2 V , a1; : : : ; an 2 , and (vi; ai+1; vi+1) 2 E for 0 i < n. We de ne the label ( ) of the path by ( ) := a1 an. Furthermore, for every v 2 V , we de ne the empty path v"v, with (v"v) = ".

A central concept considered in the present paper are regular relations (cf. [ 4 ] and the references therein). Let be a nite alphabet, let ? be a new symbol with ?2= , and let ? := [ f?g. Let w = (w1; : : : ; wk) 2 ( )k, where wi = ai;1 ai;jwij (and all ai;j 2 ). We de ne the string [w] 2 ( ?)k by [w] := b1 bn, where n is the maximum of all jwij, and bj := (bj;1; : : : ; bj;k), with bj;i = ai;j if j jwij, and bj;i =? if j > jwij. In other words, [w] is obtained by aligning all wi to the left, and padding the un lled space with ? symbols. A k-ary relation R ( )k is called regular if the language f[r] j r 2 Rg is regular.

Obviously, every regular language is a (unary) regular relation. In addition to this, the present paper focuses on the following k-ary regular relations (k 2): 1. the equality relation eq: = f(w1; : : : ; wk) j w1 = : : : = wkg, 2. the length equality relation el := f(w1; : : : ; wk) j jw1j = : : : = jwkjg. Note that each of these relations needs to be de ned w. r. t. a nite alphabet , which we usually omit for the sake of brevity.

We now de ne ECRPQs and CPRQs, following the de nitions from [ 4 ]. Fix a countable set of node variables and a countable set of path variables. Let be a nite alphabet. An extended conjunctive regular path query (ECRPQ) Q over is an expression of the form

Ans(z; ) ^ (xi; i; yi); ^

Rj (!j ); 1 i m 1 j l (1) such that m 1, l

0, and 1. each Rj is a regular expression that de nes a regular relation over , 2. x = (x1; : : : ; xm) and y = (y1; : : : ; ym) are tuples of (not necessarily distinct) node variables, 3. = ( 1; : : : ; m) is a tuple of distinct path variables, 4. !1; : : : ; !l are tuples of path variables, such that each !j is a tuple of variables from , of the same arity as Rj , 5. z is a tuple of node variables among x, y, and 6. is a tuple of path variables among those in .

The expression Ans(z; ) is the head, and the expression to the right of is the body of Q. If z and are the empty tuple (i. e., the head is of the form Ans()), Q is a Boolean query. The relational part of an ECRPQ Q is V1 i m(xi; i; yi), and the labeling part is V1 j l Rj (!j ). We denote the set of node variables in Q by nvar(Q).

Intuitively, all variables are quanti ed existentially, and the words formed by the labels along the paths have to satisfy the respective relations. Formally, for every -labeled db-graph G, every ECRPQ Q (of the form described in (1)) over , every mapping from the node variables of Q to nodes in G, and every mapping from the path variables of Q to paths in G, we write (G; ; ) j= Q if 1. ( i) is a path from (xi) to (yi) for every 1 i m, 2. for each !j = ( j1 ; : : : ; jk ), 1 j l, the tuple ( ( ( j1 )); : : : ; ( ( jk ))) belongs to the relation Rj .

Finally, we de ne the output of Q (of the form described in (1)) on G by Q(G) := f (z); ( ) j ; such that (G; ; ) j= Q g: As usual, if Q is Boolean, we model the Boolean constants true and false by the empty tuple () and the empty set ;, respectively. In other words, Q(G) = true i there exist assignments and with (G; ; ) j= Q.

Two queries Q and Q0 are called equivalent (Q Q0, for short) if Q(G) = Q0(G) for all db-graphs G. A query Q is said to be contained in a query Q0 (Q Q0, for short) if Q(G) Q0(G) for all db-graphs G.

With an ECRPQ Q we associate an edge-labeled directed graph HQlab whose vertex set is the set of node variables occurring in Q, and where there is an edge from x to y labeled i (x; ; y) occurs in the relational part of Q. As in [ 4 ], we write HQ to denote the (unlabeled) directed graph obtained from HQlab by deleting the edge-labels (and removing duplicate edges). A query Q is called acyclic if HQ is acyclic.

In accordance with [4], a conjunctive regular path query (CRPQ) Q over

is an ECRPQ over of the form described in (1), where all relations Rj are unary relations, and (hence), all tuples !j are singletons.

Thus, CRPQs can only refer to the languages that are allowed to occur along the paths, while ECRPQs can also describe relations between di erent paths.

The present paper devotes special attention to two classes of queries with an expressive power that lies strictly between CRPQs and ECRPQs: A CRPQ with equality relations is an ECRPQ where every relation in the labeling part is either of arity 1 (i. e., a regular language), or a k-ary eq-relation for some k 2. Analogously, a CRPQ with equal length relations is an ECRPQ where every relation in the labeling part is either of arity 1, or a k-ary el-relation.

It is easy to see that ECRPQs and CRPQs can be transformed into queries in the following normal forms (note, though, that these transformations might increase the size of the queries): Lemma 1. For every ECRPQ Q = Ans(z; ) ^ (xi; i; yi); ^ Rj (!j ); 1 i m 1 j l there exists a regular relation R of arity m such that Q is equivalent to the ECRPQ Q0 := Ans(z; ) ^ (xi; i; yi); R( 1; : : : ; m):

1 i m Lemma 2. For every CRPQ Q = Ans(z; ) ^ (xi; i; yi); ^ 1 i m (where ij 2 f1; : : : ; mg), there exist regular languages L01; : : : ; L0m Q is equivalent to the CRPQ Q0 := Ans(z; )

Lj ( ij ) 1 j l

such that ^ L0i( i).

^ (xi; i; yi); 1 i m 1 i m Hence, for ECRPQs it su ces to consider just one regular relation of arity m; and for CRPQs, it su ces to consider just one regular language per path variable. Turing Machines and H-Systems. Let M be a (deterministic) Turing machine with state set Q, initial state q0 2 Q, halting state qH 2 Q, tape alphabet (including the blank symbol), such that Q \ = ;, and an input alphabet I that does not include the blank symbol. We adopt the conventions that M accepts by halting, and does not halt in the rst step (i. e., q0 6= qH ).

A con guration of M is a word w1qw2, with w1; w2 2 and q 2 Q. We interpret w1qw2 as M being in state q, while the tape contains w1 on the left side, and w2 on the right side. The head is on the position of the rst (leftmost) letter of w2 (if w2 = ", M reads the blank symbol). We denote the successor relation on con gurations of M by `M. An accepting run of M is a sequence C0; : : : ; Cn of con gurations of M (with n 1), such that C0 2 q0 I (C0 is an initial con guration), Cn 2 qH (Cn is an accepting con guration), and Ci `M Ci+1 holds for all 0 i < n. Let := [ Q [ f#g, where # is a new letter that does not occur in or Q. We de ne the set of valid computations of M by VALC(M) := f#C0# #Cn# j C0; : : : ; Cn is an accepting run of Mg, and denote its complement by INVALC(M) := n VALC(M). Finally, we de ne dom(M) to be the set of all w 2 I such that M halts after a nite number of steps when started in the con guration q0w.

By de nition, INVALC(M) = holds if and only if dom(M) = ;; and note that (given M), the question if dom(M) = ; is undecidable.

As a technical tool for our proofs, we use the notion of H-systems to describe the sets INVALC(M) for Turing machines M. Our notion of H-systems can be viewed as a generalization of pattern languages (cf. Salomaa [ 15 ]), or as a restricted version of the H-systems introduced by Albert and Wegner [ 3 ]. L(H) = Sik=1 L(Hi).

De nition 3. An H-system (over the alphabet ) is a 4-tuple H := ( ; X; L; ), where (i) X and are nite, disjoint alphabets, (ii) L is a function that maps every x 2 X to a regular language L(x) with " 2 L(x), and (iii) 2 (X [ )+.

A morphism h : ( [X) ! is H-compatible if h(a) = a for every a 2 , and h(x) 2 L(x) for every x 2 X. We then de ne the language L(H) that is generated by H = ( ; X; L; ) as L(H) := fh( ) j h is an H-compatible morphismg:

For every nite, nonempty set of H-systems H = fH1; : : : ; Hkg, we de ne In other words, the letters from are constants, the letters from X are variables, and L(H) is obtained from by uniformly replacing every variable x with a word from L(x). We assume w.l.o.g. that X is chosen minimally; i. e., every x 2 X occurs in . It is easy to see that H-systems are able to generate non-regular languages; e. g., the system H = ( ; fxg; L; xx) with L(x) = generates the language of all ww, w 2 .We use unions of H-system languages to describe the sets INVALC(M): Lemma 4. Given a Turing machine M, one can e ectively construct a set H = fH1; : : : ; Hkg of H-systems (for some k 1) such that INVALC(M) = L(H). Proof (sketch). Let M be a Turing machine with state set Q and tape alphabet , and de ne := Q [ [ f#g. We approach the process of de ning H from the following angle: Every word w 2 INVALC(M) contains at least one error that prevents w from being an element of VALC(M). Most of these error conditions can be described using regular expressions (similar to the construction used in the proof of Lemma 10.2 in Aho et al. [ 2 ]).

As INVALC(M) might be non-regular (if dom(M) is in nite), regular languages alone are not su cient to describe all possible errors in a run of M. More speci cally, we cannot handle arbitrary errors in the preservation of the tape contents from one con guration to the other. As an example, assume M reads some a 2 while in state q 2 Q and is supposed to write some b 2 , move the head to the right, and enter some state p 2 Q. In all these cases, a con guration C = w1qaw2 with w1; w2 2 is followed by the con guration C0 = w1bpw2.

Our goal is to construct H-expressions that capture all cases where a word encodes a sequence of con gurations C0; : : : ; Cn that contains con gurations Ci = w1qaw2, Ci+1 = w3bpw4 where w1 6= w3, or w2 6= w4 holds (with w1; : : : ; w4 2 ). Note that, for all words w; w0 2 , w 6= w0 holds if and only if there exist words u; v; v0 2 and letters c; d 2 with c 6= d, w = ucv, and w0 = udv0, or exactly one of w; w0 is the empty word.

As errors described in the latter case (i. e., that exactly one of w1, w3 or of w2; w4 is empty) can be expressed using regular languages, we focus on the former case. In order to express these errors, for every c 2 , we de ne languages Lc;1 := Lc;2 := [ As we shall see in the next section, it is possible to reduce decision problems on unions of H-systems (and, hence, on the domains of Turing machines) to decision problems on CRPQs and ECRPQs. 3

Query Containment and Equivalence Query Containment. The query containment problem is the problem to de

cide for two input queries Q and Q0 whether Q Q0.

The containment of CRPQs in CPRQs and of ECRPQs in CRPQs is known to be decidable and Expspace-complete (cf. [ 8, 5 ] and [ 4 ], resp.). In [ 4 ], the authors proved the undecidability of the containment problem for ECRPQs, and mentioned the decidability of containment of CRPQs in ECRPQs as an important open problem. Our rst main result states that this problem is undecidable, even if the ECRPQs are of a comparatively restricted form:

Theorem 5. For every alphabet with j j CRPQs in CRPQs with equality relations over 2, the containment problem of is undecidable.

The proof is a consequence of Lemma 4, the undecidability of the emptiness of dom(M) for Turing machines M, and the following lemma: Lemma 6. Let be an alphabet. For every set H = fH1; : : : ; Hkg of H-systems over , one can e ectively construct an alphabet 0, a CRPQ Q1 over 0, and a CRPQ with equality relations Q2 over 0 such that Q1 Q2 if and only if L(H) = .

Proof. Let = fa1; : : : ; asg for some s 1. Let H be a set of k H-systems H = fH1; : : : ; Hkg over (with k 1). We de ne 0 := [ fF; $g, where F and $ are distinct letters that do not occur in . Next, we de ne Q1 :=

Ans() (x; ; y); L( ); where L := $Fa1 asF$F F$, and x and y are distinct variables. Thus, Q1(G) = true if and only if G contains a path with ( ) 2 L.

The de nition of Q2 is more involved. Informally explained, Q2 uses the structure provided by Q1 to implement the union of the languages L(Hi). We de ne Q2 such that, for every db-graph G with Q1(G) = true, Q2(G) = true holds if and only if there is a path in G with ( ) = $Fa1 asF$FwF$, where w 2 L(H) (i. e., w 2 L(Hi) for some Hi 2 H).

Note that the paths described by Q1 contain exactly three occurrences of the $ symbol, which can be understood to divide into two parts, where the left part is labeled Fa1 asF. Likewise, the query Q2 can be understood as consisting of two parts, which are to be de ned in the subqueries V1 i k isel and V1 i k icod, respectively. Our goal is to construct Q2 in such a way that, when matching Q2 to , the isel are used to select which H-system Hi is simulated in Q2, while the actual encoding of that H-system is achieved by icod (hence, the superscripts sel and cod). We de ne Q2 as

Q2 :=

Ans() (x0; c1$; x1); (xk+1; c2$; x^1); (x^k+1; c3$; x^k+2); L$(c1$); L$(c2$); L$(c3$); ^ isel; ^ cod i 1 i k 1 i k where L$ = f$g, and the sel and cod consist of relational and labeling atoms i i that shall be de ned further down. As explained above, the subqueries isel are used to select which H-system is active when matching Q2 to a graph. These queries are de ned by sel := (xi; ciF;1; yi;1); (yi;1; cia1 ; yi;2); : : : ; (yi;s; cias ; yi;s+1); (yi;s+1; ciF;2; xi+1); i

LF(ciF;1); La1 (cia1 ); : : : ; Las (cias ); LF(ciF;2); eq(ciF;1; ciF;2) where La := f"; ag for each a 2 fF; a1; : : : ; asg.

In order to de ne each icod, we need to consider the respective H-system Hi: Let Hi = ( ; Xi; Li; i), where i = i;1 i;mi for some mi 1 and i;1; : : : ; i;mi 2 (X [ ). We de ne the relational part of icod to be (x^i; ciF;3; zi;1); (zi;1; di;1; zi;2); : : : ; (zi;mi ; di;mi ; zi;mi+1); (zi;mi+1; ciF;4; x^i+1); where ciF;3, ciF;4, and all di;j are (pairwise distinct) new path variables. We start the construction of the labeling part of icod with the labeling atoms LF(ciF;3), LF(ciF;4), eq(ciF;1; ciF;3), and eq(ciF;1; ciF;4). Furthermore, we de ne a regular language Li;j for every 1 j mi by Li;j := Li( i;j ) if i;j 2 X, and Li;j := f"; i;j g if i;j 2 . In addition to this, we add a label atom eq(ci i;j ; di;j ) for every j with i;j 2 . Finally, for every j with i;j 2 X such that i;j occurs more than once in i, we add a relation eq(di;j ; di;l) for every l 6= j with i;l = i;j .

Note that the relation graph HQ2 consists only of a path from x0 to x^k+1, where each node (except x^k+1, the last node) has exactly one successor. Thus, the relation graph is acyclic and has no branches.

We claim that L(H) = holds if and only if Q1 Q2, which completes the proof of Lemma 6. Due to space limitations, the proof of this claim has been omitted from this version. tu

By using standard encoding techniques for representing arbitrary nite alphabets by an alphabet of size 2, the proof of Theorem 5 now easily follows from Lemma 4, the undecidability of the emptiness of dom(M) for Turing machines M, and Lemma 6. By using universal Turing machines instead of arbitrary Turing machines, we also obtain the following strengthening of Theorem 5:

Theorem 7. For every alphabet with j j 2, there are a xed CRPQ Q1

over and a xed CRPQ with equality relations Q2 over such that (i) the containment problem of Q1 in CRPQs with equality relations, and (ii) the containment problem of CRPQs in Q2 are both undecidable. This holds even if all queries are Boolean and acyclic.

Applying slight modi cations to the proof of Lemma 6, we observe the same situation for ECRPQs that use length equality instead of equality relations:

Theorem 8. For every alphabet with j j 2, there are a xed CRPQ Q1

over and a xed CRPQ with length equality relations Q2 over such that (i) the containment problem of Q1 in CRPQs with length equality relations, and (ii) the containment problem of CRPQs in Q2, are both undecidable. This holds even if all queries are Boolean and acyclic.

Query Equivalence. The query equivalence problem is the problem to decide

for two input queries Q and Q0 whether Q Q0.

Another question speci cally posed in [ 4 ] is whether the equivalence problem for CRPQs and ECRPQs is decidable. Using a variant of the proof of Theorem 7, we can answer this question negatively:

Theorem 9. For every alphabet with j j 2, there are a xed CRPQ Q1

over and a xed ECRPQ Q2 over such that (i) the equivalence problem of

Q1 and ECRPQs, and (ii) the equivalence problem of CRPQs and Q2, are both

undecidable. This holds even if all queries are Boolean and acyclic.

Expressiveness and Relative Succinctness

(E)CRPQ Expressibility. We say that a query function F is CRPQ-expressible (or ECRPQ-expressible) if there is a CRPQ (or ECRPQ, resp.) Q such that Q(G) = F (G) for every -labeled db-graph G.

For every language L , we de ne a query function FL by

FL(G) := f(x; y) j G contains a path from x to y with ( ) 2 Lg for every -labeled db-graph G. Analogously, we de ne a Boolean query function FLB by FLB(G) := true if and only if FL(G) 6= ;.

The proofs presented in this section will use speci c db-graphs Gw representing strings w 2 as follows: If w = b1 bjwj (with all bi 2 ), we de ne the db-graph Gw := (Vw; Ew) by Vw := fv0; : : : ; vjwjg (where all vi are distinct nodes), and Ew = f(vi; bi+1; vi+1) j 0 i < jwjg. Thus, Gw consists of a path from v0 to vjwj that is labeled with w.

Clearly, if L such that FL is expressible by an ECRPQ QL, then for all words w 2 we have w 2 L i (v0; vjwj) 2 QL(Gw).

Lemma 10. Let be an alphabet, let L

if and only if L is regular. . Then FL is CRPQ-expressible Proof (sketch). The \if-direction" is trivial. For the \only-if-direction", let L and let QL be a CRPQ such that QL(G) = FL(G) for every -labeled db-graph G. For showing that L is regular, we proceed along the following steps: (1) Rewrite QL into a CRPQ Q0L such that (i) for all w 2 we have (v0; vjwj) 2 Q0L(Gw) i (v0; vjwj) 2 QL(Gw) (i. e., in some sense, Q0L is equivalent to QL on db-graphs representing strings), and (ii) the graph HQ0L associated with the relational part of Q0L (cf., Section 2) is a connected directed acyclic graph with x as its single source node and y as its single sink node, where Ans(x; y) is the head of Q0L. (2) Use Q0L and a variant of the product construction to construct a nondeterministic nite automaton that accepts exactly those words w 2 for which (v0; vjwj) 2 Q0L(Gw).

The proof details can be found in the full version of the paper. tu The situation is not strictly the same for Boolean queries (e. g., if L contains every single letter of , FLB(G) = true holds for all non-empty db-graphs G); but a similar result can be observed:

Lemma 11. Let be an alphabet with j j 2, let a 2 , and let L

fag) . Then FaBLa is CRPQ-expressible if and only if L is regular. ( n For alphabets of size 2, ECRPQs can express queries FL for non-regular L which, according to Lemma 10, are not CRPQ-expressible. For example, for L := fanbn j n 2 Ng, FL is not CRPQ-expressible, but is expressed by the ECRPQ Ans(x; y) (x; 1; z); (z; 2; y); L1( 1); L2( 2); el( 1; 2), where L1 := a and L2 := b . For unary alphabets (i. e., alphabets of size 1), however, we can show the following:

Lemma 12. Let be a unary alphabet, let L

expressible if and only if it is CRPQ-expressible. . Then FL is ECRPQ

Before giving a proof sketch of this lemma, let us note that, in spite of Lemma 12, there exist ECRPQ-queries over unary alphabets that are not CRPQexpressible. For example, consider the ECRPQ

Q :=

Ans(x; y)

(x; 1; z); (y; 2; z); el( 1; 2); selecting all pairs of nodes (u; v) in a db-graph G, for which there exists a node w such that there are paths from u to w and from v to w of the same length. It should be not too di cult to see that this query is not CRPQ-expressible.

Proof (Sketch of the proof of Lemma 12).

The \if-direction" is trivial. For the \only-if-direction" let := fag, let L fag , and QL be an ECRPQ expressing FL. By Lemma 10 it su ces to show that L is regular. Due to Lemma 1, w.l.o.g. QL is of the form

Ans(x; y)

^ (xi; i; yi) ^ R( 1; : : : ; k) 1 i k for a k-ary regular relation R over fag . With R we associate a relation Rlen Nk as follows: Rlen := f (jw1j; : : : ; jwkj) j (w1; : : : ; wk) 2 R g: The proof of the lemma proceeds along the following steps: (1) Note that Rlen is semi-linear (since R is a regular relation over a unary alphabet). The notion of semi-linear relations is de ned as follows (cf., e. g., [ 10 ]): For every k 1 and every vector a 2 Nk, de ne aN := fai j i 2 Ng. For all sets A; B Nk, let A + B := fa + b j a 2 A; b 2 Bg. A set A Nk is linear if there exist a0; : : : ; an 2 Nk for some n 0 such that A = a0 + a1N + : : : + anN. A set is semi-linear if it is a nite union of linear sets. (2) Consider all non-empty acyclic directed paths in the labeled query graph HQlaLb, and let P = fp1; : : : ; plg be the set of all these paths. For each such path pj let j be the sequence of path variables labeling the edges of pj in HQlaLb. Furthermore, let start(pj ) and end(pj ) denote the start node and the end node of pj . By de nition, for each pj , each path variable i occurs at most once in j . (3) With each pj 2 P we assume a function p^j : Nk ! N such that p^j (r1; : : : ; rk) is the sum of all ri for which i occurs in j ; and let p^ : Nk ! Nl be de ned as p^(r1; : : : ; rk) := p^1(r1; : : : ; rk); : : : ; p^l(r1; : : : ; rk) for all (r1; : : : ; rk) 2 Nk. Note that p^j (r1; : : : ; rk) is the length of the path in Gw corresponding to the path pj in HQlaLb. (4) Since Rlen is semi-linear, p^(Rlen) := fp^(r) j r 2 Rleng is also semi-linear. (5) Assume w.l.o.g. that for the path p1 we have start(p1) = x and end(p1) = y.

Let T := p^(Rlen) \ B \ T1 j l Sj ; for B := f(s1; : : : ; sl) 2 Nl j sj s1 for all 1 j lg and Sj := f(s1; : : : ; sl) 2 Nl j sj0 = sj for all 1 j0 l with start(pj0 ) = start(pj ) and end(pj0 ) = end(pj )g.

Note that T is semi-linear (since each of the sets p^(Rlen), B, Sj is semi-linear, and the class of semi-linear sets is closed under intersection, see [ 10 ]). (6) Show that for all w 2 fag we have: jwj 2 proj1(T ) i (v0; vjwj) 2 QL(Gw), where proj1 projects each element of T to its rst component.

As a consequence, L = fw 2 fag j jwj 2 proj1(T )g is regular, since proj1(T ) is semi-linear (cf. Harrison [ 11 ]). This completes the proof sketch of Lemma 12. A detailed proof can be found in the full version. tu In Section 3.1 of [ 4 ], Barcelo et al. mention that ECRPQs are able to express queries corresponding to regular expressions with backreferencing (or extended regular expressions ) (cf. Aho [ 1 ], Freydenberger [ 9 ]). These expressions extend the regular expressions with variable binding and repetition operators; e. g., for every expression , the extended expression ( )%x xx generates the language of all www with w 2 L( ) ( generates some w 2 L( ), %x assigns that w to x, and the subsequent uses of x repeat this w { hence, xx generates ww).

Let L := fan j n 4; n is a composite numberg: According to Lemma 12, FL is not ECRPQ-expressible (as L is not regular). On the other hand, L is generated by the extended regular expression (a a+)%x x+ (cf. C^ampeanu et al. [ 6 ]). This demonstrates that ECRPQs are not able to express all queries that correspond to extended regular expressions.

Relative Succinctness. We can adapt Lemma 6 to observe the following result on the decidability of expressibility:

Theorem 13. CRPQ-expressibility for ECRPQs is not co-semi-decidable.

Proof. This follows from the proof of Theorem 9, a variation of Lemma 11, and the observation that INVALC(M) is regular i dom(M) is nite. Regarding the latter, note that if dom(M) is nite, INVALC(M) is co- nite; if dom(M) is in nite, non-regularity of INVALC(M) can be established using standard tools. This allows us to e ectively construct an ECPRQ Q from a Turing machine M such that Q is CRPQ-expressible if and only if dom(M) is nite.

Finiteness of dom(M) is a 20-complete problem in the arithmetical hierarchy (cf. Kozen [ 13 ]); hence, CRPQ-expressibility is 20-hard, which means that this problem is neither semi-decidable, nor co-semi-decidable. tu Using Theorem 13 in conjunction with a technique that is due to Hartmanis [ 12 ] and has been widely used in Formal Language Theory (cf. Kutrib [ 14 ]), we obtain a result on the relative succinctness of ECRPQs and CRPQs. One of the bene ts of that technique is that it applies to a wide range of di erent reasonable de nitions of the size of an ECRPQ.

In order to be as general as possible, we de ne a complexity measure for ECRPQs as a computable function c from the set of all ECRPQs to N, such that for every nite alphabet , the set of all ECRPQs Q over (i) can be e ectively enumerated in order of increasing c(Q), and (ii) does not contain in nitely many ECRPQs with the same value c(Q). As the following theorem demonstrates, no matter which complexity measure we choose, the size tradeo between ECRPQs and CRPQs is not bounded by any recursive function:

Theorem 14. Let be a nite alphabet with j j 2. For every recursive

function f : N ! N and every complexity measure c, there exists an ECRPQ Q over such that Q is CRPQ-expressible, but for every CRPQ Q0 with Q0 Q, c(Q0) > f (c(Q)).

Aho . Algorithms for nding patterns in strings . In J. van Leeuwen, editor, Handbook of Theoretical Computer Science , volume

Elsevier , 1990 .

Aho ,

Hopcroft , and

Ullman . The Design and Analysis of Computer Algorithms . Addison-Wesley, Reading, MA, 1974 .

Albert and

Wegner . Languages with homomorphic replacements . Theoretical Computer Science , 16 : 291 { 305 , 1981 .

Barcelo ,

Hurtado ,

Libkin , and

Wood . Expressive languages for path queries over graph-structured data . In Proc. PODS'10 , pages 3 { 14 , 2010 .

Calvanese ,

G. D.

Giacomo ,

Lenzerini , and

M. Y.

Vardi . Containment of conjunctive regular path queries with inverse . In KR'00 , pages 176 { 185 , 2000 .

6. C. Ca^mpeanu, K. Salomaa, and

Yu . A formal study of practical regular expressions . International Journal of Foundations of Computer Science , 14 : 1007 { 1018 , 2003 .

Deutsch and

Tannen . Optimization properties for classes of conjunctive regular path queries . In Proc. DBPL'01 , volume 2397 of LNCS , pages 21 { 39 . Springer, 2001 .

Florescu ,

A. Y.

Levy , and

Suciu . Query containment for conjunctive queries with regular expressions . In Proc. PODS'98 , pages 139 { 148 , 1998 .

Freydenberger . Extended regular expressions: Succinctness and decidability . In Proc. STACS 2011 , pages 507 { 518 , 2011 .

10.

Ginsburg and

Spanier . Bounded ALGOL-like languages . Transactions of the American Mathematical Society , 113 ( 2 ): 333 { 368 , 1964 .

11.

Harrison . Introduction to Formal Language Theory . Addison Wesley Publishing Company, 1978 .

12.

Hartmanis . On Godel speed-up and succinctness of language representations . Theoretical Computer Science , 26 ( 3 ): 335 { 342 , 1983 .

13.

Kozen . Theory of Computation . Springer-Verlag, London, 2006 .

14.

Kutrib . The phenomenon of non-recursive trade-o s . International Journal of Foundations of Computer Science , 16 ( 5 ): 957 { 973 , 2005 .

15.

Salomaa . Patterns. In Formal Languages and Applications, number 148 in Studies in Fuzziness and Soft Computing , pages 367 { 379 . Springer, 2004 .