<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Expressiveness and Static Analysis of Extended Conjunctive Regular Path Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Institut fur Informatik</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Goethe-Universitat</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frankfurt am Main</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>freydenberger@em.uni-frankfurt.de</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>schweika@informatik.uni-frankfurt.de</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We study the expressiveness and the complexity of static analysis of extended conjunctive regular path queries (ECRPQs), introduced by Barcelo et al. (PODS '10). ECRPQs are an extension of conjunctive regular path queries (CRPQs), a well-studied language for querying graph structured databases. Our rst main result shows that query containment and equivalence of a CRPQ in an ECRPQ is undecidable. This settles one of the main open problems posed by Barcelo et al. As a second main result, we prove a non-recursive succinctness gap between CRPQs and the CRPQ-expressible fragment of ECRPQs. Apart from this, we develop a tool for proving inexpressibility results for CRPQs and ECRPQs. In particular, this enables us to show that there exist queries de nable by regular expressions with backreferencing, but not expressible by ECRPQs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Many application areas (e. g., concerning the Semantic Web or biological
applications) consider graph structured data, where the data consists of a nite set of
nodes connected by labeled edges. For querying such data, one usually needs to
specify types of paths along which nodes are connected. A widely studied class
of queries for graph structured databases are the conjunctive regular path queries
(CRPQs) (cf., e. g., [
        <xref ref-type="bibr" rid="ref5 ref7 ref8">5, 7, 8</xref>
        ]), where types of paths can be described by regular
expressions specifying labels along the paths. For modern applications, however,
also more expressive query languages are desirable, allowing not only to specify
regular properties of path labels, but also to compare paths based on, e. g., their
lengths, labels, or similarity.
      </p>
      <p>
        To start a formal investigation of such concepts, Barcelo et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduced
the class of extended conjunctive regular path queries (ECRPQs), allowing to
use not only regular languages to express properties of individual paths, but
also regular relations among several paths, capable of expressing certain
associations between paths. The authors of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] investigated the complexity of query
evaluation and static analysis of ECRPQs. While query containment is known
to be decidable and Expspace-complete for CRPQs [
        <xref ref-type="bibr" rid="ref5 ref8">8, 5</xref>
        ], it was shown to be
undecidable for ECRPQs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, checking containment of an ECRPQ in a
CRPQ still is decidable and Expspace-complete [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. (Un)Decidability of
checking containment (or, equivalence) of a CRPQ in an ECRPQ was posed as an
open question in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In the present paper, we answer this question by showing that containment
of a CRPQ in an ECRPQ is undecidable | even if the ECRPQ is, in fact,
a CRPQ extended only by relations for checking equality of path labels (or,
similarly, equal lengths of paths). Our proof proceeds by (a) simulating Turing
machine runs by so-called H-systems, a concept from formal language theory
generalizing pattern languages, and (b) using CRPQs and ECRPQs to represent
languages described by H-systems. Our proof generalizes to (i) the case where
one of the two queries is xed, (ii) the case where all queries are Boolean and
acyclic, and (iii) the problem of deciding equivalence rather than containment
of CRPQs and ECRPQs.</p>
      <p>Apart from the static analysis of queries, the present paper also investigates
the expressiveness and succinctness of ECRPQs. Using the machinery developed
for proving our undecidability results concerning static analysis, we show that
CRPQ-de nability of a given ECRPQ is undecidable, and that there is no
recursive function f such that every CRPQ-de nable ECRPQ of length n is equivalent
to a CRPQ of length f (n).</p>
      <p>
        Concerning the expressivity of (E)CRPQs, to the best of our knowledge, tools
for showing inexpressibility results have not been presented in the literature yet.
We develop such tools, enabling us to show, for example, that no ECRPQ-query
can return exactly those tuples of nodes between which there is a path whose
length is a composite number (i. e., a number of the from nm for n; m 2). Since
these paths can be easily described by a regular expression with backreferencing
(cf. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) of the form (a a+)%x x+, this refutes a claim of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] stating that all regular
expressions with backreferencing can be expressed by ECRPQs.
      </p>
      <p>Structure of the paper. We start with the necessary notations and de
nitions in Section 2 where, in particular, the syntax and semantics of ECRPQs
(and restrictions thereof) are de ned. Section 3 is devoted to the static analysis
of ECRPQs and CRPQs, showing that containment and equivalence of CRPQs
in ECRPQs are undecidable. Section 4 investigates the relative succinctness
between CRPQs and CRPQ-expressible ECRPQs and provides tools for proving
limitations to the expressive power of CRPQs and ECRPQs. Due to space
limitations, many technical details of the proofs had to be deferred to the full version.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>Let N denote the set of non-negative integers. We denote the empty word by
". Let A; B be alphabets. A morphism (between A and B ) is a function h :
A ! B with h(uv) = h(u)h(v) for all u; v 2 A . For every word w 2 A , jwj
stands for the length of w, and for every letter a 2 A, jwja denotes the number
of occurrences of a in w.</p>
      <p>DB-Graphs and Queries. A -labeled db-graph is a directed graph G =
(V; E), where V is a nite set of nodes, and E V V is a nite set of
directed edges with labels from . A path between two nodes v0 and vn in G
with n 0 is a sequence v0a1v1 vn 1anvn with v0; : : : ; vn 2 V , a1; : : : ; an 2 ,
and (vi; ai+1; vi+1) 2 E for 0 i &lt; n. We de ne the label ( ) of the path by
( ) := a1 an. Furthermore, for every v 2 V , we de ne the empty path v"v,
with (v"v) = ".</p>
      <p>
        A central concept considered in the present paper are regular relations (cf. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and the references therein). Let be a nite alphabet, let ? be a new symbol
with ?2= , and let ? := [ f?g. Let w = (w1; : : : ; wk) 2 ( )k, where
wi = ai;1 ai;jwij (and all ai;j 2 ). We de ne the string [w] 2 ( ?)k by
[w] := b1 bn, where n is the maximum of all jwij, and bj := (bj;1; : : : ; bj;k),
with bj;i = ai;j if j jwij, and bj;i =? if j &gt; jwij. In other words, [w] is obtained
by aligning all wi to the left, and padding the un lled space with ? symbols. A
k-ary relation R ( )k is called regular if the language f[r] j r 2 Rg is regular.
      </p>
      <p>Obviously, every regular language is a (unary) regular relation. In addition to
this, the present paper focuses on the following k-ary regular relations (k 2):
1. the equality relation eq: = f(w1; : : : ; wk) j w1 = : : : = wkg,
2. the length equality relation el := f(w1; : : : ; wk) j jw1j = : : : = jwkjg.
Note that each of these relations needs to be de ned w. r. t. a nite alphabet ,
which we usually omit for the sake of brevity.</p>
      <p>
        We now de ne ECRPQs and CPRQs, following the de nitions from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Fix
a countable set of node variables and a countable set of path variables. Let be
a nite alphabet. An extended conjunctive regular path query (ECRPQ) Q over
is an expression of the form
      </p>
      <p>Ans(z; )
^ (xi; i; yi); ^</p>
      <p>Rj (!j );
1 i m
1 j l
(1)
such that m
1, l</p>
      <p>0, and
1. each Rj is a regular expression that de nes a regular relation over ,
2. x = (x1; : : : ; xm) and y = (y1; : : : ; ym) are tuples of (not necessarily distinct)
node variables,
3. = ( 1; : : : ; m) is a tuple of distinct path variables,
4. !1; : : : ; !l are tuples of path variables, such that each !j is a tuple of
variables from , of the same arity as Rj ,
5. z is a tuple of node variables among x, y, and
6. is a tuple of path variables among those in .</p>
      <p>The expression Ans(z; ) is the head, and the expression to the right of is the
body of Q. If z and are the empty tuple (i. e., the head is of the form Ans()),
Q is a Boolean query. The relational part of an ECRPQ Q is V1 i m(xi; i; yi),
and the labeling part is V1 j l Rj (!j ). We denote the set of node variables in
Q by nvar(Q).</p>
      <p>Intuitively, all variables are quanti ed existentially, and the words formed
by the labels along the paths have to satisfy the respective relations. Formally,
for every -labeled db-graph G, every ECRPQ Q (of the form described in (1))
over , every mapping from the node variables of Q to nodes in G, and every
mapping from the path variables of Q to paths in G, we write (G; ; ) j= Q if
1. ( i) is a path from (xi) to (yi) for every 1 i m,
2. for each !j = ( j1 ; : : : ; jk ), 1 j l, the tuple ( ( ( j1 )); : : : ; ( ( jk )))
belongs to the relation Rj .</p>
      <p>Finally, we de ne the output of Q (of the form described in (1)) on G by
Q(G) :=
f
(z); ( )
j
; such that (G; ; ) j= Q g:
As usual, if Q is Boolean, we model the Boolean constants true and false by the
empty tuple () and the empty set ;, respectively. In other words, Q(G) = true
i there exist assignments and with (G; ; ) j= Q.</p>
      <p>Two queries Q and Q0 are called equivalent (Q Q0, for short) if Q(G) =
Q0(G) for all db-graphs G. A query Q is said to be contained in a query Q0
(Q Q0, for short) if Q(G) Q0(G) for all db-graphs G.</p>
      <p>
        With an ECRPQ Q we associate an edge-labeled directed graph HQlab whose
vertex set is the set of node variables occurring in Q, and where there is an
edge from x to y labeled i (x; ; y) occurs in the relational part of Q. As in
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we write HQ to denote the (unlabeled) directed graph obtained from HQlab
by deleting the edge-labels (and removing duplicate edges). A query Q is called
acyclic if HQ is acyclic.
      </p>
      <sec id="sec-2-1">
        <title>In accordance with [4], a conjunctive regular path query (CRPQ) Q over</title>
        <p>is an ECRPQ over of the form described in (1), where all relations Rj are
unary relations, and (hence), all tuples !j are singletons.</p>
        <p>Thus, CRPQs can only refer to the languages that are allowed to occur along
the paths, while ECRPQs can also describe relations between di erent paths.</p>
        <p>The present paper devotes special attention to two classes of queries with
an expressive power that lies strictly between CRPQs and ECRPQs: A CRPQ
with equality relations is an ECRPQ where every relation in the labeling part
is either of arity 1 (i. e., a regular language), or a k-ary eq-relation for some
k 2. Analogously, a CRPQ with equal length relations is an ECRPQ where
every relation in the labeling part is either of arity 1, or a k-ary el-relation.</p>
        <p>It is easy to see that ECRPQs and CRPQs can be transformed into queries
in the following normal forms (note, though, that these transformations might
increase the size of the queries):
Lemma 1. For every ECRPQ Q = Ans(z; ) ^ (xi; i; yi); ^ Rj (!j );
1 i m 1 j l
there exists a regular relation R of arity m such that Q is equivalent to the
ECRPQ Q0 := Ans(z; ) ^ (xi; i; yi); R( 1; : : : ; m):</p>
        <p>1 i m
Lemma 2. For every CRPQ Q = Ans(z; ) ^ (xi; i; yi); ^
1 i m
(where ij 2 f1; : : : ; mg), there exist regular languages L01; : : : ; L0m
Q is equivalent to the CRPQ Q0 := Ans(z; )</p>
        <p>Lj ( ij )
1 j l</p>
        <p>such that
^ L0i( i).</p>
        <p>^ (xi; i; yi);
1 i m
1 i m
Hence, for ECRPQs it su ces to consider just one regular relation of arity m; and
for CRPQs, it su ces to consider just one regular language per path variable.
Turing Machines and H-Systems. Let M be a (deterministic) Turing
machine with state set Q, initial state q0 2 Q, halting state qH 2 Q, tape alphabet
(including the blank symbol), such that Q \ = ;, and an input alphabet
I that does not include the blank symbol. We adopt the conventions that
M accepts by halting, and does not halt in the rst step (i. e., q0 6= qH ).</p>
        <p>A con guration of M is a word w1qw2, with w1; w2 2 and q 2 Q. We
interpret w1qw2 as M being in state q, while the tape contains w1 on the left
side, and w2 on the right side. The head is on the position of the rst (leftmost)
letter of w2 (if w2 = ", M reads the blank symbol). We denote the successor
relation on con gurations of M by `M. An accepting run of M is a sequence
C0; : : : ; Cn of con gurations of M (with n 1), such that C0 2 q0 I (C0 is
an initial con guration), Cn 2 qH (Cn is an accepting con guration), and
Ci `M Ci+1 holds for all 0 i &lt; n. Let := [ Q [ f#g, where # is a new
letter that does not occur in or Q. We de ne the set of valid computations of
M by VALC(M) := f#C0# #Cn# j C0; : : : ; Cn is an accepting run of Mg,
and denote its complement by INVALC(M) := n VALC(M). Finally, we
de ne dom(M) to be the set of all w 2 I such that M halts after a nite
number of steps when started in the con guration q0w.</p>
        <p>By de nition, INVALC(M) = holds if and only if dom(M) = ;; and note
that (given M), the question if dom(M) = ; is undecidable.</p>
        <p>
          As a technical tool for our proofs, we use the notion of H-systems to describe
the sets INVALC(M) for Turing machines M. Our notion of H-systems can
be viewed as a generalization of pattern languages (cf. Salomaa [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]), or as a
restricted version of the H-systems introduced by Albert and Wegner [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
L(H) = Sik=1 L(Hi).
        </p>
        <p>De nition 3. An H-system (over the alphabet ) is a 4-tuple H := ( ; X; L; ),
where (i) X and are nite, disjoint alphabets, (ii) L is a function that maps
every x 2 X to a regular language L(x) with " 2 L(x), and (iii) 2
(X [ )+.</p>
        <p>A morphism h : ( [X) ! is H-compatible if h(a) = a for every a 2 ,
and h(x) 2 L(x) for every x 2 X. We then de ne the language L(H) that is
generated by H = ( ; X; L; ) as L(H) := fh( ) j h is an H-compatible morphismg:</p>
        <p>
          For every nite, nonempty set of H-systems H = fH1; : : : ; Hkg, we de ne
In other words, the letters from are constants, the letters from X are variables,
and L(H) is obtained from by uniformly replacing every variable x with a word
from L(x). We assume w.l.o.g. that X is chosen minimally; i. e., every x 2 X
occurs in . It is easy to see that H-systems are able to generate non-regular
languages; e. g., the system H = ( ; fxg; L; xx) with L(x) = generates the
language of all ww, w 2 .We use unions of H-system languages to describe
the sets INVALC(M):
Lemma 4. Given a Turing machine M, one can e ectively construct a set H =
fH1; : : : ; Hkg of H-systems (for some k 1) such that INVALC(M) = L(H).
Proof (sketch). Let M be a Turing machine with state set Q and tape alphabet
, and de ne := Q [ [ f#g. We approach the process of de ning H from the
following angle: Every word w 2 INVALC(M) contains at least one error that
prevents w from being an element of VALC(M). Most of these error conditions
can be described using regular expressions (similar to the construction used in
the proof of Lemma 10.2 in Aho et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]).
        </p>
        <p>As INVALC(M) might be non-regular (if dom(M) is in nite), regular
languages alone are not su cient to describe all possible errors in a run of M.
More speci cally, we cannot handle arbitrary errors in the preservation of the
tape contents from one con guration to the other. As an example, assume M
reads some a 2 while in state q 2 Q and is supposed to write some b 2 ,
move the head to the right, and enter some state p 2 Q. In all these cases, a
con guration C = w1qaw2 with w1; w2 2 is followed by the con guration
C0 = w1bpw2.</p>
        <p>Our goal is to construct H-expressions that capture all cases where a word
encodes a sequence of con gurations C0; : : : ; Cn that contains con gurations
Ci = w1qaw2, Ci+1 = w3bpw4 where w1 6= w3, or w2 6= w4 holds (with
w1; : : : ; w4 2 ). Note that, for all words w; w0 2 , w 6= w0 holds if and
only if there exist words u; v; v0 2 and letters c; d 2 with c 6= d, w = ucv,
and w0 = udv0, or exactly one of w; w0 is the empty word.</p>
        <p>As errors described in the latter case (i. e., that exactly one of w1, w3 or
of w2; w4 is empty) can be expressed using regular languages, we focus on the
former case. In order to express these errors, for every c 2 , we de ne languages
Lc;1 :=
Lc;2 :=
[
As we shall see in the next section, it is possible to reduce decision problems on
unions of H-systems (and, hence, on the domains of Turing machines) to decision
problems on CRPQs and ECRPQs.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Query Containment and Equivalence</title>
      <sec id="sec-3-1">
        <title>Query Containment. The query containment problem is the problem to de</title>
        <p>cide for two input queries Q and Q0 whether Q Q0.</p>
        <p>
          The containment of CRPQs in CPRQs and of ECRPQs in CRPQs is known
to be decidable and Expspace-complete (cf. [
          <xref ref-type="bibr" rid="ref5 ref8">8, 5</xref>
          ] and [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], resp.). In [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the
authors proved the undecidability of the containment problem for ECRPQs, and
mentioned the decidability of containment of CRPQs in ECRPQs as an
important open problem. Our rst main result states that this problem is undecidable,
even if the ECRPQs are of a comparatively restricted form:
        </p>
        <sec id="sec-3-1-1">
          <title>Theorem 5. For every alphabet with j j</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>CRPQs in CRPQs with equality relations over</title>
      </sec>
      <sec id="sec-3-3">
        <title>2, the containment problem of is undecidable.</title>
        <p>The proof is a consequence of Lemma 4, the undecidability of the emptiness of
dom(M) for Turing machines M, and the following lemma:
Lemma 6. Let be an alphabet. For every set H = fH1; : : : ; Hkg of H-systems
over , one can e ectively construct an alphabet 0, a CRPQ Q1 over 0, and
a CRPQ with equality relations Q2 over 0 such that Q1 Q2 if and only if
L(H) = .</p>
        <p>Proof. Let = fa1; : : : ; asg for some s 1. Let H be a set of k H-systems
H = fH1; : : : ; Hkg over (with k 1). We de ne 0 := [ fF; $g, where F
and $ are distinct letters that do not occur in . Next, we de ne
Q1 :=</p>
        <p>Ans()
(x; ; y); L( );
where L := $Fa1 asF$F F$, and x and y are distinct variables. Thus,
Q1(G) = true if and only if G contains a path with ( ) 2 L.</p>
        <p>The de nition of Q2 is more involved. Informally explained, Q2 uses the
structure provided by Q1 to implement the union of the languages L(Hi). We
de ne Q2 such that, for every db-graph G with Q1(G) = true, Q2(G) = true
holds if and only if there is a path in G with ( ) = $Fa1 asF$FwF$,
where w 2 L(H) (i. e., w 2 L(Hi) for some Hi 2 H).</p>
        <p>Note that the paths described by Q1 contain exactly three occurrences of
the $ symbol, which can be understood to divide into two parts, where the
left part is labeled Fa1 asF. Likewise, the query Q2 can be understood as
consisting of two parts, which are to be de ned in the subqueries V1 i k isel and
V1 i k icod, respectively. Our goal is to construct Q2 in such a way that, when
matching Q2 to , the isel are used to select which H-system Hi is simulated in
Q2, while the actual encoding of that H-system is achieved by icod (hence, the
superscripts sel and cod). We de ne Q2 as</p>
        <p>Q2 :=</p>
        <p>Ans()
(x0; c1$; x1); (xk+1; c2$; x^1); (x^k+1; c3$; x^k+2);
L$(c1$); L$(c2$); L$(c3$); ^ isel; ^
cod
i
1 i k
1 i k
where L$ = f$g, and the sel and cod consist of relational and labeling atoms
i i
that shall be de ned further down. As explained above, the subqueries isel are
used to select which H-system is active when matching Q2 to a graph. These
queries are de ned by
sel := (xi; ciF;1; yi;1); (yi;1; cia1 ; yi;2); : : : ; (yi;s; cias ; yi;s+1); (yi;s+1; ciF;2; xi+1);
i</p>
        <p>LF(ciF;1); La1 (cia1 ); : : : ; Las (cias ); LF(ciF;2); eq(ciF;1; ciF;2)
where La := f"; ag for each a 2 fF; a1; : : : ; asg.</p>
        <p>In order to de ne each icod, we need to consider the respective H-system
Hi: Let Hi = ( ; Xi; Li; i), where i = i;1 i;mi for some mi 1 and
i;1; : : : ; i;mi 2 (X [ ). We de ne the relational part of icod to be
(x^i; ciF;3; zi;1); (zi;1; di;1; zi;2); : : : ; (zi;mi ; di;mi ; zi;mi+1); (zi;mi+1; ciF;4; x^i+1);
where ciF;3, ciF;4, and all di;j are (pairwise distinct) new path variables. We
start the construction of the labeling part of icod with the labeling atoms
LF(ciF;3), LF(ciF;4), eq(ciF;1; ciF;3), and eq(ciF;1; ciF;4). Furthermore, we de ne a
regular language Li;j for every 1 j mi by Li;j := Li( i;j ) if i;j 2 X, and
Li;j := f"; i;j g if i;j 2 . In addition to this, we add a label atom eq(ci i;j ; di;j )
for every j with i;j 2 . Finally, for every j with i;j 2 X such that i;j
occurs more than once in i, we add a relation eq(di;j ; di;l) for every l 6= j with
i;l = i;j .</p>
        <p>Note that the relation graph HQ2 consists only of a path from x0 to x^k+1,
where each node (except x^k+1, the last node) has exactly one successor. Thus,
the relation graph is acyclic and has no branches.</p>
        <p>We claim that L(H) = holds if and only if Q1 Q2, which completes
the proof of Lemma 6. Due to space limitations, the proof of this claim has been
omitted from this version. tu</p>
        <p>By using standard encoding techniques for representing arbitrary nite
alphabets by an alphabet of size 2, the proof of Theorem 5 now easily follows from
Lemma 4, the undecidability of the emptiness of dom(M) for Turing machines
M, and Lemma 6. By using universal Turing machines instead of arbitrary
Turing machines, we also obtain the following strengthening of Theorem 5:</p>
        <sec id="sec-3-3-1">
          <title>Theorem 7. For every alphabet with j j 2, there are a xed CRPQ Q1</title>
          <p>over and a xed CRPQ with equality relations Q2 over such that (i) the
containment problem of Q1 in CRPQs with equality relations, and (ii) the
containment problem of CRPQs in Q2 are both undecidable. This holds even if all
queries are Boolean and acyclic.</p>
          <p>Applying slight modi cations to the proof of Lemma 6, we observe the same
situation for ECRPQs that use length equality instead of equality relations:</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Theorem 8. For every alphabet with j j 2, there are a xed CRPQ Q1</title>
          <p>over and a xed CRPQ with length equality relations Q2 over such that
(i) the containment problem of Q1 in CRPQs with length equality relations, and
(ii) the containment problem of CRPQs in Q2, are both undecidable. This holds
even if all queries are Boolean and acyclic.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Query Equivalence. The query equivalence problem is the problem to decide</title>
        <p>for two input queries Q and Q0 whether Q Q0.</p>
        <p>
          Another question speci cally posed in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is whether the equivalence problem
for CRPQs and ECRPQs is decidable. Using a variant of the proof of Theorem 7,
we can answer this question negatively:
        </p>
        <sec id="sec-3-4-1">
          <title>Theorem 9. For every alphabet with j j 2, there are a xed CRPQ Q1</title>
          <p>over and a xed ECRPQ Q2 over such that (i) the equivalence problem of</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>Q1 and ECRPQs, and (ii) the equivalence problem of CRPQs and Q2, are both</title>
        <p>undecidable. This holds even if all queries are Boolean and acyclic.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Expressiveness and Relative Succinctness</title>
      <p>(E)CRPQ Expressibility. We say that a query function F is CRPQ-expressible
(or ECRPQ-expressible) if there is a CRPQ (or ECRPQ, resp.) Q such that
Q(G) = F (G) for every -labeled db-graph G.</p>
      <p>For every language L , we de ne a query function FL by</p>
      <p>FL(G) := f(x; y) j G contains a path
from x to y with ( ) 2 Lg
for every -labeled db-graph G. Analogously, we de ne a Boolean query function
FLB by FLB(G) := true if and only if FL(G) 6= ;.</p>
      <p>The proofs presented in this section will use speci c db-graphs Gw
representing strings w 2 as follows: If w = b1 bjwj (with all bi 2 ), we de ne
the db-graph Gw := (Vw; Ew) by Vw := fv0; : : : ; vjwjg (where all vi are distinct
nodes), and Ew = f(vi; bi+1; vi+1) j 0 i &lt; jwjg. Thus, Gw consists of a path
from v0 to vjwj that is labeled with w.</p>
      <p>Clearly, if L such that FL is expressible by an ECRPQ QL, then for
all words w 2 we have w 2 L i (v0; vjwj) 2 QL(Gw).</p>
      <sec id="sec-4-1">
        <title>Lemma 10. Let be an alphabet, let L</title>
        <p>if and only if L is regular.
. Then FL is CRPQ-expressible
Proof (sketch). The \if-direction" is trivial. For the \only-if-direction", let L
and let QL be a CRPQ such that QL(G) = FL(G) for every -labeled
db-graph G. For showing that L is regular, we proceed along the following steps:
(1) Rewrite QL into a CRPQ Q0L such that (i) for all w 2 we have (v0; vjwj) 2
Q0L(Gw) i (v0; vjwj) 2 QL(Gw) (i. e., in some sense, Q0L is equivalent to QL
on db-graphs representing strings), and (ii) the graph HQ0L associated with
the relational part of Q0L (cf., Section 2) is a connected directed acyclic graph
with x as its single source node and y as its single sink node, where Ans(x; y)
is the head of Q0L.
(2) Use Q0L and a variant of the product construction to construct a
nondeterministic nite automaton that accepts exactly those words w 2 for which
(v0; vjwj) 2 Q0L(Gw).</p>
        <p>The proof details can be found in the full version of the paper.
tu
The situation is not strictly the same for Boolean queries (e. g., if L contains
every single letter of , FLB(G) = true holds for all non-empty db-graphs G);
but a similar result can be observed:</p>
        <sec id="sec-4-1-1">
          <title>Lemma 11. Let be an alphabet with j j 2, let a 2 , and let L</title>
          <p>fag) . Then FaBLa is CRPQ-expressible if and only if L is regular.
( n
For alphabets of size 2, ECRPQs can express queries FL for non-regular
L which, according to Lemma 10, are not CRPQ-expressible. For example,
for L := fanbn j n 2 Ng, FL is not CRPQ-expressible, but is expressed by
the ECRPQ Ans(x; y) (x; 1; z); (z; 2; y); L1( 1); L2( 2); el( 1; 2), where
L1 := a and L2 := b . For unary alphabets (i. e., alphabets of size 1), however,
we can show the following:</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Lemma 12. Let be a unary alphabet, let L</title>
        <p>expressible if and only if it is CRPQ-expressible.
. Then FL is
ECRPQ</p>
        <p>Before giving a proof sketch of this lemma, let us note that, in spite of
Lemma 12, there exist ECRPQ-queries over unary alphabets that are not
CRPQexpressible. For example, consider the ECRPQ</p>
        <p>Q :=</p>
        <p>Ans(x; y)</p>
        <p>(x; 1; z); (y; 2; z); el( 1; 2);
selecting all pairs of nodes (u; v) in a db-graph G, for which there exists a node
w such that there are paths from u to w and from v to w of the same length. It
should be not too di cult to see that this query is not CRPQ-expressible.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Proof (Sketch of the proof of Lemma 12).</title>
        <p>The \if-direction" is trivial. For the \only-if-direction" let := fag, let L
fag , and QL be an ECRPQ expressing FL. By Lemma 10 it su ces to show
that L is regular. Due to Lemma 1, w.l.o.g. QL is of the form</p>
        <p>Ans(x; y)</p>
        <p>
          ^ (xi; i; yi) ^ R( 1; : : : ; k)
1 i k
for a k-ary regular relation R over fag . With R we associate a relation Rlen Nk
as follows: Rlen := f (jw1j; : : : ; jwkj) j (w1; : : : ; wk) 2 R g: The proof of the
lemma proceeds along the following steps:
(1) Note that Rlen is semi-linear (since R is a regular relation over a unary
alphabet). The notion of semi-linear relations is de ned as follows (cf., e. g.,
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]): For every k 1 and every vector a 2 Nk, de ne aN := fai j i 2
Ng. For all sets A; B Nk, let A + B := fa + b j a 2 A; b 2 Bg. A set
A Nk is linear if there exist a0; : : : ; an 2 Nk for some n 0 such that
A = a0 + a1N + : : : + anN. A set is semi-linear if it is a nite union of linear
sets.
(2) Consider all non-empty acyclic directed paths in the labeled query graph
HQlaLb, and let P = fp1; : : : ; plg be the set of all these paths. For each such
path pj let j be the sequence of path variables labeling the edges of pj in
HQlaLb. Furthermore, let start(pj ) and end(pj ) denote the start node and the
end node of pj . By de nition, for each pj , each path variable i occurs at
most once in j .
(3) With each pj 2 P we assume a function p^j : Nk ! N such that p^j (r1; : : : ; rk)
is the sum of all ri for which i occurs in j ; and let p^ : Nk ! Nl be de ned as
p^(r1; : : : ; rk) := p^1(r1; : : : ; rk); : : : ; p^l(r1; : : : ; rk) for all (r1; : : : ; rk) 2 Nk.
Note that p^j (r1; : : : ; rk) is the length of the path in Gw corresponding to the
path pj in HQlaLb.
(4) Since Rlen is semi-linear, p^(Rlen) := fp^(r) j r 2 Rleng is also semi-linear.
(5) Assume w.l.o.g. that for the path p1 we have start(p1) = x and end(p1) = y.
        </p>
        <p>Let T := p^(Rlen) \ B \ T1 j l Sj ;
for B := f(s1; : : : ; sl) 2 Nl j sj s1 for all 1 j lg
and Sj := f(s1; : : : ; sl) 2 Nl j sj0 = sj for all 1 j0 l with start(pj0 ) =
start(pj ) and end(pj0 ) = end(pj )g.</p>
        <p>
          Note that T is semi-linear (since each of the sets p^(Rlen), B, Sj is semi-linear,
and the class of semi-linear sets is closed under intersection, see [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]).
(6) Show that for all w 2 fag we have: jwj 2 proj1(T ) i (v0; vjwj) 2 QL(Gw),
where proj1 projects each element of T to its rst component.
        </p>
        <p>
          As a consequence, L = fw 2 fag j jwj 2 proj1(T )g is regular, since proj1(T ) is
semi-linear (cf. Harrison [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]). This completes the proof sketch of Lemma 12. A
detailed proof can be found in the full version.
tu
In Section 3.1 of [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], Barcelo et al. mention that ECRPQs are able to express
queries corresponding to regular expressions with backreferencing (or extended
regular expressions ) (cf. Aho [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], Freydenberger [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]). These expressions extend
the regular expressions with variable binding and repetition operators; e. g., for
every expression , the extended expression ( )%x xx generates the language of
all www with w 2 L( ) ( generates some w 2 L( ), %x assigns that w to x,
and the subsequent uses of x repeat this w { hence, xx generates ww).
        </p>
        <p>
          Let L := fan j n 4; n is a composite numberg: According to Lemma 12,
FL is not ECRPQ-expressible (as L is not regular). On the other hand, L is
generated by the extended regular expression (a a+)%x x+ (cf. C^ampeanu et
al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]). This demonstrates that ECRPQs are not able to express all queries that
correspond to extended regular expressions.
        </p>
        <p>Relative Succinctness. We can adapt Lemma 6 to observe the following
result on the decidability of expressibility:</p>
      </sec>
      <sec id="sec-4-4">
        <title>Theorem 13. CRPQ-expressibility for ECRPQs is not co-semi-decidable.</title>
        <p>Proof. This follows from the proof of Theorem 9, a variation of Lemma 11, and
the observation that INVALC(M) is regular i dom(M) is nite. Regarding the
latter, note that if dom(M) is nite, INVALC(M) is co- nite; if dom(M) is
in nite, non-regularity of INVALC(M) can be established using standard tools.
This allows us to e ectively construct an ECPRQ Q from a Turing machine M
such that Q is CRPQ-expressible if and only if dom(M) is nite.</p>
        <p>
          Finiteness of dom(M) is a 20-complete problem in the arithmetical hierarchy
(cf. Kozen [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]); hence, CRPQ-expressibility is 20-hard, which means that this
problem is neither semi-decidable, nor co-semi-decidable.
tu
Using Theorem 13 in conjunction with a technique that is due to Hartmanis [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
and has been widely used in Formal Language Theory (cf. Kutrib [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]), we
obtain a result on the relative succinctness of ECRPQs and CRPQs. One of the
bene ts of that technique is that it applies to a wide range of di erent reasonable
de nitions of the size of an ECRPQ.
        </p>
        <p>In order to be as general as possible, we de ne a complexity measure for
ECRPQs as a computable function c from the set of all ECRPQs to N, such
that for every nite alphabet , the set of all ECRPQs Q over (i) can be
e ectively enumerated in order of increasing c(Q), and (ii) does not contain
in nitely many ECRPQs with the same value c(Q). As the following theorem
demonstrates, no matter which complexity measure we choose, the size tradeo
between ECRPQs and CRPQs is not bounded by any recursive function:</p>
        <sec id="sec-4-4-1">
          <title>Theorem 14. Let be a nite alphabet with j j 2. For every recursive</title>
          <p>function f : N ! N and every complexity measure c, there exists an ECRPQ Q
over such that Q is CRPQ-expressible, but for every CRPQ Q0 with Q0 Q,
c(Q0) &gt; f (c(Q)).</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Aho</surname>
          </string-name>
          .
          <article-title>Algorithms for nding patterns in strings</article-title>
          . In J. van Leeuwen, editor,
          <source>Handbook of Theoretical Computer Science</source>
          , volume
          <string-name>
            <given-names>A.</given-names>
            <surname>Elsevier</surname>
          </string-name>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Aho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hopcroft</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ullman</surname>
          </string-name>
          .
          <article-title>The Design and Analysis of Computer Algorithms</article-title>
          . Addison-Wesley, Reading, MA,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J.</given-names>
            <surname>Albert</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Wegner</surname>
          </string-name>
          .
          <article-title>Languages with homomorphic replacements</article-title>
          .
          <source>Theoretical Computer Science</source>
          ,
          <volume>16</volume>
          :
          <fpage>291</fpage>
          {
          <fpage>305</fpage>
          ,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Barcelo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hurtado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Wood</surname>
          </string-name>
          .
          <article-title>Expressive languages for path queries over graph-structured data</article-title>
          .
          <source>In Proc. PODS'10</source>
          , pages
          <fpage>3</fpage>
          {
          <fpage>14</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Vardi</surname>
          </string-name>
          .
          <article-title>Containment of conjunctive regular path queries with inverse</article-title>
          .
          <source>In KR'00</source>
          , pages
          <fpage>176</fpage>
          {
          <fpage>185</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. C. Ca^mpeanu, K. Salomaa, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>A formal study of practical regular expressions</article-title>
          .
          <source>International Journal of Foundations of Computer Science</source>
          ,
          <volume>14</volume>
          :
          <fpage>1007</fpage>
          {
          <fpage>1018</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          .
          <article-title>Optimization properties for classes of conjunctive regular path queries</article-title>
          .
          <source>In Proc. DBPL'01</source>
          , volume
          <volume>2397</volume>
          <source>of LNCS</source>
          , pages
          <volume>21</volume>
          {
          <fpage>39</fpage>
          . Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Florescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Query containment for conjunctive queries with regular expressions</article-title>
          .
          <source>In Proc. PODS'98</source>
          , pages
          <fpage>139</fpage>
          {
          <fpage>148</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D.</given-names>
            <surname>Freydenberger</surname>
          </string-name>
          .
          <article-title>Extended regular expressions: Succinctness and decidability</article-title>
          .
          <source>In Proc. STACS</source>
          <year>2011</year>
          , pages
          <fpage>507</fpage>
          {
          <fpage>518</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ginsburg</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Spanier</surname>
          </string-name>
          .
          <article-title>Bounded ALGOL-like languages</article-title>
          .
          <source>Transactions of the American Mathematical Society</source>
          ,
          <volume>113</volume>
          (
          <issue>2</issue>
          ):
          <volume>333</volume>
          {
          <fpage>368</fpage>
          ,
          <year>1964</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>M.</given-names>
            <surname>Harrison</surname>
          </string-name>
          .
          <article-title>Introduction to Formal Language Theory</article-title>
          . Addison Wesley Publishing Company,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>J.</given-names>
            <surname>Hartmanis</surname>
          </string-name>
          .
          <article-title>On Godel speed-up and succinctness of language representations</article-title>
          .
          <source>Theoretical Computer Science</source>
          ,
          <volume>26</volume>
          (
          <issue>3</issue>
          ):
          <volume>335</volume>
          {
          <fpage>342</fpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kozen</surname>
          </string-name>
          .
          <source>Theory of Computation</source>
          . Springer-Verlag, London,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kutrib</surname>
          </string-name>
          .
          <article-title>The phenomenon of non-recursive trade-o s</article-title>
          .
          <source>International Journal of Foundations of Computer Science</source>
          ,
          <volume>16</volume>
          (
          <issue>5</issue>
          ):
          <volume>957</volume>
          {
          <fpage>973</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>K.</given-names>
            <surname>Salomaa</surname>
          </string-name>
          . Patterns.
          <source>In Formal Languages and Applications, number 148 in Studies in Fuzziness and Soft Computing</source>
          , pages
          <volume>367</volume>
          {
          <fpage>379</fpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>