=Paper= {{Paper |id=None |storemode=property |title=Relative Expressiveness of Nested Regular Expressions |pdfUrl=https://ceur-ws.org/Vol-866/paper13.pdf |volume=Vol-866 |dblpUrl=https://dblp.org/rec/conf/amw/BarceloPR12 }} ==Relative Expressiveness of Nested Regular Expressions== https://ceur-ws.org/Vol-866/paper13.pdf
                   Relative Expressiveness of
                   Nested Regular Expressions

              Pablo Barceló1 , Jorge Pérez1 , and Juan L. Reutter2
             1
                 Department of Computer Science, Universidad de Chile
                   2
                     School of Informatics, University of Edinburgh



      Abstract. Nested regular expressions (NREs) have been proposed as a
      powerful formalism for querying RDFS graphs, but not too much investi-
      gation on NREs has been pursued in a more general graph database con-
      text. In this paper we study the relative expressiveness of NREs by com-
      paring it with the language of conjunctive two-way regular path queries
      (C2RPQs), which is one of the most widely studied query languages for
      graph databases. Among other results, we show that NREs and C2RPQs
      are incomparable in terms of expressive power, but NREs properly ex-
      tend the language of unions of acyclic C2RPQs. Even more, there is
      a natural fragment of NREs that coincide in expressive power with the
      class of unions of acyclic C2RPQs. Our results, plus previous results that
      show that NREs can be evaluated in linear time in combined complexity,
      put forward NREs as a query language for graph-structured data that
      deserves further attention.


1   Introduction
Graph-structured data has become ubiquitous in current data-centric applica-
tions. Social networks, bioinformatics, astronomic databases, digital libraries,
Semantic Web, and linked government data, are only a few examples of applica-
tions in which structuring data as graphs is essential. This has naturally raised
the need for data management tools that can cope with the challenges posed by
today’s graph data applications.
    Traditional relational query languages do not appropriately cope with the
querying problematics raised by graph-structured data. The reason for this is
twofold. First, in the context of graph databases one is typically interested in
navigational queries, i.e. queries that traverse the edges of the graph while check-
ing for the existence of paths satisfying certain conditions. However, most re-
lational query languages, such as SQL, are not designed to deal with this kind
of recursive queries [1]. Second, current graph database applications tend to be
massive in size (think, for instance, of social networks or astronomic databases,
that may store terabytes of information). Thus, one can immediately dismiss any
query language that cannot be evaluated in polynomial time (or even in linear
time!). On the other hand, even the core of the usual relational query languages
– conjunctive queries (CQs) – does not satisfy this property. In fact, parameter-
ized complexity analysis tells us that – under widely-held complexity theoretical




                                        180
analysis – CQs over graph databases cannot be evaluated in time 0(|G|c · f (|φ|)),
where c ≥ 1 is a constant and f : N → N is a computable function [17].
     One of the languages for graph-structured data that has received much at-
tention in the literature is the class of conjunctive regular path queries (CRPQs)
[9, 13, 3, 2], together with their extension with inverses (that is known as the
class of C2RPQs). This language can express complex graph patterns by means
of navigational features and existential quantification over node ids. But as it is
to be expected, the good expressiveness properties of CRPQs are not for free.
In fact, CRPQs properly contain the class of CQs, and hence its applicability is
limited, at least for querying very large graph databases.
     To overcome this problem, while at the same time retain reasonable expres-
siveness, one can follow at least two strategies. The first one is to consider struc-
tural restrictions on the joins of C2RPQs that yield efficient query evaluation. A
typical example is the class of acyclic C2RPQs [2], that is, C2RPQs whose un-
derlying undirected graph is acyclic. A query Q from this class can be evaluated
over a graph database G in time O(|G| · |Q|) (see [19]), that is, linearly in both
the size of the graph database and the query, showing that acyclic C2RPQs are
well-suited for practical applications. The same complexity bound applies when
Q is a union of acyclic C2RPQs.
     Another strategy is enhancing the navigational features of the language, while
disallowing conjunctions altogether. A language that has been designed follow-
ing this premise is the language of nested regular expressions (NREs) [18], that
extends classical regular expressions with inverses and an existential branching
operator à la XPath [16]. This class of expressions was proposed in [18] for query-
ing Semantic Web data, and was shown to be useful for querying RDF graphs
(the data model for the Semantic Web) in the presence of the RDFS vocab-
ulary [15]. The language of NREs share the good query evaluation properties
of the class of acyclic C2RPQs. In particular, as shown in [18], NREs can be
evaluated in linear time both in the size of the data and the expression.
     Even though the expressiveness of NREs has been studied in the RDF/S
context [18], not much is known about the relative expressive power of NREs
with respect to C2RPQs. This paper intends to fill this gap, providing a complete
picture of the relative expressiveness of NREs and C2RPQs. We start by showing
that NREs and C2RPQs are incomparable in terms of their expressive power.
The intuition behind this is simple. While C2RPQs, and even CRPQs, can verify
the existence of cycles satisfying certain properties over the data, NREs are by
definition acyclic. On the other hand, NREs allow defining the transitive closure
of an expression that makes use of the existential test, a feature that cannot be
simulated within the class of C2RPQs.
     Afterwards, we prove that the class of NREs properly extend the class of
unions of acyclic C2RPQs that satisfy a mild syntactic condition (namely, that
the underlying undirected graph of each disjunct is connected). Finally, we prove
that there is a natural fragment of NREs that coincides in expressiveness with
the class of unions of acyclic C2RPQs.




                                         2




                                       181
    Our results can be read in the following way. The class of acyclic C2RPQs
is well-behaved in terms of query evaluation. Both NREs and C2RPQs extend
this class in incomparable ways. But only one of these extensions, namely NREs,
preserves the good evaluation properties of acyclic C2RPQs. As such, the results
of this paper are also a way of putting forward NREs, as they show that NREs
constitute a class of graph queries that exhibit a good balance between efficiency
and expressiveness. Therefore, we believe that NREs deserve a place as a general
query language for graphs, beyond the RDFs context for which it was originally
proposed.
    In terms of related work, the relative expressiveness of a myriad of naviga-
tional languages for graph databases has been recently studied in [12]. But none
of our results can be directly obtained from the results in such paper. This is
because in [12] the study was restricted to navigational query algebras that do
not allow expressing complex joins, such as CRPQs. Moreover, no syntactical
restriction that resembles acyclicity is considered in such work.
    Organization of the paper. Section 2 introduces the notions that will
be used along the paper. In Section 3 we prove that C2RPQs and NREs are
incomparable in terms of expressive power. Then in Section 4 we show that,
under a mild restriction on the class of queries allowed, the class of unions of
acyclic C2RPQs coincides with a syntactic fragment of the class of NREs. Finally,
in Section 5, we provide conclusions for the paper. Due to space limitations some
proofs have been moved to the appendix.


2     Preliminaries

2.1    Graph databases

Let V be a countably infinite set of node ids, and Σ a finite alphabet. A graph
database G over Σ is a pair (V, E), where V is a finite set of node ids (that
is, V is a finite subset of V) and E ⊆ V × Σ × V . Therefore, G is essentially
an edge-labeled directed graph, where the fact that (u, a, v) belongs to E means
that there is an edge from node u into node v labeled a. For a graph database
G = (V, E), we write (u, a, v) ∈ G whenever (u, a, v) ∈ E.


2.2    Nested regular expressions

Let Σ be a finite alphabet. Nested regular expressions (NREs) over Σ extend
regular expressions with an existential nesting test operator [(·)] (or just nesting
operator, for short), and an inverse operator a− , over each a ∈ Σ [18]. The
syntax of NREs is given by the following grammar.

    nexp := ε | a (a ∈ Σ) | a− (a ∈ Σ) | nexp · nexp
                                         nexp ∗ | nexp + nexp | [nexp]          (1)

As it is customary, we use n + as shortcut for n · n ∗ .


                                         3




                                       182
                                              sw:journal                     dc:creator
                              journal:jacm                  inJacm:HopcroftT74dc:                   :Robert Endre Tarjan
                                                                                  cre
                                                                                      ato
                                                                                          r
                sw:series                     dct:partOf                     dc:creator
    conf:focs                 inFocs:FOCS8               inFocs:HopcroftU67a dc:                    :John E. Hopcroft
                                                                                  cre
                                                                                      ato
                                                                                          r

                                                                                                r   :Jeffrey D. Ullman
                                                                                          ato
                                                                                      cre
                                                                                  dc:
                 sw:series                    dct:partOf                         dc:creator
    conf:pods   sw:            inPods:83                    inPods:FaginUV83      dc:                 :Ronald Fagin
                    ser                                                               cre
                        ies                                                               ato
                                                                                               r
                                              dct:partOf                         dc:creator
                               inPods:95                      inPods:Vardi95                  o r    :Moshe Y. Vardi
                                                                                          a t
                                                                                      cre
                                                                                  dc:
                                              sw:journal                         dc:creator
                              journal:iandc                 inIandc:VardiW94                          :Pierre Wolper


Fig. 1. A fragment of the RDF Linked Data representation of DBLP [10] available at
http://dblp.l3s.de/d2r/.


    Intuitively, NREs specify pairs of node ids in a graph database, subject to the
existence of a path satisfying a certain regular condition among them. That is,
each NRE nexp over Σ defines a binary relation JnexpKG when evaluated over a
graph database G over Σ. This binary relation is defined inductively as follows,
where we assume that a is a symbol in Σ, and n, n1 and n2 are arbitrary NREs:
                          JεKG = {(u, u) | u is a node id in G}
                          JaKG = {(u, v) | (u, a, v) ∈ G}
                      Ja− KG = {(u, v) | (v, a, u) ∈ G}
                  Jn1 · n2 KG = Jn1 KG ◦ Jn2 KG
                Jn1 + n2 KG = Jn1 KG ∪ Jn2 KG
                     Jn ∗ KG = JεKG ∪ JnKG ∪ Jn · nKG ∪ Jn · n · nKG ∪ · · ·
                      J [n] KG = {(u, u) | there exists v s.t. (u, v) ∈ JnKG }.
Here, the symbol ◦ denotes the usual composition of binary relations, that is,
Jn1 KG ◦ Jn2 KG = {(u, v) | there exists w s.t. (u, w) ∈ Jn1 KG and (w, v) ∈ Jn2 KG }.
Example 1. Let G be the graph database in Figure 1. This graph contains a
fragment of the RDF Linked Data representation of DBLP [10].3 The following
is a simple NRE that matches all pairs (x, y) such that x is an author that
published a paper in conference y (for simplicity, we omit prefixes dc:, dct:, and
sw: in edge labels):

                                  n1 = creator− · partOf · series
For example, (:Jeffrey D. Ullman, conf:focs) and (:Ronald Fagin, conf:pods) are
in Jn1 KG .
    Consider now the following expression that matches pairs (x, y) such that x
and y are connected by a coautorship sequence:
                                     n2 = (creator− · creator)+
3
    For brevity we have not depicted all the RDF-URI prefixes in the figure.


                                                        4




                                                     183
For example, the pair (:John E. Hopkroft, :Pierre Wolper) is in Jn2 KG .
   Finally the following expression matches all pairs (x, y) such that x and y
are connected by a coautorship sequence that only considers conference papers:
                   n3 = (creator− · [ partOf · series] · creator)+
Let us give the intuition of the evaluation of this expression. Assume that
we start at node u. The (inverse) edge creator− makes us to navigate from
u to a paper v created by u. Then the existential test [ partOf · series] is
used to check that from v we can navigate to a conference (and thus, v is
a conference paper). Finally, we follow edge creator from v to an author w
of v. The (·)+ over the expression allows us to repeat this sequence several
times. For instance, (:John E. Hopkroft, :Moshe Y. Vardi) is in Jn3 KG , but the
pair (:John E. Hopkroft, :Pierre Wolper) is not in Jn3 KG .                   
    The following result, proved in [18], shows a remarkable property of NREs.
It states that the query evaluation problem for NREs is not only polynomial in
combined complexity (i.e. when both the database and the query are given as
input), but also that it can be solved linearly in both the size of the database
and the expression. Given a graph database G and an NRE nexp, we use |G| to
denote the size of G (in terms of the number of edges (u, a, v) ∈ G), and |nexp|
to denote the size of nexp.
Proposition 1 (from [18]). Checking, given a graph database G, a pair of
nodes (u, v), and an NRE nexp, whether (u, v) ∈ JnexpKG , can be done in time
O(|G| · |nexp|).
   Proposition 1 can be proved by adapting previously existing linear time al-
gorithms for propositional dynamic logic (PDL) [11].

2.3    Conjunctive regular path queries
In Section 3 we compare NREs with one of the the most classical languages
used for querying graph data: conjunctive regular-path queries (CRPQs) [5, 6].
Since NREs consider the inverse operator, we actually compare NREs with the
language obtained by extending CRPQs with inverse. This extension is known as
C2RPQs [6, 7]. We also consider the language of unions of C2RPQs (UC2RPQs).
    The definition of C2RPQs is based on the notion of regular expression with
inverse. This is an NRE that does not use the nesting operator [ · ]. For instance,
both expressions n1 and n2 in Example 1 are regular expressions with inverse.
The evaluation of a regular expression with inverse r over a graph database G
is inherited from the evaluation of NREs, i.e. the evaluation of r over a graph
database G is the set of pairs JrKG .
    Let x̄ = (x1 , . . . , xn ) and ȳ = (y1 , . . . , ym ) be (possibly empty) tuples of
distinct variables. We denote by x̄ ∪ ȳ the set {x1 , . . . , xn , y1 , . . . , ym }. A C2RPQ
over alphabet Σ with free variables x̄, is a formula ϕ(x̄) of the form
                                                                                  
                 ∃ȳ (z1 , r1 , z1′ ) ∧ (z2 , r2 , z2′ ) ∧ · · · ∧ (zk , rk , zk′ ) ,       (2)


                                              5




                                            184
where (1) ri is a regular expression with inverse over Σ, for every 1 ≤ i ≤
k, (2) z1 , z1′ , . . . zk , zk′ are not necessarily distinct variables, and (3) the set
{z1 , z1′ , . . . , zk , zk′ } is precisely x̄ ∪ ȳ. The arity of a C2RPQ is the number of free
variables in the formula (n in this case).
    Given a tuple c̄ = (c1 , . . . , cn ) and a graph database G over Σ, we say that G
satisfies ϕ(c̄), denoted by G |= ϕ(c̄), if there exists a mapping h from x̄ ∪ ȳ to the
node ids in G such that h(xj ) = cj for every 1 ≤ j ≤ n, and (h(zi ), h(zi′ )) ∈ Jri KG
for every 1 ≤ i ≤ k. The evaluation of ϕ(x̄) over G, denoted by Jϕ(x)KG , is the
set of tuples {c̄ | G |= ϕ(c̄)}. CRPQs are obtained from C2RPQs by forbidding
the use of the inverse operator. Finally, a UC2RPQ (resp. UCRPQ) is a formula
ψ(x̄) of the form ϕ1 (x̄) ∨ · · · ∨ ϕk (x̄) where ϕi (x̄) is a C2RPQ (resp. CRPQ) for
each 1 ≤ i ≤ k, and we have that c̄ ∈ Jψ(x̄)KG if and only if c̄ ∈ Jϕi (x̄)KG for
some 1 ≤ i ≤ k.
Example 2. Let us consider again the expression n1 used in Example 1. It is easy
to see that n1 can be expressed as the following C2RPQ:
                                                                           
     Qn1 (x, y) := ∃u∃v (x, creator− , u) ∧ (u, partOf, v) ∧ (u, series, y) .

That is, for each graph database G it is the case that Jn1 KG = JQn1 KG . In
the same way, n2 can be expressed as the C2RPQ Qn2 (x, y) := (x, (creator− ·
creator)+ , y), that does not use existential quantification.
    On the other hand, it can be proved that n3 cannot be expressed as a
UC2RPQ. The intuitive reason is that n3 makes use of recursion (Kleene-∗)
over an expression that uses the nesting operator [ · ], which is a feature that
cannot be codified by UC2RPQs in general. We will see a simpler example of
this kind of behavior later when we prove Theorem 1.                          
   As we already mentioned in Section 1, C2RPQs do not share the good query
evaluation properties of NREs. In fact, the combined complexity of CRPQs
is NP-complete [8, 2]. Furthermore, under widely-held complexity theoretical
assumptions, the parameterized complexity of CRPQs is also intractable [17],
which means that they cannot be evaluated in time O(|G|c ·f (|Q|)), for a constant
c ≥ 0 and a computable function f : N → N.


3    Expressiveness of NREs in terms of C2RPQs
In what follows we compare the expressive power of NREs with respect to
C2RPQs. In particular, we say that an NRE nexp over Σ can be expressed
as a (U)C2RPQ, if there exists a (U)C2RPQ ϕ(x, y) over Σ such that for every
graph database G over Σ it holds that JnexpKG = Jϕ(x, y)KG . Symmetrically, we
say that a (U)C2RPQ ϕ(x, y) over Σ can be expressed as an NRE, if there exists
an NRE nexp over Σ such that for every graph database G over Σ it holds that
Jϕ(x, y)KG = JnexpKG . Notice that it is only meaningful to compare NREs with
binary (U)C2RPQs (that is, with UC2RPQs with exactly two free variables).
    The main result of this section is that NREs and C2RPQs are incomparable
in terms of expressive power. This is a direct consequence of Proposition 2 and


                                              6




                                            185
Theorem 1 below. In fact, we prove something stronger: there is a CRPQ that
cannot be expressed as an NRE, and there is an NRE that cannot be expressed
as a UC2RPQ.
Proposition 2. There exists a binary CRPQ over the unary alphabet Σ = {a}
that cannot be expressed as an NRE.
Proof. Consider the CRPQ ϕ(x, y) given by the expression (x, a, y)∧(y, a, x). We
next show that ϕ(x, y) cannot be expressed as an NRE. Consider the following
graph G1 over alphabet Σ:
                                                  a
                                         c0              c1


                                           a             a

                                                 c2


Notice that Jϕ(x, y)KG1 = ∅, that is, there are no values u, v such that G1 |=
ϕ(u, v). We show that every NRE nexp satisfies JnexpKG1 6= ∅, which proves that
ϕ cannot be expressed as an NRE. In particular, we prove something stronger:
for every NRE nexp there exists values x0 , x1 , x2 , y0 , y1 , y2 ∈ {c0 , c1 , c2 } such
that (c0 , x0 ), (c1 , x1 ), (c2 , x2 ), (y0 , c0 ), (y1 , c1 ), (y2 , c2 ) ∈ JnexpKG1 . We prove this
by induction on the construction of the expression.
    We have three base cases:
 1. if nexp = ε then JnexpKG1 = {(c0 , c0 ), (c1 , c1 ), (c2 , c2 )}, and the property
    holds.
 2. if nexp = a then JnexpKG1 = {(c0 , c1 ), (c1 , c2 ), (c2 , c0 )}, and the property
    holds.
 3. if nexp = a− then JnexpKG1 = {(c0 , c2 ), (c1 , c0 ), (c2 , c1 )}, and the property
    holds.
Assume now that nexp 1 and nexp 2 satisfy the property. We have several cases.
 4. if nexp = nexp 1 + nexp 2 then JnexpKG1 = Jnexp 1 KG1 ∪ Jnexp 2 KG2 and then
    the property clearly holds.
 5. assume that nexp = nexp 1 ·nexp 2 . Then we know that that there exist values
    xi ’s such that (ci , xi ) ∈ Jnexp 1 KG1 for i ∈ {0, 1, 2}. Notice that every xi ∈
    {c0 , c1 , c2 } and thus, we know that there exists values u0 , u1 , u2 such that
    (xi , ui ) ∈ Jnexp 2 KG1 for i ∈ {0, 1, 2}. This implies that (ci , ui ) ∈ JnexpKG1
    for i ∈ {0, 1, 2}. By a symmetric argument we can show that there exists
    values v0 , v1 , v2 such that (vi , ci ) ∈ JnexpKG1 for i ∈ {0, 1, 2}.
 6. if nexp = nexp ∗1 the property holds by using 1), 4), 5) and the definition of
    Jnexp ∗1 KG1 .
 7. if nexp = [nexp 1 ] then since nexp 1 satisfies the property we have JnexpKG1 =
    {(c0 , c0 ), (c1 , c1 ), (c2 , c2 )} and then nexp also satisfies the property.
This completes the proof.                                                                           ⊓
                                                                                                    ⊔
Theorem 1. There exists an NRE over binary alphabet Σ = {a, b} that cannot
be expressed as a UC2RPQ.


                                                  7




                                               186
Proof. Consider the NRE nexp = (a · [b])+ . We next show that nexp cannot be
expressed as a UC2RPQ ψ(x, y). Assume for the sake of contradiction that this
is not the case and let N be the maximum number of conjuncts in a disjunct of
ψ(x, y). Now consider the following graph G

                            m1               m2                                      mK

                           b                 b                                       b
                    a                a                 a                      a
           n0               n1               n2                  ···                 nK


where K > N + 1. Now, since we are assuming that ψ(x, y) is equivalent to nexp
and given that (n0 , nK ) ∈ JnexpKG , we have that there exists a disjunct ϕ(x, y)
of ψ(x, y), such that (n0 , nK ) ∈ Jϕ(x, y)KG . Assume without loss of generality
that the conjunctive part of ϕ(x, y) is of the form (z1 , r1 , z1′ ) ∧ · · · ∧ (zN , rN , zN    ′
                                                                                                  ).
Then by the semantics of C2RPQs, we know that there exists a mapping h from
{z1 , z1′ , . . . , zN , zN
                          ′
                            } to {n0 , n1 , m1 , . . . , nK , mK } such that h(x) = n0 , h(y) = nK
and for every i ∈ {1, . . . , N } there is a path ρhi between h(zi ) and h(zi′ ) such
that the sequence of edge labels λhi associated to ρhi , satisfies the expression ri .
We use the mapping h and the fact that K is sufficiently large compared with
N to obtain a contradiction.
     From h and ϕ(x, y) we construct a graph Gh(ϕ(x,y)) as follows. Initially con-
sider the values h(z1 ), h(z1′ ), . . . , h(zN ), h(zN        ′
                                                                ) as nodes in Gh(ϕ(x,y)) . Now, for
                                  ′                       h
every conjunct (zi , ri , zi ) of ϕ(x, y) let λi = a0 a1 . . . ak be the sequence of edges
(and inverse edges) in the path between h(zi ) and h(zi′ ) that satisfies the regular
expression with inverse ri when evaluated over G. Then, we include k fresh nodes
s1 , s2 , . . . , sk to Gh(ϕ(x,y)) and the following edges. Assuming that s0 = h(zi ) and
that sk+1 = h(zi′ ), then for every j ∈ {1, . . . , k + 1}:
 – if aj = a or aj = b then add edge (sj−1 , aj , sj ) to Gh(ϕ(x,y)) , and
 – if aj = a− or aj = b− then add edge (sj , aj , sj−1 ) to Gh(ϕ(x,y)) .
    By the construction of Gh(ϕ(x,y)) it is easy to conclude that (h(x), h(y)) =
(n0 , nK ) ∈ Jϕ(x, y)KGh(ϕ(x,y)) . We prove below that (n0 , nK ) ∈
                                                                  / JnexpKGh(ϕ(x,y)) .
    First notice that if n0 and nK are in different connected components in
Gh(ϕ(x,y)) , then clearly (n0 , nK ) ∈/ JnexpKGh(ϕ(x,y)) . Thus, assume that n0 and
nK are in the same connected component of Gh(ϕ(x,y)) . Notice that since n0 and
nK are at distance K in G, then any semi-path (that is a path using forward
and backward edges [6]) connecting n0 and nK in Gh(ϕ(x,y)) should have at
least K edges (each edge either a forward or backward edge). Now, given that
ϕ(x, y) has N conjuncts and K > N + 1 we know that everyone of the paths
connecting n0 and nK in Gh(ϕ(x,y)) should contain a portion that was constructed
by converting a sequence of edge labels of G into a linear semi-path in Gh(ϕ(x,y)) .
All this implies that every possible (semi) path from n0 to nK in Gh(ϕ(x,y)) should
contain some intermediate nodes such that the sum of the out and in-degrees
of the node is at most 2 (they are part of a line). On the other hand, given
the semantics of NREs, every path that satisfies the expression (a · [b])+ should
be such that all intermediate nodes (that is without considering the first and


                                                  8




                                                 187
last node), should have in-degree at least 1 (one edge going into the node with
label a) and out-degree at least 2 (one edge going out with label a, and another
going out with label b), which implies that no path from n0 to nK in Gh(ϕ(x,y))
satisfies expression (a · [b])+ . Thus we have that (n0 , nK ) ∈ Jϕ(x, y)KGh(ϕ(x,y))
and then given that ϕ(x, y) is a disjunct in ψ(x, y), we also have that (n0 , nK ) ∈
Jψ(x, y)KGh(ϕ(x,y)) . But (n0 , nK ) ∈
                                     / JnexpKGh(ϕ(x,y)) , which contradicts the fact
that ψ(x, y) expresses nexp. This completes the proof of the theorem.              ⊓
                                                                                   ⊔

4    Expressiveness of NREs with respect to acyclic
     C2RPQs
The reason why NREs do not capture C2RPQs is because they are memory-
less, and that is why they cannot express CRPQs that form complex joins, in
particular, cyclic joins. On the other hand, acyclic C2RPQs syntactically re-
strict C2RPQs by forbidding the existence of those cycles. This suggests that
the expressive power of NREs may be better understood in terms of this class
of queries. In fact, we prove in this section that NREs properly extend the class
of acyclic UC2RPQs, under a mild syntactic restriction that asks for each dis-
junct of a UC2RPQ to be connected. However, in order to prove this we actually
show something more interesting: the class of acyclic and connected UC2RPQs
coincides with a natural syntactic restriction of the class of NREs.
     In order to formally define acyclic C2RPQs, we need to recall the standard
notion of the underlying undirected graph of a query. Consider a C2RPQ ϕ(x̄)
of the form ∃ȳ((z1 , r1 , z1′ ) ∧ (z2 , r2 , z2′ ) ∧ · · · ∧ (zk , rk , zk′ )). The underlying graph
of ϕ(x̄) is Gϕ(x̄) = (N, E), where N = x̄ ∪ ȳ and E = {{zi , zi′ } | 1 ≤ i ≤ k}. That
is, the nodes in Gϕ(x̄) are the variables in ϕ(x̄), and there is an edge between two
(not necessarily distinct) variables if they occur in the same conjunct in ϕ(x̄).
Then we say that ϕ(x̄) is acyclic if Gϕ(x̄) has no cycles. Similarly, we say that
ϕ(x̄) is connected if Gϕ(x̄) is a connected graph. We extend these definitions to
UC2RPQs, and say that a UC2RPQ is acyclic and connected if every one of its
disjuncts is acyclic and connected.
     We shall prove that every acyclic and connected UC2RPQ can be expressed
as an NRE. But, more interestingly, we can actually prove that this class of
queries precisely coincides with a fragment of NREs, specifically, those NREs
that do not allow the nesting operator to be used inside an expression of the
form s∗ . Formally, an NRE is nesting-star alternation free, if for every subex-
pression of nexp of the form s∗ , it holds that s does not use the nesting operator.
For instance, expressions n1 and n2 in Example 1 are nesting-star alternation
free, while expression n3 in the same example is not. We can prove that acyclic
and connected UC2RPQs and nesting-star alternation-free NREs have the same
expressive power.
Theorem 2.
(a) Let ϕ(x, y) be an acyclic and connected binary UC2RPQ over Σ. Then
    ϕ(x, y) can be expressed as a nesting-star alternation free NRE of linear
    size in |ϕ(x, y)|.


                                                 9




                                               188
(b) Let nexp be a nesting-star alternation-free NRE over Σ. Then nexp can be
    expressed as an acyclic and connected binary UC2RPQ of exponential size
    in |nexp|.

Proof. In order to prove (a) we use a technique which is similar to the roll-up of
an acyclic conjunctive query, which has been used to provide decidability results
for query containment in the context of description logics (see [14, 4]). In this
case we roll-up the acyclic conjunctive query into an NRE.
     Before going into the details of the proof of (1), we introduce the notion of
an inverse of an NRE. The inverse of an NRE exp over Σ, denoted by exp −1 ,
is defined inductively as follows: (1) ε−1 = ε, (2) a−1 = a− , for each a ∈ Σ,
(2) (a− )−1 = a, for each a ∈ Σ, (3) (exp 1 · exp 2 )−1 = exp −1                                  −1
                                                                                         2 · exp 1 , (4)
      ∗ −1          −1 ∗                              −1        −1            −1                      −1
(exp ) = (exp ) , (5) (exp 1 + exp 2 ) = exp 1 + exp 2 , and (6) [exp] =
[exp]. It is straightforward that for each graph G over Σ, it holds that (u, v) ∈
JexpKG iff (v, u) ∈ Jexp −1 KG , and moreover, exp −1 can be constructed in poly-
nomial time from exp.
     In order to prove (a), let ϕ(x, y) be an acyclic and connected C2RPQ, and
consider the graph Gϕ(x,y) constructed by having variables of ϕ(x, y) as nodes.
Moreover, for every conjunct of the form (zi , ri , zi′ ) we add an undirected edge
{zi , zi′ } with two labels ℓ(zi ,zi′ ) ({zi , zi′ }) = ri and ℓ(zi′ ,zi ) ({zi , zi′ }) = (ri )−1 , where
(ri )−1 is the inverse of ri . (Notice that if ri is a regular expression with inverse,
then (ri )−1 is also a regular expression with inverse of size linear in the size of
r.) Intuitively, label ℓ(zi ,zi′ ) ({zi , zi′ }) describes the regular expression with inverse
that allows to go from zi to zi′ , and ℓ(zi′ ,zi ) ({zi , zi′ }) the regular expression with
inverse that allows to go from zi′ to zi , when evaluating the query ϕ(x, y). Notice
that Gϕ(x,y) is a tree (since it is acyclic and connected).
     Now, for every node u in Gϕ(x,y) we define an NRE ν(u) as follows. First
consider the graph Gϕ(x,y)
                         ′
                                   obtained from Gϕ(x,y) by deleting all the edges in the
unique path between x and y. Notice that Gϕ(x,y)          ′
                                                                is no longer a tree, but a forest,
and every node of Gϕ(x,y) belongs to a unique tree in Gϕ(x,y)              ′
                                                                                   . Let Tu be the tree
in Gϕ(x,y) to which node u belongs, and assume that Tu is a rooted tree, with
      ′

root u. Then for every node v in Tu construct an expression τu (v) recursively as
follows:

 – if v is a leaf in Tu then τu (v) = ε
 – else, if v has v1 , . . . , vk as children in Tu then

             τu (v) = [ℓ(v,v1 ) ({v, v1 }) · τu (v1 )] · · · · · [ℓ(v,vk ) ({v, vk }) · τu (vk )]

Finally we say that ν(u) = τu (u). Notice that the size of ν(u) is linear in the size
of the tree Tu . Also notice that the fact that Gϕ(x,y) is actually a tree is crucial
for defining ν(u) for every node in Gϕ(x,y) .
    Now, with ν(u) for every possible variable of ϕ(x, y) we are ready to describe
the NRE that defines ϕ(x, y). Let x, u1 , u2 , . . . , uk , y be the unique path in Gϕ(x,y)


                                                   10




                                                 189
from x to y. Then we consider the expression

  nexp = ν(x) · ℓ(x,u1 ) ({x, u1 }) · ν(u1 ) · ℓ(u1 ,u2 ) ({u1 , u2 }) · ν(u2 ) · · · ·
                                                  · · · · ν(uk ) · ℓ(uk ,y) ({uk , y}) · ν(y).

It is not difficult to see that nexp is of size linear in the size of ϕ(x, y). Moreover,
it is not difficult to prove by induction that for every graph G it holds that
JnexpKG = Jϕ(x, y)KG . Essentially, expression nexp is testing for the existence of
the unique path (in the query ϕ(x, y)) between x and y, and also existentially
testing for all the branches that start from the variables in this path as described
in ϕ(x, y). In particular, let ψu (u) be the formula obtained by considering the
conjunction of all formulas (zi , ri , zi′ ) obtained from the edges of graph Tu with
all variables except for u existentially quantified. It is not difficult to prove that
for every graph G, it holds that (a) ∈ Jψu (u)KG if and only if (a, a) ∈ Jν(u)KG .
Moreover, it is easy to see that if (x, r0 , u1 ), (u1 , r1 , u2 ), . . . , (uk , rk , y) are the
conjuncts defining the single path from x to y in ϕ(x, y), then ϕ(x, y) is equivalent
to the formula
                  
  ∃u1 · · · ∃uk       ψx (x) ∧ (x, r, u1 ) ∧ ψu1 (u1 ) ∧ (u1 , r1 , u2 ) ∧ ψu2 (u2 ) · · ·
                                                                                                     
                                                        · · · ∧ ψuk (uk ) ∧ (uk , rk , y) ∧ ψy (y)


From this we obtain that our construction of nexp is correct and then nexp is
equivalent to ϕ(x, y). Notice that the expression nexp constructed above, uses
several levels of nesting (that essentially depend on the level of branching in
ϕ(x, y)) but does not use the Kleene-star operator over the nesting operator,
and thus, it is nesting-star alternation free.
   Now if we start with a UC2RPQ of the form ϕ1 (x, y) ∨ · · · ∨ ϕk (x, y), then
we can apply the above described process to every disjunct to construct NREs
nexp 1 , . . . , nexp k and then consider the expression nexp 1 + · · · + nexp k .

    We now prove (b). Given an NRE nexp, we construct a UC2RPQ ϕnexp (x, y)
by an inductive process on the construction of NREs. We consider a construction
in two steps. First for an NRE nexp that do not use the nesting operator we
just have that ϕnexp (x, y) = (x, nexp, y). Now for the NRE nexp = [nexp ′ ] we
construct the formula
                                                 
                   ϕnexp (x, y) = ∃u ϕnexp (x, u) ∧ (x, ε, y)
                                             ′




Notice that in this case we have used the conjunct (x, ε, y) to simulate an equality
just to ensure that the constructed formula is binary (one can also eliminate this
conjunct and consider unary formulas but this only complicates the inductive
argument). We only have two remaining cases:


                                                   11




                                                 190
 – if nexp = nexp 1 · nexp 2 with nexp 1 or nexp 2 NREs that use the nesting
   operator we have that
                                                                 
                 ϕnexp (x, y) = ∃u ϕnexp 1 (x, u) ∧ ϕnexp 2 (u, y) .

 – if nexp = nexp 1 + nexp 2 with nexp 1 or nexp 2 NREs that use the nesting
   operator we have that

                       ϕnexp (x, y) = ϕnexp 1 (x, y) ∨ ϕnexp 2 (x, y).

Notice that we do not need to consider the case nexp = (nexp ′ )∗ since we are
assuming that the expression is nesting-star alternation free. Also notice that the
formula constructed is not exactly a UC2RPQ (since it does not necessarily have
the disjunction always as a top-level logical operator) but can be transformed into
a UC2RPQ by distribution. This implies that given a NRE nexp the UC2RPQ
constructed by this process is of size exponential in the size of nexp.
    We show next that JnexpKG = Jϕnexp (x, y)KG for all G by using an inductive
argument. If nexp does not mention the nesting operator, then nexp is a regular
expression with inverse, and it is clear that JnexpKG = J(x, nexp, y)KG for every
G. Now assume that Jnexp ′ KG = Jϕnexp ′ (x, y)KG for every G and let nexp =
[nexp ′ ]. Then by definition of NREs, we have that J[nexp ′ ]KG = {(a, a) | there
exists b such that (a, b) ∈ Jnexp ′ KG }. Thus, by applying the induction hypothesis
we have that

     J[nexp ′ ]KG = {(a, a) | there exists b such that (a, b) ∈ Jϕnexp ′ (x, y)KG }.

which implies that

  J[nexp ′ ]KG = {(a, a) | a ∈ J∃u(ϕnexp ′ (x, u))KG } =
             {(a, a) | (a, a) ∈ J∃u(ϕnexp ′ (x, u)) ∧ (x, ε, y)KG } = Jϕnexp (x, y)KG .

and then JnexpKG = Jϕnexp (x, y)KG . The cases nexp = nexp 1 · nexp 2 and nexp =
nexp 1 + nexp 2 are straightforward.                                           ⊓
                                                                               ⊔

    We left open whether the exponential blow-up when translating from nesting-
star alternation free NREs into acyclic and connected UC2RPQs is necessary.
We end up this section by showing that both unions and the connectedness
condition are essential in Theorem 2; that is, that nesting-star alternation free
NREs are neither captured by the class of acyclic (but not necessarily connected)
UC2RPQs nor by the class of acyclic and connected C2RPQs. In view of Theorem
2, it is enough to prove the following simple proposition. The proof can be found
in the appendix.

Proposition 3. The following holds:
(a) There is a binary and acyclic CRPQ that cannot be expressed as a connected
    UC2RPQ.


                                           12




                                         191
(b) There is a binary, acyclic and connected UCRPQ that cannot be expressed
    as a C2RPQ.

    Notice that while part (a) is very simple to prove, part (b) is not as straight-
forward as it may seem at first sight, since C2RPQs allow for some form of
disjunction by means of the union (+) operator of regular expressions.


5   Concluding remarks

While both C2RPQs and NREs strictly subsume the class of acyclic C2RPQs,
only NREs share the good query evaluation properties of acyclic C2RPQs. In
fact, it is known that the query evaluation problem for both NREs and acyclic
C2RPQs can be solved linearly in both the size of the graph database and
the query, but it is intractable for CRPQs. Hence, our results suggest NREs
to be a good language for graph databases, exhibiting a good balance between
expressiveness and efficiency.
Acknowledgements Pablo Barceló was supported by Fondecyt grant 1110171,
Jorge Pérez by Fondecyt grant 11110404 and by VID grant U-Inicia 11/04 Uni-
versidad de Chile, and Juan Reutter by EPSRC grant G049165 and FET-Open
project FoX.


References
 1. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases.
    Addison-Wesley, 1995.
 2. Pablo Barceló, Carlos A. Hurtado, Leonid Libkin, and Peter T. Wood. Expressive
    languages for path queries over graph-structured data. In Proceedings of the 29th
    ACM Symposium on Principles of Database Systems (PODS), pages 3–14, 2010.
 3. D. Calvanese, G. De Giacomo, M. Lenzerini, and M.Y. Vardi. Containment of con-
    junctive regular path queries with inverse. In Proceedings of the 7th International
    Conference on Principles of Knowledge Representation and Reasoning (KR), pages
    176–185, 2000.
 4. Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Conjunctive
    query containment and answering under description logic constraints. ACM Trans-
    actions on Computational Logic, 9(3), 2008.
 5. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi.
    Rewriting of regular expressions and regular path queries. In Proceedings of the
    18th ACM Symposium on Principles of Database Systems (PODS), pages 194–204,
    1999.
 6. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi.
    View-based query processing for regular path queries with inverse. In Proceedings
    of the 19th ACM Symposium on Principles of Database Systems (PODS), pages
    58–66, 2000.
 7. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi.
    Simplifying schema mappings. In Proceedings of the 14th International Conference
    on Database Theory (ICDT), pages 114–125, 2011.


                                          13




                                        192
 8. Ashok K. Chandra and Philip M. Merlin. Optimal implementation of conjunctive
    queries in relational data bases. In Proceedings of the 9th Annual ACM Symposium
    on Theory of Computing (STOC), pages 77–90, 1977.
 9. M. Consens and A.O. Mendelzon. GraphLog: A visual formalism for real life
    recursion. In Proceedings of the 9th Symposium on Principles of Database Systems
    (PODS), pages 404–416, 1990.
10. D2R DBLP bibliography database hosted at L3S research center.
    http://dblp.l3s.de/d2r/.
11. Michael J. Fischer and Richard E. Ladner. Propositional modal logic of programs
    (extended abstract). In Proceedings of the 9th Annual ACM Symposium on Theory
    of Computing (STOC), pages 286–294, 1977.
12. George H. L. Fletcher, Marc Gyssens, Dirk Leinders, Jan Van den Bussche,
    Dirk Van Gucht, Stijn Vansummeren, and Yuqing Wu. Relative expressive power
    of navigational querying on graphs. In Proceedings of the 14th International Con-
    ference on Database Theory (ICDT), pages 197–207, 2011.
13. D Florescu, A. Levy, and D. Suciu. Query containment for conjunctive queries with
    regular expressions. In Proceedings of the 17th ACM Symposium on Principles of
    Database Systems (PODS), pages 139–148, 1998.
14. Birte Glimm, Ian Horrocks, and Ulrike Sattler. Conjunctive query answering for
    description logics with transitive roles. In Proceedings of the 2006 International
    Workshop on Description Logics (DL), 2006.
15. P. Hayes. RDF Semantics, W3C Recommendation.
    http://www.w3.org/TR/rdf-mt, February 2004.
16. S. DeRose. J. Clark. Xml path language (xpath). W3C Recommendation, Novem-
    ber 1999, http://www.w3.org/TR/xpath.
17. Christos Papadimitriou and Mihalis Yannakakis. On the complexity of database
    queries. Journal of Computer and System Sciences, 58(3):407–427, 1999.
18. Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. nSPARQL: A navigational
    language for RDF. Journal of Web Semantics, 8(4):255–270, 2010.
19. Mihalis Yannakakis. Algorithms for acyclic database schemes. In Proceedings of
    the 7th International Conference on Very Large Databases (VLDB), pages 82–94,
    1981.




                                         14




                                        193
A     Appendix

A.1     Proof of Proposition 3

The proof of part (a) is simple: we only have to consider alphabet Σ = {a, b}
and the query Q(x, y) := ∃z∃u((x, a, z) ∧ (u, b, y)) over Σ. Clearly, Q is a binary
and acyclic CRPQ. Notice, on the other hand, that Q is not connected. We claim
that Q cannot be expressed as a connected UC2RPQ Q′ (x, y). Assume otherwise.
Consider the graph database G over Σ whose set of nodes is {1, 2, 3, 4}, there is
an a-labeled edge in G from 1 to 2, and there is an b-labeled edge in G from 3
to 4. Clearly, (1, 4) ∈ JQKG but (1, 4) 6∈ JQ′ KG . This is because each disjunct of
Q′ is connected, while 1 and 4 belong to different connected components of G.
The latter contradicts the fact that Q and Q′ are equivalent.
    Now we prove part (b). Consider alphabet Σ = {a, b, c, d, e, f } and the query

Q(x, y) = ∃u∃v((x, a, u)∧(u, b, v)∧(u, c, y)) ∨ ∃u∃v((x, d, u)∧(u, e, v)∧(u, f, y)).

Clearly, Q is a binary, acyclic and connected UCRPQ. We claim that Q cannot
be expressed as a C2RPQ Q′ (x, y). Assume for the sake of contradiction that
this is the case. We construct below a graph G over Σ such that JQKG 6= JQ′ KG ,
which is a contradiction.
    Consider graph databases G1 and G2 over Σ, such that the set of edges of
G1 is {(1, a, 2), (2, b, 3), (2, c, 4)} and that of G2 is {(1, d, 2), (2, e, 3), (2, f, 4)}.
Clearly, (1, 4) ∈ JQKG1 and (1, 4) ∈ JQKG2 . Hence (1, 4) ∈ JQ′ KG1 and (1, 4) ∈
JQ′ KG2 . Assume without loss of generality that the conjunctive part of Q′ is
of the form (z1 , r1 , z1′ ) ∧ · · · ∧ (zN , rN , zN
                                                   ′
                                                     ). Then by the semantics of C2RPQs,
we know that there exists a mapping h1 (resp. h2 ) from {z1 , z1′ , . . . , zN , zN       ′
                                                                                            } to
{1, 2, 3, 4} such that h1 (x) = 1, h1 (y) = 4 (resp. h2 (x) = 1, h2 (y) = 4) and for
every i ∈ {1, . . . , N } it holds that there is a path ρhi 1 (resp. ρhi 2 ) between h1 (zi )
and h1 (zi′ ) in G1 (resp. between h2 (zi ) and h2 (zi′ ) in G2 ) such that the sequence
of edge labels λhi 1 (resp. λhi 2 ) associated to ρhi 1 (resp. ρhi 2 ), satisfies ri .
    From h1 and Q′ we construct a graph database Gh1 (Q′ ) over alphabet Σ1 =
{a, b, c, ε} as follows. Initially consider the variables z1 , z1′ , . . . , zN , zN
                                                                                   ′
                                                                                       as nodes
in Gh1 (Q′ ) . Now, for every conjunct (zi , ri , zi ) of Q (x, y) assume that λhi 1 =
                                                         ′      ′

a0 a1 . . . ak is the sequence of edge labels (and inverse edge labels) in the path
between h1 (zi ) and h1 (zi′ ) that satisfies the regular expression with inverse ri
when evaluated over G1 . We consider two cases:

 1. Assume first that λhi 1 is the empty string. Then we connect zi to zi′ in G1
    by an edge labeled ε.
 2. Assume otherwise. Then, we include k fresh nodes s1 , s2 , . . . , sk to Gh1 (Q′ )
    and the following edges. Assuming that s0 = zi and that sk+1 = zi′ , then for
    every j ∈ {1, . . . , k + 1}:
     – if aj = a, aj = b or aj = c then add edge (sj−1 , aj , sj ) to Gh1 (Q′ ) , and
     – if aj = a− , aj = b− or aj = c− then add edge (sj , aj , sj−1 ) to Gh1 (Q′ ) .


                                              15




                                             194
We denote by νih1 the path zi = s0 , a0 , s1 , a1 , s2 , . . . , sk , ak , sk+1 = zi′ (or zi , ε, zi′ ,
if λhi 1 is the empty string). In an analogous way, we construct a graph database
Gh2 (Q′ ) over Σ2 = {d, e, f, ε} from h2 and Q′ .
     By the construction of Gh1 (Q′ ) it is easy to conclude that (x, y) ∈ JQ′ KGh1 (Q′ )) ,
if we assume that ε-labeled edges contribute to labels of paths as the empty
string. In fact, the identity mapping from the variables of Q′ into the nodes of
G1 can be used to witness the values for the variables of Q′ when evaluated over
G1 . For the same reason, (x, y) ∈ JQ′ KGh2 (Q′ )) .
     Consider G to be the graph database over Σ that is obtained from Gh1 (Q′ )
by iteratively applying the following process:

 – For each 1 ≤ i ≤ N such that zi = x or zi′ = x and the path νih1 belongs to
   G, replace such path with νih2 .
 – If there is an ε-labeled edge connecting x to a node y, delete such an edge
   and declare y to be equal to x.

Notice that the symbols of Σ2 \ {ε} appear in G only in those paths that connect
x with a variable of Q′ without going through an intermediate variable of Q′ .
Further, every edge that leaves or starts in x in G is labeled in Σ2 \ {ε}.
    It is easy to see that for every 1 ≤ i ≤ N it is the case that node zi is
connected to zi′ in G by a path that satisfies ri . Hence (x, y) ∈ JQ′ KG . We prove
next that (x, y) 6∈ JQKG . Assume otherwise. Since every edge that leaves x in
G is labeled in Σ2 \ {ε}, it must be the case that (G, (x, y)) |= ∃u∃v((x, d, u) ∧
(u, e, v) ∧ (u, f, y)). We consider three possibilities:

 1. There is a path xdsf y in G, where s is a fresh node (i.e. s has been introduced
    as an internal node in a path of the form νih2 ), and s has an outgoing edge
    labeled e. But this is not possible since fresh nodes have exactly one outgoing
    edge, and in this case s must have one outgoing edge labeled e and another
    one labeled f .
 2. There is a path xdzf y in G, where z is a variable of Q′ , and z has an outgoing
    edge labeled e. But then either x = z or y = x, since the only edges of G
    labeled in Σ2 \ {ε} that connect two nodes of G named after a variable of Q′
    are the ones that have at least one endpoint in x. This implies in the first
    that there is a path labeled d from 1 into itself in G2 , and in the second case
    that there is a path labeled df from 1 into itself in G2 . Clearly none of these
    cases can happen.

This is our desired contradiction.                                                                   ⊓
                                                                                                     ⊔




                                                  16




                                                195