<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Practical Query Language for Graph DBs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Renzo Angles</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Barcelo´</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gonzalo R´ıos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Universidad de Chile</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Universidad de Talca</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Science, VU University Amsterdam</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Query languages for current graph DB systems lack clear syntax and semantics, which difficults the understanding of its expressiveness and complexity. In particular, many of them suffer from poor performance due to the inherently high complexity of the queries they can express. We propose propositional dynamic logic (PDL) as a yardstick query language for graph database engines, based on the fact that it can express many relevant properties with very low computational cost. We present an implementation of the language that shows its potential applicability for querying massive graph databases by building on existing graph database support.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Some of the current systems for managing graph data implement APIs with special
functions for querying graph properties (e.g., Dex and Titan). Others include
graphoriented query languages; e.g., Neo4j provides Cypher4, based on expressions of the
form start-match-where-return; OrientDB5 includes a SQL-style language extended for
querying graphs; InfiniteGraph6 allows navigation through the implementation of Java
classes; and RDF stores like AllegroGraph, Virtuoso and BigData support SPARQL7,
the standard query language for RDF.</p>
      <p>But with the exception of SPARQL, none of the above languages provides a formal
syntax and semantics, which difficults the accurate evaluation of their expressive power
and complexity. Moreover, after some empirical experiments (not included here due to
lack of space), we found that many of them suffer from poor performance. This is not
due to the implementation, but to the inherently high computational complexity of the
queries they allow to express.</p>
      <p>We propose a navigational language – originally designed for program verification
– as a yardstick for graph database engines. The language is propositional dynamic
logic [5], PDL, that extends several important graph database languages [3]. The reason
is that the language combines good properties of evaluation and expressiveness: It can
be evaluated in polynomial time, and even in linear time for an important fragment of
graph queries. In addition, it allows to express relevant properties of graph databases,
as we will see soon.
4 http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html
5 http://www.orientdb.org
6 http://http://objectivity.com
7 http://www.w3.org/TR/rdf-sparql-query/</p>
      <p>We also present an implementation of the language that works reasonably well, but
decreases its performance for large graph databases in comparison with other engines
(more specifically, DEX) that have much better support. The conclusion we draw is that
implementation of PDL for querying massive graph databases seems promising, but it
can only be achieved by building on existing graph database engines.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Query Language</title>
      <p>We work with a simple graph data model that lies at the core of most graph data models
studied in the literature [2]. In essence, our graph databases are edge-labeled directed
graphs, in which each element (node) is attached a single attribute with its
corresponding value. We formalize this below.</p>
      <p>
        Let V be a countably infinite set of node ids and Str the set of strings over some
alphabet. Let Σ be a finite alphabet. A graph database G over Σ is a tuple (V, E, @)
such that: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) V is a finite set of node ids (i.e., elements in V ), called the nodes of G, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
E ⊆ V ×Σ×V is the set of labeled edges of G, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) @Att : V → Str is the
attributevalue assignment of G. The intuitive interpretation behind an edge (u, c, v) ∈ E, for
u, v ∈ V and c ∈ Σ, is that there exists a c-labeled edge from u to v in G. Also,
@Att(v) = k, for a node v ∈ V and a string k ∈ Str, means that the single attribute
@Att of node v in G is assigned value k.
      </p>
      <p>The query language for graph databases we propose is the extension of
propositional dynamic logic [5], PDL, with the inverse operator [8]. This extension allows to
increase the expressive power of the language without computational cost. The syntax
of the language is as follows. Recall that V is a countably infinite set of node ids over
which nodes of graph databases are taken. Let Σ be a finite alphabet. The language
PDL with converse (PDL−) over Σ is defined by the following grammar, in which α
denotes programs and φ denotes formulas:
α := ǫ | c (c ∈ Σ) | c− (c ∈ Σ) | α ∪ α | α · α | α∗ | φ?
φ := ⊤ | [↓ v] (v ∈ V ) | [@Att = k] (k ∈ Str) | φ ∧ φ | ¬φ | hαiφ.</p>
      <p>
        We now formalize the semantics. Let G = (V, E) be a graph database over Σ
(that is, V ⊆ V ). Each program α defines a binary relation JαKG on V . Analogously,
each formula φ defines over G a subset JφKG of V . The definitions of JαKG and JφKG
are mutually inductive. We start with the case of programs. We assume that c belongs
to Σ, that α, α1 and α2 are programs, and that φ is a formula: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Basis cases: (a)
JǫKG = {(v, v) | v ∈ V }, (b) JcKG = {(u, v) | (u, c, v) ∈ E}, and (c) Jc−KG =
{(u, v) | (v, c, u) ∈ E}. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Inductive cases: (a) Jα1 ∪ α2KG = Jα1KG ∪ Jα2KG, (b)
Jα1 · α2KG = Jα1KG ◦ Jα2KG, (c) Jα∗KG = JǫKG ∪ JαKG ∪ (JαKG ◦ JαKG) ∪ · · · , and
(d) Jφ?KG = {(u, u) | u ∈ JφKG}. Here, ◦ denotes the usual composition of binary
relations. That is, Jα1KG ◦ Jα2KG is the set of pairs (u, v) such that (u, w) ∈ Jα1KG and
(w, v) ∈ Jα2KG, for some w ∈ V .
      </p>
      <p>Let us provide some intuition for the semantics of programs: ǫ defines the identity
on V × V , the pairs of nodes linked by a c-labeled edge are defined by the expression
c, and c− defines the inverse of c. Definable binary relations are closed under union,
composition and transitive-reflexive closure, which are represented by operators ∪, ·
and ( )∗, respectively. Finally, φ? defines the set of pairs (u, u) such that u satisfies the
formula φ.</p>
      <p>
        The semantics of formulas is defined as follows. We assume that v ∈ V , k ∈ Str,
α is a program, and φ, φ1 and φ2 are formulas: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Basis cases: (a) J⊤KG = V , (b)
J[↓ v]KG = {v}, if v ∈ V , and J[↓ v]KG = ∅, otherwise, and (c) J[@Att = k]KG =
{v ∈ V | @Att(v) = k}. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Inductive cases: (a) Jφ1 ∧ φ2KG = Jφ1KG ∩ Jφ2KG, (b)
J¬φKG = V \ JφKG, and (c) JhαiφKG = {u | (u, v) ∈ JαKG, for some v ∈ JφKG}.
      </p>
      <p>The intuition behind the semantics of formulas is as follows: ⊤ defines the whole
set of vertices, [↓ v] is true only at the node id v, and [@Att = k] defines the set of
nodes whose attribute value is k. Definable sets are closed under Boolean operations,
represented by operators ∧ and ¬. Finally, hαiφ defines the set of nodes u from which
a node v that satisfies φ can be “reached” using program α.</p>
      <p>Example 1. Let us consider a toy example of a social network over alphabet Σ =
{friend}, where nodes are persons and attributes denote their names. The query that
retrieves all friends of person p is definable in our language by the following expression:
hfriendi[↓ p]. Intuitively, this expression defines the set of persons p′ in the social
network that are adjacent via a friend-labeled edge to another person p′′ (or, formally,
the pair (p′, p′′) satisfies the program hfriendi), and the id of p′′ is p (or, formally, p′′
satisfies the formula [↓ p]).</p>
      <p>Closure of formulas under Boolean combinations allows us to express important
properties. For instance, the expression hfriendi[↓ p] ∧ hfriendi[↓ p′] defines the
common friends of p and p′.</p>
      <p>The use of regular expressions helps expressing interesting navigational properties
of graph databases. For instance, the expression hfriend∗i[↓ p] defines the people that
is connected by a friendship sequence to person p, i.e., the people who knows someone
who knows someone ... who knows p.</p>
      <p>The language also allows to talk about the inverse of a relation, which is useful when
relationships – unlike friendship in a social network – are not bidirectional. Assume, for
instance, that Σ is now extended with a new symbol parent, that defines the set of pairs
(p, p′) such that p is a parent of p′. Then the expression hparent− ·parenti[↓ p] defines
the set of siblings of p.</p>
      <p>Finally, the combination of features of the language and the use of attributes
allows us to express some sophisticated queries. For instance, the expression h friend ·
∗
(parent[@ = John])? i[↓ p] defines the set of persons that are linked by a friendship
sequence to p, in such a way that each person in the sequence has a son named John.
Expressiveness and complexity In the above sections we presented examples of
relevant properties of graph databases that can be expressed in PDL−. The language also
subsumes some important navigational query languages for graph databases that have
been studied in the literature, e.g., nested regular expressions [3], that were originally
proposed for querying Semantic Web data [7], and a tailored version of the XML query
language XPath for querying graph data [6].</p>
      <p>Expressions in PDL− are acyclic, i.e., they cannot express interesting properties
about cycles in the underlying graph database. For instance, consider again the case of
social networks over alphabet Σ = {friend, parent}. There is no PDL− formula
φ such that for each graph database G over Σ it is the case that JφKG coincides with
the set of persons that have two friends, one of which is a parent of the other [3]. This
shortcoming of the language is at the service of efficiency: Allowing cycles in queries
easily leads to intractability of evaluation [1], while we will see next that evaluation of
expressions in PDL− is tractable.</p>
      <p>The language PDL− has good properties in terms of evaluation complexity, that is,
the theoretical cost of computing JαKG and JφKG, for a PDL− program α and a PDL−
formula φ, respectively, over a graph database G. This is confirmed by the following
folklore result that can be proved using standard model checking techniques [4]. Here,
|G|, |φ| and |α| denote the size of a reasonable encoding of a graph database G, a PDL−
formula φ, and a PDL− program α, respectively:
Theorem 1. 1. The cost of computing the set JφKG, for G a graph database and φ a</p>
      <p>PDL− formula over the same alphabet Σ, is O(|G| · |φ|).
2. The cost of computing the set JαKG, for G a graph database and α a PDL−
program over the same alphabet Σ, is O(|G|2 · |φ|).</p>
      <p>In practice, specifications (formulas and programs) are much smaller than the data
where they are evaluated (the graph database). Under such view, the previous result
essentially tells us that PDL− formulas can be evaluated in linear time in the size of
the data, and that programs can be evaluated in quadratic time in the size of the data.
The quadratic running time for evaluating programs is optimal, since in the worst case
a program can define the whole set of pairs of nodes of a graph database.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Implementation and Evaluation</title>
      <p>Implementation: We wanted our implementation to work on reasonably large graph
databases, and, thus, we decided to concentrate on the evaluation of PDL− formulas,
as they have linear time complexity. In fact, a quadratic running time for program
evaluation may be rendered as unfeasible unless deep and novel optimization techniques
are used in the implementation. Our main assumption, based on the size of the graph
databases we want to query, is that main memory structures used in the implementation
must be of size at most O(|V |) (i.e., only boolean operations on nodes may be handled
there), while external memory structures are of size O(|G|) (that is, the graph database
is kept on disk).</p>
      <p>
        In order to minimize the access to external memory, we identified a minimal set of
operations that have to be handled there. These are: Given a set V0 of nodes, compute
the set V1 of nodes that have a c-labeled edge to a node in V0, and the set V2 of nodes
that can be reached by a c-labeled edge from a node in V0. We used the following data
structure to implement these operations: For each label c we have an array that contains
an element for each node v ∈ V . Each such element is a linked list containing the
c-neighbors of the node. This structure can not be explicitly implemented in external
disk, but it can be emulated as follows (nodes are represented by long integers): (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) For
each label label we create two files: label.gdb y label.aux. These files consist of lines of
fixed size. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) The k-th line of label.gdb stores data related to k-th node. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Each line
of label.gdb will have n consecutive longs, where the first and last long will be used as
a node identifier and a pointer to the auxiliary file, respectively. The n − 2 remaining
longs will store the neighboring nodes of the respective node. (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Once these n − 2
longs are occupied, we create a new empty line with m longs at the end of label.aux,
and we assign the pointer of label.gdb to this new line. (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) The first m − 1 longs will
store the neighbors of the respective node, and the last one will be used as a pointer
in the same way than in the file label.gdb. (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) For the inverse label−, we analogously
create the files label .gdb y label .aux.
      </p>
      <p>
        If pointers in the files were arranged so that the disk access was sequential, then
the algorithm would be optimal. Unfortunately, the way the data is ordered depends on
the order of insertion, which can not be known a priori. But notice that in label.aux
pointers are always higher than the line they point to, that is, if in the k-th line pointer
is p, then p &gt; k. The idea of the algorithm is, thus, to keep in main memory a data
structure, called LazyP, that stores the “pending accesses” to the lines in label.aux. The
pseudo-algorithm is the following: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Read sequentially all lines in label.gdb, where
the k-th line corresponds to the k-th node. If any of the n − 2 nodes belongs to V0, add
the node k to the output, else if the pointer p is not 0, add the pair (p, k) to LazyP. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
Iterate LazyP in ascending order in p, and read the p-th line in label.aux. If any of the
m − 1 nodes belongs to V0, add the k-th node to the output, else if the new pointer p¯ is
not 0, add the pair (p¯, k) to LazyP.
      </p>
      <p>In the first part of the algorithm the file label.gdb is read sequentially, and in the
second part the file label.aux is read sequentially, which is optimal.</p>
      <p>Evaluation results : We present an experimental evaluation of the implementation of
our query language. The objective is to show the performance of PDL for loading and
querying several sizes of data, and a referential comparison with two well-know graph
databases, Dex and Neo4j.</p>
      <p>All the experiments were conducted on a PC with 7 processors Intel Core i7-2600
of 3.4GHz, 15.6 GB RAM, running a Fedora Linux 64-bits. The execution of the java
programs were done by using the java -jar command with parameter -Xmx10000m in
order to set the maximum heap memory size used by Java.</p>
      <p>We use a social network data use-case consisting of people and Webpages. A person
has attributes id and name, and a Web page has attribute id. Two people can be related
by an undirected edge friend, and a person can be connected with a Web page via a
directed edge like. The data follows a power-law distribution for both relations (e.g,
88 87
1 2
332 196
there are a small number of people having a lot of friends, and most people have a
reduced number of friends). The datasets were created using the generator available at
http://dcc.utalca.cl/∼rangles/research/gdg/.</p>
      <p>The evaluation considered loading and querying tests for graphs of three sizes.
Table 1 shows the number of nodes and edges for each size, the loading time and the
space on disk occupied for each system after data loading. Notice that PDL presents the
highest loading time, and it spends a lot of disk space for storing its data structures in
comparison with Dex and Neo4j that have better support for this task.</p>
      <p>For the query processing test, we used the following query set:
– (Q1) Get people having name N: [@name = N].
– (Q2) Get people that likes a given Web page W: hlikei[@id = W].
– (Q3) Get the Web pages that person P likes: hlikes−i[@id = P].
– (Q4) Check if N is the name of person P: [@id = P] ∧ [@name = N].
– (Q5) Get the friends of the friends of person P: hfriendi hfriendi[@id = P] .
– (Q6) Get the Web pages liked by the friends of a given person P:
– (Q7) Get people that likes a Web page which a person P likes:
– (Q8) Is there a “friend” connection (path) between person P1 and P2?
– (Q9) Get the common friends between people P1 and P2:</p>
      <p>This query mix is oriented to evaluate the support of essential graph queries, that is:
attribute searching (Q1 and Q4), node/edge adjacency (Q2 and Q3), fixed-length paths
(Q5, Q6 and Q7), reachability (Q8) and graph pattern matching (Q9 and Q10).</p>
      <p>Table 2 shows the results of evaluating the test queries for the graphs G1, G2 and
G3. The table shows the time of evaluating each query 100 times, and the total time of
the query mix. According to this table we can see that PDL− works better than Dex
and Neo4j just one time (Q1 for G1), rarely is better than Dex (Q1 for G1, G2 and
G3), and several times is better than Neo4j. Note that Dex is the system with the best
performance in the test.</p>
      <p>These results show that our current implementation of PDL works well for small
graphs, but its performance decreases when the data size grows. This suggests that a
potential application of PDL− for querying large graph databases seems promising,
but this has to be accompanied by the support of existing graph database engines.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>In this paper we have proposed a yardstick language for querying graph databases that
supports relevant graph queries and can be evaluated in linear time. In the future we plan
to consider the definition of a high-level syntax for the language, in order to facilitate the
construction and understanding of complex queries by the general user. Although the
simplest approach would be considering an extension of the well-known SQL syntax,
we hope to explore and design a more interesting syntax based on graph structures but
enforcing the restrictions of PDL formulas.</p>
      <p>Acknowledgements: Angles is funded by Fondecyt grant 11100364 and Barcelo´ by Fondecyt
grant 1130104.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Serge</surname>
            <given-names>Abiteboul</given-names>
          </string-name>
          , Richard Hull, and
          <string-name>
            <given-names>Victor</given-names>
            <surname>Vianu</surname>
          </string-name>
          .
          <source>Foundations of Databases. AddisonWesley</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Renzo</given-names>
            <surname>Angles</surname>
          </string-name>
          and
          <article-title>Claudio Gutie´rrez. Survey of graph database models</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>40</volume>
          (
          <issue>1</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Pablo Barcelo´, Jorge Pe´rez, and Juan Reutter.
          <article-title>Relative expresiveness of nested regular expressions</article-title>
          .
          <source>In Alberto Mendelzon Workshop</source>
          , AMW, pages
          <fpage>180</fpage>
          -
          <lpage>195</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Edmund</given-names>
            <surname>Clarke</surname>
          </string-name>
          , Orna Grumberg, and
          <string-name>
            <given-names>Doron</given-names>
            <surname>Peled</surname>
          </string-name>
          . Model Checking. MIT Press,
          <source>1st edition</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Michael</surname>
            <given-names>J</given-names>
          </string-name>
          . Fischer and
          <string-name>
            <given-names>Richard E.</given-names>
            <surname>Ladner</surname>
          </string-name>
          .
          <article-title>Propositional dynamic logic of regular programs</article-title>
          .
          <source>Journal of Computer and System Sciences</source>
          ,
          <volume>18</volume>
          (
          <issue>2</issue>
          ):
          <fpage>194</fpage>
          -
          <lpage>211</lpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Leonid</given-names>
            <surname>Libkin</surname>
          </string-name>
          , Wim Martens, and
          <string-name>
            <given-names>Domagoj</given-names>
            <surname>Vrgoc</surname>
          </string-name>
          .
          <article-title>Querying graph databases with xpath</article-title>
          .
          <source>In International Conference on Database Theory, ICDT</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Pe</surname>
          </string-name>
          <article-title>´rez, Marcelo Arenas, and Claudio Gutierrez. nSPARQL: A navigational language for RDF</article-title>
          . J. Web Sem.,
          <volume>8</volume>
          (
          <issue>4</issue>
          ):
          <fpage>255</fpage>
          -
          <lpage>270</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Moshe</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Vardi</surname>
          </string-name>
          .
          <article-title>The taming of converse: Reasoning about two-way computations</article-title>
          .
          <source>In Logic of Programs</source>
          , pages
          <fpage>413</fpage>
          -
          <lpage>423</lpage>
          ,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>