Implementing Graph Query Languages over
      Compressed Data Structures: A Progress Report

                          Nicolás Lehmann1,2 and Jorge Pérez1,2
                  1
                      Department of Computer Science, Universidad de Chile
                         2
                           Chilean Center for Semantic Web Research
                           [nlehmann,jperez]@dcc.uchile.cl


       Abstract. In this short paper we present our preliminary results on implementing
       two-way regular-path queries (2RPQs) over a compressed representation of graph
       data. We report on several experiments comparing our approach with state-of-
       the-art graph database engines. Our results are encouraging; although we use a
       naive implementation for 2RPQs, our system exhibits a competitive performance
       compared with other engines.


1   Introduction
Graph databases have recently gained a lot of attention in theory and practice. This can
be explained by the current need of handling Web-related data, such as on-line social
networks, and RDF and Semantic Web data. One of the most important challenges
in this context, is the need of handling Web-scale amounts of data, while providing a
reasonable expressiveness for users that want to explore this data.
    In this short paper we present our preliminary results on implementing two-way
regular-path queries (2RPQs) over a compressed representation of graph data. We use
2RPQs as they are an expressive language capable of navigating graphs by using paths
defined by regular expressions, using forward and backward edges while navigating the
graph. 2RPQs are in the core of recently proposed standards for handling RDF data [6].
To compress the graph data we use the k 2 -tree data structure [4]. Given a node v, a k 2 -
tree representation allows us to access the neighbors of v, as well as the nodes pointing
to v, in a very efficient way. This feature plus the compression ratio of k 2 -trees which
permits to maintain the structure of huge graphs in main memory, implied a critical
performance gain when implementing 2RPQs.
    We implement a naive algorithm for 2RPQs based on the typical automata-theoretic
approach [8]. Even with this naive implementation, our system exhibits a competitive
performance compared with state-of-the-art commercial engines. We perform several
experiments with data generated by the GDBench tool [2], comparing our implemen-
tation with Sparksee [7] (formerly known as DEX) and Neo4j [9]. Our system and
Sparksee show a similar performance and Neo4j is considerably surpassed by both al-
ternatives. Our solution also exhibits a considerable advantage in a cold scenario where
the structures have just been loaded and the system is running for the first time.
2   Background and implementation details
Graph databases and query languages We consider a simple model of a graph
database as just a graph G = (V, E) in which every element in V is a node ID (or
just node for short), and each edge is a triple (v1 , e, v2 ) where v1 , v2 ∈ V and e is an
edge label from an alphabet Σ. We say that (v1 , e, v2 ) is a forward e-edge from v1 to
v2 . Symmetrically, (v1 , e, v2 ) is a backward e-edge from v2 to v1 . As a query language,
we consider two-way regular-path queries (2RPQs) which are essentially regular ex-
pressions over Σ ∪ Σ − , where Σ − = {e− | e ∈ Σ} is the alphabet of backward
edges. Given a 2RPQ r, a pair of nodes (v1 , v2 ) is in the evaluation of r over G, if there
exists a path in G from v1 to v2 following forward and backward edges, such that the
sequence of labels of the path belongs to the regular expression defined by r consider-
ing each backward e-edge traversed as the symbol e− . For example, consider the 2RPQ
r = a/(b− )∗ /c and a graph G with edges (v1 , a, v2 ), (v3 , b, v2 ), (v4 , b, v3 ), (v4 , c, v5 ).
Then we have that (v1 , v5 ) is in the evaluation of r over G.


K 2 -trees A k 2 -tree [4] is a tree-shaped structure for representing graphs that exploits
sparseness and clustering features of the adjacency matrix associated to the graph.
Given an adjacency matrix, a k 2 -tree divides it into k 2 submatrices of the same size.
Each submatrix is represented in the tree as a child of the root. For the submatrices
containing only 0’s the decomposition ends there, using a single 0-node to represent
the whole submatrix. The submatrices with at least one 1 are recursively decomposed
using the same strategy until an actual cell in the matrix is reached, which is stored as
a 0- or 1-node in the last level. The tree is then implemented in a highly compacted
way using bitstrings; every level of the tree is represented as a bitstring and the whole
tree as the concatenation of them. Given a node v, searching for the neighbors of v as
well as for the nodes pointing to v, can be achieved by just traversing the k 2 -tree [4].
The traversal of the tree can be simulated using rank queries over the bitstrings, which
can be implemented very efficiently [5]. Thus, the whole k 2 -tree can be represented
in a succinct manner while maintaining its traversal properties. Further optimizations
are possible, for example, using different values of k for different levels of the tree or
stopping the decomposition when the matrices reach size kL × kL and use DACs to
compress them [3].


Design and implementation details Let G = (V, E) be a graph database over al-
phabet Σ. To simplify the correspondence between node IDs and rows and columns in
an adjacency matrix representation, we first map every node ID in V and label in Σ
to an integer via a dictionary encoding3 . After the encoding, our design continues by
vertically partitioning the data, reorganizing it into |Σ| independent graphs, each graph
containing only edges with a particular edge label. Then the whole graph is represented
as an array of k 2 -trees, each tree representing the graph induced by a particular edge
label. Given a node v and an edge label e, we compute the direct or inverse e-neighbors
of v by traversing the k 2 -tree corresponding to e. Following the configuration of similar
work [1], the k 2 -trees we use for evaluation follow a hybrid policy using k = 4 for the
first 5 levels and k = 2 for the rest. The decomposition stop when the submatrices reach
size 8 × 8 and are encoded using DAC’s.
 3
     The implementation of the dictionary is orthogonal to our proposal and thus it is not considered
     in our evaluation in Section 3.
    The evaluation of the 2RPQs follows a simple algorithm using the typical automata-
theoretic approach [8]. Given a 2RPQ r, we first build the Non-deterministic Finite Au-
tomaton (NFA) associated to r, considering labels in Σ and inverse labels. Then, the
graph is also considered as an NFA and the algorithm performs a breadth first search
over the product automaton. In practice the product automaton cannot be constructed,
but we perform the traversal implicitly. Thus the algorithm only needs to know neigh-
bors of a node by a single label (or an inverse label) at a time, which can be efficiently
computed with the k 2 -tree representation as explained above.
    The code is implemented in C++ and available via Github.4


3      Experimental results

We compare our implementation with Sparksee [7] (version 5, February 2014) and
Neo4j [9] (version 2.1, July 2014), using a machine with the following configuration:
3.40 GHz Intel Core i7-2600k (4 cores), 8 GB RAM, Archlinux OS kernel version
3.18.4. We compare the running time for several 2RPQs considering two evaluation
scenarios: the warm and the cold scenarios. The warm scenario simulates the condi-
tions of an already running server: we first perform a warm-up run, and then report the
results for the second run (of the same query). The cold scenario reports the running
time of the first run. The idea is to analyze how caching influences the performance.
For every query tested, we run 10 000 experiments and report the average time.
     In our experiments, we use the data generator provided by the graph database bench-
mark GDBench [2]. Graphs generated by GDBench have a simple social network struc-
ture with nodes representing persons and webpages, friend-edges between persons, and
like-edges from persons to webpages. We considered graphs of different size ranging
from 10 million to 40 million nodes.
     Figs. 1-3 present a comparison for queries like, friend/friend, and like/like-
in the warm scenario. Our implementation is labelled as k2tdb in the figures. For these
queries, k2tdb and Sparksee show a similar performance (running times are in the same
order of magnitude), while Neo4j is considerably slower. Notice that for queries in-
volving only like-edges, k2tdb has a performance twice as good as Sparksee in a warm
scenario (Fig. 1 and 3). This is consistent with the characteristics of k 2 -trees which are
specially suited for sparse graphs, and the subgraph of like-edges enjoys this feature.
     Our next experiment considers navigational path queries which are one of the most
important features of 2RPQs. Informally, we consider queries that goes from one person
to their set of friends, and the friends of its friends, and so on, for several steps. More for-
mally, we consider the queries friend, friend/friend, friend/friend/friend,. . .
until five copies of the friend-edge. These queries are denoted by f1, f2, f3, f4, f5, re-
spectively. We also test the query friend*, denoted by f*, which allows to navigate an
arbitrary number of friend-edges. We report on the results for a graph with 20 million
nodes (Fig. 4 and 5). In the warm scenario Sparksee slightly outperforms our imple-
mentation (Fig. 4), but both stays within the same order of magnitude. For the cold
scenario our implementation has a better performance (Fig. 5), and the difference is
 4
     https://github.com/nilehmann/libk2tree, https://github.com/nilehmann/k2tdb
            106                                                                  106                                                                   106
                      k2tdb                       neo4j                                    k2tdb                       neo4j                                     k2tdb             neo4j
            105                                                                                                                                        105    sparksee
                   sparksee                                                      105    sparksee
            104


                                                                                                                                           Time (µs)
Time (µs)


                                                                     Time (µs)
                                                                                                                                                       104
            103                                                                  104
                                                                                                                                                       103
            102                                                                  103
            101                                                                                                                                        102
                                                                                 102
            100                                                                                                                                        101
                       10M             20M                30M                               10M             20M                30M                            10M           20M            30M
                         Size of the graph (nodes)                                            Size of the graph (nodes)                                           Size of the graph (nodes)
            Fig. 1: Running time for query like                      Fig. 2: Running time for friend/friend                                     Fig. 3: Running time for like/like-
            106                                                                  107                                                                    600
                                            k2tdb                                                                k2tdb                                                            k2tdb
            105                                                                  106                                                                    500
                                         sparksee                                                             sparksee                                                         sparksee
            104                                                                  105
Time (µs)


                                                                     Time (µs)


                                                                                                                                           Time (µs)
                                                                                                                                                        400
            103                                                                  104                                                                    300
            102                                                                  103                                                                    200
            101                                                                  102                                                                    100
            100                                                                  101                                                                      0
                  f1         f2   f3         f4      f5         f*                     f1         f2   f3         f4      f5         f*                       10M        20M      30M         40M
                                   Query                                                                Query                                                     Size of the graph (nodes)
        Fig. 4: Path queries in warm scenario                                    Fig. 5: Path queries in cold scenario                    Fig. 6: Scalability test for friend/friend


                                  quite substantial for the simpler queries. This behavior can be explained by the caching
                                  techniques used in Sparksee. As the portion of the graph being traversed gets larger,
                                  the probability of using the cache increases. Our solution best suits a scenario where
                                  the explored portion of the graph has not yet been visited. This presents an interesting
                                  opportunity for improving our implementation by using similar ideas for caching, for
                                  example, by decompressing and caching some portions of the graph as we traverse it.
                                       Our last experiment is a scalability test for query friend/friend over graphs of
                                  increasing size (Fig. 6). The Sparksee license that we use, allows graphs with at most 1
                                  billion objects (nodes plus edges), which disallows the loading of the 40M-node graph.
                                  Thus, we show the time up to 30M nodes for Sparksee. The growth in running time for
                                  k2tdb is more pronounced compared with Sparksee, but it still shows a linear behavior.
                                  Further experimentation with larger graphs is needed to obtain specific conclusions.


                                  4          Conclusions and future work

                                  Our naive implementation of 2RPQs over compressed graph structures shows a compet-
                                  itive performance compared with highly-optimized graph database engines. This shows
                                  the benefits of considering compressed data structures when querying graphs with ex-
                                  pressive query languages. Our implementation shows a particularly good performance
                                  in the cold scenario where no caching is permitted. This presents an interesting oppor-
                                  tunity for optimizing our implementation with caching techniques. Our ongoing work
                                  includes the implementation of 2RPQs in a less naive way, taking a more specific ad-
                                  vantage of the way the graph is actually compressed.
References
1. Álvarez-Garcı́a, S., Brisaboa, N., Fernández, J., Martı́nez-Prieto, M., Navarro, G.: Com-
   pressed vertical partitioning for efficient RDF management. Knowledge and Information Sys-
   tems (2014), to appear
2. Angles, R., Prat-Pérez, A., Dominguez-Sal, D., Larriba-Pey, J.L.: Benchmarking database
   systems for social network applications. In: GRADES. p. 15 (2013)
3. Brisaboa, N., Ladra, S., Navarro, G.: DACs: Bringing direct access to variable-length codes.
   Information Processing and Management (IPM) 49(1), 392–404 (2013)
4. Brisaboa, N.R., Ladra, S., Navarro, G.: Compact representation of web graphs with extended
   functionality. Inf. Syst. 39, 152–174 (2014)
5. González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and
   select queries. In: Wea 2005. pp. 27–38
6. Harris, S., Seaborne, A.: Sparql 1.1 query language. W3C Recommendation (2013)
7. Martı́nez-Bazan, N., Gómez-Villamor, S., Escale-Claveras, F.: DEX: A high-performance
   graph database management system. In: ICDE Workshops 2011. pp. 124–127
8. Mendelzon, A.O., Wood, P.T.: Finding regular simple paths in graph databases. In: VLDB
   1989. pp. 185–193 (1989)
9. Webber, J.: A programmatic introduction to neo4j. In: SPLASH 2012. pp. 217–218