=Paper= {{Paper |id=Vol-1810/GraphQ_paper_01 |storemode=property |title=On the Performance of Analytical and Pattern Matching Graph Queries in Neo4j and a Relational Database |pdfUrl=https://ceur-ws.org/Vol-1810/GraphQ_paper_01.pdf |volume=Vol-1810 |authors=Jürgen Hölsch,Tobias Schmidt,Michael Grossniklaus |dblpUrl=https://dblp.org/rec/conf/edbt/HolschSG17 }} ==On the Performance of Analytical and Pattern Matching Graph Queries in Neo4j and a Relational Database== https://ceur-ws.org/Vol-1810/GraphQ_paper_01.pdf
     On the Performance of Analytical and Pattern Matching
       Graph Queries in Neo4j and a Relational Database

                        Jürgen Hölsch                     Tobias Schmidt                Michael Grossniklaus
                          Department of Computer and Information Science, University of Konstanz
                                         P.O. Box 188, 78457 Konstanz, Germany
                         {juergen.hoelsch, tobias.2.schmidt, michael.grossniklaus}@uni-konstanz.de

ABSTRACT                                                                       In this paper, we bridge this gap by defining a set of
Graph databases with a custom non-relational backend pro-                   analytical queries (Section 2). In addition, compared to re-
mote themselves to outperform relational databases in an-                   lated work, we introduce a more fine-grained categorization
swering queries on large graphs. Recent empirical studies                   of pattern matching queries including (i) paths that filter
show that this claim is not always true. However, these                     on specifc node labels without restricting edge types, (ii)
studies focus only on pattern matching queries and neglect                  paths that filter on specific edge types without restricting
analytical queries used in practice such as shortest path, di-              node labels, (iii) paths that filter on node labels and edge
ameter, degree centrality or closeness centrality. In addition,             types, and (iv) paths containing cycles. This categoriza-
there is no distinction between different types of pattern                  tion provides the basis to systematically breaking down the
matching queries. In this paper, we introduce a set of ana-                 reasons why one system is outperforming another system.
lytical and pattern matching queries, and evaluate them in                  Since Neo4j is the most widely used native graph database,
Neo4j and a market-leading commercial relational database                   we compare the execution times of Cypher queries in Neo4j
system. We show that the relational database system out-                    to the execution times of corresponding SQL queries in a
performs Neo4j for our analytical queries and that Neo4j is                 market-leading relational database system (Section 3). We
faster for queries that do not filter on specific edge types.               show that the relational database system outperforms Neo4j
                                                                            for our analytical queries. We argue that analytical queries,
                                                                            which access most or all nodes of the graph, benefit from the
1.    INTRODUCTION                                                          more advanced disk and buffer management of the relational
   Application domains such as social media, biology, and                   database system and that there is room for improvement
transportation planning produce large amounts of graph data.                in this respect in Neo4j. In contrast, we demonstrate that
Therefore, the management and processing of graph data                      Neo4j is more efficient for path queries that do not filter on
is gaining importance. As a result, there are several graph                 specific edge types. Section 4 summarizes related work and
databases such as Neo4j, DEX/Sparksee, and OrientDB. In                     Section 5 gives an outlook on future work.
general, graph databases can be categorized into systems
using an existing relational database as a backend or systems               2.    QUERIES
implementing a custom non-relational backend, which are                        We begin by introducing the set of queries that we use to
often called “native graph databases”. Although a relational                compare Neo4j to a relational database system. We catego-
backend offers numerous advantages such as transactions,                    rize our queries in analytical and pattern matching queries.
reliable storage and access control, native graph databases at-             Analytical queries process large parts of the graph at a time,
tract customers with the promise of outperforming traditional               whereas pattern matching queries access only small parts
relational databases. Recent empirical studies [4, 9], however,             of the graph in most cases. In order to formally define our
demonstrate that this promise is not always kept. Gubichev                  queries, we first have to introduce the property graph data
and Then [4] show that several queries of their benchmark                   model on which Neo4j and its query language Cypher is
timed out in Neo4j and DEX/Sparksee. While these studies                    based [6].
are a good starting point for comparing graph databases
with a relational and a custom non-relational backend, they                    Definition 1. Assume a domain V of nodes, a domain E of
only focus on pattern matching queries and neglect analytical               directed edges, and a domain A of attributes. Additionally,
queries such as shortest path and centrality measures.                      suppose a domain DA of atomic attribute values, a domain
                                                                            DL of node labels, and a domain DT of edge types. A property
                                                                            graph is a finite structure G = (V, E, A, λ, α, β, γ), where

                                                                                 • V ⊆ V is a finite set of nodes,

                                                                                 • E ⊆ E is a finite set of edges,
c 2017, Copyright is with the authors. Published in the Workshop Proceed-
ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice,
Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is
                                                                                 • A ⊆ A is a finite set of attributes,
permitted under the terms of the Creative Commons license CC-by-nc-nd
4.0                                                                              • λ : E → V × V is a function assigning nodes to edges,
   • α : (V ∪ E) × A → DA is a partial function assigning                  SQL
     values to attributes of nodes and edges,
                                                                           SELECT n.id, CAST(COUNT(*) AS FLOAT)/
   • β : V → P(DL ) is a partial function assigning labels                  (SELECT COUNT(*) FROM nodes) AS degree centrality
     to nodes, and                                                         FROM nodes n LEFT JOIN rel all bidirectional r ON n.id = r.sID
                                                                           GROUP BY n.id
   • γ : E → DT is a partial function assigning a type to
     edges.                                                                2.1.3    QA3 : Connectedness
                                                                             Query QA3 returns all pairs of nodes u, v ∈ V which are
  Besides the property graph data model, we also define the                connected by a path u →∗ v.
notion of a path in a property graph.
                                                                           Cypher
  Definition 2. Suppose the nodes v1 , vn ∈ V . A path from
                                                                           MATCH (n), (m)
v1 to vn , denoted as v1 →∗ vn , is a sequence of nodes and
                                                                           RETURN n, m, EXISTS((n)-[*]->(m)) AS is connected
edges hv1 , e1 , v2 , . . . , vn−1 , en−1 , vn i, where ∀1 ≤ i ≤ n − 1 :
λ(ei ) = (vi , vi+1 ).
                                                                           SQL
2.1     Analytical Queries                                                 SELECT a.id1, a.id2,
   In this subsection, we formally define the set of analytical             CASE WHEN tc.sID IS NULL THEN 0 ELSE 1 END AS connected
queries that have been proposed in the graph benchmark of                  FROM (SELECT n.id AS id1, m.id AS id2 FROM nodes n
Grossniklaus et al. [3]. Since the focus of this paper is to                CROSS JOIN nodes m) AS a
compare Neo4j with a relational database system, we only                    LEFT JOIN transitive closure tc ON a.id1 = tc.sID
use queries that can be expressed in Cypher. Therefore, we                  AND a.id2 = tc.dID
cannot cover all analytical graph queries that are used in
real world applications. For each query of this subsection, we             2.1.4    QA4 : Shortest Paths
also derive the corresponding Cypher and SQL statements                       Suppose P (u, v) is a set containing all paths u →∗ v, where
that are based on the dataset of the LUBM [5] benchmark                    u, v ∈ V , and length(p) is a function returning the number of
(cf. Section 3). The tables and views that are used in the                 edges of a path p. A shortest path between two nodes u and
SQL statements are described in the Appendix. Some of the                  v is a path p where ∀p0 ∈ P (u, v) : length(p) ≤ length(p0 ).
following queries are limited in the number of result tuples               The length of a shortest path between u and v is denoted as
because Neo4j is not able to execute these queries without                 δ(u, v). Query QA4 returns for all pairs of nodes u, v ∈ V a
restricting the result size.                                               shortest path between u and v.

2.1.1     QA1 : Node Degree                                                Cypher
   Query QA1 returns for each node v ∈ V its degree, which                 MATCH p = shortestPath((n)-[*]->(m))
is the number of ingoing and outgoing edges and formally                   RETURN n, m, p
defined as follows:
                                                                           SQL
  deg(v) := {e|e ∈ E ∧ λ(e) = (x, y) ∧ (v = x ∨ v = y)} .
                                                                           SELECT *
Cypher                                                                     FROM shortest path
MATCH (n)--(m)
RETURN n, COUNT(m)                                                         2.1.5    QA5 : Closeness Centrality
                                                                             Query QA5 returns for each node v ∈ V its closeness
SQL                                                                        centrality which is defined as follows:
SELECT n.id, COUNT(*) AS degree
FROM nodes n LEFT JOIN rel all bidirectional r ON n.id = r.sID                                                1
                                                                                             C C (v) := P               .
GROUP BY n.id                                                                                             u∈V δ(v, u)


2.1.2     QA2 : Degree Centrality                                          Cypher
  Query QA2 returns for each node v ∈ V its degree cen-                    MATCH p = shortestPath((n)-[*]-(m))
trality, which is the number of ingoing and outgoing edges                 WHERE n <> m WITH n, p LIMIT 60000000
normalized by the total number of nodes in the graph. For-                 RETURN n, 1.0/SUM(LENGTH(p)) AS centrality
mally, the degree centrality of a node v is defined as follows:
                                      deg(v)                               SQL
                         C D (v) :=          .
                                       |V |                                SELECT TOP 60000000 sID AS node,
                                                                            1.0 / SUM(length) AS centrality
Cypher
                                                                           FROM (SELECT sID, dID, MIN(length) AS length
MATCH (a)                                                                   FROM shortest connection GROUP BY sID, dID
WITH TOFLOAT(COUNT(*)) AS total                                             UNION ALL SELECT dID, sID, MIN(length) AS length
MATCH (n)-[e]-()                                                            FROM shortest connection GROUP BY sID, dID) AS t
RETURN n, COUNT(e)/total AS centrality                                     GROUP BY sID
2.1.6    QA6 : Diameter                                             Note that we do not restrict the edge direction. Both
   Query QA6 returns the diameter of the graph which is the       outgoing and ingoing edges are matched by the pattern
length of the longest shortest path between any two nodes         above.
u, v ∈ V :
                                                                  Cypher
                        d := max δ(u, v).
                            u,v∈V                                 MATCH (a:University)--(b:FullProfessor)
                                                                  RETURN a, b
Cypher
MATCH p = shortestPath((n)-[*]-(m))                               SQL
WITH p LIMIT 40000000
                                                                  SELECT n1.id, n2.id
RETURN p, LENGTH(p)
                                                                  FROM rel all bidirectional r
ORDER BY LENGTH(p) DESC
                                                                   INNER JOIN node label n1 ON n1.id = r.sID
LIMIT 1
                                                                   INNER JOIN node label n2 ON n2.id = r.dID
SQL                                                               WHERE n1.value = ’University’ AND n2.value = ’FullProfessor’
SELECT MAX(length)                                                2.2.2    QP2 : Cycles
FROM (SELECT TOP 40000000 *
                                                                    Queries of type QP2 extend queries of type QP1 with
 FROM shortest connection) AS t
                                                                  cycles.
2.1.7    QA7 : Grouping
  Query QA7 returns for each existing value x of an attribute                         l1         ...         ln
a ∈ A the number of nodes v ∈ V where α(v, a) = x:

      γ(a) := {(x, i) | v ∈ V ∧ a ∈ A ∧ x = α(v, a)
                      ∧ i = |{u | u ∈ V ∧ α(u, a) = x}|}.         Cypher

Cypher                                                            MATCH (a:UndergraduateStudent)--(b:FullProfessor)
                                                                   --(c:Department)--(a)
MATCH (a:Course)                                                  RETURN a,b,c
RETURN a.name, COUNT(a)
SQL                                                               SQL

SELECT name.value, COUNT(*) AS count                              SELECT n1.id, n2.id, n3.id
FROM node label node LEFT JOIN name ON name.id = node.id          FROM rel all bidirectional r1
WHERE node.value = ’Course’                                        INNER JOIN node label n1 ON n1.id = r1.sID
GROUP BY name.value                                                INNER JOIN node label n2 ON n2.id = r1.dID
ORDER BY count DESC                                                INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id
                                                                   INNER JOIN node label n3 ON n3.id = r2.dID
                                                                   INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID
2.2     Pattern Matching Queries                                  WHERE n1.value = ’UndergraduateStudent’
  In the following, we define the types of pattern matching        AND n2.value = ’FullProfessor’
queries used in our evaluation. We categorize our queries in       AND n3.value = ’Department’
paths that filter on specific node labels without restricting      AND r3.dID = n1.id
edge types (QP1 ), paths that filter on specific edge types
without restricting node labels (QP3 ), paths that filter on      2.2.3    QP3 : Paths filtering on edge types
node labels and edge types (QP4 ), and paths containing             Queries of type QP3 return paths with specific edge types
cycles (QP2 ). This categorization provides a way to deter-       as it is shown in the figure below, where t1 , . . . , tn ∈ DT are
mine how efficient Neo4j and the relational database system       edge types.
can filter on specific node labels or edge types. For each
pattern matching query type, we derive Cypher and SQL                                       t2         t1
statements. However, in this subsection we only present the                                      ...
actual code of queries with patterns of length 1 (denoted as
QPx,1 in Section 3, where x is the query type). The queries
with longer patterns used in the evaluation are found in the      Cypher
appendix.                                                         MATCH (a)-[:teacherOf]->(b)<-[:takesCourse]-(c)
                                                                  RETURN a, b, c
2.2.1    QP1 : Paths filtering on node labels
   The first type of pattern matching queries in our evaluation   SQL
are paths filtering on specific node labels. This query type is
illustrated below, where l1 , . . . , ln ∈ DL are labels.         SELECT n1.id AS teacher, n2.id AS course, n3.id AS student
                                                                  FROM nodes n1, rel teacherOf rto, nodes n2,
                                                                   rel takesCourse rtc, nodes n3
                   l1          ...          ln                    WHERE n1.id = rto.sID AND rto.dID = n2.id
                                                                   AND n2.id = rtc.dID AND n3.id = rtc.sID
2.2.4    QP4 : Paths filtering on node labels and edge
         types
 Queries of type QP4 are a combination of queries of type                                                         Neo4j                  RDBMS A                          RDBMS B




                                                                                                                             19




                                                                                                                                                               05




                                                                                                                                                                                7
QP1 and of type QP3 .




                                                                                                                                             1, 2
                                                                                                                                                 5




                                                                                                                                                                               9
                                                                                                                           ,3




                                                                                                                                               31
                                                                                                                                               01



                                                                                                                                                             ,7




                                                                                                                                                                             ,4
                                                                                                                         13




                                                                                                                                                           21




                                                                                                                                                                           14
                                                                                                                                             2,
                         t2         t1
                   l2         ...        l1




                                                                  Runtime in sec
                                                                                       600


Cypher                                                                                 400




                                                                                                                                                      0
                                                                                                                                                    25
MATCH (a:FullProfessor)-[:teacherOf]->(b:Course)




                                                                                                                              6
                                                                                       200




                                                                                                                                     2
                                                                                                                            13

                                                                                                                                   12




                                                                                                                                                                56
                                                                                                72

                                                                                                19

                                                                                                06

                                                                                                38

                                                                                                08

                                                                                                06
 <-[:takesCourse]-(c:UndergraduateStudent)




                                                                                                                                                                                                  01
                                                                                                                                                                                                   03
                                                                                                                                                                                                   01
                                                                                                                                                                                29
                                                                                                                                                                     14
                                                                                             0.

                                                                                             0.

                                                                                             0.

                                                                                             0.

                                                                                             0.

                                                                                             0.




                                                                                                                                                                                                0.
                                                                                                                                                                                                0.
                                                                                                                                                                                                0.
                                                                                                                                                                                          8
                                                                                        0
RETURN a, b, c                                                                                QA1           QA2            QA3                 QA4           QA5             QA6                    QA7

SQL                                                                                                                 (a) Analytical queries




                                                                                                                                                                                                1

                                                                                                                                                                                                         7
                                                                                                                                                                                              39

                                                                                                                                                                                                      66
SELECT a.id, b.id, c.id
FROM node label a, rel teacherOf r1, node label b,                                     60
 rel takesCourse r2, node label c




                                                                      Runtime in sec
WHERE a.value = ’FullProfessor’ AND b.value = ’Course’




                                                                                                                                                                                     .9
                                                                                       40




                                                                                                                                                                                   34
 AND c.value = ’UndergraduateStudent’ AND a.id = r1.sID
 AND r1.dID = b.id AND b.id = r2.sID AND r2.dID = c.id                                 20




                                                                                                                                                              .4
                                                                                                                                                            11
                                                                                                                                                       8




                                                                                                                                                                      9
                                                                                                                                                     9.




                                                                                                                                                                    8.
                                                                                                                          0. 8
                                                                                                   03
                                                                                                   09

                                                                                                   02




                                                                                                                             03

                                                                                                                             41

                                                                                                                             57
3.    EVALUATION




                                                                                                0.
                                                                                                0.

                                                                                                0.




                                                                                                                          0.



                                                                                                                          0.
                                                                                        0
  Having presented the queries used in our performance                                          QP1,1                      QP1,2                       QP1,3                         QP1,4

evaluation, we now present the experimental setup and the                                             (b) Paths filtering on node labels
obtained results.




                                                                                                                                                                  00

                                                                                                                                                                  28




                                                                                                                                                                                                    0

                                                                                                                                                                                                             3
                                                                                                                                                                                                  0

                                                                                                                                                                                                          1
3.1     Experimental Setup




                                                                                                                                                                ,5

                                                                                                                                                               ,4




                                                                                                                                                                                                ,9

                                                                                                                                                                                                        ,8
                                                                                                                                    0




                                                                                                                                                             10

                                                                                                                                                             16




                                                                                                                                                                                              38

                                                                                                                                                                                                      73
                                                                                                                                  98
   All experiments presented in this paper were performed                              800




                                                                                                                                           5
                                                                                                                                        71
on a Mac Pro with a 3.5 GHz 6-Core Intel Xeon E5 pro-
                                                                  Runtime in ms




                                                                                       600
cessor with 64 GB main memory. In our evaluation, we
compared Neo4j Version 3.0.3 with a market-leading commer-                             400
                                                                                                        0



cial database system. Due to license terms, the commercial
                                                                                                      24




                                                                                                                                                       0
                                                                                                                           0




                                                                                                                                                     16
                                                                                       200
system is referred to as “RDBMS”. Neo4j and RDBMS are                                                                    14




                                                                                                                                                                                    70
                                                                                                            57
                                                                                               30




installed on a 64-bit Windows 8.1 virtual machine with 32                               0
GB of main memory. Both database systems run in their                                            QP2,1                      QP2,2                          QP2,3                      QP2,4
standard configuration. The data sets used in our evalu-                                                                            (c) Cycles
ation were generated by the data generator of the LUBM
benchmark [5]. We generated a data set with one university                             600
(LUBM-1) and a data set with two universities (LUBM-2).                                        487
                                                                                                                                                     463
                                                                  Runtime in ms




                                                                                                                                                                                    404
To store the graph in RDBMS, we used the decomposed stor-                              400                               379

age model (DSM) described by Sakr et al. [10]. All runtime
measurements were repeated five times in random order. The                             200
reported averages discard the smallest and the largest value.                                         50                          50                         66                               69

   Figure 1 shows the runtimes of our queries on the LUBM-                               0
                                                                                                            8.7                         8.2                         8.5                               8.6


1 data set, whereas Figure 2 shows the runtimes of our                                           QP3,1                      QP3,2                          QP3,3                      QP3,4
queries on the LUBM-2 data set. In our first experiments,                                                  (d) Paths filtering on edge types
we have noticed that the performance of specific query types
in RDBMS heavily depends on how efficiently edges can                                  100
                                                                                                 86
be retrieved. Therefore, we created two different schemata
                                                                  Runtime in ms




                                                                                        80
in RDBMS that are both based on the decomposed storage
                                                                                        60
model. Schema “A” has tables for each edge type. In addition,                                               41
a view is computed containing edges of all types. In contrast,                          40
                                                                                                                                               26                                    29
                                                                                                                                                                           23
Schema “B” has a table containing edges of all types. In                                20                         8.3
                                                                                                                                        17
                                                                                                                                                       8.6                                         8.3
addition, there are views for each edge type. In the following,                          0
RDBMS with Schema A is denoted as “RDBMS A” and                                                       QP4,1                                  QP4,2                              QP4,3
RDBMS with Schema B is denoted as “RDBMS B”. We
                                                                                             (e) Paths filtering node labels and edge types
use the term “RDBMS” to refer to both RDBMS A and
RDBMS B.
3.2     Results                                                                    Figure 1: Runtimes of queries on LUBM-1.
  We first present the results obtained on data set LUBM-1.
Figure 1(a) shows that for most of the analytical queries
                                                                                                                                                  RDBMS is several orders of magnitude faster than Neo4j.
                                                        Neo4j                RDBMS A                            RDBMS B
                                                                                                                                                  The largest difference can be seen for query QA5 which




                                                                                       2
                                                                                     39
                                                                                                                                                  computes the closeness centrality where RDBMS B is over




                                                                                   2,
                                                                                                                                                  1500× faster than Neo4j. The computation of the shortest
                       1,000
                                                                                                                                                  path is almost 10× faster in RDBMS B than in Neo4j. At a
Runtime in sec




                        800                                                                                                                       first glance, this is surprising since computing the shortest




                                                                        4
                                                                     67




                                                                                          1
                                                                          5




                                                                                       63
                                                                        60
                        600                                                                                                                       path is a core functionality of Neo4j and is also provided
                        400                                                                                                                       as a construct in the Cypher query language. Figure 1(b)
                                                                                                                                                  gives the runtimes of our pattern matching queries that




                                                                                                              6
                        200




                                                                                                            11




                                                                                                                            .6
                                                                                                           .7
                                                                                                                                                  filter specific node labels without restricting edge types. The




                                                                                                                          .7
                                                                                                                         57
                                 58

                                 12

                                 12

                                  77

                                  09

                                  12




                                                                                                                                             02
                                                                                                                                             02
                                                                                                        35




                                                                                                                        17
                               0.

                               0.

                               0.

                               0.

                               0.

                               0.




                                                                                                                                           0.
                                                                                                                                           0.
                          0                                                                                                                       largest difference in the runtime of Neo4j and RDBMS can be
                                QA1               QA2            QA3              QA4               QA5                QA6              QA7
                                                                                                                                                  observed for query QP1,4 which contains the largest number
                                                             (a) Analytical queries                                                               of edges of all queries in (b). We can also see that the
                                                                                                                                                  queries QP1,1 and QP1,3 are faster in RDBMS B than in




                                                                                                                                    8

                                                                                                                                            0
                                                                                                                                  07

                                                                                                                                          91
                                                                                                                                3,

                                                                                                                                        1,
                                                                                                                                                  Neo4j. The runtimes of cyclic pattern matching queries that
                                                                                                                                                  filter specific node labels without restricting edge types are

                                                                                                                           .1
                                                                                                                        94
                       100
                                                                                                                                                  shown in Figure 1(c). It can be seen that Neo4j outperforms
      Runtime in sec




                                                                                                                                                  RDBMS. For all queries of Figure 1(c), Neo4j is almost
                        50
                                                                                                                                                  an order of magnitude faster than RDBMS. The runtimes
                                                                                                                                                  of pattern matching queries on filtering specific edge types
                                                                                                           .6
                                                                                              .4


                                                                                                        24
                                                                                           23




                                                                                                                                                  without restricting node labels are given in Figure 1(d).
                                                                                                   13
                                                                   42
                                      1

                                      5

                                      2




                                                                    8



                                                                  91
                                   02

                                    06

                                    03




                                                                 06

                                                                4.

                                                               1.




                                                                                                                                                  In contrast to the previous pattern matching query types,
                                 0.

                                 0.

                                 0.




                                                              0.




                          0
                                   QP1,1                        QP1,2                       QP1,3                          QP1,4                  RDBMS answers queries of this type more efficiently than
                                          (b) Paths filtering on node labels                                                                      Neo4j. This is due to the fact that each relationship type has
                                                                                                                                                  its own table (or view) in RDBMS. Finally, Figure 1(e) plots
                                                                                                                                                  the runtimes of pattern matching queries that filter on node
                                                                                                         0




                                                                                                                                       0

                                                                                                                                       0
                                                                          90




                                                                                                       13

                                                                                                       50




                                                                                                                                     70

                                                                                                                                     91
                                             0




                                                                          0
                                           96




                                                                        ,9

                                                                       56




                                                                                                     7,

                                                                                                     ,0




                                                                                                                                  5,

                                                                                                                                   3,




                                                                                                                                                  labels and edge types. As before, RDBMS B outperforms
                                                                     16




                                                                                                   19

                                                                                                   86




                                                                                                                                81

                                                                                                                                37
                                        7,




                                                                     3,




                       400
                                                                                                                                                  Neo4j and RDBMS A.
                                                                                                                                                     Moving on to data set LUBM-2, Figure 2(a) shows the run-
                                                                 0
      Runtime in ms




                                                              29




                       300                                                                                                                        time of our analytical queries. In Neo4j, only query QA1 and
                                                                                                                                                  QA2 could be executed in reasonable time. The other queries
                       200
                                                                                                                                                  were aborted after 24 hours or ran out of memory (denoted
                                                    7
                                                 10




                       100                                                                                                                        by in Figure 2(a)). The runtimes of our pattern matching
                                 70




                                                                                                                        60
                                                                                           41




                                                                                                                                                  queries that filter specific node labels without restricting edge
                          0
                                   QP2,1                        QP2,2                       QP2,3                          QP2,4
                                                                                                                                                  types are given in Figure 2(b). For query QP1,3 RDBMS A
                                                                                                                                                  is faster than Neo4j, otherwise Neo4j outperforms RDBMS.
                                                                       (c) Cycles                                                                 Figure 2(c) plots the runtimes of cyclic pattern matching
                                1,071                        932                       1,203                           1,732
                                                                                                                                                  queries that filter specific node labels without restricting
                                                                                                                                                  edge types. As for the results obtained on the LUBM-1 data
                       600                                                                                                            549         set (Figure 1(c)), Neo4j is consistently faster than RDBMS
                                                                                                                                                  for this type of pattern matching query. The runtimes of
      Runtime in ms




                       400                                                                              349                                       pattern matching queries filtering specific edge types without
                                                                                                                                                  restricting node labels that are plotted in Figure 2(d) show
                       200                                                                       138
                                                                                                                                169               that RDBMS consistently outperforms Neo4j for this type of
                                                                     106 105
                                        82       76                                                                                               query. Finally, the runtimes of pattern matching queries that
                          0                                                                                                                       filter on node labels and edge types are shown in Figure 2(e).
                                   QP3,1                        QP3,2                       QP3,3                          QP3,4                  For these queries, RDBMS A consistently outperforms Neo4j.
                                             (d) Paths filtering on edge types
                                  202
                                                                                                                                                  3.3    Discussion and Limitations
                                                                                                                                                     To conclude this section, we summarize our results and
                       100
                                                                                                                                      83
                                                                                                                                                  discuss limitations of our evaluation. We have shown that
                                                                                                                                                  RDBMS outperforms Neo4j for our analytical queries. For
       Runtime in ms




                         80

                         60                                                                 50
                                                                                                                                                  most of the analytical queries, RDBMS is several orders of
                                                                                                                  47      45
                         40                  33
                                                        43
                                                                            39    38                                                              magnitude faster than Neo4j. A possible reason for this result
                                                                                                                                                  is the fact that the analytical queries have to access most or all
                         20
                                                                                                                                                  nodes of the graph. Therefore, they profit from the advanced
                          0                                                                                                                       disk and memory management of relational database systems.
                                         QP4,1                                   QP4,2                                 QP4,3
                                                                                                                                                  We also showed that RDBMS A and RDBMS B both perform
                               (e) Paths filtering node labels and edge types                                                                     badly for queries that need to join the whole edge table (or
                                                                                                                                                  view) multiple times for longer patterns. For queries with
                                                                                                                                                  cycles, the performance is even worse. However, if we only
                       Figure 2: Runtimes of queries on LUBM-2.                                                                                   query specific edge types, RDBMS outperforms Neo4j as
                                                                                                                                                  can be seen in Figure 1(d) and Figure 2(d). Comparing
RDBMS A with RDBMS B on LUBM-1, we can observe that                Acknowledgement
for queries that filter on specific edge types RDBMS B is more     This work is supported by Grant No. GR 4497/2 of the
efficient than RDBMS A. However, for queries with longer           Deutsche Forschungsgemeinschaft (DFG).
patterns that do not filter on specific edge types RDBMS A
outperforms RDBMS B. As for LUBM-1, there is again no
clear winner on LUBM-2.                                            6.   REFERENCES
   Since the work presented in this paper is only the first         [1] R. Angles, A. Prat-Pérez, D. Dominguez-Sal, and J.-L.
step towards a more extensive empirical study, it has some              Larriba-Pey. Benchmarking Database Systems for
shortcomings and open issues. First, we only use the de-                Social Network Applications. In Proc. Workshop on
composed storage model [10] to map a property graph to                  Graph Data Management Experiences and Systems
relations. Since the choice of the graph-to-relation mapping            (GRADES), pages 1–7, 2013.
strongly impacts the performance of queries, we need to
                                                                    [2] M. Ciglan, A. Averbuch, and L. Hluchy. Benchmarking
evaluate alternative mappings such as the hybrid schema                 Traversal Operations over Graph Databases. In Proc.
described by Sun et al. [11]. Second, the performance of a              ICDE Workshops (ICDEW), pages 186–189, 2012.
query also depends on the characteristics of the graph, e.g.,
                                                                    [3] M. Grossniklaus, S. Leone, and T. Zäschke. Towards a
the number of distinct labels or average node degree. Hence,
                                                                        Benchmark for Graph Data Management and
further data sets with different graph characteristics need to
                                                                        Processing. Technical report, University of Konstanz,
be evaluated. Finally, pattern matching queries that filter
                                                                        2013.
on specific attribute values should be included as our current
                                                                        http://nbn-resolving.de/urn:nbn:de:bsz:352-242536.
queries filter on node labels and edge types only.
                                                                    [4] A. Gubichev and M. Then. Graph Pattern Matching:
                                                                        Do We Have to Reinvent the Wheel? In Proc.
4.   RELATED WORK                                                       Workshop on Graph Data Management Experiences and
   There are several empirical studies on the query perfor-             Systems (GRADES), pages 1–7, 2014.
mance of graph database systems. To the best of our knowl-          [5] Y. Guo, Z. Pan, and J. Heflin. LUBM: A Benchmark
edge, none of these studies cover analytical queries formulated         for OWL Knowledge Base Systems. J. Web Sem.,
in a declarative graph query language. The work of Vick-                3(2-3):158–182, 2005.
nair et al. [12] benchmarks Neo4j and MySQL. The authors            [6] J. Hölsch and M. Grossniklaus. An Algebra and
define three simple traversal queries and queries counting the          Equivalences to Transform Graph Patterns in Neo4j. In
number of nodes with a specific value. Ciglan et al. [2] im-            Proc. Intl. Workshop on Querying Graph Structured
plement graph traversal operations in Neo4j, DEX/Sparksee,              Data (GraphQ), 2016.
OrientDB, NativeSail, and SGDB, and compare their per-              [7] S. Jouili and V. Vansteenberghe. An Empirical
formance. Grossniklaus et al. [3] benchmark Neo4j and a                 Comparison of Graph Databases. In Proc. Intl. Conf.
relational database system by implementing a set of analyti-            on Social Computing (SOCIALCOM), pages 708–715,
cal queries using the Neo4j API and SQL. Welc et al. [13]               2013.
evaluate the performance of computing the shortest path in          [8] R. McColl, D. Ediger, J. Poovey, D. Campbell, and
Neo4j, Oracle and the Green-Marl language infrastructure.               D. A. Bader. A Performance Evaluation of Open
Angles et al. [1] benchmark graph databases in the context              Source Graph Databases. In Proc. Workshop on
of social networks. They define a set of pattern matching               Parallel Programming for Analytics Applications
and reachability queries that are often used to analyse social          (PPAA), pages 11–18, 2014.
networks. Jouili and Vansteenberghe [7] use traversal and           [9] N. Pobiedina, S. Rümmele, S. Skritek, and H. Werthner.
shortest path queries to benchmark Neo4j, DEX/Sparksee,                 Benchmarking Database Systems for Graph Pattern
OrientDB and Titan. McColl et al. [8] evaluate open-source              Matching. In Proc. Intl. Conf. on Database and Expert
graph databases using implementations of the graph algo-                Systems Applications (DEXA), pages 226–241, 2014.
rithms shortest path, connected components and PageRank.
                                                                   [10] S. Sakr, S. Elnikety, and Y. He. G-SPARQL: A Hybrid
Pobiedina et al. [9] evaluate pattern matching queries in
                                                                        Engine for Querying Large Attributed Graphs. In Proc.
Neo4j (Cypher), PostgreSQL (SQL), Jena TDB (SPARQL)
                                                                        Intl. Conf. on Information and Knowledge Management
and Clingo (ASP).
                                                                        (CIKM), pages 335–344, 2012.
                                                                   [11] W. Sun, A. Fokoue, K. Srinivas, A. Kementsietsidis,
5.   CONCLUSION AND FUTURE WORK                                         G. Hu, and G. Xie. SQLGraph: An Efficient
   In this paper, we defined a set of analytical and pattern            Relational-Based Property Graph Store. In Proc. Intl.
matching queries in Cypher and SQL, and evaluated them in               Conf. on Management of Data (SIGMOD), pages
Neo4j and a market-leading commercial relational database               1887–1901, 2015.
system. We showed that the relational database system              [12] C. Vicknair, M. Macias, Z. Zhao, X. Nan, Y. Chen, and
outperforms Neo4j for our analytical queries and that Neo4j             D. Wilkins. A Comparison of a Graph Database and a
is faster for queries that do not filter on specific edge types.        Relational Database: A Data Provenance Perspective.
   As future work we plan to build on this initial study                In Proc. ACM Southeast Regional Conference, page 42,
in order to better understand the implications of different             2010.
backends on graph query performance. First, we will study          [13] A. Welc, R. Raman, Z. Wu, S. Hong, H. Chafi, and
the influence of different graph-to-relation mappings on the            J. Banerjee. Graph Analysis: Do We Have to Reinvent
performance of relational backends. Additionally, we will               the Wheel? In Proc. Workshop on Graph Data
include further data sets with different graph characteristics          Management Experiences and Systems (GRADES),
and pattern matching queries that filter on attribute values.           pages 1–6, 2013.
APPENDIX                                                             t.step from AS step to
                                                                    FROM ShortestPathCTE t
A.    TABLES AND VIEWS                                              INNER JOIN shortest connection s
                                                                     ON s.dID = t.step from AND s.sID = t.sID
  Below, we introduce the tables and views that were created     ) SELECT sID, dID,
for Schema A and Schema B in order to store a property              coalesce(step from, sID) AS step from, step to
                                                                   FROM ShortestPathCTE
graph in a relational database system. A node has a unique
id and is stored in the node_label relation with its corre-
sponding labels. Therefore, for each label of a node, there is   B. PATTERN MATCHING QUERIES
an entry in the node_label table. For each attribute, there is
a relation containing entries of the node id and the attribute   B.1 Paths restricted on node labels
value. The node_label table and each attribute table have a      QP1,2 Cypher
clustered index on the node id. In Schema A, there is a table    MATCH (a:University)--(b:FullProfessor)--(c:Department)
for each edge type containing the edge id and the ids of the     RETURN a, b, c
source and target nodes. The edge tables have a clustered
index on the id of the source and target node. In Schema B,      QP1,2 SQL
there is a table containing edges of all types. This table has   SELECT n1.id, n2.id, n3.id FROM rel all bidirectional r1
a clustered index on the id of the source and target node as       INNER JOIN node label n1 ON n1.id = r1.sID
                                                                   INNER JOIN node label n2 ON n2.id = r1.dID
well as an unclustered index on the column indicating the          INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id
edge type. In addition, we create the following views, where       INNER JOIN node label n3 ON n3.id = r2.dID
the prefix “rel ” denotes an edge table (or view). However,      WHERE n1.value = ’University’ AND n2.value = ’FullProfessor’
                                                                   AND n3.value = ’Department’
in Schema B we have a table rel_all instead of a view.

CREATE VIEW nodes (id) AS                                        QP1,3 Cypher
SELECT DISTINCT id FROM node label                               MATCH (a:University)--(b:FullProfessor)--(c:Department)
                                                                  --(d:UndergraduateStudent)
CREATE VIEW rel all (sID, dID) AS                                RETURN a, b, c, d
SELECT sID, dID FROM rel advisor
 UNION ALL
SELECT sID, dID FROM rel doctoralDegreeFrom                      QP1,3 SQL
 UNION AL                                                        SELECT n1.id, n2.id, n3.id, n4.id FROM rel all bidirectional r1
SELECT sID, dID FROM rel headOf                                    INNER JOIN node label n1 ON n1.id = r1.sID
 UNION ALL                                                         INNER JOIN node label n2 ON n2.id = r1.dID
SELECT sID, dID FROM rel mastersDegreeFrom                         INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id
 UNION ALL                                                         INNER JOIN node label n3 ON n3.id = r2.dID
SELECT sID, dID FROM rel memberOf                                  INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID
 UNION ALL                                                         INNER JOIN node label n4 ON n4.id = r3.dID
SELECT sID, dID FROM rel publicationAuthor                       WHERE n1.value = ’University’ AND n2.value = ’FullProfessor’
 UNION ALL                                                         AND n3.value = ’Department’
SELECT sID, dID FROM rel subOrganizationOf                         AND n4.value = ’UndergraduateStudent’
 UNION ALL
SELECT sID, dID FROM rel takesCourse
 UNION ALL                                                       QP1,4 Cypher
SELECT sID, dID FROM rel teacherOf                               MATCH (a:University)--(b:FullProfessor)--(c:Department)
 UNION ALL                                                        --(d:UndergraduateStudent)--(e:Course)
SELECT sID, dID FROM rel teachingAssistantOf                     RETURN a, b, c, d, e
 UNION ALL
SELECT sID, dID FROM rel undergraduateDegreeFrom
 UNION ALL                                                       QP1,4 SQL
SELECT sID, dID FROM rel worksFor
                                                                 SELECT n1.id, n2.id, n3.id, n4.id, n5.id
                                                                 FROM rel all bidirectional r1
CREATE VIEW rel all bidirectional (sID, dID) AS
                                                                   INNER JOIN node label n1 ON n1.id = r1.sID
SELECT sID, dID FROM rel all
                                                                   INNER JOIN node label n2 ON n2.id = r1.dID
 UNION ALL
                                                                   INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id
SELECT dID AS sID, sID AS dID FROM rel all
                                                                   INNER JOIN node label n3 ON n3.id = r2.dID
                                                                   INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID
CREATE VIEW transitive closure AS
                                                                   INNER JOIN node label n4 ON n4.id = r3.dID
WITH BacktraceCTE(sID, parent, dID, length) AS
                                                                   INNER JOIN rel all bidirectional r4 ON n4.id = r4.sID
  (SELECT sID, NULL, dID, 1
                                                                   INNER JOIN node label n5 ON n5.id = r4.dID
   FROM dbo.rel all
                                                                 WHERE n1.value = ’University’ AND n2.value = ’FullProfessor’
    UNION ALL
                                                                   AND n3.value = ’Department’
   SELECT b.sID, b.dID, r.dID, b.length + 1
                                                                   AND n4.value = ’UndergraduateStudent’ AND n5.value = ’Course’
   FROM BacktraceCTE AS b
   INNER JOIN dbo.rel all AS r ON b.dID = r.sID)
SELECT sID, parent, dID, length FROM BacktraceCTE                B.2     Cycles
CREATE VIEW shortest connection AS                               QP2,2 Cypher
SELECT DISTINCT t.* FROM dbo.transitive closure t                MATCH (a:UndergraduateStudent)--(b:AssociateProfessor)
INNER JOIN (SELECT sID, dID, MIN(length) as minLength             --(c:GraduateStudent)--(d:Department)--(a)
FROM dbo.transitive closure GROUP BY sID, dID) m                 RETURN a,b,c,d
ON m.sID = t.sID AND m.dID = t.dID AND m.minLength = t.length

CREATE VIEW shortest path AS WITH                                QP2,2 SQL
 ShortestPathCTE (sID, dID, step from, step to) AS (
                                                                 SELECT n1.id, n2.id, n3.id, n4.id FROM rel all bidirectional r1
  SELECT sID, dID, parent, dID FROM dbo.shortest connection
                                                                  INNER JOIN node label n1 ON n1.id = r1.sID
  UNION ALL
                                                                  INNER JOIN node label n2 ON n2.id = r1.dID
  SELECT t.sID, t.dID, coalesce(s.parent, t.sID) AS step from,
                                                                  INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id
  INNER JOIN node label n3 ON n3.id = r2.dID                      QP3,3 Cypher
  INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID
                                                                  MATCH (a)-[:teacherOf]->(b)<-[:takesCourse]-(c)
  INNER JOIN node label n4 ON n4.id = r3.dID
                                                                   -[:advisor]->(d)-[:mastersDegreeFrom]->(e)
  INNER JOIN rel all bidirectional r4 ON n4.id = r4.sID
                                                                  RETURN a, b, c, d, e
WHERE n1.value = ’UndergraduateStudent’
  AND n2.value = ’AssociateProfessor’
  AND n3.value = ’GraduateStudent’                                QP3,3 SQL
  AND n4.value = ’Department’ AND r4.dID = n1.id
                                                                  SELECT n1.id AS teacher, n2.id AS course, n3.id AS student,
                                                                    n4.id AS advisor
QP2,3 Cypher                                                      FROM nodes n1, rel teacherOf rto, nodes n2, rel takesCourse rtc,
                                                                    nodes n3, rel advisor ra, nodes n4,
MATCH (a:UndergraduateStudent)--(b:Course)                          rel mastersDegreeFrom rmdf, nodes n5
 --(c:AssociateProfessor)--(d:UndergraduateStudent)               WHERE n1.id = rto.sID AND rto.dID = n2.id
 --(e:Course)--(a) WHERE b <> e AND a<>d                            AND n2.id = rtc.dID AND n3.id = rtc.sID
RETURN a,b,c,d,e                                                    AND n3.id = ra.sID AND n4.id = ra.dID
                                                                    AND n4.id = rmdf.sID AND n5.id = rmdf.dID

QP2,3 SQL                                                         QP3,4 Cypher
SELECT n1.id, n2.id, n3.id, n4.id, n5.id
FROM rel all bidirectional r1                                     MATCH (a)-[:teacherOf]->(b)<-[:takesCourse]-(c)
  INNER JOIN node label n1 ON n1.id = r1.sID                      -[:advisor]->(d)-[:mastersDegreeFrom]->(e)
  INNER JOIN node label n2 ON n2.id = r1.dID                      <-[:doctoralDegreeFrom]-(f)
  INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id           RETURN a, b, c, d, e, f
  INNER JOIN node label n3 ON n3.id = r2.dID
  INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID           QP3,4 SQL
  INNER JOIN node label n4 ON n4.id = r3.dID
  INNER JOIN rel all bidirectional r4 ON n4.id = r4.sID           SELECT n1.id AS teacher, n2.id AS course, n3.id AS student,
  INNER JOIN node label n5 ON n5.id = r4.dID                        n4.id AS advisor
  INNER JOIN rel all bidirectional r5 ON n5.id = r5.sID           FROM nodes n1, rel teacherOf rto, nodes n2, rel takesCourse rtc,
WHERE n1.value = ’UndergraduateStudent’ AND n2.value = ’Course’     nodes n3, rel advisor ra, nodes n4, rel mastersDegreeFrom rmdf,
  AND n3.value = ’AssociateProfessor’                               nodes n5, rel doctoralDegreeFrom rddf, nodes n6
  AND n4.value = ’UndergraduateStudent’ AND n5.value = ’Course’   WHERE n1.id = rto.sID AND rto.dID = n2.id
  AND r5.dID = n1.id AND n1.id != n4.id AND n2.id != n5.id          AND n2.id = rtc.dID AND n3.id = rtc.sID
                                                                    AND n3.id = ra.sID AND n4.id = ra.dID
                                                                    AND n4.id = rmdf.sID AND n5.id = rmdf.dID
QP2,4 Cypher                                                        AND n5.id = rddf.dID AND n6.id = rddf.sID
MATCH (a:UndergraduateStudent)--(b:Course)
  --(c:AssociateProfessor)--(d:UndergraduateStudent)              B.4     Paths restricted on node labels and edge
  --(e:Course) --(f:AssistantProfessor)--(a)                              types
WHERE b <> e AND a<>d
RETURN a,b,c,d,e,f                                                QP4,2 Cypher
                                                                  MATCH (a:FullProfessor)-[:teacherOf]->(b:Course)
QP2,4 SQL                                                          <-[:takesCourse]-(c:UndergraduateStudent)
                                                                   -[:advisor]->(d:AssociateProfessor)
SELECT n1.id, n2.id, n3.id, n4.id, n5.id, n6.id                   RETURN a, b, c, d
FROM rel all bidirectional r1
  INNER JOIN node label n1 ON n1.id = r1.sID
  INNER JOIN node label n2 ON n2.id = r1.dID                      QP4,2 SQL
  INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id           SELECT a.id, b.id, c.id, d.id
  INNER JOIN node label n3 ON n3.id = r2.dID                      FROM node label a, rel teacherOf r1, node label b,
  INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID             rel takesCourse r2, node label c, rel advisor r3, node label d
  INNER JOIN node label n4 ON n4.id = r3.dID                      WHERE a.value = ’FullProfessor’ AND b.value = ’Course’
  INNER JOIN rel all bidirectional r4 ON n4.id = r4.sID             AND c.value = ’UndergraduateStudent’
  INNER JOIN node label n5 ON n5.id = r4.dID                        AND d.value = ’AssociateProfessor’
  INNER JOIN rel all bidirectional r5 ON n5.id = r5.sID             AND a.id = r1.sID AND r1.dID = b.id AND b.id = r2.dID
  INNER JOIN node label n6 ON n6.id = r5.dID                        AND r2.sID = c.id AND c.id = r3.sID AND r3.dID = d.id
  INNER JOIN rel all bidirectional r6 ON n6.id = r6.sID
WHERE n1.value = ’UndergraduateStudent’
  AND n2.value = ’Course’ AND n3.value = ’AssociateProfessor’
                                                                  QP4,3 Cypher
  AND n4.value = ’UndergraduateStudent’ AND n5.value = ’Course’   MATCH (a:FullProfessor)-[:teacherOf]->(b:Course)
  AND n6.value = ’AssistantProfessor’ AND r6.dID = n1.id           <-[:takesCourse]-(c:UndergraduateStudent)
  AND n1.id != n4.id AND n2.id != n5.id                            -[:advisor]->(d:AssociateProfessor)
                                                                   -[:mastersDegreeFrom]->(e:University)
                                                                  RETURN a, b, c, d, e
B.3     Paths restricted on edge types
QP3,2 Cypher                                                      QP4,3 SQL
MATCH (a)-[:teacherOf]->(b)<-[:takesCourse]-(c)                   SELECT a.id, b.id, c.id, d.id, e.id
 -[:advisor]->(d)                                                 FROM node label a, rel teacherOf r1, node label b,
RETURN a, b, c, d                                                   rel takesCourse r2, node label c, rel advisor r3, node label d,
                                                                    rel mastersDegreeFrom r4, node label e
                                                                  WHERE a.value = ’FullProfessor’ AND b.value = ’Course’
QP3,2 SQL                                                           AND c.value = ’UndergraduateStudent’
                                                                    AND d.value = ’AssociateProfessor’ AND e.value = ’University’
SELECT n1.id AS teacher, n2.id AS course, n3.id AS student,         AND a.id = r1.sID AND r1.dID = b.id AND b.id = r2.dID
  n4.id AS advisor                                                  AND r2.sID = c.id AND c.id = r3.sID AND r3.dID = d.id
FROM nodes n1, rel teacherOf rto, nodes n2,                         AND d.id = r4.sID AND r4.dID = e.id
  rel takesCourse rtc, nodes n3, rel advisor ra, nodes n4
WHERE n1.id = rto.sID AND rto.dID = n2.id
  AND n2.id = rtc.dID AND n3.id = rtc.sID
  AND n3.id = ra.sID AND n4.id = ra.dID