GooDBye: a Good Graph Database Benchmark - an Industry
                         Experience
             Piotr Matyjaszczyk                                  Przemyslaw Rosowski                             Robert Wrembel
      Poznan University of Technology                       Poznan University of Technology              Poznan University of Technology
                 Poland                                                 Poland                                       Poland
           piotrmk1@gmail.com                              przemyslaw.rosowski@student.put.              robert.wrembel@cs.put.poznan.pl
                                                                      poznan.pl

ABSTRACT                                                                           analyzed there. For this reason, a fundamental issue was to choose
This paper reports a use-case developed for an international IT                    a GDBMS that would be the most suitable for particular ’graph
company, whose one of multiple branches is located in Poland. In                   shapes’ and queries needed by the company. The assessment
order to deploy a graph database in their IT architecture, the com-                criteria included: (1) performance characteristics w.r.t. variable
pany needed an assessment of some of the most popular graph                        number of nodes in a cluster as well as (2) functionality and user
database management systems to select one that fits their needs.                   experience.
Despite the fact that multiple graph database benchmarks have                          The specific structures of graphs produced by the company
been proposed so far, they do not cover all use-cases required                     and specific queries have not matched what was offered by the ex-
by industry. This problem was faced by the company. A specific                     isting GDB benchmarks. These facts motivated the development
structure of graphs used by the company and specific queries,                      of the GoodBye benchmark, presented in this paper. Designing
initiated developing a new graph benchmark, tailored to their                      GooDBye was inspired by [19] and it complements the existing
needs. With this respect, the benchmark that we developed com-                     graph database benchmarks mentioned before. The benchmark
plements the existing benchmarks with 5 real use-cases. Based                      contributes real business use-cases.
on the benchmark, 5 open-source graph database management                              The paper is structured as follows. Section 2 overviews bench-
systems were evaluated experimentally. In this paper we present                    marks developed by research and industrial communities. Section
the benchmark and the experimental results.                                        3 presents the benchmark that we developed. Section 4 outlines
                                                                                   our test environment. Section 5 discusses the experimental eval-
                                                                                   uation of GDBs and their results. Finally, Section 6 summarizes
1    INTRODUCTION                                                                  the paper.
Among multiple database technologies [26], for a few years graph
databases (GDBs) have gained their popularity for storing and                      2   RELATED WORK
processing interconnected Big Data. In the time of writing this
                                                                                   The performance of a database management system is typically
paper there existed 29 recognized graph database management
                                                                                   assessed by means of benchmarks. Each domain of database appli-
systems (GDBMSs), cf., [9], offering different functionality, query
                                                                                   cation incurs its own benchmark. A benchmark is characterized
languages, and performance.
                                                                                   by a given schema (structure of data) and different workload
   When it comes to selecting a GDB to suit an efficient storage
                                                                                   characteristics (query and data manipulation), and often by per-
of given graphs and efficient processing, a company professional
                                                                                   formance measures. Database benchmarking over years have
has to either implement multiple ’proofs of a concept’ or rely on
                                                                                   received a substantial attention from the industry and research
existing evaluations of various databases. Typically, important
                                                                                   communities. Nowadays, the standard industry approved set of
assessment metrics include: (1) performance and (2) scalability
                                                                                   benchmarks for testing relational databases is being offered by the
w.r.t. a graph size and (3) scalability w.r.t. a number of nodes in a
                                                                                   Transaction Processing Council (TPC) [31]. They support 2 main
cluster.
                                                                                   classes of benchmarks, namely: (1) TPC-C and TPC-E - for test-
   In practice, assessing performance of IT architectures and
                                                                                   ing the performance of databases applied to on-line transaction
particular software products is done by a benchmark. There exist
                                                                                   processing, (2) TPC-H - for testing the performance of databases
multiple dedicated benchmarks for given domains of application.
                                                                                   applied to decision support systems. Special benchmarks were
In the area of information systems and databases, the industry
                                                                                   proposed for testing the performance of data warehouses (e.g.,
accepted and used benchmarks are developed by the Transaction
                                                                                   [8, 15, 20, 27]).
Processing Council. There also exist dedicated benchmarks for
                                                                                      [17] overviews the existing cloud benchmarks with a focus
non-relational databases and clouds, cf. Section 2.
                                                                                   on cloud database performance testing. The author argues about
   Although there exist multiple benchmarks designed for graph
                                                                                   adopting TPC benchmarks to a cloud architecture. [14] proposes
databases, the motivation for our work came as a real need from
                                                                                   a DBaaS benchmark with typical OLTP, DSS, and mixed work-
industry, i.e., a large international IT company (whose name
                                                                                   loads. [7] compares a traditional open-source RDBMS and HBase
cannot to be revealed), having one of its multiple divisions located
                                                                                   a distributed cloud database. [25] and [4] show the performance
in Poland. The company stores large data volumes on various
                                                                                   results of relational database systems running on top of virtual
configurations of their software and network infrastructures.
                                                                                   machines. [30] presents a high-level overview of TPC-V, a bench-
These data by virtue are interconnected and naturally form large
                                                                                   mark designed for database workloads running in virtualized
graphs. Currently, these graphs are stored in flat files but in
                                                                                   environments.
the future, they will be imported into a proprietary GDB and
                                                                                      Benchmarking of other types of databases, like XML (e.g.,
© Copyright 2020 for this paper held by its author(s). Published in the Workshop   [23, 29]), RDF-based (e.g., [16]), NoSQL, and graph, received less
Proceedings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020,
Copenhagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons
                                                                                   interest from the research and technology communities in the
License Attribution 4.0 International (CC BY 4.0)                                  past. However, with the widespread of Big Data technologies,
testing performance of various NoSQL data storage systems be-              (4) load the data into a GDB, using its proprietary tool,
came a very important research and technological issue. In this            (5) turn off database’s caching mechanisms, as the same subset
context, [6] proposed Yahoo! Cloud Serving Benchmark (YCSB)                    of queries will need to be repeated multiple times,
to compare different key-value and cloud storage systems. [28]             (6) run queries on the GDB.
proposed a set of BigTable oriented extensions known as YCSB++.
   In the area of GDBs, several benchmarks have been proposed           3.1     Graph data
so far. [3] advocated for using a large parameterized weighted, di-     A graph used in the benchmark is directed and cyclic, with a max-
rected multigraph and irregular memory access patterns. In [10]         imum cycle length of 2. The graph reflects typical software and
the authors discussed characteristics of graphs to be included          hardware configurations in a large company. A node represents
in a benchmark, characteristics of queries that are important           one of the three following data entities:
in graph analysis applications, and an evaluation workbench.
                                                                              • a package - it is composed of objects; a package can be
In the same spirit, problems of benchmarking GDBs were dis-
                                                                                transformed into another package; all packages have the
cussed in [5]. The authors explained how graph databases are
                                                                                same structure (fields);
constructed, where and how they can be used, as well as how
                                                                              • an object - it is composed of fields; an object can be trans-
benchmarks should be constructed. Their most important con-
                                                                                formed into another object, similarly as a package;
clusions were that: (1) an increase in the size of a graph in most
                                                                              • a field - a field can be transformed into another field, simi-
graph databases leads only to a linear increase of an execution
                                                                                larly as an object, all fields have the same simple elemen-
time for highly centralized queries, (2) the same cannot be said
                                                                                tary datatype.
for distributed queries, and (3) an important factor controlling
throughput of highly distributed queries is the size of memory            An arc represents:
cache, and whether an entire graph structure can be fit in mem-               • a data transformation - packages can be transformed into
ory.                                                                            other packages, objects into other objects, and fields into
   [1] described the so-called SynthBenchmark, which is included                other fields; each transformation (identified by its ID) is
in the Spark GraphX library. It also offers a small graph gener-                represented at all three levels of data entities;
ator. [2, 13] outlined a Java-based benchmark for testing social              • a data composition - each package contains one or more
networks. Its data were stored in MySQL. The benchmark al-                      objects, and each object contains one or more fields.
lowed to generate a graph of 1 billion of nodes, with its statistical      The data generator is parameterized and can produce graphs
properties similar to the one of Facebook.                              described by different statistics. For the benchmark application
   [12, 22] proposed the Social Network Benchmark, focusing             presented in this paper, the graph had the following statistics:
on graph generation and 3 different workloads, i.e., interactive,
                                                                              • the number of vertices: 911034, which represented 500
Business Intelligence, and graph algorithms.
                                                                                packages;
   [19] suggested and implemented a benchmark for a GDBMS
                                                                              • the number of arcs: 3229158,
working in a distributed environment. The authors attempted to
                                                                              • the average number of objects in a package: 100 (binomial
create a holistic benchmark and - using the Tinkerpop stack - run
                                                                                distribution n=8000, p=0.0125),
it on a series of the most popular graph databases at that time,
                                                                              • the number of object categories (types): 2; 30% of objects
including Neo4j, OrientDB, TitanDB, and DEX. [11] evaluated the
                                                                                belong to category A, 70% belong to category B,
performance of four GDBs, i.e, Neo4j, Jena, HypergraphDB, and
                                                                              • the average number of fields of objects in category A: 30
DEX with respect to a graph size, using typical graph operations.
                                                                                (binomial distribution n=1500, p=0.02),
[24] focused on benchmarking 12 GDBs, i.e., Neo4j, OrientDB,
                                                                              • the average number of fields of objects in category B: 8
InfoGrid, TitanDB, FlockDB, ArangoDB, InfiniteGraph, Allegro-
                                                                                (binominal distribution n=400, p=0.02),
Graph, DEX, GraphBase, HyperGraphDB, Bagel, Hama, Giraph,
                                                                              • the average number of incoming fields transformation
PEGASUS, Faunus, NetworkX, Gephi, MTGL, Boost, uRiKA, and
                                                                                arcs: 2.5 (binominal n=80, p=0.03125),
STINGER. This work tests performance of the majority of the
                                                                              • 4% of arcs form single-arc cycles,
GDBs but only in a centralized environment.
                                                                              • 2% of arcs form two-arc cycles.
   [18, 21] described a benchmark developed in the co-operation
with 4 IT corporations and 4 universities. The benchmark con-
                                                                        3.2     Queries
sists of six algorithms: Breadth-first Search, PageRank, Weakly
Connected Components, Community Detection using Label Prop-             Eight queries were defined and implemented in the benchmark,
agation, Local Clustering Coefficient, and Single-source Shortest       as required by the company. Queries Q1-Q5 (described below)
Paths. The data part includes real and synthetic datasets.              were demanded by the company. Q1-Q3 aim at checking how
                                                                        long it takes for a GDB to find neighbor vertices, as well as
3    OUR APPROACH: GOODBYE - A GOOD                                     navigating via incoming and outcoming arcs. Q4 and Q5 check
                                                                        how fast a GDB finds nodes, having given in- and out-going arcs,
     GRAPH DATABASE BECHMARK                                            to calculate an impact of changes on transformations. Q6-Q8 are
The GooDBye benchmark includes: (1) a parameterized graph               typical queries that are defined in other benchmarks.
data generator, (2) a graph database, and (3) queries that are to
                                                                              • Q1 - it finds and returns all vertices that transform to node
be run on it. In order to use the benchmark, a user needs to:
                                                                                of type A, i.e., nodes that have an outgoing arc of type
    (1) run a data generator,                                                   Transformation that is an incoming arc to A.
    (2) decide which GDBMS is to be tested, and install it on a               • Q2 - it finds and returns all nodes that have an incoming
        cluster,                                                                arc of type Transformation, whose source is A.
    (3) transform data generated by the benchmark into a form                 • Q3 - it counts all the vertices that are connected to a node
        readable by the selected GDBMS,                                         by the Transformation arc lading to the node of type A and
        computes the percentage of these nodes over the number         5.1                    Results
        of all vertices in the graph.                                  The response time (elapsed) for Q1-Q8 was measured in mil-
      • Q4 - it counts all direct neighbors of nodes connected by      liseconds. Below we present the results and discuss the obtained
        a given transformation type and returns the percentage         performance characteristics.
        of the entire graph they comprise.                                Q1 - transformation sources for a given node. The re-
      • Q5 - it counts all direct nodes connected by a given trans-    sponse times for Q1 are shown in Figure 1. As we can observe,
        formation type, including nodes adjacent to nodes of type      ArangoDB clearly outperforms all the other GDBMSs. GraphX is
        A.                                                             the only system for which query response time decreases with
      • Q6 - it counts the number of incoming and outgoing arcs        the increasing number of nodes. The performance of TitanDB
        of every single node in the graph, and returns a total count   and JanusGraph degrades with the increasing number of nodes
        for each of them. This models a degree calculation for the     - queries run on the 9-machine cluster take about twenty times
        entire graphs.                                                 more than when running on a single node.
      • Q7 - it returns all nodes in the database that have their
        attribute equal to given number.
                                                                                              1x107
      • Q8 - it computes the shortest path between two nodes.                                           ArangoDB
                                                                                              1x106      OrientDB
                                                                                                          TitanDB
                                                                                                      JanusGraph
4 TEST ENVIRONMENT                                                                           100000


                                                                        response time [ms]
                                                                                                          GraphX

4.1 GBDMSs under test                                                                         10000

The company imposed the following requirements on a GDBMS:                                     1000

      • to be run on either an open-source or academic licence,                                 100
      • to be used in practice by industry (listed on the DB-Engines                             10
        website [9]),
                                                                                                  1
      • to support at least 3 of the ACID properties,                                                           1n   3n            5n   9n
      • to be capable of running in a cluster of workstations.
                                                                                                                          #nodes

   Based on the aforementioned criteria, out of 29 available GDBMSs,
the following were selected for evaluation: ArangoDB, TitanDB,                                         Figure 1: Execution time of Q1
JanusGraph, OrientDB, and Spark GraphX. One system we heav-
ily considered using was Neo4j. According to DB-Engines ranking,          Q2 - listing transformation destinations for a given node.
at the time of writing this paper, it was the most popular GDBMS.      The response times of this query are shown in Figure 2. In this
Unluckily, we were unable to obtain an academic license of its full,   test, ArangoDB, once again outperforms all the other GDBMSs,
’enterprise’ edition, providing among others distributed storage       although the degradation of its performance when increasing the
and processing. As such, we have decided not to test it, rather        size of the cluster is more noticeable. GraphX execution times
than unfairly assess its toy version.                                  decrease by a factor of three, when a cluster size increases to 9,
                                                                       resulting in better execution times than TitanDB or JanusGraph,
4.2     Benchmark setup                                                but still worse than OrientDB.
The GDBMSs were installed in a micro-cluster composed of 9
physical machines. Each node was run by Ubuntu and had the                                    1x107
                                                                                                        ArangoDB
following parameters: (1) 8GB RAM, (2) 457GB HDD, (3) Intel                                   1x106      OrientDB
Core2 Quad CPU Q9650 3.00GHz, (4) graphic card: GT215. The                                                TitanDB
                                                                                                      JanusGraph
                                                                                             100000
machines were physically fully interconnected with each other,
                                                                        response time [ms]


                                                                                                          GraphX

enabling direct communication whenever required. The logical                                  10000
connections depended on the database system used.                                              1000
   Depending on an experiment, there were 1, 3, 5, or 9 nodes
                                                                                                100
used at any given time. Such cluster sizes allowed to use with 1,
2, 4, and 8 worker nodes, and 1 access/coordinator node. Data                                    10
were partitioned as equally as possible between nodes, using data                                 1
distribution mechanisms provided by each GDMS.                                                                  1n   3n            5n   9n
                                                                                                                          #nodes

5     PERFORMANCE EVALUATION OF
      SELECTED GRAPH DATABASES                                                                         Figure 2: Execution time of Q2
The goal of the experiments was to evaluate response time of
the 8 queries outlined in Section 3.2, for the 5 GDBMSs under             Q3 - measuring the impact of changes in a node. The
test. Each query was run twelve times on the same dataset, on          response times of this query are shown in Figure 3. From the chart
1, 3, 5, and 9 nodes. The highest and lowest measurements were         we can notice that ArangoDB has a clear lead over its competitors
discarded. An average value and standard error were calculated         yet again, with its executions taking thousands times less than
for the remaining measurements. Due to huge differences in             of the other GDBMSs. GraphX slowly approaches ArangoDB
performance between the tested GBDMSs, a logarithmic scale             execution times as the cluster size increases. OrientDB achieves
was used in charts. Below we discuss the obtained performance          better results than TitanDB and JanusGraph on clusters of greater
characteristics.                                                       size.
                       1x107                                          GraphiX offers the best performance, regardless of the cluster
                                 ArangoDB
                       1x106      OrientDB                            size. In a 3-, 5-, and 9-node cluster ArangoDB performs the worst.
                                   TitanDB
                               JanusGraph
                                                                      The performance of OrientDB remains unchanged regardless of
                      100000                                          the cluster size.
 response time [ms]


                                   GraphX

                       10000

                        1000                                                                 1x107
                                                                                                       ArangoDB
                                                                                                        OrientDB
                         100                                                                 1x106       TitanDB
                                                                                                     JanusGraph
                          10                                                                100000


                                                                       response time [ms]
                                                                                                         GraphX

                           1                                                                 10000
                                         1n   3n            5n   9n
                                                                                              1000
                                                   #nodes
                                                                                               100

                                                                                                10
                                Figure 3: Execution time of Q3
                                                                                                 1
                                                                                                               1n   3n            5n   9n
   Q4 - measuring the impact of changes in transforma-
                                                                                                                         #nodes
tion. As Q4 query is very similar to Q3, the execution times
shown in Figure 4 have characteristics similar to those shown in
Figure 3, i.e., ArangoDB achieves the best results, with GraphX                                       Figure 6: Execution time of Q6
slowly decreasing ArangoDB lead as the cluster grows.
                                                                         Q7 - filtering query. The results of this evaluation are shown
                       1x107                                          in Figure 7. On a single node, average execution times of the same
                                 ArangoDB
                       1x106      OrientDB                            query on ArangoDB and GraphX differ only by forty-five millisec-
                                   TitanDB
                               JanusGraph
                                                                      onds, but as the cluster size increases GraphX gains noticeable
                      100000
                                                                      lead over all the other GDBMSs.
 response time [ms]


                                   GraphX

                       10000

                        1000                                                                 1x107
                                                                                                       ArangoDB
                         100                                                                 1x106      OrientDB
                                                                                                         TitanDB
                                                                                                     JanusGraph
                          10                                                                100000
                                                                       response time [ms]


                                                                                                         GraphX

                           1                                                                 10000
                                         1n   3n            5n   9n
                                                                                              1000
                                                   #nodes
                                                                                               100

                                Figure 4: Execution time of Q4                                  10

                                                                                                 1
   Q5 - measuring the impact of changes in topology. The                                                       1n   3n            5n   9n
execution times from this experiment are shown in Figure 5. Once                                                         #nodes

again, ArangoDB is in the lead. GraphX execution times decrease
as the cluster size increases. OrientDB performs worse that Ti-                                       Figure 7: Execution time of Q7
tanDB and JanusGraph for a cluster size up to 3 and performs
better when the cluster grows.
                                                                         Q8 - finding the shortest path between nodes. This experi-
                                                                      ment was run on ArangoDB, OrientDB, TitanDB, and JanusGraph.
                       1x107
                                 ArangoDB                             The reason for eliminating GraphX was caused by the implemen-
                       1x106      OrientDB
                                                                      tation of the shortest path algorithm in GraphX. Rather than
                                   TitanDB

                      100000
                               JanusGraph                             simply finding the shortest path between two nodes, it finds all
 response time [ms]


                                   GraphX
                                                                      the shortest paths from all the nodes to the target one, only then
                       10000
                                                                      allowing users to select specific paths from a generated RDD. This
                        1000                                          heavily influences execution times of such queries. The first stage
                         100                                          (computation of the shortest paths) takes minutes rather than
                                                                      milliseconds, and the second (retrieval of specific paths) takes a
                          10
                                                                      few milliseconds, making the results fairly incomparable to other
                           1                                          GDBMSs. Figure 8 reveals that ArangoDB handles this query in
                                         1n   3n            5n   9n
                                                                      the least amount of time. On a single node, OrientDB performs
                                                   #nodes             worse than TitanDB or JanusGraph, and performs better on 3, 5,
                                                                      and 9 nodes.
                                Figure 5: Execution time of Q5           In Figure 9 we present total execution times of a workload
                                                                      composed of queries Q1, Q2, ..., Q7, for ArangoDB, OrientDB,
   Q6 - computing the degree of each node. The execution              TitanDB, JanusGraph, and GraphX in a cluster composed of 1,
times from this experiment are shown in Figure 6. For this query,     3, 5, and 9 machines. As we can observe, ArangoDB, TitanDB,
                          1x107                                                                As we can observe, the p-values are much lower than the
                                        ArangoDB
                          1x106          OrientDB                                           assumed p-value of 0.01. This means, that the difference in ex-
                                          TitanDB
                                      JanusGraph
                                                                                            ecution times between ArangoDB, OrientDB, and GraphX are
                        100000                                                              statistically significant for all 8 queries but Q7 on 1 node. It means
   response time [ms]


                         10000                                                              that our conclusions are valid for Q1-Q8 except Q7 on 1 node.
                          1000

                           100                                                              Table 1: p-values for testing statistical significance of ex-
                                                                                            ecution times between (1) GraphX and ArangoDB (Q1-Q5
                               10
                                                                                            and Q7) as well as between (2) GraphX and OrientDB (Q6)
                                1
                                                1n        3n            5n        9n
                                                                                                     Query          1 node         3 nodes          5 nodes             9 nodes
                                                               #nodes
                                                                                                       Q1         0.0000000000   0.0000000001     0.0000000003        0.0000000048
                                                                                                       Q2         0.0000000054   0.0000000014     0.0000000000        0.0000000000
                                                                                                       Q3         0.0000000423   0.0000000006     0.0000000000        0.0000000000
                                       Figure 8: Execution time of Q8                                  Q4         0.0000000740   0.0000000011     0.0000000000        0.0000000000
                                                                                                       Q5         0.0000003403   0.0000000108     0.0000000000        0.0000000000
                                                                                                       Q6         0.0000000000   0.0000000000     0.0000000000        0.0000000000
                                                                                                       Q7         0.1674163506   0.0000000000     0.0000000000        0.0000000000
and JanusGraph do not offer scaling out, as the total execution
time grows with the increase of the number of machines. On the
contrary, OrientDB and GraphX offer rather constant execution
time w.r.t. the number of machines.                                                         5.3      Functionality assessment
                                                                                            In this section we present our assessment of some features of the
                        1800                                                                GDBMSs, related to user experience, grading each of them on
                        1600          ArangoDB                                              a scale from 1 to 5 (1 being the lowest and 5 - the highest). The
                                       OrientDB
                        1400                                                                following features were assessed: (1) ease of installing and setting
 response time [sec]


                                        TitanDB
                        1200        JanusGraph                                              up the GDBMS, (2) ease of using the GDBMS (how complicated is
                        1000            GraphX                                              its query language, whether it provides access to graph data from
                         800                                                                other languages), (3) support for multiple OS, and (4) visualization
                         600                                                                capabilities. The assessment results are shown in Table 2.
                         400
                         200                                                                          Table 2: Assessing functionality of GDBMSs
                           0
                               1n                    3n                      5n        9n
                                                                                                                    ArangoDB     OrientDB       TitanDB    JanusGraph         GraphX
                                                           #nodes                                 Ease of setup          5            3            4              4                3
                                                                                                   Ease of use          5             3            5             5                 2
                                                                                                   Portability          5            5            5              5                5
Figure 9: Total execution time of a workload composed of                                            Interface           4            4            2              2                1
queries Q1 - Q7                                                                                       Total             19           15           16             16               11


                                                                                                As it can be seen, ArangoDB wins in this regard as well. Its
5.2                       Significance tests                                                installation is straightforward, setting up a cluster requires noth-
                                                                                            ing but running a few, simple scripts. Its querying language is
From the presented charts we can observe that on the average,
                                                                                            robust and intuitive, with a focus on sub-querying. It runs on
ArangoDB and GraphX offer the best performance. ArangoDB of-
                                                                                            most common operating systems. Its visual interface is decent.
fers the best performance for all queries but Q6. Graphx achieves
                                                                                                TitanDB and JanusGraph are next in our ranking. Their instal-
various results, with being a clear winner in Q6, but performing
                                                                                            lation and setting up a cluster require a bit of fiddling, although
worse than ArangoDB for all other queries. Thus, we need to
                                                                                            it does not require all that much skill. The query languages are
check whether it is statistically significant that:
                                                                                            easy to learn and use. Both of these GDBMSs do not expose any
                        • ArangoDB achieves better results for Q1-Q5 and Q7-Q8              problems of running on any of the popular operating systems.
                          than GraphX,                                                      They do require quite a lot of work to set up any kind of visual
                        • ArangoDB achieves worse performance than GraphX for               interface.
                          Q6,                                                                   OrientDB scores third. Its installation is not difficult, although
                        • GraphX achieves better performance than OrientDB for              having a few instances in a cluster is problematic. Its language
                          Q6, since OrientDB is more efficient than ArangoDB in             lacks a few built-ins. It supports numerous OSs. Visual represen-
                          executing Q6.                                                     tations of graphs it generates are decent and legible.
   To this end, we applied T-Student tests with p=0.01. The results                             GraphX scores last. Ease of use was never the focus for Spark-
of the significance tests are included in Table 1. The p-values for                         based tools. Installation and cluster set up is rather easy, but
the significance of the results between GraphX and ArangoDB                                 connecting it to a resilient data storage is more difficult. Tutorials
are represented by rows with queries Q1-Q5 and Q7, whereas                                  for GraphX are almost non-existent, and documentation occa-
p-values for the significance between GraphX and OrientDB are                               sionally leaves a bit to be desired. Since it is Java-based, it has no
represented by row with Q6. Each row includes p-values for the                              problems running virtually anywhere. Graphical interface (other
experiments on 1, 3, 5, and 9 nodes.                                                        than Spark management tool) is nonexistent.
6     SUMMARY AND CONCLUSIONS                                                                 Performance on the HPC Scalable Graph Analysis Benchmark. In Int. Conf. on
                                                                                              Web-age Information Management (WAIM).
In this paper we presented a graph database benchmark devel-                             [12] Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev,
oped to meet specific requirements of an international IT com-                                Arnau Prat, Minh-Duc Pham, and Peter Boncz. 2015. The LDBC Social Network
                                                                                              Benchmark: Interactive Workload. In SIGMOD Int. Conf. on Management of
pany. Even though over 10 graph benchmarks have been pro-                                     Data.
posed in the research literature, none of them reflects the partic-                      [13] Facebook. [n.d.]. LinkBench. GitHub,https://github.com/facebookarchive/
ular structure of the graph or particular queries needed by the IT                            linkbench.
                                                                                         [14] Avrilia Floratou, Jignesh M. Patel, Willis Lang, and Alan Halverson. 2011.
company. Therefore, the benchmark that we developed can be                                    When Free Is Not Really Free: What Does It Cost to Run a Database Work-
considered as a complementary to those mentioned in Section 2.                                load in the Cloud?. In TPC Technology Conference on Performance Evaluation,
It contributes another graph structure used by industry and five                              Measurement and Characterization of Complex Systems (TPCTC). 163–179.
                                                                                         [15] Florian Funke, Alfons Kemper, Stefan Krompass, Harumi Kuno, Raghunath
queries used by industry.                                                                     Nambiar, Thomas Neumann, Anisoara Nica, Meikel Poess, and Michael Sei-
    The benchmark was implemented and used in practice to asses                               bold. 2012. Metrics for Measuring the Performance of the Mixed Workload
                                                                                              CH-benCHmark. In TPC Technology Conference on Performance Evaluation,
the performance of 5 open-source GBDMSs in a micro-cluster                                    Measurement and Characterization of Complex Systems (TPCTC).
composed variable number of physical nodes (up to 9 nodes were                           [16] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A Benchmark for
used). The experiments that we run showed that:                                               OWL Knowledge Base Systems. Web Semantics 3, 2-3 (2005).
                                                                                         [17] Karl Huppler. 2011. Benchmarking with Your Head in the Cloud. In TPC
     • distributing graph data into multiple nodes does not pro-                              Technology Conference on Performance Evaluation, Measurement and Charac-
                                                                                              terization of Complex Systems (TPCTC). 97–110.
       vide scaling out; we observed that: (1) query execution                           [18] Alexandru Iosup, Tim Hegeman, Wing Lung Ngai, Stijn Heldens, Arnau Prat-
       times increased when the size of the cluster increased                                 Pérez, Thomas Manhardto, Hassan Chafio, Mihai Capotă, Narayanan Sun-
       (the case of ArangoDB, TitanDB, and JanusGraph) or re-                                 daram, Michael Anderson, Ilie Gabriel Tănase, Yinglong Xia, Lifeng Nai, and
                                                                                              Peter Boncz. 2016. LDBC Graphalytics: A Benchmark for Large-scale Graph
       mained approximately constant (the case of OrientDB and                                Analysis on Parallel and Distributed Platforms. VLDB Endownment 9, 13
       GraphX);                                                                               (2016).
                                                                                         [19] S. Jouili and V. Vansteenberghe. 2013. An Empirical Comparison of Graph
     • even simple queries can take much longer to execute in a                               Databases. In Int. Conf. on Social Computing.
       cluster when a GDB needs to cross-check every node for                            [20] Martin L. Kersten, Alfons Kemper, Volker Markl, Anisoara Nica, Meikel Poess,
       arcs leading to another shard;                                                         and Kai-Uwe Sattler. 2011. Tractor Pulling on Data Warehouses. In Int. Work-
                                                                                              shop on Testing Database Systems.
     • ArangoDB offers the best performance in the majority                              [21] LDBCouncil. [n.d.]. LDBC Graphalytics. GitHub,https://github.com/ldbc/
       of tests; it also offers the best functionality from a user                            ldbc_graphalytics.
       perspective;                                                                      [22] LDBCouncil. [n.d.]. Social Network Benchmark. LDBCouncil,http://ldbcouncil.
                                                                                              org/developer/snb.
     • GraphX offers the best performance when it comes to                               [23] Hadj Mahboubi and Jérôme Darmont. 2011. XWeB: the XML Warehouse
       massive localized data processing (cf. Figure 6), i.e., it is a                        Benchmark. CoRR (2011).
                                                                                         [24] Robert McColl, David Ediger, Jason Poovey, Dan Campbell, and David A. Bader.
       good match for certain algorithms such as PageRank, that                               2014. A performance evaluation of open source graph databases. In Workshop
       are highly interested in degrees of nodes.                                             on Parallel Programming for Analytics Applications.
                                                                                         [25] Umar Farooq Minhas, Jitendra Yadav, Ashraf Aboulnaga, and Kenneth Salem.
The performance evaluation can further be extended to test the                                2008. Database systems on virtual machines: How much do you lose?. In Int.
scalability of GDBMSs w.r.t. graph size and clusters of sizes                                 Conf. on Data Engineering Workshops (ICDE). 35–41.
                                                                                         [26] ODBMS. [n.d.]. Operational Database Management Systems - ODBMS. http:
greater than 9 nodes. To this end, the proposed GoodBye bench-                                //www.odbms.org/.
mark needs to be further extended as well, to generate graphs of                         [27] Patrick O’Neil, Betty O’Neil, and Xuedong Chen. 2009. Star Schema Bench-
parameterized size and multiple statistical properties.                                       mark. https://www.cs.umb.edu/ poneil/StarSchemaB.PDF.
                                                                                         [28] Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, Julio López,
                                                                                              Garth Gibson, Adam Fuchs, and Billie Rinaldi. 2011. YCSB++: benchmarking
REFERENCES                                                                                    and performance debugging advanced features in scalable table stores. In ACM
                                                                                              Symposium on Cloud Computing. 9.
 [1] Apache. [n.d.]. SynthBenchmark. Apache,https://github.com/apache/spark/             [29] Albrecht Schmidt, Florian Waas, Martin Kersten, Michael J. Carey, Ioana
     blob/master/exam-ples/src/main/scala/org/apache/spark/examples/graphx/                   Manolescu, and Ralph Busse. 2002. XMark: A Benchmark for XML Data
     SynthBenchmark.scala.                                                                    Management. In Int. Conf. on Very Large Data Bases.
 [2] Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark                  [30] Priya Sethuraman and H. Reza Taheri. 2011. TPC-V: A Benchmark for Eval-
     Callaghan. 2013. LinkBench: A Database Benchmark Based on the Facebook                   uating the Performance of Database Applications in Virtual Environments.
     Social Graph. In SIGMOD Int. Conf. on Management of Data.                                In TPC Technology Conference on Performance Evaluation, Measurement and
 [3] D. Bader, J. Feo, J. Gilbert, J. Kepner, D. Koetser, E. Loh, K. Madduri, B. Mann,        Characterization of Complex Systems (TPCTC). 121–135.
     T. Meuse, and E. Robinson. 2009. HPC Scalable Graph Analysis Benchmark.             [31] TPC. [n.d.]. Transaction Processing Council Benchmarks. http://www.tpc.
     HPC Graph Analysis, http://www.graphanalysis.org/benchmark/.                             org/.
 [4] Sharada Bose, Priti Mishra, Priya Sethuraman, and H. Reza Taheri. 2009. Bench-
     marking Database Performance in a Virtual Environment. In TPC Technology
     Conference on Performance Evaluation, Measurement and Characterization of
     Complex Systems (TPCTC). 167–182.
 [5] M. Ciglan, A. Averbuch, and L. Hluchy. 2012. Benchmarking Traversal Opera-
     tions over Graph Databases. In Int. Conf. on Data Engineering Workshops.
 [6] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and
     Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In ACM
     Symposium on Cloud Computing. 143–154.
 [7] Jean-Daniel Cryans, Alain April, and Alain Abran. 2008. Criteria to Compare
     Cloud Computing with Current Database Technology. In Int. Conf. Software
     Process and Product Measurement. 114–126.
 [8] Jerome Darmont, Fadila Bentayeb, and Omar Boussaid. 2007. Benchmarking
     Data Warehouses. Int. Journal of Business Intelligence and Data Mining 2, 1
     (2007).
 [9] DB-ENGINES. [n.d.]. DB-Engines Ranking of Graph DBMS. https://db-engines.
     com/en/ranking/graph+dbms.
[10] David Dominguez-Sal, Norbert Martinez-Bazan, Victor Muntes-Mulero, Pere
     Baleta, and Josep Lluis Larriba-Pay. 2011. A Discussion on the Design of
     Graph Database Benchmarks. In TPC Technology Conference on Performance
     Evaluation, Measurement and Characterization of Complex Systems (TPCTC).
[11] D. Dominguez-Sal, P. Urbón-Bayes, A. Giménez-Vañó, S. Gómez-Villamor,
     N. Martínez-Bazán, and J. L. Larriba-Pey. 2010. Survey of Graph Database