<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Piecing together large puzzles, efficiently: Towards scalable loading into graph database systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriel Campero Durand</string-name>
          <email>campero@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcus Pinnecke</string-name>
          <email>pinnecke@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jingy Ma</string-name>
          <email>jma@st.ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gunter Saake</string-name>
          <email>saake@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Magdeburg</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Many applications rely on network analysis to extract business intelligence from large datasets, requiring specialized graph tools such as processing frameworks (e.g. Apache Giraph, Gradoop), database systems (e.g. Neo4j, JanusGraph) or applications/libraries (e.g. NetworkX, nvGraph). A recent survey reports scalability, particularly for loading, as the foremost practical challenge faced by users. In this paper we consider the design space of tools for e cient and scalable graph bulk loading. For this we implement a prototypical loader for a property graph DBMS, using a distributed message bus. With our implementation we evaluate the impact and limits of basic optimizations. Our results con rm the expectation that bulk loading can be best supported as a server-side process. We also nd, for our speci c case, gains from batching writes (up to 64x speedups in our evaluation), uniform behavior across partitioning strategies, and the need for careful tuning to nd the optimal con guration of batching and partitioning. In future work we aim to study loading into alternative physical storages with GeckoDB, an HTAP database system developed in our group.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.2.m [Information Systems]: Miscellaneous|Graph-based
database models; H.2.m [Information Systems]:
Miscellaneous|Extraction, transformation and loading
Measurement
Graph database systems, Bulk loading, Streaming graph
partitioning</p>
      <p>Transaction checks
Id assignation 5
Physical storage
Pre-processing 3
File Interpretation 2
1 Parsing for different input characteristics</p>
      <p>CSV, JSON, Adjacency lists,  Implicit graphs...</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Network analysis is one of many methods used by
scientists to study large data collections. For this, data has to be
represented with models based on graph theory. Namely, as
a graph structure composed of nodes/vertexes and
relationships/edges. Additional details of properties, labels,
semantics or the direction of edges, allow to de ne more speci c
models such as RDF, hypergraphs or property graphs.</p>
      <p>Building on such models network analysis is commonly
assisted by three kinds of tools specialized for graph
management:</p>
      <p>Standalone applications, toolkits and libraries, such as
Gephi, NetworkX and nvGraph, provide cost-e ective
processing for small to medium networks in a single
user environment. Accordingly they target single
machine shared memory con gurations. Standalone
applications like Pajek and Gephi, also o er visualization.
Graph database systems emphasize persistent storage
and transactional guarantees in the presence of
updates from a multi-user environment. Physical/logical
data independence is accomplished through the adoption
of high-level graph models (e.g., the property graph
model), backed by di erent physical storage and
indexing alternatives. According to their storage graph
databases can be distinguished as native (i.e., with
graph-speci c storage) or non-native (i.e., with storage
developed for other models, like documents). Example
systems include Neo4j, OrientDB, ArangoDB, Ga er,
SAP Hana Graph, and JanusGraph.</p>
      <p>Large scale graph processing frameworks are
characterized by their goal of supporting scale-out analytical
workloads over large graphs, with failure-tolerance
properties. As a result they adopt parallel and distributed
strategies for batch or stream processing. Distributed
job scheduling for skew handling and communication
avoidance, next to I/O reduction are some of the key
design characteristics of these tools. Apache Giraph,
Pregel, Flink's Gelly and Gradoop are some
representatives of such frameworks.</p>
      <p>
        A recent survey among researchers and practitioners
employing these specialized graph tools reveals several
characteristics of their usage [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The rst nding is what authors
identify as the ubiquity of large graphs: about a third of the
surveyed users report having graphs with more than 100
million edges, and more speci cally, graphs on the range of
billions of edges are not exclusive to large organizations but
actually are used in organizations of all sizes. Second, the
survey reports that graph database systems currently
occupy a prominent place within the community, consitituting
the most popular category of tools in use. Third, the
survey nds that scalability (i.e., the need for processing e
ciently larger graphs), followed by visualization and
requirements for expressive query languages, are the most pressing
challenges faced by users of these tools. Moreover, regarding
scalability, users report that the precise challenges are
inefciencies in loading, updating and performing computations
over large graphs. In this paper we share early
considerations about a speci c challenge of this list: bulk loading large
networks into a specialized graph tool. This is a process that
can become a bottleneck, and delay the time for analyis. In
this sense, optimizing such process can be specially
impactful, enabling analysis on data with more currency.
      </p>
      <p>We organize our study as follows:</p>
      <p>We introduce the bulk loading process into graph tools,
outlining performance-impacting aspects and usability
requirements gathered by surveying the SNAP
repository for network datasets (Sec. 2).</p>
      <p>We develop an early prototype for bulk loading into
a graph database, evaluating client vs. server-side
loading, request batching and di erent partitioning
strategies (Sec. 3, 4).</p>
      <p>We summarize related work providing context for our
research (Sec. 5).</p>
      <p>We conclude by suggesting takeways from our study
and future work (Sec. 6).</p>
    </sec>
    <sec id="sec-3">
      <title>THE GRAPH BULK LOADING PROCESS</title>
      <p>Bulk loading is a process common to most graph tools
(Fig. 1). We propose that the general process consists of:
1. Reading input data sources, often in tabular formats.
2. Interpreting data as nodes and edges according to
rules. Intermediate data structures can be utilized.
3. Cleaning, deduplication and preprocessing (optional).
4. Process organization, where data partitioning and batch
sizes can de de ned.
5. Transfering this data into the physical storage model
of the tool, while keeping with integrity constraints.</p>
      <p>Intrinsic data dependencies (e.g., the fact that edges
require their connected vertexes to exist) can a ect how this
process is organized, usually requiring several passes over
the input data.</p>
      <p>Regarding constraints, to store an edge the existence of
the connected vertexes needs to be checked. For storing both
vertexes and edges, integrity constraints speci c to the data
model might also require validation. When sources present
duplicate entities, each write request might entail
determining rst if the entity needs to be created or not. For large
graphs both checks for constraints and for duplicate entities
might impact the loading time.</p>
      <p>In terms of distributing the process, this can take place
either by distributing the data sources (e.g., chunking and
sharding les) or the interpreted data.</p>
      <p>Perhaps the most pressing aspects that need to be
considered by tools for the general graph data loading process we
describe are e ciency, which for the purposes of our
discussion encompasses scalability, and usability, which refers to
ful lling requirements from diverse input characteristics. In
what follows we brie y present these aspects before
advancing to the speci c contributions of our study.
2.1</p>
    </sec>
    <sec id="sec-4">
      <title>Performance-impacting factors</title>
      <p>
        One of the main factors determining how the loading will
take place, its e ciency and the possible optimizations, is
the physical storage model. Paradies and Voigt [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] survey
some of the more prevalent alternatives for this. Authors
distinguish between choices for storing only the topology
and choices for storing richer logical graph models,
including labels and properties associated with nodes and edges.
Among the rst group they count adjacency matrixes,
adjacency lists and compressed sparse rows (consisting of an
ordering of edges stored in a sequential immutable array
with an index of o sets to improve the access). Among the
second choice they list triple tables (storing in a single
table with dictionary compression the subject-object-predicate
data that conform RDF triples), universal tables (wherein a
single table is asssigned to edges and another to vertexes),
emerging schemas (for which tables are still employed but
with schemas tuned to the data), schema hashing (where
item ids and properties are used as hashes to store the
corresponding values in separate tables) and separate property
storage (a strategy that simply separates the storage of
properties from that of the topology). Specialized compressed
structures, adaptive strategies and structural storage are
also discussed by authors as alternative storage approaches.
Finally, graph summaries like sparsi ers and spanners could
be considered storage alternatives too, though speci cally
attuned for expected uses.
      </p>
      <p>Adding to the speci c storage model selected, which should
reasonably determine the operations involved in bulk
loading, we consider that other performance-impacting aspects
pertaining to the design of an e cient bulk loading tool are:
input le parsing, memory allocation, access patterns, I/O
paths for persistent storage, write batching, the amount of
paralellism employed and load balance, concurrency control,
consensus for distributed writes, transactional management,
types of cuts in data distribution, e ciency for
integrityconstraint checking, and identi er assignation. These
performance-impacting aspects require consideration in designing
a tool for e cient graph data loading.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Usability requirements</title>
      <p>Data loading tools should be able to assist the precise
loading process of their users. This is a challenging expectation
due to the diversity of data sources and formats.</p>
      <p>
        In order to describe better the characteristics of data
sources we consider the popular Stanford Large Network
Dataset Collection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the SNAP repository1. As of the date of
our evaluation, in February 2018, it consisted of 90 publicly
available datasets representing social, citation,
communication, collaboration and road networks, among others. The
largest dataset in this collection is the Amazon Reviews
dataset, consisting of approximately 36 million edges (for
reviews) and around 12 GBs of compressed data. Most
datasets (84) are either in CSV or TXT formats, with tab or
comma separation. The remaining datasets (6, e.g. Bitcoin)
also use TXT format, but following an arrangement similar
to JSON. A majority of datasets (54) present the data as
simple edge lists (with srcId and tgtId), which facilitate the
loading. A number of datasets (9, e.g. the Ego networks)
present edge data organizations that followed the idea of an
adjacency list, with a single line of a le containing one
source id and then several target ids. From these a small number
(3, e.g. Ego-Facebook) have in addition an encoding for
properties with a dictionary le and 0s and 1s to indicate if
the vertex presents a given property. Another organization,
which we could call implicit, is given for one dataset (e.g.
Amazon Reviews). This is a specially challenging
representation, as each line represents multiple edge relations.
      </p>
      <p>The support for diverse input characteristics is also related
to e ciency: When a tool supports a speci c input source,
the tool can o er optimizations related to the overall
process. When a tool does not support a given input source,
users can either preprocess their data to match the
expected format, or, they can develop their own load process by
employing operations o ered by the tool. In both cases, and
specially in the second, possible optimizations that the tool
could perform over the complete process might be lost.</p>
    </sec>
    <sec id="sec-6">
      <title>AN EARLY PROTOTYPE FOR LOADING</title>
    </sec>
    <sec id="sec-7">
      <title>INTO A GRAPH DATABASE</title>
      <p>In order to understand the data loading process and the
optimization possibilities on a general graph tool, we
develop a prototype over JanusGraph, a property graph
database with non-native physical storage following the schema
hashing approach. This system supports Apache Cassandra,
Apache HBase and Oracle Berkeley DB as storage backends.
JanusGraph can be executed as a server or an
applicationembedded client. In both con gurations JanusGraph o ers
a graph and a management API, in addition to
maintaining socket-based read/write access to the backends, speci c
client-level caches and statistics.</p>
      <p>Concretely, we propose to employ the prototype for
studying the impact of server vs. client side loading, the
effect of batching when loading graphs of di erent topologies,
and distributing the edge loading process (after
interpretation) through a publisher/subscriber framework to accomplish
scalable loading.</p>
    </sec>
    <sec id="sec-8">
      <title>EVALUATION</title>
      <p>We selected JanusGraph Version 0.1.1 (May,11,2017) for
our tests and Apache Cassandra 2.1.1. Our experiments were
executed on a commodity multi-core machine composed of
2 Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz processors
(8 cores in total) with 251 GB of memory.
1https://snap.stanford.edu
0:34</p>
      <p>0:25
Client-side</p>
      <p>Server-side</p>
      <p>We tackle our evaluation questions by running the data
loading process on real-world datasets. We selected two
datasets from di erent areas, with di erent sizes in order to
make our tests more diverse. Both adopt the edge list
organization. The rst dataset is WikiRfA. It contains 11,402
users (voters and votes) corresponding to Requests for
Adminship and votes, forming 189,004 distinct voter/candidate
pairs, it is, thus a small directed, signed network. There is
also a rich textual component in RfAs since each vote includes
a short comment.</p>
      <p>Wiki-RfA is an example of a real-world temporal signed
network, since edges represent either positive, negative or
neutral votes, and the network presents a time dimension
that speci es the order in which votes are cast. In terms of
topology, Wiki-RfA can be classi ed as a social media
network, this is a kind of network similar to a social network
(i.e., it can also be considered to be based on a social
network), with the same scale-free properties and short paths,
but that can be shaped by the a ordances of the
interaction platform. We choose as a second dataset the Google-Web
graph, a representative of information networks and of larger
datasets (800k nodes and 5M edges). In this graph, nodes
represent web pages and directed edges represent hyperlinks
between them.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Client vs. Server-side loading</title>
      <p>In this section we ask, what is the right place for loading
graph data, considering rst if there are fundamental
performance di erences between carrying out the load process
from client vs. server side.</p>
      <p>As discussed previously, the loading process involves
several steps according to the source les. The main steps we
proposed where loading of vertexes and loading of edges;
each of these involved parsing the les, creating possible
inmemory mappings for ids, ordering the input items,
determining the load granule (i.e., transaction size or batch size)
and distributing/parallelizing the process itself. Considering
that database operations can be performed as client or
server codes (with the rst one being passed to the systems as
a series of http, websocket, language client or CLI requests,
and the latter being passed as a single script, in the case of
JanusGraph groovy scripts, to be executed on the server
side), the rst question in designing a loading tool for a graph
database is to determine which of these options is the best
for launching the process.</p>
      <p>Fig. 2 presents the average time performance over 10 runs
of loading data from Client/server side. We used the
WikiRfa dataset. The average loading time from client side is
339283.2ms (5.65 minutes). The average loading time from
server side is 245320.4265ms (4.08 minutes). And the average
speed up is 1.38x. From this evaluation we observe that even
s
teu 3
n
i
m
in 2
n
o
i
t
a
ru 1
D
3:68
1
99:66
1
for a relatively small dataset, and without adopting any
optimization, there is an evident distinction between loading
in client side vs. server side, leading at least to moderate
speedups.
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>Batch Loading</title>
      <p>Bulk/Batch loading enables us to add large amounts of
data in individual transactions. In our experiments we only
considered batching alternatives for loading edges,
evaluating the response time of di erent batch sizes.</p>
      <p>Fig. 3 and Fig. 4 show the time taken to load all edges
with di erent batch sizes for the di erent datasets. It can be
seen among the two charts that batching approaches
reduce the loading time signi cantly. The bigger the batch size,
the faster the loading process. This follows a close-to-linear
relationship. However, when batch sizes are increased
exponentially, the loading time does not decrease in the same
scale. There seems to be diminishing returns from
increases in batch sizes. In fact, beyond a certain extent, the time
improvement of performance from increased batch sizes
becomes smaller. If the batch size is very big, it might even
increase the overall time of the loading task. From our test
results, the threshold of batch size where the best
performance is achieved is 100.</p>
      <p>We speculate that a possible explanation for the
decreasing gains from batching could be that more data per
transaction deteriorates the use of transaction caches, breaking
temporal and spatial locality that appear on small
transactions. A further aspect that should be considered is that large
transactions could also lead to more costly distributed
transactions. This was not studied here, since we did not employ
multi-node backends.</p>
      <p>One interesting thing to note is that in the dataset
\WebGoogle" the speedup is reduced when batch size equals 1k,
while the same is not evident in Wiki-RFA. From this we
can speculate that the batch size is not the only factor that
a ects loading performance and that topology
characteristics, a ecting in turn transaction cache usage, might also
have an impact. Speci cally, Wiki-RFA represents a more
connected network than Web-Google, thus there might be
more chances of reusing data already in the transaction
cache, reducing loading costs. Further studies would be needed
to verify these possible cases.
4.3</p>
    </sec>
    <sec id="sec-11">
      <title>Partitioning</title>
      <p>In our studies so far we have considered batching, which
consisted on tting more data inside a single transaction,
in order to reduce the number of transactions employed in
the loading process. In this section we consider how to
organize the process with parallel transactions by partitioning
the data into parallel chunks and running the loading for
each chunk in separate requests to the backend. Contrasted
to the previous experiments, with this approach we do not
seek to reduce the number of transactions but to schedule
them in such a way that some of them can be performed
simultaneously, thus possibly reducing the overall runtime.
To achieve this it is necessary to determine a strategy to
partition the loading task. One straightforward possibility
is to partition the edge set into groups of items that can be
inserted separately, this is a form of partitioning over the
interpreted data.</p>
      <p>
        For the task of loading there is a signi cant di erence with
respect to traditional partitioning approaches. Namely that
the complete graph is not available in such a way that it
could enable computing a large algorithm over the graph.
Instead the loading process must partition the graph with
incomplete information, deciding for the location of a vertex
or an edge, or a group of them, as it processes them. In spite
of the limited information there is still the goal of nding a
balanced partition that can also reduce communication costs
during the loading process. Hence this can be de ned as a
streaming graph partitioning problem[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Authors have proposed[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the use of di erent heuristics for
streaming graph partitioning, such as balanced (assigning a
vertex to a partition with minimal size), chunking (assuming
some order in the stream, divide the stream into chunks and
distribute them in a round-robin fashion), hashing items,
deterministic greedy (assigning an entity to the partition
where it has more items, e.g. a vertex to where it has more
edges, this can be further parametrized to include penalties
to large partitions), next to bu er-based ones. Authors nd
that these simple heuristics can bring important bene ts
over random cases and also reduce the edge-cuts, improving
distributed graph processing[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        We have picked 4 di erent partitioning strategies for our
experiments [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These were selected due to their suitability
for speci cally distributing the edges, since it is not clear to
us if reducing edge cuts will have or not an impact on the
runtimes for our setting.
      </p>
      <p>E/E Strategy: This strategy distributes edges in a
roundrobin (RR) manner. It allocates many or all outgoing
edges of one vertex to multiple partitions.</p>
      <p>V/V Strategy: The V/V strategy distributes vertexes
with RR and all outgoing edges of a vertex are asigned
to a single partition.</p>
      <p>BE Strategy: This strategy partition the graph by
vertexes and meanwhile balances the amount of edges per
partition. This strategy requires to sort the vertexes
Baseline
VV
EE
BE</p>
      <p>DS
Baseline
VV
EE
BE
DS
Partition 2</p>
      <p>Partition 4</p>
      <p>Partition 8
Partition 2</p>
      <p>Partition 4</p>
      <p>Partition 2</p>
      <p>Partition 4</p>
      <p>Partition 8
according to the number of outgoing edges in a
descending order. And then it iterates over this sorted
list and allocates all outgoing edges from one vertex
to the currently smallest partition. It balances the
edges across partitions. Thereby all outgoing edges of a
vertex belong to the same partition.</p>
      <p>DS Strategy: This strategy basically extends the BE
Strategy. It's an approximation for handling skewed
data. To ease the pressure of highly connected vertexes
the DS strategy allocates the edges equally across
partitions. For the vertexes that have signi cantly more
edges, this strategy separates the edges and distributes
them in di erent partitions.</p>
      <p>We implement the support for these partitioning by using
a message passing system, Apache Kafka. When executing
several JanusGraph servers (all sharing the same clustered
backend), Kafka helps to organize a distributed task. For the
case of loading the request is received by a single JG server.
This worker is in charge of managing the load of vertexes
and performing the partitioning strategies over a compact
representation of the edges. Next, it sends the computed
partitions to connected workers using Kafka. These in turn
receive and load their partitions, following all con guration
parameters given with the request (e.g., batch sizes), and
reply back to the original requester via Kafka messages.
Finally the original requester returns when all the partitions
have been inserted.</p>
      <p>Fig. 5 and Fig. 6 summarize the results of our evaluations
with partitioning strategies. Parallel processing consistently
lead to improvements over the baseline. Although parallel
processing improves the performance, the overheads added
due to threading and communication (i.e., more Kafka
clients) limit the speedups for relatively short loading tasks.
For this reason, when the loading time is relatively short
(as is the case of Wiki-RfA), speedup gains decline. Thus, a
careful balance is required for determining the best number
of partitions according to the size of the loading task.</p>
      <p>We observe that loading processes using VV spent more
time than using other strategies. VV is not balanced and in
some situations it can lead to lower performance. In spite of
the small di erence for VV, we found that, overall, these
basic partitioning strategies have little impact on loading time
for our prototype. However we speculate that for di erent
topologies of graph datasets, and for scaled-out
architectures (where the loading is distributed but coordinated) the
in uence of these strategies might be di erent.</p>
      <p>EE, BE and DS lead in our tests to the exact same
partitions in all cases. VV leads to some small imbalances with
absolute di erences of 5517 and 467 for 2 partitions on
WikiRfA and Web Google, respectively.</p>
      <p>The combination of batching, partitioning and
parallelization, as shown in Fig. 7 can actually lead to degraded
performance, when the loading time is relatively short (as
happens with batch sizes larger than 10). The combination
was only better than the baseline for a batch size of 10, with
maximum performance gains of 2x and 1.5x over all
strategies for both datasets in 4 and 8 partitions respectively.
Thus, the gains are sublinear. For other batch sizes, there
were no improvements over the baseline. We believe that the
core factor leading to this situation is that the overheads for
message passing dominate the performance when the batch
sizes are larger (i.e., when the tasks to perform are few). This
argument is also sustained with the observation that, when
not comparing against the baseline, more partitions
consistently improve the performance for Google-Web, no matter
the batch size, as opposed to Wiki-RfA (where the task is
shorter). Regarding the di erences in strategies we report
one interesting case: VV for Google-Web with 2 partitions,
which outperforms all cases. From our studies we know that
this gain does not come from a better load balance, instead
we speculate that it might be due to a good reduction in
transaction commit overheads for distributed transactions,
produced by the fact that the strategy assigns to a partition
with a given vertex all the edges that connect to it. However
further studies are needed to understand better if this is the
case. For all other cases we observe mixed results regarding
the strategies, and there is no clear sense of one being better
than others.</p>
    </sec>
    <sec id="sec-12">
      <title>RELATED WORK</title>
      <p>To our knowledge there is limited related work devoted
exclusively to choices for bulk loading of graph data and to
improving the process.</p>
      <p>
        Then et. al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] study optimizations at the level of DBMS
design for loading a graph into an in-memory database. They
propose to decompose the process into 1) Parsing (in which
the vertex data and identi ers are loaded into memory). 2)
Dense vertex identi cation (in which, for improving memory
use, vertex identi ers are sorted based on their density). 3)
Relabeling, wherein in-memory dictionary encoding is
adopted such that densely connected vertexes are given smaller
identi ers than less connected ones. 4) Finally writing to
different in-memory data structures that represent the graph
(i.e. the authors consider compressed sparse rows and a map
of neighbor lists).
      </p>
      <p>Mainstream DBMSs like Neo4j o er useful features to
improve the bulk loading process, such as loading from les
bypassing the transactional layer2, functions for batching
requests and for combining writes with consistency checks
via Cypher's MERGE operator.</p>
      <p>
        Benchmarks like HPC-SGAB [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Bluebench [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and GDB
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have tests for the loading process. The authors of GDB [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
assess the impact of batching, reporting performance gains
similar to our evaluation. The authors of Bluebench [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
consider, in addition, the e ect of indexing. The LDBC
benchmarks study trickle updates in mixed workloads; evaluations
for bulk-loading choices into graph DBMSs are, at the
moment, not part of the core workloads [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper we share early results towards designing a
tool for scalable bulk loading into a graph storage. We
establish the goals of our research and provide a practical
evaluation using an open source database.</p>
      <p>Stemming from our test results we can make the argument
that bulk loading is better supported as a single server-side
process rather than a process with intermediate operations
all managed at the application/client side. Temporal
structures, such as the mapping between unique identi ers and
internal DBMS identi ers, can be more e ciently used when
managed from server than from client side. Also, reducing
the number of requests can bring performance gains by
lessening the communication and interpretation costs of
individual requests.</p>
      <p>From our results we also observe batching to be a useful
optimization. In our study we report a case where by moving
from a batch size of 1 to 100, the loading process moves
from 100 minutes to close to 1.5 minutes. Furthermore we
suggest that batching should be a choice considered before
others, due to its simplicity. However there are limits to this
approach (too big batches might introduce large overheads
on transaction failures/restarts), and performance gains do
not grow in proportion to batch sizes.</p>
      <p>Based on our results we can also conclude that parallelism
is a consistently good choice, depending on the number of
parallel processors available. In our study we observed that
parallelization, when not combined with batching, can lead
to speedups of 5.96 with 8 partitions. For partitioning we
observe little to no distinction between the strategies, thus
we susggest that EE could be a default strategy.</p>
      <p>The combination of optimization alternatives: batching,
partitioning, parallelization should be chosen properly, after
loading tests on sample data. We have observed that using
more optimizations does not necessarily translate into
performance gains. In our tests with 1k batches more use of
partitioning and parallelization strategies can only reduce
the loading e ciency.
Taken together, batching proves to be a straightforward
optimization choice. It's easy to use for local settings, but for
distributed scenarios parallelization becomes necessary and
its combination with batching requires careful consideration
and, possibly, automatically adaptive solutions.</p>
      <p>As future work we intend to study bulk loading in di erent
tools and evaluate the role of physical storage alternatives.
7.</p>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially funded by the DFG (grant no.:
SA 465/50-1).
8.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dominguez-Sal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Urbon-Bayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gimenez-Vano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gomez-Villamor</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Mart nez-Bazan, and</article-title>
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Larriba-Pey</surname>
          </string-name>
          .
          <article-title>Survey of graph database performance on the hpc scalable graph analysis benchmark</article-title>
          .
          <source>In International Conference on Web-Age Information Management</source>
          , pages
          <volume>37</volume>
          {
          <fpage>48</fpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Erling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Averbuch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Larriba-Pey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gubichev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-D. Pham</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Boncz</surname>
          </string-name>
          .
          <article-title>The ldbc social network benchmark: Interactive workload</article-title>
          .
          <source>In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <volume>619</volume>
          {
          <fpage>630</fpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jouili</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vansteenberghe</surname>
          </string-name>
          .
          <article-title>An empirical comparison of graph databases</article-title>
          .
          <source>In Social Computing (SocialCom)</source>
          , 2013 International Conference on, pages
          <volume>708</volume>
          {
          <fpage>715</fpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kolomicenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Svoboda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Mlynkova</surname>
          </string-name>
          .
          <article-title>Experimental comparison of graph databases</article-title>
          .
          <source>In Proceedings of International Conference on Information Integration and Web-based Applications &amp; Services, page 115. ACM</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kissinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Habich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Voigt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Lehner</surname>
          </string-name>
          .
          <article-title>Partitioning strategy selection for in-memory graph pattern matching on multiprocessor systems</article-title>
          .
          <source>In 23rd International Conference on Parallel and Distributed Computing</source>
          , Santiago de Compostela, Spain,
          <year>August 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Krevl</surname>
          </string-name>
          . SNAP Datasets:
          <article-title>Stanford large network dataset collection</article-title>
          . http://snap.stanford.edu/data,
          <year>June 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paradies</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Voigt</surname>
          </string-name>
          .
          <article-title>Big graph data analytics on single machines{an overview</article-title>
          .
          <source>Datenbank-Spektrum</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ):
          <volume>101</volume>
          {
          <fpage>112</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sahu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mhedhbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Salihoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. T. O</surname>
          </string-name>
          <article-title>zsu. The ubiquity of large graphs and surprising challenges of graph processing: A user survey</article-title>
          .
          <source>arXiv preprint arXiv:1709.03188</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Stanton</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Kliot</surname>
          </string-name>
          .
          <article-title>Streaming graph partitioning for large distributed graphs</article-title>
          .
          <source>In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>1222</volume>
          {
          <fpage>1230</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Then</surname>
          </string-name>
          , M. Kaufmann,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kemper</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          .
          <article-title>Evaluation of parallel graph loading techniques</article-title>
          .
          <source>In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems</source>
          , GRADES '
          <volume>16</volume>
          , pages
          <issue>4:1</issue>
          {
          <issue>4</issue>
          :
          <fpage>6</fpage>
          , New York, NY, USA,
          <year>2016</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>