=Paper=
{{Paper
|id=None
|storemode=property
|title=Data Locality in Graph Databases through N-Body Simulation
|pdfUrl=https://ceur-ws.org/Vol-733/paper_pacher.pdf
|volume=Vol-733
|dblpUrl=https://dblp.org/rec/conf/gvd/PacherBS11
}}
==Data Locality in Graph Databases through N-Body Simulation==
Data Locality in Graph Databases
through N-Body Simulation
Dominic Pacher Robert Binna Günther Specht
Institute of Computer Science Institute of Computer Science Institute of Computer Science
Technikerstrasse 21a Technikerstrasse 21a Technikerstrasse 21a
Innsbruck Austria Innsbruck Austria Innsbruck Austria
dominic.pacher@uibk.ac.at robert.binna@uibk.ac.at guenther.specht@uibk.ac.at
ABSTRACT
1 2 3 4 5 6 7
Data locality poses a major performance requirement in
graph databases, since it forms a basis for efficient caching
and distribution. This vision paper presents a new approach
to satisfy this requirement through n-body simulation. We
describe our solution in detail and provide a theoretically Figure 1: A two dimensional structure gets mapped
complexity estimation of our method. To prove our con- to one-dimensional space. Since no locality is pre-
cept, we conducted an evaluation using the DBpedia dataset served, large jumps (6 nodes) appear in the data
data. The results are promising and show that n-body simu- (red).
lation is capable to improve data locality in graph databases
significantly.
this requirement influences all of the different sub compo-
Categories and Subject Descriptors nents of a graph database and can be fulfilled through im-
H.2.4 [Database Systems]: Graph databases provements on many different levels. However, there is no
other property, which has as much influence on the perfor-
mance and scalability of the overall system as data locality.
General Terms In terms of graphs this means that any node stored has to be
Locality, N-body Simulation, Graph Data, Experimentation also physical near to its linked nodes in the memory. This
seems to be a straightforward requirement, but it’s hard to
Keywords fulfill practically. In theory, a graph describes a multidi-
mensional data structure, which has to be managed by the
Database, Graph, Simulation, Graph Database, Triple Store computer. Unfortunately, since memory systems work on a
fixed one-dimensional memory layout, this cannot be done
1. INTRODUCTION directly. The common solution to this problem is to define
Recently the demand to manage high amounts of linked a mapping from multidimensional data to less (one) dimen-
data increased substantially. This development has its origin sional space. Although it’s not a problem to find any kind
in data, generated by social as well as linked knowledge net- of mapping, it’s hard to preserve data locality at the same
works like Wikipedia [1]. In addition, all of today’s imper- time. Therefore data locality isn’t assured directly (Figure
ative programming languages work on graph oriented (ob- 1) and databases try to speed up operations using additional
ject) memory systems, because they are easy to understand indexes or in the case of main memory systems, by providing
and can be efficiently processed in main memory. More- cheap jumps through random access memory.
over, graph oriented memory systems provide means to eas- Despite the fact that this solutions work out quite well for
ily formulate complex recursive behavior and data struc- the problem, they are always tied to additional costs and re-
tures. Usually these data structures need to be stored per- maining limitations and don‘t solve the actual problem. For
sistently in some kind of external database. example, additional indices need space and have to be up-
Beside the exact internal concept, this (graph) database has dated on every change. Main memory systems work well on
to support query, update and remove operations of single one core and one computer. But since, frequent jumps be-
nodes or complete sub graphs as fast as possible. Clearly tween the cores memory or even worse, between computers,
are orders of magnitudes costlier than jumps within main
memory of one single thread, it’s hard to distribute them
properly.
To come up with a new approach to improve this situation,
this paper suggests building a graph database whose nodes
are aligned in memory by a n-body simulation system. In-
spired by real world physics laws, links will be simulated as
23rd GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 31.05.2011 - 03.06.2011, Obergurgl, Austria. springs causing nodes to arrange themselves automatically.
Copyright is held by the author/owner(s). As a result, when the state of lowest energy is reached, a
85
adapted Jena TDB [18] (in contrast to SDB), Virtuoso [10],
YARS [11] and RDF-3X [13]. Where the last two approaches
1 7 2 6 3 5 4 make excessive use of indices to speed up the query execu-
tion process. Though RDF-3X achieved new query speed
records, this approach is heavily optimized on read opera-
tions. As a consequence on any data update all indices have
to be changed accordingly. In contrast, BitMap [3] uses
Figure 2: Better solution of Figure 1. Locality is a completely different design using compressed bit-matrix
preserved and the maximum jump length is reduced structure.
to two nodes (green) Finally Sesame [8] provides storage engines of all tree groups.
The performance of these systems have been evaluated in
[13] [6] and through the Berlin Benchmark [7].
maximum of data locality is provided at the same time (Fig- Consequently there is no system yet using n-body simula-
ure 2). In addition, n-body simulation systems are known tion to improve data locality and it’s interesting if such an
to be highly distributable and computational feasible [16]. approach is able to improve the overall performance of graph
Consequently the aim of this paper is to show through databases.
experimentation, that such a simulation will optimally place
graph nodes in memory achieving improved data locality on
global scale.
2. THE METHOD
The remainder of this paper is structured as follows. Sec- In contrast to existing methods to store graph data we
tion 2 describes related papers in more detail. In section suggest an algorithm, which achieves a high degree of data
3 we present our new method by a short introduction to n- locality. This algorithm is based on the idea, that link length
body simulation idea, as well as some adjustments we had to don’t come for free, making longer links to more distant data
make. To prove that our concept is feasible, we performed locations more expensive than shorter links. With this ad-
some preliminary evaluations which results are discussed in ditional costing factor c, an optimal solution for the locality
section 4. Section 5 sums up with an conclusion and future problem in databases can be defined as achieving the global
works. minimum of the sum of this costs overall nodes n:
1.1 Related Work n
X
Although there is, at best of our knowledge, no related ap- Call = min ci
proach solving the data locality problem of graph database i=0
using n-body simulation, papers exists which make use of This optimization process becomes quickly unsolvable us-
this method for related problems. ing analytically methods, therefore a common n-body simu-
The idea of n-body simulation to support graph alignment lation approach is applied. Every edge is seen as a physical
has been already proposed in the 80s [14] and constantly spring between two data nodes. Springs will add distance
improved [19]. However, these algorithms try to find an op- depended forces Fl to the connected links causing them to
timal layout for graph nodes, which is a far more complicated approach each other:
problem than preserving locality as it includes additional re-
quirements like finding aesthetic pleasing solutions. Fortu-
Fl = Fc ∗ D(l)
nately, this is clearly not an affordance for graph databases.
Plenty of systems were developed in the RDF research area Where Fc is the force constant and D(l) a distance function
trying to optimize storing and querying graphs. These graph of linked node l. This distance function can be for example
stores can be separated into three groups of stores, which a linear function returning the distance to the linked node l
or an exponential function causing forces to increase expo-
• reside completely in memory (In-Memory Store) nentially with the distance.
• are based on a relational database (Relational Triple Since a node is influenced by all its linked nodes, all forces
Store) Fl have to summed up to achieve the final overall force Fn :
• use their own implementation (Native Triple Store) n
X
Fn = Fi
To the group of In-Memory Stores belong GRIN [17] and i=0
Brahms [12], which mainly try to solve special purpose queries
Now we can calculate the acceleration of the current node
through dedicated indices. Also SpiderStore [6] operates in
nusing its mass mn :
memory completely. However it makes no special assump-
tions about queries.
Jena SDB [18] and Virtuoso RDF Views [10] are part of an = Fi /mn
the second group using a traditional row oriented relational
In our prototype we set mn to 1 but for later implementa-
model. Mapping graph data to the relational model tend
tions this parameter may represent a ideal way to reduce the
to result in one big table with three columns: source node,
movement of big nodes using the number of links as mass.
edge, destination node (or in RDF terms subject, predicate
This would cause big nodes to be moved less often. Finally
object) and billions of rows. As the mapping of this table
we can use an to calculate the change of velocity
to a common row oriented store is inefficient [2] and [15] ap-
plied a column oriented relational model.
Part of the third group, the native implementations, are the ∆vn = an ∗ s
86
2 3
1
1 2 3 5 6
5 6
Figure 3: Nodes 2 / 5 and 3 / 6 have to be stored
to the same position in one-dimensional space. 1 5 2 6 3
where s describes the used step size. For the sake of sim-
plicity our prototype used a step size of 1. The simulation
can now be formulated in three steps:
1. Calculate vn for all n.
Figure 4: Mapping of Figure 3 to one dimension.
2. Change vn according to ∆vn and calculate new posi-
The upper Figure shows a non optimal solution with
tion.
jump size of three. The lower Figure shows a better
3. Check if there is any movement . If yes then goto 1. solution with a jump size of two.
4. Simulation finished.
nodes. Indeed for a complete graph, the n-body simulation
2.1 Adjustments cannot improve any locality, because energy equilibrium is
N-body simulation methods have been used widely and already reached and the simulation would terminate after
very successfully in many fields of physics over the past the first step. Assuming that n >> m the next important
decades. However some adjustments are necessary to make factor is snum , which depends on the dataset as well as the
the approach useful for locality calculations. used step size s. Consequently an increased step size would
As memory of all modern computers is accessed through dis- lead to less simulation steps. However too large steps sizes
crete addresses, the simulation has to take this into account also increase the computational error between steps and will
and have to operate on integers entirely. This approach has lead to an unstable simulation eventually. Fortunately, a
two advantages. In the first place it avoids the introduction variety of algorithms exist to minimize this approximation
of additional repulsion forces to keep nodes at a minimum error, using one step or multistep methods as well as meth-
distance to each other and secondly, the calculations can be ods with variable step size at need [9].
done with faster integer calculations. Finally, if simulated once, we assume that the majority of
As mentioned previously, graph data is naturally multi di- data locations will remain stable and won’t have to be re-
mensional, which stands in direct contrast to the one-dimensional calculated on every data update on global scale. This esti-
memory space. Because of that, nodes may have found a fi- mation and in addition, that existing implementations deal
nal position, which is already be claimed by another node. with about 10 billion elements [16], let us believe that a large
Therefore, a priority function has to be defined to solve this scale simulation of graph data is feasible.
problem, preserving that the node wins which leads to less
energy in the overall system. This can be accomplished by
using the nodes overall force as priority value. 3. PRELIMINARY EVALUATION
An example of this problem can be found in Figures 3 and 4 To prove the suggested concept we implemented a proto-
where nodes 2/5 and 3/6 claiming the same position. Figure type and made some preliminary evaluations. To get realis-
4 also shows, that preserving locality comes at cost of the tic results we chose a subset (first 200 000 triples = 110,205
link length of other nodes. nodes) of the DBpedia dataset. All tests were conducted
on a single machine (Mac Pro Intel Xeon 2,26 GHz) using
2.2 Complexity Estimation a simple single threaded process with 200 MiB of dedicated
Generally the complexity of n-body simulations can be es- ram.
timated as O(n ∗ m ∗ snum ) where n describes the number To get an visual impression how effective our method in-
of nodes, m the number of links per node and s the number creases data locality, we visualize every data element as a
of simulation steps needed until energy equilibrium. Conse- pixel in an image. As we are working on a one-dimensional
quently for a complete graph, where n = m the complexity space, all values are simply wrapped at the end of image
raises to O(n2 ∗ snum ). This wouldn’t be feasible for high width to create a two dimensional image. Every pixel posi-
amounts of data. Fortunately [4] showed for gravity simula- tion corresponds to the actual position of a data node in the
tions, which can be reduced to a fully connected graph, that data space.
complexity can be reduced to O(n ∗ log(n) ∗ snum ), using a In Figure 5 the color of this pixel represents the maximum
supporting tree structure and aggregation for distant nodes. distance of a node to its linked neighbor nodes in the data
Although this is the worst-case estimation, it is very unlikely space. Using a maximum value of n/2 (value red) this Figure
to happen for real data where the number of links per node shows the development over time until energy equilibrium is
should always be significantly smaller than the number of reached. At t = 0 the data is scattered randomly in the
87
Figure 5: Complete data space of 110205 nodes. Each pixel represents one node. The color key on the right
describes the maximum distance of a node’s link. From left to right the image shows the data space at at
t=0, t = 0.5 and t = 1.
Figure 6: Series of access heat maps showing high access zones which should not be separated. Red positions
are accessed by 55,000, dark blue nodes by less than 1,000 nodes. From left to right the image shows the
data space at at t=0, t = 0.5 and t = 1.
data space. There are plenty of nodes, which have to jump nificant reduction of distances to about one quarter of the
through the whole data space to access their linked nodes. originally data space. In addition, the histogram points out
At t = 0.5 the data distribution has improved already, but that global data locality comes also at cost of local data lo-
can be further enhanced until a state of minimum energy cality, which is showed by the reduction of the red peak at
(no more movement of nodes) is reached (t = 1). The final t = 0 mentioned before to the blue one at t = 1.0. The
result shows that all nodes have now arrived at a position, cause of this reduction can be seen again in the mapping of
where they can access their most distant linked node with a multi dimensional data to less (in our case one) dimensional
minimum of locality change. space, where different data nodes claim the same position
To get a better impression of the exact numbers, we created (Figures 3 and 4).
a histogram (Figure 7) for the same sample data. For better Furthermore there are some small peaks (two nodes) remain-
understanding, be aware of the logarithmic scaling on the ing near distance 50, 000. These nodes couldn’t be aligned
vertical axis. Furthermore we moved the mid-term frame very well. Although we always expected problems with
t = 0.5 to quarter time t = 0.25 to get a better impression nodes that are highly linked to different other nodes and
of the progress over time. Similar to Figure 5, at t = 0 can’t be further optimized by means of location, this ap-
one can observe an almost equal distribution of nodes along pears not to be the case here, as the number of links of the
the complete range of possible distances. As an important worst-case node where only about 300. Unfortunately, this
matter of fact, there is already a peak of values having very remains a rather unsatisfied situation and has to be further
close linked nodes on the very left side (distance < 1, 000) investigated in future. However, as shown in our sample
of the diagram. Since it’s often the case that new nodes data set, these nodes can be considered as very rare (overall
are introduced followed by their direct neighbors, the input 8 nodes out of 110, 025 until distance of 26, 000).
file itself can be seen as origin of this peak. Of course this These problems apart, it’s most important that the global
issue can only have a positive impact on local data locality data locality improved substantially. As previously seen in
and not on global scale, as we want to achieve. During the Figure 5, the histogram shows that most nodes are far less
simulation over t = 0.25 to t = 1.0 one can observe a sig- distant to their linked nodes than at the start of simulation.
88
Figure 7: Histogram of the occurrence of nodes and their maximum distances to linked nodes. For better
understanding the time frames were changed to t=0, t = 0.25 and t = 1.
In particular, the link length was reduced to 1/4 in the worst 5. REFERENCES
and to 1/10 in the average case.
Based on this data it should be possible to make efficient de- [1] Wikipedia Free Encyclopedia. http://wikipedia.com,
cisions how graph data can be separated in certain chunks, apr 2011.
to be distributed on different cores as well as on different [2] D. J. Abadi, A. Marcus, S. R. Madden, and
computers. To gather more insight into this, we used an K. Hollenbach. Scalable semantic web data
access heat map. To create this map all data positions lying management using vertical partitioning. VLDB
between a node and all it’s linked nodes are incremented by Endowment, sep 2007.
one. Of course in main memory we are able to randomly [3] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler.
access every position at same speed, but in a distributed en- Matrix Bit loaded: a scalable lightweight join query
vironment this model fits very well. When this is done for processor for RDF data. ACM, apr 2010.
all nodes and their respective linked nodes, every position [4] J. Barnes and P. Hut. A hierarchical O(N*log(N))
marks the number of accesses needed to visit every neigh- force-calculation algorithm. nature, 324(4):446–449,
bor node within data space (Figure 6). This image gives 1986.
an impression, where the data space can be separated best, [5] C. Becker. RDF Store Benchmarks with DBpedia.
choosing less dense (blue in the Figure) zones. www4.wiwiss.fu-berlin.de, 2011.
[6] R. Binna, W. Gassler, E. Zangerle, and D. Pacher.
SpiderStore: Exploiting Main Memory for Efficient
4. CONCLUSION AND FUTURE WORK RDF Graph Representation and Fast Querying. 2010.
The aim of this paper was to show that a n-body simula- [7] C. Bizer. The berlin sparql benchmark. Int J Semantic
tion can improve graph data locality significantly. After a Web Inf Syst, 2009.
introduction to our suggested method, we evaluated a exper- [8] J. Broekstra, A. Kampman, and F. van Harmelen.
imental prototype using partial data of the DBpedia dataset Sesame: A generic architecture for storing and
[5]. As a result, we were able to restrict jumps to about 1/4 querying RDF and RDF schema. In Semantic Web -
of the whole data space in the worst-case and to 1/10 for Iswc 2002, pages 54–68, Aidministrator Nederland
the average case. Although these are very promising re- BV, Amersfoort, Netherlands, 2002. Aidministrator
sults, there is plenty of work remaining. Nederland BV, Amersfoort, Netherlands.
We theoretically showed that our n-body approach should [9] J. Butcher. Numerical Methods for Ordinary
scale well into millions of graph nodes. However, our proto- Differential Equations . Wiley, 2 edition, jun 2008.
type is currently not optimized for very large data sets like [10] O. Erling and I. Mikhailov. Rdf support in the
the complete DBpedia dataset, consisting of about 100 mil- virtuoso dbms. In T. Pellegrini, S. Auer,
lion triples. Hence our goal for future works will be to opti- K. Tochtermann, and S. Schaffert, editors, Networked
mize the simulation by improving the algorithm and finding Knowledge - Networked Media, volume 221 of Studies
a way to distribute the simulation on many cores and com- in Computational Intelligence, pages 7–24. Springer
puters. As a result of this development, we hope to provide Berlin / Heidelberg, 2009.
practically evidence that our method is working on large real [11] A. Harth, J. Umbrich, and A. Hogan. YARS2: A
world graphs preserving computational feasibility. federated repository for querying graph structured
89
data from the web. The Semantic Web, 2007.
[12] M. Janik and K. Kochut. BRAHMS: A WorkBench
RDF Store and High Performance Memory System for
Semantic Association Discovery. In Fourth
International Semantic Web Conference, pages
431–445. Springer, 2005.
[13] T. Neumann and G. Weikum. The RDF-3X engine for
scalable management of RDF data. The VLDB
Journal — The International Journal on Very Large
Data Bases, 19(1):91–113, feb 2010.
[14] E. Peter. A Heuristic for Graph Drawing. Congressus
Numerantium, 42:149–160, nov 1984.
[15] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and
S. Manegold. Column-store support for rdf data
management: not all swans are white. Proc. VLDB
Endow., 1:1553–1563, August 2008.
[16] V. Springel, S. D. M. White, A. Jenkins, C. S. Frenk,
N. Yoshida, L. Gao, J. Navarro, R. Thacker,
D. Croton, J. Helly, J. A. Peacock, S. Cole,
P. Thomas, H. Couchman, A. Evrard, J. o. r. Colberg,
and F. Pearce. Simulations of the formation, evolution
and clustering of galaxies and quasars. nature,
435(7042):629–636, jun 2005.
[17] O. Udrea, A. Pugliese, and V. S. Subrahmanian. Grin:
a graph based rdf index. In Proceedings of the 22nd
national conference on Artificial intelligence - Volume
2, pages 1465–1470. AAAI Press, 2007.
[18] K. Wilkinson, C. Sayers, and H. Kuno. Efficient RDF
storage and retrieval in Jena2. In Proceedings of
SWDB, 2003.
[19] V. Zabinako and P. Rusakovs. Development and
Implementation of Partial Hybrid Algorithm for
Graphs Visualization. Scientific Proceedings of Riga
Technical University, 5(34):192–203, jul 2008.
90