Introduction

TripleRush: A Fast and Scalable Triple Store

Philip Stutz

Mihaela Verman

Lorenz Fischer

Abraham Bernstein

bernsteing@ifi.uzh.ch 0 0 DDIS, Department of Informatics, University of Zurich , Zurich , Switzerland

50 65

TripleRush is a parallel in-memory triple store designed to address the need for e cient graph stores that quickly answer queries over large-scale graph data. To that end it leverages a novel, graph-based architecture. Speci cally, TripleRush is built on our parallel and distributed graph processing framework Signal/Collect. The index structure is represented as a graph where each index vertex corresponds to a triple pattern. Partially matched copies of a query are routed in parallel along di erent paths of this index structure. We show experimentally that TripleRush takes less than a third of the time to answer queries compared to the fastest of three state-of-the-art triple stores, when measuring time as the geometric mean of all queries for two benchmarks. On individual queries, TripleRush is up to three orders of magnitude faster than other triple stores.

Introduction

Many applications such as social network analysis, monitoring of nancial transactions or analysis of web pages and their links require large-scale graph computation. To address this need, many have researched the development of e cient triple stores [ 1, 14, 11 ]. These systems borrow from the database literature to investigate e cient means for storing large graphs and retrieving subgraphs, which are usually de ned via a pattern matching language such as SPARQL. This avenue has had great success both in research [ 1, 14, 11 ] and practice.1 However, many of these systems are built like a centralized database, raising the question of scalability and parallelism within query execution. One way to increase the e ciency of parallel pipelined joins in such centralized databases is the use of sideways information passing [ 10 ].

Other approaches focus on adapting triple stores to a distributed setting: MapReduce [ 3 ] has been used to aggregate results from multiple single-node RDF stores in order to support distributed query processing [ 5 ].

Others have mapped SPARQL query execution pipelines to chains of MapReduce jobs (e.g., [ 6 ]). Whilst this provides scalability, the authors point out that the rigid structure of MapReduce and the high latencies when starting new jobs constrain the possibilities for optimizations [ 6 ]. Distributed graph processing frameworks such as Pregel [ 8 ], GraphLab/Powergraph [ 7, 4 ], Trinity [ 12 ], and 1 http://virtuoso.openlinksw.com, http://www.ontotext.com/owlim our own Signal/Collect [ 13 ] can o er more exibility for scalable querying of graphs.

Among the existing triple stores we only know of Trinity.RDF [ 15 ] to be implemented on top of such an abstraction. Trinity.RDF is a graph engine for SPARQL queries that was built on the Trinity distributed graph processing system. To answer queries, Trinity.RDF represents the graph with adjacency lists and combines traditional query processing with graph exploration.

In this paper we introduce TripleRush, a triple store which is based on an index graph, where a basic graph pattern SPARQL query is answered by routing partially matched query copies through this index graph. Whilst traditional stores pipe data through query processing operators, TripleRush routes query descriptions to data. For this reason, TripleRush does not use any joins in the traditional sense but searches the index graph in parallel. We implemented TripleRush on top of Signal/Collect, a scalable, distributed, parallel and vertex-centric graph processing framework [ 13 ].

The contributions of this paper are the following: First, we present the TripleRush architecture, with an index graph consisting of many active processing elements. Each of these elements contains a part of the processing logic as well as a part of the index. The result is a highly parallel triple store based on graph-exploration. Second, as a proof of concept, we implemented the TripleRush architecture within our graph processing framework Signal/Collect, bene ting from transparent parallel scheduling, e cient messaging between the active elements, and the capability to modify a graph during processing. Third, we evaluated our implementation and compared it with three other state-of-theart in-memory triple stores using two benchmarks based on the LUBM and DBPSB datasets. We showed experimentally that TripleRush outperforms the other triple stores by a factor ranging from 3.7 to 103 times in the geometric mean of all queries. Fourth, we evaluated and showed data scalability for the LUBM benchmark. Fifth, we measured memory usage for TripleRush, which is comparable to that of traditional approaches. Last, we open sourced our implementation.2

In the next section we discuss the infrastructural underpinnings of TripleRush. This is followed by a description of the TripleRush architecture, as well as the functionality and interactions of its building blocks. We continue with a description of the optimizations that reduce memory usage and increase performance. We evaluate our implementation, discuss some of this paper's ndings as well as limitations, and nish with some closing remarks. 2

Signal/Collect

In this section we describe the scalable graph processing system Signal/Collect and some of the features that make it a suitable foundation for TripleRush. 2 Apache 2.0 licensed, https://github.com/uzh/triplerush

Signal/Collect [ 13 ]3 is a parallel and distributed large-scale graph processing system written in Scala. Akin to Pregel [ 8 ], it allows to specify graph computations in terms of vertex centric methods that describe aggregation of received messages (collecting) and propagation of new messages along edges (signalling). The model is suitable for expressing data- ow algorithms, with vertices as processing stages and edges that determine message propagation. In contrast to Pregel and other systems, Signal/Collect supports di erent vertex types for di erent processing tasks. Another key feature, also present in Pregel, is that the graph structure can be changed during the computation. The framework transparently parallelizes and distributes the processing of data- ow algorithms. Signal/Collect also supports features such as bulk-messaging and Pregel-like message combiners to increase the message-passing e ciency.

Most graph processing systems work according to the bulk-synchronous parallel model. In such a system, all components act in lock-step and the slowest part determines the overall progress rate. In a query processing use-case, this means that one expensive partial computation would slow down all the other ones that are executed in the same step, which leads to increased latency. Signal/Collect supports asynchronous scheduling, where di erent partial computations progress at their own pace, without a central bottleneck. The system is based on message-passing, which means that no expensive resource locking is required. These two features are essential for low-latency query processing.

With regard to the representation of edges, the framework is exible. A vertex can send messages to any other vertex: Edges can either be represented explicitly or messages may contain vertex identi ers from which virtual edges are created. TripleRush uses this feature to route query messages. 3

TripleRush

The core idea of TripleRush is to build a triple store with three types of Signal/Collect vertices: Each index vertex corresponds to a triple pattern, each triple vertex corresponds to an RDF triple, and query vertices coordinate query execution. Partially matched copies of queries are routed in parallel along di erent paths of this structure. The index graph is, therefore, optimized for e cient routing of query descriptions to data and its vertices are addressable by an ID, which is a unique [ subject predicate object ] tuple.

We rst describe how the graph is built and then explain the details of how this structure enables e cient parallel graph exploration. 3.1

Building the Graph

The TripleRush architecture is based on three di erent types of vertices. Index and triple vertices form the index graph. In addition, the TripleRush graph contains a query vertex for every query that is currently being executed. Fig. 1 shows the index graph that is created for the triple [ Elvis inspired Dylan ]: 3 http://uzh.github.io/signal-collect/ Elvis * * Triple vertices are illustrated on level 4 of Fig. 1 and represent triples in the database. Each contains subject, predicate, and object information. Index vertices, illustrated in levels 1 to 3 in Fig. 1, represent triple patterns and are responsible for routing partially matched copies of queries (referred to as query particles ) towards triple vertices that match the pattern of the index vertex. Index vertices also contain subject, predicate, and object information, but one or several of them are wildcards. For example, in Fig. 1 the index vertex [ * inspired * ] (in the middle of the gure on level 2) routes to the index vertex [ * inspired Dylan ], which in turn routes to the triple vertex [ Elvis inspired Dylan ].

Query vertices, depicted in the example in Fig. 2, are query dependent. For each query that is being executed, a query vertex is added to the graph. The query vertex emits the rst query particle that traverses the index graph. All query particles|successfully matched or not|eventually get routed back to their respective query vertex. Once all query particles have succeeded or failed the query vertex reports the results and removes itself from the graph.

The index graph is built by adding a triple vertex for each RDF triple that is added to TripleRush. These vertices are added to Signal/Collect, which turns them into parallel processing units. Upon initialization, a triple vertex will add its three parent index vertices (on level 3) to the graph and add an edge from these index vertices to itself. Should any parent index vertex already exist, then only the edge is added from this existing vertex.

When an index vertex is initialized, it adds its parent index vertex, as well as an edge from this parent index vertex to itself. Note that the parent index vertex always has one more wildcard than its child. The construction process continues recursively until the parent vertex has already been added or the index vertex has no parent. In order to ensure that there is exactly one path from an index vertex to all triple vertices below it, an index vertex adds an edge from at most one parent index vertex, always according to the structure illustrated in Fig. 1.

Next we describe how the index graph allows parallel graph exploration in order to match SPARQL queries. 3.2

Query Processing

The index graph we just described is di erent from traditional index structures, because it is designed for the e cient parallel routing of messages to triples that correspond to a given triple pattern. All vertices that form the index structure are active parallel processing elements that only interact via message passing.

A query is de ned by a list of SPARQL triple patterns. Each query execution starts by adding a query vertex to the TripleRush graph. Upon initialization, a query vertex emits a single query particle. A query particle consists of the list of unmatched triple patterns, the ID of its query vertex, a list of variable bindings, a number of tickets, and a ag that indicates if the query execution was able to explore all matching patterns in the index graph. Next, we describe how the parts of the query particle are modi ed and used during query execution.

The emitted particle is routed (by Signal/Collect) to the index vertex that matches its rst unmatched triple pattern. If that pattern is, for example, [ Elvis inspired ?person ], where ?person is a variable, then it will be sent to the index vertex with ID [ Elvis inspired * ]. This index vertex then sends copies of the query particle to all its child vertices.

Once a query particle reaches a triple vertex, the vertex attempts to match the next unmatched query pattern to its triple. If this succeeds, then a variable binding is created and the remaining triple patterns are updated with the new binding. If all triple patterns are matched or a match failed,4 then the query particle gets routed back to its query vertex. Otherwise, the query particle gets sent to the index or triple vertex that matches its next unmatched triple pattern.

If no index vertex with a matching ID is found, then a handler for undeliverable messages routes the failed query particle back to its query vertex. So no matter if a query particle found a successful binding for all variables or if it failed, it ends up being sent back to its query vertex.

In order to keep track of query execution and determine when a query has nished processing, each query particle is endowed with a number of tickets. The rst query particle starts out with a very large number of tickets.5

When a query particle arrives at an index vertex, a copy of the particle is sent along each edge. The original particle evenly splits up its tickets among its copies. If there is a remainder, then some particles get an extra ticket. If a particle does not have at least one ticket per copy, then copies only get sent 4 A match fails if it creates con icting bindings: Pattern [ ?X inspired ?X ] fails to bind to the triple [ Elvis inspired Dylan ], because the variable ?X cannot be bound to both Elvis and Dylan. 5 We use Long.MaxValue, which has been enough for a complete exploration of all queries on all datasets that we have experimented with so far. along edges for which at least one ticket was assigned, and those particles get agged to inform the query vertex that not all matching paths in the graph were explored. Query execution nishes when the sum of tickets of all failed and successful query particles received by the query vertex equals the initial ticket endowment of the rst particle that was sent out.

Once query execution has nished, the query vertex reports the result that consists of the variable bindings of the successful query particles, and then removes itself from the graph.

Query Vertex 6

Failure Success ?X inspired ?Y ?Y inspired ?Z

5 No vertex with id [ Jobs inspired * ] 4

As an example illustrating a full query execution, consider the relevant subgraph created for the triples [ Elvis inspired Dylan ] and [ Dylan inspired Jobs ], shown in Fig. 2 along with the query processing for the query: (unmatched = [ ?X inspired ?Y ], [ ?Y inspired ?Z ]; bindings = fg). Execution begins in the query vertex. 1 Once the query vertex has been added to the graph, it emits a query particle, which is illustrated in blue. Given its rst triple pattern, the query particle is routed to the index vertex with ID [ * inspired * ]. 2 Inside the index vertex, the query particle is split into two particles, one colored green and the other one orange (for illustration). The tickets of the blue particle are evenly split among the green and the orange particle. Both particles are sent down to their respective index vertex, the green one to [ * inspired Dylan ] and the orange one to [ * inspired Jobs ]. These index vertices, in turn, send the particles further down to their corresponding triple vertices. 3 The rst pattern of the green particle gets matched in the triple vertex [ Elvis inspired Dylan ]. The triple vertex sends the updated particle (unmatched = [ Dylan inspired ?Z ]; bindings = f ?X=Elvis, ?Y=Dylan g) to the index vertex with ID [ Dylan inspired * ], which in turn routes the particle towards all triples that match the next unmatched pattern, [ Dylan inspired ?Z ]. 4 From the index vertex, the green particle is routed down to the triple vertex [ Dylan inspired Jobs ], which binds ?Z to Jobs. As there are no more unmatched triple patterns, the triple vertex routes the particle containing successful bindings for all variables back to its query vertex.

The rst pattern of the orange particle gets matched in the triple vertex [ Dylan inspired Jobs ]. The triple vertex sends the updated particle (unmatched = [ Jobs inspired ?Z ]; bindings = f ?X=Dylan, ?Y=Jobs g) to the index vertex with ID [ Jobs inspired * ]. The message cannot be delivered, because no index vertex with that ID is found. The handler for undeliverable messages reroutes the failed query particle to its query vertex.

The query vertex receives both the successfully bound green and the failed orange particle. Query execution has nished, because all tickets that were sent out with the initial blue particle have been received again. The query vertex reports the successful bindings f ?X=Elvis, ?Y=Dylan, ?Z=Jobs g and then removes itself from the graph.

The architecture and query execution scheme we described captures the essence of how TripleRush works. Next, we explain how we improved them with regard to performance and memory usage. 3.3

Optimizations

The system, as previously shown, already supports parallel graph exploration. Here we describe how we implemented and optimized di erent components. Dictionary Encoding While the examples we gave so far represented triple patterns in terms of strings, the actual system implementation operates on a dictionary encoded representation, where RDF resource identi ers and literals are encoded by numeric IDs. Both wildcards and variables are also represented as numeric IDs, but variable IDs are only unique in the context of a speci c query.

Index Graph Structure We remove triple vertices, and instead store the triple information inside each of the three third level index vertices that have a compatible triple pattern. We hence refer to these index vertices as binding index vertices, because they can bind query particles to triples, which was previously done by the triple vertices. This change saves memory and reduces the number of messages sent during query execution.

One question that arises from this change is: If the subject, predicate and object of the next unmatched pattern of a particle are already bound, where should that particle be routed to? With no single triple vertex responsible anymore, TripleRush load balances such routings over the three binding index vertices into which the triple information was folded.

Index Vertex Representation In Fig. 1, one notices that the ID of an index vertex varies only in one position|the subject, the predicate, or the object| from the IDs of its children. To reduce the size of the edge representations, we do not store the entire ID of child vertices, but only the speci cation of this position consisting of one dictionary encoded number per child. We refer to these numbers as child-ID-re nements. The same reasoning applies to binding index vertices, where the triples they store only vary in one position from the ID of the binding index vertex. Analogously, we refer to these di erences as triple-ID-re nements.

Binding index vertices need to be able to check quickly if a triple exists. This requires fast search over these triple-ID-re nements. We support this by storing them in a sorted array on which we use the binary search algorithm.

Index vertices of levels 1 and 2 do not need to check for the existence of a speci c child-ID, as these levels always route to all children. Routing only requires a traversal of all child-ID-re nements. To support traversal in a memorye cient way, we sort the child-ID-re nements, perform delta encoding on them, and store the encoded deltas in a byte array, using variable length encoding.

Note that these array representations are ine cient for modi cations, which is why we initially store the child/triple-ID-re nements in tree sets and switch the representation to arrays once the loading is done. This design supports fast inserts at the cost of increased memory usage during loading.

Query Optimization The number of particle splits performed depends on the order in which the triple patterns of a query are explored. One important piece of information to determine the best exploration order is the number of triples that match a given triple pattern, which we refer to as the cardinality of the pattern. Because only relative cardinality di erences matter for ordering, we can assume a very large number for the root vertex. The binding index vertices already store all the triples that match their pattern and thus have access to their cardinalities. So we only need to determine the cardinality of index vertices on level 2, below the root vertex and above the binding index vertices. A level 2 index vertex requests cardinality counts from its binding index vertex children and sums up the replies. We do this once after graph loading and before executing queries, but it can be done at any time and could also be done incrementally.

Query optimization happens only once inside the query vertex before the query particle is emitted. To support it, the query vertex rst sends out cardinality requests to the vertices in the index graph that are responsible for the respective triple patterns in the query. These requests get answered in parallel and, once all cardinalities have been received, we greedily select the pattern with the lowest cardinality to be matched rst. If this match will bind a variable, we assume that the cardinality of all other patterns that contain this variable is reduced, because only a subset of the original triples that matched the pattern would be explored at that point. To this end, we divide the cardinality estimate for each triple pattern containing bound variables by a constant per bound variable. In our experiments we set the constant to 100, based on exploration.6 If all variables in a pattern are bound (meaning that all its variables appear in 6 We tried di erent factors and this one performed well on the LUBM benchmark. It also performed well on the DBPSB benchmark, which suggests that it generalizes at least to some degree. patterns that will get matched before it), then we assume a cardinality of 1, designating that at most one triple could match this pattern.

Repeating the procedure we choose the next pattern with the lowest cardinality estimate, until all patterns have been ordered. Next, the initial query particle and all its copies explore the patterns in the order speci ed by the optimization. Optimizations to Reduce Messaging Each TripleRush vertex is transparently assigned to a Signal/Collect worker. Workers are comparable to a thread that is responsible for messaging and for scheduling the execution of its assigned vertices.

Sending many individual messages between di erent Signal/Collect workers is ine cient, because it creates contention on the worker message queues. In order to reduce the number of messages sent, we use a bulk message bus that bundles multiple messages sent from one worker to another. In order to reduce message sizes and processing time in the query vertex, we do not send the actual failed particle back to the query vertex, but only its tickets.7 We also use a Pregel-like combiner that sums up the tickets in the bulk message bus, to again reduce the number of messages sent.

Because the query vertex is a bottleneck, we further reduce the number of messsages it receives and the amount of processing it does by combining multiple successful particles into one result bu er before sending them. The query vertex can concatenate these result bu ers in constant time. 4

Evaluation

In the last section we described the TripleRush architecture and parallel query execution. In this section we evaluate its performance compared to other stateof-the-art triple stores. 4.1

Performance

TripleRush was built and optimized for query execution performance. In order to evaluate TripleRush, we wanted to compare it with the fastest related approaches. Trinity.RDF [ 15 ] is also based on a parallel in-memory graph store, and it is, to our knowledge, the best performing related approach. Thus, our evaluation is most interesting in a scenario where it is comparable to that of Trinity.RDF. As Trinity.RDF is not available for evaluation, we decided to make our results comparable by closely following the setup of their published parallel evaluations. The Trinity.RDF paper also includes results for other in-memory and on-disk systems that were evaluated with the same setup, which allows us to compare TripleRush with these other systems in terms of performance. 7 The ag is also necessary and in practice we encode it in the sign. Datasets and Queries Consistent with the parallel Trinity.RDF [ 15 ] evaluation, we benchmarked the performance of TripleRush by executing the same seven queries on the LUBM-160 dataset ( 21 million triples) and the same eight queries on the DBPSB-10 dataset ( 14 million triples). The LUBM (Lehigh University Benchmark) dataset is a synthetic one, generated with UBA1.7,8 while the DBPSB (DBpedia SPARQL Benchmark) dataset is generated based on realworld DBpedia data [ 9 ].9 The queries cover a range of di erent pattern cardinalities, result set sizes and number of joins. The queries L1-L7 and D1-D8 are listed in the Appendix. These queries only match basic graph patterns and do not use features unsupported by TripleRush, such as aggregations or lters. More information about the datasets and the queries is found in [ 2 ] and [ 15 ]. Evaluation Setup In the Trinity.RDF paper, all triple stores are evaluated in an in-memory setting, while RDF-3X and BitMat are additionally evaluated in a cold cache setting [ 15 ].

For evaluating TripleRush, we executed all queries on the same JVM running on a machine with two twelve-core AMD OpteronTM 6174 processors and 66 GB RAM, which is comparable to the setup used for the evaluation of Trinity.RDF.10 The whole set of queries was run 100 times before the measurements in order to warm up the JIT compiler, and garbage collections were triggered before the actual query executions. All query executions were complete, no query particle ever ran out of tickets. We repeated this evaluation 10 times.11

The execution time covers everything from the point where a query is dispatched to TripleRush until the results are returned. Consistent with the Trinity.RDF setup12, the execution times do include the time used by the query optimizer, but do not include the mappings from literals/resources to IDs in the query, nor the reverse mappings for the results.

Result Discussion The top entries in Tables 1 and 2 show the minimum execution times over 10 runs. According to our inquiry with the authors of the Trinity.RDF paper [ 15 ], this is consistent with their measures. Additionally, we also report the average execution times for TripleRush. TripleRush performs fastest on six of the seven LUBM queries, and on all DBPSB queries. For the query where TripleRush is not the fastest system, it is the second fastest system.

On all queries, TripleRush is consistently faster than Trinity.RDF. In the geometric mean of both benchmarks, TripleRush is more than three times faster than Trinity.RDF, between seven and eleven times faster than RDF-3X (in memory) and between 31 and 103 times faster than BitMat (in memory). For individual queries the results are even more pronounced: On query L7 TripleRush is about ten times faster than Trinity.RDF, on L1 it is more than two orders of magnitude faster than RDF-3X (in memory) and on L4 TripleRush is more than three orders of magnitude faster than BitMat (in memory).

These results indicate that the performance of TripleRush is competitive with, or even superior to other state-of-the-art triple stores. Average over 10 runs TripleRush 89.3 60.1 84.1 1.7 1.3 2.3 69.4 14.8 4.2

Data Scalability

Performance is a very important characteristic for a triple store, but it is also important that the query execution time scales reasonably when queries are executed over more triples.

We evaluate the data scalability of TripleRush by executing the LUBM queries L1-L7 with the same setup as in subsection 4.1, but ranging the dataset sizes from 10 to 320 universities and measuring the average time over 10 runs. The execution time for queries L1-L3 and L7 should increase with the size of the dataset, which is proportional to the number of universities. Queries L4-L6 are tied to a speci c university and, given a good query plan, should take approximately the same time to execute, regardless of the dataset size. The left chart in Fig. 3 shows the execution times on a linear scale, while for queries L1-L3 and L7 both number of universities and the execution times are shown on a logarithmic scale. We see that queries L2 and L7 scale slightly sublinearly. Queries L1 and L3 scale almost linearly until LUBM-160, and then with a factor of more than three on the step to LUBM-320. We are uncertain about what causes this di erence. As expected, the results in the left chart in Fig. 3 show that for queries L4-L6 the query execution time does not increase with the dataset size.

Overall, this evaluation suggests that TripleRush query execution times scale as expected with increased dataset sizes, but leaves an open question related to the scaling of queries L1 and L3 on LUBM-320.

!)s3.5 (m 3 iem2.5 tn 2 o i tu1.5 xce 1 .e0.5 g vA 0 10 20 40 80 160 320

LUBM Universities! 10 20 40 80 160 320

LUBM Universities! Another important aspect of a triple store is the memory usage and how it increases with dataset size. In order to evaluate this aspect, we measured the memory usage of TripleRush after loading LUBM dataset sizes ranging from 10 to 320 universities and optimizing their index representations (smallest memory footprint of entire program from 10 runs). Figure 4 shows that the memory usage increases slighly sublinearly for this dataset. The memory footprint of TripleRush L4 L5 L6 !)s512 (m256 iem128 tn 64 tuo 32 i xce 16 .eg 8 vA 4 is 5.8 GB when the 21 347 998 triples in the LUBM-160 datset are loaded. This is equivalent to 272 bytes per triple for this dataset size. TripleRush requires 3.8 GB for the DBPSB-10 dataset with 14 009 771 triples, which is equivalent to 271 bytes per triple. This is between a factor of 2 up to 3.6 larger than the index sizes measured for these datasets in Trinity.RDF [ 15 ], but far from the index size of 19 GB measured for DBPSB-10 in BitMat [ 15 ].

Currently, graph loading and index optimization for LUBM-160 takes as little as 106 seconds (without dictionary encoding, average over 10 runs). This is because the tree set data structure we use during graph loading supports fast insertions. The ip side is the high memory usage, which causes the graphloading of the LUBM-320 dataset to take excessively long. Most of that time is spent on garbage collection, most likely because the entire assigned 31 GB heap is used up during loading. After loading is nished, the index representation optimizations reduce the memory usage to a bit more than 11 GB.

Overall, the index size of TripleRush is rather large, but that is in many cases a reasonable tradeo , given the performance. Our current investigation and design has a number of limitations that should be addressed in future work.

First, we need to evaluate TripleRush with additional benchmarks.

Second, and more importantly, we need to investigate the performance of TripleRush on larger graphs, in a distributed setting. Whilst we are optimistic that some of its optimizations will help even in the distributed setting, it is unclear what the overall performance impact of increased parallelism and increased messaging cost will be.

Third, TripleRush was not built with SPARQL feature-completeness in mind. Many SPARQL constructs such as lters and aggregates were not implemented.

Fourth, the current query optimization is very simple and could be improved.

Fifth, the root vertex is a communication bottleneck. Potential avenues for addressing this are to disallow a completely unbound query, which would retrieve the whole database, or to partition this vertex.

Sixth, the memory usage during graph loading should be reduced without overly slowing down the loading times.

Seventh, although the hardware we ran the benchmarks on had a lower performance score, it is desirable to do a comparison with Trinity.RDF on exactly the same hardware.

Even in the light of these limitations, TripleRush is a highly competitive system in terms of query execution performance. To our knowledge, it is the rst triple store that decomposes both the storage and query execution into interconnected processing elements, thereby achieving a high degree of parallelism that contains the promise of allowing for transparent distributed scalability. 6

Conclusions

The need for e cient querying of large graphs lies at the heart of most Semantic Web applications. The last decade of research in this area has shown tremendous progress based on a database-inspired paradigm. Parallelizing these centralized architectures is a complex task. The advent of multi-core computers, however, calls for approaches that exploit parallelization.

In this paper we presented TripleRush, an in-memory triple store that inherently divides the query execution among a large number of active processing elements that work towards a solution in parallel. We showed that this approach is both fast and scalable.

Whilst TripleRush has its limitations, it is a step towards providing highperformance triple stores that inherently embrace parallelism.

Acknowledgments We would like to thank the Hasler Foundation for the generous support of the Signal/Collect Project under grant number 11072 and Alex Averbuch as well as Cosmin Basca for their feedback on our ideas.

LUBM evaluation queries, originally used in the BitMat evaluation [ 2 ] and selected by them from OpenRDF LUBM Queries. L1: SELECT ?X ?Y ?Z WHERE f ?Z ub:subOrganizationOf ?Y. ?Y rdf:type ub:University. ?Z rdf:type ub:Department. ?X ub:memberOf ?Z. ?X rdf:type ub:GraduateStudent.

?X ub:undergraduateDegreeFrom ?Y. g L2: SELECT ?X ?Y WHERE f ?X rdf:type ub:Course.

?X ub:name ?Y. g L3: SELECT ?X ?Y ?Z WHERE f ?X rdf:type ub:UndergraduateStudent. ?Y rdf:type ub:University. ?Z rdf:type ub:Department. ?X ub:memberOf ?Z. ?Z ub:subOrganizationOf ?Y. ?X ub:undergraduateDegreeFrom ?Y.

DBPSB evaluation queries, received courtesy of Kai Zeng.

PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rdfs: <http://www.w3.org/2000/01/rdfschema#> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX rdf: <http://www.w3.org/1999/02/22rdf-syntax-ns#> PREFIX dbpprop: <http://dbpedia.org/ property/> PREFIX dbpres: <http://dbpedia.org/ resource/> PREFIX rdfcore: <http://www.w3.org/2004/02/ skos/core#> D1: SELECT ?X WHERE f

?Y rdfcore:subject dbpres:Category:Firstperson shooters.

?Y foaf:name ?X. g D2: SELECT ?X WHERE f ?Z foaf:homepage ?Y.

?Z rdf:type ?X. g D3: SELECT ?X ?Y ?Z WHERE f ?Z rdfcore:subject dbpres:Category:German musicians. ?Z foaf:name ?X. ?Z rdfs:comment ?Y. g L4: SELECT ?X ?Y1 ?Y2 ?Y3 WHERE f ?X ub:worksFor g <http://www.Department0.University0.edu>. D4: SELECT ?W ?X ?Y ?Z WHERE f ?X rdf:type ub:FullProfessor. ?Z dbo:birthPlace dbpres:Berlin. ?X ub:name ?Y1. ?Z dbo:birthDate ?X. ?X ub:emailAddress ?Y2. ?Z foaf:name ?W.

?X ub:telephone ?Y3. ?Z dbo:deathDate ?Y. g L5: SELECT ?X WHERE f ?X ub:subOrganizationOf <http://www.Department0.University0.edu>.

?X rdf:type ub:ResearchGroup. g L6: SELECT ?X ?Y WHERE f ?Y ub:subOrganizationOf

<http://www.University0.edu>. ?Y rdf:type ub:Department. ?X ub:worksFor ?Y.

?X rdf:type ub:FullProfessor. g L7: SELECT ?X ?Y ?Z WHERE f ?Y ub:teacherOf ?Z. ?Y rdf:type ub:FullProfessor. ?Z rdf:type ub:Course. ?X ub:advisor ?Y. ?X rdf:type ub:UndergraduateStudent.

?X ub:takesCourse ?Z. g D5: SELECT ?X ?Y ?Z WHERE f ?Z rdfcore:subject dbpres:Category:Luxury vehicles. ?Z foaf:name ?Y. ?Z dbo:manufacturer ?W.

?W foaf:name ?X. g D6: SELECT ?Z1 ?Z2 ?Z3 ?Z4 WHERE f ?X rdf:type ?Y. ?X dbpprop:name ?Z1. ?X dbpprop:pages ?Z2. ?X dbpprop:isbn ?Z3.

?X dbpprop:author ?Z4.

Abadi ,

Marcus ,

Madden , and

Hollenbach . Scalable Semantic Web Data Management Using Vertical Partitioning . In Proceedings of the 33rd International Conference on Very Large Data Bases , pages 411 { 422 . VLDB Endowment, 2007 .

Atre ,

Chaoji ,

M. J.

Zaki , and

J. A.

Hendler. Matrix "bit" loaded: a scalable lightweight join query processor for rdf data . In Proceedings of the 19th international conference on World wide web, WWW '10 , pages 41 { 50 , New York, NY, USA, 2010 . ACM.

Dean and

Ghemawat . Mapreduce: simpli ed data processing on large clusters. Communications of the ACM , 51 ( 1 ): 107 { 113 , 2008 .

J. E.

Gonzalez ,

Low ,

Gu ,

Bickson , and

Guestrin . Powergraph: distributed graph-parallel computation on natural graphs . In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation , OSDI' 12 , pages 17 { 30 , Berkeley, CA, USA, 2012 . USENIX Association.

Huang ,

D. J.

Abadi , and

Ren . Scalable sparql querying of large rdf graphs . PVLDB , 4 ( 11 ): 1123 { 1134 , 2011 .

Kotoulas ,

Urbani ,

P. A.

Boncz , and

Mika . Robust runtime optimization and skew-resistant execution of analytical sparql queries on pig . In P. CudreMauroux, J. He in, E. Sirin,

Tudorache ,

Euzenat ,

Hauswirth ,

J. X.

Parreira ,

Hendler ,

Schreiber ,

Bernstein , and E. Blomqvist, editors, International Semantic Web Conference (1) , volume 7649 of Lecture Notes in Computer Science, pages 247 { 262 . Springer, 2012 .

Low ,

Gonzalez ,

Kyrola ,

Bickson ,

Guestrin , and

J. M.

Hellerstein . Graphlab: A new parallel framework for machine learning . In Conference on Uncertainty in Arti cial Intelligence (UAI) , Catalina Island, California, July 2010 .

Malewicz ,

M. H.

Austern ,

A. J. C.

Bik ,

J. C.

Dehnert , I. Horn,

Leiser , and

Czajkowski. Pregel: a system for large-scale graph processing . In SIGMOD Conference , pages 135 { 146 , 2010 .

Morsey ,

Lehmann ,

Auer , and

A.-C. N.

Ngomo . Dbpedia sparql benchmark: performance assessment with real queries on real data . In Proceedings of the 10th international conference on The semantic web - Volume Part I, ISWC'11 , pages 454 { 469 , Berlin, Heidelberg, 2011 . Springer-Verlag.

10.

Neumann and

Weikum . Scalable join processing on very large rdf graphs . In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, SIGMOD '09 , pages 627 { 640 , New York, NY, USA, 2009 . ACM.

11.

Neumann and G. Weikum. The RDF-3X engine for scalable management of RDF data . The VLDB Journal.The International Journal on Very Large Data Bases , 19 ( 1 ): 91 { 113 , 2010 .

12.

Shao ,

Wang , and

Li . The trinity graph engine . Technical report, Technical Report 161291 , Microsoft

Research

, 2012 .

13.

Stutz ,

Bernstein , and

W. W.

Cohen . Signal/Collect: Graph Algorithms for the (Semantic) Web . In P. P. -S . et al., editor, International Semantic Web Conference (ISWC) 2010 , volume LNCS 6496 , pages pp. 764 { 780 . Springer, Heidelberg, 2010 .

14. C. Weiss , P. Karras, and

Bernstein . Hexastore: sextuple indexing for semantic web data management . Proceedings of the VLDB Endowment , 1 ( 1 ): 1008 { 1019 , 2008 .

15.

Zeng ,

Yang ,

Wang ,

Shao , and

Wang . A distributed graph engine for web scale rdf data . Proceedings of the VLDB Endowment , 6 ( 4 ), 2013 .