Introduction

Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store

Vaibhav Khadilkar

Murat Kantarcioglu

Bhavani Thuraisingham

Paolo Castagna

0 0 Talis Systems Ltd 1 The University of Texas at Dallas , USA

Lack of scalability is one of the most significant problems faced by single machine RDF data stores. The advent of Cloud Computing has paved a way for a distributed ecosystem of RDF triple stores that can potentially allow up to a planet scale storage along with distributed query processing capabilities. Towards this end, we present Jena-HBase, a HBase backed triple store that can be used with the Jena framework. Jena-HBase provides end-users with a scalable storage and querying solution that supports all features from the RDF specification.

Introduction

triples. In contrast, existing systems that use a MapReduce-based query engine for processing RDF data are optimized for query performance, however, they are currently unable to support all features from the RDF specification. Our motivation with Jena-HBase is to provide end-users with a cloud-based RDF storage and querying API that supports all features from the RDF specification.

Our contributions: Jena-HBase provides the following: (a) A variety of custom-built RDF data storage layouts for HBase that provide a tradeoff in terms of query performance/storage. (b) Support for reification, inference and SPARQL processing through the implementation of appropriate Jena interfaces.

2 Jena-HBase Architecture

We have performed benchmark experiments using SP2Bench (non-inference queries) [ 4 ] and LUBM (inference queries) [ 5 ] to determine the best layout currently available in Jena-HBase, as well as to compare the performance of the best layout with Jena TDB. We have compared Jena-HBase only with Jena TDB and not with other Hadoop-based systems for the following reasons: (i) Jena TDB gives

Layout Type Simple Indexed VP and Indexed Hybrid Hash 3 tables each indexed by subjects, predicates and objects

Vertically Partitioned (VP) sFuobr jeevctesryanudnioqbujeecptrsedicate, two tables, each indexed by

Six tables representing the six possible combinations of

a triple namely, SPO, SOP, PSO, POS, OSP and OPS

VP layout with additional tables for SPO, OSP and OS Simple + VP layouts Hybrid layout with hash values for nodes and a separate table containing hash-to-node mappings

the best query performance of all available Jena storage subsystems. (ii) The available Hadoop-based systems do not implement all features from the RDF specification. In this section, we show results only for Q1 and Q9 of SP2Bench and Q1 and Q10 of LUBM, however, these results are indicative of the overall trend [ 3 ]. Additionally, the figures only show query times and do not include loading times. Finally, as a part of the procedure to determine the best layout, we ran both benchmarks over several graph sizes, but we show results only for a graph of 250137 triples for SP2Bench and for a graph of 5 universities (≈ 560K triples) for LUBM (Fig. 2). Although we used a small graph size, it is still sufficient for determining the best Jena-HBase layout. Since LUBM contains inference queries, we used the Pellet reasoner (v2.3.0) to perform inference. c3 ) e s ( e im2 T g n i ry1 e u Q

Graph Querying - Q1 Simple Verticaly-Partitioned Indexed VP-Indexed Hybrid Hash Graph Querying - Q9

i()sceengTm324000000000 VerticalyV-PPa-IIrnntSiHddtiiHoyeembnxxapeeresilhdeddd i rye1000 u Q

Graph Querying - Q1

)c100 Simple (se 80 Verticaly-Partitioned iiegnTm6400 VP-IInnHddHyeebxxaeersihddd rye 20 u Q

Graph Querying - Q10

ii()yscngeeTm11505000 VerticalyV-PPa-IIrnntSiHtddiiHoyeembnxxapreeesildheddd r e u Q

Fig. 2 shows a comparison of all layouts for Q1 and Q9 of SP2Bench and Q1 and Q10 of LUBM. We see that the Hybrid layout gives the best results since it combines the advantages of the Simple (Q9 of SP2Bench and Q10 of LUBM) and VP (Q1 of SP2Bench) layouts. The Indexed, VP-Indexed and Hash layouts require longer querying times, since they require multiple row lookups in the SPO, SOP, PSO, POS, OSP and OPS tables (Indexed case) or the SPO, OSP and OS tables (VP-Indexed case) or the mapping table (Hash case).

Fig. 3 shows a comparison of the Hybrid layout with Jena TDB for increasing graph sizes. Note that Jena-HBase values have been scaled down for Q1 (by 1000) and Q9 (by 100) of SP2Bench and for Q1 (by 10) of LUBM for a clear comparison. We see that TDB outperforms the Hybrid layout in the 1M to 25M range for SP2Bench and in the N = 50 to 500 range (N is the number of universities) for

Graph Querying - Q1 TDB Hybrid Graph Querying - Q9 TDB Hybrid Graph Querying - Q1 TDB Hybrid

Q1 of LUBM. This is expected since for these ranges TDB is able to create and maintain the necessary B+ graph indices in memory, thus resulting in a shorter query execution time. Jena-HBase requires multiple graph pattern matches on increasing graph sizes over a distributed cluster, thus making it slower than TDB. We also observe that TDB fails to execute Q10 of LUBM for the N = 50 to 500 range, since the test program runs out of memory during the process of inference for this range. The Hybrid layout successfully executes Q10 for this range, since the reasoner is able to construct the necessary inference related data structures. Finally, we observe that Jena-HBase is more scalable than TDB which fails to construct graphs with 100M triples for SP2Bench and N = 1000 universities for LUBM, thereby preventing the execution of any query on these graphs.

4 Conclusion

In this paper, we show that creating a distributed RDF storage framework with existing cloud computing tools results in a scalable data processing solution. Additionally, our solution maintains a reasonable query execution time overhead when compared with a single-machine RDF storage framework (viz. Jena TDB). 5

Acknowledgements

This work was partially supported by The Air Force Office of Scientific Research MURI-Grant FA-9550-08-1-0265 and Grant FA-9550-08-1-0260, National Institutes of Health Grant 1R01LM009989, National Science Foundation (NSF) Grant Career-CNS-0845803, and NSF Grants CNS-0964350, CNS-1016343 and CNS-1111529. We thank Dr. Robert Herklotz for his support.

D. J.

Abadi ,

Marcus ,

Madden , and

K. J.

Hollenbach . Scalable Semantic Web Data Management Using Vertical Partitioning . In VLDB , pages 411 - 422 , 2007 .

Wilkinson ,

Sayers ,

Kuno , and

Reynolds . Efficient RDF Storage and Retrieval in Jena2 . Technical report, HP Laboratories , 2003 .

Khadilkar ,

Kantarcioglu ,

Castagna , and

Thuraisingham . Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store . Technical report , 2012 . http: //www.utdallas.edu/~vvk072000/Research/Jena-HBase -Ext/tech-report.pdf.

Schmidt ,

Hornung , G. Lausen, and

Pinkel . SP2Bench: A SPARQL Performance Benchmark . In ICDE , pages 222 - 233 , 2009 .

Guo ,

Pan , and J. Heflin. LUBM: A benchmark for OWL knowledge base systems . J. Web Sem ., 3 ( 2 -3): 158 - 182 , 2005 .