How to feed Apache HBase with petabytes of RDF data: An extremely scalable RDF store based on Eclipse RDF4J framework and Apache HBase database Adam Sotona, and Stefan Negru, MSD IT Prague, Czech Republic {firstname.lastname}@merck.com Abstract. Majority of the systems designed to handle big RDF data rely on a single high-end computer dedicated to a certain RDF dataset and do not easily scale out, at the same time several clustered solution were tested and both the features and the benchmark results were unsatisfying. In this paper we describe a system designed to tackle such issues, a system that connects RDF4J1 and Apache HBase2 in order to receive an extremely scalable RDF store. 1 Introduction “The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries”3. One of the key aspects of the Semantic Web is RDF4 (Resource Description Framework), a framework that enables the (re)use, discoverability, accessibility and integration of data from multiple diverse sources. Considering the exponential growth of the Linked Open Data5 sources, but also the internal information across organizations we can note an increasing need for scalable storage solutions for RDF data. Following an evaluation of existing technologies with high potential of handling big semantic data, we derived a set of requirements for such a system, as follows: 1. Handle petabytes of RDF heterogeneous data and be able to scale linearly as the data grows 2. Share common hardware (ideally a cheap "commodity hardware" cluster) 3. Support SPARQL6 version 1.1and provide basic rule-based reasoning [3] 4. Provide fast bulk and batch data operations and enable exploration and inte- gration of the datasets 5. Export graphs and selected tuples to down-streaming services and application 6. Support multiple teams and individuals working on various tasks 7. Be open for down-streaming services and applications 1 Eclipse RDF4J: http://rdf4j.org/ 2 Apache HBase: http://hbase.apache.org/ 3 W3C Semantic Web: https://www.w3.org/2001/sw/ 4 RDF: https://www.w3.org/2001/sw/wiki/RDF 5 Linked Open Data Cloud: http://lod-cloud.net/ 6 SPARQL 1.1: http://www.w3.org/TR/sparql11-query/ 2 Key Components Two key aspects can be correlated to a scalable Triple/RDF Store and those are a scalable database, and a dependable framework for working with RDF data. Our pro- posal architecture for such a system is represented in Figure 1. Eclipse RDF4J [6] represents the framework, selected mostly because of its features and the fact that it facilitates the achievement of the identified (including Handle petabytes of data and the support of SPARQL 1.1). Other reasons consist in its exten- sibility through Storage And Inference Layer (SAIL) to support various RDF stores and inference engines, its test coverage and the extensive documentation. Apache HBase is the Hadoop7 database, a distributed and scalable big data store. It is designed to scale up from single servers to thousands of machines, thus providing a flexible and reliable means of handling big RDF data. Fig. 1. HBase RDF Store Diagram Bulk Data Operations Bulk data operations are critical to the overall system efficiency. A system capable to hold and query hundreds of billions of triples must be also able to load them in, create snapshots, clone and export in bulk operations. Our current bulk load is implemented as a simple MapReduce application consum- ing RDF data in any standard format and producing HBase table region files directly by each reducer task. Performance of the bulk load scales linearly with the Hadoop cluster. Measurement of load and indexing performance shows in average about 40.000 quads per second per each cluster node. Bulk Export is also a MapReduce application, by exporting the whole dataset in parallel - measured performance shows in average about 400.000 quads per second per each cluster node. Bulk merge is mul- ti-step operation composed from snapshotting of the source datasets and direct load- ing of all HBase region files into one merged HBase table. 7 Apache Hadoop. (n.d.). https://hadoop.apache.org/ In order to achieve such performance, a must is a proper RDF mapping to Apache HBase, as it provides much needed performance and scalability of the whole system. After several iterations of trying various mapping approaches we have found the op- timal RDF mapping with satisfying performance, a mapping represented in Table 1. RDF Triple: RDF Quad: Apache HBase Table (single Column Family) Regions Row Keys Column Qualifiers Values are not used (empty) <= Ordered by Row Keys SPO 0 0 POS 1 1 OSP 2 2 CSPO 3 CPOS 4 COSP 5 Table. 1. RDF Mapping to Apache HBase RDF4J SAIL Implementation Storage and Inference Layer (SAIL) is a key interface of the RDF4J framework to communicate with various storage and/or inference sub-systems. A minimalistic im- plementation of the HBase RDF4J SAIL works perfectly even it has not been origi- nally designed for big data operations. We optimized the performance also by strip- ping the transactions support and streamlining the dataflow through the RDF4J SAIL and through the RDF4J framework. Brief performance measurement has been per- formed using Berlin SPARQL Benchmark [4] and its Explore Use Cases query mix. For 10 billion triples dataset the average QMPH is 7,8 per single cluster node. With excluded query #5 from the query mix it shows more promising 588 QMPH per sin- gle cluster node involved in the benchmark. With regard to reasoning, our target datasets size limits us to basic interferences over RDFS+ [1] and similar set of rules. Based on batch-processing mode and high storage limits in HBase, we have decided to use batch processes to physically materi- alize inferred triples. More precisely we developed a tool performing the reasoning based on SPARQL in a batch mode. Parallel SPARQL Evaluation Strategies RDF4J framework comes with a standard SPARQL 1.1 compliant evaluation strat- egy. This strategy uses single-threaded pull model where the data are requested from the underlying RDF store when needed in the evaluation model. This model shows excelling results on all low-latency RDF stores, however it is not performing enough for distributed storage systems. Our improved SPARQL evaluation strategy uses par- allel push model. In this model each SPARQL query tree is expanded, requests are processed asynchronously and data are pushed back for evaluation when ready. A proper prioritization of the asynchronous processes is a key to let the data flow through the system with minimal latency and without memory overflows. Another experimental parallelization strategy is targeted for parallel SPARQL- based export, which makes uses multiple distinct jobs working on the same SPARQL query and custom SPARQL filtering function to avoid duplicate work and duplicate data. Parallel architecture allows us also to spawn multiple SPARQL endpoints and balance the load between them. As a result of the parallel architecture the above- mentioned export and query performance measurements can be multiplied by the number of involved cluster nodes up to the cluster size. 3 Related work Similar directions of research have been used by Jianling Sun et.al. [5] in storing RDF in HBase related mapping presented in Table 1 and Apache RYA [2] has been developed in the same spirit. Nevertheless Apache RYA [3] is still an Apache Incuba- tor as project, and current requirements are divergent, yet there is a vision that our efforts might merge in the future. 4 Conclusion and Future Work In this paper we described the process of developing a scalable RDF system by fo- cusing on satisfying what we deemed as the minimal set of requirements for such as system, some worth mentioning are: Storage capacity in tens of terabytes of RDF data and potentially up to petabytes; Support for thousands of distinct datasets, in a single HBase table each; Down-streaming services and applications are supported through SPARQL, HBase and Java APIs. All things considered we are seeking to answer more questions by the implementation of such as system and look into inte- grating other improvements that need further development and testing. References 1. Allemang, D., & Hendler, J. A. (2011). Semantic Web for the working on- tologist: Effective modeling in RDFS and OWL. Waltham, MA: Morgan Kaufmann/Elsevier. 2. Punnoose, R., Crainiceanu, A., & Rapp, D. (2012). Rya. Proceedings of the 1st International Workshop on Cloud Intelligence - Cloud-I '12. 3. Apache Rya (incubating). (n.d.). http://rya.incubator.apache.org/ 4. Berlin SPARQL Benchmark (BSBM). (n.d.). http://wifo5- 03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/ 5. Jianling Sun and Qiang Jin, "Scalable RDF store based on HBase and MapReduce," (ICACTE), Chengdu, 2010, pp. V1-633-V1-636.