RDFChain: Chain Centric Storage for Scalable Join Processing of RDF Graphs using MapReduce and HBase Pilsik Choi1,2 *, Jooik Jung1 and Kyong-Ho Lee1 1 Department of Computer Science, Yonsei University, Seoul, Republic of Korea pschoi@icl.yonsei.ac.kr, jijung@icl.yonsei.ac.kr, khlee@cs.yonsei.ac.kr 2 Mobile Communication Division, Samsung Electronics pilsik.choi@samsung.com Abstract. As a massive linked open data is available in RDF, the scalable stor- age and efficient retrieval using MapReduce have been actively studied. Most of previous researches focus on reducing the number of MapReduce jobs for processing join operations in SPARQL queries. However, the cost of shuffle phase still occurs due to their reduce-side joins. In this paper, we propose RDFChain which supports the scalable storage and efficient retrieval of a large volume of RDF data using a combination of MapReduce and HBase which is NoSQL storage system. Since the proposed storage schema of RDFChain re- flects all the possible join patterns of queries, it provides a reduced number of storage accesses depending on the join pattern of a query. In addition, the pro- posed cost-based map-side join of RDFChain reduces the number of map jobs since it processes as many joins as possible in a map job using statistics. Keywords: Map-side join, chain centric storage, HBase, NoSQL, RDF, SPARQL, MapReduce, Hadoop 1 Introduction As an enormous amount of Linked Data is available, processing a SPARQL query into a massive RDF dataset becomes a challenging task when scalability and perfor- mance issues are taken into consideration. Progress in many researches into SPARQL query processing has been made with the use of MapReduce, a distributed parallel processing framework. In particular, Hadoop1 is the most popular open source version of MapReduce. Particularly, the conventional MapReduce-based join processing methods are divided into two approaches: reduce-side join and map-side join [1]. Map-side join outperforms reduce-side join since the shuffle and reduce phases of reduce-side join are not required. However, map-side join requires the datasets to be equally partitioned by join keys. It is not just non-trivial but demanding for the condi- tion to be met in the case of multi-way joins. Przyjaciel-Zablocki et al. [2] have pro- 1 http://hadoop.apache.org. posed the Map-Side Index Nested Loop Join (MAPSIN) using HBase2, which is a distributed, scalable and column-oriented NoSQL storage. HBase is well suited for random access due to the sparse multidimensional sorted map, and is also appropriate for a data model that requires processing row keys such as table scans and lookups. MAPSIN provides an optimized algorithm for reducing the number of storage access for star pattern joins, but not for chain pattern joins. Therefore, we propose RDFChain with the following contributions:  RDFChain reflects every possible join patterns in its storage schema. Specifically, the proposed chain centric storage, which reflects relations among the subjects and objects of RDF triples, reduces the number of storage access.  RDFChain reduces the number of map jobs in multi-way joins. RDFChain esti- mates the cost of join processing using statistics to split a query. The queries sepa- rated include as many triple patterns (TPs) as possible to be processed in a map job. Thus, RDFChain processes as many joins as possible in a single map job. 2 Proposed Architecture RDFChain consists of two components, the data loading component and the query processing one. RDF triples are converted into an N-triple format which is natively supported by Hadoop and then loaded using map jobs. At this stage, we employ the bulk load of HBase instead of directly putting every triple into a table. We also create all the statistics [3] required by our join execution. 2.1 Storage Design The proposed method stores RDF triples as follows:  RDF triples with the terms that co-exist in both the subject and object parts of tri- ples are located in the Tcom table. A RDF triple with the term as a subject is consid- ered as a Subject-Predicate-Object (SPO) triple. A RDF triple with the term as an object correspond to Object-Predicate-Subject (OPS).  SPO triples which are not located in Tcom are stored in Tspo.  OPS triples which are not located in Tcom are stored in Tops. If a term is used not only as a subject but also as an object in triples, it would be a row key in Tcom. Tcom has two column-families to represent OPS and SPO schemas. A predicate comes to a column. The subject and object terms become the values of the corresponding columns for OPS and for SPO, respectively. A subject may have sev- eral predicates and a predicate may also have several objects. This means that triples are stored in a triple group by a subject as a row. Although Tcom may be sparse and have a lot of empty fields, empty fields do not occupy storage in HBase. When look- ing into rows in Tcom, some OPS and SPO triple groups share the same row key. 2 http://hbase.apache.org. These triple groups indicate chain pattern relationship. In other words, the target tri- ples of a chain pattern join must exist in Tcom. RDFChain does not index the predi- cates of RDF triples. Having a table with predicates as row keys has serious scalabil- ity problems because the number of predicates in an ontology is usually fixed, rela- tively small in RDF datasets [3]. 2.2 Query Planning A join pattern is determined by the relationship of join variables in a query. Subject- Subject (SS) and Object-Object (OO) relationships are subject to star pattern joins while Object-Subject (OS or SO) relationships are subject to chain pattern joins. A query graph is a directed graph derived from a query. A triple pattern is referred to as a node and each join pattern is referred to as an edge. A logical plan is derived from a query graph. A logical plan includes a set of triple pattern groups (TPGs) and the join order of them. RDFChain determines the logical plan with the selection rules pro- posed. The selection rules focus on reducing the number of bindings. A chain TPG is a priority for grouping of TPGs. 2.3 Query Execution Where a triple pattern has rdf:type as its predicate, we analyze the class hierarchy in the corresponding ontology tree by utilizing Tcom instead of storing the triples inferred and then extends the logical plan. The logical plan is transformed into a physical plan for actual join processing on MapReduce and HBase. Each TPG in a logical plan is transformed into a map job in a physical plan. RDFChain makes table mappings for each TPG using an HBase index and accesses tables with an HBase filter based on the structure of a TPG. The first map job uses HBase tables as the query input. The intermediate result of each map job is stored in a distributed file system and then tak- en as an input of a map job iteration. For a star pattern join, RDFChain efficiently retrieves a row through a single stor- age access in a map job since its storage has a triple group by subject or object as a row. For a chain pattern join, RDFChain only scans Tcom since the triple groups satisfying a chain pattern are located in Tcom. Therefore, Tcom is more efficient for a chain pattern join due to the reduced number of storage access. In multi-way joins, only the rows with possibility of satisfying a chain pattern join are passed on as inputs of map job iterations, thereby reducing the processing time. Two chain TPGs with disjoint join variables are compatible [4]. RDFChain esti- mates the cost of processing compatible sets in a map job. The cost of a map-side join is the sum of all the cost-consuming tasks of a map job. Since we do not need to con- sider the shuffle and reduce phases of a reduce-side join, we only take account of the time to process a join. In a conventional reduce-side join, a map task simply writes data according to a join key for an actual join in the reduce phase. However, the pro- posed map task of RDFChain is relatively heavy to iterate both binding and pattern matching. So, it may exceed the execution time assigned to the task. This problem can be resolved by a static method of increasing the timeout or a computing power. How- ever, for a stationary time duration in a map task environment, we split TPGs based on a threshold to dynamically solve this problem. We limit the number of TPGs which can be processed in a map job using statistics such as the number of objects of frequently used subject-predicate pairs and the number of predicate-object pairs for every subject. We are able to estimate an intermediate result. The number of bindings for join variables dominates the number of map jobs. All map jobs run sequentially. If the cost does not exceed time limit, a single map job is generated. 3 Experimental Results We used a Jena SPARQL parser and the Amazon's Elastic MapReduce service. Ten clusters of large instances were run on Hadoop 0.20.2 and HBase 0.92.0. We experi- mented with the LUBM 10K, LUBM 20K and LUBM 30K datasets, which consist of 1.3, 2.7 and 4.1 billion triples respectively, and a subset of benchmark queries, Q1, Q2, Q3, Q4, Q7 and Q9 [5]. The queries include simple and complex structures and provide star and chain pattern joins. Q3, Q4, Q7 and Q9 require type inference. We compared RDFChain with HadoopRDF [6] in terms of reduce-side join and MAPSIN [2] for map-side join. RDFChain showed the best performance in large non-selective queries (Q2 and Q9). In particular, Q2 and Q9 have a complex structure, a low selectivity due to unbound objects, and a relationship of a chain pattern join. RDFChain greatly reduced the size of the intermediate results by limiting RDF triples to actual candidate rows which can satisfy a chain pattern join. RDFChain also shows smaller number of storage accesses than MAPSIN. Since Tcom is a common subset of Tspo and Tops, it scales down the scan space. RDFChain splits TPGs with compatible mappings and process the divided TPGs in a map task. So, the number of map jobs decreases in turn. Acknowledgment This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2013R1A2A2A01016327). References 1. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A Comparison of Join Algorithms for Log Processing in Mapreduce. In: Proc. International Conference on Management of data (2010) 2. Przyjaciel-Zablocki, M., Schätzle, A., Hornung, T., Dorner, C., Lausen, G.: Cascading Map-Side Joins over Hbase for Scalable Join Processing. CoRR, abs/1206.6293 (2012) 3. Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: SPARQL basic graph pattern optimization using selectivity estimation. In: WWW, pp. 595–604. ACM (2008) 4. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and Complexity of SPARQL. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 30–43. Springer, Heidelberg (2006) 5. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics 3, 158–182 (2005) 6. Husain, M., McGlothlin, J., Masud, M., Khan, L., Thuraisingham, B.: Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing. IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 9, pp.1312-1327 (2011)