=Paper= {{Paper |id=None |storemode=property |title=Cascading Map-Side Joins over HBase for Scalable Join Processing |pdfUrl=https://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper5.pdf |volume=Vol-943 |dblpUrl=https://dblp.org/rec/conf/semweb/SchatzlePD0L12 }} ==Cascading Map-Side Joins over HBase for Scalable Join Processing== https://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper5.pdf
        Cascading Map-Side Joins over HBase
             for Scalable Join Processing

               Alexander Schätzle, Martin Przyjaciel-Zablocki,
          Christopher Dorner, Thomas Hornung, and Georg Lausen

        Department of Computer Science, University of Freiburg, Germany
               {schaetzle,zablocki,dornerc,hornungt,lausen}
                        @informatik.uni-freiburg.de



      Abstract. One of the major challenges in large-scale data processing
      with MapReduce is the smart computation of joins. Since Semantic Web
      datasets published in RDF have increased rapidly over the last few years,
      scalable join techniques become an important issue for SPARQL query
      processing as well. In this paper, we introduce the Map-Side Index Nested
      Loop Join (MAPSIN join) which combines scalable indexing capabilities
      of NoSQL data stores like HBase, that suffer from an insufficient dis-
      tributed processing layer, with MapReduce, which in turn does not pro-
      vide appropriate storage structures for efficient large-scale join process-
      ing. While retaining the flexibility of commonly used reduce-side joins,
      we leverage the effectiveness of map-side joins without any changes to
      the underlying framework. We demonstrate the significant benefits of
      MAPSIN joins for the processing of SPARQL basic graph patterns on
      large RDF datasets by an evaluation with the LUBM and SP2 Bench
      benchmarks. For selective queries, MAPSIN join based query execution
      outperforms reduce-side join based execution by an order of magnitude.


1   Introduction
Most of the information in the classical ”Web of Documents” is designed for
human readers, whereas the idea behind the Semantic Web is to build a ”Web
of Data” that enables computers to understand and use the information in the
web. The advent of this Web of Data gives rise to new challenges with regard to
query evaluation on the Semantic Web. The core technologies of the Semantic
Web are RDF (Resource Description Framework) [1] for representing data in
a machine-readable format and SPARQL [2] for querying RDF data. However,
querying RDF datasets at web-scale is challenging, especially because the com-
putation of SPARQL queries usually requires several joins between subsets of the
data. On the other side, classical single-place machine approaches have reached
a point where they cannot scale with respect to the ever increasing amount
of available RDF data (cf. [16]). Renowned for its excellent scaling properties,
the MapReduce paradigm [8] is an attractive candidate for distributed SPARQL
processing. The Apache Hadoop platform is the most prominent and widely used
open-source MapReduce implementation. In the last few years many companies




                                        59
have built-up their own Hadoop infrastructure but there are also ready-to-use
cloud services like Amazon’s Elastic Compute Cloud (EC2) offering the Hadoop
platform as a service (PaaS). Thus, in contrast to specialized distributed RDF
systems like YARS2 [15] or 4store [14], the use of existing Hadoop MapReduce
infrastructures enables scalable, distributed and fault-tolerant SPARQL process-
ing out-of-the-box without any additional installation or management overhead.
Following this avenue, we introduced the PigSPARQL project in [26] that offers
full support for SPARQL 1.0 and is implemented on top of Hadoop. However,
while the performance and scaling properties of PigSPARQL for complex analyt-
ical queries are competitive, the performance for selective queries is not satisfying
due to the lack of built-in index structures and unnecessary data shuffling as join
computation is done in the reduce phase.
    In this paper we present a new MapReduce join technique, the Map-Side
Index Nested Loop Join (MAPSIN join), that uses the indexing capabilities of a
distributed NoSQL data store to improve query performance of selective queries.
MAPSIN joins are completely processed in the map phase to avoid costly data
shuffling by using HBase as underlying storage layer. Our evaluation shows an
improvement of up to one order of magnitude over the common reduce-side join
for selective queries. Overall, the major contributions of this paper are as follows:

 – We describe a space-efficient storage schema for large RDF graphs in HBase
   while retaining favourable access characteristics. By using HBase instead of
   HDFS, we can avoid shuffling join partitions across the network and instead
   only access the relevant join partners in each iteration.
 – We present the MAPSIN join algorithm, which can be evaluated cascadingly
   in subsequent MapReduce iterations. In contrast to other approaches, we do
   not require an additional shuffle and reduce phase in order to preprocess the
   data for consecutive joins. Moreover, we do not require any changes to the
   underlying frameworks.
 – We demonstrate an optimization of the basic MAPSIN join algorithm for the
   efficient processing of multiway joins. This way, we can save n MapReduce
   iterations for star join queries with n + 2 triple patterns.

The paper is structured as follows: Section 2 provides a brief introduction to the
technical foundations for this paper. Section 3 describes our RDF storage schema
for HBase, while Section 4 presents the MAPSIN join algorithm. We continue
with a presentation of the evaluation of our approach in Section 5, followed by
a discussion of related work in Section 6. We conclude in Section 7 and give an
outlook on future work.


2     Background

2.1   RDF & SPARQL

RDF [1] is the W3C recommended standard model for representing knowledge
about arbitrary resources, e.g. articles and authors. An RDF dataset consists of a




                                        60
set of RDF triples in the form (subject, predicate, object) that can be interpreted
as ”subject has property predicate with value object”. For clarity of presentation,
we use a simplified RDF notation in the following. It is possible to visualize an
RDF dataset as directed, labeled graph where every triple corresponds to an
edge (predicate) from subject to object. Figure 1 shows an RDF graph with
information about articles and corresponding authors.


      "PigSPARQL"                                                   SPARQL BGP query
           title
                   Article1   author    Alex                        SELECT *
           year                                                     WHERE {
                    author     cite     author
                                                  "RDFPath"           ?article title ?title .
      "2011"
                                                  title               ?article author ?author .
                   Martin     author   Article2                       ?article year   ?year
                                                  year
                                                           "2011"   }


                              Fig. 1. RDF graph and SPARQL query


    SPARQL is the W3C recommended declarative query language for RDF. A
SPARQL query defines a graph pattern P that is matched against an RDF graph
G. This is done by replacing the variables in P with elements of G such that the
resulting graph is contained in G (pattern matching). The most basic constructs
in a SPARQL query are triple patterns, i.e. RDF triples where subject, predicate
and object can be variables, e.g. (?s, p, ?o). A set of triple patterns concatenated
by AND (.) is called a basic graph pattern (BGP) as illustrated in Figure 1. The
query asks for all articles with known title, author and year of publication. The
result of a BGP is computed by joining the variable mappings of all triple pat-
terns on their shared variables, in this case ?article. For a detailed definition of
the SPARQL syntax we refer the interested reader to the official W3C Recom-
mendation [2]. A formal definition of the SPARQL semantics can also be found
in [23]. In this paper we focus on efficient join processing with MapReduce and
NoSQL (i.e. HBase) and therefore only consider SPARQL BGPs.


2.2    MapReduce

The MapReduce programming model [8] enables scalable, fault tolerant and mas-
sively parallel computations using a cluster of machines. The basis of Google’s
MapReduce is the distributed file system GFS [12] where large files are split into
equal sized blocks, spread across the cluster and fault tolerance is achieved by
replication. The workflow of a MapReduce program is a sequence of MapReduce
iterations each consisting of a Map and a Reduce phase separated by a so-called
Shuffle & Sort phase. A user has to implement map and reduce functions which
are automatically executed in parallel on a portion of the data. The map function
gets invoked for every input record represented as a key-value pair. It outputs a
list of new intermediate key-value pairs which are then sorted and grouped by
their key. The reduce function gets invoked for every distinct intermediate key




                                                          61
together with the list of all according values and outputs a list of values which
can be used as input for the next MapReduce iteration.
    We use Apache Hadoop as it is the most popular open-source implementation
of Google’s GFS and MapReduce framework that is used by many companies
like Yahoo!, IBM or Facebook.

Map-Side vs. Reduce-Side Join. Processing joins with MapReduce is a chal-
lenging task as datasets are typically very large [5,20]. If we want to join two
datasets with MapReduce, L � R, we have to ensure that the subsets of L and R
with the same join key values can be processed on the same machine. For joining
arbitrary datasets on arbitrary keys we generally have to shuffle data over the
network or choose appropriate pre-partitioning and replication strategies.
    The most prominent and flexible join technique in MapReduce is called
Reduce-Side Join [5,20]. Some literature also refer to it as Repartition Join [5]
as the idea is based on reading both datasets (map phase) and repartition them
according to the join key (shuffle phase). The actual join computation is done
in the reduce phase. The main drawback of this approach is that both datasets
are completely transferred over the network regardless of the join output. This
is especially inefficient for selective joins and consumes a lot of network band-
width. Another group of joins is based on getting rid of the shuffle and reduce
phase to avoid transferring both datasets over the network. This kind of join
technique is called Map-Side Join since the actual join processing is done in
the map phase. The most common one is the Map-Side Merge Join [20]. How-
ever, this join cannot be applied on arbitrary datasets. A preprocessing step is
necessary to fulfill several requirements: datasets have to be sorted and equally
partitioned according to the join key. If the preconditions are fulfilled, the map
phase can process an efficient parallel merge join between pre-sorted partitions
and data shuffling is not necessary. However, if we want to compute a sequence
of such joins, the shuffle and reduce phases are needed to guarantee that the
preconditions for the next join iteration are fulfilled. Therefore, map-side joins
are generally hard to cascade and the advantage of avoiding a shuffle and reduce
phase is lost. Our MAPSIN join approach is designed to overcome this drawback
by using the distributed index of a NoSQL system like HBase.

2.3   HBase
HBase is a distributed, scalable and strictly consistent column-oriented NoSQL
data store, inspired by Google’s Bigtable [7] and well integrated into Hadoop.
Hadoop’s distributed file system, HDFS, is designed for sequential reads and
writes of very large files in a batch processing manner but lacks the ability to
access data randomly in close to real-time. HBase can be seen as an additional
storage layer on top of HDFS that supports efficient random access. The data
model of HBase corresponds to a sparse multi-dimensional sorted map with the
following access pattern:
           (T able, RowKey, F amily, Column, T imestamp) → V alue




                                       62
The rows of a table are sorted and indexed according to their row key and every
row can have an arbitrary number of columns. Columns are grouped into column
families and column values (denoted as cell) are timestamped and thus support
multiple versions. HBase tables are dynamically split into regions of contiguous
row ranges with a configured maximum size. When a region becomes too large,
it is automatically split into two regions at the middle key (auto-sharding).
     However, HBase has neither a declarative query language nor built-in sup-
port for native join processing, leaving higher-level data transformations to the
overlying application layer. In our approach we propose a map-side join strategy
that leverages the implicit index capabilities of HBase to overcome the usual
restrictions of map-side joins as outlined in Section 2.2.


3   RDF Storage Schema for HBase

In contrast to relational databases, NoSQL data stores do neither have a common
data model nor a common query language like SQL. Hence, the implementation
of our join approach strongly relies on the actual NoSQL store used as backend.
In our initial experiments we considered HBase and Cassandra, two popular
NoSQL stores with support for MapReduce. We decided to use HBase for our
implementation as it proved to be more stable and also easier to handle in our
cluster since HBase was developed to work with Hadoop from the beginning.
    In [28] the authors adopted the idea of Hexastore [30] to index all possible
orderings of an RDF triple for storing RDF data in HBase. This results in six
tables in HBase allowing to retrieve results for any possible SPARQL triple pat-
tern with a single lookup on one of the tables (except for a triple pattern with
three variables). However, as HDFS has a default replication factor of three and
data in HBase is stored in files on HDFS, an RDF dataset is actually stored
18 times using this schema. But it’s not only about storage space, also loading
a web-scale RDF dataset into HBase becomes very costly and consumes many
resources. Our storage schema for RDF data in HBase is inspired by [10] and
uses only two tables, Ts po and To ps . We extend the schema with a triple pat-
tern mapping that leverages the power of predicate push-down filters in HBase
to overcome possible performance shortcomings of a two table schema. Further-
more, we improve the scalibility of the schema by introducing a modified row key
design for class assignments in RDF which would otherwise lead to overloaded
regions constraining both scalability and performance.
    In Ts po table an RDF triple is stored using the subject as row key, the
predicate as column name and the object as column value. If a subject has
more than one object for a given predicate (e.g. an article having more than
one author), these objects are stored as different versions in the same column.
The notation Ts po indicates that the table is indexed by subject. Table To ps
follows the same design. In both tables there is only one single column family
that contains all columns. Table 1 illustrates the corresponding Ts po table for
the RDF graph in Section 2.1.




                                       63
                 Table 1. Ts po table for RDF graph in Section 2.1

               rowkey         family:column→value
               Article1       p:title→{”PigSPARQL”}, p:year→{”2011”},
                              p:author→{Alex, Martin}
               Article2       p:title→{”RDFPath”}, p:year→{”2011”},
                              p:author→{Martin, Alex}, p:cite→{Article1}




    At first glance, this storage schema seems to have performance drawbacks
when compared to the six table schema in [28] since there are only indexes for
subjects and objects. However, we can use the HBase Filter API to specify addi-
tional column filters for table index lookups. These filters are applied directly on
server side such that no unnecessary data must be transferred over the network
(predicate push-down). As already mentioned in [10], a table with predicates as
row keys causes scalability problems since the number of predicates in an ontol-
ogy is usually fixed and relatively small which results in a table with just a few
very fat rows. Considering that all data in a row is stored on the same machine,
the resources of a single machine in the cluster become a bottleneck. Indeed, if
only the predicate in a triple pattern is given, we can use the HBase Filter API
to answer this request with a table scan on Ts po or To ps using the predicate as
column filter. Table 2 shows the mapping of every possible triple pattern to the
corresponding HBase table. Overall, experiments on our cluster showed that the
two table schema with server side filters has similar performance characteristics
compared to the six table schema but uses only one third of storage space.


 Table 2. SPARQL triple pattern mapping using HBase predicate push-down filters

               pattern           table                         filter
               (s, p, o)         Ts po or To ps                column & value
               (?s, p, o)        To ps                         column
               (s, ?p, o)        Ts po or To ps                value
               (s, p, ?o)        Ts po                         column
               (?s, ?p, o)       To ps
               (?s, p, ?o)       Ts po or To ps (table scan)   column
               (s, ?p, ?o)       Ts po
               (?s, ?p, ?o)      Ts po or To ps (table scan)




    Our experiments also revealed some fundamental scaling limitations of the
storage schema caused by the To ps table. In general, an RDF dataset uses a
relatively small number of classes but contains many triples that link resources
to classes, e.g. (Alex, rdf:type, foaf:Person). Thus, using the object of a triple
as row key means that all resources of the same class will be stored in the
same row. With increasing dataset size these rows become very large and exceed
the configured maximum region size resulting in overloaded regions that contain
only a single row. Since HBase cannot split these regions the resources of a single
machine become a bottleneck for scalability. To circumvent this problem we use
a modified To ps row key design for triples with predicate rdf:type. Instead of
using the object as row key we use a compound row key of object and subject,




                                             64
e.g. (foaf:Person|Alex). As a result, we can not access all resources of a class
with a single table lookup but as the corresponding rows will be consecutive in
To ps we can use an efficient range scan starting at the first entry of the class.


4     MAPSIN Join

The major task in SPARQL query evaluation is the computation of joins be-
tween triple patterns, i.e. basic graph patterns. However, join processing on large
RDF datasets, especially if it involves more than two triple patterns, is challeng-
ing [20]. Our approach combines the scalable storage capabilities of NoSQL data
stores (i.e. HBase), that suffer from a suitable distributed processing layer, with
MapReduce, a highly scalable and distributed computation framework, which
in turn does not support appropriate storage structures for large scale join pro-
cessing. This allows us to catch up with the flexibility of reduce-side joins while
utilizing the effectiveness of a map-side join without any changes to the under-
lying frameworks.
    First, we introduce the needed SPARQL terminology analogous to [23]: Let
V be the infinite set of query variables and T be the set of valid RDF terms.

Definition 1. A (solution) mapping µ is a partial function µ : V → T . We
call µ(?v) the variable binding of µ for ?v. Abusing notation, for a triple pattern
p we call µ(p) the triple pattern that is obtained by substituting the variables
in p according to µ. The domain of µ, dom(µ), is the subset of V where µ is
defined and the domain of p, dom(p), is the subset of V used in p. The result of
a SPARQL query is a multiset of solution mappings Ω.

Definition 2. Two mappings µ1 , µ2 are compatible if, for every variable ?v ∈
dom(µ1 ) ∩ dom(µ2 ), it holds that µ1 (?v) = µ2 (?v). It follows that mappings with
disjoint domains are always compatible and the set-union (merge) of µ1 and µ2 ,
µ1 ∪ µ2 , is also a mapping.


4.1   Base Case

To compute the join between two triple patterns, p1 � p2 , we have to merge
the compatible mappings for p1 and p2 . Therefore, it is necessary that subsets
of both multisets of mappings are brought together such that all compatible
mappings can be processed on the same machine.
    Our MAPSIN join technique computes the join between p1 and p2 in a sin-
gle map phase. At the beginning, the map phase is initialized with a parallel
distributed HBase table scan for the first triple pattern p1 where each machine
retrieves only those mappings that are locally available. This is achieved by
utilizing a mechanism for allocating local records to map functions, which is
supported by the MapReduce input format for HBase. The map function is in-
voked for each retrieved mapping µ1 for p1 . To compute the partial join between
p1 and p2 for the given mapping µ1 , the map function needs to retrieve those




                                        65
mappings for p2 that are compatible to µ1 based on the shared variables be-
tween p1 and p2 . At this point, the map function utilizes the input mapping µ1
to substitute the shared variables in p2 , i.e. the join variables. The substituted
triple pattern psub
                2   is then used to retrieve the compatible mappings with a table
lookup in HBase following the triple pattern mapping outlined in Table 2. Since
there is no guarantee that the corresponding HBase entries reside on the same
machine, the results of the request have to be transferred over the network in
general. However, in contrast to a reduce-side join approach where a lot of data
is transferred over the network, we only transfer the data that is really needed.
Finally, the computed multiset of mappings is stored in HDFS.


                                                                                       1
            1 SCAN  for  local  mappings:  ?article  title  ?title  
                                                                                   2
                                                                          2
            2     map  inputs                                                                             Node  3
                                                                              2   Node  1
             ?article=article1  ?title="PigSPARQL"                                                                 NoSQL  
                                                                  3  
                3 GET  bindings:  article1  author  ?author                                            Storage  System

             ?article=article2  ?title="RDFPath"                                    Node  2
                3 GET  bindings:  article2  author  ?author         3  



            4     map  outputs
                                                                                            4
             ?article=article1  ?title="PigSPARQL"  ?author=Alex
             ?article=article1  ?title="PigSPARQL"  ?author=Martin                        4                    HDFS
                                                                                         4
             ?article=article2  ?title="RDFPath"  ?author=Martin
             ?article=article2  ?title="RDFPath"  ?author=Alex




 Fig. 2. MAPSIN join base case for the first two triple patterns of query in Figure 1


   Figure 2 is an example for the base case of our MAPSIN join that illustrates
the join between the first two triple patterns of the SPARQL query in Figure 1.
While the mappings for the first triple pattern (?article, title, ?title) are retrieved
locally using a distributed table scan (step 1+2), the compatible mappings for
(?article, author, ?author) are requested within the map function (step 3) and
the resulting set of mappings is stored in HDFS (step 4).


4.2   Cascading Joins

Chains of concatenated triple patterns require some slight modifications to the
previously described base case. To compute a query of at least three triple pat-
terns we have to process several joins successively, e.g. p1 � p2 � p3 . The pro-
cessing of the first two patterns p1 � p2 correspond to the base case and the
results are stored in HDFS. The additional triple pattern p3 is then joined with
the mappings for p1 � p2 . To this end, an additional map-phase (without any
intermediate shuffle or reduce phase) is initialized with the previously computed
mappings as input. Since these mappings reside in HDFS, they are retrieved




                                                          66
locally in parallel such that the map function gets invoked for each mapping
µ2 for p1 � p2 . The compatible mappings for p3 are retrieved using the same
strategy as for the base case, i.e. µ2 is used to substitute the shared variables in
p3 and compatible mappings are retrieved following the triple pattern mapping
outlined in Table 2. Algorithm 1 outlines one iteration of the MAPSIN join. The
input for the map function contains either a mapping for the first triple pattern
(via distributed table scan) or a mapping for previously joined triple patterns
(loaded from HDFS).


 Algorithm 1: MAPSIN join: map(inKey, inValue)
      input : inKey, inValue: value contains input mapping, key can be ignored
      output: multiset of mappings
 1    pn+1 ← Config.getNextPattern()
 2    µn ← inV alue.getInputMapping()
 3    Ωn+1 ← ∅
 4    if dom(µn ) ∩ dom(pn+1 ) �= ∅ then
 5         // substitute shared vars in pn+1
 6         psub
            n+1 ← µn (pn+1 )
 7         results ← HBase.GET(psub n+1 ) // table index lookup using substituted pattern
 8    else
 9         results ← HBase.GET(pn+1 ) // table index lookup using unsubstituted pattern
10    end
11    if results �= ∅ then
12         // merge µn with compatible mappings for pn+1
13         foreach mapping µ in results do
14              µn+1 ← µn ∪ µ
15              Ωn+1 ← Ωn+1 ∪ µn+1
16         end
17         emit(null, Ωn+1 ) // key is not used since there is no reduce phase
18    end




4.3      Multiway Join Optimization
Instead of processing concatenated triple patterns successively as a sequence
of two-way joins, some basic graph patterns allow to apply a multiway join
approach to process joins between several concatenated triple patterns at once
in a single map phase. This is typically the case for star pattern queries where
triple patterns share the same join variable. The SPARQL query introduced in
Section 2.1 is an example for such a query as all triple patterns share the same
join variable ?article. This query can be processed by a three-way join in a single
map-phase instead of two consecutive two-way joins.
    We extended our approach to support this multiway join optimization. Again,
the first triple pattern p1 is processed using a distributed table scan as input for
the map phase. But instead of using a sequence of n map phases to compute p1 �
p2 � ... � pn+1 we use a single map phase thus saving n−1 MapReduce iterations.
Hence, the map function needs to retrieve all mappings for p2 , p3 , ..., pn+1 that
are compatible to the input mapping µ1 for p1 . Therefore, the join variable ?vs
in p2 , p3 , ..., pn+1 (e.g. ?article) is substituted with the corresponding variable




                                                67
binding µ1 (?vs ). The substituted triple patterns psub  sub       sub
                                                    2 , p3 , ..., pn+1 are then used
to retrieve the compatible mappings using HBase table lookups. This general case
of the MAPSIN multiway join is outlined in Algorithm 2.


    Algorithm 2: MAPSIN multiway join: map(inKey, inValue)
     input : inKey, inValue: value contains input mapping, key can be ignored
     output: multiset of mappings
 1   #p ← Config.getNumberOfMultiwayPatterns()
 2   µn ← inV alue.getInputMapping()
 3   Ωn ← {µn }
 4   // iterate over all subsequent multiway patterns
 5   for i ← 1 to #p do
 6        Ωn+i ← ∅
 7        pn+i ← Config.getNextPattern()
 8        // substitute shared vars in pn+i
 9        psub
            n+i ← µn (pn+i )
10        results ← HBase.GET(psub n+i ) // table index lookup using substituted pattern
11        if results �= ∅ then
12             // merge previous mappings with compatible mappings for pn+i
13             foreach mapping µ in results do
14                 foreach mapping µ� in Ωn+i−1 do
15                      Ωn+i ← Ωn+i ∪ (µ ∪ µ� )
16                 end
17             end
18        else
19             // no compatible mappings for pn+i hence join result for µn is empty
20             return
21        end
22   end
23   emit(null, Ωn+#p ) // key is not used since there is no reduce phase




    The performance of MAPSIN joins strongly correlates with the number of
index lookups in HBase. Hence, minimizing the number of lookups is a crucial
point for optimization. In many situations, it is possible to reduce the number of
requests by leveraging the RDF schema design for HBase outlined in Section 3.
If the join variable for all triple patterns is always on subject or always on object
position, then all mappings for p2 , p3 , ..., pn+1 that are compatible to the input
mapping µ1 for p1 are stored in the same HBase table row of Ts po or To ps ,
respectively, making it possible to use a single instead of n subsequent table
lookups. Hence, all compatible mappings can be retrieved at once thus saving
n − 1 lookups for each invocation of the map function. Due to space limitations
the corresponding algorithm for this optimized case can be found in the technical
report version of this paper [24].


5     Evaluation
The evaluation was performed on a cluster of 10 Dell PowerEdge R200 servers
equipped with a Dual Core 3.16 GHz CPU, 8 GB RAM, 3 TB disk space and
connected via gigabit network. The software installation includes Hadoop 0.20.2,
HBase 0.90.4 and Java 1.6.0 update 26.




                                                68
 Table 3. SP2 Bench & LUBM loading times for tables Ts po and To ps (hh:mm:ss)

   SP2 Bench          200M          400M          600M          800M          1000M
   # RDF triples   ∼ 200 million ∼ 400 million ∼ 600 million ∼ 800 million ∼ 1000 million
   Ts po             00:28:39      00:45:33      01:01:19      01:16:09       01:33:47
   To ps             00:27:24      01:04:30      01:28:23      01:43:36       02:19:05
   total             00:56:03      01:50:03      02:29:42      02:59:45       03:52:52
   LUBM               1000          1500          2000          2500          3000
   # RDF triples   ∼ 210 million ∼ 315 million ∼ 420 million ∼ 525 million ∼ 630 million
   Ts po             00:28:50      00:42:10      00:52:03      00:56:00      01:05:25
   To ps             00:48:57      01:14:59      01:21:53      01:38:52      01:34:22
   total             01:17:47      01:57:09      02:13:56      02:34:52      02:39:47




    We used the well-known Lehigh University Benchmark (LUBM) [13] as the
queries can easily be formulated as SPARQL basic graph patterns. Furthermore,
we also considered the SPARQL-specific SP2 Bench Performance Benchmark [27].
However, because most of the SP2 Bench queries are rather complex queries that
use all different kinds of SPARQL 1.0 operators, we only evaluated some of the
queries as the focus of our work is the efficient computation of joins, i.e. basic
graph patterns. Both benchmarks offer synthetic data generators that can be
used to generate arbitrary large datasets. For SP2 Bench we generated datasets
from 200 million up to 1000 million triples. For LUBM we generated datasets
from 1000 up to 3000 universities and used the WebPIE inference engine for
Hadoop [29] to pre-compute the transitive closure. The loading times for both
tables Ts po and To ps as well as all datasets are listed in Table 3.
    The goal of our approach was to optimize MapReduce based join computa-
tion for selective queries. Therefore, we compared our MAPSIN join approach
with the reduce-side join based query execution in PigSPARQL [26], a SPARQL
1.0 engine built on top of Pig. Pig is an Apache top-level project developed
by Yahoo! Research that offers a high-level language for the analysis of very
large datasets with Hadoop MapReduce. The crucial point for this choice was
the sophisticated and efficient reduce-side join implementation of Pig [11] that
incorporates sampling and hash join techniques which makes it a challenging can-
didate for comparison. We illustrate the performance comparison of PigSPARQL
and MAPSIN for some selected LUBM queries that represent the different query
types in Figure 3. Our proof-of-concept implementation is currently limited to
a maximum number of two join variables as the goal was to demonstrate the
feasibility of the approach for selective queries rather than supporting all pos-
sible BGP constellations. For detailed comparison, the runtimes of all executed
queries are listed in Table 4.
    LUBM queries Q1, Q3, Q5, Q11, Q13 as well as SP2 Bench query Q3a demon-
strate the base case with a single join between two triple patterns (cf. Figure 3a).
For the LUBM queries, MAPSIN joins performed 8 to 13 times faster compared
to the reduce-side joins of PigSPARQL. Even for the less selective SP2 Bench
query, our MAPSIN join required only one third of the PigSPARQL execution
time. Furthermore, the performance gain increases with the size of the dataset
for both LUBM and SP2 Bench.




                                            69
                                                                                                                                                                                                                                                                                                                                                           1000
   60                                                                                                                                                                                                                                                                 PigSPARQL           MAPSIN                          100
   69                                                                                                                                                                                                               1000                                                            500                                                                     100
   84                                                                                                                                                                                                                                                                               400                                    10




                                                                               timeinseconds
                                                                                                                                                                                                                                           100                                                                                                                  10
  104                                                                                                                                                                                                                                                                               300
                                                                                                                                                                                                                                                                                    200                                     1                                    1
                                                                                                                                                                                                                                           10
                                                                                                                                                                                                                                                                                    100                                         1000 1500 2000 2500 3000             10
                                                                                                                                                                                                                                                                                                                          Q6              LUBM             Q7
                                                                                                                                                                                                                                            1                                         0
                                                                                                                                                                                                                                                 1000   1500   2000   2500   3000      1000   1500   2000   2500   3000    10000
    47                                                                           LUBM(#universities)                               LUBM(#universities)
                                                                                                                                                                                     1000                             MJ             MJ
    62                                                                                                                                                                                                MJ
                                                                                                PigSPARQL                       MAPSIN
    68                                                                                           PigSPARQL                       MAPSIN                                                                                  MJ               MJ
                                                            1000                                                     1000                                                              100               MJ
    93 MAPSIN                                              10000                                                      3000
                                                                                                                      750                                                               10




                                                                                      timeinseconds
    23
  114                                                        100
                                                             1000                                                     2400




                                                                                  timeinseconds
                                                                                                  PigSPARQL                      MAPSIN
    34
N 123 MAPSIN                                                                                                          500
                                                                                                                      1800
                                                             1000
                                                               100                                                    1000                                                                1
                                                               10
    5123                                                                                                              1200
                                                                                                                      250
                                                                                                                        750
                                                                                                                                                                                                  200M           400M           600M




                                                                          timeinseconds
    53
                                                                10
                                                               100                                                     600                                                         Q1                                         SP²Bench
      34                                                         1                                                        0
                                                                                                                        500
    7051                                                          1
                                                                 10 1000 1500 2000 2500 3000
                                                                                                                           01000 1500 2000 2500 3000
                                                      (a)                1000 LUBM(#universities)
                                                                                  1500 2000 2500 3000                   2501000 LUBM(#universities)
                                                                                                                                     1500      2000     2500     3000
    84
    5053                                                                         LUBM(#universities)
                                                                                         LUBM                                         LUBM(#universities)
                                                                                                                                              LUBM
                                                                   1                                                        0                                                                                                   10000
    6470                                                                                         PigSPARQL
                                                                          1000 1500 2000 2500 3000
                                                                                                                                MAPSIN
                                                                                                                              1000 1500 2000 2500 3000
    7784                                               (a)  1000                  LUBM(#universities)
                                                                                                  PigSPARQL          1000             LUBM(#universities)
                                                                                                                                 MAPSIN                                                                                          1000
    93 MAPSIN                                               10000                                  PigSPARQL
                                                                                      timeinseconds                 3500
                                                                                                                       750       MAPSIN
                                                              100                                                                                                                                                                  100
    21
  108                                                         1000
                                                             1000                                                      1000
                                                                                                                      2800
                                                                                   timeinseconds

                                                                                                                       500
    33 MAPSIN
N 121                                                           10                                                    2100
                                                                                                                        750                                                                                                          10
                                                                             timeinseconds


                                                               100
                                                                100                                                    250
    4221                                                                                                              1400
                                                                 10                                                     500                                                                                                           1
    4933                                                         1
                                                                 10
                                                                                                                          0
                                                                                                                        700
                                                                         1000 1500 2000 2500 3000                       250 1000     1500     2000      2500     3000
    5942                                                          1
                                                                                LUBM(#universities)
                                                                                                                            0
                                                                                                                                     LUBM(#universities)
                                                                          1000 1500 2000 2500 3000                           1000     1500     2000      2500     3000
    72
    2149                                               (b)         1
                                                                                 LUBM(#universities)
                                                                                                                            0
                                                                                                                                      LUBM(#universities)
                                                                          1000 1500 2000 2500 3000                            1000    1500      2000     2500      3000
    3359                                                                          LUBM(#universities)                      (Multiway    JoinOptimization)
                                                                                                                                       LUBM(#universities)
ltiͲjoinMAPSIN
    4672               MAPSINmultiͲjoin                            PigSPARQL                     PigSPARQL
                                                                                                   MAPSIN                        MAPSIN
                                                                                                                                PigSPARQL                    MAPSIN
 MJ53       HBase        HBaseMJ                            1000
                                                              10000                                                   1000
                                                                                                                               (Multiway JoinOptimization)
                                                                                                                                                                                                                         timeinseconds




   436
    69
ultiͲjoinMAPSIN     63 MAPSINmultiͲjoin
                                    23                         1000
                                                                      PigSPARQL                     MAPSIN              750 PigSPARQL                         MAPSIN
                                                                                                                                                                  timeinsecondstimeinseconds timeinseconds timeinseconds




                                                               100
   861
LMJ79       HBase 121    HBaseMJ 37                            100
                                                               10000
                                                                                                                        500
                                                                                                                                                                                                               timeinseconds




 1297
    436            167
                     63             53
                                     23                            10
                                                                 1000
                                                                 10
                                                                                                                        250
 1728
    861            182
                    121             62
                                     37                              1
                                                                   100
                                                                                1000              1500             2000               2500               3000
 2173
  1297             235
                    167             81
                                     53                           1  10                                                     0
 2613
    29
  1728             279
                    182             92
                                     62                        30001 1000 1500 2000 2500 3000                                1000     1500     2000      2500     3000
                                                                                 LUBM(#universities)                                LUBM(#universities)
                                                               2000               1000             1500             2000                2500              3000
    44
  2173              235              81
    72
  2613              279              92                        1000
                                                                 3000                             PigSPARQL                      MAPSIN
                                                                                                                                                                                           timeinseconds




    84                                                       100020000                                                1000
                                                                      1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000                                                                                      PigSPARQL             MAP
    22
  108                                                  (c) 100   1000                                                   750
                                                                                                           LUBM(#universities)
    33
  128                                                                  0                                                                                                            1000                                          10000
                                                                                                                        500
    44                                                                  1000 1200 1400 PigSPARQL   1600 1800 2000 2200           MAPSIN 2400 2600 2800 3000                                                        PigSPARQL
      22                                                         10
                                                         (c) Performance                                    LUBM(#universities)                                                                                                  1000MAP
    5333                                          Fig.  3.   1000                        comparison             for   LUBM
                                                                                                                        250
                                                                                                                      1000            Q1 (a), Q8 (b), Q4 (c)                         100
                                                                                                                                                                                     1000                                          10000
    66                                                            1                                                                                                                                                                  100
      44                                                                                           PigSPARQL           7500       MAPSIN
                                                               100                                                                                                                     10                                            1000
    80
    1653                                                       1000 1000 1500 2000 2500 3000                           10001000       1500     2000      2500     3000                100                                              10
                                                                                 LUBM(#universities)                 500            LUBM(#universities)
    4366                                                                                                                                                                                                                               100
                                            LUBM queries        100 Q4 (5 triple patterns), 250
                                                                10
                                                                                                                        Q7 (4 triple patterns), Q8 (5 triple10
                                                                                                                         750
                                                                                                                                                         timeinseconds




                                                                                                                                                                                        1                                               1
    7080                                                          2                               PigSPARQL                      MAPSIN                                                                                                 10
    79
                                        patterns)   and    SP1000 1 Bench queries Q1 (3 triple
                                                                                                                         500 patterns), Q2 (9 triple patterns)
                                                                                                                       500 0
                                                                                                                                                                                              1000 1500 2000 2500 3000
                                                                  10                                                                                                               Q1                     LUBM                  Q4
    14
    89                                  demonstrate the more                   general          case with a 400
                                                                         1000 1500 2000 2500 3000
                                                                                 LUBM(#universities)
                                                                                                                         sequence
                                                                                                                         250 1000    1500
                                                                                                                                             of2000cascaded
                                                                                                                                                        2500
                                                                                                                                      LUBM(#universities)
                                                                                                                                                                 3000
                                                                                                                                                                      joins (cf. Fig-
                                                                                                                                                                                   1000
                                                                                                                                                                                          1
                                                                                                                                                                                                                               10000
                                                                                                                                                                                                                                          1
                                                                                                                              timeinsecondstimeinseconds




                                                               100                                                                                                                             1000  1500   2000  2500   3000
    48
  107                                   ure 3b). In these cases,    1            MAPSIN joins perform                  3000        even up to 28 times faster than                  Q1                     LUBM                  Q4
                                                                           1000 1500 2000 2500 3000                            1000    1500     2000      2500     3000                                                          1000
    6014                                PigSPARQL for LUBM      10                queries        and      up to 12
                                                                                                  PigSPARQL            200 times MAPSINfaster for SP2 Bench queries.                100
                                                                                                                                                                                    1000                                         10000
                                                                                   LUBM(#universities)                                LUBM(#universities)
    69                                                        1000                                                      500
                                                                                                                       100                                                                                                        100
      48                                    Of particular 1interest are queries Q4 of4000 LUBM and Q1, Q2 of SP2 Bench                                                               10                                           1000
    8460                                                                                            PigSPARQL                     MAPSIN                                             100                                            10
   10469
                                        since these queries    100 support
                                                               1000 1000 1500 the         2000 multiway
                                                                                                  2500 3000          join
                                                                                                                        300
                                                                                                                          5001000optimization
                                                                                                                                     1500      2000     2500outlined
                                                                                                                                                                  3000      in Section                                              100
                                                                                 LUBM(#universities)                                LUBM(#universities)
      84                                4.3 as all triple patterns
                                                                 10
                                                                                    share the same join200                  variable. This kind of optimization10
                                                                                                                          400                                                         1                                               1
                                                                                                                     timeinseconds




                                                                                                                                                                                            1000 1500 2000 2500 3000                  10 10
    104                                 is also supported100by PigSPARQL such that100                                     300both        approaches
                                                                                                                             (Multiway JoinOptimization)       can      compute   the
                                                                                                                                                                                   Q6                    LUBM                  Q7
MJ          HBase        HBaseMJ       query results with110PigSPARQL
                                                                     a single          multiway
                                                                          1000 1500 2000 2500 3000
                                                                                                  MAPSIN  join (cf. 200     0
                                                                                                                            Figure
                                                                                                                                PigSPARQL
                                                                                                                              1000    1500
                                                                                                                                           3c).2000The2500MAPSIN
                                                                                                                                                             MAPSIN
                                                                                                                                                                  3000
                                                                                                                                                                             multiway   1
                                                                                                                                                                                     10000
                                                                                                                                                                                                                                       1
    47                                                       10000                                                        100                                                                1000   1500   2000  2500   3000                1
  310               58             42 join optimization           improves              the basic MAPSIN joinLUBM(#universities)
                                                                                  LUBM(#universities)                                   execution time by a factor Q6              of1000     200
                                                                                                                                                                                                          LUBM                   Q7 MJ
                                                                                          timeinsecondstimeinseconds




    62
  600              118             87                          1000  1                                                        0                                                                 400 MJ                MJ
    68
                                        1.4 (SP2 Bench Q1)       100
                                                                         to10003.31500 (LUBM2000 2500  Q4),
                                                                                                   PigSPARQL3000 independently 1000
                                                                                                                                 MAPSIN 1500      of the
                                                                                                                                                 2000      2500data 3000 size. For the10000
                                                                                                                                                                                                                         MJ                MJ
  896 47           153           118                                                LUBM(#universities)                               LUBM(#universities)                           100 600
11879362           177           154
                                        LUBM queries, 10000  the10 MAPSIN multiway join 3000                            optimization performs 19 to 28 times1000 800                                      MJ
                                                                                                                                                                                                                       MJ               MJ
                                                                                                                                                                                         10            MJ
   114
1476  68           214           174 faster than the reduce-side
                                                              10001                     based PigSPARQL
                                                                                                    multiway 2400      join implementation
                                                                                                                                  MAPSIN                           of PigSPARQL. 100          1000                        MJ                M
                                                                               200M 2            400M             600M                800M             1000M                                               MJ
   12393                                For the more complex  10000
                                                                100
                                                               3000          SP       Bench       queries,        the  1800
                                                                                                                           performance
                                                                                                                        3000                         improvements             degrade      1
                                                                                                                       1200
                                                                                                                        2400                                                               10      200M           400M           600M
    114                                 to a factor of approximately                       8.5.
                                                                                   timeinseconds




                                                                1000
                                                                                 timeinseconds




                                                               2000
                                                                 10                                                                                                                Q1                                          SP²Bench
                                                                                                                        600
                                                                                                                        1800
    123                                                                                                                                   2                                                 1
                                            The remaining        100queries (LUBM Q6, Q14 and SP Bench Q10) consist of only
                                                               10001                                                    12000                                                                       200M           400M            600M
MJ          HBase        HBaseMJ
116850            8982           241
                                        one  single  triple       pattern.
                                                                   10 1000 1500
                                                                    0                  Consequently
                                                                                           2000 2500 3000 they 1000
                                                                                  LUBM(#universities)
                                                                                          LUBM
                                                                                                                          600 do not contain a join processing
                                                                                                                                      1500      2000     2500
                                                                                                                                       LUBM(#universities)
                                                                                                                                               LUBM
                                                                                                                                                                   3000             Q1                                          SP²Bench
                                                                     1200           300       400        500       600        0 700         800        900       1000                                                            10000
234164                           444                                                                   SP²Bench(triplesinmillion)
                                                                            1000 1500 2000 2500             3000                1000    1500      2000     2500     3000
35147750                         671                                                               PigSPARQL
                                                                                    LUBM(#universities)
                                                                                            LUBM
                                                                                                                                  MAPSIN LUBM(#universities)
                                                                                                                                                 LUBM                                                                             1000
    93                                                       10000                                                     3500                                                                                                       10000
4745  64                         834                                                           (Multiway JoinOptimization)                                                                                                         100
   108
6005  77        43597            999                                                             PigSPARQL             2800 MAPSIN
                                                              1000                                  PigSPARQL                      MAPSIN                                                                                           1000
                                                                                                  timeinseconds




   12193                                                    10000
                                                              10000                                                    2100
                                                                                                                         3500                                                                                                         10
                                                                100                                                   6000                                                                                                           100
    108                                                      1000
                                                                1000                                          70       1400
                                                                                                                         2800
                                                                                        timeinseconds
                                                                                                                                                     timeinseconds




                                                                  10                                                                                                                                                                   1
                                                                                                                      4500
                                                                                                                         700
    121                                                                                                                  2100                                                                                                          10
                                                               100100
                                                                    1                                                 3000
                                                                                                                         14000
          HBaseMJ                                                                                                                                                                                                                       1
                                                                 1010 1000 1500 2000 2500 3000                        1500    1000     1500     2000      2500     3000
    21
  187               70                                  (b)                        LUBM(#universities)
                                                                                                                          700
                                                                                                                                       LUBM(#universities)
    33
  279              139                                            11                                                        00
                                                                           200
                                                                            1000 400 1500 6002000 8002500 1000
                                                                                                             3000             200
                                                                                                                                1000 400 1500 600 2000 800 2500 10003000
    46
  417 21           178                                   (b)                                       PigSPARQL
                                                                              SP²Bench(triplesinmillion)
                                                                                    LUBM(#universities)
                                                                                                                                  MAPSIN
                                                                                                                                   SP²Bench(triplesinmillion)
                                                                                                                                         LUBM(#universities)
    53                                                        1000                                                     1000
  53633            235
 Table 4. Query execution times for PigSPARQL (P) and MAPSIN (M) in seconds

LUBM             1000           1500           2000            2500           3000
              P        M     P        M      P      M       P        M      P       M
Q1           324       34   475       51    634     53     790       70    944      84
Q3           324       33   480       42    642     49     805       59    961      72
Q4          1202      121   1758     167    2368    182   2919      235   3496     279
Q4 MJ        861       37   1297      53    1728    62    2173       81   2613      92
Q5           329       33   484       44    640     53     800       66    955      80
Q6           149       48   214       60    284     69     355       84    424     104
Q7          1013       62   1480      68    1985    93    2472      114   2928     123
Q8          1172       64   1731      77    2318    33    2870      108   3431     121
Q11          319       33   469       46    620     53     780       69    931      79
Q13          325       44   482       72    645     84     800      108    957     128
Q14          149       43   214       70    288     79     364       89    434     107
SP2 Bench       200M           400M            600M           800M           1000M
              P      M       P      M        P     M        P      M        P      M
Q1           545     58     1026   118      1527   153    2018    177     2519    214
Q1 MJ        310     42     600     87      896    118    1187    154     1476    174
Q2 MJ       1168    241     2341   444      3514   671    4745    834     6005    999
Q3a          227     70     435    139      641    178     845    235     1050    274
Q10          99      40     174     84      254    111     340    151      414    167




step and illustrate primarily the advantages of the distributed HBase table scan
compared to the HDFS storage access of PigSPARQL. Improvements are still
present but less significant, resulting in an up to 5 times faster query execution.
    An open issue of the evaluation remains the actual data flow between HBase
and MapReduce as HBase is like a black box where data distribution and parti-
tioning is handled by the system automatically. Since data locality is an impor-
tant aspect of distributed systems, it is crucial to examine additional measures
for future optimizations.
    Overall, the MAPSIN join approach clearly outperforms the reduce-side join
based query execution for selective queries. Both approaches reveal a linear scal-
ing behavior with the input size but the slope of the MAPSIN join is much
smaller. Especially for LUBM queries, MAPSIN joins outperform reduce-side
joins by an order of magnitude as these queries are generally rather selective.
Moreover, the application of the multiway join optimization results in a further
significant improvement of the total query execution times.


6   Related Work

Single machine RDF systems like Sesame [6] and Jena [31] are widely-used
since they are user-friendly and perform well for small and medium sized RDF
datasets. RDF-3X [21] is considered one of the fastest single machine RDF
systems in terms of query performance that vastly outperforms previous single
machine systems but performance degrades for queries with unbound objects and
low selectivity factor [17]. Furthermore, as the amount of RDF data continues to
grow, it will become more and more difficult to store entire datasets on a single
machine due to the limited scaling capabilities [16]. One possible approach are
specialized clustered RDF systems like OWLIM [19], YARS2 [15] or 4store [14].




                                           71
However, these systems require a dedicated infrastructure and pose additional
installation and management overhead. In contrast, our approach builds upon
the idea to use existing infrastructures that are well-known and widely used. As
we do not require any changes to Hadoop and HBase at all, it is possible to use
any existing Hadoop cluster or cloud service (e.g. Amazon EC2) out of the box.
    There is a large body of work dealing with join processing in MapReduce
considering various aspects and application fields [4,5,18,20,22,25,32]. In Sec-
tion 2.2 we briefly outlined the advantages and drawbacks of the general-purpose
reduce-side and map-side (merge) join approaches in MapReduce. In addition to
these general-purpose approaches there are several proposals focusing on certain
join types or optimizations of existing join techniques for particular application
fields. In [22] the authors discussed how to process arbitrary joins (theta joins)
using MapReduce, whereas [4] focuses on optimizing multiway joins. However,
in contrast to our MAPSIN join, both approaches process the join in the reduce
phase including a costly data shuffle phase. Map-Reduce-Merge [32] describes a
modified MapReduce workflow by adding a merge phase after the reduce phase,
whereas Map-Join-Reduce [18] proposes a join phase in between the map and re-
duce phase. Both techniques attempt to improve the support for joins in MapRe-
duce but require profound modifications to the MapReduce framework. In [9]
the authors present non-invasive index and join techniques for SQL processing
in MapReduce that also reduce the amount of shuffled data at the cost of an
additional co-partitioning and indexing phase at load time. However, the schema
and workload is assumed to be known in advance which is typically feasible for
relational data but does not hold for RDF in general.
    HadoopDB [3] is a hybrid of MapReduce and DBMS where MapReduce is
the communication layer above multiple single node DBMS. The authors in [16]
adopt this hybrid approach for the semantic web using RDF-3X. However, the
initial graph partitioning is done on a single machine and has to be repeated if
the dataset is updated or the number of machines in the cluster change. As we
use HBase as underlying storage layer, additional machines can be plugged in
seamlessly and updates are possible without having to reload the entire dataset.
    HadoopRDF [17] is a MapReduce based RDF system that stores data directly
in HDFS and does also not require any changes to the Hadoop framework. It
is able to rebalance automatically when cluster size changes but join processing
is also done in the reduce phase. Our MAPSIN join does not use any shuffle or
reduce phase at all even in consecutive iterations.


7   Conclusion

In this paper we introduced the Map-Side Index Nested Loop join (MAPSIN
join) which combines the advantages of NoSQL data stores like HBase with the
well-known and approved distributed processing facilities of MapReduce. In gen-
eral, map-side joins are more efficient than reduce-side joins in MapReduce as
there is no expensive data shuffle phase involved. However, current map-side
join approaches suffer from strict preconditions what makes them hard to ap-




                                       72
ply in general, especially in a sequence of joins. The combination of HBase and
MapReduce allows us to cascade a sequence of MAPSIN joins without having to
sort and repartition the intermediate output for the next iteration. Furthermore,
with the multiway join optimization we can reduce the number of MapReduce
iterations and HBase requests. Using an index to selectively request only those
data that is really needed also saves network bandwidth, making parallel query
execution more efficient. The evaluation with the LUBM and SP2 Bench bench-
marks demonstrate the advantages of our approach compared to the commonly
used reduce-side join approach in MapReduce. For selective queries, MAPSIN
join based SPARQL query execution outperforms reduce-side join based execu-
tion by an order of magnitude while scaling very smoothly with the input size.
Lastly, our approach does not require any changes to Hadoop and HBase at all.
Consequently, MAPSIN joins can be run on any existing Hadoop infrastructure
and also on an instance of Amazon’s Elastic Compute Cloud (EC2) without
additional installation or management overhead.
    In our future work, we will investigate alternatives and improvements of the
RDF storage schema for HBase and incorporate MAPSIN joins into PigSPARQL
in a hybrid fashion such that the actual join method is dynamically selected based
on pattern selectivity and statistics gathered at data loading time.


References
 1. RDF Primer. W3C Recom. (2004), http://www.w3.org/TR/rdf-primer/
 2. SPARQL Query Language for RDF. W3C Recom. (2008), http://www.w3.org/
    TR/rdf-sparql-query/
 3. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.:
    HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for
    Analytical Workloads. PVLDB 2(1), 922–933 (2009)
 4. Afrati, F.N., Ullman, J.D.: Optimizing Multiway Joins in a Map-Reduce Environ-
    ment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011)
 5. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A Compar-
    ison of Join Algorithms for Log Processing in MapReduce. In: SIGMOD (2010)
 6. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A Generic Architecture
    for Storing and Querying RDF and RDF Schema. In: ISWC (2002)
 7. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra,
    T., Fikes, A., Gruber, R.: Bigtable: A Distributed Storage System for Structured
    Data. ACM Transactions on Computer Systems (TOCS) 26(2), 4 (2008)
 8. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clus-
    ters. Communications of the ACM 51(1), 107–113 (2008)
 9. Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.:
    Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even
    Noticing). PVLDB 3(1), 518–529 (2010)
10. Franke, C., Morin, S., Chebotko, A., Abraham, J., Brazier, P.: Distributed Seman-
    tic Web Data Management in HBase and MySQL Cluster. In: IEEE International
    Conference on Cloud Computing (CLOUD). pp. 105 –112 (2011)
11. Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Ol-
    ston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a High-Level Dataflow
    System on top of Map-Reduce: The Pig Experience. PVLDB 2(2) (2009)




                                         73
12. Ghemawat, S., Gobioff, H., Leung, S.: The Google File System. In: ACM SIGOPS
    Operating Systems Review. vol. 37, pp. 29–43. ACM (2003)
13. Guo, Y., Pan, Z., Heflin, J.: LUBM: A Benchmark for OWL Knowledge Base
    Systems. Web Semantics 3(2) (2005)
14. Harris, S., Lamb, N., Shadbolt, N.: 4store: The Design and Implementation of a
    Clustered RDF Store. In: SSWS. pp. 94–109 (2009)
15. Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A Federated Repository
    for Querying Graph Structured Data from the Web. The Semantic Web (2007)
16. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF
    Graphs. PVLDB 4(11), 1123–1134 (2011)
17. Husain, M.F., McGlothlin, J.P., Masud, M.M., Khan, L.R., Thuraisingham, B.M.:
    Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Comput-
    ing. IEEE TKDE 23(9) (2011)
18. Jiang, D., Tung, A.K.H., Chen, G.: Map-Join-Reduce: Toward Scalable and Effi-
    cient Data Analysis on Large Clusters. IEEE TKDE 23(9), 1299–1311 (2011)
19. Kiryakov, A., Ognyanov, D., Manov, D.: OWLIM - A Pragmatic Semantic Repos-
    itory for OWL. In: WISE Workshops. pp. 182–192 (2005)
20. Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel Data Processing
    with MapReduce: A Survey. SIGMOD Record 40(4), 11–20 (2011)
21. Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. PVLDB 1(1),
    647–659 (2008)
22. Okcan, A., Riedewald, M.: Processing Theta-Joins using MapReduce. In: SIGMOD
    Conference. pp. 949–960 (2011)
23. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and Complexity of SPARQL. ACM
    Transactions on Database Systems (TODS) 34(3), 16 (2009)
24. Przyjaciel-Zablocki, M., Schätzle, A., Hornung, T., Dorner, C., Lausen, G.: Cas-
    cading Map-Side Joins over HBase for Scalable Join Processing. Technical Report.
    CoRR abs/1206.6293 (2012)
25. Przyjaciel-Zablocki, M., Schätzle, A., Hornung, T., Lausen, G.: RDFPath: Path
    Query Processing on Large RDF Graphs with MapReduce. In: ESWC Workshops.
    pp. 50–64 (2011)
26. Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: Mapping SPARQL
    to Pig Latin. In: Proceedings of the International Workshop on Semantic Web
    Information Management (SWIM). pp. 4:1–4:8 (2011)
27. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2Bench: A SPARQL Perfor-
    mance Benchmark. In: ICDE. pp. 222–233 (2009)
28. Sun, J., Jin, Q.: Scalable RDF Store Based on HBase and MapReduce. In:
    ICACTE. vol. 1, pp. 633–636 (2010)
29. Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.: OWL Reasoning
    with WebPIE: Calculating the Closure of 100 Billion Triples. In: ESWC. pp. 213–
    227 (2010)
30. Weiss, C., Karras, P., Bernstein, A.: Hexastore: Sextuple Indexing for Semantic
    Web Data Management. PVLDB 1(1), 1008–1019 (2008)
31. Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient RDF Storage and
    Retrieval in Jena2. In: SWDB. pp. 131–150 (2003)
32. Yang, H.C., Dasdan, A., Hsiao, R.L., Jr., D.S.P.: Map-Reduce-Merge: Simplified
    Relational Data Processing on Large Clusters. In: SIGMOD (2007)




                                         74