<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How to feed Apache HBase with petabytes of RDF data:</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adam Sotona</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Negru</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>MSD IT Prague</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Czech Republic</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>firstname.lastname}@merck.com</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <abstract>
        <p>Majority of the systems designed to handle big RDF data rely on a single high-end computer dedicated to a certain RDF dataset and do not easily scale out, at the same time several clustered solution were tested and both the features and the benchmark results were unsatisfying. In this paper we describe a system designed to tackle such issues, a system that connects RDF4J1 and Apache HBase2 in order to receive an extremely scalable RDF store.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Key Components</title>
      <p>Two key aspects can be correlated to a scalable Triple/RDF Store and those are a
scalable database, and a dependable framework for working with RDF data. Our
proposal architecture for such a system is represented in Figure 1.</p>
      <p>Eclipse RDF4J [6] represents the framework, selected mostly because of its features
and the fact that it facilitates the achievement of the identified (including Handle
petabytes of data and the support of SPARQL 1.1). Other reasons consist in its
extensibility through Storage And Inference Layer (SAIL) to support various RDF stores
and inference engines, its test coverage and the extensive documentation.</p>
      <p>Apache HBase is the Hadoop7 database, a distributed and scalable big data store. It
is designed to scale up from single servers to thousands of machines, thus providing a
flexible and reliable means of handling big RDF data.</p>
      <sec id="sec-2-1">
        <title>Bulk Data Operations</title>
        <p>Bulk data operations are critical to the overall system efficiency. A system capable
to hold and query hundreds of billions of triples must be also able to load them in,
create snapshots, clone and export in bulk operations.</p>
        <p>Our current bulk load is implemented as a simple MapReduce application
consuming RDF data in any standard format and producing HBase table region files directly
by each reducer task. Performance of the bulk load scales linearly with the Hadoop
cluster. Measurement of load and indexing performance shows in average about
40.000 quads per second per each cluster node. Bulk Export is also a MapReduce
application, by exporting the whole dataset in parallel - measured performance shows
in average about 400.000 quads per second per each cluster node. Bulk merge is
multi-step operation composed from snapshotting of the source datasets and direct
loading of all HBase region files into one merged HBase table.
7 Apache Hadoop. (n.d.). https://hadoop.apache.org/</p>
        <p>In order to achieve such performance, a must is a proper RDF mapping to Apache
HBase, as it provides much needed performance and scalability of the whole system.
After several iterations of trying various mapping approaches we have found the
optimal RDF mapping with satisfying performance, a mapping represented in Table 1.</p>
        <p>RDF Triple: &lt;SUBJ1&gt; &lt;PRED1&gt; &lt;OBJ1&gt;
RDF Quad: &lt;SUBJ2&gt; &lt;PRED2&gt; &lt;OBJ2&gt; &lt;CTX2&gt;</p>
        <p>Apache HBase Table (single Column Family)
Row Keys Column Qualifiers
0&lt;SUBJ1 hash&gt;&lt;PRED1 hash&gt;&lt;OBJ1 hash&gt; &lt;SUBJ1&gt;&lt;PRED1&gt;&lt;OBJ1&gt;
s Regions
y
eK SPO
w
o
R
y
b POS
d
e
r
e
d
rO OSP
=
&lt;</p>
        <p>CSPO
CPOS
COSP
)
y
t
p
m
e
(
d
e
s
u
t
o
n
e
r
a
s
e
u
l
a
V
0&lt;SUBJ2 hash&gt;&lt;PRED2 hash&gt;&lt;OBJ2 hash&gt;
1&lt;PRED1 hash&gt;&lt;OBJ1 hash&gt;&lt;SUBJ1 hash&gt;
1&lt;PRED2 hash&gt;&lt;OBJ2 hash&gt;&lt;SUBJ2 hash&gt;
2&lt;OBJ1 hash&gt;&lt;SUBJ1 hash&gt;&lt;PRED1 hash&gt;
2&lt;OBJ2 hash&gt;&lt;SUBJ2 hash&gt;&lt;PRED2 hash&gt;
3&lt;CTX2 hash&gt;&lt;SUBJ2 hash&gt;&lt;PRED2 hash&gt;&lt;OBJ2 hash&gt;
4&lt;CTX2 hash&gt;&lt;PRED2 hash&gt;&lt;OBJ2 hash&gt;&lt;SUBJ hash&gt;
5&lt;CTX2 hash&gt;&lt;OBJ2 hash&gt;&lt;SUBJ2 hash&gt;&lt;PRED2 hash&gt;
&lt;SUBJ2&gt;&lt;PRED2&gt;&lt;OBJ2&gt;&lt;CTX2&gt;
&lt;SUBJ1&gt;&lt;PRED1&gt;&lt;OBJ1&gt;
&lt;SUBJ2&gt;&lt;PRED2&gt;&lt;OBJ2&gt;&lt;CTX2&gt;
&lt;SUBJ1&gt;&lt;PRED1&gt;&lt;OBJ1&gt;
&lt;SUBJ2&gt;&lt;PRED2&gt;&lt;OBJ2&gt;&lt;CTX2&gt;
&lt;SUBJ2&gt;&lt;PRED2&gt;&lt;OBJ2&gt;&lt;CTX2&gt;
&lt;SUBJ2&gt;&lt;PRED2&gt;&lt;OBJ2&gt;&lt;CTX2&gt;
&lt;SUBJ2&gt;&lt;PRED2&gt;&lt;OBJ2&gt;&lt;CTX2&gt;</p>
      </sec>
      <sec id="sec-2-2">
        <title>RDF4J SAIL Implementation</title>
        <p>
          Storage and Inference Layer (SAIL) is a key interface of the RDF4J framework to
communicate with various storage and/or inference sub-systems. A minimalistic
implementation of the HBase RDF4J SAIL works perfectly even it has not been
originally designed for big data operations. We optimized the performance also by
stripping the transactions support and streamlining the dataflow through the RDF4J SAIL
and through the RDF4J framework. Brief performance measurement has been
performed using Berlin SPARQL Benchmark [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and its Explore Use Cases query mix.
For 10 billion triples dataset the average QMPH is 7,8 per single cluster node. With
excluded query #5 from the query mix it shows more promising 588 QMPH per
single cluster node involved in the benchmark.
        </p>
        <p>
          With regard to reasoning, our target datasets size limits us to basic interferences
over RDFS+ [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and similar set of rules. Based on batch-processing mode and high
storage limits in HBase, we have decided to use batch processes to physically
materialize inferred triples. More precisely we developed a tool performing the reasoning
based on SPARQL in a batch mode.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Parallel SPARQL Evaluation Strategies</title>
        <p>RDF4J framework comes with a standard SPARQL 1.1 compliant evaluation
strategy. This strategy uses single-threaded pull model where the data are requested from
the underlying RDF store when needed in the evaluation model. This model shows
excelling results on all low-latency RDF stores, however it is not performing enough
for distributed storage systems. Our improved SPARQL evaluation strategy uses
parallel push model. In this model each SPARQL query tree is expanded, requests are
processed asynchronously and data are pushed back for evaluation when ready. A
proper prioritization of the asynchronous processes is a key to let the data flow
through the system with minimal latency and without memory overflows.</p>
        <p>Another experimental parallelization strategy is targeted for parallel
SPARQLbased export, which makes uses multiple distinct jobs working on the same SPARQL
query and custom SPARQL filtering function to avoid duplicate work and duplicate
data. Parallel architecture allows us also to spawn multiple SPARQL endpoints and
balance the load between them. As a result of the parallel architecture the
abovementioned export and query performance measurements can be multiplied by the
number of involved cluster nodes up to the cluster size.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>
        Similar directions of research have been used by Jianling Sun et.al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in storing
RDF in HBase related mapping presented in Table 1 and Apache RYA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been
developed in the same spirit. Nevertheless Apache RYA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is still an Apache
Incubator as project, and current requirements are divergent, yet there is a vision that our
efforts might merge in the future.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>In this paper we described the process of developing a scalable RDF system by
focusing on satisfying what we deemed as the minimal set of requirements for such as
system, some worth mentioning are: Storage capacity in tens of terabytes of RDF
data and potentially up to petabytes; Support for thousands of distinct datasets, in a
single HBase table each; Down-streaming services and applications are supported
through SPARQL, HBase and Java APIs. All things considered we are seeking to
answer more questions by the implementation of such as system and look into
integrating other improvements that need further development and testing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Allemang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Semantic Web for the working ontologist: Effective modeling in RDFS and OWL</article-title>
          . Waltham, MA: Morgan Kaufmann/Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Punnoose</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crainiceanu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rapp</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <source>Rya. Proceedings of the 1st International Workshop on Cloud Intelligence - Cloud-I '12.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Apache</given-names>
            <surname>Rya</surname>
          </string-name>
          (incubating).
          <source>(n.d.)</source>
          . http://rya.incubator.apache.org/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Berlin SPARQL Benchmark (BSBM).
          <source>(n.d.)</source>
          . http://wifo5-
          <fpage>03</fpage>
          .informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Jianling</given-names>
            <surname>Sun</surname>
          </string-name>
          and
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>"Scalable RDF store based on HBase and MapReduce,"</article-title>
          (ICACTE),
          <year>Chengdu</year>
          ,
          <year>2010</year>
          , pp.
          <fpage>V1</fpage>
          -633
          <string-name>
            <surname>-</surname>
          </string-name>
          V1-636.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>