<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vaibhav Khadilkar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Murat Kantarcioglu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bhavani Thuraisingham</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Castagna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Talis Systems Ltd</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The University of Texas at Dallas</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Lack of scalability is one of the most significant problems faced by single machine RDF data stores. The advent of Cloud Computing has paved a way for a distributed ecosystem of RDF triple stores that can potentially allow up to a planet scale storage along with distributed query processing capabilities. Towards this end, we present Jena-HBase, a HBase backed triple store that can be used with the Jena framework. Jena-HBase provides end-users with a scalable storage and querying solution that supports all features from the RDF specification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>triples. In contrast, existing systems that use a MapReduce-based query engine
for processing RDF data are optimized for query performance, however, they
are currently unable to support all features from the RDF specification. Our
motivation with Jena-HBase is to provide end-users with a cloud-based RDF
storage and querying API that supports all features from the RDF specification.</p>
      <p>Our contributions: Jena-HBase provides the following: (a) A variety of
custom-built RDF data storage layouts for HBase that provide a tradeoff in
terms of query performance/storage. (b) Support for reification, inference and
SPARQL processing through the implementation of appropriate Jena interfaces.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Jena-HBase Architecture</title>
      <p>
        We have performed benchmark experiments using SP2Bench (non-inference queries)
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and LUBM (inference queries) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to determine the best layout currently
available in Jena-HBase, as well as to compare the performance of the best layout
with Jena TDB. We have compared Jena-HBase only with Jena TDB and not
with other Hadoop-based systems for the following reasons: (i) Jena TDB gives
      </p>
      <sec id="sec-2-1">
        <title>Layout Type</title>
      </sec>
      <sec id="sec-2-2">
        <title>Simple</title>
      </sec>
      <sec id="sec-2-3">
        <title>Indexed</title>
      </sec>
      <sec id="sec-2-4">
        <title>VP and Indexed</title>
      </sec>
      <sec id="sec-2-5">
        <title>Hybrid</title>
      </sec>
      <sec id="sec-2-6">
        <title>Hash</title>
      </sec>
      <sec id="sec-2-7">
        <title>3 tables each indexed by subjects, predicates and objects</title>
        <p>Vertically Partitioned (VP) sFuobr jeevctesryanudnioqbujeecptrsedicate, two tables, each indexed by</p>
      </sec>
      <sec id="sec-2-8">
        <title>Six tables representing the six possible combinations of</title>
        <p>a triple namely, SPO, SOP, PSO, POS, OSP and OPS</p>
      </sec>
      <sec id="sec-2-9">
        <title>VP layout with additional tables for SPO, OSP and OS</title>
      </sec>
      <sec id="sec-2-10">
        <title>Simple + VP layouts Hybrid layout with hash values for nodes and a separate table containing hash-to-node mappings</title>
        <p>
          the best query performance of all available Jena storage subsystems. (ii) The
available Hadoop-based systems do not implement all features from the RDF
specification. In this section, we show results only for Q1 and Q9 of SP2Bench
and Q1 and Q10 of LUBM, however, these results are indicative of the overall
trend [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Additionally, the figures only show query times and do not include
loading times. Finally, as a part of the procedure to determine the best layout,
we ran both benchmarks over several graph sizes, but we show results only for
a graph of 250137 triples for SP2Bench and for a graph of 5 universities (≈
560K triples) for LUBM (Fig. 2). Although we used a small graph size, it is still
sufficient for determining the best Jena-HBase layout. Since LUBM contains
inference queries, we used the Pellet reasoner (v2.3.0) to perform inference.
c3
)
e
s
(
e
im2
T
g
n
i
ry1
e
u
Q
        </p>
        <sec id="sec-2-10-1">
          <title>Graph Querying - Q1</title>
        </sec>
        <sec id="sec-2-10-2">
          <title>Simple</title>
        </sec>
        <sec id="sec-2-10-3">
          <title>Verticaly-Partitioned</title>
        </sec>
        <sec id="sec-2-10-4">
          <title>Indexed</title>
        </sec>
        <sec id="sec-2-10-5">
          <title>VP-Indexed</title>
        </sec>
        <sec id="sec-2-10-6">
          <title>Hybrid</title>
        </sec>
        <sec id="sec-2-10-7">
          <title>Hash</title>
        </sec>
        <sec id="sec-2-10-8">
          <title>Graph Querying - Q9</title>
          <p>i()sceengTm324000000000 VerticalyV-PPa-IIrnntSiHddtiiHoyeembnxxapeeresilhdeddd
i
rye1000
u
Q</p>
        </sec>
        <sec id="sec-2-10-9">
          <title>Graph Querying - Q1</title>
          <p>)c100 Simple
(se 80 Verticaly-Partitioned
iiegnTm6400 VP-IInnHddHyeebxxaeersihddd
rye 20
u
Q</p>
        </sec>
        <sec id="sec-2-10-10">
          <title>Graph Querying - Q10</title>
          <p>ii()yscngeeTm11505000 VerticalyV-PPa-IIrnntSiHtddiiHoyeembnxxapreeesildheddd
r
e
u
Q</p>
          <p>Fig. 2 shows a comparison of all layouts for Q1 and Q9 of SP2Bench and Q1
and Q10 of LUBM. We see that the Hybrid layout gives the best results since
it combines the advantages of the Simple (Q9 of SP2Bench and Q10 of LUBM)
and VP (Q1 of SP2Bench) layouts. The Indexed, VP-Indexed and Hash layouts
require longer querying times, since they require multiple row lookups in the
SPO, SOP, PSO, POS, OSP and OPS tables (Indexed case) or the SPO, OSP
and OS tables (VP-Indexed case) or the mapping table (Hash case).</p>
          <p>Fig. 3 shows a comparison of the Hybrid layout with Jena TDB for increasing
graph sizes. Note that Jena-HBase values have been scaled down for Q1 (by 1000)
and Q9 (by 100) of SP2Bench and for Q1 (by 10) of LUBM for a clear comparison.
We see that TDB outperforms the Hybrid layout in the 1M to 25M range for
SP2Bench and in the N = 50 to 500 range (N is the number of universities) for</p>
        </sec>
        <sec id="sec-2-10-11">
          <title>Graph Querying - Q1 TDB</title>
        </sec>
        <sec id="sec-2-10-12">
          <title>Hybrid</title>
        </sec>
        <sec id="sec-2-10-13">
          <title>Graph Querying - Q9 TDB</title>
        </sec>
        <sec id="sec-2-10-14">
          <title>Hybrid</title>
        </sec>
        <sec id="sec-2-10-15">
          <title>Graph Querying - Q1 TDB</title>
        </sec>
        <sec id="sec-2-10-16">
          <title>Hybrid</title>
          <p>Q1 of LUBM. This is expected since for these ranges TDB is able to create and
maintain the necessary B+ graph indices in memory, thus resulting in a shorter
query execution time. Jena-HBase requires multiple graph pattern matches on
increasing graph sizes over a distributed cluster, thus making it slower than TDB.
We also observe that TDB fails to execute Q10 of LUBM for the N = 50 to 500
range, since the test program runs out of memory during the process of inference
for this range. The Hybrid layout successfully executes Q10 for this range, since
the reasoner is able to construct the necessary inference related data structures.
Finally, we observe that Jena-HBase is more scalable than TDB which fails to
construct graphs with 100M triples for SP2Bench and N = 1000 universities for
LUBM, thereby preventing the execution of any query on these graphs.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Conclusion</title>
      <p>In this paper, we show that creating a distributed RDF storage framework with
existing cloud computing tools results in a scalable data processing solution.
Additionally, our solution maintains a reasonable query execution time overhead
when compared with a single-machine RDF storage framework (viz. Jena TDB).
5</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work was partially supported by The Air Force Office of Scientific
Research MURI-Grant FA-9550-08-1-0265 and Grant FA-9550-08-1-0260, National
Institutes of Health Grant 1R01LM009989, National Science Foundation (NSF)
Grant Career-CNS-0845803, and NSF Grants CNS-0964350, CNS-1016343 and
CNS-1111529. We thank Dr. Robert Herklotz for his support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Hollenbach</surname>
          </string-name>
          .
          <article-title>Scalable Semantic Web Data Management Using Vertical Partitioning</article-title>
          .
          <source>In VLDB</source>
          , pages
          <fpage>411</fpage>
          -
          <lpage>422</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sayers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuno</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          .
          <article-title>Efficient RDF Storage and Retrieval in Jena2</article-title>
          .
          <source>Technical report, HP Laboratories</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>V.</given-names>
            <surname>Khadilkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kantarcioglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Castagna</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thuraisingham</surname>
          </string-name>
          .
          <article-title>Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2012</year>
          . http: //www.utdallas.edu/~vvk072000/Research/Jena-HBase
          <source>-Ext/tech-report.pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hornung</surname>
          </string-name>
          , G. Lausen, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Pinkel</surname>
          </string-name>
          .
          <article-title>SP2Bench: A SPARQL Performance Benchmark</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>222</fpage>
          -
          <lpage>233</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Heflin.</surname>
          </string-name>
          <article-title>LUBM: A benchmark for OWL knowledge base systems</article-title>
          .
          <source>J. Web Sem</source>
          .,
          <volume>3</volume>
          (
          <issue>2</issue>
          -3):
          <fpage>158</fpage>
          -
          <lpage>182</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>