<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RDFVector: An Efficient and Scalable Schema for Semantic Web Knowledge Bases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>James P. McGlothlin</string-name>
          <email>jpmcglothlin@utdallas.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Texas at Dallas Richardson</institution>
          ,
          <addr-line>TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>As the semantic web grows in popularity and enters the mainstream of computer
technology, RDF(Resource Description Framework) datasets are becoming larger and
more complex. As datasets grow larger and more datasets are linked together,
scalability becomes more important. As more complex ontologies are developed,
there is growing need for efficient queries that handle inference. In areas such as
research, it is vital to be able to perform queries that retrieve not just facts but also
inferred knowledge and uncertain information.</p>
      <p>Currently, scalability and performance issues limit the semantic web from reaching
its full potential. The primary goal of RDFVector is to improve the performance of
queries against large RDF datasets, including queries that involve inferred knowledge
and probabilistic reasoning. RDFVector is a new persistent data model using bit
vectors that is highly efficient. This design allows us to store inferred knowledge
with no penalty. Queries become simpler and more efficient.</p>
      <p>Query
Query</p>
      <p>Engine
POTable
PSTable</p>
      <p>At this point we would like to define our bit vector tables in more detail. The
POTable includes four columns: Property, Object, SubjectBitBector and BitCount.
The SubjectBitVector has a 1 representing every subject that appears with this
property and object in a triple in the dataset. Each URI is dictionary-encoded to a
unique identification number that represent the index into the bit vectors. For
example, all subjects matching &lt;?s type text&gt; can be retrieved from a single tuple in
the POTable and returned as a single bit vector, indexed by the identification
numbers. Joins can be performed as and operations against these bit vectors. The
BitCount is simply the number of on bits in the vector. We use this information to
determine selectivity factors during query optimization. It also important to note that
we keep the vector compressed. As the vectors are quite sparse, this substantially
reduces memory and storage requirements. Similarily, the PSTable includes property,
subject, ObjectBitVector, and bitCount.</p>
      <p>
        Our design is to execute inference at addition time rather than query time.
Consider the triple: &lt;Professor0 type AssociateProfessor&gt;. The LUBM ontology
file[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] will allow us to infer four additional triples:&lt;Professor0 type Professor&gt;,
&lt;Professor0 type Faculty&gt;, &lt;Professor0 type Employee&gt;, &lt;Professor0 type
Person&gt;. Our strategy is to store all inferred triples in the database. Inference rules
register with the system and are called when triples are added, so that inferred
knowledge can be added as well. Thus, all inferred triples are stored in the database
and available to queries. There are 21 subclasses of Person in the LUBM ontology.
Thus, to query all persons using a schema such as vertical partitioning or RDF-3X
requires 21 subqueries and 20 unions. Our solution is both simpler and more
efficient; we can query all persons by retrieving a single type from the POTable. No
unions or subqueries are required. Most database schemata pay a query performance
penalty based on the size of the dataset. Because RDFVector already includes bits in
the vectors for every possible combination of known subject, objects, and properties,
there is no such penalty.
The goal of RDFVector is efficiency and scalability. Therefore, we evaluate our
solution by testing the performance of queries against standardized datasets, and we
compare this performance to the current state of the art solutions. We test using hot
and cold runs, and, for comparison, we utilize source code from the existing research
projects or accepted evaluation research such as [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>We also execute experiments to test tradeoffs including storage space and memory
consumption. For additions and deletions, we use cross validation to select and test
subsets of the dataset, and we measure the time to add or delete the selected triples.
4</p>
    </sec>
    <sec id="sec-2">
      <title>Query Evaluation and Performance Results</title>
      <p>
        We have implemented RDFVector using the MonetDB[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] database and tested it on
a single machine. Our current implementation includes support for inference, for
additions and for deletions. We implemented and evaluated experiments with two
benchmarks: the Lehigh University Benchmark(LUBM) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the Barton dataset[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
All of our tests were performed on a Dell M6400 laptop with 8GB RAM.
      </p>
      <p>The LUBM dataset is a synthetic dataset, which allowed us to vary the number of
triples in the dataset from 1 to 44 million triples to test scalability. We tested 11
LUBM queries against vertical partitioning and triple store solutions. RDFVector
achieved the best performance in all 11 test cases. The performance improvement
grew as the dataset size was increased, demonstrating scalability. RDFVector
improved performance by over 80.0% in every query for the 44 million triples dataset.</p>
      <p>
        The Barton dataset is an RDF representation of the MIT library catalog. We tested
with all 12 queries defined in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and compared our results with triple store (PSO and
SPO ordered), vertical partitioning[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and RDF-3X[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. RDFVector achieved the best
performance for all 12 queries during cold runs and for 11 out of 12 queries during
hot runs. Table 1 summarizes our results for the Barton dataset.
      </p>
      <p>Table 1. Average query times for Barton dataset
Average(hot)
Geo. Mean(hot)
Average(cold)
Geo. Mean(cold)</p>
      <p>RDFVector
1.02
0.72
1.54
1.08</p>
      <p>RDF-3X
1.76
0.97
4.10
3.24</p>
      <p>VP
3.53
2.01
4.68
3.41</p>
      <p>PSO
3.28
2.21
4.51
3.67</p>
      <p>SPO
4.12
3.21
5.68
5.07
9 of the 11 LUBM test cases and 3 of the 12 Barton Dataset queries involve
inferencing, and RDFVector returned the greatest performance improvement in these
tests. In all of these instances, we implemented inference in the other solution by
backward chaining. However, we also tried materializing inference in RDF-3X (the
next fastest solution) and this actually reduced its performance. This shows that a
triple store pays a query performance penalty for increasing the dataset to add inferred
triples, and RDFVector does not. Furthermore, RDFVector was also faster in the tests
not involving inference.</p>
      <p>We also tested the four trade-offs we were able to identify: storage space, memory
consumption, time to add triples, time to delete triples. Using vector compression, the
size of the Barton Dataset was 3.4GB. The maximum memory consumption was
3.54G to perform all the tests in this paper. The time to add 1 million triples to LUBM
was 71 seconds, and the time to delete 1 million triples was 26.7 seconds.</p>
    </sec>
    <sec id="sec-3">
      <title>5 Related Work</title>
      <p>In this section, we examine related research and products that provide RDF storage
and querying.</p>
      <p>
        Vertical partitioning[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a schema using column store relational databases.
Vertical partitioning creates a two column table for each property, sorted by subject.
Vertical partitioning is similar to the PSTable, and provides subject-subject merge
joins.
      </p>
      <p>
        RDF-3X[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] uses a triple store schema combined with indexes, histograms and
query optimization. RDF-3X uses its own persistence mechanism, and does not
materialize inference.
      </p>
      <p>
        Hexastore[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a main memory sextuple indexing structure for RDF datasets.
RDFKB implements 2 of these indices with our bit vector tables. By providing
secondary indexing on each table and the triples table, we emulate the other four
indices.
      </p>
      <p>
        Jena[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] allows inferred triples to be forward chained and data to be persisted in a
relational database. However, Jena defines the schema for persistence so it is
restricted to a triple store schema. Furthermore, Jena will require inference to be
reprocessed in the event of updates to the dataset.
      </p>
      <p>
        BitMat[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is similar to our solution in that it uses compressed bit vectors and
performs join operations using bit operations. BitMat combines these vectors into a
complete in-memory bit matrix. BitMat cannot support additions and deletions
without recreating the entire bit matrix. Their architecture requires upfront
knowledge of the set of common subject and object URIs in order to generate the
correct matrix indices necessary to perform subject-object joins. Therefore, additions
cannot be supported because if a URI is added as a subject and object, the bit matrix
would have to be recreated. Additionally, BitMat provides persistence by storing an
image of its memory. BitMat is able to load this image fairly efficiently, but load
time is still large enough that RDFKB will always achieve better performance time on
cold runs.
      </p>
      <p>
        For uncertainty reasoning, we have relied on the final report from [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], studying all
of the different solutions and use cases described in this collaborative effort.
6
      </p>
    </sec>
    <sec id="sec-4">
      <title>Future Work</title>
      <p>The first area for future work is uncertainty reasoning. Existing research has
proposed ontology definitions that include the probability that the inference applies.
We will add thresholds that allow vectors to be materialized with probability (1 in the
vector will mean the triple is known with probability&gt;=threshold). We plan to
calculated all inferred triples, and probabilities at add time and stored, and support
queries that select or rank by probability.</p>
      <p>
        A second area for future work is cloud computing. In RDFVector, a significant
portion of query time is spent accessing bit vectors. Once all the bit vectors have been
retrieved, the remaining steps of the query plan, such as joins, can be performed in
main memory through simple bit operations. Hadoop[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] could be used to distribute
the bit vectors reads across a grid of computers, as each vector can be read separately.
Join operations would be performed during the MapReduce processing.
7
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have proposed a solution for querying RDF datasets that is efficient and scalable.
Our bit vector solution is novel and contributes significantly to the field. RDFVector
stores inferred triples without a performance penalty. This enables simple inference
queries with exceptional performance. Our experimental results show that our
solution consistently outperforms vertical partitioning, triple store and RDF-3X. We
tested 23 different queries on two large dataset benchmarks. The degree of
performance improvement increases with the size of the dataset, showing that our
solution is scalable. Inference queries provide even greater performance improvement,
showing the viability of our inference solution.</p>
      <p>Migrating to a cloud computing environment will further improve our scalability.
Adding uncertainty reasoning will enable RDFVector to retrieve more information
than available from state of the art RDF storage solutions.
8</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hollenbach</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          :
          <article-title>Scalable Semantic Web Data Management Using Vertical Partitioning</article-title>
          . In VLDB(
          <year>2007</year>
          )
          <fpage>411</fpage>
          -
          <lpage>422</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karras</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Hexastore: sextuple indexing for semantic web data management</article-title>
          .
          <source>PVLDB</source>
          (
          <year>2008</year>
          )
          <fpage>1008</fpage>
          -
          <lpage>1019</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sidirourgos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goncalves</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kersten</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manegold</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Column-store support for RDF data management: not all swans are white</article-title>
          .
          <source>PVLDB</source>
          (
          <year>2008</year>
          )
          <fpage>1553</fpage>
          -
          <lpage>1563</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>RDF-3X: a RISC-style engine for RDF</article-title>
          .
          <source>PVLDB</source>
          (
          <year>2008</year>
          )
          <fpage>647</fpage>
          -
          <lpage>659</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>5. Lehigh University Benchmark (LUBM), http://swat.cse.lehigh.edu/projects/lubm</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>6. The Barton dataset, http://simile.mit.edu/wiki/Dataset:_Barton</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>7. MonetDB, http://monetdb.cwi.nl</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jena - A Semantic Web</surname>
          </string-name>
          <article-title>Framework for Java</article-title>
          , http://jena.sourceforge.net
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>W3C</given-names>
            <surname>Uncertainty Reasoning</surname>
          </string-name>
          Incubator Group , http://www.w3.org/2005/Incubator/urw3
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hadoop</surname>
          </string-name>
          , http://hadoop.apache.org
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Atre</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>BitMat: A Main-memory Bit-matrix of RDF Triples</article-title>
          . In ISWC(
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Costa</surname>
            ,
            <given-names>P.C.G.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laskey</surname>
            ,
            <given-names>K.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laskey</surname>
            ,
            <given-names>K.J.:</given-names>
          </string-name>
          <article-title>PR-OWL: A Bayesian Ontology Language for the Semantic Web</article-title>
          .
          <source>In URSW (LNCS</source>
          Vol.
          <volume>)</volume>
          (
          <year>2008</year>
          )
          <fpage>88</fpage>
          -
          <lpage>107</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>