<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RDFMatView: Indexing RDF Data for SPARQL Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Humboldt Universitaet zu Berlin</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The Semantic Web is now gaining momentum due to its e®orts to create a universal medium for the exchange of semantically tagged data. The representation and querying of semantic data have been made by means of directed labelled graphs using RDF and SPARQL, standards which have been widely accepted by the scienti¯c community. Currently, most implementations of RDF/SPARQL are based on relational database technology. But executing complex queries in these systems usually is rather slow due to the number of joins that need to be performed. In this article, we describe an indexing method using materialized SPARQL queries as indexes on RDF data sets to reduce the query processing time. We provide a formal de¯nition of materialized SPARQL queries, a cost model to evaluate their impact on query performance, a storage scheme for the materialization, and an algorithm to ¯nd the optimal set of indexes given a query. We also introduce di®erent approaches to integrate materialized queries into an existing SPARQL query engine.</p>
      </abstract>
      <kwd-group>
        <kwd>SPARQL</kwd>
        <kwd>Indexing</kwd>
        <kwd>RDF</kwd>
        <kwd>Materialized Queries</kwd>
        <kwd>Semantic Web</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Semantic Web aims to create a universal medium for the exchange of data
where data can be shared and processed by automated tools as well as by people.
The basis for this proposal is a logical data model called Resource Description
Framework (RDF) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. An RDF data set is a collection of statements called
triples, of the form (s,p,o) where s is a subject, p is a predicate and o is an object.
Each triple states the relation between subject and object. A set of triples can be
represented as a directed graph where subjects and objects represent nodes and
predicates represent edges connecting these nodes. The SPARQL query language
is the o±cial standard for searching over RDF repositories [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The increasing amount of RDF data have motivated the development of
approaches for e±cient RDF data management. Initially, most SPARQL
implementations were built upon relational databases (e.g. Jena [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], 3Store [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], or
Sesame [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
      </p>
      <p>Some other approaches have proved to be very e±cient based on relational
database technology but following a native data scheme (see for instance [7{9]).
Basically, they propose data structures, which transform the triples table into
several tables according to di®erent conditions and create indexes by combining
triple elements (s,p,o) in all possible ways.</p>
      <p>
        In contrast to these systems, our approach fundamentally is based on the
assumption that some patterns are combined more frequently than others, and
that only indexing those combinations promises to provide large speed-ups at
manageable space and maintenance cost. Note that we do not claim that our
approach o®ers a particular fast SPARQL-processor, compared to systems such
as [7{9]. Instead, we present a new technique to speed up query execution with
SPARQL that is applicable to any SPARQL query processor. We want to
showcase its potential and compare di®erent ways to integrate it into query processing
using one particular system (namely ARQ [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]), which has been chosen because
of its widespread use. An extended version of this paper can be found in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
1.1
      </p>
      <p>Contributions
In our research, we capture the following contributions:
{ We propose an indexing method, which fully exploits the RDF graph-structure.</p>
      <p>We do not index single attributes or triples, but fractions of queries that
occur frequently in an expected workload. Therefore, our approach is a native
RDF/SPARQL indexing method whose concepts are viable for all possible
implementations of RDF stores.
{ We provide an indexing method using Materialized Queries for RDF data
sets. Our method can be seen as an orthogonal indexing solution that acts
as cache between SPARQL query processors and triple stores, which may be
used in conjunction with other indexing methods.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The RDFMatView Approach</title>
      <p>
        RDFMatView is a theoretical framework for using materialized SPARQL queries
as indexes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Formally, RDFMatView de¯nes an index as follows:
De¯nition 1 (Index). An index I over a data graph G is a pair I = (P; O),
where P represents a pattern1 and O represents the set of all occurrences of P
in G.
      </p>
      <p>
        Indexes are precomputed queries suitable to speed up other queries when
the index pattern is entirely contained in the query pattern. We say, an index
I is eligible for a query Q when the patterns of I are entirely contained in
the patterns of Q [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, it would be more advantageous when query
processing uses more than one index. For those cases, we require that indexes
overlap. Overlapping indexes are good candidates for reducing query processing
1 Set of triple patterns in the body of I.
because the query engine can combine occurrences of these indexes and generate
solutions without matching against the RDF dataset. To estimate which indexes
bring more savings in time to query execution, the optimizer must choose an
index, a set of indexes or decide to execute the query without using indexes.
      </p>
      <p>We propose a simple model for estimating these savings by using the concept
of selectivity of an index. It de¯nes the relation of the number of occurrences of
an index in a given graph to the possible total number of index occurrences in
the graph. Having computed the selectivity of all indexes, the query optimizer
must determine which index or set of indexes is the best for query processing.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        To integrate our approach into an existing SPARQL query processor we
developed three strategies. Our ¯rst strategy (MatView-and-ARQ) uses ARQ to
process the residual part of the query. RDFMatView extends the results of the
chosen cover by joining the partial solutions with the solutions of the residual
patterns. The second strategy (MatView-to-SQL) is based on a SPARQL-to-SQL
algorithm to translate SPARQL queries into SQL queries. The idea is to directly
access the native Jena storage tables and to combine those results with the index
tables to generate the ¯nal solution. The last strategy (Hybrid) is built from a
combination of the previous two strategies, i.e. ARQ and database execution
engine. We present these strategies for the ARQ system [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However, we want
to stress that general process would be the same for any other SPARQL query
processors.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        Because of space reasons, we present an evaluation of our approach by executing
one query on ¯ve data sets [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] ranging from 250K to 25M triples using our three
RDFMatView methods and plain ARQ (without indexes). A description of the
tested query is shown in Listing 1.
      </p>
      <p>Listing 1. Query1 ¯nds products for a given set of generic features.
SELECT * WHERE {
? product rdfs : label ? label .
? product rdf : type ? ProductType .
? product bsbm : productFeature ? ProductFeature1 .
? product bsbm : productFeature ? ProductFeature2 .</p>
      <p>? product bsbm : productPropertyNumeric1 ? value1 .}
Figure 1(a) illustrates the average processing time of Query1 in 5 di®erent
datasets. Clearly, processing time signi¯cantly improves when using
MatViewand-ARQ (M1) and Hybrid (M3). The improvements are the higher, the larger
the database. However, processing time does not signi¯cantly improve when
using MatView-to-SQL (M2). The reason for this is the Jena native storage schema.
Since the values are encoded following the Jena layout, our process needs to parse
(a) Methods comparison
(b) Processing Query1
the stored values and extract the required information. Figure 1(b) show that
the costs estimated by our model roughly correlate with the real processing time
and that it manages, in all cases, to prevent the selection of exceptionally bad
plans. Actually all plans improve the total execution times when compared to
those without using indexes. However, they also show that our model can be
improved. Especially, it does not yet re°ect that using less indexes is advantageous
as this requires less joins at runtime. This fact is captured only indirectly by our
model as we concentrate on the number of covered patterns.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>We have proposed a novel logical and physical framework to speed-up the
execution of SPARQL queries by using materialized queries as indexes. The use of
RDFMatView indexes focuses on minimizing query pattern comparison against
the RDF data set and on minimizing the number of self-joins to answer a query.
We analyze query and index patterns and provide three rewriting methods to use
indexes and get the ¯nal query result set. We also provide a simple cost model
to estimate the savings in time these indexes may bring to query execution. Our
preliminary results have shown that the performance gains are considerable. As
future directions, and in addition to the previous described ideas, we plan to
use variable bindings stored in the index structures to restrict uncovered query
patterns. Another promising topic to extend our approach is index selection for
SPARQL queries taking into account the particularities of RDF and SPARQL,
especially in the area of cost models and scope of selectable indexes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Manola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          : RDF Primer (
          <year>February 2004</year>
          )
          <article-title>W3C Recommendation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Prud'hommeaux</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SPARQL Query Language for RDF (</article-title>
          <year>April 2008</year>
          )
          <article-title>W3C Recommendation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sayers</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
          </string-name>
          , D.:
          <article-title>E±cient RDF storage and retrieval in Jena2</article-title>
          .
          <source>In: Proc. First International Workshop on Semantic Web and Databases</source>
          . (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.G.</surname>
          </string-name>
          :
          <article-title>3store: E±cient Bulk RDF Storage</article-title>
          .
          <source>In: 1st International Workshop on Practical and Scalable Semantic Systems (PSSS'03)</source>
          . (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>SPARQL Query Processing with Conventional Relational Database Systems</article-title>
          . In: International Workshop on Scalable Semantic Web Knowledge Base System. (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Broekstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kampman</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema</article-title>
          . In: International Semantic Web Conference. (
          <year>2002</year>
          )
          <volume>54</volume>
          {
          <fpage>68</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hollenbach</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Scalable Semantic Web Data Management using Vertical Partitioning</article-title>
          .
          <source>In: VLDB '07: Proceedings of the 33rd international conference on Very large data bases</source>
          ,
          <source>VLDB Endowment</source>
          (
          <year>2007</year>
          )
          <volume>411</volume>
          {
          <fpage>422</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karras</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Hexastore: Sextuple Indexing for Semantic Web Data Management</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>1</volume>
          (
          <issue>1</issue>
          ) (
          <year>2008</year>
          )
          <volume>1008</volume>
          {
          <fpage>1019</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>RDF-3X: a RISC-style engine for RDF</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>1</volume>
          (
          <issue>1</issue>
          ) (
          <year>2008</year>
          )
          <volume>647</volume>
          {
          <fpage>659</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>ARQJena: ARQ - A SPARQL Processor for Jena</article-title>
          . http://jena.sourceforge.net/ARQ/ (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rothe</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>RDFMatView: Indexing RDF Data for SPARQL Queries</article-title>
          .
          <source>Technical Report 234</source>
          , Humboldt Universitaet zu Berlin (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Halevy</surname>
          </string-name>
          , A.Y.:
          <article-title>Answering Queries Using Views: A Survey</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>10</volume>
          (
          <issue>4</issue>
          ) (
          <year>2001</year>
          )
          <volume>270</volume>
          {
          <fpage>294</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : The Berlin SPARQL Benchmark.
          <source>International Journal On Semantic Web and Information Systems - Special Issue on Scalability and Performance of Semantic Web Systems</source>
          ,
          <year>2009</year>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>