<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Tool for E ciently Processing SPARQL Queries on RDF Quads</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anas Katib</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Praveen Rao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasil Slavov anaskatib@mail.umkc.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>raopr@umkc.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>vgslavov@mail.umkc.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Missouri-Kansas City</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a tool called RIQ (RDF Indexing on Quads) for e ciently processing SPARQL queries on large RDF datasets containing quads. RIQ's novel design includes: (a) a vector representation of RDF graphs for e cient indexing, (b) a ltering index for e ciently organizing similar RDF graphs, and (c) a decrease-and-conquer strategy for e cient query processing using the ltering index to discard non-matching RDF graphs early on and query rewriting and optimization. During the demo, the user will visually gain insights on how RIQ's unique design leads to fast processing of SPARQL queries on named graphs in a real dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Today, datasets with billions of RDF statements are widely available for
developing new Semantic Web applications (e.g., Billion Triples Challenge 2012
(BTC 2012)1, Bio2RDF2). Such datasets contain RDF statements expressed as
quads where each statement has a subject, predicate, object, and context.
Context is powerful to model provenance information, especially when integrating
data from multiple sources. Quads with the same context can be modeled as
a directed, labeled graph. When the context is an IRI, a dataset of quads
is a collection of RDF named graphs. Here is an example of an RDF
statement (from BTC 2012) in RDF 1.1 N-Quads: &lt;http://dbpedia.org/resource/
Oswego&gt; &lt;http://dbpedia.org/ontology/country&gt; &lt;http://dbpedia.org/
resource/United_States&gt; &lt;http://dbpedia.org/data/Oswego.xml&gt;.
Interestingly, the notion of a named graph simpli es the maintenance of provenance
information for entities in triples contained within the same graph (e.g., using
PROV-DM3). That is, the provenance triples for the entities can be stored either
in the same named graph or a di erent named graph.</p>
      <p>The keyword GRAPH in a SPARQL query restricts the matching of a basic
graph pattern (BGP) to within a particular RDF named graph. If this keyword
is followed by a variable, then the variable is bound to an IRI (of a named
graph), the matching is done over all the named graphs in the dataset. Further,
provenance information can be queried for the bindings by expressing joins via
triple patterns within the same named graph as shown in this example:</p>
      <sec id="sec-1-1">
        <title>1 http://km.aifb.kit.edu/projects/btc-2012</title>
        <p>3 https://www.w3.org/TR/prov-dm</p>
      </sec>
      <sec id="sec-1-2">
        <title>2 http://bio2rdf.org</title>
        <p>
          SELECT ?x ?y ?z ?p1 ?p2 ?g WHERE { GRAPH ?g {
?x &lt;http://dbpedia.org/ontology/country&gt; ?y .
?x &lt;http://dbpedia.org/ontology/populationTotal&gt; ?z .
?x prov:wasGeneratedBy ?p1 . ?p1 prov:wasAssociatedWith ?p2 . }}
While many techniques have been proposed for indexing and query processing
of large RDF datasets containing (subject, predicate, object) triples [
          <xref ref-type="bibr" rid="ref1 ref2 ref6 ref8">6, 1, 2, 8</xref>
          ] on a
single machine, little attention has been paid towards indexing large-scale RDF
datasets with quads. A rei cation approach can be used to represent quads as
triples and use an existing approach like RDF-3X [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. However, doing so leads
to poor query performance due to increase in the dataset size and the number
of triple patterns in a query [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Another interesting observation is that none
of the previous approaches has investigated how large, complex BGPs (e.g.,
containing undirected cycles) in SPARQL queries can be processed e ciently on
large RDF datasets. State-of-the-art approaches for a local environment were
slow in processing SPARQL queries containing large, complex BGPs on RDF
datasets with over a billion RDF statements [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This is because of the large
number of join operations that must be performed to process a query, which
increases the cost of query optimization and execution.
        </p>
        <p>
          Motivated by the above reasons, we developed RIQ [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for e ciently
processing SPARQL named graph queries on RDF quads. RIQ yielded superior
performance for SPARQL queries on named graphs compared to Virtuoso and Jena
TDB, both of which support quads. It also outperformed a rei cation approach
using RDF-3X as the system to process triples. (While RQ-RDF-3X [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] built
quad indexes to support queries on quads and rei cation statements, its
implementation did not support named graph queries.) This demo paper is based on
our previously published work on RIQ [
          <xref ref-type="bibr" rid="ref4 ref7">7, 4</xref>
          ].
2
        </p>
        <p>RIQ
We provide an overview of RIQ's novel design to e ciently process SPARQL
select queries on RDF named graphs and highlight our key contributions.
Figures 1(a) and 1(b) show the steps during indexing and query processing in RIQ.</p>
        <p>The rst design feature of RIQ is a new representation for an RDF graph
called a Pattern Vector (PV) to facilitate e cient indexing and ltering during
query processing. A PV is a vector of vectors, one for each pattern in fSP O,
SP ?, S?O, ?P O, S??, ?P ?, ??Og. These patterns are based on the nature of
triple patterns that can appear in a BGP of a query. To construct the PV of an
RDF graph, for each quad, we generate a tuple, one for each canonical pattern.
Each tuple is hashed, and the hash value is stored in the vector for the canonical
pattern. A PV is also constructed for a BGP in a query. Essentially, the concept
of PVs allow us to map a quad/triple in the data and a triple pattern in a query
to a common plane of reference. This will enable us to quickly test if a triple
pattern in a BGP has a match in the data.</p>
        <p>
          The second design feature of RIQ is the PV-Index to e ectively organize
millions of PVs in the database. This index provides two bene ts: First, it creates
(a) Index construction
(b) Query processing
groups of similar RDF graphs e ciently, which are indexed separately. Second,
it serves as the ltering index to speed up query processing. We employ locality
sensitive hashing [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] on the PVs to group similar PVs e ciently (and as a result,
similar RDF graphs) with high probability. We summarize the PVs in a group
using the notion of a union operation on PVs; this preserves a necessary condition
for nding a BGP match and is discussed later. The individual vectors in a PV
are kept sorted. So we can compute the union of the PVs in linear time. The
union of a group of similar PVs is stored compactly using a combination of
Bloom lters (BFs) and Counting Bloom lters (CBFs). The BFs and CBFs for
all the groups constitute the PV-Index. For each group, we also store the ids of
the graphs belonging to it and index these graphs (including the context of each
quad) using a SPARQL processor that supports named graphs/quads.
        </p>
        <p>The third design feature of RIQ is a streamlined approach for query
processing via a decrease-and-conquer strategy: the PV-Index is searched to quickly
discard most of the non-matching RDF graphs early on during query processing
followed by methodical query rewriting and optimization. A query plan called a
BGP Tree is generated to process individual BGPs and constructs like UNION,
OPTIONAL, etc. A key necessary condition that forms the basis for BGP matching
is as follows: If a BGP has a subgraph match in an RDF graph, then for
every canonical pattern, its vector in the BGP's PV is a subset of its vector in the
graph's PV. Thus, we identify a superset of RDF graphs that contain a subgraph
match for the BGP. On each group of the PV-Index, we evaluate the BGP Tree
and test the necessary condition for the BGPs using the group's BFs/CBFs. We
methodically rewrite the original query, generate optimized SPARQL queries
for the candidates identi ed after ltering, and execute these queries (using a
SPARQL processor that supports quads) to produce the nal (exact) output.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Demonstration Scenarios</title>
      <p>RIQ was implemented using C++, Java, Python, and Django. A user will visually
experience three scenarios and gain insights on how and when the
decrease-andconquer approach of RIQ leads to superior performance than its competitors.
The user will use the BTC 2012 dataset containing about 1.4 billion RDF
statements. He/she will select prebuilt indexes and a few statistics related to
index construction will be shown. Next, the user will choose (or edit) from a
variety of SPARQL queries on named graphs/quads. These queries will contain
large, complex BGPs, small BGPs, or multiple BGPs with keywords like UNION
and OPTIONAL. The queries will also have di erent selectivities based on the
number of named graph matches in the dataset. The wall-clock time taken by
RIQ and its competitors will be shown. The user will also observe the bene t of
RIQ's streamlined approach to query processing and visualize key intermediate
steps. Finally, the user will observe use cases when RIQ is a better choice for local
query processing in federated SPARQL queries. A video showing the features of
RIQ is available at http://vortex.sce.umkc.edu:8080/RIQ.</p>
      <p>Acknowledgments: This work was supported by the National Science Foundation
under Grant Nos. 1115871 and 1620023. We thank Vinutha Nuchimaniyanda for
her assistance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Atre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaoji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Hendler</surname>
          </string-name>
          .
          <article-title>Matrix \Bit" loaded: A scalable lightweight join query processor for RDF data</article-title>
          .
          <source>In Proc. of the 19th WWW Conf.</source>
          , pages
          <volume>41</volume>
          {
          <fpage>50</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Bornea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dolby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dantressangle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Udrea</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Bhattacharjee</surname>
          </string-name>
          .
          <article-title>Building an e cient RDF store over a relational database</article-title>
          .
          <source>In Proc. of 2013 SIGMOD Conference</source>
          , pages
          <volume>121</volume>
          {
          <fpage>132</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Haveliwala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gionis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Indyk</surname>
          </string-name>
          .
          <article-title>Evaluating strategies for similarity search on the Web</article-title>
          .
          <source>In Proc. of 11th WWW Conf.</source>
          , pages
          <volume>432</volume>
          {
          <fpage>442</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Katib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Slavov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rao</surname>
          </string-name>
          . RIQ:
          <article-title>Fast processing of SPARQL queries on RDF quadruples</article-title>
          .
          <source>Journal of Web Semantics</source>
          , 37(C):
          <volume>90</volume>
          {
          <fpage>111</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Leeka</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bedathur</surname>
          </string-name>
          .
          <article-title>RQ-RDF-3X: going beyond triplestores</article-title>
          .
          <source>In Proc. of the 5th Intl. Work. on Data Engineering Meets the Semantic Web</source>
          , pages
          <volume>263</volume>
          {
          <fpage>268</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          and
          <string-name>
            <surname>G. Weikum.</surname>
          </string-name>
          <article-title>The RDF-3X engine for scalable management of RDF data</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <volume>91</volume>
          {
          <fpage>113</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>V.</given-names>
            <surname>Slavov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paturi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Barenkala</surname>
          </string-name>
          .
          <article-title>Fast processing of SPARQL queries on RDF quadruples</article-title>
          .
          <source>In Proc. of WebDB '14</source>
          , pages
          <issue>1{6</issue>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>P.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and L. Liu.
          <article-title>TripleBit: A fast and compact system for large scale RDF data</article-title>
          .
          <source>In Proc. of VLDB Endow.</source>
          ,
          <volume>6</volume>
          (
          <issue>7</issue>
          ):
          <volume>517</volume>
          {
          <fpage>528</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>