<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Rewriting Queries over Summaries of Big Data Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mariano P. Consens</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valeria Fionda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shahan Khatchadourian</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Pirro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics, University of Calabria</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>KRDB, Free University of Bozen-Bolzano</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Toronto</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This short paper reports on the bene ts that traversal queries over existing graph stores (such as RDF databases) can gain from a class of optimizations based on summaries. Summaries, also known as structural indexes, have been extensively covered in the literature (see [2] for a brief overview). Despite this, summary-based optimizations are not widely implemented. To make both graph traversal queries and summaries readily available in existing RDF stores, we have devised a translation that outputs SPARQL queries that execute over summaries directly represented in RDF. In what follows, we give an overview of our proposal, illustrate it with an example, and mention preliminary evaluation results on real-world data. Overview. Our approach to summary-based optimization of graph traversal queries on top of SPARQL processors consists of three components. First, we have developed and implemented an algorithm that translates graph traversal queries (see [1, 6] for recent surveys) into SPARQL expressions. The graph traversal language we selected extends the expressiveness of Nested Regular Expressions [5] without compromising the language's low combined data and query complexity. Second, we construct and store summaries alongside the original graph in the RDF store. The summaries we focus on this work are obtianed by dual bisimulation contractions [3, 4]. Third, we build upon our translation algorithm to develop a rewriting optimization that converts graph traversal queries into equivalent SPARQL queries that execute over RDF graphs of both the original data and the summary.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Motivating example. We illustrate our approach to summary-based query rewriting using
a fragment of the linkedmdb.org open movie dataset shown in Fig. 1(a). Consider the query
that asks for actors that have acted in two movies directed by the same director such that one
movie does have information about the country in which it has been shot while the other does
not. Evaluating this query in the fragment shown has the actor labelled Joe Turkel as the
only answer, since he acted in Paths of Glory (with no country information) as well as in The
Shining (shot in GB ), and both movies were directed by Stanley Kubrick (Jack Nicholson
acts in only one movie in this input graph). This query can be expressed as a graph traversal
query starting at the director class node and ending at matching actor nodes.</p>
      <p>Fig 1 (b) shows a dual bisimulation summary materialized as an RDF graph that connects
to the original data fragment in Fig 1 (a). Each block node in the summary is identi ed by a
hash URI and represents a non-overlapping subset of the fragment's nodes using bc:hasExtent
edges to each of the data graph nodes belonging to the block partition (referred to as the
block's extent ). Blocks are also connected by edges that summarize the corresponding
semantic relationship between data nodes across partitions (e.g., :actor, :director, :country,
etc.). Speci cally, a summary edge means that each node in the source block's extent has an
edge to a node in the target block's extent, and dually, each node in the target block's extent
has an incoming edge from a node in the source block's extent. This speci cation warranties
that graph traversals in the summary graph will match exactly those nodes in the data graph
that satisfy that same graph traversal.</p>
      <p>A graph traversal of our motivating example can be evaluated on a summary as follows.
First, the graph traversal expression is rewritten into a SPARQL query over the summary
graph. Edge traversals occur between blocks instead of data graph nodes where summary
predicates can be pre xed with a custom namespace. Since a summary query returns blocks,
the last step of the rewriting consists of navigating from blocks in the summary to nodes
in the data graph by traversing bc:hasExtent edges. Evaluating the rewritten query over the
combined summary and data graph can traverse a lower number of nodes and edges than
evaluating the original query on the data graph.</p>
      <p>Preliminary experimental evaluation. Our initial experiments use several open datasets,
including linkedmdb.org, which contains information about 17K directors, 97K movies and 60K
actors. Computing a dual bisimulation contraction of linkedmdb.org produces a summary with
a total of 14K director blocks, 72K movies blocks, and 50K actor blocks (the blocks with the
largest extent sizes contain 0.4K directors, 3K movies and 2K actors). This summary has 63%
of the nodes and 70% of the edges of the original data graph.</p>
      <p>Graph traversal expressions used to generate SPARQL queries for our experiments are
similar to our motivating example. We choose graph traversal expressions that induce
neighbourhoods with a variety of path lengths, cycles, forward traversals, and backward traversals,
and that return a range of zero to millions of results. Our evaluation has been performed using
Virtuoso as the underlying triple store for each data graph and its summary. Executing the
rewritten SPARQL queries over the combined for linkedmdb.orgsummary and data graphs can
yield speed up factors over 30 times (e.g., elapsed times of tens of seconds for the summary
query compared to tens of minutes for the data graph query).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>P.</given-names>
            <surname>Barcelo. Querying Graph</surname>
          </string-name>
          <article-title>Patterns</article-title>
          .
          <source>In PODS</source>
          , pages
          <volume>175</volume>
          {
          <fpage>188</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Consens</surname>
          </string-name>
          . Structure Indexing.
          <source>In Encyclopedia of Database Systems</source>
          , pages
          <fpage>2857</fpage>
          {
          <fpage>2862</fpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Piazza</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Policriti</surname>
          </string-name>
          .
          <article-title>An e cient algorithm for computing bisimulation equivalence</article-title>
          .
          <source>Theor. Comput. Sci.</source>
          ,
          <volume>311</volume>
          (
          <issue>1-3</issue>
          ):
          <volume>221</volume>
          {
          <fpage>256</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Paige</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Tarjan</surname>
          </string-name>
          .
          <article-title>Three partition re nement algorithms</article-title>
          .
          <source>SIAM Journal on Computing</source>
          ,
          <volume>16</volume>
          (
          <issue>6</issue>
          ):
          <volume>973</volume>
          {
          <fpage>989</fpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          .
          <article-title>nSPARQL: A Navigational Language for RDF</article-title>
          .
          <source>JWS</source>
          ,
          <volume>8</volume>
          (
          <issue>4</issue>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P. T.</given-names>
            <surname>Wood</surname>
          </string-name>
          .
          <article-title>Query Languages for Graph Databases</article-title>
          . SIGMOD Rec.,
          <volume>41</volume>
          (
          <issue>1</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>