<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a Top-K SPARQL Query Benchmark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shima Zahmatkesh</string-name>
          <email>shima.zahmatkesh@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Della Valle</string-name>
          <email>emanuele.dellavalle@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Dell'Aglio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Bozzon</string-name>
          <email>a.bozzon@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DEIB - Politecnico of Milano, Delft University of Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The research on optimization of top-k SPARQL query would largely benefit from the establishment of a benchmark that allows comparing different approaches. For such a benchmark to be meaningful, at least two requirements should hold: 1) the benchmark should resemble reality as much as possible, and 2) it should stress the features of the topk SPARQL queries both from a syntactic and performance perspective. In this paper we propose Top-k DBPSB: an extension of the DBpedia SPARQL benchmark (DBPSB), a benchmark known to resemble reality, with the capabilities required to compare SPARQL engines on top-k queries.</p>
      </abstract>
      <kwd-group>
        <kwd>Top-k Query</kwd>
        <kwd>SPARQL</kwd>
        <kwd>Benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Top-k queries – queries returning the top k results ordered according to a
userdefined scoring function – are gaining attention in the Database [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Semantic
Web communities [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6">2–6</xref>
        ]. Order is an important property that can be exploited
to speed up query processing, but state-of-the-art SPARQL engines such as
Virtuoso [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Sesame [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and Jena [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], do no exploit order for query
optimisation purposes. Top-k SPARQL queries are managed with a materialize-then-sort
processing schema [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that computes all the matching solutions (e.g., thousands)
even if only a limited number k (e.g., ten) are requested.
      </p>
      <p>
        Recent works [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2–5</xref>
        ] have shown that an efficient split-and-interleave
processing schema [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] could be adopted to improve the performance of top-k SPARQL
queries. To the best of our knowledge, a consistent comparison of those works
does not exist. As often occurs, the main cause for this fragmentation resides in
the partial lack of a SPARQL benchmark covering top-k SPARQL queries. To
foster the work on top-k query processing within the Semantic Web community,
we believe that it is the right time to define a top-k SPARQL benchmark.
      </p>
      <p>
        Following well known principles of benchmarking [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we formulate the
research question of this work as: is it possible to set up a benchmark for top-k
SPARQL queries, which resembles reality as much as possible and stresses the
features of top-k queries both from a syntactic (i.e., queries should contain
rankrelated clauses) and performance (i.e., the query mix should insist on
characteristics of top-k queries which stress SPARQL engine) perspective?
      </p>
    </sec>
    <sec id="sec-2">
      <title>Metodology</title>
      <p>
        In this poster, we describe our ongoing work on Top-k DBpedia SPARQL
Benchmark (Top-k DBPSB). It extends the methodology, proposed in DBPSB [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], to
build SPARQL benchmarks that resemble reality. It uses the same dataset,
performance metrics, and test driver of DBPSB, but it extends the query
variability feature of DBPSB by proposing an algorithm to automatically create top-k
queries from the 25 auxiliary queries of the DBPSB and its datasets.
      </p>
      <p>In order to present Top-k DBPSB, we need to introduce some terminology:
– Definition 1: Rankable Data Property is a RDF property whose range is
in xsd:int, xsd:long, xsd:float, xsd:integer, xsd:decimal, xsd:double, xsd:date,
xsd:dateTime, xsd:time, or xsd:duration.
– Definition 2: Rankable Triple Pattern is a triple pattern that has a
Rankable Data Property in the property position of the pattern.
– Definition 3: When a variable, in the object position of a Rankable Triple
Pattern, appears in a scoring criteria of the scoring function, we call it
Scoring Variable and we call Rankable Variable the one appearing in the subject
position.</p>
      <p>In the first step, Top-k DBPSB looks for all rankable variables, in each
auxiliary query of each BDPSB query template, which could be used as part of a
ranking criterion in a scoring function. For each DBPSB auxiliary query, Top-k
DBPSB first checks if the variables in the query fit the definition of scoring
variable. To find additional rankable variable Top-k DBPSB considers all variables
in the query and tries also to find rankable triple pattern related to them.</p>
      <p>For the sake of simplicity, Top-k DBPSB generates scoring function as a
weighted sum of normalized ranking criteria, therefore, for each rankable
variable, Top-k DBPSB needs to compute its maximum and minimum value. So, for
each DBPSB auxiliary query and each rankable triple pattern identified in the
previous step, Top-k DBPSB generates a query to find those values.</p>
      <p>After having collected those values, for each rankable triple pattern, Top-k
DBPSB creates a new top-k query template. For instance, Query 2 of Figure
1 shows such a query as generated by the Top-K DBPSB. In order to obtain
an executable query, for each scoring variable that appears in scoring function
Top-K DBPSB adds the related rankable triple pattern to the body of the query.
1.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>Given that we extended DBPSB we give for positively answered the first part of
our research question (i.e., Top-k DBPSB resembles reality). In this section, we
provide preliminary evidence that the query variability feature of Top-k DBPSB
positively answers also the second part of our research question. We do so by
operationalising our research question in three hypotheses:
H.1 The more are the number of the Rankable Variables, the longer is the average
execution time.</p>
      <p>H.2 The more are the number of the Scoring Variables in the scoring function,
the longer is the average execution time.</p>
      <p>H.3 The value of the LIMIT clause has not any significant impact on the average
execution time.</p>
      <p>In order to evaluate our hypothesis, we carry out our experiments by using
the SPARQL engines Virtuoso, and Jena. After preparing the experimental
environment, we load the DBpedia dataset with the scale factors of 10% on the
SPARQL engines. In order to evaluate the performance of SPARQL engines,
we use the DBpedia SPARQL Benchmark test driver and modify it for Top-k
queries. We execute the generated Top-k SPARQL queries against these two
SPARQL engines. The information gains from the execution of randomly
generated query against the SPARQL engines is used to evaluate our hypotheses.</p>
      <p>As a result we got that most of the queries templates generated by Top-k
DBPSB are compatible with our hypotheses H.2 and H.3. On the contrary,
Topk DBPSB does not generate adequate query templates to test the hypothesis H.1:
only 6 queries templates out of the 25 in DBPSB can be use in validating H.1
while the others have only one rankable variable. Further investigation is needed
to refine the hypothesis H.1 and better qualify the cases that stress SPARQL
engines when answering top-k queries.
1 The implementation code is available
//bitbucket.org/sh_zahmatkesh/top − k − dbpsb
online
at
https
:</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        In this work, we presented Top-k DBPSB, an extension of DBPSB [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] that adds
the possibility to automatically generate top-k queries. Top-k DBPSB satisfies
the requirement of resembling the reality extending DBPSB which automatically
derives datasets and queries from DBpedia and it query logs. Moreover, we
provide experimental evidence that Top-k DBPSB also satisfies the requirement
to stress the distinguish features of top-k queries.
      </p>
      <p>In order to support the claim that we positively answered to the research
question presented in Section 1, we experimentally showed that the query
variability provided by Top-k DBPSB stresses the SPARQL engines. To this end,
we formulated three hypothesis and we empirically demonstrated that when the
number of scoring variables increases the average execution time also does
(hypothesis H.2) and that the average execution time is independent from the value
used in the LIMIT clause (hypothesis H.3). Counterintuitively, hypothesis H.1 is
not confirmed by our experiments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beskales</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soliman</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>A survey of top-k query processing techniques in relational database systems</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 40(4)</source>
          (
          <year>2008</year>
          )
          <fpage>11</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Magliacane</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bozzon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Della Valle</surname>
          </string-name>
          , E.:
          <article-title>Efficient execution of top-k sparql queries</article-title>
          .
          <source>In: The Semantic Web-ISWC 2012</source>
          . Springer (
          <year>2012</year>
          )
          <fpage>344</fpage>
          -
          <lpage>360</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duc</surname>
          </string-name>
          , T.T.,
          <string-name>
            <surname>Ladwig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Studer</surname>
          </string-name>
          , R.:
          <article-title>Top-k linked data query processing</article-title>
          .
          <source>In: The Semantic Web: Research and Applications</source>
          . Springer (
          <year>2012</year>
          )
          <fpage>56</fpage>
          -
          <lpage>71</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Cheng, J.,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>L.:</given-names>
          </string-name>
          <article-title>f-sparql: a flexible extension of sparql</article-title>
          .
          <source>In: Database and Expert Systems Applications</source>
          , Springer (
          <year>2010</year>
          )
          <fpage>487</fpage>
          -
          <lpage>494</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cedeno</surname>
            ,
            <given-names>J.P.:</given-names>
          </string-name>
          <article-title>A Framework for Top-K Queries over Weighted RDF Graphs</article-title>
          .
          <source>PhD thesis</source>
          , Arizona State University (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Siberski</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thaden</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Querying the semantic web with preferences</article-title>
          .
          <source>In: ISWC</source>
          . (
          <year>2006</year>
          )
          <fpage>612</fpage>
          -
          <lpage>624</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Erling</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikhailov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Rdf support in the virtuoso dbms</article-title>
          . In Auer, S.,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhdanova</surname>
          </string-name>
          , A.V., eds.
          <source>: CSSW</source>
          . Volume
          <volume>113</volume>
          of LNI.,
          <source>GI</source>
          (
          <year>2007</year>
          )
          <fpage>59</fpage>
          -
          <lpage>68</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Broekstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kampman</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Sesame: A generic architecture for storing and querying rdf and rdf schema</article-title>
          . In Horrocks, I.,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.A</given-names>
          </string-name>
          ., eds.:
          <source>International Semantic Web Conference</source>
          . Volume
          <volume>2342</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2002</year>
          )
          <fpage>54</fpage>
          -
          <lpage>68</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Owens</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibbins</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>mc</surname>
            <given-names>schraefel</given-names>
          </string-name>
          :
          <article-title>Clustered tdb: A clustered triple store for jena</article-title>
          .
          <source>Technical report, Electronics and Computer Science</source>
          , University of Southampton (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gray</surname>
          </string-name>
          , J., ed.:
          <article-title>The Benchmark Handbook for Database and Transaction Systems (2nd Edition)</article-title>
          . Morgan Kaufmann (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Morsey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          :
          <article-title>Dbpedia sparql benchmark - performance assessment with real queries on real data</article-title>
          . In Aroyo, L.,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kagal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomqvist</surname>
          </string-name>
          , E., eds.:
          <source>International Semantic Web Conference (1)</source>
          . Volume
          <volume>7031</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2011</year>
          )
          <fpage>454</fpage>
          -
          <lpage>469</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>