<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>QCan: Normalising Congruent SPARQL Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaime Salas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aidan Hogan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IMFD Chile &amp; Department of Computer Science, University of Chile</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We demonstrate a system to canonicalise (aka. normalise) SPARQL queries for use-cases such as caching, query log analysis, query minimisation, signing queries, etc. Our canonicalisation method deterministically rewrites a given input query to an equivalent canonical form such that the results for two queries are syntactically (string) equal if and only if they give the same results on any database, modulo variable names. The method is sound and complete for a monotone fragment of SPARQL with selection (equalities), projection, join and union under both set and bag semantics. Considering other SPARQL features (e.g., optional, lter, graph, etc.), the underlying equivalence problem becomes undecidable, where we currently rather support a best-e ort canonicalisation for other SPARQL 1.0. features. We demonstrate a prototype of our canonicalisation framework, provide example rewritings, and discuss limitations, use-cases and future work. Demo link: http://qcan.dcc.uchile.cl</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Though there are hundreds of SPARQL endpoints now available on the Web,
many su er from performance problems, including unavailability, variance in
performance, timeouts, partial results, and so forth [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A lot of attention has
thus been given to improving the performance of such query services, either in
terms of exploring query optimisations, better indexing schemes, etc. Such works
aim to improve the e ciency of executing a particular query in isolation.
      </p>
      <p>
        However, a more recent trend in the literature is to consider in more detail the
past queries that such services have received from users and to try to optimise
these queries in particular. One direction in which such research has gone is to
analyse SPARQL query logs to understand what kinds of queries users typically
ask in terms of the features used, their underlying structure, the number of
results they return, and so forth [
        <xref ref-type="bibr" rid="ref11 ref2 ref3 ref4">2,11,4,3</xref>
        ]. Such analyses help to understand the
most important aspects of the SPARQL query language to optimise in practice.
Another direction is to consider caching methods that reuse static results from
previous queries to help process later queries [
        <xref ref-type="bibr" rid="ref12 ref7">12,7</xref>
        ]; such caching methods thus
aim to reduce the overall workload of the service.
      </p>
      <p>Such works are complicated by one key issue: the same \abstract query"
can be expressed in myriad ways in the SPARQL query language, with possible
variations in both the syntax and operators used. More formally, letting Q be
a query and D be a dataset, we denote by Q(D) the results of applying Q over
D. Now letting Q1 and Q2 be two SPARQL queries, we say that Q1 and Q2 are
equivalent, denoted Q1 Q2 if and only if Q1(D) = Q2(D) for any dataset D.
This means that Q1 and Q2 do not di er in semantics { in terms of what results
they will generate from a(ny) dataset { only in how they are expressed. Thus,
for example, in a caching scenario, though we have already stored the results
for Q1, we may not hit the cache for Q2 if we do not consider equivalence, even
though the results are reusable (leaving aside changes to the underlying data).</p>
      <p>Furthermore, the results for SPARQL queries are mappings from variables
to constants and thus two queries that di er only in variable names may not
be equivalent since their results may di er. Hence we arrive at another relation
between queries: we say that Q1 and Q2 are congruent, denoted Q1 = Q2, if and
only if there exists a one-to-one rewriting of variables : V ! V in Q1 such
that (Q1)(D) Q2(D); in other words, Q1 and Q2 are equivalent under some
one-to-one variable rewriting. This notion of query congruence would allow us
to capture further useful queries in, for example, a caching scenario.
Example 1. Consider two queries QA and QB asking for grandparents:</p>
      <sec id="sec-1-1">
        <title>SELECT DISTINCT ?y WHERE f</title>
        <p>ff ??xw ::mmootthheerr ??yx .. gg UUNNIIOONN ff ??xw ::ffaatthheerr ??yx.. gg
?a ?b ?c. ?c ?d ?y . # redundant
g</p>
      </sec>
      <sec id="sec-1-2">
        <title>SELECT DISTINCT ?c WHERE f</title>
        <p>f ?a :mother ?b . ?b :mother ?c g
UUNNIIOONN ff ??aa ::fmaotthheerr ??bb .. ??bb ::mfoatthheerr ??cc gg
UNION f ?a :father ?b . ?b :father ?c g g
Both queries are congruent: they return the same results for any RDF dataset
(modulo variable names). Note that the line marked redundant in QA could be
removed without changing the semantics of the query.
tu</p>
        <p>
          While detecting cases of query equivalence/congruence could be helpful for
various use-cases, such a procedure entails a very high computational complexity.
More speci cally, deciding if two queries Q1 and Q2 are equivalent, in the case
of SPARQL, is NP-complete, even when simply permitting joins (conjunctive
queries). Even worse, the problem becomes undecidable when features such as
projection and optional matches are added [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Furthermore, consider a log or
stream of n queries issued to an endpoint. Traditional methods for deciding
query equivalence take two queries and return true if they are equivalent, or
false otherwise. Using such a decision procedure, we would be forced to perform
O(n2) comparisons to nd all equivalent queries. This would clearly become
prohibitively costly for endpoints that have to deal with millions of queries [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],
even if checking individual query pairs does not observe the exponential
worstcases constructed in theory. Hence we see some of the challenges associated with
detecting equivalent/congruent queries in settings such as caching.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>QCan: Query Canonicalisation</title>
      <p>
        While the previous discussion paints a pessimistic view of the possibility of ever
having a procedure that can detect equivalent/congruent queries in a practical
setting such as caching, we highlight two observations that may give us hope: (1)
many real-world queries are quite simple, where the sorts of worst-cases
predicated in theory seem unlikely to occur in current practice [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; (2) we can avoid
quadratic pairwise checks by using a canonicalisation procedure that facilitates
indexing equivalent queries by their normalised string or hash value.
      </p>
      <p>
        Along these lines, in our paper accepted for the ISWC Research Track [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we
proposed a canonicalisation procedure for monotone SPARQL queries { queries
with selection, projection, join and union under set or bag semantics { such
that the canonical form of a query can(Q) is congruent to Q, and the canonical
form of two queries Q1 and Q2 are equal (can(Q1) = can(Q2)) if and only if
Q1 and Q2 are congruent. We proved this procedure to be sound and complete
for the aforementioned fragment of monotone SPARQL queries and through
experiments on real and synthetic queries, we showed it (1) to be e cient for
queries found in practice, taking milliseconds on average; (2) to be capable of
nding more duplicate queries than a baseline syntactic normalisation; (3) to
eventually break on harder synthetic cases that provoke a doubly-exponential
behaviour. Rather than reject queries using other features of SPARQL (1.0), we
discussed how the procedure can be extended to perform a best-e ort (sound
but incomplete) canonicalisation of other features of SPARQL.
      </p>
      <p>
        The canonicalisation procedure is founded on representing the algebra of a
SPARQL query in a simpler structure that we can then normalise (following
techniques known from theoretic works for deciding query equivalence). Speci cally,
we represent a SPARQL query as an RDF graph and then apply normalisation
steps on the RDF graph;1 in this graph, we represent query variables as blank
nodes. Over the RDF graph representation, our rst step is to apply a Union of
Conjunctive Query (UCQ) normalisation: in SPARQL one can express unions of
joins, joins of unions, or any nesting between; hence we rst apply a (potentially
exponential) rewriting of the query such that it is strictly a union of multiple
Conjunctive Queries (CQs; aka. BGP), where each CQ is a join of multiple triple
patterns; this UCQ rewriting will e ectively rewrite QA in Example 1 to a query
of a similar form to QB. Next, under set semantics, as shown for QA in
Example 1, a query may contain redundant patterns, where we apply a minimisation
step: in each CQ of the UCQ computed in the previous stage, we remove
redundant variables (this is done by leaning the RDF sub-graph corresponding to the
CQ), where we then remove redundant CQs from the UCQ. Finally, we apply a
canonical labelling (modulo isomorphism) of the blank nodes, which
deterministically labels the query variables [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The last step is to apply a deterministic
syntactic mapping from the RDF graph back to a SPARQL query.
      </p>
      <p>
        For space reasons, we have sketched our canonicalisation procedure. For
further details { including related work, detailed examples, formalisation, proofs
and experiments { we refer to the extended version of the full paper [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We
highlight, however, that while there exist frameworks to decide query
equivalence, we are not aware of any framework that canonicalises queries in this
1 This is not dissimilar to the idea of SPIN [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], though we require special attention to
ensure correspondences between RDF [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and SPARQL equivalence relations.
manner (for SPARQL or even for SQL); the closest work we are aware of is that
of Papailiou et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], who rewrite query variables modulo isomorphism.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>QCan Demo</title>
      <p>In the Poster and Demo track, we plan to provide a demo of our QCan
system, publicly available here: http://qcan.dcc.uchile.cl. The demo is quite
straightforward: given an input query on the left, it computes and displays the
normalised canonical version of the query on the right. Our main motivation is to
be able to give practical hands-on examples of how the canonicalisation process
works to those who may have use for it, and to highlight the current types of
queries supported, what are the current limitations (limited support for lters,
no support for SPARQL 1.1), etc. We would also be keen to discuss use-cases
for this system and what requirements they might imply.</p>
      <p>Acknowledgements This work was supported by the Millennium Institute for
Foundational Research on Data (IMFD) and by Fondecyt Grant No. 1181896.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aranda</surname>
            ,
            <given-names>C.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandenbussche</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>SPARQL web-querying infrastructure: Ready for action?</article-title>
          <source>In: International Semantic Web Conference (ISWC)</source>
          . pp.
          <volume>277</volume>
          {
          <issue>293</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Arias</given-names>
            <surname>Gallego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.D.</given-names>
            ,
            <surname>Mart</surname>
          </string-name>
          nez-Prieto,
          <string-name>
            <surname>M.A.</surname>
          </string-name>
          , de la Fuente,
          <string-name>
            <surname>P.:</surname>
          </string-name>
          <article-title>An empirical study of real-world SPARQL queries</article-title>
          . In: USEWOD (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bielefeldt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonsior</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Practical Linked Data Access via SPARQL: The Case of Wikidata</article-title>
          . In: LDOW (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bonifati</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martens</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Timm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>An analytical study of large SPARQL query logs</article-title>
          .
          <source>PVLDB</source>
          <volume>11</volume>
          (
          <issue>2</issue>
          ),
          <volume>149</volume>
          {
          <fpage>161</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Furber,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Hepp</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Using SPARQL and SPIN for data quality management on the semantic web</article-title>
          .
          <source>In: Business Information Systems (BIS)</source>
          . pp.
          <volume>35</volume>
          {
          <issue>46</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Canonical forms for isomorphic and equivalent RDF graphs: Algorithms for leaning and labelling blank nodes</article-title>
          .
          <source>ACM TOW 11(4)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Papailiou</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsoumakos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karras</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koziris</surname>
          </string-name>
          , N.:
          <article-title>Graph-aware, workloadadaptive SPARQL query caching</article-title>
          .
          <source>In: ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <volume>1777</volume>
          {
          <fpage>1792</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Semantics and complexity of SPARQL</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .
          <volume>34</volume>
          (
          <issue>3</issue>
          ) (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pichler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skritek</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Containment and equivalence of well-designed SPARQL</article-title>
          .
          <source>In: Principles of Database Systems (PODS)</source>
          . pp.
          <volume>39</volume>
          {
          <issue>50</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Salas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Canonicalisation of Monotone SPARQL Queries</article-title>
          .
          <source>ISWC (Accepted)</source>
          (
          <year>2018</year>
          ), see extended version: http://aidanhogan.com/qcan/extended.pdf
          <article-title>(under revision for the camera-ready version)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehmood</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.N.:</given-names>
          </string-name>
          <article-title>LSQ: the linked SPARQL queries dataset</article-title>
          .
          <source>In: International Semantic Web Conference (ISWC)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>G.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weaver</surname>
          </string-name>
          , J.:
          <article-title>Enabling ne-grained HTTP caching of SPARQL query results</article-title>
          .
          <source>In: International Semantic Web Conference (ISWC)</source>
          . pp.
          <volume>762</volume>
          {
          <issue>777</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>