QCan: Normalising Congruent SPARQL Queries

                          Jaime Salas and Aidan Hogan

        IMFD Chile & Department of Computer Science, University of Chile


      Abstract. We demonstrate a system to canonicalise (aka. normalise)
      SPARQL queries for use-cases such as caching, query log analysis, query
      minimisation, signing queries, etc. Our canonicalisation method deter-
      ministically rewrites a given input query to an equivalent canonical form
      such that the results for two queries are syntactically (string) equal if
      and only if they give the same results on any database, modulo variable
      names. The method is sound and complete for a monotone fragment of
      SPARQL with selection (equalities), projection, join and union under
      both set and bag semantics. Considering other SPARQL features (e.g.,
      optional, filter, graph, etc.), the underlying equivalence problem becomes
      undecidable, where we currently rather support a best-effort canonicali-
      sation for other SPARQL 1.0. features. We demonstrate a prototype of
      our canonicalisation framework, provide example rewritings, and discuss
      limitations, use-cases and future work.
      Demo link: http://qcan.dcc.uchile.cl


1   Introduction

Though there are hundreds of SPARQL endpoints now available on the Web,
many suffer from performance problems, including unavailability, variance in
performance, timeouts, partial results, and so forth [1]. A lot of attention has
thus been given to improving the performance of such query services, either in
terms of exploring query optimisations, better indexing schemes, etc. Such works
aim to improve the efficiency of executing a particular query in isolation.
   However, a more recent trend in the literature is to consider in more detail the
past queries that such services have received from users and to try to optimise
these queries in particular. One direction in which such research has gone is to
analyse SPARQL query logs to understand what kinds of queries users typically
ask in terms of the features used, their underlying structure, the number of
results they return, and so forth [2,11,4,3]. Such analyses help to understand the
most important aspects of the SPARQL query language to optimise in practice.
Another direction is to consider caching methods that reuse static results from
previous queries to help process later queries [12,7]; such caching methods thus
aim to reduce the overall workload of the service.
   Such works are complicated by one key issue: the same “abstract query”
can be expressed in myriad ways in the SPARQL query language, with possible
variations in both the syntax and operators used. More formally, letting Q be
a query and D be a dataset, we denote by Q(D) the results of applying Q over
D. Now letting Q1 and Q2 be two SPARQL queries, we say that Q1 and Q2 are
equivalent, denoted Q1 ≡ Q2 if and only if Q1 (D) = Q2 (D) for any dataset D.
This means that Q1 and Q2 do not differ in semantics – in terms of what results
they will generate from a(ny) dataset – only in how they are expressed. Thus,
for example, in a caching scenario, though we have already stored the results
for Q1 , we may not hit the cache for Q2 if we do not consider equivalence, even
though the results are reusable (leaving aside changes to the underlying data).
    Furthermore, the results for SPARQL queries are mappings from variables
to constants and thus two queries that differ only in variable names may not
be equivalent since their results may differ. Hence we arrive at another relation
between queries: we say that Q1 and Q2 are congruent, denoted Q1 ∼   = Q2 , if and
only if there exists a one-to-one rewriting of variables ν : V → V in Q1 such
that ν(Q1 )(D) ≡ Q2 (D); in other words, Q1 and Q2 are equivalent under some
one-to-one variable rewriting. This notion of query congruence would allow us
to capture further useful queries in, for example, a caching scenario.

Example 1. Consider two queries QA and QB asking for grandparents:
 SELECT DISTINCT ?y WHERE {                      SELECT DISTINCT ?c WHERE {
  { ?w :mother ?x . } UNION { ?w :father ?x. }    { ?a :mother ?b . ?b :mother ?c }
  { ?x :mother ?y . } UNION { ?x :father ?y. }    UNION { ?a :mother ?b . ?b :father ?c }
  ?a ?b ?c. ?c ?d ?y . # redundant                UNION { ?a :father ?b . ?b :mother ?c }
 }                                                UNION { ?a :father ?b . ?b :father ?c } }

Both queries are congruent: they return the same results for any RDF dataset
(modulo variable names). Note that the line marked redundant in QA could be
removed without changing the semantics of the query.                      t
                                                                          u

    While detecting cases of query equivalence/congruence could be helpful for
various use-cases, such a procedure entails a very high computational complexity.
More specifically, deciding if two queries Q1 and Q2 are equivalent, in the case
of SPARQL, is NP-complete, even when simply permitting joins (conjunctive
queries). Even worse, the problem becomes undecidable when features such as
projection and optional matches are added [9]. Furthermore, consider a log or
stream of n queries issued to an endpoint. Traditional methods for deciding
query equivalence take two queries and return true if they are equivalent, or
false otherwise. Using such a decision procedure, we would be forced to perform
O(n2 ) comparisons to find all equivalent queries. This would clearly become
prohibitively costly for endpoints that have to deal with millions of queries [11],
even if checking individual query pairs does not observe the exponential worst-
cases constructed in theory. Hence we see some of the challenges associated with
detecting equivalent/congruent queries in settings such as caching.


2    QCan: Query Canonicalisation

While the previous discussion paints a pessimistic view of the possibility of ever
having a procedure that can detect equivalent/congruent queries in a practical
setting such as caching, we highlight two observations that may give us hope: (1)
many real-world queries are quite simple, where the sorts of worst-cases predi-
cated in theory seem unlikely to occur in current practice [4]; (2) we can avoid
quadratic pairwise checks by using a canonicalisation procedure that facilitates
indexing equivalent queries by their normalised string or hash value.
    Along these lines, in our paper accepted for the ISWC Research Track [10], we
proposed a canonicalisation procedure for monotone SPARQL queries – queries
with selection, projection, join and union under set or bag semantics – such
that the canonical form of a query can(Q) is congruent to Q, and the canonical
form of two queries Q1 and Q2 are equal (can(Q1 ) = can(Q2 )) if and only if
Q1 and Q2 are congruent. We proved this procedure to be sound and complete
for the aforementioned fragment of monotone SPARQL queries and through
experiments on real and synthetic queries, we showed it (1) to be efficient for
queries found in practice, taking milliseconds on average; (2) to be capable of
finding more duplicate queries than a baseline syntactic normalisation; (3) to
eventually break on harder synthetic cases that provoke a doubly-exponential
behaviour. Rather than reject queries using other features of SPARQL (1.0), we
discussed how the procedure can be extended to perform a best-effort (sound
but incomplete) canonicalisation of other features of SPARQL.
    The canonicalisation procedure is founded on representing the algebra of a
SPARQL query in a simpler structure that we can then normalise (following tech-
niques known from theoretic works for deciding query equivalence). Specifically,
we represent a SPARQL query as an RDF graph and then apply normalisation
steps on the RDF graph;1 in this graph, we represent query variables as blank
nodes. Over the RDF graph representation, our first step is to apply a Union of
Conjunctive Query (UCQ) normalisation: in SPARQL one can express unions of
joins, joins of unions, or any nesting between; hence we first apply a (potentially
exponential) rewriting of the query such that it is strictly a union of multiple
Conjunctive Queries (CQs; aka. BGP), where each CQ is a join of multiple triple
patterns; this UCQ rewriting will effectively rewrite QA in Example 1 to a query
of a similar form to QB . Next, under set semantics, as shown for QA in Exam-
ple 1, a query may contain redundant patterns, where we apply a minimisation
step: in each CQ of the UCQ computed in the previous stage, we remove redun-
dant variables (this is done by leaning the RDF sub-graph corresponding to the
CQ), where we then remove redundant CQs from the UCQ. Finally, we apply a
canonical labelling (modulo isomorphism) of the blank nodes, which determin-
istically labels the query variables [6]. The last step is to apply a deterministic
syntactic mapping from the RDF graph back to a SPARQL query.
    For space reasons, we have sketched our canonicalisation procedure. For fur-
ther details – including related work, detailed examples, formalisation, proofs
and experiments – we refer to the extended version of the full paper [10]. We
highlight, however, that while there exist frameworks to decide query equiv-
alence, we are not aware of any framework that canonicalises queries in this

1
    This is not dissimilar to the idea of SPIN [5], though we require special attention to
    ensure correspondences between RDF [6] and SPARQL equivalence relations.
manner (for SPARQL or even for SQL); the closest work we are aware of is that
of Papailiou et al. [7], who rewrite query variables modulo isomorphism.

3   QCan Demo
In the Poster and Demo track, we plan to provide a demo of our QCan sys-
tem, publicly available here: http://qcan.dcc.uchile.cl. The demo is quite
straightforward: given an input query on the left, it computes and displays the
normalised canonical version of the query on the right. Our main motivation is to
be able to give practical hands-on examples of how the canonicalisation process
works to those who may have use for it, and to highlight the current types of
queries supported, what are the current limitations (limited support for filters,
no support for SPARQL 1.1), etc. We would also be keen to discuss use-cases
for this system and what requirements they might imply.
Acknowledgements This work was supported by the Millennium Institute for Foun-
dational Research on Data (IMFD) and by Fondecyt Grant No. 1181896.


References
 1. Aranda, C.B., Hogan, A., Umbrich, J., Vandenbussche, P.: SPARQL web-querying
    infrastructure: Ready for action? In: International Semantic Web Conference
    (ISWC). pp. 277–293 (2013)
 2. Arias Gallego, M., Fernández, J.D., Martı́nez-Prieto, M.A., de la Fuente, P.: An
    empirical study of real-world SPARQL queries. In: USEWOD (2011)
 3. Bielefeldt, A., Gonsior, J., Krötzsch, M.: Practical Linked Data Access via
    SPARQL: The Case of Wikidata. In: LDOW (2018)
 4. Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query
    logs. PVLDB 11(2), 149–161 (2017)
 5. Fürber, C., Hepp, M.: Using SPARQL and SPIN for data quality management on
    the semantic web. In: Business Information Systems (BIS). pp. 35–46 (2010)
 6. Hogan, A.: Canonical forms for isomorphic and equivalent RDF graphs: Algorithms
    for leaning and labelling blank nodes. ACM TOW 11(4) (2017)
 7. Papailiou, N., Tsoumakos, D., Karras, P., Koziris, N.: Graph-aware, workload-
    adaptive SPARQL query caching. In: ACM SIGMOD International Conference on
    Management of Data. pp. 1777–1792. ACM (2015)
 8. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM
    Trans. Database Syst. 34(3) (2009)
 9. Pichler, R., Skritek, S.: Containment and equivalence of well-designed SPARQL.
    In: Principles of Database Systems (PODS). pp. 39–50 (2014)
10. Salas, J., Hogan, A.: Canonicalisation of Monotone SPARQL Queries. ISWC (Ac-
    cepted) (2018), see extended version: http://aidanhogan.com/qcan/extended.pdf
    (under revision for the camera-ready version)
11. Saleem, M., Ali, M.I., Hogan, A., Mehmood, Q., Ngomo, A.N.: LSQ: the linked
    SPARQL queries dataset. In: International Semantic Web Conference (ISWC)
    (2015)
12. Williams, G.T., Weaver, J.: Enabling fine-grained HTTP caching of SPARQL query
    results. In: International Semantic Web Conference (ISWC). pp. 762–777 (2011)