=Paper=
{{Paper
|id=Vol-3409/paper15
|storemode=property
|title=A Taxonomy of Basic Graph Pattern Motifs for Understanding SPARQL Query Logs.
|pdfUrl=https://ceur-ws.org/Vol-3409/paper15.pdf
|volume=Vol-3409
|authors=Jaime Salas,Aidan Hogan
|dblpUrl=https://dblp.org/rec/conf/amw/SalasH23
}}
==A Taxonomy of Basic Graph Pattern Motifs for Understanding SPARQL Query Logs.==
A Taxonomy of Basic Graph Pattern Motifs for
Understanding SPARQL Query Logs
Jaime Salas1 , Aidan Hogan1
1
    DCC, Universidad de Chile; IMFD
                                         Abstract
                                         Popular SPARQL query endpoints hosted by open knowledge graphs such as Wikidata and DBpedia
                                         process hundreds of thousands or even millions of queries per day. Making sense of queries at this scale
                                         is challenging. We propose a taxonomy of basic graph patterns (BGPs) in order to induce a hierarchical
                                         structure from such patterns found in a large query log. The leaves of this taxonomy are the raw basic
                                         graph patterns extracted from each query of the log. Each layer thereafter applies a generalisation step
                                         followed by a canonicalisation step, with each layer representing an increasingly coarse partition based
                                         on an increasingly more general motif. Generalisations are applied for constant subjects/objects (nodes),
                                         constant predicates (edge labels), direction, constant/variable distinction, and homomorphic equivalence.
                                         We discuss use-cases, define these generalisation steps, and apply them to induce a taxonomy of BGPs
                                         from a subset of the Wikidata query log.
                                         Keywords
                                         SPARQL, canonicalisation, query logs, graph queries, Wikidata
1. Introduction
There are now hundreds of public SPARQL endpoints available on the Web [1], with some of
the largest, such as those hosted by DBpedia [2] and Wikidata [3], evaluating in the order of
hundreds of thousands or millions of queries per day. Although supporting such a volume of
queries poses significant engineering and scientific challenges [1, 3], samples of these logs have
been published [3, 2], and provide significant research opportunities [4]. Specifically, these
query logs contain key insights into the distribution of queries of interest to different clients
in practice in terms of the complexity of the query patterns [5, 6], which operators are more
(or less) frequently used [6, 2], which operators tend to be used together [6], whether queries
originate from humans or bots [7, 8], how clients refine or modify queries from one request to
the next [9], etc.
   Thus, given a large SPARQL query log such as that published for Wikidata [3], or as part of
the Linked SPARQL Queries dataset [2], there is a wide range of analyses that a researcher or
database administrator can run in order to gain insights into trends in those queries. What we
argue is missing is a way to organise and “browse” the queries of such logs hierarchically. We
AMW’23: Alberto Mendelzon International Workshop on Foundations of Data Management, May 22–26, 2023, Santiago,
Chile
$ jaime.os.salas@gmail.com (J. Salas); ahogan@dcc.uchile.cl (A. Hogan)
 https://aidanhogan.com/ (A. Hogan)
 0000-0002-9353-8955 (J. Salas); 00000-0001-9482-1982 (A. Hogan)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
envisage the ability to make sense of large SPARQL query logs via a taxonomy that begins with
very general (equivalence) classes of queries that become increasingly more specific in lower
layers, providing first a high-level view of categories that open up into more and more specific
categories of queries.
   The design of such a taxonomy is challenging, and indeed there are potentially many ways
in which such a taxonomy can be defined. But we think the idea is worth exploring in order
to help bring order to large SPARQL query logs. In this work, we take some initial steps
in this direction by focusing on the case of basic graph patterns [10], proposing a taxonomy
to induce a hierarchical structure from them. This structure is based on generalisation and
canonicalisation [11] steps applied iteratively at each layer of the hierarchy, inducing a set of
motifs at each layer that constitute a partition of the queries considered. We currently foresee
five use-cases:
Human vs. bot detection [7, 8]: The taxonomy can help identify motifs for highly-regular
   queries, such as parameterised queries used by bots.
Caching [12, 13]: The taxonomy can identify frequent motifs that, if cached, could increase
     cache hit rates and improve query evaluation performance.
Optimisation [14]: The taxonomy could be used to decide on specialised indexes and optimi-
     sations to reduce the costs for frequent motifs.
Benchmarking [15, 16]: The taxonomy can be used to classify different types of graph pat-
    terns for benchmarking, enabling the comparison of different engines for different motifs.
Log exploration [6]: The taxonomy can help, for example, to identify unusual traffic through
     highly-specific motifs with large numbers of instances.
Paper outline In Section 2, we discuss related works that classify or generalise queries from
SPARQL logs. Section 3 presents the taxonomy we propose, while Section 4 presents the results
of applying our taxonomy to a sample of the Wikidata query logs. Section 5 concludes.
2. Related Work
The use of SPARQL query logs for understanding user demands has become a rich topic of
interest, leading to a wide body of research. For space reasons, we focus on those works that
specifically look at generalising and/or classifying real-world queries in such logs.
   The first category of related works extract high-level query patterns for the purpose of
analysing logs. Bonifati et al. [6] present a detailed analysis of the Wikidata and LSQ query
logs, where they present frequency distributions relating to query size, operator usage, etc., but
also look at the graph structure of basic graph patterns, defining a list of different graph shapes
that correspond to a specific subset of the high-level motifs explored here.
   The second category of works process log queries for the purposes of benchmarking. Saleem
et al. [15] propose a framework, called FEASIBLE, that extracts different features from queries,
such as join vertex degree and shape, number of triple pattern and their selectivity, etc.; these
features are used to select a small number of representative queries for a benchmark. Angles et
al. [16] propose high-level categories of subqueries (BGPs, property paths, etc.) for the Wikidata
query log that are then used to compare the performance of different engines.
   In terms of novelty, rather than pre-defining shapes/motifs, we instead pre-define a set of
generalisation steps that can produce arbitrary motifs. We are also not aware of works that
propose a hierarchical way to structure such motifs.
3. A Taxonomy of Basic Graph Pattern Motifs
In this section, we present our proposed taxonomy of BGPs based on increasingly high-level
motifs induced by a sequence of generalisation steps. We first present definitions, then an
example, and finally summarise how we implemented software to compute the taxonomy.
Preliminaries. Let I, L and V denote the set of all IRI, literal and variable terms, respectively.
A triple pattern 𝑡 = (𝑠, 𝑝, 𝑜) ∈ (I ∪ L ∪ V) × (I ∪ V) × (I ∪ L ∪ V) is an RDF triple allowing
variables in any position. We use C = I ∪ L to denote constants where the distinction between
IRIs and literals does not matter. A set of triple patterns is called a basic graph pattern (BGP).1
Letting 𝐵 denote a BGP, we denote by vars(𝐵) the set of variables appearing in 𝐵, by cons(𝐵)
the set of constants appearing in 𝐵, by nodes(𝐵) the set of subject/object terms (i.e., nodes) in
𝐵, and by labs(𝐵) the set of predicate terms (i.e., edge labels) in 𝐵.
   Given a BGP 𝐵, we will define a sequence of generalisation steps that form increasingly
high-level equivalence classes of BGPs induced by patterns that we call motifs. Specifically, a
generalisation step 𝑠 is a transformation of a BGP that yields an equivalence relation ∼𝑠 on BGPs
such that, given two BGPs 𝐴 and 𝐵, we say that 𝐴 ∼𝑠 𝐵 if and only if 𝑠(𝐴) = 𝑠(𝐵). We then
call 𝑠(𝐴) (or equivalently 𝑠(𝐵)) an 𝑠-motif. To define specific generalisation steps, we introduce
a partial term mapping 𝜏 : C∪V → C∪V whose domain is denoted by dom(𝜏 ). Given a BGP 𝐵,
we denote by 𝜏 (𝐵) the image of 𝐵 under 𝜏 , i.e., 𝜏 (𝐵) = {(𝜏 ′ (𝑠), 𝜏 ′ (𝑝), 𝜏 ′ (𝑜)) | (𝑠, 𝑝, 𝑜) ∈ 𝐵}
where 𝜏 ′ (𝑥) = 𝜏 (𝑥) for all 𝑥 ∈ dom(𝜏 ), and 𝜏 ′ (𝑥) = 𝑥 for all 𝑥 ∈   / dom(𝜏 ); i.e., 𝜏 rewrites
some of the terms in 𝐵, leaving terms for which it is not defined as they are. Letting s, p and o
denote the subject, predicate and object positions, we may also use selective images to rewrite
terms only in selected positions; for example, 𝜏s,o (𝐵) = {(𝜏 ′ (𝑠), 𝑝, 𝜏 ′ (𝑜)) | (𝑠, 𝑝, 𝑜) ∈ 𝐵} only
rewrites subject and object terms with 𝜏 .
Proposed Taxonomy. There are innumerable distinct generalisation steps that one could
consider, and indeed the “best” choice may depend on external factors such as the particular
use-cases, logs, etc., involved. The order in which the generalisation steps are applied can also
affect the resulting taxonomy. In the following, we propose a concrete set of generalisation
steps inspired by the idea of incrementally generalising the graph structure of BGPs. As a
zeroth step, we abstract away variable names from the input BGPs (modulo isomorphism).
In subsequent steps we abstract away the values of particular constants in order to yield an
increasingly abstract graph structure. We then abstract away edge labels and direction in order
to yield a directed and then undirected graph, respectively. Thereafter we abstract away the
1
    We assume that blank nodes appearing in a BGP are mapped to fresh variables [11].
distinction between constants and variables, and finally we compute the core of the graph. As
aforementioned, this is one way in which generalisation steps can be applied. Exploring other
possible steps, and the motifs and taxonomies they yield, is an interesting topic for future work.
Definition 3.1. We define the following generalisation steps, where 𝐴 and 𝐵 are BGPs:
   0. The zeroth step generalises variable names, capturing isomorphism modulo variables. We
      say that 𝐴 ∼0 𝐵 if and only if there exists a one-to-one term mapping 𝛼 : vars(𝐴) →
      vars(𝐵) such that 𝛼(𝐴) = 𝐵.
   1. The first step further generalises constant nodes in a BGP, capturing isomorphism modulo
      variables and constant nodes. We say that 𝐴 ∼1 𝐵 if and only if there exists a one-to-one
      term mapping 𝛽 : nodes(𝐴) ∩ cons(𝐴) → nodes(𝐵) ∩ cons(𝐵) such that 𝛽s,o (𝐴) ∼0 𝐵.
   2. The second step further generalises constant edge labels (predicates) in a BGP, capturing
      isomorphism modulo variables and constants. We say that 𝐴 ∼2 𝐵 if and only if there
      exists a one-to-one term mapping 𝛾 : labs(𝐴) ∩ cons(𝐴) → labs(𝐵) ∩ cons(𝐵) such
      that 𝛾p (𝐴) ∼1 𝐵.
   3. The third step generalises edge labels, capturing isomorphism of the directed graph of
      the BGP while still distinguishing constants and variables. Let c ∈ C and v ∈ V denote
      a reserved constant and variable, respectively. Let 𝛿 : C ∪ V → {c, v} denote a term
      mapping such that 𝛿(𝑥) = c for all 𝑥 ∈ C and 𝛿(𝑥) = v for all 𝑥 ∈ V. We say that
      𝐴 ∼3 𝐵 if and only if 𝛿p (𝐴) ∼2 𝛿p (𝐵).
   4. The fourth step generalises edge direction, capturing isomorphism of the undirected
      graph of the BGP while still distinguishing constants and variables. Let 𝐵 ± = {(𝑠, 𝑝, 𝑜) |
      (𝑠, 𝑝, 𝑜) ∈ 𝐵 or (𝑜, 𝑝, 𝑠) ∈ 𝐵} denote the completion of 𝐵. We say that 𝐴 ∼4 𝐵 if and
      only if 𝐴± ∼3 𝐵 ± .
   5. The fifth step generalises constant and variable distinction, capturing isomorphism of the
      undirected graph of the BGP without distinguishing constants and variables. Here we
      choose to map constants to variables. We say that 𝐴 ∼5 𝐵 if and only if there exists a
      one-to-one term mapping 𝜖 : cons(𝐴) ∪ cons(𝐵) → V ∖ (vars(𝐴) ∪ vars(𝐵)) such that
      𝜖(𝐴) ∼4 𝜖(𝐵).
   6. The sixth step generalises non-core edges and nodes, capturing homomorphic equivalence
      of the undirected graph of the BGPs. We say that there is a homomorphism from 𝐴
      to 𝐵, denoted 𝐴 → 𝐵, if and only if there exists a term mapping 𝜇 : vars(𝐴) →
      vars(𝐵) ∪ cons(𝐵) such that 𝜇(𝐴) ⊆ 𝐵. We then say that 𝐴 and 𝐵 are homomorphically
      equivalent modulo variables, denoted 𝐴 ≃0 𝐵, if and only if 𝐴 → 𝐵 and 𝐵 → 𝐴. Let us
      overload the previous definitions, where we denote by ≃0,...,5 the steps ∼0,...,5 replacing
      the zeroth step ∼0 with ≃0 . We say that 𝐴 ∼6 𝐵 if and only if 𝐴 ≃5 𝐵.
Each generalisation step 𝑠 induces a partition of a set of BGPs ℬ via the quotient set ℬ/∼𝑠 .
Example. Take the following SPARQL query, which intuitively asks for information about
papers published in AMW that cite other AMW papers:
 SELECT ?x ?p ?o WHERE { ?x :venue :AMW . ?y :venue :AMW . ?x :cites ?y . ?x ?p ?o . }
             :venue                         :venue
      :cites           :AMW        :cites
             :venue                         :venue
         (a) Zeroth step               (b) First step               (c) Second step             (d) Third step
                        (e) Fourth step                 (f) Fifth step            (g) Sixth step
Figure 1: Proposed generalisation steps applied to an example SPARQL query
Figure 1 presents the result of the previously defined generalisation steps for the BGP of this
query, where we use circles to represent existential values whose particular value does not
matter, solid lines to indicate constants, and dashed lines to indicate variables. Each step creates
an increasingly high-level motif. Applying these generalisations to a set of BGPs then generates
a partition according to BGPs having the same motif at that level. By design, each generalisation
step builds upon the previous, which means that the partition induced by each step is strictly
finer that the one that came before, inducing a hierarchical taxonomy.2
Implementation. We implemented the extraction of the aforementioned taxonomy using
the QCan package [11], which offers methods for canonicalising SPARQL queries. Specifically,
QCan represents SPARQL BGPs as RDF graphs in a reified form using r-graphs. The package
can then invoke the blabel package [17], which canonically labels existential values, and can
also perform RDF leaning, which is used to calculate the core of the graph required for step 6.
4. An Analysis of Wikidata’s BGPs
In this section, we extract the taxonomy from the BGPs of a large sample of queries from the
Wikidata SPARQL logs [3]. We work with two samples: a subset of the robotic queries, and a
subset of the organic/human queries. The samples are taken from queries received between
2018-02-26 and 2018-03-25, i.e., the most recent interval available in the Wikidata SPARQL logs.
2
    An unintended consequence of layering generalisation steps is that, by renaming predicates separately from nodes,
    the second step may rename a constant in a predicate position differently from how the first step renamed the same
    constant in a node position; correspondences for constants in predicate and node positions are lost. A solution
    would be to merge the first and second steps, losing some granularity, or defining the second step directly over the
    zeroth step without using a selective image.
Table 1
The number of unique motifs, and the largest equivalence class for a single motif, found for each step in
the robotic and organic BGPs from Wikidata.
                    (a) Robotic BGPs                                (b) Organic BGPs
            Step      Unique           Max                  Step      Unique           Max
                0     107,583        239,523                    0     134,329       286,994
                1       2,380        251,956                    1      13,315       388,836
                2         240        590,529                    2       1,776       586,021
                3         198        590,880                    3       1,279       590,948
                4         143        590,880                    4         917       590,948
                5          55      1,190,351                    5         332     1,290,264
                6           3      1,780,647                    6           4     2,197,940
Though performance is not a focus of the current work, we note that computing the taxonomy
for close to four million BGPs took a little over three hours.
Robotic queries We first present the results for robotic queries. We took a random sample
of 1,000,000 queries from the set of robotic queries in the interval. We then extracted all the
BGPs contained in these queries, giving us a total of 1,781,337 BGPs. Table 1a presents some
high-level statistics, showing the number of motifs generated at each step, along with the
motif (equivalence class) with the largest number of BGPs. The results clearly show that each
successive level of the taxonomy finds less unique motifs, while finding larger equivalence
classes for these queries. This is of interest for exploring the logs: a user starting at step 6 (the
highest level) can expand to a reasonable number (55) of child motifs at step 5, ordered by size;
expanding these motifs from step 5, they can expect, on average, 2–3 child motifs, and so forth.
   In Table 2, we present the top 5 motifs at each step for these robotic queries, starting from
step 1. As before, dashed lines indicate variables while solid lines indicate constants. At step
1, the motifs represent parameterised queries, where nodes (subjects/objects), specifically, are
generalised. The Wikidata properties P300 and P214 seen in this step indicate an ISO 3166-2
code for countries, and a VIAF ID used by libraries, respectively. The results show that the most
common BGPs are composed of single triple patterns. The top motif is unusual in that it has
no variable: it stems from a syntactic shortcut used in Wikidata for selecting the languages of
labels returned by a custom service. We also see that the third, fourth and fifth motifs contain
two variables; such BGPs may form part of a larger query, and be contained, for example, in an
OPTIONAL clause. Steps 2 and 3 are quite similar to each other. We see that the most common
motif in both steps refers to a singleton BGP with a constant predicate and variable nodes, but
in the second and fifth positions, we see popular join shapes. Step 4 generalises edge direction,
where we see a similar set of results to the previous two steps; the motif with a single edge
having a constant and variable node is more prominent here as it merges two cases from step 3:
one where the subject is the variable, and the other where the object is the variable. Removing
the constant/variable distinction in step 5, we see many of the simpler motifs merged into the
first two motifs, but thereafter, we see different types of joins, including disconnected patterns
Table 2
Top 5 motifs for robotic queries after each step
                                    Rank of motif per number of occurrences
  Step
                 1                  2                 3                 4                  5
             wb:language      s:description        rdfs:label        wdt:P300          wdt:P214
     1        251,956           135,073             95,105            88,199           83,229
             (14.14%)           (7.58%)            (5.33%)           (4.95%)          (4.67%)
                  c            c1      c2              c                c             c2       c1
     2        590,529           342,885             286,728          268,915           71,524
             (33.15%)          (19.25%)            (16.10%)         (15.10%)          (4.02%)
     3        590,880           342,886             296,520          270,846           71,664
             (33.17%)          (19.25%)            (16.65%)         (15.20%)          (4.02%)
     4        590,880           348,187             302,953          296,520           76,558
             (33.17%)          (19.55%)            (17.01%)         (16.65%)          (4.30%)
     5      1,190,353           428,815             68,540            53,511           26,998
            (66.82%)           (24.07%)            (3.85%)           (3.00%)          (1.52%)
     6      1,780,647             649                 41
                                                                       —                   —
            (99.96%)            (0.04%)            (0.002%)
in the fourth and fifth positions; these disconnected patterns are due to geographic queries
that apply a join between two variables based on geographic distance, rather than a natural
equi-join. At step 6, which looks at the cores of the step 5 graphs, we find three motifs: 1,780,647
BGPs collapse to a core with a single edge between two nodes (this includes acyclic queries and
queries with even-length cycles, for example), 649 BGPs collapse to a self-loop on one node
(this includes all and only queries with a self-loop, i.e., a triple pattern with the same subject
and object), while 41 BGPs collapse to a triangle (as per the example shown in Figure 1).
Organic queries Next we extract our taxonomy from the set of all 872,555 organic queries
in the chosen interval, which contain 2,198,557 BGPs. Table 1b presents the overall statistics
regarding the number of motifs and the largest equivalence class size at each step. Of interest is
that the number of motifs is generally much higher than in the robotic case, indicating more
diverse query shapes. This is to not surprising as one would expect robotic traffic to account
for more regular (e.g., parameterised) queries when compared with queries created by users.
   As before, Table 3 presents the top 5 motifs at each step – starting from step 1 – in the organic
case. The Wikidata properties P31, P625 and P238 seen in step 1 refer to instance of, coordinate
location and IATA airport code, respectively. We see the same top motif as in the case of robotic
queries relating to Wikidata’s label service. However, we also see a number of more complex
motifs, including the fourth and fifth motif, which we suspect may be misclassified bot traffic
(the logs classify organic and robotic traffic using a number of heuristics, such as user-agent,
the volume of similar queries originating from a single source, etc. [3]). The fourth motif is
disconnected; rather than being an equi-join, the motif represents a join based on geographic
distance, where the join condition does not appear in the BGP. The fifth motif notably includes
an edge with a variable, seemingly trying to retrieve all facts about airports. Looking at steps 2, 3
and 4, we see that although organic motifs share a lot in common with their robotic counterparts,
they tend to be more complex, with the fifth motif in these steps having a join on three edges.
At step 5, we again see the return of disconnected patterns resulting from the aforementioned
geographic queries, with a large part of the fourth motif (209,322 of 211,016 instances) being
accounted for by the fourth motif at step 1. At step 6, we again found an edge motif, a self-loop
motif, and a triangle motif, but unlike in the robotic case, we also found a pentagon motif with
a small number of instances (25 BGPs).
5. Conclusions
We believe that better tools are needed in order to summarise, structure, explore and under-
stand the contents of large SPARQL logs. In this work, we have proposed a taxonomy for
understanding the structure (specifically) of BGPs found in such a log. We proposed a sequence
of generalisation steps that, when applied to BGPs, yield increasingly high-level motifs by
incrementally abstracting away details from the BGP, generating a hierarchical taxonomy over
those BGPs. Extracting the taxonomy for a sample of Wikidata queries, we gain some interest-
ing insights into the most common motifs occurring in BGPs relating to frequently accessed
predicates, frequent join types, the rarity of acyclic queries, the presence of disconnected graph
patterns, and some potentially misclassified organic queries.
   For future work, we would like to develop a user interface that allows for exploring the
taxonomy of a given SPARQL query log: starting from step 6 (the highest level), the user is
presented the motifs and the number of associated BGPs in descending order; upon clicking a
motif, the sub-motifs at the lower level are presented in the same order. This would allow the
user to start from a high-level view of the BGPs of the log, and “drill-down” on specific motifs
for more details. We are also interested in exploring different sequences of generalisation steps
that might yield more interesting insights. Another direction would be to study how different
generalisation steps preserve different graph metrics; a key metric along these lines would be
(hyper-)treewidth, which captures how “tree-like” a query is and predicts the complexity of
evaluating the query [6].3 Finally, it would be of interest to explore query features beyond
3
    We note that step 6, which computes the core of the undirected graph, does not preserve treewidth; for example, it
Table 3
Top 5 motifs for organic queries after each step
                                   Rank of motif per number of occurrences
   Step
                  1                 2                 3                4                        5
                                                                    wb:distance
                                                                wb:center wb:radius
                              wdt:P31 rdfs:label                                      wdt:P31       wdt:P238
              wb:language                           wdt:P625         wdt:P625
      1        388,836           239,752           215,078          209,322              209,084
              (17.69%)          (10.90%)           (9.78%)          (9.52%)              (9.51%)
                   c                 c             c1      c2            c               c1          c2
      2        586,021           413,058            250,833          221,809             209,457
              (26.65%)          (18.79%)           (11.41%)         (10.09%)             (9.53%)
      3        590,948           413,175            251,147          237,488             209,474
              (26.88%)          (18.79%)           (11.42%)         (10.80%)             (9.53%)
      4        590,948           413,175            286,165          251,254             209,514
              (26.88%)          (18.79%)           (13.02%)         (11.43%)             (9.53%)
      5       1,290,288          347,870            238,128         211,016                38,859
              (58.69%)          (15.82%)           (10.83%)         (9.60%)               (1.77%)
      6       2,197,940            404               188               25
                                                                                                —
              (99.97%)           (0.02%)           (0.01%)          (0.001%)
simple BGPs; this seems non-trivial as it introduces many alternatives for generalisation steps,
but could help to glean insights into how operators like paths, union, optionals, etc., are used.
Acknowledgments
This work was supported by Fondecyt No. 1221926 and by ANID – Millennium Science Initiative
Program – Code ICN17_002.
References
 [1] P. Vandenbussche, J. Umbrich, L. Matteis, A. Hogan, C. B. Aranda, SPARQLES: monitoring
     public SPARQL endpoints, Semantic Web 8 (2017) 1049–1065. doi:10.3233/SW-170254.
 [2] C. Stadler, M. Saleem, Q. Mehmood, C. Buil-Aranda, M. Dumontier, A. Hogan, A.-C. N.
     Ngomo, LSQ 2.0: A Linked Dataset of SPARQL Query Logs, Semantic Web Journal (2023).
     (to appear).
 [3] S. Malyshev, M. Krötzsch, L. González, J. Gonsior, A. Bielefeldt, Getting the Most Out of
     Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph, in: The Semantic
     Web - ISWC 2018 - 17th International Semantic Web Conference, Monterey, CA, USA, Octo-
     ber 8-12, 2018, Proceedings, Part II, 2018, pp. 376–394. doi:10.1007/978-3-030-00668-6\
     _23.
 [4] W. Martens, Towards Theory for Real-World Data, in: PODS ’22: International Conference
     on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, ACM, 2022, pp. 261–276.
     doi:10.1145/3517804.3526066.
 [5] T. Stegemann, J. Ziegler, Pattern-Based Analysis of SPARQL Queries from the LSQ Dataset,
     in: Proceedings of the ISWC 2017 Posters & Demonstrations and Industry Tracks co-located
     with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October
     23rd - to - 25th, 2017, volume 1963 of CEUR Workshop Proceedings, CEUR-WS.org, 2017.
 [6] A. Bonifati, W. Martens, T. Timm, An analytical study of large SPARQL query logs, VLDB
     J. 29 (2020) 655–679. doi:10.1007/s00778-019-00558-9.
 [7] L. Rietveld, R. Hoekstra, Man vs. machine: Differences in SPARQL queries, in: Proceedings
     of the 4th USEWOD Workshop on Usage Analysis and the Web of of Data, ESWC, 2014.
 [8] X. Zhang, M. Wang, B. Zhao, R. Liu, J. Zhang, H. Yang, Characterizing Robotic and
     Organic Query in SPARQL Search Sessions, in: Web and Big Data - 4th International
     Joint Conference, APWeb-WAIM 2020, Tianjin, China, September 18-20, 2020, Proceedings,
     Part I, volume 12317 of Lecture Notes in Computer Science, Springer, 2020, pp. 270–285.
     doi:10.1007/978-3-030-60259-8\_21.
 [9] X. Zhang, M. Wang, M. Saleem, A. N. Ngomo, G. Qi, H. Wang, Revealing secrets in SPARQL
     session level, in: International Semantic Web Conference (ISWC), volume 12506 of LNCS,
     Springer, 2020, pp. 672–690. doi:10.1007/978-3-030-62419-4\_38.
[10] S. Harris, A. Seaborne, E. Prud’hommeaux, SPARQL 1.1 Query Language, W3C Recom-
     mendation, 2013. http://www.w3.org/TR/sparql11-query/.
can collapse cycles of even length to a single edge, cycles of odd length or even cliques to a self-loop, etc.
[11] J. Salas, A. Hogan, Semantics and canonicalisation of SPARQL 1.1, Semantic Web 13 (2022)
     829–893. doi:10.3233/SW-212871.
[12] G. T. Williams, J. Weaver, Enabling fine-grained HTTP caching of SPARQL query re-
     sults, in: The Semantic Web - ISWC 2011 - 10th International Semantic Web Con-
     ference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I, 2011, pp. 762–777.
     doi:10.1007/978-3-642-25073-6\_48.
[13] N. Papailiou, D. Tsoumakos, P. Karras, N. Koziris, Graph-aware, workload-adaptive
     SPARQL query caching, in: ACM SIGMOD International Conference on Management of
     Data, ACM, 2015, pp. 1777–1792. doi:10.1145/2723372.2723714.
[14] W. Ali, M. Saleem, B. Yao, A. Hogan, A. N. Ngomo, A survey of RDF stores &
     SPARQL engines for querying knowledge graphs, VLDB J. 31 (2022) 1–26. doi:10.1007/
     s00778-021-00711-3.
[15] M. Saleem, Q. Mehmood, A. N. Ngomo, FEASIBLE: A feature-based SPARQL bench-
     mark generation framework, in: The Semantic Web - ISWC 2015 - 14th International
     Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings,
     Part I, volume 9366 of Lecture Notes in Computer Science, Springer, 2015, pp. 52–69.
     doi:10.1007/978-3-319-25007-6\_4.
[16] R. Angles, C. B. Aranda, A. Hogan, C. Rojas, D. Vrgoc, WDBench: A Wikidata Graph
     Query Benchmark, in: The Semantic Web - ISWC 2022 - 21st International Semantic Web
     Conference, Virtual Event, October 23-27, 2022, Proceedings, volume 13489 of Lecture Notes
     in Computer Science, Springer, 2022, pp. 714–731. doi:10.1007/978-3-031-19433-7\_41.
[17] A. Hogan, Canonical forms for isomorphic and equivalent RDF graphs: Algorithms for
     leaning and labelling blank nodes, ACM TOW 11 (2017) 22:1–22:62. doi:10.1145/3068333.