<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pattern-based analysis of SPARQL queries from the LSQ dataset?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timo Stegemann</string-name>
          <email>timo.stegemann@uni-due.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jurgen Ziegler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Duisburg-Essen</institution>
          ,
          <addr-line>Duisburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a pattern-based analysis of the Linked Sparql Queries dataset (Lsq). The analysis showed that from more than 630;000 unique Select queries stored in the dataset, 99% of them are represented by only 120 di erent query patterns. In this paper, we present an analysis of the Linked Sparql Queries dataset (Lsq), collected by Saleem et al. [2]. The Lsq dataset has already been evaluated in statistical ways by the authors themselves as well as others (eg. [1]). They investigated, among other things, the usage of speci c Sparql features (Union, Distinct, Filter, etc.) or di erent forms of joins (star, path, sink, etc.). In contrast to these evaluations, we analysed the patterns that were used when constructing the queries. The results of our analysis might be relevant in the eld of teaching, for developers of Linked Data applications that support users in the query writing process, or for providers of public Sparql endpoints.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>PREFIX rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt;
SELECT DISTINCT *</p>
      <p>WHERE { ?value rdfs:label Semantic Web @en . }
LIMIT 10</p>
      <p>+</p>
      <p>SELECT ?v0 WHERE { ?v0 &lt;i0&gt; l0 @lang }
Describe 3.8%, Construct 0.7%). While Saleem et al. analyzed the dataset
as a whole, we set our focus on the used query patterns.</p>
      <p>To extract query patterns from the individual queries, we transformed the
queries in the test set into a parameterized form. In the rst step, we removed
all parts from the query string that have no impact on the pattern itself, such
as Prefix, From, Limit, Offset, and Order By, as well as the Distinct
keyword and corresponding parenthesis. In the next step, we mapped all Iris,
variables, literals and language tags of each query to a generic format (Iris:
&lt;i#&gt;, variables: ?v#, literals: l# , and language tags: @lang, where # is the
index of its rst occurrence in the query). We replaced all wildcards in Select
statements with the corresponding list of variables from the Where statement
and harmonized language lter expressions. Fig. 1 shows an example
transformation of a Sparql query into parameterized form. In the last step we merged
all identically parameterized queries into a single query pattern.</p>
      <p>Through this, we obtained 1619 unique query patterns where the rst 120
patterns of the most frequently executed queries represent 99% of the queries
executed overall. The rst 42 represent 95%, the rst 21 represent 90%, and the
rst 3 already represent more than 50% of the queries executed overall (see Fig.
2). Complete results are available online4.</p>
      <p>A separate analysis for each endpoint in the data set resulted in 1240 unique
query pattern for the DBpedia endpoint, 289 for Lgd, and 151 for Swdf. Only 17
patterns appear in all three endpoints (but represent 24.8% o all executions), 27
4 https://semwidg.org/page/download-research#LSQ_Analysis
appear in two di erent endpoints (9.5%), and 1575 appear in only one endpoint
(65.6%). The Pareto distribution of the query patterns for the DBpedia and Lgd
endpoint are similar, although it is notable that the most frequently used pattern
in the Lgd endpoint already represents more than every second executed query.
For the Swdf endpoint, the patterns are more evenly distributed (see Fig. 3),
but it should be noted that 16.5% of the executed queries in Swdf context are
probably made by a Php library5 that tests the feature support of an endpoint
by using a set of very characteristic queries.</p>
      <p>The 12 most frequently used query patterns of the overall Lsq dataset,
representing more than 85% of all executed queries, are listed in Table 1 together
with their ranking of the separate analysis for the di erent Sparql endpoints.
Each additional pattern represents less than 0.1% of the queries executed overall
(see Fig. 3).</p>
      <p>Most of these patterns are simple subject-predicate-object relations with one
or two variables, that are also part of the Select statement, on varying
positions. Four of them (patterns (a), (b), (f) and (i)) are using Optional pattern
matching. Filter expressions are only used in very simple ways. In pattern (g)
the subject variable is restricted to a set of speci c Iris. In pattern (d) and (i)
a Filter expression is used to restrict the result set to a speci ed language. In
pattern (k) this is done by a language tag. Pattern (k) is also the only one of
these patterns that makes use of a literal value. Patterns (b) and (k) have more
than one triple pattern sharing the same subject. Pattern (i) is the only complex
one of these patterns. Additionally to the Filter expression and the Optional
pattern matching it contains a path spreading over two triple patterns and a
triple pattern that takes a predicate from a previous triple pattern as subject.</p>
      <p>In the overall data set, language lters (in form of Filter expressions or
language tags) are used in 14.8% of all executed queries. Filter expressions
that do not restrict the language are used in 5.7%, Unions in 7.5%, and Group
By expressions in 0.5%. Aggregate functions are used in 1.2%, whereas most
of them are Count expressions (0.9%). Other Sparql features like subqueries,
Bind, or Having are merely used in negligibly small numbers.
5 http://graphite.ecs.soton.ac.uk/sparqllib/
(a) 1 - 1 - SELECT ?v0 WHERE f OPTIONAL f &lt;i0&gt; &lt;i1&gt; ?v0 g g
(b) 2 1 - - SELECT ?v0 ?v1 ?v2 WHERE f &lt;i0&gt; &lt;i1&gt; ?v0</p>
      <p>,! OPTIONAL f &lt;i0&gt; &lt;i2&gt; ?v1 ; &lt;i3&gt; ?v2 g g
(c) 3 2 4 54 SELECT ?v0 WHERE f &lt;i0&gt; &lt;i1&gt; ?v0 g
(d) 4 3 128 - SELECT ?v0 WHERE f &lt;i0&gt; &lt;i1&gt; ?v0 FILTER langMatches( lang(?v0), l0 ) g
(e) 5 31 2 20 SELECT ?v0 ?v1 WHERE f ?v0 &lt;i0&gt; ?v1 g
(f) 6 - 3 - SELECT ?v0 WHERE f OPTIONAL f ?v0 &lt;i0&gt; &lt;i1&gt; g g
(g) 7 - 5 - SELECT ?v0 ?v1 WHERE f ?v0 &lt;i0&gt; ?v1 FILTER</p>
      <p>,! (?v0 = &lt;i1&gt; || ?v0 = &lt;i2&gt; || ?v0 = &lt;i3&gt; || ?v0 = &lt;i4&gt; || ?v0 = &lt;i5&gt;) g
(h) 8 5 19 1 SELECT ?v0 ?v1 WHERE f &lt;i0&gt; ?v0 ?v1 g
(i) 9 4 - - SELECT ?v3 ?v0 ?v1 ?v2 WHERE f &lt;i0&gt; ?v0 ?v1 OPTIONAL f ?v1 &lt;i1&gt; ?v2 g
,! OPTIONAL f ?v0 &lt;i1&gt; ?v3 g FILTER ( ( langMatches( lang(?v3), l0 )
,! || !langMatches( lang(?v3), * ) ) &amp;&amp; ( langMatches( lang(?v1), l0 )
,! || !langMatches( lang(?v1), * ) ) &amp;&amp; ( langMatches( lang(?v2), l0 )
,! || !langMatches( lang(?v2), * ) ) ) g
(j) 10 6 50 54 SELECT ?v0 WHERE f ?v0 &lt;i0&gt; &lt;i1&gt; g
(k) 11 7 80 - SELECT ?v0 ?v1 WHERE f ?v0 &lt;i0&gt; l0 @lang ; &lt;i1&gt; ?v1 g
(l) 12 8 - - SELECT ?v0 ?v1 WHERE f ?v0 ?v1 &lt;i0&gt; g
3</p>
    </sec>
    <sec id="sec-2">
      <title>Conclusion</title>
      <p>In summary our analysis showed that the most frequently executed queries in
the Lsq dataset are rather simple and represented by only few di erent query
patterns. It is common to request several properties of a speci c resource in a
single query and optionally lter them by their language. Other lter
expressions are rare. It should be noted that the analyzed queries are not necessarily
completely representative, since the query logs are provided by only three public
Sparql endpoints.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>On the statistical analysis of practical sparql queries</article-title>
          .
          <source>In: Proceedings of the 19th International Workshop on Web and Databases</source>
          . p.
          <fpage>2</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehmood</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.: The</given-names>
          </string-name>
          <string-name>
            <surname>Semantic</surname>
            <given-names>Web - ISWC</given-names>
          </string-name>
          <year>2015</year>
          : 14th International Semantic Web Conference, Bethlehem, PA, USA, October
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2015</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <article-title>chap</article-title>
          .
          <source>LSQ: The Linked SPARQL Queries Dataset</source>
          , pp.
          <volume>261</volume>
          {
          <fpage>269</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Stegemann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ziegler</surname>
          </string-name>
          , J.:
          <article-title>Investigating learnability, user performance, and preferences of the path query language SemwidgQL compared to SPARQL</article-title>
          . In: The Semantic Web - ISWC
          <year>2017</year>
          : 16th International Semantic Web Conference (to appear)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>