Introduction

Pattern-based analysis of SPARQL queries from the LSQ dataset?

Timo Stegemann

timo.stegemann@uni-due.de 0

Jurgen Ziegler

0 0 University of Duisburg-Essen , Duisburg , Germany

This paper presents a pattern-based analysis of the Linked Sparql Queries dataset (Lsq). The analysis showed that from more than 630;000 unique Select queries stored in the dataset, 99% of them are represented by only 120 di erent query patterns. In this paper, we present an analysis of the Linked Sparql Queries dataset (Lsq), collected by Saleem et al. [2]. The Lsq dataset has already been evaluated in statistical ways by the authors themselves as well as others (eg. [1]). They investigated, among other things, the usage of speci c Sparql features (Union, Distinct, Filter, etc.) or di erent forms of joins (star, path, sink, etc.). In contrast to these evaluations, we analysed the patterns that were used when constructing the queries. The results of our analysis might be relevant in the eld of teaching, for developers of Linked Data applications that support users in the query writing process, or for providers of public Sparql endpoints.

Introduction

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT DISTINCT *

WHERE { ?value rdfs:label Semantic Web @en . } LIMIT 10

SELECT ?v0 WHERE { ?v0 <i0> l0 @lang } Describe 3.8%, Construct 0.7%). While Saleem et al. analyzed the dataset as a whole, we set our focus on the used query patterns.

To extract query patterns from the individual queries, we transformed the queries in the test set into a parameterized form. In the rst step, we removed all parts from the query string that have no impact on the pattern itself, such as Prefix, From, Limit, Offset, and Order By, as well as the Distinct keyword and corresponding parenthesis. In the next step, we mapped all Iris, variables, literals and language tags of each query to a generic format (Iris: <i#>, variables: ?v#, literals: l# , and language tags: @lang, where # is the index of its rst occurrence in the query). We replaced all wildcards in Select statements with the corresponding list of variables from the Where statement and harmonized language lter expressions. Fig. 1 shows an example transformation of a Sparql query into parameterized form. In the last step we merged all identically parameterized queries into a single query pattern.

Through this, we obtained 1619 unique query patterns where the rst 120 patterns of the most frequently executed queries represent 99% of the queries executed overall. The rst 42 represent 95%, the rst 21 represent 90%, and the rst 3 already represent more than 50% of the queries executed overall (see Fig. 2). Complete results are available online4.

A separate analysis for each endpoint in the data set resulted in 1240 unique query pattern for the DBpedia endpoint, 289 for Lgd, and 151 for Swdf. Only 17 patterns appear in all three endpoints (but represent 24.8% o all executions), 27 4 https://semwidg.org/page/download-research#LSQ_Analysis appear in two di erent endpoints (9.5%), and 1575 appear in only one endpoint (65.6%). The Pareto distribution of the query patterns for the DBpedia and Lgd endpoint are similar, although it is notable that the most frequently used pattern in the Lgd endpoint already represents more than every second executed query. For the Swdf endpoint, the patterns are more evenly distributed (see Fig. 3), but it should be noted that 16.5% of the executed queries in Swdf context are probably made by a Php library5 that tests the feature support of an endpoint by using a set of very characteristic queries.

The 12 most frequently used query patterns of the overall Lsq dataset, representing more than 85% of all executed queries, are listed in Table 1 together with their ranking of the separate analysis for the di erent Sparql endpoints. Each additional pattern represents less than 0.1% of the queries executed overall (see Fig. 3).

Most of these patterns are simple subject-predicate-object relations with one or two variables, that are also part of the Select statement, on varying positions. Four of them (patterns (a), (b), (f) and (i)) are using Optional pattern matching. Filter expressions are only used in very simple ways. In pattern (g) the subject variable is restricted to a set of speci c Iris. In pattern (d) and (i) a Filter expression is used to restrict the result set to a speci ed language. In pattern (k) this is done by a language tag. Pattern (k) is also the only one of these patterns that makes use of a literal value. Patterns (b) and (k) have more than one triple pattern sharing the same subject. Pattern (i) is the only complex one of these patterns. Additionally to the Filter expression and the Optional pattern matching it contains a path spreading over two triple patterns and a triple pattern that takes a predicate from a previous triple pattern as subject.

In the overall data set, language lters (in form of Filter expressions or language tags) are used in 14.8% of all executed queries. Filter expressions that do not restrict the language are used in 5.7%, Unions in 7.5%, and Group By expressions in 0.5%. Aggregate functions are used in 1.2%, whereas most of them are Count expressions (0.9%). Other Sparql features like subqueries, Bind, or Having are merely used in negligibly small numbers. 5 http://graphite.ecs.soton.ac.uk/sparqllib/ (a) 1 - 1 - SELECT ?v0 WHERE f OPTIONAL f <i0> <i1> ?v0 g g (b) 2 1 - - SELECT ?v0 ?v1 ?v2 WHERE f <i0> <i1> ?v0

,! OPTIONAL f <i0> <i2> ?v1 ; <i3> ?v2 g g (c) 3 2 4 54 SELECT ?v0 WHERE f <i0> <i1> ?v0 g (d) 4 3 128 - SELECT ?v0 WHERE f <i0> <i1> ?v0 FILTER langMatches( lang(?v0), l0 ) g (e) 5 31 2 20 SELECT ?v0 ?v1 WHERE f ?v0 <i0> ?v1 g (f) 6 - 3 - SELECT ?v0 WHERE f OPTIONAL f ?v0 <i0> <i1> g g (g) 7 - 5 - SELECT ?v0 ?v1 WHERE f ?v0 <i0> ?v1 FILTER

,! (?v0 = <i1> || ?v0 = <i2> || ?v0 = <i3> || ?v0 = <i4> || ?v0 = <i5>) g (h) 8 5 19 1 SELECT ?v0 ?v1 WHERE f <i0> ?v0 ?v1 g (i) 9 4 - - SELECT ?v3 ?v0 ?v1 ?v2 WHERE f <i0> ?v0 ?v1 OPTIONAL f ?v1 <i1> ?v2 g ,! OPTIONAL f ?v0 <i1> ?v3 g FILTER ( ( langMatches( lang(?v3), l0 ) ,! || !langMatches( lang(?v3), * ) ) && ( langMatches( lang(?v1), l0 ) ,! || !langMatches( lang(?v1), * ) ) && ( langMatches( lang(?v2), l0 ) ,! || !langMatches( lang(?v2), * ) ) ) g (j) 10 6 50 54 SELECT ?v0 WHERE f ?v0 <i0> <i1> g (k) 11 7 80 - SELECT ?v0 ?v1 WHERE f ?v0 <i0> l0 @lang ; <i1> ?v1 g (l) 12 8 - - SELECT ?v0 ?v1 WHERE f ?v0 ?v1 <i0> g 3

Conclusion

In summary our analysis showed that the most frequently executed queries in the Lsq dataset are rather simple and represented by only few di erent query patterns. It is common to request several properties of a speci c resource in a single query and optionally lter them by their language. Other lter expressions are rare. It should be noted that the analyzed queries are not necessarily completely representative, since the query logs are provided by only three public Sparql endpoints.

1. Han , X. , Feng , Z. , Zhang , X. , Wang , X. , Rao , G. , Jiang , S. : On the statistical analysis of practical sparql queries . In: Proceedings of the 19th International Workshop on Web and Databases . p. 2 . ACM ( 2016 )

2. Saleem , M. , Ali , M.I. , Hogan , A. , Mehmood , Q. , Ngomo , A.C.N.: The

Semantic

Web - ISWC

2015 : 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11 - 15 , 2015 , Proceedings, Part

, chap . LSQ: The Linked SPARQL Queries Dataset , pp. 261 { 269 . Springer International Publishing, Cham ( 2015 )

3. Stegemann , T. , Ziegler , J.: Investigating learnability, user performance, and preferences of the path query language SemwidgQL compared to SPARQL . In: The Semantic Web - ISWC 2017 : 16th International Semantic Web Conference (to appear)