Pattern-based analysis of SPARQL queries from the LSQ dataset? Timo Stegemann and Jürgen Ziegler University of Duisburg-Essen, Duisburg, Germany, timo.stegemann@uni-due.de, http://interactivesystems.info Abstract. This paper presents a pattern-based analysis of the Linked Sparql Queries dataset (Lsq). The analysis showed that from more than 630,000 unique Select queries stored in the dataset, 99% of them are represented by only 120 different query patterns. 1 Introduction In this paper, we present an analysis of the Linked Sparql Queries dataset (Lsq), collected by Saleem et al. [2]. The Lsq dataset has already been evaluated in statistical ways by the authors themselves as well as others (eg. [1]). They investigated, among other things, the usage of specific Sparql features (Union, Distinct, Filter, etc.) or different forms of joins (star, path, sink, etc.). In contrast to these evaluations, we analysed the patterns that were used when constructing the queries. The results of our analysis might be relevant in the field of teaching, for developers of Linked Data applications that support users in the query writing process, or for providers of public Sparql endpoints. 2 Analysis of the Linked SPARQL Query Dataset The Linked Sparql Queries dataset (Lsq) contains nearly 1.75 million unique queries (date: July 2017) with a total of approximately 5.68 million query ex- ecutions from four different public Sparql endpoints1 . From this dataset we extracted 636,876 unique Select queries with 1,526,804 executions2 that did not produce any parse errors and returned a valid result at the time of their logging3 . These Select queries represent 91.0% of all valid queries (Ask 4.5%, ? The poster is accompanying our ISWC’17 research track paper [3]. 1 DBpedia, LinkedGeoData (Lgd), Semantic Web Dog Food (Swdf), British Museum 2 Our number of queries differ slightly from the ones that Saleem et al. received during their analysis. We additionally parsed all queries with functions from the Apache Jena framework. During this procedure some queries that were marked in the dataset as correct were sorted out because of parsing errors. 3 Queries from the British Museum have been completely filtered out, since it was not recorded if they returned a result. Furthermore, all requests from the dataset were made by a single agent and match the same simple query pattern. PREFIX rdfs: SELECT DISTINCT * WHERE { ?value rdfs:label 'Semantic Web'@en . } LIMIT 10 ⇓ SELECT ?v0 WHERE { ?v0 'l0'@lang } Fig. 1: Exemplary transformation of a Sparql query into a parameterized form. 1.0 Overall DBpedia LGD SWDF 0.8 0.50 3 2 1 5 0.6 0.75 6 3 3 13 0.4 0.85 12 6 5 21 0.2 0.90 21 8 9 30 0.0 0.95 42 14 17 44 0 10 20 30 40 50 60 0.99 120 52 36 62 Overall DBpedia LGD SWDF Fig. 2: Fitted cumulative Pareto distribution of the most frequently used query patterns overall and for each endpoint, showing how many query patterns are at least required to represent a specific fraction of the executed queries. Describe 3.8%, Construct 0.7%). While Saleem et al. analyzed the dataset as a whole, we set our focus on the used query patterns. To extract query patterns from the individual queries, we transformed the queries in the test set into a parameterized form. In the first step, we removed all parts from the query string that have no impact on the pattern itself, such as Prefix, From, Limit, Offset, and Order By, as well as the Distinct keyword and corresponding parenthesis. In the next step, we mapped all Iris, variables, literals and language tags of each query to a generic format (Iris: , variables: ?v#, literals: 'l#', and language tags: @lang, where # is the index of its first occurrence in the query). We replaced all wildcards in Select statements with the corresponding list of variables from the Where statement and harmonized language filter expressions. Fig. 1 shows an example transfor- mation of a Sparql query into parameterized form. In the last step we merged all identically parameterized queries into a single query pattern. Through this, we obtained 1619 unique query patterns where the first 120 patterns of the most frequently executed queries represent 99% of the queries executed overall. The first 42 represent 95%, the first 21 represent 90%, and the first 3 already represent more than 50% of the queries executed overall (see Fig. 2). Complete results are available online4 . A separate analysis for each endpoint in the data set resulted in 1240 unique query pattern for the DBpedia endpoint, 289 for Lgd, and 151 for Swdf. Only 17 patterns appear in all three endpoints (but represent 24.8% off all executions), 27 4 https://semwidg.org/page/download-research#LSQ_Analysis 0.6 Overall DBpedia LGD SWDF 1 0.267 0.357 0.525 0.212 0.5 2 0.159 0.221 0.134 0.102 0.4 3 0.129 0.183 0.091 0.102 0.3 4 0.081 0.042 0.060 0.076 0.2 5 0.069 0.028 0.054 0.046 0.1 6 0.046 0.027 0.013 0.033 0.0 1 2 3 4 5 6 7 8 9 10 11 12 … … … … … 12 0.001 0.005 0.006 0.021 Overall DBpedia LGD SWDF Fig. 3: Frequency distribution of the most frequently used query patterns. appear in two different endpoints (9.5%), and 1575 appear in only one endpoint (65.6%). The Pareto distribution of the query patterns for the DBpedia and Lgd endpoint are similar, although it is notable that the most frequently used pattern in the Lgd endpoint already represents more than every second executed query. For the Swdf endpoint, the patterns are more evenly distributed (see Fig. 3), but it should be noted that 16.5% of the executed queries in Swdf context are probably made by a Php library5 that tests the feature support of an endpoint by using a set of very characteristic queries. The 12 most frequently used query patterns of the overall Lsq dataset, rep- resenting more than 85% of all executed queries, are listed in Table 1 together with their ranking of the separate analysis for the different Sparql endpoints. Each additional pattern represents less than 0.1% of the queries executed overall (see Fig. 3). Most of these patterns are simple subject-predicate-object relations with one or two variables, that are also part of the Select statement, on varying posi- tions. Four of them (patterns (a), (b), (f) and (i)) are using Optional pattern matching. Filter expressions are only used in very simple ways. In pattern (g) the subject variable is restricted to a set of specific Iris. In pattern (d) and (i) a Filter expression is used to restrict the result set to a specified language. In pattern (k) this is done by a language tag. Pattern (k) is also the only one of these patterns that makes use of a literal value. Patterns (b) and (k) have more than one triple pattern sharing the same subject. Pattern (i) is the only complex one of these patterns. Additionally to the Filter expression and the Optional pattern matching it contains a path spreading over two triple patterns and a triple pattern that takes a predicate from a previous triple pattern as subject. In the overall data set, language filters (in form of Filter expressions or language tags) are used in 14.8% of all executed queries. Filter expressions that do not restrict the language are used in 5.7%, Unions in 7.5%, and Group By expressions in 0.5%. Aggregate functions are used in 1.2%, whereas most of them are Count expressions (0.9%). Other Sparql features like subqueries, Bind, or Having are merely used in negligibly small numbers. 5 http://graphite.ecs.soton.ac.uk/sparqllib/ Table 1: Ranking of the 12 most frequently used query patterns of the Lsq dataset and their positions for the single endpoints. These queries cover more than 85% of all requests. DBpedia Overall Swdf Lgd Query Pattern (a) 1 - 1 - SELECT ?v0 WHERE { OPTIONAL { ?v0 } } (b) 2 1 - - SELECT ?v0 ?v1 ?v2 WHERE { ?v0 ,→ OPTIONAL { ?v1 ; ?v2 } } (c) 3 2 4 54 SELECT ?v0 WHERE { ?v0 } (d) 4 3 128 - SELECT ?v0 WHERE { ?v0 FILTER langMatches( lang(?v0), 'l0' ) } (e) 5 31 2 20 SELECT ?v0 ?v1 WHERE { ?v0 ?v1 } (f) 6 - 3 - SELECT ?v0 WHERE { OPTIONAL { ?v0 } } (g) 7 - 5 - SELECT ?v0 ?v1 WHERE { ?v0 ?v1 FILTER ,→ (?v0 = || ?v0 = || ?v0 = || ?v0 = || ?v0 = ) } (h) 8 5 19 1 SELECT ?v0 ?v1 WHERE { ?v0 ?v1 } (i) 9 4 - - SELECT ?v3 ?v0 ?v1 ?v2 WHERE { ?v0 ?v1 OPTIONAL { ?v1 ?v2 } ,→ OPTIONAL { ?v0 ?v3 } FILTER ( ( langMatches( lang(?v3), 'l0' ) ,→ || !langMatches( lang(?v3), '*' ) ) && ( langMatches( lang(?v1), 'l0' ) ,→ || !langMatches( lang(?v1), '*' ) ) && ( langMatches( lang(?v2), 'l0' ) ,→ || !langMatches( lang(?v2), '*' ) ) ) } (j) 10 6 50 54 SELECT ?v0 WHERE { ?v0 } (k) 11 7 80 - SELECT ?v0 ?v1 WHERE { ?v0 'l0'@lang ; ?v1 } (l) 12 8 - - SELECT ?v0 ?v1 WHERE { ?v0 ?v1 } 3 Conclusion In summary our analysis showed that the most frequently executed queries in the Lsq dataset are rather simple and represented by only few different query patterns. It is common to request several properties of a specific resource in a single query and optionally filter them by their language. Other filter expres- sions are rare. It should be noted that the analyzed queries are not necessarily completely representative, since the query logs are provided by only three public Sparql endpoints. References 1. Han, X., Feng, Z., Zhang, X., Wang, X., Rao, G., Jiang, S.: On the statistical analysis of practical sparql queries. In: Proceedings of the 19th International Workshop on Web and Databases. p. 2. ACM (2016) 2. Saleem, M., Ali, M.I., Hogan, A., Mehmood, Q., Ngomo, A.C.N.: The Semantic Web - ISWC 2015: 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part II, chap. LSQ: The Linked SPARQL Queries Dataset, pp. 261–269. Springer International Publishing, Cham (2015) 3. Stegemann, T., Ziegler, J.: Investigating learnability, user performance, and pref- erences of the path query language SemwidgQL compared to SPARQL. In: The Semantic Web - ISWC 2017: 16th International Semantic Web Conference (to ap- pear)