LSQ Framework: The LSQ Framework for SPARQL Query Log Processing Claus Stadler1,* , Muhammad Saleem2 and Axel-Cyrille Ngonga Ngomo2 1 AKSW Research Group, University of Leipzig, Germany 2 Data Science Group, Department of Computer Science, Paderborn University, Germany Abstract The Linked SPARQL Queries (LSQ) datasets contain real-world SPARQL queries collected from the query logs of the publicly available SPARQL endpoints. In LSQ, each SPARQL query is represented as RDF with various structural and data-driven features attached. In this paper, we present the LSQ Java framework for creating rich knowledge graphs from SPARQL query logs. The framework is able to RDFize SPARQL query logs, which are available in different formats, in a scalable way. Furthermore, the framework offers a set of static and dynamic enrichers. Static enrichers derive information from the queries, such as their number of basic graph patterns and projected variables or even a full SPIN model. Dynamic enrichment involves additional resources. For instance, the benchmark enricher executes queries against a SPARQL endpoint and collects query execution times and result set sizes. This framework has already been used to convert query logs of 27 public SPARQL endpoints, representing 43.95 million executions of 11.56 million unique SPARQL queries. The LSQ queries have been used in many use cases such as benchmarking based on real-world SPARQL queries, SPARQL adoption, caching, query optimization, useability analysis, and meta-querying. Realization of LSQ required devising novel software components to (a) improve scalability of RDF data processing with the Apache Spark Big Data framework and (b) ease operations of complex RDF data models such as controlled skolemization. Following the spirit of OpenSource software development and the "don’t repeat yourself" (DRY) paradigm, the work on the LSQ framework also resulted in contributions to Apache Jena in order to make these improvements readily available outside of the LSQ context. Keywords SPARQL, RDF, LSQ, Query Log, Software Framework 1. Introduction Query logs allow us to bridge the theory and practice of SPARQL [1]. These query logs ensure that the research conducted by the community is guided by the requirements and trends that emerge in practice. The real-world SPARQL queries that are collected from public SPARQL endpoints have multiple use-cases such as performance evaluation in real-world settings, improve caching of triplestores, analysis on SPARQL adoption, query optimization, and usability analysis etc. [2]. The query logs produced by different SPARQL endpoints use different 6th Workshop on Storing, Querying and Benchmarking Knowledge Graphs (QuWeDa) at ISWC 2022, virtual * Corresponding author. $ cstadler@informatik.uni-leipzig.de (C. Stadler); saleem@mail.uni-paderborn.de (M. Saleem); axel.ngonga@upb.de (A. N. Ngomo) € https://aksw.org/ClausStadler (C. Stadler)  0000-0001-9948-6458 (C. Stadler) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) formats to syntactically represent records. In this work we consider Web Log Format and CSV. Within each format, the set of available fields – the schemas – vary. For example, the formats used by wikidata, bio2rdf, and virtuoso instances in default configuration are different schema variants of the Web Log Format [2, 3]. Furthermore, in order to utilize real-world SPARQL queries in the aforementioned use-cases, it is required to parse these queries and annotate them with various structural and data-driven features such number of triple patterns, list of projection variables used, the different types of joins used, the results size etc. Finally, annotating SPARQL queries in existing query logs needs a scalable processing engine. For example, DBpedia public SPARQL endpoint receives more than 100k queries everyday. Similarly, the Wikidata public endpoint receives thousands of queries on daily bases. To the best of our knowledge, there exist no generic and scalable software framework that parses these query logs to extract SPARQL queries, annotate them with different information, and convert them into an RDF dataset. To fill this research gap, we present the LSQ framework, which converts SPARQL query logs into RDF dataset and attach various structural and data- driven features to each SPARQL query. In order to perform scalable RDF conversion, we make use of the Apache Spark Big Data framework. By default, the framework supports query logs from nine different formats. Support for other logs formats can be easily done by adding log patterns into the configuration file. This framework has already been used to convert query logs of 27 public SPARQL endpoints from various public SPARQL endpoints such as DBpedia, Wikidata, Bio2RDF, Semantic Web Dog Food etc. The resulting RDF datasets are named as LSQ V2.0 [2]. The LSQ V2.0 represents 43.95 million executions of 11.56 million unique SPARQL queries, resulting in 1.24 billion triples. The LSQ queries have been used in many use cases such as benchmarking based on real-world SPARQL queries [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], analysis of the SPARQL adoption in different applications [16, 17, 18], improving caching strategies for SPARQL engines [19, 20, 21, 22, 23], useability analysis of the SPARQL [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35], and SPARQL query optimization [36, 37, 38, 39]. Since the initial release of the LSQ V1.0 [3], the datasets converted by our framework have been used in more than 50 research papers[2]. The scalable components developed into the LSQ framework have also contributed to the recent Apache Jena 4.6.0 release with additional improvements pending1 . The rest of the paper is organized as follows. In section 2, we present the LSQ framework. Section 3 discusses various use-cases where the datasets produced by our framework have already been used. In section 4, we present the useability instruction of this framework, followed by the availability information and sustainability plans in section 5. Finally, we conclude in section 6. 2. The LSQ Framework In this section, we first briefly explain the core architecture. The details of each component is explained later. 1 For details please see: https://github.com/apache/jena/pull/712, https://github.com/apache/jena/pull/1475, https://github.com/apache/jena/pull/1394, https://github.com/apache/jena/pull/1390, https://github.com/apache/jena/issues/1470, https://issues.apache.org/jira/projects/JENA/issues/JENA-2309 Figure 1: LSQ Framework Architecture. The underlying Semantic Web framework is Apache Jena. Figure 1 shows the architecture of LSQ which is briefly summarized as follows. Records of query logs in different formats and having different schemas are parsed and normalized as RDF. The RDF log is inverted such that every query is related to all log records that mention it, i.e., every query is assigned a global (w.r.t, all query logs from different endpoints) identifier. In this process, queries are represented by hashes computed with their normalized strings. Static enrichment basically extracts the static features (that can be directly derived from a SPARQL query without running it against a specific RDF dataset) relevant to SPARQL query. It extends the RDF model with a SPIN2 and further static structural features such as the number of projection variables, number of triple patterns, join types and selectivities, the used syntactic elements (e.g. UNION, OPTIONAL) etc. The dynamic enrichment extracts data-driven SPARQL features such as query runtime, result size or triple pattern selectivities etc. Whereas static enrichment is independent on any dataset, dynamic enrichment needs a dataset as input. Benchmarking is a form of dynamic enrichment that extends a query’s RDF model w.r.t. a dataset with result set sizes and execution times. In the following, we present these steps in detail and how LSQ inspired creation of certain components which we contributed to other frameworks in order to facilitate reuse. 2.1. Named Graph Stream Processing A fundamental design principle that is followed throughout LSQ is the entity graph paradigm, also referred to as (named-)graph-per-entity: An entity is represented by an IRI that appears in a (subject position of) triple in a named graph with the same IRI, such as GRAPH :anEntity { :anEntity :p :o }. Some Semantic Web tools (such as Virtuoso and Apache Jena) already support the use of named graphs with CONSTRUCT queries due to popular demand3 , although 2 SPARQL SPIN representation: https://www.w3.org/Submission/spin-sparql/ 3 https://github.com/w3c/sparql-12/issues/31 this feature is not part of the current SPARQL specification v1.14 . LSQ uses the entity graph approach for representing log records and SPARQL queries. The advantage of this approach are: • The graph name acts as a natural entry point for starting exploration/traversals of the contained data. This is a convention applications (such as viewers) can support without having to rely on a specific (ad-hoc) vocabulary. • Retrieval and removal of all triples related to an entity is a simple operation on the named graph. This is of particular importance as certain models (e.g. SPIN) allow for arbitrary deeply nested tree structures in RDF which are extremely hard to query without scoping by a named graph. • Partitioning the data into self-contained named graphs is well aligned with the map- reduce paradigm: Operations on individual entities, such as enrichment operations, can be carried out naturally in parallel over a set of entity graphs using conventional Big Data frameworks. We created the necessary hadoop-based parsers as part of LSQ and contributed them to the SANSA stack (see Section 2.7). • An operation on an individual entity typically only requires its graph to be loaded rather than requiring access to all triples across all entities, which allows for stream-processing with low memory usage. 2.2. Parsing Query Logs Generally, LSQ parses query logs and transforms them into a set of named graphs of which each represents an individual log record. This means that every log file is harmonized to a common RDF model. Query logs come in different formats, and within a format there can be many different schemas. By schema we refer to the set of available attributes per log record. Manually determining the format and schema of a log is a very tedious task. For this reason, LSQ features a log format registry which can be used to probe log files against. Currently 2 types of log formats are supported: Web access log formats that are compatible with Apache HTTP server’s mod_log_config5 and CSV. For the former, LSQ provides a custom mapping of log fields to an RDF model, for the latter a tarql-based6 approach is supported where columns of the input file are mapped to RDF using a SPARQL construct query. Support for additional mapping languages, such as RML [40] or YARRML [41] is future work. Note, that for the task of RDFizing a single CSV file, the practical difference between the approaches lies in syntax rather than functionality. An excerpt of LSQ’s log format registry configuration is shown in Listing 1. So far the default registry comprises 10 formats/schema combinations. 2.3. Accessing RDF graphs via Object Models The LSQ data model (see Figure 2) is sufficiently complex such that manipulation solely with SPARQL turned out to be infeasible. One of the main reasons is due to the redundancy of 4 https://www.w3.org/TR/sparql11-query/ 5 https://httpd.apache.org/docs/current/mod/mod_log_config.html 6 https://github.com/tarql/tarql Listing 1: RDF-based log format registry used in LSQ 1 fmt:combined 2 a lsq:WebAccessLogFormat ; 3 lsq:pattern "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" . 4 5 fmt:bio2rdfProcessedCsv 6 a lsq:CsvLogFormat ; 7 lsq:pattern 8 """ 9 PREFIX lsq: 10 PREFIX prov: 11 PREFIX xsd: 12 CONSTRUCT { 13 GRAPH ?s { 14 ?s 15 lsq:query ?query ; 16 lsq:host ?domain ; 17 lsq:headers [ ?agent ] ; 18 prov:atTime ?t 19 } 20 }{ 21 BIND(IRI(CONCAT(’urn:lsq:’, MD5(CONCAT(?query, ’-’, ?domain, ’-’, ?timestamp)))) AS ?s) 22 BIND(STRDT(?timestamp, xsd:dateTime) AS ?t) 23 } 24 """ . complex graph patterns: Many patterns have to be repeated over and over again in different queries in order to address the nested resources. Especially during development, any change in the T-box requires a change in several places and debugging SPARQL queries that yield empty result sets is a tedious process. Furthermore, deterministic skolemization of the model is an important aspect: We strictly want to avoid effects such as a merge of a query’s RDF resulting in doubling the number of related BGPs as would be the case if blank nodes were used. Therefore, we need a way to express how to compute the identity of entities, such as: the id of a triple pattern depends on the id of its subject, predicate and object components, the id of a BGP depends on the ids of the contained triple patterns (in order) and the id of a specific triple pattern in a BGP depends on that of the BGP and its own. For this purpose, LSQ features a domain model realized as Java interfaces which must extend from Jena’s Resource interface. The essence is, that the auto-generated implementations for this domain model then provides a view over the RDF graph; all state is kept in the RDF graph and any invocation of a “setter” method directly mutates the RDF graph whereas a “getter” method reads from it. Annotations on these interfaces are used to describe how getters and setters relate to triples in the underlying RDF graph. We use Reprogen (Resource Proxy Generator)7 to generate Java proxies that implement the intended behavior of the annotations. In order to realize skolemization, we extended Reprogen with new @HashId and @StringId annotations. @HashIds are a way to control how to compute hash codes for an RDF term in 7 https://github.com/Scaseco/jenax/tree/develop/jenax-reprogen-parent/jenax-reprogen-core Figure 2: LSQ2 Data Model an RDF graph w.r.t. to the annotated interface. The hash code of a resource is obtained by recursively considering the hash codes of all its member methods annotated with @HashId. The @StringId annotation is used for methods that can convert hash codes to strings suitable for use in IRIs. Listing 2 shows how to programmatically generate RDF for an annotated class whereas Listing 3 shows an example for how to annotate a class. Listing 2: Example for setting up and skolemizing a basic LSQ RDF model 1 Model model = ModelFactory.createModel(); 2 LsqTriplePattern tp = model.createResource(LsqTriplePattern.class); 3 TpInBgp tpInBgp = model.createResource(TpInBgp.class); 4 tp.setJenaTriple(new Triple()); 5 6 Resource root = Reprogen.skolemize(tpInBgp, "http://lsq.aksw.org"); 7 RDFDataMgr.write(System.out, root.getModel(), RDFFormat.TURTLE_PRETTY); 2.4. Query Id Generation An important requirement in LSQ is determinism: Syntactically equivalent queries and their elements should always be represented by the same IRIs. The goal is to allow for the integration of different datasets generated by the same version of LSQ simply by means of merging the RDF. Generally, query strings are normalized using a parse/string serialization round trip. Query strings themselves are not suitable as identifiers because they can become very lengthy. The obvious choice is thus to resort to hashing. LSQ v1 used 8 characters of md5 hashes however it turned out that the number of queries was sufficiently large that collisions occurred. LSQ v2.0 uses sha256 with base64 encoding of the whole query string. The disadvantage was, that queries that only differed in projection or slice received unrelated hashes. As a consequence, identifying queries that only differ in their ’basic parameterization’ was not possible from the hashes. LSQ Listing 3: Example of LSQ’s annotated Java domain model for which Reprogen generates proxy implementations that can read and write to an RDF graph 1 /* The identity of a bgp depends on the list of identities of the contained 2 * triple patterns */ 3 public interface Bgp 4 extends Resource 5 { 6 @HashId 7 @Iri(LsqTerms.hasTp) 8 List getTriplePatterns(); 9 } 10 11 /* The identity of a "triple pattern in a bgp" depedends on the identity of 12 * the bgp and the identity of the triple pattern */ 13 public interface TpInBgp 14 extends Resource 15 { 16 @HashId 17 @Iri(LsqTerms.hasBgp) 18 SpinBgp getBgp(); 19 TpInBgp setBgp(Resource bgp); 20 21 @HashId 22 @Iri(LsqTerms.hasTp) 23 LsqTriplePattern getTriplePattern(); 24 TpInBgp setTriplePattern(Resource tp); 25 } v2.1 introduces additional structure: The body hash of a query is obtained by replacing the projection with ’SELECT *’ and applying sha256+base64 hashing on the normalized query string. A separate hash is computed from the projection: The actual projection expression strings are first sorted lexicographically and a projection-base-hash is computed from these ordered strings. The actual projection is then permutation of the sorted one, and numbering schemes exist to label permutations: We use the Lehmer-Code to obtain a number for the actual projection. For a given sequence of 𝑛 items, the Lehmer-Code is 0 when that sequence is sorted and the code reaches its maximum value 𝑛! when the sequence is reverse-sorted. Finally, the slice is appended which leads to the new improved pattern bodyHash/projBaseHash/lehmerCode[/offset[-limit]]. 2.5. LSQ Command Line Interface In this section we briefly present the workflow for benchmarking a SPARQL endpoint with the lsq command line tool. LSQ aims to capture on which endpoint a query was executed at which point in time. For that purpose, LSQ can reuse timestamps in log files. If such are absent then sequential numbering is used as a fallback. The URL of the endpoint from which a query log file originated needs to be provided manually. The endpoint for RDFization is purely of informational nature. No requests will be made to it. In principle, if it was known which dataset was available at an endpoint at a certain point in time, then this information can be used to link to a dataset identifier. Furthermore, ideally a dataset identifier can be linked to a download URL of exactly the set of RDF triples (or quads) that were present at that endpoint. Although this background knowledge is currently usually not systematically available, we envision this situation to improve with advancements in data catalog and service modeling such as with DCAT8 . An example of a command line invocation and its corresponding output is shown in Listing 4 and Listing 5. Listing 4: Command for rdfizing a SPARQL query log 1 lsq rx rdfize --endpoint=http://dbpedia.org/sparql virtuoso.dbpedia.log Listing 5: The output of the rdfization is one named graph per query 1 :lsqQuery-X { 2 :lsqQuery-X 3 lsq:text "SELECT * { ?s ?p ?o }" ; 4 lsq:hasRemoteExec :remoteExec-org-dbpedia-sparql_2016-04-10T01:00:00Z . 5 6 :remoteExec-org-dbpedia-sparql_2016-04-10T01:00:00Z 7 lsq:endpoint ; 8 prov:atTime "2016-04-10T01:00:00Z"^^xsd:dateTime . 9 } 10 :lsqQuery-Y { ... } Note, that the output is query-centric, which means that the input sequence of records is sorted by the query hash. The rx variant uses linux sorting which is not portable, whereas spark variant provides a portable java-native solution. 2.6. Benchmarking LSQ uses RDF to keep benchmark settings in order to relate benchmark results to the context under which they were obtained. For this reason, benchmarking is a three step process: (1) Create a benchmark configuration. This is an RDF document which contains an IRI that carries the settings. The IRI is described with the date when the configuration was created and a custom label. The latter is useful for organization because benchmarking is often performed against SPARQL endpoints under a localhost URL from which no expressive name can be derived. Benchmark creation also caches the number of triples in the endpoint with the configuration. This number is used to compute certain ratios, such as the ratio of triples matched by a single triple pattern w.r.t. to the total size of the RDF graph. As a consequence, a new configuration should be created whenever the settings or the data changes. (2) Prepare a benchmark run: This is another little RDF document which only introduces an IRI with two pieces of information: The IRI of the configuration and the timestamp of when the run was prepared. (3) Execute the benchmark. Every benchmark task will be linked to the IRI of the benchmark run which in turn links to the used configuration. Listing 6 demonstrates the command line invocations for setting up and running a benchmark whereas an example configuration is shown in Listing 7. 8 https://www.w3.org/TR/vocab-dcat-2/ Listing 6: Example benchmark setup and execution lsq benchmark create --endpoint http://localhost:8080/sparql --dataset dbpedia # Assumed output: xc-dbpedia_2021-10-22.conf.ttl lsq benchmark prepare -c xc-dbpedia_2021-10-22.conf.ttl # Assumed output: c-dbpedia_2021-10-22_2021-10-22T12_26_40_828056Z.run.ttl lsq benchmark run -c xc-dbpedia_2021-10-22_2021-10-22T12_26_40_828056Z.run.ttl virtuoso.dbpedia.trig Listing 7: Excerpt of benchmark configuration options 1 lsqr:xc-dbpedia_2021-10-22 2 dct:identifier "xc-dbpedia_2021-10-22" ; 3 lsqo:connectionTimeoutForRetrieval "60"^^xsd:decimal ; 4 lsqo:executionTimeoutForRetrieval "300"^^xsd:decimal ; 5 lsqo:maxResultCountForRetrieval "1000000"^^xsd:long ; 6 lsqo:maxByteSizeForRetrieval "-1"^^xsd:long ; 7 lsqo:maxResultCountForSerialization "-1"^^xsd:long ; 8 lsqo:maxByteSizeForSerialization "1000000"^^xsd:long ; 9 lsqo:executionTimeoutForCounting "300"^^xsd:decimal ; 10 lsqo:connectionTimeoutForCounting "60"^^xsd:decimal ; 11 lsqo:benchmarkSecondaryQueries true . 2.6.1. Benchmark Workflow When benchmarking a query, an IRI is allocated based on the benchmark run id and the query id. Within an individual benchmark run a query can only be executed once. A database (TDB2) is used to keep track of query hashes that have already been benchmarked. LSQ first attempts to benchmark and retrieve the result set of a query. If retrieval succeeds (and no thresholds are exceeded) then the retrieved result set is serialized as a literal in the output RDF dataset as long as the serialization thresholds are adhered to. If the retrieval fails due to threshold violation (in contrast to e.g. a syntactic error) then an alternate strategy is employed which attempts to count the size of the query by wrapping it with SELECT (COUNT(*) AS ?c) { ... }. The benchmark result contains triples for any exceeded thresholds. Benchmarking Secondary Queries Primary queries are those originating from a query log. Secondary queries are those derived from the syntactic elements of primary ones. The most prominent elements are basic graph patterns and individual triple patterns. A secondary query is benchmarked just like a primary one. The rule that a query will only be benchmarked once within a run thus also applies. An example RDF representation of a SPARQL query generated by the LSQ framework is shown in Listing 8. We encourage authors to have a look at LSQ V2.0 paper [2] for further details of the RDF representation. Listing 8: An example LSQ/RDF representation of a SPARQL query in Turtle syntax [2] 1 @prefix rdf: . 2 @prefix lsqr: . 3 @prefix lsqv: . 4 @prefix rdfs: . 5 @prefix swc: . 6 @prefix swr: . 7 @prefix xsd: . 8 @prefix prov: . 9 10 # Primary resource describing the query found with the SWDF logs 11 lsqr:lsqQuery-3wBd2uKotB_-vUxnngs6ZNsGPhJmIDD9c7ig0UI24y8 12 lsqv:hasLocalExec lsqr:localExec-v9fBp3ElS1aVXXN1Z8zX1jxcHX3iy-axTgRrU2c7NY8 ; 13 lsqv:hasRemoteExec lsqr:re-data.semanticweb.org-sparql_2014-05-22T16:08:17Z , 14 lsqr:re-data.semanticweb.org-sparql_2014-05-20T13:24:13Z ; 15 lsqv:hasStructuralFeatures lsqr:lsqQuery-3wBd2uKotB_-vUxnngs6ZNsGPhJmIDD9c7ig0UI24y8-sf ; 16 lsqv:hash "3wBd2uKotB_-vUxnngs6ZNsGPhJmIDD9c7ig0UI24y8" ; 17 lsqv:text """PREFIX rdf: 18 PREFIX swc: 19 SELECT DISTINCT ?prop 20 WHERE { ?obj rdf:type swc:SessionEvent ; ?prop ?targetObj FILTER isLiteral(?targetObj) } 21 LIMIT 150""" . 22 23 # Static features of the query 24 lsqr:lsqQuery-3wBd2uKotB_-vUxnngs6ZNsGPhJmIDD9c7ig0UI24y8-sf 25 lsqv:bgpCount 1 ; 26 lsqv:hasBgp lsqr:bgp-_x9Mckke-V9R3ddISuw-Nj_j278nT5HwiA1WUNk7tgY ; 27 lsqv:joinVertexCount 1 ; 28 lsqv:joinVertexDegreeMean 2 ; 29 lsqv:joinVertexDegreeMedian 2 ; 30 lsqv:projectVarCount 1 ; 31 lsqv:tpCount 2 ; 32 lsqv:tpInBgpCountMax 2 ; 33 lsqv:tpInBgpCountMean 2 ; 34 lsqv:tpInBgpCountMedian 2 ; 35 lsqv:tpInBgpCountMin 2 ; 36 lsqv:usesFeature lsqv:fn-isLiteral , lsqv:Select , lsqv:Limit , lsqv:Functions , lsqv:Group , lsqv:Filter , 37 lsqv:Distinct , lsqv:TriplePattern . 38 39 # Remote execution no. 1 on the original endpoint 40 lsqr:re-data.semanticweb.org-sparql_2014-05-22T16:08:17Z 41 prov:atTime "2014-05-22T16:08:17Z"^^xsd:dateTime ; 42 lsqv:endpoint swr:sparql ; 43 lsqv:hostHash "O5UQpDtofxAsrJk7yzGfDolFGylMFw5446KcRZDcBkU" . 44 45 # Remote execution no. 2 on the original endpoint 46 lsqr:re-data.semanticweb.org-sparql_2014-05-20T13:24:13Z 47 prov:atTime "2014-05-20T13:24:13Z"^^xsd:dateTime ; 48 lsqv:endpoint swr:sparql ; 49 lsqv:hostHash "7aPNvqsgizRuEjH7_cO_dXoqLk-exKJ-xFmbCH3ew_E" . 50 51 # Local execution to extract statistics 52 lsqr:localExec-v9fBp3ElS1aVXXN1Z8zX1jxcHX3iy-axTgRrU2c7NY8-xc 53 lsqv:benchmarkRun lsqr:xc-swdf_2020-09-23_at_23-09-2020_17:10:19 ; 54 lsqv:hasQueryExec lsqr:queryExec-Cmv7SccybbBxwkep_cHvDiF3piq29tH7NWlDfIiCHqU . 55 56 # Results of local execution 57 lsqr:queryExec-Cmv7SccybbBxwkep_cHvDiF3piq29tH7NWlDfIiCHqU 58 prov:atTime "2020-09-23T15:27:36.325Z"^^xsd:dateTime ; 59 lsqv:countingDuration 0.008466651 ; 60 lsqv:evalDuration 0.008868635 ; 61 lsqv:resultCount 16 . 62 63 # The full data further include a SPIN description of the query, a list of BGPs within the query, 64 # a list of triple patterns and terms within the query, as well as execution statistics for individual 65 # BGPs, triple patterns and sub-BGPs induced by join variables 2.7. Scaling LSQ with SANSA Processing large log files with only a single core is tedious and anachronistic in times of Big Data and laptops having more than a dozen cores. Apache Spark is a framework that enables scaling computing tasks to use all available resources on a cluster – even if the "cluster" only comprises a single machine. Apache Spark features high level abstractions for distributed executions of operations on different types of distributed collections of records. Resilient Distributed Datasets (RDDs) are the ones used by LSQ/Spark. However, the low-level I/O for reading records from files (regardless whether in local or distributed file systems) is provided by Apache Hadoop. The now retired Apache Jena/Elephas9 project provided distributed ingestion of RDF data by wiring its own I/O library (called RIOT ) up with Apache Hadoop. However, while Elephas supported n-quads, this format is both much more verbose and more hard to read than pretty printed trig. Conversely, manually reviewing rather sophisticated LSQ models in trig format was significantly easier, however, Elephas could not read trig in splits. Therefore use of this format negated the benefits of the Big Data framework. In order to optimize processing, we created an initial distributed parser for trig that searched hadoop input splits by matching the { ... } pattern. This framework was continuously extended to handle RDF prefixes and even RDF data in literals. By now, this parsing framework has evolved into Hadoop Generic Parser Framework (HGPF) and provides support for trig (a superset of turtle, n-quads and n-triples), JSON and CSV. Listing 9 shows the contribution made to SANSA in order to enable processing of large trig files (of which n-quads is a special case) with LSQ. The parser has been successfully used to ingest and sort 500GB of LSQ trig data in about 5 hours on a 3 node spark cluster. Listing 9: Parsing named graph streams 1 RdfSourceFactory rdfSourceFactory = RdfSourceFactoryImpl.from(sparkSession); 2 RdfSource rdfSource = rdfSourceFactory.get("query-log.lsq.trig"); 3 RDD rddOfQuads = rdfSource.asQuads(); 4 RDD rddOfDataset = rdfSource.asDatasets(); The HGPF framework has also been used to realize a CSV parser that can also handle multi- line cell values. The limitation is that when searching for candidate record offsets in a split, the maximum size of multi-line cells needs to be configured in advance. The default value is 500KB, and the candidate record offset detector always has to exhaust this amount of data in order not to miss any cell endings. The CSV settings are based on the frictionless data csv dialect which slightly extends over the CSV on the Web (CSVW)10 specification. An example of programmatic usage is shown in Listing 10. Listing 10: Example for setting up and skolemizing a basic LSQ RDF model 1 JavaRDD rddOfBindings = CsvDataSources.createRddOfBindings(sparkContext, "data.csv", csvDialect); 2 Query query = QueryFactory.create("CONSTRUCT ..."); 3 JavaRDD rddOfQuads = JavaRddOfBindingsOps.tarqlQuads(rddOfBindings, query); 9 https://jena.apache.org/documentation/archive/hadoop/ 10 https://www.w3.org/TR/tabular-data-model/ 3. Impact To the best of our knowledge, the framework we propose in this paper is the first generic framework for RDFizing SPARQL query logs in different formats. Furthermore, it does so in a scalable way by following Big Data processing paradigms. There was no mechanism available to reuse existing query logs in a single standard format with much more enriched information attached to each query. Our framework is completely abided by semantic web technologies. It has been used to convert query logs of 27 public SPARQL endpoints, resulting in terabytes of RDF data. In the recent LSQ v2.0 paper [2], potential six use cases – custom benchmarking, SPARQL adoption, caching, usability analysis, query optimisation, and meta-querying – have been discussed. The RDF datasets generated using the LSQ framework have been used widely for these use-cases [2]. he study [2] reported that 29 research papers have used LSQ queries for custom benchmarking, six research papers for SPARQL adoption, five research papers for caching, 12 research papers for SPARQL useability analysis, seven research papers for query optimisation, and two papers for meta-querying [2]. Furthermore, [2] discussed a number of works which have used LSQ (mostly for evaluation) in contexts that were not originally anticipated by the aforementioned use cases. These works include predicting temporal relations between events [42], augmenting RDF data sources with completeness statements [43], finding the frequency and distribution of answerable and non-answerable query patterns [43], question answering over linked data [44], and a blockchain that allows users to propose updates to faulty or outdated data [45]. 4. Reusability We are hopeful that the LSQ framework will be used by more researchers to convert existing query logs and create LSQ datasets. The resource home page includes documentation along with examples for easy reusability. The components developed in the LSQ framework are very generic. A dockerized version of the LSQ is also available from the resource homepage. The LSQ framework can also be adopted to other log formats by providing the log pattern, e.g., CSV, specific text pattern. Custom enrichment can also be done, i.e, add more query features attached to each query. The resource homepage includes a wiki explaining how others can use the framework along with examples and CLI instructions. So far, we have not tested the framework with XML, JSON logs. However, it should be working provided that the correct log pattern is specified in the configuration file. 5. Availability and Sustainability The resource is available from a persistent URL http://w3id.org/lsq. LSQ v2.0 [2] is the canonical citation associated with this resource. Our framework and LSQ datasets are available under GNU General Public License v3.0. The source code is publicly available via Github. All future extensions will be reflected on the same persistent URL. In addition, this framework will be sustained via the Paderborn Center for Parallel Computing 𝑃 𝐶 2 , which provides computing resources as well as consulting regarding their usage to research projects at Paderborn University and also to external research groups. The Information and Media Technologies Centre (IMT) at Paderborn University also provides a permanent IT infrastructure to host the LSQ project. 6. Conclusion In this paper, we have presented the LSQ framework, a scalable engine to represent queries in logs as RDF, allowing users perform their analysis on real-world SPARQL queries. We discussed the core architecture of this software framework along with potential impact and the useability instructions. We briefly discussed various use-cases where RDF datasets produced by our framework have already been used. In the future, we want to further extend this framework that provides further annotated information, e.g. annotating the named entities used in the SPARQL queries and their disambiguation to well-known datasets such Wikidata and DBpedia. We aim to collect further query logs from public SPARQL endpoints and provide them as RDF datasets. Resource Availability Statement: • Source code of the LSQ framework is available from https://github.com/AKSW/LSQ • Installation instructions available from http://lsq.aksw.org/v2/setup.html • Usage instructions available from http://lsq.aksw.org/v2/usage/usage.html • RDF dumps of the LSQ v2.0 datasets available from https://hobbitdata.informatik. uni-leipzig.de/lsqv2/dumps/ • LSQ V2.0 public SPARQL endpoint is available from http://lsq.aksw.org/sparql • Set of useful SPARQL queries over LSQ datasets available from http://lsq.aksw.org/v2/ usage/usage.html • The resource type is Software Framework and is available under GNU General Public License v3.0 • All of the above information along with legacy data and old LSQ maintenance information available from the resource persistent URL http://w3id.org/lsq • LSQ V2.0 [2] is the canonical citation associated with this paper Acknowledgments The authors also acknowldge the financial support for 3DFed (Grant no. 01QE2114B), Know- Graphs(Grant no. 860801) and by the Federal Ministry for Economics and Climate Action in the project CoyPu (project number 01MK21007A). References [1] W. Martens, T. Trautner, Bridging Theory and Practice with Query Log Analysis, SIGMOD Record 48 (2019) 6–13. [2] C. Stadler, M. Saleem, Q. Mehmoodb, C. Buil-Arandac, M. Dumontierd, A. Hogane, A.-C. N. Ngomoa, Lsq 2.0: A linked dataset of sparql query logs, in: Semantic Web Journal, 2022. [3] M. Saleem, M. I. Ali, A. Hogan, Q. Mehmood, A. N. Ngomo, LSQ: The Linked SPARQL Queries Dataset, in: International Semantic Web Conference (ISWC), Springer, 2015, pp. 261–269. [4] M. Saleem, Q. Mehmood, A. N. Ngomo, FEASIBLE: A Feature-Based SPARQL Benchmark Generation Framework, in: International Semantic Web Conference (ISWC), Springer, 2015, pp. 52–69. [5] M. Saleem, Q. Mehmood, C. Stadler, J. Lehmann, A. N. Ngomo, Generating SPARQL Query Containment Benchmarks Using the SQCFramework, in: ISWC Posters & Demos, CEUR-WS.org, 2018. [6] M. Saleem, A. Hasnain, A.-C. N. Ngomo, LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation, Journal of Web Semantics 48 (2018) 85–125. [7] M. Saleem, G. Szárnyas, F. Conrads, S. A. C. Bukhari, Q. Mehmood, A. N. Ngomo, How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks, in: World Wide Web Conference (WWW), ACM, 2019, pp. 1623–1633. [8] D. Hernández, A. Hogan, C. Riveros, C. Rojas, E. Zerega, Querying Wikidata: Comparing SPARQL, Relational and Graph Databases, in: International Semantic Web Conference (ISWC), Springer, 2016, pp. 88–103. [9] J. D. Fernández, J. Umbrich, A. Polleres, M. Knuth, Evaluating query and storage strategies for RDF archives, Semantic Web 10 (2019) 247–291. [10] A. Azzam, J. D. Fernández, M. Acosta, M. Beno, A. Polleres, SMART-KG: Hybrid Shipping for SPARQL Querying on the Web, in: The Web Conference (WWW), 2020, pp. 984–994. [11] A. Bigerl, F. Conrads, C. Behning, M. A. Sherif, M. Saleem, A.-C. Ngonga Ngomo, Tentris – A Tensor-Based Triple Store, in: International Semantic Web Conference (ISWC), Springer, 2020, pp. 56–73. [12] A. Azzam, C. Aebeloe, G. Montoya, I. Keles, A. Polleres, K. Hose, WiseKG: Balanced Access to Web Knowledge Graphs, in: The Web Conference (WWW), ACM / IW3C2, 2021, pp. 1422–1434. URL: https://doi.org/10.1145/3442381.3449911. doi:10.1145/3442381. 3449911. [13] A. Davoudian, L. Chen, H. Tu, M. Liu, A Workload-Adaptive Streaming Partitioner for Distributed Graph Stores, Data Science and Engineering 6 (2021) 163–179. [14] A. A. Desouki, F. Conrads, M. Röder, A.-C. N. Ngomo, SYNTHG: Mimicking RDF Graphs Using Tensor Factorization, in: International Conference on Semantic Computing (ICSC), 2021, pp. 76–79. doi:10.1109/ICSC50631.2021.00017. [15] M. Röder, P. T. S. Nguyen, F. Conrads, A. A. M. da Silva, A.-C. N. Ngomo, Lemming – Example-based Mimicking of Knowledge Graphs, in: International Conference on Semantic Computing (ICSC), 2021, pp. 62–69. doi:10.1109/ICSC50631.2021.00015. [16] X. Han, Z. Feng, X. Zhang, X. Wang, G. Rao, S. Jiang, On the statistical analysis of practical SPARQL queries, in: International Workshop on Web and Databases (WebDB), ACM, 2016, p. 2. [17] A. Bonifati, W. Martens, T. Timm, An Analytical Study of Large SPARQL Query Logs, PVLDB 11 (2017) 149–161. URL: http://www.vldb.org/pvldb/vol11/p149-bonifati.pdf. doi:10.14778/3149193.3149196. [18] A. Bonifati, W. Martens, T. Timm, DARQL: Deep Analysis of SPARQL Queries, in: WWW Posters & Demos, ACM, 2018, pp. 187–190. [19] M. Knuth, O. Hartig, H. Sack, Scheduling refresh queries for keeping results from a SPARQL endpoint up-to-date, in: On The Move to Meaningful Internet Systems (OTM), Springer, 2016, pp. 780–791. [20] U. Akhtar, M. A. Razzaq, U. U. Rehman, M. B. Amin, W. A. Khan, E.-N. Huh, S. Lee, Change- Aware Scheduling for Effectively Updating Linked Open Data Caches, IEEE Access 6 (2018) 65862–65873. [21] U. Akhtar, A. Sant’Anna, S. Lee, A Dynamic, Cost-Aware, Optimized Maintenance Policy for Interactive Exploration of Linked Data, Applied Sciences 9 (2019) 4818. [22] J. Salas, A. Hogan, Canonicalisation of Monotone SPARQL Queries, in: International Semantic Web Conference (ISWC), Springer, 2018, pp. 600–616. [23] T. Safavi, C. Belth, L. Faber, D. Mottin, E. Müller, D. Koutra, Personalized knowledge graph summarization: From the cloud to your pocket, in: International Conference on Data Mining (ICDM), IEEE, 2019, pp. 528–537. [24] M. Arenas, G. I. Diaz, E. V. Kostylev, Reverse engineering SPARQL queries, in: World Wide Web Conference (WWW), ACM, 2016, pp. 239–249. [25] F. Benedetti, S. Bergamaschi, A model for visual building SPARQL queries, in: Symposium on Advanced Database Systems (SEBD), 2016, pp. 19–30. [26] I. Dellal, S. Jean, A. Hadjali, B. Chardin, M. Baron, On addressing the empty answer problem in uncertain knowledge bases, in: International Conference on Database and Expert Systems Applications (DEXA), Springer, 2017, pp. 120–129. [27] T. Stegemann, J. Ziegler, Investigating learnability, user performance, and preferences of the path query language SemwidgQL compared to SPARQL, in: International Semantic Web Conference (ISWC), Springer, 2017, pp. 611–627. [28] A. Viswanathan, G. de Mel, J. A. Hendler, Feature-based reformulation of entities in triple pattern queries, CoRR abs/1807.01801 (2018). URL: http://arxiv.org/abs/1807.01801. [29] J. Potoniec, Learning SPARQL Queries from Expected Results, Computing and Informatics 38 (2019) 679–700. [30] M. Wang, J. Liu, B. Wei, S. Yao, H. Zeng, L. Shi, Answering why-not questions on SPARQL queries, Knowledge and Information Systems (2019) 1–40. [31] A. Bonifati, W. Martens, T. Timm, An analytical study of large SPARQL query logs, VLDB J. 29 (2020) 655–679. doi:10.1007/s00778-019-00558-9. [32] X. Jian, Y. Wang, X. Lei, L. Zheng, L. Chen, SPARQL Rewriting: Towards Desired Results, in: SIGMOD International Conference on Management of Data, 2020, pp. 1979–1993. [33] X. Zhang, M. Wang, M. Saleem, A.-C. N. Ngomo, G. Qi, H. Wang, Revealing Secrets in SPARQL Session Level, in: International Semantic Web Conference (ISWC), Springer, 2020, pp. 672–690. [34] J. M. Almendros-Jiménez, A. Becerra-Terón, Discovery and diagnosis of wrong SPARQL queries with ontology and constraint reasoning, Expert Systems with Applications 165 (2021) 113772. URL: https://www.sciencedirect.com/science/article/pii/S0957417420305960. doi:https://doi.org/10.1016/j.eswa.2020.113772. [35] M. Wang, K. Chen, G. Xiao, X. Zhang, H. Chen, S. Wang, Explaining similarity for SPARQL queries, World Wide Web (2021) 1–23. [36] Z. Song, Z. Feng, X. Zhang, X. Wang, G. Rao, Efficient approximation of well-designed SPARQL queries, in: International Conference on Web-Age Information Management (WAIM), Springer, 2016, pp. 315–327. [37] W. Martens, T. Trautner, Evaluation and Enumeration Problems for Regular Path Queries, in: International Conference on Database Theory (ICDT), Schloss Dagstuhl – Leibniz- Zentrum fuer Informatik, 2018, pp. 19:1–19:21. [38] S. Cheng, O. Hartig, OPT+: A Monotonic Alternative to OPTIONAL in SPARQL, Journal of Web Engineering 18 (2019) 169–206. [39] D. Figueira, A. Godbole, S. N. Krishna, W. Martens, M. Niewerth, T. Trautner, Containment of simple conjunctive regular path queries, in: International Conference on Principles of Knowledge Representation and Reasoning (KR), 2020, pp. 371–380. doi:10.24963/kr. 2020/38. [40] A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van de Walle, Rml: a generic language for integrated rdf mappings of heterogeneous data, in: Ldow, 2014. [41] P. Heyvaert, B. De Meester, A. Dimou, R. Verborgh, Declarative Rules for Linked Data Generation at your Fingertips!, in: Proceedings of the 15th ESWC: Posters and Demos, 2018. [42] K. Georgala, M. A. Sherif, A.-C. N. Ngomo, An efficient approach for the generation of Allen relations, in: European Conference on Artificial Intelligence (ECAI), IOS Press, 2016, pp. 948–956. [43] P. Fafalios, Y. Tzitzikas, How many and what types of SPARQL queries can be answered through zero-knowledge link traversal?, in: ACM/SIGAPP Symposium on Applied Com- puting (SAC), ACM, 2019, pp. 2267–2274. [44] K. Singh, M. Saleem, A. Nadgeri, F. Conrads, J. Z. Pan, A.-C. N. Ngomo, J. Lehmann, Qaldgen: Towards microbenchmarking of question answering systems over knowledge graphs, in: International Semantic Web Conference (ISWC), Springer, 2019, pp. 277–292. [45] C. Aebeloe, G. Montoya, K. Hose, ColChain: Collaborative Linked Data Networks, in: The Web Conference (WWW), ACM / IW3C2, 2021, pp. 1385–1396. URL: https://doi.org/10. 1145/3442381.3450037. doi:10.1145/3442381.3450037.