INTRODUCTION

Practical Linked Data Access via SPARQL: The Case of Wikidata

Adrian Bielefeldt

adrian.bielefeldt@tu-dresden.de 0

Julius Gonsior

julius.gonsior@tu-dresden.de 0

Markus Krötzsch

markus.kroetzsch@tu-dresden.de 0 0 cfaed, TU Dresden , Dresden , Germany

2018

SPARQL is one of the main APIs for accessing linked data collections. Compared to other modes of access, SPARQL queries carry much more information on the precise information need of users, and their analysis can therefore yield valuable insights into the practical usage of linked data sets. In this paper, we focus on Wikidata, the knowledge-graph sister of Wikipedia, which ofers linked data exports and a heavily used SPARQL endpoint since 2015. Our detailed analysis of Wikidata's server-side query logs reveals several important diferences to previously studied uses of SPARQL over large knowledge graphs. Wikidata queries tend to be much more complex and varied than queries observed elsewhere. Our analysis is founded on a simple but efective separation of robotic from organic trafic. Whereas the robotic part is highly volatile and seems unpredictable even on larger time scales, the much smaller organic part shows clear trends in individual human usage. We analyse query features, structure, and content to gather further evidence that our approach is essential for obtaining meaningful results here.

INTRODUCTION

The SPARQL query language [ 7 ] is one of the most powerful and most widely used APIs for accessing linked data collections on the Web. Large-scale RDF publication eforts, such as DBpedia [ 2 ], routinely provide a SPARQL service, often with live data. Moreover, SPARQL has been an incentive for open data projects that are based on other formats to convert their data to RDF in order to improve query functionality, a route that was chosen by large-scale projects such as Bio2RDF [ 1 ] or the British Museum.1 One of the most prominent such project is Wikidata [ 17 ], the large2 knowledge graph of Wikipedia, which is ofering browsable linked data, RDF exports, and a live SPARQL service since September 2015.

Analysing the queries sent to SPARQL services promises unique insights into the practical usage of the underlying resources [ 12 ]. This opens the door to understanding computational demands [ 3, 9, 13 ], improving reliability and performance [ 10 ], and studying user behaviour [ 14, 15 ]. Data providers in addition are highly interested in learning how their content is used. This research is enabled by more and more datasets of SPARQL query logs becoming available [ 11, 16 ]. In some cases, including Wikidata, access to SPARQL logs 1http://www.britishmuseum.org/about_us/news_and_press/press_releases/2011/ semantic_web_endpoint.aspx (accessed January 2018) 2>45M entities, >400M statements, >200K editors (>37K in Jan 2018), >640M edits Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

LDOW’2018, April 2018, Lyon, France © 2018 Copyright held by the owner/author(s). remains strongly regulated, but research can help to safely publish useful data there as well (as we intend as part of our work).

Unfortunately, the initial enthusiasm in the study of practical SPARQL usage has been dampened by some severe dificulties. A well-known problem is that SPARQL services experience extremely heterogeneous trafic due to their widespread use by software tools [ 14, 15 ]. Indeed, a single user’s script can dominate query trafic for several hours, days, or weeks – and then vanish and never run again. Hopes that the impact of such extreme events would even out with wider usage have not been justified so far. Even when studying the history of a single dataset at larger time scales, we can often see no clear usage trends at all. For example, Bonifati et al. recently found that within the years 2012–2016 the keyword DISTINCT was used in 18%, 8%, 11%, 38%, and 8% of DBpedia queries, respectively [ 3 ].

As a result, the insights gathered by SPARQL log analysis so far have remained behind expectations. It seems almost impossible to generalise statistical findings, or to make any predictions for the next year (or even month). Building new optimisation methods or user interfaces based on such volatile findings seems hardly worth the efort. And yet, even recent research works rarely make any attempt to quantify or at least discuss the impact of (random) scripts on their findings. Exceptions are few: Raghuveer hypothesised that similarities in query patterns can be used to find bots, and provided basic analysis of bot requests isolated from larger logs [ 14 ]; Rietveld and Hoekstra used client-side SPARQL logs as a subset of true user queries that they compared to server-side logs [ 15 ].

In this work, we take a first look at the SPARQL usage logs of the oficial Wikidata query service, and we ask if and how relevant insights can be obtained from them. Our starting hypothesis is that SPARQL queries can be meaningfully partitioned into two classes: organic queries fetch data to satisfy an immediate information need of a human user, while robotic queries fetch data in an unsupervised fashion for further automated processing. We then classify queries accordingly based on user agent information, temporal distribution, and query patterns, and conduct further analysis on the results. We argue that the organic component of SPARQL query trafic can then be studied statistically, since it is relatively regular and since it can reveal the needs of many actual users. In contrast, the robotic component of query trafic should rather be subjected to a causal analysis that attempts to understand the sources of the trafic, so as to predict its current and future relevance to answering specific research questions.

Our main contributions are as follows: (1) We propose the concept of organic and robotic SPARQL trafic as a basic principle for query log analysis. (2) We present a method for classifying query logs accordingly, and we use it to partition a set of over 200M Wikidata SPARQL queries. Only 0.31% of the queries are organic, supporting our thesis that human information need is completely hidden by bots in most published analyses. (3) We evaluate our classification by analysing several aspects that we consider characteristic for organic and robotic queries, respectively. This supports our conjecture that we can efectively distinguish the two types of trafic. (4) We investigate the organic Wikidata trafic to gain basic insights into actual direct usage of Wikidata. (5) We discuss anonymisation aspects and potential privacy issues, which forms the basis for our ongoing eforts for allowing the Wikimedia Foundation to release essential parts of our datasets to the public.

Besides the concrete contributions towards understanding the use of Wikidata, we believe that our systematic study can help in advancing the research methodology in the wider field of analysing linked data access through rich query APIs. Indeed, due to the versatile use of SPARQL services – for manual and for automated requests, for transactional and analytical queries, interactively or in batch processes – the analysis of their usage requires suitable techniques and methods that are not suficiently developed yet. 2

SPARQL ON WIKIDATA

Wikidata is the community-created knowledge base of the Wikimedia Foundation. It was founded in 2012 with the main goal of providing a central place for collecting factual data used in Wikipedia across all languages [ 17 ]. As of March 2018, Wikidata stores more than 402 million statements about over 45 million entities.3 The data is collaboratively curated by a global community, with over 18,000 registered editors making contributions each month. Wikidata is widely used in diverse applications, such as Apple’s Siri mobile app, Eurowing’s in-flight information system, and data integration initiatives such as the Virtual Integrated Authority File.

The data model of Wikidata is based on a directed, labelled graph where entities are connected by edges that are labelled by properties. Entities can have labels in many languages, but their actual identiifers are abstract: properties use identifiers such as P569 (“date of birth”), while other entities (called “items”) use identifiers such as Q42 (“Douglas Adams”). Both types of entities can be freely created by users. The model is distinct from RDF in that edges of the graph may in turn have annotations. This feature is used to record sources, temporal validity, or other contextual information. Indeed, annotations on edges are using the same community-defined vocabulary as edge labels.

Since September 2015, Wikidata provides an oficial SPARQL service (user interface at https://query.wikidata.org/) to query its data. For this purpose, data is first converted to RDF. Each edge is represented by a URI that can be associated with its annotations, following an encoding as laid out by Erxleben et al. [ 5 ]. In addition to this faithful encoding, the RDF export also includes simplified statements that only capture the actual edge as a single RDF triple, without any of its annotations. Diferent URIs are introduced for the diferent roles that properties can play in this encoding, so that all views of the data can be stored and queried in one database without risk of confusion. As of March 2018, the RDF encoding contains 3see https://www.wikidata.org/wiki/Wikidata:Statistics and links therefrom about 4.7 billion triples.4 Queryable data is updated at least once per minute to keep synchronised with updates.

The Wikidata SPARQL service is based on the BlazeGraph database management system. SPARQL support is mostly standard, but includes some built-in operational extensions that are represented by (ab)using SPARQL’s SERVICE directive, which is normally used for federated queries to external services. Of chief practical importance is the labelling service, used to fetch optional entity labels with support for fallback languages. The widespread use of this service does afect the structure of queries, which rarely include the otherwise familiar OPTIONAL-FILTER pattern to select labels in the desired language.

Table 1 shows an example query that illustrates several aspects of Wikidata’s RDF encoding. The query returns the 100 countries that have the most cities with a female mayor. Within the query pattern, the first line finds a value for ?mayor with gender (P21) female (Q6581072). We are using the simplified property wdt:P21, which cannot have annotations or source information to its triples. The following line finds a ?city that is instance of (P31) of the class city (Q515), or of any subclass thereof (P279*). We then determine country (P17) of this city. The fourth line of the pattern then uses the more complex RDF encoding to find a ?statement for property mayor (P6) and matches its value to ?mayor. We require that this statement has no end time (P582) to ensure that the mayor is current. Finally, the service wikibase:label is invoked to fetch labels in Russian, or, as an alternative, English. The query can readily be executed online.

Wikidata provides extensive documentation for using SPARQL,5 including a collection of over 300 example queries and pages to request help in writing new queries. The query service receives several million queries per day. 3

WIKIDATA SPARQL QUERY LOGS

We now give an overview of the datasets we are working with for this paper. Our data is based on the server-side request logs (Apache Access Log Files) of the Wikidata SPARQL query service, as exported from the internal logging infrastructure of Wikimedia. As logs contain sensitive information (especially IP addresses), this 4https://grafana.wikimedia.org/dashboard/db/wikidata-query-service 5https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/ data is not publically available, and all records are deleted after a period of three months. For this research, we therefore created less sensitive (but still internal) snapshots that contain only SPARQL queries, request times, and user agent information, but no IPs.

We consider the complete query trafic in three consecutive intervals in 2017, each spanning exactly four weeks: (I1) 12th June–9th July, (I2) 10th July–6th August, and (I3) 7th August–3rd September. We process all queries with the Java module of Wikidata’s BlazeGraph instance,6 which is based on OpenRDF Sesame, with minimal modifications in the parsing process re-implemented to match those in BlazeGraph. In particular, BIND clauses are moved to the start of their respective sub-query after the first parsing stage.

This results in 211,703,889 valid queries. We then eliminate exact string duplicates individually for each interval to obtain a subset of unique valid queries. A specific unique query may therefore still re-occur in several intervals. The numbers of queries per interval are given in Table 2.

The column Patterns counts unique query patterns after a further abstraction. We uniformly replace all resources in subject and object positions with a normalised placeholder, using diferent placeholders for URIs and literals (by type). Patterns therefore reflect basic resource types, co-occurrence of resources, and all predicates used in the query. We also normalise values in LIMIT and OFFSET. Table 3 illustrates the pattern we would obtain for the query from Table 1. Our approach follows Raghuveer who observed that programs often use query templates to construct many similarly looking queries where only certain placeholders are instantiated [ 14 ]. However, we do retain predicates since the abstraction would otherwise be too strong (in particular, most one-triple queries would lead to the same pattern when also abstracting predicates).

We can see a clear trend towards a significant increase in query trafic over time, which we have witnessed for many months since 6Maven artefact com.blazegraph.sparql-grammar v2.1.4

Organic queries . . . Robotic queries . . . a . . . fetch data to be delivered . . . fetch data to be processed directly to human users algorithmically b . . . are part of an ongoing hu- . . . are executed without close man interaction human supervision c . . . reflect an immediate human . . . may serve many indirect information need purposes (or none) d . . . are typically sent from . . . are sent from applications browser applications that rarely run in browsers e . . . represent the needs and in- . . . are not representative of terests of many general needs f . . . are relatively diverse . . . are relatively uniform g . . . have uniform distributions . . . have skewed distributions that change continuously that change abruptly the introduction of the query service, the exception being a decline in June 2017 following new measures for throttling scripts that send overly many queries in very short times. We can also see some variability in the number of query types, which roughly measures how uniform queries were in an interval. 4

CLASSIFYING QUERY SOURCES

We conjecture that a meaningful analysis of SPARQL query logs in most cases must involve a classification of trafic into two basic forms, organic and robotic queries. In this section, we characterise both types, and we present our approach for separating them for the example case of the Wikidata query logs.

In an idealised view, organic and robotic queries are characterised as shown in Table 4. Note that we do not restrict organic queries to mean those that are manually typed in by individual users, as studied previously [ 15 ]. Indeed, we wish to include users who are not aware that SPARQL is being used at all, as long as the query is representative of their present information needs.

Nevertheless, there is a grey area of applications that may allow users to schedule and execute thousands of queries through a browser interface. There is a gradual transition between idealised organic and robotic queries, and, in theory, many intermediate applications are conceivable. If in doubt, such cases should be considered robotic, since they can then still be analysed individually without their trafic giving undue prominence to individual users among the organic queries.

To classify queries as organic or robotic, we rely on just two characteristics, (d) and (f), with some further guidance from (g). We use only three features to decide the type of a query: the user agent (for (d)), the abstract query pattern (for (f)), and (when available) comments in the query that some tools use to identify themselves. Our implementation determines the query type by custom rules that use these two aspects only. By default, we expect all browserrelated user agents to indicate organic trafic, while all other agents indicate robotic trafic. However, we have implemented a more finegrained causal analysis that tries to relate certain user-agent/pattern combinations to specific sources (programs). Individual sources can then be manually added to either of the two categories.

Concretely, we consider WikiShootMe (nearby sites to photograph), SQID (data browser), and Histropedia (interactive timelines) as sources of organic trafic, and leave other software tools in the robotic category. On the other hand, we had frequent occasion of classifying queries sent from well-known browsers as robotic. Examples included an apparently browser-based application that retrieved Wikidata items with hundreds of thousands of diverse movie database identifiers, and another “browser” that issued the exact same query three million times. Clearly, these cases did not meet our criteria for organic trafic, yet they would completely change the characteristics of the organic dataset when not discovered.

We have applied this classification to the set of valid queries, leading to a distribution of queries as shown in Table 5. The total of 658,890 organic queries accounts for less than 0.5% of the trafic, and would therefore be overlooked completely in any non-discriminative analysis. Already Table 5 clearly shows that this small fraction of trafic behaves significantly diferent from the overall dataset. While robotic trafic contains 21% to 33% unique queries, organic queries are 72%–79% unique. The non-discriminative query analysis of Bonifati et al. showed that DBpedia logs have between 30% (2016) and 54% (2013) unique queries, while other datasets are 3%–30% unique [ 3 ]. Our robotic queries therefore are well within the typical range observed so far, whereas our organic queries seem to represent a very diferent type of trafic. 4.1

SPARQL Feature Prevalence

Further diferences are revealed when analysing the SPARQL query features used in each dataset. SELECT queries make up more than 99% of queries in each dataset, so we do not report uses for DESCRIBE, CONSTRUCT, or ASK. Table 6 shows the prevalence of most common solution set modifiers, graph matching patterns, and aggregates in each of the query sets. Join refers to the (often implicit) SPARQL join operator; Filter also includes FILTER expressions using NOT EXIST; and SERVICE calls are split between the very common Wikidata label retrieval service (lang) and others. We will discuss several aspects of this large table in the remainder of this section.

Absent Features. We have omitted features that were generally used in less than 1% of queries from Table 6. This applies to REDUCED, EXISTS, GRAPH, the occurrence of + in property paths, as well as specific aggregation functions (MIN, MAX, SUM, AVG). Any other feature that is not shown in the table has not been counted. The absence of GRAPH in our data is expected, since Wikidata does not use named graphs.

Unique Queries. Many studies restrict to unique queries. We do not find it obvious that this is most suitable. Indeed, in a continuously changing database like Wikidata, where the RDF export is updated at least once per minute, a repeated query may indicate a real and recurring information need, and it may always require a new answer to be computed. Moreover, considering organic trafic, it is also relevant if many users require the exact same data. Therefore neither human interests nor database query load can necessarily be understood any better by eliminating duplicates. Table 6 includes prevalence among unique queries to show that the choice between the two views has a significant impact on results. A higher prevalence among unique queries also indicates that queries with that feature are more likely to be repeated than queries without this feature. For example, organic queries with subqueries are less frequent when eliminating duplicates, which shows that such queries tend to re-occur more often than others. Note that not all queries contribute to the feature counts in Table 6, and in general it is possible that queries without any counted feature (e.g., single triple queries) are much more common among the unique queries.

Stability of Results. The usage patterns of any technical system will evolve over time, but it is dificult to estimate the stability of specific metrics. We have split our data into three intervals to make such changes visible. Robotic trafic exhibits huge fluctuations, e.g., for joins (67%–88%) and OPTIONAL (11%–25%). Prevalence among unique queries sometimes fluctuates independently, e.g., for UNION (2.5%–8.6%). By studying one or several of the intervals as a single dataset, and by choosing to include or exclude duplicate queries, one could therefore arrive at extremely diferent conclusions from this data. Many previous studies of SPARQL logs – often smaller than ours, and therefore even more easily dominated by bots – should be viewed in this light.

Organic trafic tends to be more stable, but also shows significant variations in some cases. Especially we can see some change in I3, most notable for VALUES and UNION. We discuss and explain this change in detail in Section 6. In most cases, however, we can see that metrics are fairly continuous, and that moreover, unique queries are much more representative of all queries than in the robotic case.

Feature Usage as Compared to Other Studies. The previous discussion suggests that it is generally questionable whether any insights can be obtained by comparing SPARQL log metrics based on a non-discriminative analysis of all queries. Nevertheless, there are some aspects where our results show overwhelming diferences from previously reported findings. The most systematic overview across several datasets is given by Bonifati et al. [ 3 ]. They found SERVICE, VALUES, BIND, and property paths to occur in less than 1% of unique queries, while they have great relevance in all parts of our data. Similarly low prevalence was reported for subqueries, SAMPLE, and GroupConcat – our robotic trafic is similar, but our organic trafic paints a very diferent picture. Especially subqueries are strikingly common there. 4.2

SPARQL Feature Co-Occurrence

The expressive power and computational complexity of SPARQL queries is determined not so much by the presence or absence of individual features, but by their co-occurrence and interaction. Investigating which combinations of features occur in queries is therefore of interest. We present the results of this analysis here.

Many studies restrict to analysing co-occurrence of join, OPTIONAL, UNION, and FILTER. In our case, however, we find that also subqueries, property paths, VALUES, and SERVICE occur in a significant share of queries. The use of SERVICE is dominated by Wikidata’s labelling service, which is mostly used to add labels after fetching query results. To reduce the number of feature combinations, we therefore ignore the labelling service entirely. We do not count queries that use any other type of service, or any other feature not mentioned explicitly (the most common such feature being BIND). Solution set modifiers (LIMIT etc.) and aggregates (COUNT etc.) are ignored: they can add to the complexity of query answering only when used in subqueries, so this feature can be considered an overestimation of the amount of such complex queries.

The results of this analysis are presented in Table 7 for all operator combinations that accounted for more than 1% of queries in either dataset. Path expressions in this case include all queries where either * or + occurs. The features we selected, possibly in combination with a labelling service, account for around 85% of all queries in either dataset, but the distributions are very diferent.

Traditional Query Fragments. Plain conjunctive queries (CQ) that contain only joins account for 21.3% (30.9%) of organic (robotic) trafic. The frequently studied conjunctive-filter-pattern queries (CFP), which also consider FILTER, increase these values to 29.1% (35.5%). This is far below the prevalence reported for such simple queries in other datasets, typically above 65% [ 3, 13 ]. VALUES can be allowed in CQs without an increase in complexity for a coverage of 21.7% (61.4%). Especially robotic trafic contains this pattern, which is not surprising since VALUES is an eficient way to combine query batches into one. Path Queries. The extension of CQ with property path expressions leads to conjunctive regular path queries (CRPQs [ 4, 6 ]), which account for 24.4% (44.5%) of organic (robotic) queries. While organic queries contain almost the same high amount of path expressions (Table 6), the simple CRPQ fragment does not sufice to capture as many of them as in the robotic case. A highly prominent query fragment for the robotic case are CRPQs with VALUES, which account for 74.9% of all queries.

OPTIONAL and UNION. OPTIONAL is much more popular in organic queries, where together with join it accounts for 44.9% of the trafic. We attribute this to the fact that user interfaces often try to show as much information as available. On the other hand, robotic queries have much less use for query results that may miss some of the queried values, especially since labels can be fetched with a dedicated service. UNION rarely occurs in otherwise simple queries, although it does occur in significant amounts of queries overall. Our findings again deviate from Bonifati et al. who found UNION alone enough to account for 7.5% of unique queries [ 3 ].

Subqueries. Table 7 shows that more than 10% of organic queries include subqueries while otherwise using nothing but joins, FILTER, and OPTIONAL. However, we have ignored solution set modifiers and aggregates in this analysis, since they normally play a role only in post processing. In combination with subqueries, however, such operations may have a big impact on the expressive power of the query language. 5

ROBOTIC VS. ORGANIC: EVALUATION

We have already seen that organic and robotic trafic are significantly diferent in many respects. This shows that our partitioning of queries is not random, but it does not support the claim that they actually can be characterised as in Table 4. In this section, we therefore investigate whether the datasets also exhibit previously asserted characteristics that have not been used to define them in the first place.

Temporal distribution. We begin by considering the temporal distribution. According to Table 4 (b), we would expect organic queries to be correlated with the time of day, while robotic queries should not show any such relationship. Figure 1 shows the hourly temporal trafic volume (in absolute numbers) and the relative distribution across 24h-intervals. Organic trafic follows a strong daily rhythm, with most activities happening during the European and American day and evening. This strongly supports a direct human involvement.

This also suggests that our dataset contains only few organic queries from Asian users, which in our case simplifies the detection of the expected patterns. If usage were globally uniform, it would be promising to correlate daily usage with hints on geographical context, which can be found in queries, e.g., in the form of coordinates or language information.

Figure 1 also shows some abrupt changes in the robotic trafic, as predicted by Table 4 (g).

Content preference. According to Table 4 (c), user queries are expected to mirror more direct human interests. We investigate this by considering the predicates used in queries. Table 8 shows the top ten most frequently used Wikidata properties in organic and in robotic queries. Robotic queries frequently involve IDs of external databases, whereas organic queries often refer to properties related to locations. Human properties such as occupation (and, just outside of the top ten, date of birth and sex or gender) also rank highly in organic queries. Commons category refers to Wikimedia Commons and can be used to obtain related media for some entity. Wiki Loves Monuments ID, the only identifier in the organic top ten, relates to the eponymous content creation activity of Wikipedia, which aims as gathering more information on local sites of interest.

Diversity vs. Uniformity. We have conjectured that organic queries are more diverse, since they are not controlled by a small number of programs, and since user information needs are generally less uniform. Our datasets support this in more than one sense. We have already seen from Table 6 that organic queries tend to use more diverse SPARQL features, including some that have no significant share of robotic queries. We also found that simple combinations of operators can account for 75% of robotic queries, whereas organic queries exhibit a greater variance. For the study in Section 7, we found 17 feature combinations that make up over 1% of organic queries, while only 8 combinations have such high prevalence among robotic queries (Table 7). Similarly, the usage of RDF properties is more skewed in robotic trafic, where only 37 properties occur in more than 1% of queries. In contrast, 59 diferent RDF properties occur in more than 1% of organic queries.

Another frequently studied structural aspect of queries is their actual length in terms of the number of triples. To determine this number, we have not counted triples that occur within SERVICE clauses, since these are mostly used to call built-in BlazeGraph functions, where triples represent parameters rather than referring to actual RDF data. Figure 2 shows the results. The results show the usual peak of 1-triple queries in the robotic case, with 55.96% of queries having at most this size, and 96% of queries having at most 7 triples. This matches findings for other datasets, where an average of 56.45% of queries had at most one triple and 91% of queries used 6 or fewer [ 3 ]. Our robotic average query size of 2.45 triples also resembles that of DBpedia, reported to be between 2.09 and 3.98 for various samples.

In contrast, the size distribution of organic queries is much wider. Queries with at most one triple only make up 17.30%, and one has to consider queries with up to 11 triples to cover 97% of the data. The average query size is 4.38 in this case. The largest organic query had 143 triples, while the largest robotic ones contained 66 triples. Again, we can see that organic queries are more diverse, and also more complex than robotic ones. 6

UNDERSTANDING WIKIDATA USAGE

Based on the above investigations, we assume that our classification of queries can successfully capture the ideas expressed in Table 4. We now turn towards exploiting this insight for obtaining a better understanding of actual Wikidata usage. Much of the previous discussion also contributes towards this goal, e.g., the analysis of temporal distribution (indicating a lack of Asian users) and the ranking of properties (showing significant user interest in local data), but we can refine these findings further. 6.1

Annotations and Complex Statements

As explained in Section 2, Wikidata supports annotations on statements, which allow to express contextual information, and which lead to a more complex RDF encoding. It is therefore relevant to ask if queries make use of such information, or if they rather use the simplified view that drops all annotations.

The analysis of URIs used in predicate positions lets us answer this question. References are linked via the RDF provenance vocabulary URI prov:wasDerivedFrom, which occurs in 1.1% (0.07%) of all organic (robotic) queries. Another annotation is the statement rank, which is used in 0.48% (0.62%) of all queries. Finally, statements can also be annotated with arbitrary Wikidata properties, and such use can also be recognised from URIs. The most frequently used properties in statement annotations for organic and robotic queries are shown in Table 9. Annotations are of course used only in a minority of queries. Their pronounced use in the case of start and end time witnesses the importance of temporal validity when interpreting statements from Wikidata.

Conversely, we can also collect the most common properties used in statements for which annotations are queried. To gauge this metric, we consider RDF properties used for encoding the complex form of statements that uses a dedicated node for representing edges. Note that this does not always indicate that the query also referred to annotations of these properties. Another motivation for using the more complex statement encoding is that the simplified statement encoding is only generated for statements of maximal rank; queries that are interested in all (e.g., historical) data therefore need to use complex statements even when not querying for annotations. Table 10 displays the most common properties that appeared in a form as used for complex statement encodings.

We can see that complex statements are a larger fraction in organic queries. Indeed, 18.9% of all organic queries are using properties that are part of the complex RDF encoding of statements, rather than relying on simplified statements represented by single edges alone. 6.2

Language Distribution

We saw that the temporal distribution of organic queries suggests that few users from Asia are accessing Wikidata via SPARQL-based applications. For further insights, it is interesting to study the language-related information contained in queries. Indeed, the use of Wikidata by communities of diferent languages is a relevant research topic by itself [ 8 ].

We have therefore extracted the languages for which queries request labels using the labelling service, and grouped queries by the first (most preferred) language they use. This resulted in over 200 diferent language tags, including misspellings. Of those, 35 occurred in more than 100 queries. For the 16 languages that occurred in more than 1000 queries, we show the query numbers for each of the intervals in Figure 3 using a logarithmic scale. The special label “[AL]” denotes the value [AUTO_LANGUAGE], used for retrieving a language based on browser settings.

English clearly dominates the list, followed by mostly European languages (cs is Serbian, sv is Swedish, ca is Catalan, and eu is Basque). Asian languages follow only at the end, with Indonesian (id), Hebrew (he), and Japanese (ja) hardly occurring in more than 1,000 queries in total. The sum of several variants of Chinese also amounts to 1,122 queries; lower still is Arabic (798). These observations suggest a strong imbalance in the global use of Wikidata via SPARQL. The choice of language may also reflect the lack of labels in some languages [ 8 ], but since the labelling service supports any number of fallback languages, users could still always put their preferred language at the front. Moreover, the amount of queries asking for a certain language is only weakly related to the total number of labels available in a language. As of March 2018, Wikidata has most labels for English (32.7M), Dutch (10.7M), French (9.4M), German (8.0M), Spanish (6.5M), Italian (6.1M), Swedish (6.1M), Russian (5.4M), Cebuano (5.0M), and Bulgarian (2.9M). Polish follows at fourteenth place with 2.6M labels; Hebrew is only at 92nd place with less than 460K labels.

The situation for robotic trafic is similar, but the labelling service is used in a much smaller fraction of queries in this case, and English is more dominant. Some European languages still occur, but others hardly do (e.g., Polish). Chinese is slightly more prominent compared to other languages, with more than 100,000 queries, but this is still less than 0.1% of all robotic queries.

An interesting observation from Figure 3 is that some languages seem to be “trending” in that we see a clear increase in their popularity. This is particularly strong for Polish, but can also be seen for Catalan. The next section discusses methods that can help to understand such observations. 6.3

Causal Analysis

We have argued that mere statistical analysis can easily be misleading. Indeed, it is not useful to compute averages over exponential distributions (skewed data), which tend to appear on many scales in usage analysis. A better understanding can be obtained by a more ifne-grained analysis that tries to attribute the observed trafic to individual causes. Beyond superficial statistics, this gives us valuable insights into the goals underlying current usage.

We have conducted such an analysis by distinguishing queries that are issued by recognisable tools. This analysis started from a number of self-identifying tools, which include the tool’s name in a comment in their SPARQL queries.7 Many further tools were identified from their distinctive trafic patterns, usually marked by bursts of many similar queries, in combination with their user agents. This was mostly manual work, based on inspecting the extracted query patterns that were most frequent, as well as their temporal distribution.

Some of the identified sources have vanished in the following month. For the intervals analysed in this paper, we know of 95 query sources that together account for 144,787,485 (68.39%) of all valid queries. As mentioned in Section 4, we consider three selfidentified tools as organic; the respective numbers of queries are: 57,564 (WikiShootMe), 46,691 (SQID), and 996 (Histropedia).

In addition, we identified a browser-based tool that looked up information on local sites based on geographic coordinates in Poland, with a total of 48,051 queries. This activity was notable only in I3 and could be traced back to I2, but not to I1. The sudden prominence 7The convention for doing this in the Wikidata community is to start a comment with #TOOL: followed by some identifying string. of the tool is connected to the Wiki Loves Monuments competition of Wikipedia, which ran in September 2017, and indeed the respective queries account for the occurrence of property P2186 in Table 8. Moreover, the sudden rise of this new application largely explains why I3 shows somewhat diferent characteristics in Table 6, and why Polish is seeing a strong upwards trend in Figure 3.

The remaining 91 sources were considered robotic. The top three of them together issue 60.06% of all queries that the Wikidata SPARQL query service has answered in our data. They are: • auxiliary matcher: a data integration tool that identifies potential matches with external datasets; 63,745,842 queries • PBB_core fastrun: a data integration tool for protein databases; 41,605,446 queries • bot2: a multi-purpose bot operated by Magnus Manske;8 22,951,459 queries

Further very active query sources include several unidentified scripts and bots. Prominent tools with more than 1M queries query movie database ids, information on association football, and detailed name and label information. The most active sources also include a query that polls the time of the most recent update of the RDF database – this single query has been answered 1,861,752 times (i.e. about once every 11.5 seconds).

Our analysis strongly supports our conjecture that any nondiscriminative analysis of the data is necessarily skewed towards a small number of tools. Indeed, both auxiliary matcher and bot2 are maintained by Magnus Manske, which means that he controls about 41% of all trafic. We believe that this is no unusual situation that only occurs for Wikidata, although few individuals will have the impact of a Magnus Manske. 7

ANONYMISATION AND PUBLICATION

We intend to publish our log datasets together with this study, pending the outcome of an ongoing clearance process within the Wikimedia Foundation. Since SPARQL logs are user access logs, they must be treated very carefully to avoid privacy concerns. The published logs therefore are to contain neither IP addresses nor user agents, but they will be partitioned according to our methodology to allow studies of diferent user groups. They will also contain time stamps for each query.

The queries will also need to be modified to remove any potentially sensitive information. All comments will be removed and each query will be parsed, normalised, and re-serialised. Longer strings, variable names, and geographic coordinates will be replaced uniformly by generic fillers of the same type. We do not expect strings or variable names to contain private information, but the query volume is too large to be certain. Geographic coordinates need to be replaced since they may contain very specific location information that could be linked to individual humans. In contrast, we think that dates (which in Wikidata are only used up to the precision of a day) do not contain enough information to be sensitive. In the current data release proposal, all references to Wikidata entities, and the exact structure of each SPARQL query would be preserved, so that the published files should be of use to other researchers.

Until the anonymised data sets have been released, researchers who wish to replicate our findings can gain access to the data by 8https://en.wikipedia.org/wiki/Magnus_Manske applying for a formal research collaboration with the Wikimedia Foundation, subject to signing suitable non-disclosure agreements. The source code of our analysis scripts is publicly available.9 In particular, this code also includes our manual classification rules for queries based on their shape and user agent. 8

CONCLUSION

We have presented a first detailed study of the access logs of the Wikidata SPARQL query service. Due to the varied uses of SPARQL services, the overall query trafic can roughly be classified into two parts: a large fraction of relatively simple queries that are generated by only a few bots, and a much smaller fraction of more complex queries posed (directly or indirectly) by many human users. We argued that, if we want to gain meaningful insights from query log analysis, the queries of the many should outweigh the queries of the few, since the former can more reliably represent a real human information need.

We proposed a characterisation of both components of organic and robotic trafic, and we showed that these concepts are workable in that (1) both parts can be separated with little efort using a small number of simple signals, and (2) the resulting datasets do indeed mirror many of the expected characteristics. We then continued to study three specific aspects of Wikidata’s usage: the practical use of complex RDF encodings, the global imbalance in the use of queries, and the explanation of individual trafic components by means of identifying their sources.

Our research motivates further studies of real SPARQL queries that apply our methods to other datasets. It would be extremely interesting to learn how much organic query trafic can be found in other datasets, and whether it is as distinct from the rest of the queries as it was in our case. Moreover, a causal analysis of the main robotic query sources could shed more light on the actual real-world use of RDF datasets published via SPARQL. Are data integration and systematic download also the predominant tasks in other scenarios? Which fraction of bots can account for which fraction of the trafic? How do more complicated metrics, such as treewidth [ 3 ] or specific shapes of query patterns [ 9 ], behave for organic and robotic data?

Finally, it would also be of interest to extend the tools available for classifying organic trafic in the first place. Our approach based on user agents, query shapes, and time was feasible but still mostly manual. An automated classification could be of interest, and our data may serve for training and evaluation. Moreover, the approach we took suggests that even anonymised query logs should ofer some general user agent information (such as “browser or not”), and also suficiently fine-grained temporal information. If enough information is retained, then the manual interaction might be avoided completely, allowing the creation of applications that continuously monitor organic trafic for relevant trends. Methodologically speaking, we do indeed seem to be at the very beginning of this field.

Acknowledgements. This work was partly supported by the DFG within the cfaed Cluster of Excellence, CRC 912 (HAEC), and Emmy Noether grant KR 4381/1-1. 9https://github.com/Wikidata/QueryAnalysis

[1]

François

Belleau , Marc-Alexandre

Nolin

, Nicole Tourigny, Philippe Rigault, and

Jean

Morissette . 2008 . Bio2RDF: Towards a mashup to build bioinformatics knowledge systems . J. of Biomedical Informatics 41 , 5 ( 2008 ), 706 - 716 .

[2]

Christian

Bizer , Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and

Sebastian

Hellmann . 2009 . DBpedia - A crystallization point for the Web of Data . J. of Web Semantics 7 , 3 ( 2009 ), 154 - 165 .

[3]

Angela

Bonifati , Wim Martens, and

Thomas

Timm . 2017 . An Analytical Study of Large SPARQL Query Logs . Proceedings of the VLDB Endowment 11 ( 2017 ), 149 - 161 . Issue 2.

[4]

Diego

Calvanese , Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe

Vardi . 2003 . Reasoning on regular path queries . SIGMOD Record 32 , 4 ( 2003 ), 83 - 92 .

[5]

Fredo

Erxleben , Michael Günther, Markus Krötzsch, Julian Mendez, and

Denny

Vrandečić . 2014 . Introducing Wikidata to the Linked Data Web . In Proc. 13th Int. Semantic Web Conf. (ISWC'14) (LNCS), Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig A. Knoblock , Denny Vrandečić, Paul T. Groth, Natasha F. Noy, Krzysztof Janowicz, and Carole A. Goble (Eds.) , Vol. 8796 . Springer, 50 - 65 .

[6]

Daniela

Florescu , Alon Levy , and

Dan

Suciu . 1998 . Query containment for conjunctive queries with regular expressions . In Proc. 17th Symposium on Principles of Database Systems (PODS'98) , Alberto O. Mendelzon and Jan Paredaens (Eds.) . ACM, 139 - 148 .

[7]

Steve

Harris and Andy Seaborne (Eds.). 21 March 2013 . SPARQL 1.1 Query Language . W3C Recommendation . Available at http://www.w3.org/TR/ sparql11-query/.

[8] Lucie-Aimée

Kafee

, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, and

Lydia

Pintscher . 2017 . A Glimpse into Babel: An Analysis of Multilinguality in Wikidata . In Proc. 13th Int. Symposium on Open Collaboration (OpenSym'17) , Lorraine Morgan (Ed.). ACM, 14 : 1 - 14 : 5 .

[9]

Mark

Kaminski and

Egor V.

Kostylev . 2018 . Complexity and Expressive Power of Weakly Well-Designed SPARQL . Theory of Computing Systems ( 2018 ), 1 - 38 . https://doi.org/10.1007/s00224-017-9802-9 to appear.

[10]

Johannes

Lorey and

Felix

Naumann . 2013 . Detecting SPARQL Query Templates for Data Prefetching . In Proc. 10th Extended Semantic WebConf . (ESWC'13) (LNCS), Philipp Cimiano , Óscar Corcho, Valentina Presutti, Laura Hollink, and Sebastian Rudolph (Eds.) , Vol. 7882 . Springer, 124 - 139 .

[11]

Markus

Luczak-Roesch , Zamil Aljaloud Saud, Bettina Berendt, and

Laura

Hollink . 2016 . USEWOD 2016 Research Dataset . ( 2016 ). https://eprints.soton.ac.uk/385344/

[12] Knud

Möller

, Michael Hausenblas, Richard Cyganiak, Siegfried Handschuh, and Gunnar

Grimnes . 2010 . Learning from Linked Open Data Usage: Patterns & Metrics . In Proc. Web Science Conf. (WebSci'10).

[13]

François

Picalausa and

Stijn

Vansummeren . 2011 . What are real SPARQL queries like? . In Proc. Int. Workshop on Semantic Web Information Management (SWIM'11) , Roberto De Virgilio, Fausto Giunchiglia, and Letizia Tanca (Eds.). ACM, 6 .

[14]

Aravindan

Raghuveer . 2012 . Characterizing Machine Agent Behavior through SPARQL Query Mining . In Proc. 2nd Int. Workshop on Usage Analysis and the Web of Data (USEWOD'12) . usewod.org.

[15]

Laurens

Rietveld and

Rinke

Hoekstra . 2014 . Man vs. Machine: Diferences in SPARQL Queries . In Proc. 4th USEWOD Workshop on Usage Analysis and the Web of Data. usewod.org.

[16] Muhammad

Saleem

, Muhammad Intizar Ali, Aidan Hogan, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo . 2015 . LSQ: The Linked SPARQL Queries Dataset . In Proc. 14th Int. Semantic Web Conf. (ISWC'15) , Part II (LNCS), Marcelo Arenas , Óscar Corcho, Elena Simperl, Markus Strohmaier, Mathieu d'Aquin,

Kavitha

Srinivas , Paul T. Groth, Michel Dumontier, Jef Heflin, Krishnaprasad Thirunarayan, and Stefen Staab (Eds.) , Vol. 9367 . Springer, 261 - 269 .

[17]

Denny

Vrandečić and

Markus

Krötzsch . 2014 . Wikidata: A Free Collaborative Knowledgebase . Commun. ACM 57 , 10 ( 2014 ).