=Paper= {{Paper |id=Vol-2073/article-03 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2073/article-03.pdf |volume=Vol-2073 |dblpUrl=https://dblp.org/rec/conf/www/BielefeldtGK18 }} ==None== https://ceur-ws.org/Vol-2073/article-03.pdf
 Practical Linked Data Access via SPARQL: The Case of Wikidata
                 Adrian Bielefeldt                                              Julius Gonsior                                Markus Krötzsch
               cfaed, TU Dresden                                             cfaed, TU Dresden                             cfaed, TU Dresden
               Dresden, Germany                                              Dresden, Germany                              Dresden, Germany
        adrian.bielefeldt@tu-dresden.de                                julius.gonsior@tu-dresden.de                  markus.kroetzsch@tu-dresden.de

ABSTRACT                                                                                    remains strongly regulated, but research can help to safely publish
SPARQL is one of the main APIs for accessing linked data collec-                            useful data there as well (as we intend as part of our work).
tions. Compared to other modes of access, SPARQL queries carry                                 Unfortunately, the initial enthusiasm in the study of practical
much more information on the precise information need of users,                             SPARQL usage has been dampened by some severe difficulties. A
and their analysis can therefore yield valuable insights into the prac-                     well-known problem is that SPARQL services experience extremely
tical usage of linked data sets. In this paper, we focus on Wikidata,                       heterogeneous traffic due to their widespread use by software tools
the knowledge-graph sister of Wikipedia, which offers linked data                           [14, 15]. Indeed, a single user’s script can dominate query traffic for
exports and a heavily used SPARQL endpoint since 2015. Our de-                              several hours, days, or weeks – and then vanish and never run again.
tailed analysis of Wikidata’s server-side query logs reveals several                        Hopes that the impact of such extreme events would even out with
important differences to previously studied uses of SPARQL over                             wider usage have not been justified so far. Even when studying the
large knowledge graphs. Wikidata queries tend to be much more                               history of a single dataset at larger time scales, we can often see no
complex and varied than queries observed elsewhere. Our analysis                            clear usage trends at all. For example, Bonifati et al. recently found
is founded on a simple but effective separation of robotic from or-                         that within the years 2012–2016 the keyword DISTINCT was used
ganic traffic. Whereas the robotic part is highly volatile and seems                        in 18%, 8%, 11%, 38%, and 8% of DBpedia queries, respectively [3].
unpredictable even on larger time scales, the much smaller organic                             As a result, the insights gathered by SPARQL log analysis so far
part shows clear trends in individual human usage. We analyse                               have remained behind expectations. It seems almost impossible to
query features, structure, and content to gather further evidence                           generalise statistical findings, or to make any predictions for the
that our approach is essential for obtaining meaningful results here.                       next year (or even month). Building new optimisation methods or
                                                                                            user interfaces based on such volatile findings seems hardly worth
1     INTRODUCTION                                                                          the effort. And yet, even recent research works rarely make any
                                                                                            attempt to quantify or at least discuss the impact of (random) scripts
The SPARQL query language [7] is one of the most powerful and
                                                                                            on their findings. Exceptions are few: Raghuveer hypothesised that
most widely used APIs for accessing linked data collections on
                                                                                            similarities in query patterns can be used to find bots, and provided
the Web. Large-scale RDF publication efforts, such as DBpedia [2],
                                                                                            basic analysis of bot requests isolated from larger logs [14]; Rietveld
routinely provide a SPARQL service, often with live data. Moreover,
                                                                                            and Hoekstra used client-side SPARQL logs as a subset of true user
SPARQL has been an incentive for open data projects that are based
                                                                                            queries that they compared to server-side logs [15].
on other formats to convert their data to RDF in order to improve
                                                                                               In this work, we take a first look at the SPARQL usage logs of
query functionality, a route that was chosen by large-scale projects
                                                                                            the official Wikidata query service, and we ask if and how relevant
such as Bio2RDF [1] or the British Museum.1 One of the most
                                                                                            insights can be obtained from them. Our starting hypothesis is that
prominent such project is Wikidata [17], the large2 knowledge
                                                                                            SPARQL queries can be meaningfully partitioned into two classes:
graph of Wikipedia, which is offering browsable linked data, RDF
                                                                                            organic queries fetch data to satisfy an immediate information need
exports, and a live SPARQL service since September 2015.
                                                                                            of a human user, while robotic queries fetch data in an unsupervised
   Analysing the queries sent to SPARQL services promises unique
                                                                                            fashion for further automated processing. We then classify queries
insights into the practical usage of the underlying resources [12].
                                                                                            accordingly based on user agent information, temporal distribution,
This opens the door to understanding computational demands [3, 9,
                                                                                            and query patterns, and conduct further analysis on the results.
13], improving reliability and performance [10], and studying user
                                                                                            We argue that the organic component of SPARQL query traffic can
behaviour [14, 15]. Data providers in addition are highly interested
                                                                                            then be studied statistically, since it is relatively regular and since it
in learning how their content is used. This research is enabled by
                                                                                            can reveal the needs of many actual users. In contrast, the robotic
more and more datasets of SPARQL query logs becoming available
                                                                                            component of query traffic should rather be subjected to a causal
[11, 16]. In some cases, including Wikidata, access to SPARQL logs
                                                                                            analysis that attempts to understand the sources of the traffic, so
1 http://www.britishmuseum.org/about_us/news_and_press/press_releases/2011/                 as to predict its current and future relevance to answering specific
semantic_web_endpoint.aspx (accessed January 2018)
2 >45M entities, >400M statements, >200K editors (>37K in Jan 2018), >640M edits
                                                                                            research questions.
                                                                                               Our main contributions are as follows:
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation      (1) We propose the concept of organic and robotic SPARQL
on the first page. Copyrights for third-party components of this work must be honored.             traffic as a basic principle for query log analysis.
For all other uses, contact the owner/author(s).
LDOW’2018, April 2018, Lyon, France                                                            (2) We present a method for classifying query logs accordingly,
© 2018 Copyright held by the owner/author(s).                                                      and we use it to partition a set of over 200M Wikidata
LDOW’2018, April 2018, Lyon, France                                                       Adrian Bielefeldt, Julius Gonsior, and Markus Krötzsch


       SPARQL queries. Only 0.31% of the queries are organic, sup-            Table 1: Countries by number of cities with a female mayor
       porting our thesis that human information need is com-
       pletely hidden by bots in most published analyses.                     SELECT ?country ?countryLabel (count(*) AS ?count)
   (3) We evaluate our classification by analysing several aspects            WHERE {
       that we consider characteristic for organic and robotic quer-              ?mayor wdt:P21 wd:Q6581072 .
       ies, respectively. This supports our conjecture that we can                ?city wdt:P31/wdt:P279* wd:Q515 .
       effectively distinguish the two types of traffic.                          ?city wdt:P17 ?country .
   (4) We investigate the organic Wikidata traffic to gain basic                  ?city p:P6 ?statement . ?statement ps:P6 ?mayor .
       insights into actual direct usage of Wikidata.                             FILTER NOT EXISTS { ?statement pq:P582 ?x }
   (5) We discuss anonymisation aspects and potential privacy                     SERVICE wikibase:label {
       issues, which forms the basis for our ongoing efforts for                      bd:serviceParam wikibase:language "ru,en" .
       allowing the Wikimedia Foundation to release essential parts               }
       of our datasets to the public.                                         }
   Besides the concrete contributions towards understanding the               GROUP BY ?country ?countryLabel
use of Wikidata, we believe that our systematic study can help in             ORDER BY DESC(?count) LIMIT 100
advancing the research methodology in the wider field of analysing
linked data access through rich query APIs. Indeed, due to the                about 4.7 billion triples.4 Queryable data is updated at least once
versatile use of SPARQL services – for manual and for automated               per minute to keep synchronised with updates.
requests, for transactional and analytical queries, interactively or             The Wikidata SPARQL service is based on the BlazeGraph data-
in batch processes – the analysis of their usage requires suitable            base management system. SPARQL support is mostly standard, but
techniques and methods that are not sufficiently developed yet.               includes some built-in operational extensions that are represented
                                                                              by (ab)using SPARQL’s SERVICE directive, which is normally used
2    SPARQL ON WIKIDATA                                                       for federated queries to external services. Of chief practical im-
Wikidata is the community-created knowledge base of the Wiki-                 portance is the labelling service, used to fetch optional entity labels
media Foundation. It was founded in 2012 with the main goal of                with support for fallback languages. The widespread use of this
providing a central place for collecting factual data used in Wiki-           service does affect the structure of queries, which rarely include
pedia across all languages [17]. As of March 2018, Wikidata stores            the otherwise familiar OPTIONAL-FILTER pattern to select labels
more than 402 million statements about over 45 million entities.3             in the desired language.
The data is collaboratively curated by a global community, with                  Table 1 shows an example query that illustrates several aspects
over 18,000 registered editors making contributions each month.               of Wikidata’s RDF encoding. The query returns the 100 countries
Wikidata is widely used in diverse applications, such as Apple’s              that have the most cities with a female mayor. Within the query
Siri mobile app, Eurowing’s in-flight information system, and data            pattern, the first line finds a value for ?mayor with gender (P21)
integration initiatives such as the Virtual Integrated Authority File.        female (Q6581072). We are using the simplified property wdt:P21,
   The data model of Wikidata is based on a directed, labelled graph          which cannot have annotations or source information to its triples.
where entities are connected by edges that are labelled by properties.        The following line finds a ?city that is instance of (P31) of the class
Entities can have labels in many languages, but their actual identi-          city (Q515), or of any subclass thereof (P279*). We then determine
fiers are abstract: properties use identifiers such as P569 (“date of         country (P17) of this city. The fourth line of the pattern then uses
birth”), while other entities (called “items”) use identifiers such as        the more complex RDF encoding to find a ?statement for property
Q42 (“Douglas Adams”). Both types of entities can be freely created           mayor (P6) and matches its value to ?mayor. We require that this
by users. The model is distinct from RDF in that edges of the graph           statement has no end time (P582) to ensure that the mayor is current.
may in turn have annotations. This feature is used to record sources,         Finally, the service wikibase:label is invoked to fetch labels in
temporal validity, or other contextual information. Indeed, annota-           Russian, or, as an alternative, English. The query can readily be
tions on edges are using the same community-defined vocabulary                executed online.
as edge labels.                                                                  Wikidata provides extensive documentation for using SPARQL,5
   Since September 2015, Wikidata provides an official SPARQL                 including a collection of over 300 example queries and pages to
service (user interface at https://query.wikidata.org/) to query its          request help in writing new queries. The query service receives
data. For this purpose, data is first converted to RDF. Each edge is          several million queries per day.
represented by a URI that can be associated with its annotations,
following an encoding as laid out by Erxleben et al. [5]. In addition         3    WIKIDATA SPARQL QUERY LOGS
to this faithful encoding, the RDF export also includes simplified            We now give an overview of the datasets we are working with
statements that only capture the actual edge as a single RDF triple,          for this paper. Our data is based on the server-side request logs
without any of its annotations. Different URIs are introduced for the         (Apache Access Log Files) of the Wikidata SPARQL query service,
different roles that properties can play in this encoding, so that all        as exported from the internal logging infrastructure of Wikimedia.
views of the data can be stored and queried in one database without           As logs contain sensitive information (especially IP addresses), this
risk of confusion. As of March 2018, the RDF encoding contains
                                                                              4 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
3 see https://www.wikidata.org/wiki/Wikidata:Statistics and links therefrom   5 https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/
Practical Linked Data Access via SPARQL: The Case of Wikidata                                            LDOW’2018, April 2018, Lyon, France

                     Table 2: Query Dataset Sizes                            Table 4: An idealised view of organic and robotic queries

                   Total           Valid         Unique    Patterns                 Organic queries . . .               Robotic queries . . .
        I1    70,201,736      61,250,218      12,833,923    502,083         a . . . fetch data to be delivered    . . . fetch data to be processed
        I2    73,425,065      71,853,238      19,229,539    216,126                 directly to human users             algorithmically
        I3    79,797,901      78,600,433      26,442,258    180,605         b . . . are part of an ongoing hu-    . . . are executed without close
                                                                                    man interaction                     human supervision
      Table 3: Example pattern for the query in Table 1                     c . . . reflect an immediate human    . . . may serve many indirect
                                                                                    information need                    purposes (or none)
SELECT ?var1 ?var2 (count(*) AS ?var3)                                      d . . . are typically sent from       . . . are sent from applications
WHERE {                                                                             browser applications                that rarely run in browsers
    ?var4 wdt:P21 wd:QName1.                                                e . . . represent the needs and in-   . . . are not representative of
    ?var5 wdt:P31/wdt:P279* wd:QName2 .                                             terests of many                     general needs
    ?var5 wdt:P17 ?var1 .                                                   f . . . are relatively diverse        . . . are relatively uniform
    ?var5 p:P6 ?var6 . ?var6 ps:P6 ?var4 .                                  g . . . have uniform distributions    . . . have skewed distributions
    FILTER NOT EXISTS { ?statement pq:P582 ?x }                                     that change continuously            that change abruptly
    SERVICE wikibase:label {
        bd:QName3 wikibase:language "string1" .
    }                                                                       the introduction of the query service, the exception being a decline
}                                                                           in June 2017 following new measures for throttling scripts that
GROUP BY ?var1 ?var2                                                        send overly many queries in very short times. We can also see some
ORDER BY DESC(?var3) LIMIT 1                                                variability in the number of query types, which roughly measures
                                                                            how uniform queries were in an interval.
data is not publically available, and all records are deleted after a
period of three months. For this research, we therefore created less        4   CLASSIFYING QUERY SOURCES
sensitive (but still internal) snapshots that contain only SPARQL           We conjecture that a meaningful analysis of SPARQL query logs
queries, request times, and user agent information, but no IPs.             in most cases must involve a classification of traffic into two basic
    We consider the complete query traffic in three consecutive inter-      forms, organic and robotic queries. In this section, we characterise
vals in 2017, each spanning exactly four weeks: (I1) 12th June–9th          both types, and we present our approach for separating them for
July, (I2) 10th July–6th August, and (I3) 7th August–3rd Septem-            the example case of the Wikidata query logs.
ber. We process all queries with the Java module of Wikidata’s                 In an idealised view, organic and robotic queries are characterised
BlazeGraph instance,6 which is based on OpenRDF Sesame, with                as shown in Table 4. Note that we do not restrict organic queries
minimal modifications in the parsing process re-implemented to              to mean those that are manually typed in by individual users, as
match those in BlazeGraph. In particular, BIND clauses are moved            studied previously [15]. Indeed, we wish to include users who are
to the start of their respective sub-query after the first parsing stage.   not aware that SPARQL is being used at all, as long as the query is
    This results in 211,703,889 valid queries. We then eliminate exact      representative of their present information needs.
string duplicates individually for each interval to obtain a subset of         Nevertheless, there is a grey area of applications that may al-
unique valid queries. A specific unique query may therefore still           low users to schedule and execute thousands of queries through
re-occur in several intervals. The numbers of queries per interval          a browser interface. There is a gradual transition between ideal-
are given in Table 2.                                                       ised organic and robotic queries, and, in theory, many intermediate
    The column Patterns counts unique query patterns after a further        applications are conceivable. If in doubt, such cases should be con-
abstraction. We uniformly replace all resources in subject and object       sidered robotic, since they can then still be analysed individually
positions with a normalised placeholder, using different placehold-         without their traffic giving undue prominence to individual users
ers for URIs and literals (by type). Patterns therefore reflect basic       among the organic queries.
resource types, co-occurrence of resources, and all predicates used            To classify queries as organic or robotic, we rely on just two
in the query. We also normalise values in LIMIT and OFFSET. Table 3         characteristics, (d) and (f), with some further guidance from (g). We
illustrates the pattern we would obtain for the query from Table 1.         use only three features to decide the type of a query: the user agent
Our approach follows Raghuveer who observed that programs often             (for (d)), the abstract query pattern (for (f)), and (when available)
use query templates to construct many similarly looking queries             comments in the query that some tools use to identify themselves.
where only certain placeholders are instantiated [14]. However, we          Our implementation determines the query type by custom rules
do retain predicates since the abstraction would otherwise be too           that use these two aspects only. By default, we expect all browser-
strong (in particular, most one-triple queries would lead to the same       related user agents to indicate organic traffic, while all other agents
pattern when also abstracting predicates).                                  indicate robotic traffic. However, we have implemented a more fine-
    We can see a clear trend towards a significant increase in query        grained causal analysis that tries to relate certain user-agent/pattern
traffic over time, which we have witnessed for many months since            combinations to specific sources (programs). Individual sources can
6 Maven artefact com.blazegraph.sparql-grammar v2.1.4                       then be manually added to either of the two categories.
LDOW’2018, April 2018, Lyon, France                                                     Adrian Bielefeldt, Julius Gonsior, and Markus Krötzsch

                        Table 5: Robotic and Organic Dataset Sizes              Unique Queries. Many studies restrict to unique queries. We do
                                                                             not find it obvious that this is most suitable. Indeed, in a continu-
                                  Valid      Unique    Un./Val.   Patterns   ously changing database like Wikidata, where the RDF export is up-
                        I1   61,052,879   12,690,953    20.79%     461,974   dated at least once per minute, a repeated query may indicate a real
      organic robotic




                        I2   71,648,854   19,080,323    26.63%     175,432   and recurring information need, and it may always require a new
                        I3   78,343,266   26,239,839    33.49%     145,286   answer to be computed. Moreover, considering organic traffic, it is
                        I1      197,339      142,970    72.45%      40,109   also relevant if many users require the exact same data. Therefore
                        I2      204,384      149,216    73.01%      40,694   neither human interests nor database query load can necessarily be
                        I3      257,167      202,419    78.71%      35,319   understood any better by eliminating duplicates. Table 6 includes
                                                                             prevalence among unique queries to show that the choice between
                                                                             the two views has a significant impact on results. A higher preval-
                                                                             ence among unique queries also indicates that queries with that
   Concretely, we consider WikiShootMe (nearby sites to photo-               feature are more likely to be repeated than queries without this
graph), SQID (data browser), and Histropedia (interactive timelines)         feature. For example, organic queries with subqueries are less fre-
as sources of organic traffic, and leave other software tools in the         quent when eliminating duplicates, which shows that such queries
robotic category. On the other hand, we had frequent occasion                tend to re-occur more often than others. Note that not all queries
of classifying queries sent from well-known browsers as robotic.             contribute to the feature counts in Table 6, and in general it is pos-
Examples included an apparently browser-based application that               sible that queries without any counted feature (e.g., single triple
retrieved Wikidata items with hundreds of thousands of diverse               queries) are much more common among the unique queries.
movie database identifiers, and another “browser” that issued the ex-
act same query three million times. Clearly, these cases did not meet           Stability of Results. The usage patterns of any technical system
our criteria for organic traffic, yet they would completely change           will evolve over time, but it is difficult to estimate the stability of
the characteristics of the organic dataset when not discovered.              specific metrics. We have split our data into three intervals to make
   We have applied this classification to the set of valid queries,          such changes visible. Robotic traffic exhibits huge fluctuations, e.g.,
leading to a distribution of queries as shown in Table 5. The total of       for joins (67%–88%) and OPTIONAL (11%–25%). Prevalence among
658,890 organic queries accounts for less than 0.5% of the traffic, and      unique queries sometimes fluctuates independently, e.g., for UNION
would therefore be overlooked completely in any non-discriminative           (2.5%–8.6%). By studying one or several of the intervals as a single
analysis. Already Table 5 clearly shows that this small fraction of          dataset, and by choosing to include or exclude duplicate queries,
traffic behaves significantly different from the overall dataset. While      one could therefore arrive at extremely different conclusions from
robotic traffic contains 21% to 33% unique queries, organic queries          this data. Many previous studies of SPARQL logs – often smaller
are 72%–79% unique. The non-discriminative query analysis of Bon-            than ours, and therefore even more easily dominated by bots –
ifati et al. showed that DBpedia logs have between 30% (2016) and            should be viewed in this light.
54% (2013) unique queries, while other datasets are 3%–30% unique               Organic traffic tends to be more stable, but also shows significant
[3]. Our robotic queries therefore are well within the typical range         variations in some cases. Especially we can see some change in
observed so far, whereas our organic queries seem to represent a             I3, most notable for VALUES and UNION. We discuss and explain
very different type of traffic.                                              this change in detail in Section 6. In most cases, however, we can
                                                                             see that metrics are fairly continuous, and that moreover, unique
4.1       SPARQL Feature Prevalence                                          queries are much more representative of all queries than in the
Further differences are revealed when analysing the SPARQL query             robotic case.
features used in each dataset. SELECT queries make up more than                 Feature Usage as Compared to Other Studies. The previous discus-
99% of queries in each dataset, so we do not report uses for DE-             sion suggests that it is generally questionable whether any insights
SCRIBE, CONSTRUCT, or ASK. Table 6 shows the prevalence of                   can be obtained by comparing SPARQL log metrics based on a
most common solution set modifiers, graph matching patterns, and             non-discriminative analysis of all queries. Nevertheless, there are
aggregates in each of the query sets. Join refers to the (often impli-       some aspects where our results show overwhelming differences
cit) SPARQL join operator; Filter also includes FILTER expressions           from previously reported findings. The most systematic overview
using NOT EXIST; and SERVICE calls are split between the very                across several datasets is given by Bonifati et al. [3]. They found
common Wikidata label retrieval service (lang) and others. We will           SERVICE, VALUES, BIND, and property paths to occur in less than
discuss several aspects of this large table in the remainder of this         1% of unique queries, while they have great relevance in all parts
section.                                                                     of our data. Similarly low prevalence was reported for subqueries,
   Absent Features. We have omitted features that were generally             SAMPLE, and GroupConcat – our robotic traffic is similar, but our
used in less than 1% of queries from Table 6. This applies to RE-            organic traffic paints a very different picture. Especially subqueries
DUCED, EXISTS, GRAPH, the occurrence of + in property paths, as              are strikingly common there.
well as specific aggregation functions (MIN, MAX, SUM, AVG). Any
other feature that is not shown in the table has not been counted.           4.2    SPARQL Feature Co-Occurrence
The absence of GRAPH in our data is expected, since Wikidata does            The expressive power and computational complexity of SPARQL
not use named graphs.                                                        queries is determined not so much by the presence or absence
Practical Linked Data Access via SPARQL: The Case of Wikidata                                              LDOW’2018, April 2018, Lyon, France

         Table 6: Relative prevalence of SPARQL features among valid queries (among unique queries in parentheses)

                                                  organic                                                     robotic
              Feature            I1                  I2                  I3                 I1                    I2              I3
                 Limit    31.08% (31.04%)    39.55% (38.69%)    46.56% (48.13%)     21.12% (26.45%)       16.86% (13.23%)   17.42% (19.55%)
              Distinct    26.50% (23.12%)    31.40% (26.06%)    19.05% (15.66%)     15.84% (30.45%)        5.48% (10.18%)    4.27% ( 6.85%)
            Order By      17.29% (16.66%)    14.75% (14.23%)    13.22% (10.40%)     12.97% ( 8.06%)        8.01% ( 6.49%)    6.78% ( 1.19%)
                Offset     0.40% ( 0.51%)     2.92% ( 3.51%)     0.37% ( 0.42%)      7.73% (15.81%)        6.07% ( 3.62%)    6.29% (13.17%)
                  Join    87.59% (85.41%)    87.82% (85.60%)    89.76% (89.38%)     88.48% (71.76%)       78.53% (67.11%)   67.41% (54.89%)
             Optional     42.36% (44.74%)    46.24% (46.36%)    55.92% (63.14%)     25.08% (34.61%)       11.63% (12.93%)   11.45% ( 9.41%)
                 Filter   25.89% (23.49%)    29.12% (27.61%)    22.24% (17.27%)     21.64% (23.40%)       17.92% (13.52%)   13.79% (13.65%)
          Path with *     15.02% (13.55%)    15.59% (16.18%)    12.88% (12.41%)     16.43% ( 9.30%)       19.19% (17.34%)   14.80% ( 7.66%)
            Subquery      13.09% ( 9.07%)    15.30% ( 8.77%)    12.79% ( 7.61%)      0.34% ( 1.45%)        0.28% ( 0.82%)    0.33% ( 0.47%)
                  Bind     9.85% ( 9.03%)     9.23% ( 8.60%)     8.68% ( 7.11%)     16.29% (13.08%)       12.07% (12.60%)    9.60% ( 4.51%)
                Union      5.10% ( 3.66%)     5.76% ( 5.06%)    12.62% (14.40%)     11.26% ( 8.62%)        8.63% ( 8.50%)    7.61% ( 2.53%)
               Values      4.44% ( 4.29%)     3.07% ( 3.13%)    10.88% (12.63%)     35.72% (10.68%)       30.74% ( 8.06%)   28.92% ( 6.24%)
           Not Exists      3.31% ( 2.75%)     3.37% ( 3.08%)     2.46% ( 1.65%)      0.19% ( 0.12%)        0.21% ( 0.18%)    0.19% ( 0.07%)
                Minus      2.04% ( 1.99%)     2.91% ( 3.13%)     1.60% ( 1.60%)      0.53% ( 1.03%)        0.92% ( 1.52%)    1.07% ( 1.60%)
        Service (lang)    44.63% (42.51%)    42.09% (43.53%)    54.78% (59.59%)     10.40% (23.66%)        6.15% (10.90%)    4.27% ( 6.35%)
       Service (other)    11.49% (15.00%)    10.53% (13.73%)    10.32% (12.45%)      4.51% ( 2.89%)        0.19% ( 0.44%)    1.16% ( 1.48%)
            Group By      17.12% (13.74%)    19.93% (14.02%)    13.04% ( 9.54%)      0.41% ( 0.57%)        0.37% ( 0.43%)    0.48% ( 0.30%)
              Sample       8.85% ( 6.45%)    10.93% ( 5.79%)     4.60% ( 3.66%)      0.04% ( 0.04%)        0.04% ( 0.05%)    0.06% ( 0.07%)
                Count      7.55% ( 6.66%)     7.60% ( 7.64%)     8.15% ( 5.74%)      1.15% ( 2.38%)        4.30% ( 7.19%)    0.30% ( 0.06%)
        GroupConcat        1.80% ( 1.79%)     2.79% ( 2.26%)     1.17% ( 1.05%)      0.06% ( 0.21%)        0.09% ( 0.17%)    0.02% ( 0.04%)
              Having       1.17% ( 1.26%)     1.14% ( 1.35%)     0.72% ( 0.71%)      0.01% ( 0.01%)        0.01% ( 0.02%)    0.00% ( 0.00%)


of individual features, but by their co-occurrence and interaction.        Table 7: Co-occurrence of SPARQL features by query type
Investigating which combinations of features occur in queries is           (Join, Filter, Optional, Union, Path, Values, Subquery)
therefore of interest. We present the results of this analysis here.
   Many studies restrict to analysing co-occurrence of join, OP-                    J   F   O    U P        V    S    organic    robotic
TIONAL, UNION, and FILTER. In our case, however, we find that                                   (none)                  8.04%    19.67%
also subqueries, property paths, VALUES, and SERVICE occur in a                     J                                  13.29%    11.26%
significant share of queries. The use of SERVICE is dominated by                        F                               1.10%      1.92%
Wikidata’s labelling service, which is mostly used to add labels after              J   F                               6.68%      2.61%
fetching query results. To reduce the number of feature combina-                    J                 P                 2.98%    13.50%
tions, we therefore ignore the labelling service entirely. We do not                J   F             P                 2.48%      0.39%
count queries that use any other type of service, or any other fea-                 J                       V           0.39%    30.42%
ture not mentioned explicitly (the most common such feature being                           O                           1.26%      0.11%
BIND). Solution set modifiers (LIMIT etc.) and aggregates (COUNT                    J       O                          22.32%      1.86%
etc.) are ignored: they can add to the complexity of query answering                J       O       P                   2.07%      0.35%
only when used in subqueries, so this feature can be considered an                  J F     O                           2.66%      2.13%
overestimation of the amount of such complex queries.                               J       O   U                       3.49%      0.02%
   The results of this analysis are presented in Table 7 for all op-                J       O           V               3.38%      0.11%
erator combinations that accounted for more than 1% of queries                      J       O       P V                 1.01%      0.16%
in either dataset. Path expressions in this case include all queries                J                      S            2.76%      0.06%
where either * or + occurs. The features we selected, possibly in                   J      O               S            4.78%      0.00%
combination with a labelling service, account for around 85% of all                 J F                    S            3.19%      0.03%
queries in either dataset, but the distributions are very different.                J F O                  S            1.02%      0.00%
                                                                                    Sum of above combinations          82.89%    84.63%
   Traditional Query Fragments. Plain conjunctive queries (CQ) that                 Other counted combinations          6.05%      2.67%
contain only joins account for 21.3% (30.9%) of organic (robotic)                   Features not counted               11.06%    12.70%
traffic. The frequently studied conjunctive-filter-pattern queries
(CFP), which also consider FILTER, increase these values to 29.1%
(35.5%). This is far below the prevalence reported for such simple         21.7% (61.4%). Especially robotic traffic contains this pattern, which
queries in other datasets, typically above 65% [3, 13]. VALUES can be      is not surprising since VALUES is an efficient way to combine query
allowed in CQs without an increase in complexity for a coverage of         batches into one.
LDOW’2018, April 2018, Lyon, France                                                Adrian Bielefeldt, Julius Gonsior, and Markus Krötzsch


   Path Queries. The extension of CQ with property path expres-         Table 8: Most frequent Wikidata properties by query type
sions leads to conjunctive regular path queries (CRPQs [4, 6]), which
account for 24.4% (44.5%) of organic (robotic) queries. While organic      Organic                           Robotic
queries contain almost the same high amount of path expressions            instance of (P31)                 VIAF ID (P214)
(Table 6), the simple CRPQ fragment does not suffice to capture as         image (P18)                       Libr. of Congress ID (P244)
many of them as in the robotic case. A highly prominent query frag-        coordinate location (P625)        located in . . . (P131)
ment for the robotic case are CRPQs with VALUES, which account             Commons category (P373)           ISO 3166-2 code (P300)
for 74.9% of all queries.                                                  subclass of (P279)                instance of (P31)
                                                                           located in . . . (P131)           image (P18)
   OPTIONAL and UNION. OPTIONAL is much more popular in                    heritage designation (P1435)      subclass of (P279)
organic queries, where together with join it accounts for 44.9% of         country (P17)                     ISSN (P236)
the traffic. We attribute this to the fact that user interfaces often      occupation (P106)                 MusicBrainz artist ID (P434)
try to show as much information as available. On the other hand,           Wiki Loves Mon. ID (P2186)        PubMed ID (P698)
robotic queries have much less use for query results that may miss
some of the queried values, especially since labels can be fetched
with a dedicated service. UNION rarely occurs in otherwise simple       top ten most frequently used Wikidata properties in organic and in
queries, although it does occur in significant amounts of queries       robotic queries. Robotic queries frequently involve IDs of external
overall. Our findings again deviate from Bonifati et al. who found      databases, whereas organic queries often refer to properties related
UNION alone enough to account for 7.5% of unique queries [3].           to locations. Human properties such as occupation (and, just outside
                                                                        of the top ten, date of birth and sex or gender) also rank highly in
   Subqueries. Table 7 shows that more than 10% of organic queries      organic queries. Commons category refers to Wikimedia Commons
include subqueries while otherwise using nothing but joins, FILTER,     and can be used to obtain related media for some entity. Wiki Loves
and OPTIONAL. However, we have ignored solution set modifiers           Monuments ID, the only identifier in the organic top ten, relates to
and aggregates in this analysis, since they normally play a role only   the eponymous content creation activity of Wikipedia, which aims
in post processing. In combination with subqueries, however, such       as gathering more information on local sites of interest.
operations may have a big impact on the expressive power of the
query language.                                                            Diversity vs. Uniformity. We have conjectured that organic quer-
                                                                        ies are more diverse, since they are not controlled by a small number
5   ROBOTIC VS. ORGANIC: EVALUATION                                     of programs, and since user information needs are generally less
                                                                        uniform. Our datasets support this in more than one sense. We have
We have already seen that organic and robotic traffic are signific-     already seen from Table 6 that organic queries tend to use more
antly different in many respects. This shows that our partitioning      diverse SPARQL features, including some that have no significant
of queries is not random, but it does not support the claim that        share of robotic queries. We also found that simple combinations of
they actually can be characterised as in Table 4. In this section, we   operators can account for 75% of robotic queries, whereas organic
therefore investigate whether the datasets also exhibit previously      queries exhibit a greater variance. For the study in Section 7, we
asserted characteristics that have not been used to define them in      found 17 feature combinations that make up over 1% of organic quer-
the first place.                                                        ies, while only 8 combinations have such high prevalence among
   Temporal distribution. We begin by considering the temporal          robotic queries (Table 7). Similarly, the usage of RDF properties is
distribution. According to Table 4 (b), we would expect organic         more skewed in robotic traffic, where only 37 properties occur in
queries to be correlated with the time of day, while robotic queries    more than 1% of queries. In contrast, 59 different RDF properties
should not show any such relationship. Figure 1 shows the hourly        occur in more than 1% of organic queries.
temporal traffic volume (in absolute numbers) and the relative             Another frequently studied structural aspect of queries is their
distribution across 24h-intervals. Organic traffic follows a strong     actual length in terms of the number of triples. To determine this
daily rhythm, with most activities happening during the European        number, we have not counted triples that occur within SERVICE
and American day and evening. This strongly supports a direct           clauses, since these are mostly used to call built-in BlazeGraph
human involvement.                                                      functions, where triples represent parameters rather than referring
   This also suggests that our dataset contains only few organic        to actual RDF data. Figure 2 shows the results. The results show
queries from Asian users, which in our case simplifies the detection    the usual peak of 1-triple queries in the robotic case, with 55.96% of
of the expected patterns. If usage were globally uniform, it would be   queries having at most this size, and 96% of queries having at most 7
promising to correlate daily usage with hints on geographical con-      triples. This matches findings for other datasets, where an average
text, which can be found in queries, e.g., in the form of coordinates   of 56.45% of queries had at most one triple and 91% of queries used
or language information.                                                6 or fewer [3]. Our robotic average query size of 2.45 triples also
   Figure 1 also shows some abrupt changes in the robotic traffic,      resembles that of DBpedia, reported to be between 2.09 and 3.98
as predicted by Table 4 (g).                                            for various samples.
                                                                           In contrast, the size distribution of organic queries is much wider.
  Content preference. According to Table 4 (c), user queries are        Queries with at most one triple only make up 17.30%, and one has to
expected to mirror more direct human interests. We investigate this     consider queries with up to 11 triples to cover 97% of the data. The
by considering the predicates used in queries. Table 8 shows the        average query size is 4.38 in this case. The largest organic query
Practical Linked Data Access via SPARQL: The Case of Wikidata                                        LDOW’2018, April 2018, Lyon, France




 Figure 1: Hourly query volume over 12 weeks (left), and aggregated by time of day, UTC (right); robotic top/organic bottom

                                                                        Table 9: Most frequent Wikidata properties used in queries
                                                                        as annotations on statements (mix of top 5 organic/robotic)

                                                                                Property                        Organic     Robotic
                                                                                end time (P582)                   2.00%       0.74%
                                                                                start time (P580)                 1.97%       0.28%
                                                                                place of publication (P291)       0.60%       0.00%
                                                                                Refseq Genome ID (P2249)          0.60%       0.00%
                                                                                academic degree (P512)            0.47%       0.00%
                                                                                catalog (P972)                    0.04%       0.55%
                                                                                ticker symbol (P249)              0.04%       0.24%
                                                                                version type (P548)               0.00%       0.13%

Figure 2: Fraction of queries by number of triples, in each
case for robotic (dark, left) and organic (light, right) traffic           The analysis of URIs used in predicate positions lets us answer
                                                                        this question. References are linked via the RDF provenance vocab-
                                                                        ulary URI prov:wasDerivedFrom, which occurs in 1.1% (0.07%) of
had 143 triples, while the largest robotic ones contained 66 triples.   all organic (robotic) queries. Another annotation is the statement
Again, we can see that organic queries are more diverse, and also       rank, which is used in 0.48% (0.62%) of all queries. Finally, state-
more complex than robotic ones.                                         ments can also be annotated with arbitrary Wikidata properties,
                                                                        and such use can also be recognised from URIs. The most frequently
6     UNDERSTANDING WIKIDATA USAGE                                      used properties in statement annotations for organic and robotic
                                                                        queries are shown in Table 9. Annotations are of course used only
Based on the above investigations, we assume that our classification    in a minority of queries. Their pronounced use in the case of start
of queries can successfully capture the ideas expressed in Table 4.     and end time witnesses the importance of temporal validity when
We now turn towards exploiting this insight for obtaining a better      interpreting statements from Wikidata.
understanding of actual Wikidata usage. Much of the previous               Conversely, we can also collect the most common properties
discussion also contributes towards this goal, e.g., the analysis of    used in statements for which annotations are queried. To gauge this
temporal distribution (indicating a lack of Asian users) and the        metric, we consider RDF properties used for encoding the complex
ranking of properties (showing significant user interest in local       form of statements that uses a dedicated node for representing
data), but we can refine these findings further.                        edges. Note that this does not always indicate that the query also
                                                                        referred to annotations of these properties. Another motivation for
6.1    Annotations and Complex Statements                               using the more complex statement encoding is that the simplified
As explained in Section 2, Wikidata supports annotations on state-      statement encoding is only generated for statements of maximal
ments, which allow to express contextual information, and which         rank; queries that are interested in all (e.g., historical) data there-
lead to a more complex RDF encoding. It is therefore relevant to        fore need to use complex statements even when not querying for
ask if queries make use of such information, or if they rather use      annotations. Table 10 displays the most common properties that
the simplified view that drops all annotations.                         appeared in a form as used for complex statement encodings.
LDOW’2018, April 2018, Lyon, France                                                   Adrian Bielefeldt, Julius Gonsior, and Markus Krötzsch

Table 10: Most frequent Wikidata properties of statements                (id), Hebrew (he), and Japanese (ja) hardly occurring in more than
whose complex form occurs in queries (mix of top 5 each)                 1,000 queries in total. The sum of several variants of Chinese also
                                                                         amounts to 1,122 queries; lower still is Arabic (798). These observa-
       Property                         Organic     Robotic              tions suggest a strong imbalance in the global use of Wikidata via
       heritage designation (P1435)       7.95%       0.00%              SPARQL. The choice of language may also reflect the lack of labels
       Wiki Loves Mon. ID (P2186)         3.37%       0.00%              in some languages [8], but since the labelling service supports any
       coordinate location (P625)         1.32%       0.46%              number of fallback languages, users could still always put their
       position held (P39)                1.32%       0.06%              preferred language at the front. Moreover, the amount of queries
       DiseasesDB (P557)                  0.63%       0.00%              asking for a certain language is only weakly related to the total num-
       catalog code (P528)                0.04%       0.55%              ber of labels available in a language. As of March 2018, Wikidata
       PMCID (P932)                       0.01%       1.05%              has most labels for English (32.7M), Dutch (10.7M), French (9.4M),
       PubMed ID (P698)                   0.00%       2.31%              German (8.0M), Spanish (6.5M), Italian (6.1M), Swedish (6.1M), Rus-
       DOI (P356)                         0.00%       0.91%              sian (5.4M), Cebuano (5.0M), and Bulgarian (2.9M). Polish follows
                                                                         at fourteenth place with 2.6M labels; Hebrew is only at 92nd place
                                                                         with less than 460K labels.
                                                                            The situation for robotic traffic is similar, but the labelling ser-
                                                                         vice is used in a much smaller fraction of queries in this case, and
                                                                         English is more dominant. Some European languages still occur, but
                                                                         others hardly do (e.g., Polish). Chinese is slightly more prominent
                                                                         compared to other languages, with more than 100,000 queries, but
                                                                         this is still less than 0.1% of all robotic queries.
                                                                            An interesting observation from Figure 3 is that some languages
                                                                         seem to be “trending” in that we see a clear increase in their pop-
                                                                         ularity. This is particularly strong for Polish, but can also be seen
                                                                         for Catalan. The next section discusses methods that can help to
                                                                         understand such observations.

                                                                         6.3     Causal Analysis
Figure 3: Number of queries for popular primary languages,
                                                                         We have argued that mere statistical analysis can easily be mislead-
for each of the three intervals (I1–I3)
                                                                         ing. Indeed, it is not useful to compute averages over exponential
                                                                         distributions (skewed data), which tend to appear on many scales in
   We can see that complex statements are a larger fraction in           usage analysis. A better understanding can be obtained by a more
organic queries. Indeed, 18.9% of all organic queries are using prop-    fine-grained analysis that tries to attribute the observed traffic to in-
erties that are part of the complex RDF encoding of statements,          dividual causes. Beyond superficial statistics, this gives us valuable
rather than relying on simplified statements represented by single       insights into the goals underlying current usage.
edges alone.                                                                We have conducted such an analysis by distinguishing queries
                                                                         that are issued by recognisable tools. This analysis started from
                                                                         a number of self-identifying tools, which include the tool’s name
6.2    Language Distribution
                                                                         in a comment in their SPARQL queries.7 Many further tools were
We saw that the temporal distribution of organic queries suggests        identified from their distinctive traffic patterns, usually marked
that few users from Asia are accessing Wikidata via SPARQL-based         by bursts of many similar queries, in combination with their user
applications. For further insights, it is interesting to study the       agents. This was mostly manual work, based on inspecting the
language-related information contained in queries. Indeed, the use       extracted query patterns that were most frequent, as well as their
of Wikidata by communities of different languages is a relevant          temporal distribution.
research topic by itself [8].                                               Some of the identified sources have vanished in the following
   We have therefore extracted the languages for which queries           month. For the intervals analysed in this paper, we know of 95
request labels using the labelling service, and grouped queries by       query sources that together account for 144,787,485 (68.39%) of all
the first (most preferred) language they use. This resulted in over      valid queries. As mentioned in Section 4, we consider three self-
200 different language tags, including misspellings. Of those, 35 oc-    identified tools as organic; the respective numbers of queries are:
curred in more than 100 queries. For the 16 languages that occurred      57,564 (WikiShootMe), 46,691 (SQID), and 996 (Histropedia).
in more than 1000 queries, we show the query numbers for each of            In addition, we identified a browser-based tool that looked up in-
the intervals in Figure 3 using a logarithmic scale. The special label   formation on local sites based on geographic coordinates in Poland,
“[AL]” denotes the value [AUTO_LANGUAGE], used for retrieving a          with a total of 48,051 queries. This activity was notable only in I3
language based on browser settings.                                      and could be traced back to I2, but not to I1. The sudden prominence
   English clearly dominates the list, followed by mostly European
languages (cs is Serbian, sv is Swedish, ca is Catalan, and eu is        7 The convention for doing this in the Wikidata community is to start a comment with
Basque). Asian languages follow only at the end, with Indonesian         #TOOL: followed by some identifying string.
Practical Linked Data Access via SPARQL: The Case of Wikidata                                              LDOW’2018, April 2018, Lyon, France


of the tool is connected to the Wiki Loves Monuments competition of      applying for a formal research collaboration with the Wikimedia
Wikipedia, which ran in September 2017, and indeed the respective        Foundation, subject to signing suitable non-disclosure agreements.
queries account for the occurrence of property P2186 in Table 8.         The source code of our analysis scripts is publicly available.9 In
Moreover, the sudden rise of this new application largely explains       particular, this code also includes our manual classification rules
why I3 shows somewhat different characteristics in Table 6, and          for queries based on their shape and user agent.
why Polish is seeing a strong upwards trend in Figure 3.
   The remaining 91 sources were considered robotic. The top             8    CONCLUSION
three of them together issue 60.06% of all queries that the Wikidata
                                                                         We have presented a first detailed study of the access logs of the
SPARQL query service has answered in our data. They are:
                                                                         Wikidata SPARQL query service. Due to the varied uses of SPARQL
     • auxiliary matcher: a data integration tool that identifies po-    services, the overall query traffic can roughly be classified into two
        tential matches with external datasets; 63,745,842 queries       parts: a large fraction of relatively simple queries that are generated
     • PBB_core fastrun: a data integration tool for protein data-       by only a few bots, and a much smaller fraction of more complex
        bases; 41,605,446 queries                                        queries posed (directly or indirectly) by many human users. We
     • bot2: a multi-purpose bot operated by Magnus Manske;8             argued that, if we want to gain meaningful insights from query log
        22,951,459 queries                                               analysis, the queries of the many should outweigh the queries of
   Further very active query sources include several unidentified        the few, since the former can more reliably represent a real human
scripts and bots. Prominent tools with more than 1M queries query        information need.
movie database ids, information on association football, and detailed       We proposed a characterisation of both components of organic
name and label information. The most active sources also include         and robotic traffic, and we showed that these concepts are workable
a query that polls the time of the most recent update of the RDF         in that (1) both parts can be separated with little effort using a small
database – this single query has been answered 1,861,752 times (i.e.     number of simple signals, and (2) the resulting datasets do indeed
about once every 11.5 seconds).                                          mirror many of the expected characteristics. We then continued
   Our analysis strongly supports our conjecture that any non-           to study three specific aspects of Wikidata’s usage: the practical
discriminative analysis of the data is necessarily skewed towards        use of complex RDF encodings, the global imbalance in the use of
a small number of tools. Indeed, both auxiliary matcher and bot2         queries, and the explanation of individual traffic components by
are maintained by Magnus Manske, which means that he controls            means of identifying their sources.
about 41% of all traffic. We believe that this is no unusual situation      Our research motivates further studies of real SPARQL queries
that only occurs for Wikidata, although few individuals will have        that apply our methods to other datasets. It would be extremely
the impact of a Magnus Manske.                                           interesting to learn how much organic query traffic can be found
                                                                         in other datasets, and whether it is as distinct from the rest of the
7    ANONYMISATION AND PUBLICATION                                       queries as it was in our case. Moreover, a causal analysis of the
We intend to publish our log datasets together with this study,          main robotic query sources could shed more light on the actual
pending the outcome of an ongoing clearance process within the           real-world use of RDF datasets published via SPARQL. Are data
Wikimedia Foundation. Since SPARQL logs are user access logs,            integration and systematic download also the predominant tasks
they must be treated very carefully to avoid privacy concerns. The       in other scenarios? Which fraction of bots can account for which
published logs therefore are to contain neither IP addresses nor user    fraction of the traffic? How do more complicated metrics, such as
agents, but they will be partitioned according to our methodology        treewidth [3] or specific shapes of query patterns [9], behave for
to allow studies of different user groups. They will also contain        organic and robotic data?
time stamps for each query.                                                 Finally, it would also be of interest to extend the tools available
   The queries will also need to be modified to remove any poten-        for classifying organic traffic in the first place. Our approach based
tially sensitive information. All comments will be removed and each      on user agents, query shapes, and time was feasible but still mostly
query will be parsed, normalised, and re-serialised. Longer strings,     manual. An automated classification could be of interest, and our
variable names, and geographic coordinates will be replaced uni-         data may serve for training and evaluation. Moreover, the approach
formly by generic fillers of the same type. We do not expect strings     we took suggests that even anonymised query logs should offer
or variable names to contain private information, but the query          some general user agent information (such as “browser or not”), and
volume is too large to be certain. Geographic coordinates need to be     also sufficiently fine-grained temporal information. If enough in-
replaced since they may contain very specific location information       formation is retained, then the manual interaction might be avoided
that could be linked to individual humans. In contrast, we think         completely, allowing the creation of applications that continuously
that dates (which in Wikidata are only used up to the precision of       monitor organic traffic for relevant trends. Methodologically speak-
a day) do not contain enough information to be sensitive. In the         ing, we do indeed seem to be at the very beginning of this field.
current data release proposal, all references to Wikidata entities,
and the exact structure of each SPARQL query would be preserved,           Acknowledgements. This work was partly supported by the DFG
so that the published files should be of use to other researchers.       within the cfaed Cluster of Excellence, CRC 912 (HAEC), and Emmy
   Until the anonymised data sets have been released, researchers        Noether grant KR 4381/1-1.
who wish to replicate our findings can gain access to the data by
8 https://en.wikipedia.org/wiki/Magnus_Manske                            9 https://github.com/Wikidata/QueryAnalysis
LDOW’2018, April 2018, Lyon, France                                                                     Adrian Bielefeldt, Julius Gonsior, and Markus Krötzsch


REFERENCES                                                                                  [9] Mark Kaminski and Egor V. Kostylev. 2018. Complexity and Expressive Power
[1] François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and              of Weakly Well-Designed SPARQL. Theory of Computing Systems (2018), 1–38.
    Jean Morissette. 2008. Bio2RDF: Towards a mashup to build bioinformatics                    https://doi.org/10.1007/s00224-017-9802-9 to appear.
    knowledge systems. J. of Biomedical Informatics 41, 5 (2008), 706–716.                 [10] Johannes Lorey and Felix Naumann. 2013. Detecting SPARQL Query Templates
[2] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker,              for Data Prefetching. In Proc. 10th Extended Semantic WebConf. (ESWC’13) (LNCS),
    Richard Cyganiak, and Sebastian Hellmann. 2009. DBpedia – A crystallization                 Philipp Cimiano, Óscar Corcho, Valentina Presutti, Laura Hollink, and Sebastian
    point for the Web of Data. J. of Web Semantics 7, 3 (2009), 154–165.                        Rudolph (Eds.), Vol. 7882. Springer, 124–139.
[3] Angela Bonifati, Wim Martens, and Thomas Timm. 2017. An Analytical Study               [11] Markus Luczak-Roesch, Zamil Aljaloud Saud, Bettina Berendt, and Laura Hollink.
    of Large SPARQL Query Logs. Proceedings of the VLDB Endowment 11 (2017),                    2016. USEWOD 2016 Research Dataset. (2016). https://eprints.soton.ac.uk/385344/
    149–161. Issue 2.                                                                      [12] Knud Möller, Michael Hausenblas, Richard Cyganiak, Siegfried Handschuh, and
[4] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi.               Gunnar A. Grimnes. 2010. Learning from Linked Open Data Usage: Patterns &
    2003. Reasoning on regular path queries. SIGMOD Record 32, 4 (2003), 83–92.                 Metrics. In Proc. Web Science Conf. (WebSci’10).
[5] Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, and Denny             [13] François Picalausa and Stijn Vansummeren. 2011. What are real SPARQL queries
    Vrandečić. 2014. Introducing Wikidata to the Linked Data Web. In Proc. 13th Int.            like?. In Proc. Int. Workshop on Semantic Web Information Management (SWIM’11),
    Semantic Web Conf. (ISWC’14) (LNCS), Peter Mika, Tania Tudorache, Abraham                   Roberto De Virgilio, Fausto Giunchiglia, and Letizia Tanca (Eds.). ACM, 6.
    Bernstein, Chris Welty, Craig A. Knoblock, Denny Vrandečić, Paul T. Groth,             [14] Aravindan Raghuveer. 2012. Characterizing Machine Agent Behavior through
    Natasha F. Noy, Krzysztof Janowicz, and Carole A. Goble (Eds.), Vol. 8796. Springer,        SPARQL Query Mining. In Proc. 2nd Int. Workshop on Usage Analysis and the Web
    50–65.                                                                                      of Data (USEWOD’12). usewod.org.
[6] Daniela Florescu, Alon Levy, and Dan Suciu. 1998. Query containment for con-           [15] Laurens Rietveld and Rinke Hoekstra. 2014. Man vs. Machine: Differences in
    junctive queries with regular expressions. In Proc. 17th Symposium on Principles            SPARQL Queries. In Proc. 4th USEWOD Workshop on Usage Analysis and the Web
    of Database Systems (PODS’98), Alberto O. Mendelzon and Jan Paredaens (Eds.).               of Data. usewod.org.
    ACM, 139–148.                                                                          [16] Muhammad Saleem, Muhammad Intizar Ali, Aidan Hogan, Qaiser Mehmood, and
[7] Steve Harris and Andy Seaborne (Eds.). 21 March 2013. SPARQL 1.1 Query                      Axel-Cyrille Ngonga Ngomo. 2015. LSQ: The Linked SPARQL Queries Dataset. In
    Language. W3C Recommendation.              Available at http://www.w3.org/TR/               Proc. 14th Int. Semantic Web Conf. (ISWC’15), Part II (LNCS), Marcelo Arenas, Óscar
    sparql11-query/.                                                                            Corcho, Elena Simperl, Markus Strohmaier, Mathieu d’Aquin, Kavitha Srinivas,
[8] Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl,                  Paul T. Groth, Michel Dumontier, Jeff Heflin, Krishnaprasad Thirunarayan, and
    Leslie Carr, and Lydia Pintscher. 2017. A Glimpse into Babel: An Analysis of                Steffen Staab (Eds.), Vol. 9367. Springer, 261–269.
    Multilinguality in Wikidata. In Proc. 13th Int. Symposium on Open Collaboration        [17] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative
    (OpenSym’17), Lorraine Morgan (Ed.). ACM, 14:1–14:5.                                        Knowledgebase. Commun. ACM 57, 10 (2014).