<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Practical Linked Data Access via SPARQL: The Case of Wikidata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adrian Bielefeldt</string-name>
          <email>adrian.bielefeldt@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julius Gonsior</string-name>
          <email>julius.gonsior@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Krötzsch</string-name>
          <email>markus.kroetzsch@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>cfaed, TU Dresden</institution>
          ,
          <addr-line>Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>SPARQL is one of the main APIs for accessing linked data collections. Compared to other modes of access, SPARQL queries carry much more information on the precise information need of users, and their analysis can therefore yield valuable insights into the practical usage of linked data sets. In this paper, we focus on Wikidata, the knowledge-graph sister of Wikipedia, which ofers linked data exports and a heavily used SPARQL endpoint since 2015. Our detailed analysis of Wikidata's server-side query logs reveals several important diferences to previously studied uses of SPARQL over large knowledge graphs. Wikidata queries tend to be much more complex and varied than queries observed elsewhere. Our analysis is founded on a simple but efective separation of robotic from organic trafic. Whereas the robotic part is highly volatile and seems unpredictable even on larger time scales, the much smaller organic part shows clear trends in individual human usage. We analyse query features, structure, and content to gather further evidence that our approach is essential for obtaining meaningful results here.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The SPARQL query language [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is one of the most powerful and
most widely used APIs for accessing linked data collections on
the Web. Large-scale RDF publication eforts, such as DBpedia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
routinely provide a SPARQL service, often with live data. Moreover,
SPARQL has been an incentive for open data projects that are based
on other formats to convert their data to RDF in order to improve
query functionality, a route that was chosen by large-scale projects
such as Bio2RDF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or the British Museum.1 One of the most
prominent such project is Wikidata [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], the large2 knowledge
graph of Wikipedia, which is ofering browsable linked data, RDF
exports, and a live SPARQL service since September 2015.
      </p>
      <p>
        Analysing the queries sent to SPARQL services promises unique
insights into the practical usage of the underlying resources [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
This opens the door to understanding computational demands [
        <xref ref-type="bibr" rid="ref13 ref3 ref9">3, 9,
13</xref>
        ], improving reliability and performance [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and studying user
behaviour [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. Data providers in addition are highly interested
in learning how their content is used. This research is enabled by
more and more datasets of SPARQL query logs becoming available
[
        <xref ref-type="bibr" rid="ref11 ref16">11, 16</xref>
        ]. In some cases, including Wikidata, access to SPARQL logs
1http://www.britishmuseum.org/about_us/news_and_press/press_releases/2011/
semantic_web_endpoint.aspx (accessed January 2018)
2&gt;45M entities, &gt;400M statements, &gt;200K editors (&gt;37K in Jan 2018), &gt;640M edits
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
      </p>
      <p>LDOW’2018, April 2018, Lyon, France
© 2018 Copyright held by the owner/author(s).
remains strongly regulated, but research can help to safely publish
useful data there as well (as we intend as part of our work).</p>
      <p>
        Unfortunately, the initial enthusiasm in the study of practical
SPARQL usage has been dampened by some severe dificulties. A
well-known problem is that SPARQL services experience extremely
heterogeneous trafic due to their widespread use by software tools
[
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. Indeed, a single user’s script can dominate query trafic for
several hours, days, or weeks – and then vanish and never run again.
Hopes that the impact of such extreme events would even out with
wider usage have not been justified so far. Even when studying the
history of a single dataset at larger time scales, we can often see no
clear usage trends at all. For example, Bonifati et al. recently found
that within the years 2012–2016 the keyword DISTINCT was used
in 18%, 8%, 11%, 38%, and 8% of DBpedia queries, respectively [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        As a result, the insights gathered by SPARQL log analysis so far
have remained behind expectations. It seems almost impossible to
generalise statistical findings, or to make any predictions for the
next year (or even month). Building new optimisation methods or
user interfaces based on such volatile findings seems hardly worth
the efort. And yet, even recent research works rarely make any
attempt to quantify or at least discuss the impact of (random) scripts
on their findings. Exceptions are few: Raghuveer hypothesised that
similarities in query patterns can be used to find bots, and provided
basic analysis of bot requests isolated from larger logs [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; Rietveld
and Hoekstra used client-side SPARQL logs as a subset of true user
queries that they compared to server-side logs [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>In this work, we take a first look at the SPARQL usage logs of
the oficial Wikidata query service, and we ask if and how relevant
insights can be obtained from them. Our starting hypothesis is that
SPARQL queries can be meaningfully partitioned into two classes:
organic queries fetch data to satisfy an immediate information need
of a human user, while robotic queries fetch data in an unsupervised
fashion for further automated processing. We then classify queries
accordingly based on user agent information, temporal distribution,
and query patterns, and conduct further analysis on the results.
We argue that the organic component of SPARQL query trafic can
then be studied statistically, since it is relatively regular and since it
can reveal the needs of many actual users. In contrast, the robotic
component of query trafic should rather be subjected to a causal
analysis that attempts to understand the sources of the trafic, so
as to predict its current and future relevance to answering specific
research questions.</p>
      <p>Our main contributions are as follows:
(1) We propose the concept of organic and robotic SPARQL
trafic as a basic principle for query log analysis.
(2) We present a method for classifying query logs accordingly,
and we use it to partition a set of over 200M Wikidata
SPARQL queries. Only 0.31% of the queries are organic,
supporting our thesis that human information need is
completely hidden by bots in most published analyses.
(3) We evaluate our classification by analysing several aspects
that we consider characteristic for organic and robotic
queries, respectively. This supports our conjecture that we can
efectively distinguish the two types of trafic.
(4) We investigate the organic Wikidata trafic to gain basic
insights into actual direct usage of Wikidata.
(5) We discuss anonymisation aspects and potential privacy
issues, which forms the basis for our ongoing eforts for
allowing the Wikimedia Foundation to release essential parts
of our datasets to the public.</p>
      <p>Besides the concrete contributions towards understanding the
use of Wikidata, we believe that our systematic study can help in
advancing the research methodology in the wider field of analysing
linked data access through rich query APIs. Indeed, due to the
versatile use of SPARQL services – for manual and for automated
requests, for transactional and analytical queries, interactively or
in batch processes – the analysis of their usage requires suitable
techniques and methods that are not suficiently developed yet.
2</p>
    </sec>
    <sec id="sec-2">
      <title>SPARQL ON WIKIDATA</title>
      <p>
        Wikidata is the community-created knowledge base of the
Wikimedia Foundation. It was founded in 2012 with the main goal of
providing a central place for collecting factual data used in
Wikipedia across all languages [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. As of March 2018, Wikidata stores
more than 402 million statements about over 45 million entities.3
The data is collaboratively curated by a global community, with
over 18,000 registered editors making contributions each month.
Wikidata is widely used in diverse applications, such as Apple’s
Siri mobile app, Eurowing’s in-flight information system, and data
integration initiatives such as the Virtual Integrated Authority File.
      </p>
      <p>The data model of Wikidata is based on a directed, labelled graph
where entities are connected by edges that are labelled by properties.
Entities can have labels in many languages, but their actual
identiifers are abstract: properties use identifiers such as P569 (“date of
birth”), while other entities (called “items”) use identifiers such as
Q42 (“Douglas Adams”). Both types of entities can be freely created
by users. The model is distinct from RDF in that edges of the graph
may in turn have annotations. This feature is used to record sources,
temporal validity, or other contextual information. Indeed,
annotations on edges are using the same community-defined vocabulary
as edge labels.</p>
      <p>
        Since September 2015, Wikidata provides an oficial SPARQL
service (user interface at https://query.wikidata.org/) to query its
data. For this purpose, data is first converted to RDF. Each edge is
represented by a URI that can be associated with its annotations,
following an encoding as laid out by Erxleben et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In addition
to this faithful encoding, the RDF export also includes simplified
statements that only capture the actual edge as a single RDF triple,
without any of its annotations. Diferent URIs are introduced for the
diferent roles that properties can play in this encoding, so that all
views of the data can be stored and queried in one database without
risk of confusion. As of March 2018, the RDF encoding contains
3see https://www.wikidata.org/wiki/Wikidata:Statistics and links therefrom
about 4.7 billion triples.4 Queryable data is updated at least once
per minute to keep synchronised with updates.
      </p>
      <p>The Wikidata SPARQL service is based on the BlazeGraph
database management system. SPARQL support is mostly standard, but
includes some built-in operational extensions that are represented
by (ab)using SPARQL’s SERVICE directive, which is normally used
for federated queries to external services. Of chief practical
importance is the labelling service, used to fetch optional entity labels
with support for fallback languages. The widespread use of this
service does afect the structure of queries, which rarely include
the otherwise familiar OPTIONAL-FILTER pattern to select labels
in the desired language.</p>
      <p>Table 1 shows an example query that illustrates several aspects
of Wikidata’s RDF encoding. The query returns the 100 countries
that have the most cities with a female mayor. Within the query
pattern, the first line finds a value for ?mayor with gender (P21)
female (Q6581072). We are using the simplified property wdt:P21,
which cannot have annotations or source information to its triples.
The following line finds a ?city that is instance of (P31) of the class
city (Q515), or of any subclass thereof (P279*). We then determine
country (P17) of this city. The fourth line of the pattern then uses
the more complex RDF encoding to find a ?statement for property
mayor (P6) and matches its value to ?mayor. We require that this
statement has no end time (P582) to ensure that the mayor is current.
Finally, the service wikibase:label is invoked to fetch labels in
Russian, or, as an alternative, English. The query can readily be
executed online.</p>
      <p>Wikidata provides extensive documentation for using SPARQL,5
including a collection of over 300 example queries and pages to
request help in writing new queries. The query service receives
several million queries per day.
3</p>
    </sec>
    <sec id="sec-3">
      <title>WIKIDATA SPARQL QUERY LOGS</title>
      <p>We now give an overview of the datasets we are working with
for this paper. Our data is based on the server-side request logs
(Apache Access Log Files) of the Wikidata SPARQL query service,
as exported from the internal logging infrastructure of Wikimedia.
As logs contain sensitive information (especially IP addresses), this
4https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
5https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/
data is not publically available, and all records are deleted after a
period of three months. For this research, we therefore created less
sensitive (but still internal) snapshots that contain only SPARQL
queries, request times, and user agent information, but no IPs.</p>
      <p>We consider the complete query trafic in three consecutive
intervals in 2017, each spanning exactly four weeks: (I1) 12th June–9th
July, (I2) 10th July–6th August, and (I3) 7th August–3rd
September. We process all queries with the Java module of Wikidata’s
BlazeGraph instance,6 which is based on OpenRDF Sesame, with
minimal modifications in the parsing process re-implemented to
match those in BlazeGraph. In particular, BIND clauses are moved
to the start of their respective sub-query after the first parsing stage.</p>
      <p>This results in 211,703,889 valid queries. We then eliminate exact
string duplicates individually for each interval to obtain a subset of
unique valid queries. A specific unique query may therefore still
re-occur in several intervals. The numbers of queries per interval
are given in Table 2.</p>
      <p>
        The column Patterns counts unique query patterns after a further
abstraction. We uniformly replace all resources in subject and object
positions with a normalised placeholder, using diferent
placeholders for URIs and literals (by type). Patterns therefore reflect basic
resource types, co-occurrence of resources, and all predicates used
in the query. We also normalise values in LIMIT and OFFSET. Table 3
illustrates the pattern we would obtain for the query from Table 1.
Our approach follows Raghuveer who observed that programs often
use query templates to construct many similarly looking queries
where only certain placeholders are instantiated [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. However, we
do retain predicates since the abstraction would otherwise be too
strong (in particular, most one-triple queries would lead to the same
pattern when also abstracting predicates).
      </p>
      <p>We can see a clear trend towards a significant increase in query
trafic over time, which we have witnessed for many months since
6Maven artefact com.blazegraph.sparql-grammar v2.1.4</p>
      <p>Organic queries . . . Robotic queries . . .
a . . . fetch data to be delivered . . . fetch data to be processed
directly to human users algorithmically
b . . . are part of an ongoing hu- . . . are executed without close
man interaction human supervision
c . . . reflect an immediate human . . . may serve many indirect
information need purposes (or none)
d . . . are typically sent from . . . are sent from applications
browser applications that rarely run in browsers
e . . . represent the needs and in- . . . are not representative of
terests of many general needs
f . . . are relatively diverse . . . are relatively uniform
g . . . have uniform distributions . . . have skewed distributions
that change continuously that change abruptly
the introduction of the query service, the exception being a decline
in June 2017 following new measures for throttling scripts that
send overly many queries in very short times. We can also see some
variability in the number of query types, which roughly measures
how uniform queries were in an interval.
4</p>
    </sec>
    <sec id="sec-4">
      <title>CLASSIFYING QUERY SOURCES</title>
      <p>We conjecture that a meaningful analysis of SPARQL query logs
in most cases must involve a classification of trafic into two basic
forms, organic and robotic queries. In this section, we characterise
both types, and we present our approach for separating them for
the example case of the Wikidata query logs.</p>
      <p>
        In an idealised view, organic and robotic queries are characterised
as shown in Table 4. Note that we do not restrict organic queries
to mean those that are manually typed in by individual users, as
studied previously [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Indeed, we wish to include users who are
not aware that SPARQL is being used at all, as long as the query is
representative of their present information needs.
      </p>
      <p>Nevertheless, there is a grey area of applications that may
allow users to schedule and execute thousands of queries through
a browser interface. There is a gradual transition between
idealised organic and robotic queries, and, in theory, many intermediate
applications are conceivable. If in doubt, such cases should be
considered robotic, since they can then still be analysed individually
without their trafic giving undue prominence to individual users
among the organic queries.</p>
      <p>To classify queries as organic or robotic, we rely on just two
characteristics, (d) and (f), with some further guidance from (g). We
use only three features to decide the type of a query: the user agent
(for (d)), the abstract query pattern (for (f)), and (when available)
comments in the query that some tools use to identify themselves.
Our implementation determines the query type by custom rules
that use these two aspects only. By default, we expect all
browserrelated user agents to indicate organic trafic, while all other agents
indicate robotic trafic. However, we have implemented a more
finegrained causal analysis that tries to relate certain user-agent/pattern
combinations to specific sources (programs). Individual sources can
then be manually added to either of the two categories.</p>
      <p>Concretely, we consider WikiShootMe (nearby sites to
photograph), SQID (data browser), and Histropedia (interactive timelines)
as sources of organic trafic, and leave other software tools in the
robotic category. On the other hand, we had frequent occasion
of classifying queries sent from well-known browsers as robotic.
Examples included an apparently browser-based application that
retrieved Wikidata items with hundreds of thousands of diverse
movie database identifiers, and another “browser” that issued the
exact same query three million times. Clearly, these cases did not meet
our criteria for organic trafic, yet they would completely change
the characteristics of the organic dataset when not discovered.</p>
      <p>
        We have applied this classification to the set of valid queries,
leading to a distribution of queries as shown in Table 5. The total of
658,890 organic queries accounts for less than 0.5% of the trafic, and
would therefore be overlooked completely in any non-discriminative
analysis. Already Table 5 clearly shows that this small fraction of
trafic behaves significantly diferent from the overall dataset. While
robotic trafic contains 21% to 33% unique queries, organic queries
are 72%–79% unique. The non-discriminative query analysis of
Bonifati et al. showed that DBpedia logs have between 30% (2016) and
54% (2013) unique queries, while other datasets are 3%–30% unique
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Our robotic queries therefore are well within the typical range
observed so far, whereas our organic queries seem to represent a
very diferent type of trafic.
4.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>SPARQL Feature Prevalence</title>
      <p>Further diferences are revealed when analysing the SPARQL query
features used in each dataset. SELECT queries make up more than
99% of queries in each dataset, so we do not report uses for
DESCRIBE, CONSTRUCT, or ASK. Table 6 shows the prevalence of
most common solution set modifiers, graph matching patterns, and
aggregates in each of the query sets. Join refers to the (often
implicit) SPARQL join operator; Filter also includes FILTER expressions
using NOT EXIST; and SERVICE calls are split between the very
common Wikidata label retrieval service (lang) and others. We will
discuss several aspects of this large table in the remainder of this
section.</p>
      <p>Absent Features. We have omitted features that were generally
used in less than 1% of queries from Table 6. This applies to
REDUCED, EXISTS, GRAPH, the occurrence of + in property paths, as
well as specific aggregation functions (MIN, MAX, SUM, AVG). Any
other feature that is not shown in the table has not been counted.
The absence of GRAPH in our data is expected, since Wikidata does
not use named graphs.</p>
      <p>Unique Queries. Many studies restrict to unique queries. We do
not find it obvious that this is most suitable. Indeed, in a
continuously changing database like Wikidata, where the RDF export is
updated at least once per minute, a repeated query may indicate a real
and recurring information need, and it may always require a new
answer to be computed. Moreover, considering organic trafic, it is
also relevant if many users require the exact same data. Therefore
neither human interests nor database query load can necessarily be
understood any better by eliminating duplicates. Table 6 includes
prevalence among unique queries to show that the choice between
the two views has a significant impact on results. A higher
prevalence among unique queries also indicates that queries with that
feature are more likely to be repeated than queries without this
feature. For example, organic queries with subqueries are less
frequent when eliminating duplicates, which shows that such queries
tend to re-occur more often than others. Note that not all queries
contribute to the feature counts in Table 6, and in general it is
possible that queries without any counted feature (e.g., single triple
queries) are much more common among the unique queries.</p>
      <p>Stability of Results. The usage patterns of any technical system
will evolve over time, but it is dificult to estimate the stability of
specific metrics. We have split our data into three intervals to make
such changes visible. Robotic trafic exhibits huge fluctuations, e.g.,
for joins (67%–88%) and OPTIONAL (11%–25%). Prevalence among
unique queries sometimes fluctuates independently, e.g., for UNION
(2.5%–8.6%). By studying one or several of the intervals as a single
dataset, and by choosing to include or exclude duplicate queries,
one could therefore arrive at extremely diferent conclusions from
this data. Many previous studies of SPARQL logs – often smaller
than ours, and therefore even more easily dominated by bots –
should be viewed in this light.</p>
      <p>Organic trafic tends to be more stable, but also shows significant
variations in some cases. Especially we can see some change in
I3, most notable for VALUES and UNION. We discuss and explain
this change in detail in Section 6. In most cases, however, we can
see that metrics are fairly continuous, and that moreover, unique
queries are much more representative of all queries than in the
robotic case.</p>
      <p>
        Feature Usage as Compared to Other Studies. The previous
discussion suggests that it is generally questionable whether any insights
can be obtained by comparing SPARQL log metrics based on a
non-discriminative analysis of all queries. Nevertheless, there are
some aspects where our results show overwhelming diferences
from previously reported findings. The most systematic overview
across several datasets is given by Bonifati et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. They found
SERVICE, VALUES, BIND, and property paths to occur in less than
1% of unique queries, while they have great relevance in all parts
of our data. Similarly low prevalence was reported for subqueries,
SAMPLE, and GroupConcat – our robotic trafic is similar, but our
organic trafic paints a very diferent picture. Especially subqueries
are strikingly common there.
4.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>SPARQL Feature Co-Occurrence</title>
      <p>The expressive power and computational complexity of SPARQL
queries is determined not so much by the presence or absence
of individual features, but by their co-occurrence and interaction.
Investigating which combinations of features occur in queries is
therefore of interest. We present the results of this analysis here.</p>
      <p>Many studies restrict to analysing co-occurrence of join,
OPTIONAL, UNION, and FILTER. In our case, however, we find that
also subqueries, property paths, VALUES, and SERVICE occur in a
significant share of queries. The use of SERVICE is dominated by
Wikidata’s labelling service, which is mostly used to add labels after
fetching query results. To reduce the number of feature
combinations, we therefore ignore the labelling service entirely. We do not
count queries that use any other type of service, or any other
feature not mentioned explicitly (the most common such feature being
BIND). Solution set modifiers (LIMIT etc.) and aggregates (COUNT
etc.) are ignored: they can add to the complexity of query answering
only when used in subqueries, so this feature can be considered an
overestimation of the amount of such complex queries.</p>
      <p>The results of this analysis are presented in Table 7 for all
operator combinations that accounted for more than 1% of queries
in either dataset. Path expressions in this case include all queries
where either * or + occurs. The features we selected, possibly in
combination with a labelling service, account for around 85% of all
queries in either dataset, but the distributions are very diferent.</p>
      <p>
        Traditional Query Fragments. Plain conjunctive queries (CQ) that
contain only joins account for 21.3% (30.9%) of organic (robotic)
trafic. The frequently studied conjunctive-filter-pattern queries
(CFP), which also consider FILTER, increase these values to 29.1%
(35.5%). This is far below the prevalence reported for such simple
queries in other datasets, typically above 65% [
        <xref ref-type="bibr" rid="ref13 ref3">3, 13</xref>
        ]. VALUES can be
allowed in CQs without an increase in complexity for a coverage of
21.7% (61.4%). Especially robotic trafic contains this pattern, which
is not surprising since VALUES is an eficient way to combine query
batches into one.
Path Queries. The extension of CQ with property path
expressions leads to conjunctive regular path queries (CRPQs [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ]), which
account for 24.4% (44.5%) of organic (robotic) queries. While organic
queries contain almost the same high amount of path expressions
(Table 6), the simple CRPQ fragment does not sufice to capture as
many of them as in the robotic case. A highly prominent query
fragment for the robotic case are CRPQs with VALUES, which account
for 74.9% of all queries.
      </p>
      <p>
        OPTIONAL and UNION. OPTIONAL is much more popular in
organic queries, where together with join it accounts for 44.9% of
the trafic. We attribute this to the fact that user interfaces often
try to show as much information as available. On the other hand,
robotic queries have much less use for query results that may miss
some of the queried values, especially since labels can be fetched
with a dedicated service. UNION rarely occurs in otherwise simple
queries, although it does occur in significant amounts of queries
overall. Our findings again deviate from Bonifati et al. who found
UNION alone enough to account for 7.5% of unique queries [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Subqueries. Table 7 shows that more than 10% of organic queries
include subqueries while otherwise using nothing but joins, FILTER,
and OPTIONAL. However, we have ignored solution set modifiers
and aggregates in this analysis, since they normally play a role only
in post processing. In combination with subqueries, however, such
operations may have a big impact on the expressive power of the
query language.
5</p>
    </sec>
    <sec id="sec-7">
      <title>ROBOTIC VS. ORGANIC: EVALUATION</title>
      <p>We have already seen that organic and robotic trafic are
significantly diferent in many respects. This shows that our partitioning
of queries is not random, but it does not support the claim that
they actually can be characterised as in Table 4. In this section, we
therefore investigate whether the datasets also exhibit previously
asserted characteristics that have not been used to define them in
the first place.</p>
      <p>Temporal distribution. We begin by considering the temporal
distribution. According to Table 4 (b), we would expect organic
queries to be correlated with the time of day, while robotic queries
should not show any such relationship. Figure 1 shows the hourly
temporal trafic volume (in absolute numbers) and the relative
distribution across 24h-intervals. Organic trafic follows a strong
daily rhythm, with most activities happening during the European
and American day and evening. This strongly supports a direct
human involvement.</p>
      <p>This also suggests that our dataset contains only few organic
queries from Asian users, which in our case simplifies the detection
of the expected patterns. If usage were globally uniform, it would be
promising to correlate daily usage with hints on geographical
context, which can be found in queries, e.g., in the form of coordinates
or language information.</p>
      <p>Figure 1 also shows some abrupt changes in the robotic trafic,
as predicted by Table 4 (g).</p>
      <p>Content preference. According to Table 4 (c), user queries are
expected to mirror more direct human interests. We investigate this
by considering the predicates used in queries. Table 8 shows the
top ten most frequently used Wikidata properties in organic and in
robotic queries. Robotic queries frequently involve IDs of external
databases, whereas organic queries often refer to properties related
to locations. Human properties such as occupation (and, just outside
of the top ten, date of birth and sex or gender) also rank highly in
organic queries. Commons category refers to Wikimedia Commons
and can be used to obtain related media for some entity. Wiki Loves
Monuments ID, the only identifier in the organic top ten, relates to
the eponymous content creation activity of Wikipedia, which aims
as gathering more information on local sites of interest.</p>
      <p>Diversity vs. Uniformity. We have conjectured that organic
queries are more diverse, since they are not controlled by a small number
of programs, and since user information needs are generally less
uniform. Our datasets support this in more than one sense. We have
already seen from Table 6 that organic queries tend to use more
diverse SPARQL features, including some that have no significant
share of robotic queries. We also found that simple combinations of
operators can account for 75% of robotic queries, whereas organic
queries exhibit a greater variance. For the study in Section 7, we
found 17 feature combinations that make up over 1% of organic
queries, while only 8 combinations have such high prevalence among
robotic queries (Table 7). Similarly, the usage of RDF properties is
more skewed in robotic trafic, where only 37 properties occur in
more than 1% of queries. In contrast, 59 diferent RDF properties
occur in more than 1% of organic queries.</p>
      <p>
        Another frequently studied structural aspect of queries is their
actual length in terms of the number of triples. To determine this
number, we have not counted triples that occur within SERVICE
clauses, since these are mostly used to call built-in BlazeGraph
functions, where triples represent parameters rather than referring
to actual RDF data. Figure 2 shows the results. The results show
the usual peak of 1-triple queries in the robotic case, with 55.96% of
queries having at most this size, and 96% of queries having at most 7
triples. This matches findings for other datasets, where an average
of 56.45% of queries had at most one triple and 91% of queries used
6 or fewer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Our robotic average query size of 2.45 triples also
resembles that of DBpedia, reported to be between 2.09 and 3.98
for various samples.
      </p>
      <p>In contrast, the size distribution of organic queries is much wider.
Queries with at most one triple only make up 17.30%, and one has to
consider queries with up to 11 triples to cover 97% of the data. The
average query size is 4.38 in this case. The largest organic query
had 143 triples, while the largest robotic ones contained 66 triples.
Again, we can see that organic queries are more diverse, and also
more complex than robotic ones.
6</p>
    </sec>
    <sec id="sec-8">
      <title>UNDERSTANDING WIKIDATA USAGE</title>
      <p>Based on the above investigations, we assume that our classification
of queries can successfully capture the ideas expressed in Table 4.
We now turn towards exploiting this insight for obtaining a better
understanding of actual Wikidata usage. Much of the previous
discussion also contributes towards this goal, e.g., the analysis of
temporal distribution (indicating a lack of Asian users) and the
ranking of properties (showing significant user interest in local
data), but we can refine these findings further.
6.1</p>
    </sec>
    <sec id="sec-9">
      <title>Annotations and Complex Statements</title>
      <p>As explained in Section 2, Wikidata supports annotations on
statements, which allow to express contextual information, and which
lead to a more complex RDF encoding. It is therefore relevant to
ask if queries make use of such information, or if they rather use
the simplified view that drops all annotations.</p>
      <p>The analysis of URIs used in predicate positions lets us answer
this question. References are linked via the RDF provenance
vocabulary URI prov:wasDerivedFrom, which occurs in 1.1% (0.07%) of
all organic (robotic) queries. Another annotation is the statement
rank, which is used in 0.48% (0.62%) of all queries. Finally,
statements can also be annotated with arbitrary Wikidata properties,
and such use can also be recognised from URIs. The most frequently
used properties in statement annotations for organic and robotic
queries are shown in Table 9. Annotations are of course used only
in a minority of queries. Their pronounced use in the case of start
and end time witnesses the importance of temporal validity when
interpreting statements from Wikidata.</p>
      <p>Conversely, we can also collect the most common properties
used in statements for which annotations are queried. To gauge this
metric, we consider RDF properties used for encoding the complex
form of statements that uses a dedicated node for representing
edges. Note that this does not always indicate that the query also
referred to annotations of these properties. Another motivation for
using the more complex statement encoding is that the simplified
statement encoding is only generated for statements of maximal
rank; queries that are interested in all (e.g., historical) data
therefore need to use complex statements even when not querying for
annotations. Table 10 displays the most common properties that
appeared in a form as used for complex statement encodings.</p>
      <p>We can see that complex statements are a larger fraction in
organic queries. Indeed, 18.9% of all organic queries are using
properties that are part of the complex RDF encoding of statements,
rather than relying on simplified statements represented by single
edges alone.
6.2</p>
    </sec>
    <sec id="sec-10">
      <title>Language Distribution</title>
      <p>
        We saw that the temporal distribution of organic queries suggests
that few users from Asia are accessing Wikidata via SPARQL-based
applications. For further insights, it is interesting to study the
language-related information contained in queries. Indeed, the use
of Wikidata by communities of diferent languages is a relevant
research topic by itself [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>We have therefore extracted the languages for which queries
request labels using the labelling service, and grouped queries by
the first (most preferred) language they use. This resulted in over
200 diferent language tags, including misspellings. Of those, 35
occurred in more than 100 queries. For the 16 languages that occurred
in more than 1000 queries, we show the query numbers for each of
the intervals in Figure 3 using a logarithmic scale. The special label
“[AL]” denotes the value [AUTO_LANGUAGE], used for retrieving a
language based on browser settings.</p>
      <p>
        English clearly dominates the list, followed by mostly European
languages (cs is Serbian, sv is Swedish, ca is Catalan, and eu is
Basque). Asian languages follow only at the end, with Indonesian
(id), Hebrew (he), and Japanese (ja) hardly occurring in more than
1,000 queries in total. The sum of several variants of Chinese also
amounts to 1,122 queries; lower still is Arabic (798). These
observations suggest a strong imbalance in the global use of Wikidata via
SPARQL. The choice of language may also reflect the lack of labels
in some languages [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], but since the labelling service supports any
number of fallback languages, users could still always put their
preferred language at the front. Moreover, the amount of queries
asking for a certain language is only weakly related to the total
number of labels available in a language. As of March 2018, Wikidata
has most labels for English (32.7M), Dutch (10.7M), French (9.4M),
German (8.0M), Spanish (6.5M), Italian (6.1M), Swedish (6.1M),
Russian (5.4M), Cebuano (5.0M), and Bulgarian (2.9M). Polish follows
at fourteenth place with 2.6M labels; Hebrew is only at 92nd place
with less than 460K labels.
      </p>
      <p>The situation for robotic trafic is similar, but the labelling
service is used in a much smaller fraction of queries in this case, and
English is more dominant. Some European languages still occur, but
others hardly do (e.g., Polish). Chinese is slightly more prominent
compared to other languages, with more than 100,000 queries, but
this is still less than 0.1% of all robotic queries.</p>
      <p>An interesting observation from Figure 3 is that some languages
seem to be “trending” in that we see a clear increase in their
popularity. This is particularly strong for Polish, but can also be seen
for Catalan. The next section discusses methods that can help to
understand such observations.
6.3</p>
    </sec>
    <sec id="sec-11">
      <title>Causal Analysis</title>
      <p>We have argued that mere statistical analysis can easily be
misleading. Indeed, it is not useful to compute averages over exponential
distributions (skewed data), which tend to appear on many scales in
usage analysis. A better understanding can be obtained by a more
ifne-grained analysis that tries to attribute the observed trafic to
individual causes. Beyond superficial statistics, this gives us valuable
insights into the goals underlying current usage.</p>
      <p>We have conducted such an analysis by distinguishing queries
that are issued by recognisable tools. This analysis started from
a number of self-identifying tools, which include the tool’s name
in a comment in their SPARQL queries.7 Many further tools were
identified from their distinctive trafic patterns, usually marked
by bursts of many similar queries, in combination with their user
agents. This was mostly manual work, based on inspecting the
extracted query patterns that were most frequent, as well as their
temporal distribution.</p>
      <p>Some of the identified sources have vanished in the following
month. For the intervals analysed in this paper, we know of 95
query sources that together account for 144,787,485 (68.39%) of all
valid queries. As mentioned in Section 4, we consider three
selfidentified tools as organic; the respective numbers of queries are:
57,564 (WikiShootMe), 46,691 (SQID), and 996 (Histropedia).</p>
      <p>In addition, we identified a browser-based tool that looked up
information on local sites based on geographic coordinates in Poland,
with a total of 48,051 queries. This activity was notable only in I3
and could be traced back to I2, but not to I1. The sudden prominence
7The convention for doing this in the Wikidata community is to start a comment with
#TOOL: followed by some identifying string.
of the tool is connected to the Wiki Loves Monuments competition of
Wikipedia, which ran in September 2017, and indeed the respective
queries account for the occurrence of property P2186 in Table 8.
Moreover, the sudden rise of this new application largely explains
why I3 shows somewhat diferent characteristics in Table 6, and
why Polish is seeing a strong upwards trend in Figure 3.</p>
      <p>The remaining 91 sources were considered robotic. The top
three of them together issue 60.06% of all queries that the Wikidata
SPARQL query service has answered in our data. They are:
• auxiliary matcher: a data integration tool that identifies
potential matches with external datasets; 63,745,842 queries
• PBB_core fastrun: a data integration tool for protein
databases; 41,605,446 queries
• bot2: a multi-purpose bot operated by Magnus Manske;8
22,951,459 queries</p>
      <p>Further very active query sources include several unidentified
scripts and bots. Prominent tools with more than 1M queries query
movie database ids, information on association football, and detailed
name and label information. The most active sources also include
a query that polls the time of the most recent update of the RDF
database – this single query has been answered 1,861,752 times (i.e.
about once every 11.5 seconds).</p>
      <p>Our analysis strongly supports our conjecture that any
nondiscriminative analysis of the data is necessarily skewed towards
a small number of tools. Indeed, both auxiliary matcher and bot2
are maintained by Magnus Manske, which means that he controls
about 41% of all trafic. We believe that this is no unusual situation
that only occurs for Wikidata, although few individuals will have
the impact of a Magnus Manske.
7</p>
    </sec>
    <sec id="sec-12">
      <title>ANONYMISATION AND PUBLICATION</title>
      <p>We intend to publish our log datasets together with this study,
pending the outcome of an ongoing clearance process within the
Wikimedia Foundation. Since SPARQL logs are user access logs,
they must be treated very carefully to avoid privacy concerns. The
published logs therefore are to contain neither IP addresses nor user
agents, but they will be partitioned according to our methodology
to allow studies of diferent user groups. They will also contain
time stamps for each query.</p>
      <p>The queries will also need to be modified to remove any
potentially sensitive information. All comments will be removed and each
query will be parsed, normalised, and re-serialised. Longer strings,
variable names, and geographic coordinates will be replaced
uniformly by generic fillers of the same type. We do not expect strings
or variable names to contain private information, but the query
volume is too large to be certain. Geographic coordinates need to be
replaced since they may contain very specific location information
that could be linked to individual humans. In contrast, we think
that dates (which in Wikidata are only used up to the precision of
a day) do not contain enough information to be sensitive. In the
current data release proposal, all references to Wikidata entities,
and the exact structure of each SPARQL query would be preserved,
so that the published files should be of use to other researchers.</p>
      <p>Until the anonymised data sets have been released, researchers
who wish to replicate our findings can gain access to the data by
8https://en.wikipedia.org/wiki/Magnus_Manske
applying for a formal research collaboration with the Wikimedia
Foundation, subject to signing suitable non-disclosure agreements.
The source code of our analysis scripts is publicly available.9 In
particular, this code also includes our manual classification rules
for queries based on their shape and user agent.
8</p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSION</title>
      <p>We have presented a first detailed study of the access logs of the
Wikidata SPARQL query service. Due to the varied uses of SPARQL
services, the overall query trafic can roughly be classified into two
parts: a large fraction of relatively simple queries that are generated
by only a few bots, and a much smaller fraction of more complex
queries posed (directly or indirectly) by many human users. We
argued that, if we want to gain meaningful insights from query log
analysis, the queries of the many should outweigh the queries of
the few, since the former can more reliably represent a real human
information need.</p>
      <p>We proposed a characterisation of both components of organic
and robotic trafic, and we showed that these concepts are workable
in that (1) both parts can be separated with little efort using a small
number of simple signals, and (2) the resulting datasets do indeed
mirror many of the expected characteristics. We then continued
to study three specific aspects of Wikidata’s usage: the practical
use of complex RDF encodings, the global imbalance in the use of
queries, and the explanation of individual trafic components by
means of identifying their sources.</p>
      <p>
        Our research motivates further studies of real SPARQL queries
that apply our methods to other datasets. It would be extremely
interesting to learn how much organic query trafic can be found
in other datasets, and whether it is as distinct from the rest of the
queries as it was in our case. Moreover, a causal analysis of the
main robotic query sources could shed more light on the actual
real-world use of RDF datasets published via SPARQL. Are data
integration and systematic download also the predominant tasks
in other scenarios? Which fraction of bots can account for which
fraction of the trafic? How do more complicated metrics, such as
treewidth [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or specific shapes of query patterns [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], behave for
organic and robotic data?
      </p>
      <p>Finally, it would also be of interest to extend the tools available
for classifying organic trafic in the first place. Our approach based
on user agents, query shapes, and time was feasible but still mostly
manual. An automated classification could be of interest, and our
data may serve for training and evaluation. Moreover, the approach
we took suggests that even anonymised query logs should ofer
some general user agent information (such as “browser or not”), and
also suficiently fine-grained temporal information. If enough
information is retained, then the manual interaction might be avoided
completely, allowing the creation of applications that continuously
monitor organic trafic for relevant trends. Methodologically
speaking, we do indeed seem to be at the very beginning of this field.</p>
      <p>Acknowledgements. This work was partly supported by the DFG
within the cfaed Cluster of Excellence, CRC 912 (HAEC), and Emmy
Noether grant KR 4381/1-1.
9https://github.com/Wikidata/QueryAnalysis</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>François</given-names>
            <surname>Belleau</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marc-Alexandre</surname>
            <given-names>Nolin</given-names>
          </string-name>
          , Nicole Tourigny, Philippe Rigault, and
          <string-name>
            <given-names>Jean</given-names>
            <surname>Morissette</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Bio2RDF: Towards a mashup to build bioinformatics knowledge systems</article-title>
          .
          <source>J. of Biomedical Informatics</source>
          <volume>41</volume>
          ,
          <issue>5</issue>
          (
          <year>2008</year>
          ),
          <fpage>706</fpage>
          -
          <lpage>716</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Hellmann</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>DBpedia - A crystallization point for the Web of Data</article-title>
          .
          <source>J. of Web Semantics</source>
          <volume>7</volume>
          ,
          <issue>3</issue>
          (
          <year>2009</year>
          ),
          <fpage>154</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Angela</given-names>
            <surname>Bonifati</surname>
          </string-name>
          , Wim Martens, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Timm</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>An Analytical Study of Large SPARQL Query Logs</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>11</volume>
          (
          <year>2017</year>
          ),
          <fpage>149</fpage>
          -
          <lpage>161</lpage>
          . Issue 2.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Diego</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , Giuseppe De Giacomo, Maurizio Lenzerini, and
          <string-name>
            <surname>Moshe</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Vardi</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Reasoning on regular path queries</article-title>
          .
          <source>SIGMOD Record 32</source>
          ,
          <issue>4</issue>
          (
          <year>2003</year>
          ),
          <fpage>83</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Fredo</given-names>
            <surname>Erxleben</surname>
          </string-name>
          , Michael Günther, Markus Krötzsch, Julian Mendez, and
          <string-name>
            <given-names>Denny</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Introducing Wikidata to the Linked Data Web</article-title>
          .
          <source>In Proc. 13th Int. Semantic Web Conf. (ISWC'14)</source>
          (LNCS), Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig A.
          <string-name>
            <surname>Knoblock</surname>
          </string-name>
          , Denny Vrandečić, Paul T. Groth, Natasha F. Noy, Krzysztof Janowicz, and Carole A.
          <source>Goble (Eds.)</source>
          , Vol.
          <volume>8796</volume>
          . Springer,
          <fpage>50</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Daniela</given-names>
            <surname>Florescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Alon Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Query containment for conjunctive queries with regular expressions</article-title>
          .
          <source>In Proc. 17th Symposium on Principles of Database Systems (PODS'98)</source>
          ,
          <source>Alberto O. Mendelzon and Jan Paredaens (Eds.)</source>
          . ACM,
          <volume>139</volume>
          -
          <fpage>148</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Steve</given-names>
            <surname>Harris</surname>
          </string-name>
          and Andy Seaborne (Eds.).
          <source>21 March</source>
          <year>2013</year>
          .
          <article-title>SPARQL 1.1 Query Language</article-title>
          .
          <source>W3C Recommendation</source>
          . Available at http://www.w3.org/TR/ sparql11-query/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lucie-Aimée</surname>
            <given-names>Kafee</given-names>
          </string-name>
          , Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, and
          <string-name>
            <given-names>Lydia</given-names>
            <surname>Pintscher</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Glimpse into Babel: An Analysis of Multilinguality in Wikidata</article-title>
          .
          <source>In Proc. 13th Int. Symposium on Open Collaboration (OpenSym'17)</source>
          , Lorraine Morgan (Ed.). ACM,
          <volume>14</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          :
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Kaminski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Egor V.</given-names>
            <surname>Kostylev</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Complexity and Expressive Power of Weakly Well-Designed SPARQL</article-title>
          .
          <source>Theory of Computing Systems</source>
          (
          <year>2018</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          . https://doi.org/10.1007/s00224-017-9802-9 to appear.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Lorey</surname>
          </string-name>
          and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Detecting SPARQL Query Templates for Data Prefetching</article-title>
          .
          <source>In Proc. 10th Extended Semantic WebConf</source>
          .
          <article-title>(ESWC'13) (LNCS), Philipp Cimiano</article-title>
          , Óscar Corcho, Valentina Presutti,
          <source>Laura Hollink, and Sebastian Rudolph (Eds.)</source>
          , Vol.
          <volume>7882</volume>
          . Springer,
          <fpage>124</fpage>
          -
          <lpage>139</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Markus</given-names>
            <surname>Luczak-Roesch</surname>
          </string-name>
          , Zamil Aljaloud Saud, Bettina Berendt, and
          <string-name>
            <given-names>Laura</given-names>
            <surname>Hollink</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>USEWOD 2016 Research Dataset</article-title>
          . (
          <year>2016</year>
          ). https://eprints.soton.ac.uk/385344/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Knud</surname>
            <given-names>Möller</given-names>
          </string-name>
          , Michael Hausenblas, Richard Cyganiak, Siegfried Handschuh, and
          <string-name>
            <surname>Gunnar</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Grimnes</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Learning from Linked Open Data Usage: Patterns &amp; Metrics</article-title>
          .
          <source>In Proc. Web Science Conf. (WebSci'10).</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>François</given-names>
            <surname>Picalausa</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stijn</given-names>
            <surname>Vansummeren</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>What are real SPARQL queries like?</article-title>
          .
          <source>In Proc. Int. Workshop on Semantic Web Information Management (SWIM'11)</source>
          , Roberto De Virgilio, Fausto Giunchiglia, and Letizia Tanca (Eds.). ACM,
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Aravindan</given-names>
            <surname>Raghuveer</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Characterizing Machine Agent Behavior through SPARQL Query Mining</article-title>
          .
          <source>In Proc. 2nd Int. Workshop on Usage Analysis and the Web of Data (USEWOD'12)</source>
          . usewod.org.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Laurens</given-names>
            <surname>Rietveld</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rinke</given-names>
            <surname>Hoekstra</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Man vs. Machine: Diferences in SPARQL Queries</article-title>
          .
          <source>In Proc. 4th USEWOD Workshop on Usage Analysis and the Web of Data. usewod.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Muhammad</surname>
            <given-names>Saleem</given-names>
          </string-name>
          , Muhammad Intizar Ali, Aidan Hogan, Qaiser Mehmood, and
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>LSQ: The Linked SPARQL Queries Dataset</article-title>
          .
          <source>In Proc. 14th Int. Semantic Web Conf. (ISWC'15)</source>
          ,
          <article-title>Part II (LNCS), Marcelo Arenas</article-title>
          , Óscar Corcho, Elena Simperl, Markus Strohmaier, Mathieu d'Aquin,
          <string-name>
            <given-names>Kavitha</given-names>
            <surname>Srinivas</surname>
          </string-name>
          , Paul T. Groth, Michel Dumontier, Jef Heflin,
          <source>Krishnaprasad Thirunarayan, and Stefen Staab (Eds.)</source>
          , Vol.
          <volume>9367</volume>
          . Springer,
          <fpage>261</fpage>
          -
          <lpage>269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Denny</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          and
          <string-name>
            <given-names>Markus</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Wikidata: A Free Collaborative Knowledgebase</article-title>
          .
          <source>Commun. ACM</source>
          <volume>57</volume>
          ,
          <issue>10</issue>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>