<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Suggesting Citations for Wikidata Claims based on Wikipedia's External References</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Curotto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aidan Hogan</string-name>
          <email>ahogang@dcc.uchile.cl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DCC, Universidad de Chile &amp; IMFD</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Given a Wikidata claim, we explore automated methods for locating references that support that claim. Our goal is to assist human editors in referencing claims, and thus increase the ratio of referenced claims in Wikidata. As an initial approach, we mine links from the references section of English Wikipedia articles, download and index their content, and use standard relevance-based measures to nd supporting documents. We consider various forms of search phrasings, as well as different scopes of search. We evaluate our methods in terms of the coverage of reference documents collected from Wikipedia. We also develop a gold standard of sample items for evaluating the relevance of suggestions. Our results in general reveal that the coverage of Wikipedia reference documents for claims is quite low, but where a reference document is available, we can often suggest it within the rst few results.</p>
      </abstract>
      <kwd-group>
        <kwd>Wikidata</kwd>
        <kwd>Wikipedia</kwd>
        <kwd>citations</kwd>
        <kwd>references</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Wikidata [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is a collaboratively-edited knowledge graph. Much like its
sibling project Wikipedia, Wikidata is continuously extended and curated by a
large community of volunteers. Unlike Wikipedia, Wikidata manages structured
statements about items. Items include people, places, proteins, papers, printers,
planets, political parties, and many more besides. A statement consists of an
item, a property, and a value. For example, a statement might claim that the
album Pulse (item) has the performer (property ) Pink Floyd (value). Values may
be items, datatypes (numbers, booleans, dates, times, etc.), or special terms
indicating an unknown value or that no such value exists. Statements can also have
quali ers that scope the validity of the claim, or provide additional details; this
may state, for example, a time period in which a claim was true, the previous
or next item with that value for that property; etc.
      </p>
      <p>As per Wikipedia, Wikidata does not aim to be a primary source of
knowledge, but rather a secondary source of knowledge: statements in Wikidata should
Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
be interpreted as claims held true according to a speci c, external and
authoritative source.1 Thus it is important that statements be independently veri able,
meaning that parties other than the editor that added the statement should be
able to verify the validity of that statement. Some statements are considered
self-veri able. The cases listed by Wikidata editor guides include:2
{ Common human knowledge: statements that are obvious to most and can
be considered self-evident; for example, that Paris is an instance of city, that
Paris is the capital of France, that city is a subclass of urban area, etc.
{ The value is an external source: statements that point to an external source,
such as identi ers associated with the item in external catalogues.
{ The value refers to an external source: statements that point to an Wikidata
item that itself can verify the statement, such as an album stating its artist,
a book stating its author, etc.</p>
      <p>In the case of statements not falling into one of these three categories, the onus
is on the editor that adds (modi es or restores) a statement to establish veri
ability by adding a reference for the statement based on an authoritative source.
Authoritative sources include books, publications, news media, laws, other
popular media, reputable websites, etc. Questionable sources, sponsored sources,
self-published sources, etc., may be rejected as non-authoritative sources.</p>
      <p>
        At the time of writing (August 2020), Wikidata describes 1.124 billion
statements about 88 million items and has over 23 thousand active users. Of these
statements, 771 million are referenced to external sources (68.56%), 68
million are referenced to Wikipedia (6.02%), leaving 286 million without reference
(25.42%).3 Of these items, 71 million (80.34%) have at least one referenced
statement. While collecting 771 million referenced statements is an impressive
achievement, more can be done to improve the coverage of references [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Valid
but unreferenced statements run the risk of being removed; conversely leaving
them in the knowledge graph runs the risk of hosting invalid statements, which
may in turn cause adverse e ects for applications that use Wikidata.4
Furthermore, Wikidata does not currently o er its editors much assistance in nding
references for a claim; a tool to automatically suggest references would help make
the most of these volunteers' time and e ort. Finally, the aforementioned
statistics count statements with some reference, but Piscopo et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] estimate that
only 61% of Wikidata's references can be considered authoritative and relevant.
      </p>
      <p>In summary, we see a need for research on methods to (semi-)automatically
nd authoritative references for statements in Wikidata. Herein we describe our
work on an initial such method based on searching over reference documents
scraped from Wikipedia. This approach seems initially quite natural: Wikipedia</p>
    </sec>
    <sec id="sec-2">
      <title>1 See https://www.wikidata.org/wiki/Wikidata:Verifiability</title>
    </sec>
    <sec id="sec-3">
      <title>2 See https://www.wikidata.org/wiki/Help:Sources/Items_not_needing_sources</title>
    </sec>
    <sec id="sec-4">
      <title>3 See https://wikidata-todo.toolforge.org/stats.php</title>
      <p>4 As an anecdotal example of the latter, we refer to Siri reporting the death of Stan
Lee, apparently based on an invalid statement added to Wikidata: https://io9.
gizmodo.com/siri-erroneously-told-people-stan-lee-was-dead-1827322243.
articles are linked with Wikidata items; Wikipedia references follow similar
principles of veri ability and authority as for Wikidata; Wikipedia is an older project
and thus one might expect more extensive reference lists to have developed
over time; a considerable number of Wikidata statements already reference a
Wikipedia article that should itself cite an authoritative source; the factual
nature of Wikipedia means that one could expect overlap in terms of the claims
made about the same entities/items on both sites; Wikidata guides suggest to
search Wikipedia for sources; etc. Our results, however, show that Wikipedia's
references are quite limited in terms of coverage for Wikidata claims.
2</p>
      <sec id="sec-4-1">
        <title>Related Works</title>
        <p>
          A number of works have analysed referencing in Wikipedia. In a study of the
quality of Wikipedia articles, Warncke-Wang et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] found that the number
of references was a key signal for predicting the quality of articles as manually
labelled through Wikipedia's peer review process. Lewoniewski et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] analyse
the di erences and overlap between Wikipedia references across seven di erent
language versions; of the languages studied, they found that over half (25.5
million) of the total (41.2 million) references came from English Wikipedia. Kousha
and Thelwall [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] analyse whether or not Wikipedia citations predict the impact of
academic publications, nding that few indexed articles are cited. Redi et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
construct a taxonomy of reasons why claims should be cited in Wikipedia, and
then develop a machine learning model to predict which claims require citation
and for which reason. More recently, Piccardi et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] found low user
engagement with external citations in English Wikipedia, with about 1-in-300 page
visits resulting in a click-through to a reference on the article.
        </p>
        <p>
          With respect to references on Wikidata, WikiCite is a Wikimedia initiative
to develop and expand the citation data available through Wikidata.5 As part of
the WikiCite initiative, Nielsen et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] discuss how Wikipedia references
provide limited data about the source being referred to, contrasting this with
Wikidata, which contains structured data about books, articles, authors, publishers,
identi ers, etc.; they provide statistics on such data, and build a scientometric
application called Scholia on top of them. Piscopo et al. [
          <xref ref-type="bibr" rid="ref10 ref12">10,12</xref>
          ] have provided
in-depth studies comparing external references on both Wikipedia and
Wikidata, nding that there is low overlap between both in terms of the references
used and the domains of those references [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]; they further estimate that 61%
of Wikidata's external references are considered relevant and authoritative [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
Lemus-Rojas and Pintscher [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] identify the \citation gap" as a problematic
issue, suggesting that librarians are well-positioned to help address this gap, as
they have already done for Wikipedia. Piscopo and Simperl [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] discuss the
importance of references to various dimensions of Wikidata quality.
        </p>
        <p>
          Regarding datasets, Delpeuch [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and more recently Singh et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] have
published metadata for citations extracted from English Wikipedia associated
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 See https://meta.wikimedia.org/wiki/WikiCite</title>
      <p>suggestions
claim
Wikidata
API
wiki urls</p>
      <sec id="sec-5-1">
        <title>Scraper</title>
        <p>articles</p>
      </sec>
      <sec id="sec-5-2">
        <title>Wikipedia</title>
        <p>results search
ref urls</p>
      </sec>
      <sec id="sec-5-3">
        <title>Index</title>
        <p>content</p>
      </sec>
      <sec id="sec-5-4">
        <title>Crawler</title>
        <p>
          with external identi ers (e.g., DOIs). Chou et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] also recently published a
dataset of English Wikipedia articles annotated with the aforementioned model
of Redi et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. However these datasets do not provide the textual content of
the external references, rather focusing on meta-data extracted from Wikipedia.
3
        </p>
        <sec id="sec-5-4-1">
          <title>Proposed Approach &amp; Research Questions</title>
          <p>We propose to scrape external reference URLs from English Wikipedia, and to
download and index their content. Thereafter, given a Wikidata claim for which
an editor requires suggestions of potential references, using the labels and aliases
of the items involved, we will convert the claim to a search using English terms
and apply the search over the inverted indexes of the content of the external
documents, using standard relevance-measures to prioritise documents. Finally,
to assist the editor, we will return not only the document itself, but also a snippet
of text from the document that contains the relevant keywords.</p>
          <p>We present the high-level architecture in Figure 1. The API provides an
interface that accepts a claim from Wikidata (along with associated metadata) and
returns suggestions of potential references. In order to provide these references, a
Scraper collects and parses the URLs of external references from articles on
English Wikipedia. These URLs are passed to a Crawler that downloads the URLs
and saves their content into an Index. The API can then formulate a search for
the claim over the Index, which returns relevant documents as results that are
returned as suggestions. We consider the option of both an o ine and online mode.
In the o ine mode, the Scraper and Crawler process all of (English) Wikipedia,
generating the Index over the full corpus that can be searched at runtime. We
also consider an online/lazy mode, where the Scraper rather accepts a list of
relevant Wikipedia article URLs from the API, which are passed to the Crawler,
which in turn populates the Index at runtime before the search is performed.
The o ine mode has the bene t of less latency, but a priori it is not clear that
performing such a crawl of all external references is feasible; also the Index would
require periodic updates. The online mode is easier to keep up-to-date, where
the Index rather acts as a cache, but is associated with slower runtime responses
as the Crawler operates while the editor is waiting for suggestions.</p>
          <p>Our initial goal is to study the feasibility of this overall approach, and to
establish baseline methods and datasets for further research. Within this proposal
a number of initial research questions arise:
RQ1 O ine vs. online indexing : Is it feasible to scrape, download and index
the content of all of English Wikipedia's external references o ine? Or would
it be better to scrape, download and index the content online/lazily for the
external references of the Wikipedia articles relevant to the claim at hand?
RQ2 Coverage: How many external references can we source from the article
corresponding to each Wikidata item? Can we build a corpus with good
coverage of Wikidata items in general?
RQ3 Search phrasing : How best should we phrase the search? Should we use
only primary labels, or also aliases? What connectives should we use?
RQ4 Relevance: Are traditional IR measures su cient to generate good
suggestions? Should we search only in references for Wikipedia articles
corresponding to the item(s) involved in the claim, or across the entire corpus?
RQ5 Suggestion Quality : How often can we generate good suggestions of
references for claims? Are the rankings of suggestions suitable? Can we also
suggest relevant text snippets from with the documents to support the claim?
In this initial work our goal is to gain insights regarding these research
questions, rather than seeking de nitive answers.
4</p>
        </sec>
        <sec id="sec-5-4-2">
          <title>Scraping, Crawling &amp; Indexing</title>
          <p>
            We rst explore the o ine approach. We start with a dump of Wikidata, from
which we extracted the mapping to Wikipedia articles. These articles were then
retrieved from a 2018 HTML corpus of Wikipedia [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. A custom scraper extracts
the external reference URLs from the articles. We used Apache Nutch6 for
crawling, which uses Apache Solr7 as an underlying index. To avoid Denial of Service
(DoS) attacks, we con gured Nutch to wait 5 seconds between requests to the
same website. Nutch indexes the content, title host and URL of the successfully
retrieved webpages in Solr; we enrich this index with the Q codes of the Wikidata
items corresponding to the Wikipedia article of the external reference.
Results: A total of 32,329,989 raw external reference URLs were extracted from
5,461,401 articles. Removing repeated and ill-formed URLs yielded 23,036,318
well-formed, unique URLs. Loading the URLs into the crawler, a lter was
applied to remove URLs with extensions referring to le-types { images, videos,
etc. { that we cannot currently process. This yielded 17,781,974 crawlable URLs.
Crawling was disabled; in other words we set Nutch to download the content of
the URLs, rather than to recursively follow further URLs. The download was
run from August 2019 to December 2019, in which time 2,475,461 URLs were
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6 http://nutch.apache.org/</title>
    </sec>
    <sec id="sec-7">
      <title>7 https://lucene.apache.org/solr/</title>
      <p>1 archive.org bbc.co.uk
2 doi.org nytimes.com
3 nih.gov archive.org
4 nytimes.com billboard.com
5 bbc.co.uk newspapers.com
6 webcitation.org thegazette.co.uk
7 allmusic.com sports-reference.com
8 youtube.com reuters.com
9 theguardian.com baseball-reference.com
10 archive.is bbc.com
successfully downloaded and indexed in Nutch. We let Nutch decide the order of
the URLs to be accessed; no con guration was made in this matter.8 Though not
all URLs were processed at this point, progress in the crawler had slowed to a
halt. An issue we did not anticipate was that of redirects: Nutch does not provide
a clean mechanism to retrace redirects, though from the logs it is possible to
retrace the URLs accessed and, in most cases, recover the original URLs. This was
important to match the original Wikipedia articles and external URLs with the
redirected content location of the indexed document. In total, we could recover
links for 2,058,896 documents (83%) from their original Wikipedia article.</p>
      <p>In Table 1 we present the top 10 domains for raw URLs extracted from
Wikipedia and indexed (redirected) URLs. We see that the indexed URLs tend
to refer to media sources. Notably (for example) doi.org is primarily a
redirection service, and hence we do not see this domain appearing in the indexed
URLs, which follow the redirects. Regarding Coverage, we managed to associate
3,899,953 (Q-identi ed) items with at least one indexed external reference. Of
these, 1,136,477 items (29.1%) had more than one reference indexed.
Complete sample: Given the incompleteness of the crawl for the full reference
corpus, we decided to also develop a complete crawl for a subset of Wikidata
items. Based on some initial samples, which were largely composed of items
without English Wikipedia articles, we decided to split our sample into ve groups
based on the Q identi ers: A: Q1{Q10000; B: Q10001{Q100000; C: Q100001{
Q1000000; D: Q1000001{Q10000000; E: Q10000001{Q100000000. This sampling
is based on the idea that Wikidata ids were de ned chronologically, and that the
most important entities (countries, major cities, recent presidents, etc.) would
fall into the earlier groups, with later groups being populated by successively
more obscure items. From each group we sample 1,000 items and then apply the
same process as before; in this case we run the download to completion.
8 By default, Nutch partitions URLs by host and then randomly selects URLs within
each partition.
We assume an inverted index of content of potential external references and
now turn to the question of how to search with the documents. We assume that
the API receives a claim as exempli ed in Table 3, with the item IDs/terms,
labels and aliases in English. Note that 1982 refers to a date value, where we use
the lexical form as the label. There is no clear individual way to construct the
search. Using just the labels may run the risk of missing some potentially relevant
documents with alias terms. On the other hand using alias terms may introduce
noise and return irrelevant documents. We experiment with four options:
1. construct a query for any of the three labels;
2. construct a query for any of the three labels or any property alias;
3. construct a query for any of the three labels or any alias;
4. construct a query for at least one label or alias for each of the three elements.
We provide examples of the searches for each of the four options in Table 4. While
it may perhaps seem quite broad to use the or connective, initial experience
suggested that using and, particularly on property labels (without aliases), meant
that few documents were returned as the search was too speci c. Furthermore,
Solr uses the BM25F relevance metric (based on TF{IDF), which will rank
documents with more occurrences of more terms more highly.</p>
      <p>We consider searching only over the references of the article associated with
the subject item (similar to the online option) to boost relevance,9 and searching
over all documents collected for the o ine corpus to boost recall.</p>
      <p>As a further feature, Solr allows for returning a snippet of each document
determined to be a highly relevant part of the document for the search. The
typical application of this feature is for building results lists, where the user can
preview the most relevant part of the text, which also ts our use-case of letting
editors preview snippets of text from di erent documents that might support a
given claim. We illustrate this feature in Figure 2 for an example claim.
Held-out evaluation: As an initial test of the di erent search options, for our
set of 5000 items, we can use the 163 URLs that appear on a Wikidata claim
9 Another alternative would be to further include documents for the value item. We
discarded this option in order to simplify experiments, observing that the value of
a claim is often much more general than the subject item; for example, considering
the claim that Neil Young was born in Canada, it would not make sense to search
within the external references for Canada.
Option 1: "university of chile" or "inception" or "1842"
Option 2: "university of chile" or "inception" or "date founded"
or ... or "1842"
Option 3: "university of chile" or "la u de chile" or ...</p>
      <p>or "inception" or "date founded" or ... or "1842"
Option 4: ("university of chile" or "la u de chile" or ...)
and ("inception" or "date founded" or ...) and "1842"
and appear in our index. We take the Wikidata claim that they appear on, and
measure the recall of the 163 URLs in the top 3 suggestions for each option
searching with the external references of the article associated with the subject
item. We also consider a baseline that selects 3 random external references for
the article of the subject item. The results are shown in Table 5, where we see
that the best results are o ered by search option 1, which retrieves the known
external reference as a top-3 suggestion in 72% of the cases. It is important to
note that any result returned may be correct as we only know a subset of the
correct references, so the recall should be interpreted as a lower bound.
Gold standard evaluation: Given the aforementioned limitations of the held-out
experiments, we opted to manually label a subset of claims, where we choose 5
items from each of the ve groups A{E, which we then labelled. The labelling
indicates which external reference in the Wikipedia article associated with the
chosen (subject) item supports which claim on that item. We rst tried a
random sampling of 5 items from each group but labelling became infeasible as
items with hundreds of associated external references and claims were found,
where manually pairing them o was considered too complex; furthermore, in
the later groups, some items had only one reference associated. Instead we choose
to sample items with a number of associated references close to the mean for
that group. We show some statistics for the gold standard in Table 6, where we
indicate the average number of claims in Wikidata per item, the average
number of references indexed from the corresponding Wikipedia articles per item,
and average percentage of claims per item supported by at least one reference
from the corresponding Wikipedia article. The All column considers the
statistics across all groups. It is worth noting that given the low numbers of references
for groups C{E, the results for searches become somewhat trivial; for this reason
we will include random baselines. The searches of our gold standard are then
formed by the claims for which at least one supporting reference is found.
A
48
23</p>
      <p>B
18
7</p>
      <p>C
17
4</p>
      <p>D
13
2</p>
      <p>E
8
3</p>
      <p>All
21
7
37%</p>
      <p>In Figure 3 we present the normalised Discounted Cumulative Gain (nDCG)
metric for the di erent search options with respect to the di erent groups. We
also include the random baseline for comparison. Intuitively speaking, a score of
1 indicates the best possible ordering possible, ranking all supporting references
above all non-supporting references. The results are divided by group. We see
that for groups A and B, the search methods perform much better than the
random baseline. The best results are given for group E, but this is largely due
to the trivial nature of the task when given few references, as noted by the high
performance of the random baseline. In general there is not much di erence of
note between the di erent search options, though we can perhaps indicate that
Option 1 performs (slightly) best and Option 4 performs worst.</p>
      <p>Given that the nDCG measure is somewhat di cult to interpret, in Figure 4
we present the Any@k measure: noting that in order to establish veri ability, in
general one reference is su cient, we look at the percentage of claims/searches for
which at least one suggestion in the top-k was relevant. We believe that this gives
a more direct measure of how the reference suggestions perform in practice. We
see that considering the top-3 results, Options 1{3 succeed in nding supporting
references in close to 88% of cases, increasing to 90% for top-4 results.
Snippets: For the 25 gold standard items, we manually evaluated the text
snippets that Solr selects to indicate why a document is relevant, where we found
that only 9% of these snippets were su cient to support a claim by themselves,
although they were often useful to help understand more about the content of
the webpage without visiting it. In particular, we found that reference
documents often support claims in a more implicit way, requiring a more general
understanding of di erent parts of the text, rather than just one part.
Global results: Finally we used our 25 gold standards to run searches over the
full corpus of 2.5 million references using the search options previously outlined.
The results were largely negative: the best results were obtained using Option 1,
which yielded Any@5 values of 19%. Manually revising the results, we found that
most of the documents returned by Solr were irrelevant to the topic at hand,
due to the broader corpus being used. It may, however, be possible to better
ne-tune the queries to return better results.
6</p>
      <sec id="sec-7-1">
        <title>Conclusions</title>
        <p>
          We now brie y summarise our insights regarding research questions RQ1{5.
RQ1 O ine vs. online indexing : Online indexing was slow, with references for
well-known entities taking up to 20 minutes to download and index. However,
achieving a complete corpus by o ine indexing is very time consuming.
RQ2 Coverage: Similar to the results of Piscopo et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we nd low overlap
between references in Wikipedia and Wikidata; in terms of our goal standard
developed for a small sample of 25 items, we estimate that about 37% of
claims had supporting references in their corresponding Wikipedia articles.
RQ3 Search phrasing : The best results were given by using an or connective on
primary labels, though including aliases gave similar results.
        </p>
        <p>RQ4 Relevance: BM25F gave good results when searching for claims within the
references of the corresponding Wikipedia article, but poor results for the
given search phrasing options when considering the full corpus.</p>
        <p>RQ5 Suggestion Quality : When a claim has a supporting reference in the
corresponding Wikipedia article for the subject item, the proposed method will
nd at least one such supporting reference in the top-5 results around 90%
of the time; however, the generated snippets rarely su ce to support the
claim, meaning the editor will often have to visit and revise the documents.</p>
        <p>When claims are supported by references in the corresponding Wikipedia
article, traditional Information Retrieval methods appear su cient to give good
recommendations. The more general issue we encountered in this initial
research is that few Wikidata claims have relevant references in the corresponding
Wikipedia article. This suggests two possible future directions:
{ O ine: Given that some Wikidata items do not have an associated Wikipedia
article, that many Wikipedia articles have few references, etc., it would be
interesting to develop a broader corpus with more documents from the Web,
perhaps from the Common Crawl. In order to ensure that the documents are
authoritative, this corpus might only include content from web-sites with a
threshold number of references detected in Wikipedia. A challenge will be
to ensure the relevance of search results, where the connection between the
Wikidata items and the indexed documents would be lost; however, this
challenge could be addressed with more advanced relevance measures based
on the elds of the documents, comparing the similarity of each document's
content to relevant Wikipedia articles, amongst other such techniques.
{ Online: We have found that our online option is too slow due to the need to
crawl references at runtime. Another option similar to the online option {
in terms of obviating the need for a local index of documents { would be to
use the existing infrastructure of major search engines to search the Web at
runtime, ltering for sites that are considered authoritative. A major bene t
of such an approach is that the (costly) retrieval, indexing and refreshing of
content could be delegated to the search engine. The downside of such an
approach would be the issues of respecting rate-limits for the search API,
plus the inability to pre-process the content for the speci c task.</p>
        <p>
          In summary, a method for automatically suggesting references for Wikidata
claims would help human editors to be more productive, and would help to
make better use of their (often volunteered) time. As a result, the coverage of
references on Wikidata would increase, and its quality as a secondary source of
knowledge would improve. While this paper does not provide a de nitive
solution, we have gained some important insights into the strengths and limitations
of basing suggestions on Wikipedia's references. We further provide online
material to facilitate future research, including the retrieved content of a large subset
of documents found in the reference sections of English Wikipedia.
Material online. Available on Zenodo [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>Acknowledgements. This work was funded by Fondecyt Grant No. 1181896 and
ANID Millennium Science Initiative Program ICN17 002.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chou</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goncalves</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale</article-title>
          . In: WikiWorkshop. pp.
          <volume>1</volume>
          {
          <issue>5</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Curotto</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>External References of English Wikipedia (ref- wiki-</article-title>
          <string-name>
            <surname>en)</surname>
          </string-name>
          (
          <year>Aug 2020</year>
          ), https://doi.org/10.5281/zenodo.4001139
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Delpeuch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Structured citations in the English Wikipedia (Jun</article-title>
          <year>2016</year>
          ), https: //doi.org/10.5281/zenodo.55004
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kousha</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thelwall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Are Wikipedia citations important evidence of the impact of scholarly articles and books?</article-title>
          <source>J. Assoc. Inf. Sci. Technol</source>
          .
          <volume>68</volume>
          (
          <issue>3</issue>
          ),
          <volume>762</volume>
          {
          <fpage>779</fpage>
          (
          <year>2017</year>
          ). https://doi.org/10.1002/asi.23694, https://doi.org/10.1002/asi.23694
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lemus-Rojas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pintscher</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Wikidata and Libraries: Facilitating Open Knowledge</article-title>
          .
          <source>In: Leveraging Wikipedia: Connecting Communities of Knowledge</source>
          . pp.
          <volume>143</volume>
          {
          <fpage>158</fpage>
          .
          <string-name>
            <given-names>ALA</given-names>
            <surname>Editions</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lewoniewski</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wecel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abramowicz</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>Analysis of References Across Wikipedia Languages</article-title>
          .
          <source>In: Information and Software Technologies (ICIST)</source>
          . pp.
          <volume>561</volume>
          {
          <fpage>573</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Luzuriaga</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Mun~oz, E.,
          <string-name>
            <surname>Rosales</surname>
          </string-name>
          , H.:
          <string-name>
            <surname>Wikitables</surname>
          </string-name>
          (Oct
          <year>2019</year>
          ), https: //doi.org/10.5281/zenodo.3483254
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mietchen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willighagen</surname>
            ,
            <given-names>E.L.</given-names>
          </string-name>
          :
          <article-title>Scholia, Scientometrics and Wikidata</article-title>
          .
          <source>In: ESWC Satellite Events</source>
          . pp.
          <volume>237</volume>
          {
          <fpage>259</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Piccardi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colavizza</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>West</surname>
          </string-name>
          , R.:
          <article-title>Quantifying Engagement with Citations on Wikipedia</article-title>
          .
          <source>In: The Web Conference (WWW)</source>
          . pp.
          <volume>2365</volume>
          {
          <fpage>2376</fpage>
          . ACM / IW3C2 (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Piscopo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ka</surname>
            <given-names>ee</given-names>
          </string-name>
          , L.,
          <string-name>
            <surname>Phethean</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
          </string-name>
          , E.:
          <article-title>Provenance Information in a Collaborative Knowledge Graph: An Evaluation of Wikidata External References</article-title>
          .
          <source>In: International Semantic Web Conference (ISWC)</source>
          . pp.
          <volume>542</volume>
          {
          <fpage>558</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Piscopo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
          </string-name>
          , E.:
          <article-title>What we talk about when we talk about Wikidata quality: a literature survey</article-title>
          .
          <source>In: International Symposium on Open Collaboration (OpenSym)</source>
          . pp.
          <volume>17</volume>
          :
          <issue>1</issue>
          {
          <fpage>17</fpage>
          :
          <fpage>11</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Piscopo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vougiouklis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ka</surname>
            <given-names>ee</given-names>
          </string-name>
          , L.,
          <string-name>
            <surname>Phethean</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hare</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
          </string-name>
          , E.:
          <article-title>What do Wikidata and Wikipedia Have in Common?: An Analysis of their Use of External References</article-title>
          .
          <source>In: International Symposium on Open Collaboration (OpenSym)</source>
          . pp.
          <volume>1</volume>
          :
          <issue>1</issue>
          {1:
          <fpage>10</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Redi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fetahu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morgan</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taraborelli</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Citation needed: A taxonomy and algorithmic assessment of Wikipedia's veri ability</article-title>
          .
          <source>In: The Web Conference (WWW)</source>
          . pp.
          <volume>1567</volume>
          {
          <fpage>1578</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>West</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colavizza</surname>
          </string-name>
          , G.:
          <article-title>Wikipedia Citations: A comprehensive dataset of citations with identi ers extracted from English Wikipedia</article-title>
          . CoRR abs/
          <year>2007</year>
          .07022 (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <string-name>
            <surname>Wikidata</surname>
            :
            <given-names>A Free</given-names>
          </string-name>
          <string-name>
            <surname>Collaborative</surname>
          </string-name>
          <article-title>Knowledgebase</article-title>
          .
          <source>Comm. ACM</source>
          <volume>57</volume>
          ,
          <issue>78</issue>
          {
          <fpage>85</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Warncke-Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cosley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedl</surname>
          </string-name>
          , J.:
          <article-title>Tell me more: an actionable quality model for Wikipedia</article-title>
          .
          <source>In: International Symposium on Open Collaboration (OpenSym)</source>
          . pp.
          <volume>8</volume>
          :
          <issue>1</issue>
          {8:
          <fpage>10</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>