<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Suggesting Citations for Wikidata Claims based on Wikipedia&apos;s External References</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Paolo</forename><surname>Curotto</surname></persName>
							<email>pcurotto@dcc.uchile.cl</email>
							<affiliation key="aff0">
								<orgName type="laboratory">DCC</orgName>
								<orgName type="institution">Universidad de Chile &amp; IMFD</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Aidan</forename><surname>Hogan</surname></persName>
							<email>ahogan@dcc.uchile.cl</email>
							<affiliation key="aff0">
								<orgName type="laboratory">DCC</orgName>
								<orgName type="institution">Universidad de Chile &amp; IMFD</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Suggesting Citations for Wikidata Claims based on Wikipedia&apos;s External References</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9BA37512AD5D644F539C7AE73DE2E43E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T06:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Wikidata</term>
					<term>Wikipedia</term>
					<term>citations</term>
					<term>references</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Given a Wikidata claim, we explore automated methods for locating references that support that claim. Our goal is to assist human editors in referencing claims, and thus increase the ratio of referenced claims in Wikidata. As an initial approach, we mine links from the references section of English Wikipedia articles, download and index their content, and use standard relevance-based measures to find supporting documents. We consider various forms of search phrasings, as well as different scopes of search. We evaluate our methods in terms of the coverage of reference documents collected from Wikipedia. We also develop a gold standard of sample items for evaluating the relevance of suggestions. Our results in general reveal that the coverage of Wikipedia reference documents for claims is quite low, but where a reference document is available, we can often suggest it within the first few results.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Wikidata <ref type="bibr" target="#b14">[15]</ref> is a collaboratively-edited knowledge graph. Much like its sibling project Wikipedia, Wikidata is continuously extended and curated by a large community of volunteers. Unlike Wikipedia, Wikidata manages structured statements about items. Items include people, places, proteins, papers, printers, planets, political parties, and many more besides. A statement consists of an item, a property, and a value. For example, a statement might claim that the album Pulse (item) has the performer (property) Pink Floyd (value). Values may be items, datatypes (numbers, booleans, dates, times, etc.), or special terms indicating an unknown value or that no such value exists. Statements can also have qualifiers that scope the validity of the claim, or provide additional details; this may state, for example, a time period in which a claim was true, the previous or next item with that value for that property; etc.</p><p>As per Wikipedia, Wikidata does not aim to be a primary source of knowledge, but rather a secondary source of knowledge: statements in Wikidata should be interpreted as claims held true according to a specific, external and authoritative source. <ref type="foot" target="#foot_0">1</ref> Thus it is important that statements be independently verifiable, meaning that parties other than the editor that added the statement should be able to verify the validity of that statement. Some statements are considered self-verifiable. The cases listed by Wikidata editor guides include: <ref type="foot" target="#foot_1">2</ref>-Common human knowledge: statements that are obvious to most and can be considered self-evident; for example, that Paris is an instance of city, that Paris is the capital of France, that city is a subclass of urban area, etc. -The value is an external source: statements that point to an external source, such as identifiers associated with the item in external catalogues. -The value refers to an external source: statements that point to an Wikidata item that itself can verify the statement, such as an album stating its artist, a book stating its author, etc.</p><p>In the case of statements not falling into one of these three categories, the onus is on the editor that adds (modifies or restores) a statement to establish verifiability by adding a reference for the statement based on an authoritative source.</p><p>Authoritative sources include books, publications, news media, laws, other popular media, reputable websites, etc. Questionable sources, sponsored sources, self-published sources, etc., may be rejected as non-authoritative sources.</p><p>At the time of writing (August 2020), Wikidata describes 1.124 billion statements about 88 million items and has over 23 thousand active users. Of these statements, 771 million are referenced to external sources (68.56%), 68 million are referenced to Wikipedia (6.02%), leaving 286 million without reference (25.42%). <ref type="foot" target="#foot_2">3</ref> Of these items, 71 million (80.34%) have at least one referenced statement. While collecting 771 million referenced statements is an impressive achievement, more can be done to improve the coverage of references <ref type="bibr" target="#b4">[5]</ref>. Valid but unreferenced statements run the risk of being removed; conversely leaving them in the knowledge graph runs the risk of hosting invalid statements, which may in turn cause adverse effects for applications that use Wikidata. <ref type="foot" target="#foot_3">4</ref> Furthermore, Wikidata does not currently offer its editors much assistance in finding references for a claim; a tool to automatically suggest references would help make the most of these volunteers' time and effort. Finally, the aforementioned statistics count statements with some reference, but Piscopo et al. <ref type="bibr" target="#b9">[10]</ref> estimate that only 61% of Wikidata's references can be considered authoritative and relevant.</p><p>In summary, we see a need for research on methods to (semi-)automatically find authoritative references for statements in Wikidata. Herein we describe our work on an initial such method based on searching over reference documents scraped from Wikipedia. This approach seems initially quite natural: Wikipedia articles are linked with Wikidata items; Wikipedia references follow similar principles of verifiability and authority as for Wikidata; Wikipedia is an older project and thus one might expect more extensive reference lists to have developed over time; a considerable number of Wikidata statements already reference a Wikipedia article that should itself cite an authoritative source; the factual nature of Wikipedia means that one could expect overlap in terms of the claims made about the same entities/items on both sites; Wikidata guides suggest to search Wikipedia for sources; etc. Our results, however, show that Wikipedia's references are quite limited in terms of coverage for Wikidata claims.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Works</head><p>A number of works have analysed referencing in Wikipedia. In a study of the quality of Wikipedia articles, Warncke-Wang et al. <ref type="bibr" target="#b15">[16]</ref> found that the number of references was a key signal for predicting the quality of articles as manually labelled through Wikipedia's peer review process. Lewoniewski et al. <ref type="bibr" target="#b5">[6]</ref> analyse the differences and overlap between Wikipedia references across seven different language versions; of the languages studied, they found that over half (25.5 million) of the total (41.2 million) references came from English Wikipedia. Kousha and Thelwall <ref type="bibr" target="#b3">[4]</ref> analyse whether or not Wikipedia citations predict the impact of academic publications, finding that few indexed articles are cited. Redi et al. <ref type="bibr" target="#b12">[13]</ref> construct a taxonomy of reasons why claims should be cited in Wikipedia, and then develop a machine learning model to predict which claims require citation and for which reason. More recently, Piccardi et al. <ref type="bibr" target="#b8">[9]</ref> found low user engagement with external citations in English Wikipedia, with about 1-in-300 page visits resulting in a click-through to a reference on the article.</p><p>With respect to references on Wikidata, WikiCite is a Wikimedia initiative to develop and expand the citation data available through Wikidata. <ref type="foot" target="#foot_4">5</ref> As part of the WikiCite initiative, Nielsen et al. <ref type="bibr" target="#b7">[8]</ref> discuss how Wikipedia references provide limited data about the source being referred to, contrasting this with Wikidata, which contains structured data about books, articles, authors, publishers, identifiers, etc.; they provide statistics on such data, and build a scientometric application called Scholia on top of them. Piscopo et al. <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b11">12]</ref> have provided in-depth studies comparing external references on both Wikipedia and Wikidata, finding that there is low overlap between both in terms of the references used and the domains of those references <ref type="bibr" target="#b11">[12]</ref>; they further estimate that 61% of Wikidata's external references are considered relevant and authoritative <ref type="bibr" target="#b9">[10]</ref>. Lemus-Rojas and Pintscher <ref type="bibr" target="#b4">[5]</ref> identify the "citation gap" as a problematic issue, suggesting that librarians are well-positioned to help address this gap, as they have already done for Wikipedia. Piscopo and Simperl <ref type="bibr" target="#b10">[11]</ref> discuss the importance of references to various dimensions of Wikidata quality.</p><p>Regarding datasets, Delpeuch <ref type="bibr" target="#b2">[3]</ref> and more recently Singh et al. <ref type="bibr" target="#b13">[14]</ref> have published metadata for citations extracted from English Wikipedia associated with external identifiers (e.g., DOIs). Chou et al. <ref type="bibr" target="#b0">[1]</ref> also recently published a dataset of English Wikipedia articles annotated with the aforementioned model of Redi et al. <ref type="bibr" target="#b12">[13]</ref>. However these datasets do not provide the textual content of the external references, rather focusing on meta-data extracted from Wikipedia.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Proposed Approach &amp; Research Questions</head><p>We propose to scrape external reference URLs from English Wikipedia, and to download and index their content. Thereafter, given a Wikidata claim for which an editor requires suggestions of potential references, using the labels and aliases of the items involved, we will convert the claim to a search using English terms and apply the search over the inverted indexes of the content of the external documents, using standard relevance-measures to prioritise documents. Finally, to assist the editor, we will return not only the document itself, but also a snippet of text from the document that contains the relevant keywords.</p><p>We present the high-level architecture in Figure <ref type="figure" target="#fig_0">1</ref>. The API provides an interface that accepts a claim from Wikidata (along with associated metadata) and returns suggestions of potential references. In order to provide these references, a Scraper collects and parses the URLs of external references from articles on English Wikipedia. These URLs are passed to a Crawler that downloads the URLs and saves their content into an Index. The API can then formulate a search for the claim over the Index, which returns relevant documents as results that are returned as suggestions. We consider the option of both an offline and online mode. In the offline mode, the Scraper and Crawler process all of (English) Wikipedia, generating the Index over the full corpus that can be searched at runtime. We also consider an online/lazy mode, where the Scraper rather accepts a list of relevant Wikipedia article URLs from the API, which are passed to the Crawler, which in turn populates the Index at runtime before the search is performed. The offline mode has the benefit of less latency, but a priori it is not clear that performing such a crawl of all external references is feasible; also the Index would require periodic updates. The online mode is easier to keep up-to-date, where the Index rather acts as a cache, but is associated with slower runtime responses as the Crawler operates while the editor is waiting for suggestions.</p><p>Our initial goal is to study the feasibility of this overall approach, and to establish baseline methods and datasets for further research. Within this proposal a number of initial research questions arise: RQ1 Offline vs. online indexing: Is it feasible to scrape, download and index the content of all of English Wikipedia's external references offline? Or would it be better to scrape, download and index the content online/lazily for the external references of the Wikipedia articles relevant to the claim at hand? RQ2 Coverage: How many external references can we source from the article corresponding to each Wikidata item? Can we build a corpus with good coverage of Wikidata items in general? RQ3 Search phrasing: How best should we phrase the search? Should we use only primary labels, or also aliases? What connectives should we use? RQ4 Relevance: Are traditional IR measures sufficient to generate good suggestions? Should we search only in references for Wikipedia articles corresponding to the item(s) involved in the claim, or across the entire corpus? RQ5 Suggestion Quality: How often can we generate good suggestions of references for claims? Are the rankings of suggestions suitable? Can we also suggest relevant text snippets from with the documents to support the claim?</p><p>In this initial work our goal is to gain insights regarding these research questions, rather than seeking definitive answers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Scraping, Crawling &amp; Indexing</head><p>We first explore the offline approach. We start with a dump of Wikidata, from which we extracted the mapping to Wikipedia articles. These articles were then retrieved from a 2018 HTML corpus of Wikipedia <ref type="bibr" target="#b6">[7]</ref>. A custom scraper extracts the external reference URLs from the articles. We used Apache Nutch<ref type="foot" target="#foot_5">6</ref> for crawling, which uses Apache Solr<ref type="foot" target="#foot_6">7</ref> as an underlying index. To avoid Denial of Service (DoS) attacks, we configured Nutch to wait 5 seconds between requests to the same website. Nutch indexes the content, title host and URL of the successfully retrieved webpages in Solr; we enrich this index with the Q codes of the Wikidata items corresponding to the Wikipedia article of the external reference.</p><p>Results: A total of 32,329,989 raw external reference URLs were extracted from 5,461,401 articles. Removing repeated and ill-formed URLs yielded 23,036,318 well-formed, unique URLs. Loading the URLs into the crawler, a filter was applied to remove URLs with extensions referring to file-types -images, videos, etc. -that we cannot currently process. This yielded 17,781,974 crawlable URLs. Crawling was disabled; in other words we set Nutch to download the content of the URLs, rather than to recursively follow further URLs. The download was run from August 2019 to December 2019, in which time 2,475,461 URLs were successfully downloaded and indexed in Nutch. We let Nutch decide the order of the URLs to be accessed; no configuration was made in this matter. <ref type="foot" target="#foot_7">8</ref> Though not all URLs were processed at this point, progress in the crawler had slowed to a halt. An issue we did not anticipate was that of redirects: Nutch does not provide a clean mechanism to retrace redirects, though from the logs it is possible to retrace the URLs accessed and, in most cases, recover the original URLs. This was important to match the original Wikipedia articles and external URLs with the redirected content location of the indexed document. In total, we could recover links for 2,058,896 documents (83%) from their original Wikipedia article.</p><p>In Table <ref type="table" target="#tab_0">1</ref> we present the top 10 domains for raw URLs extracted from Wikipedia and indexed (redirected) URLs. We see that the indexed URLs tend to refer to media sources. Notably (for example) doi.org is primarily a redirection service, and hence we do not see this domain appearing in the indexed URLs, which follow the redirects. Regarding Coverage, we managed to associate 3,899,953 (Q-identified) items with at least one indexed external reference. Of these, 1,136,477 items (29.1%) had more than one reference indexed.</p><p>Complete sample: Given the incompleteness of the crawl for the full reference corpus, we decided to also develop a complete crawl for a subset of Wikidata items. Based on some initial samples, which were largely composed of items without English Wikipedia articles, we decided to split our sample into five groups based on the Q identifiers: A: Q1-Q10000; B: Q10001-Q100000; C: Q100001-Q1000000; D: Q1000001-Q10000000; E: Q10000001-Q100000000. This sampling is based on the idea that Wikidata ids were defined chronologically, and that the most important entities (countries, major cities, recent presidents, etc.) would fall into the earlier groups, with later groups being populated by successively more obscure items. From each group we sample 1,000 items and then apply the same process as before; in this case we run the download to completion. Table <ref type="table" target="#tab_1">2</ref> indicates the number of raw URLs extracted from Wikipedia, and the number of URLs indexed. Corresponding to the design of each group, in general we see more references available for earlier groups; for example, group A contains many countries, whose articles in Wikipedia contain potentially hundreds of references. The difference between raw URLs and indexed URLs refers to duplicate or malformed URLs, filtered URLs, and URLs that returned 4xx or 5xx errors. For the 5,000 Wikidata items, we found 74 (1.4%) that used some reference also found in the indexed URLs from Wikipedia. On the other hand, of the 37,983 indexed URLs, only 163 (0.43%) were found to be used on one of the Wikidata items as a reference URL. We checked for exact URL matches, which may lead to under-reporting overlap, but these results offer strong support for the results of Piscopo et al. <ref type="bibr" target="#b11">[12]</ref> indicating a low overlap in references between Wikipedia and Wikidata. This does not necessarily imply, however, that claims for Wikidata do not have support in the content of the references from Wikipedia.</p><p>It is worth noting that the download of references for some of the most popular items took tens of minutes to complete, which suggests that the online mode will often be too slow for interactive runtimes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Search &amp; Recommendation</head><p>We assume an inverted index of content of potential external references and now turn to the question of how to search with the documents. We assume that the API receives a claim as exemplified in Table <ref type="table">3</ref>, with the item IDs/terms, labels and aliases in English. Note that 1982 refers to a date value, where we use the lexical form as the label. There is no clear individual way to construct the search. Using just the labels may run the risk of missing some potentially relevant documents with alias terms. On the other hand using alias terms may introduce noise and return irrelevant documents. We experiment with four options:</p><p>1. construct a query for any of the three labels; 2. construct a query for any of the three labels or any property alias; 3. construct a query for any of the three labels or any alias; 4. construct a query for at least one label or alias for each of the three elements.</p><p>We provide examples of the searches for each of the four options in Table <ref type="table">4</ref>. While it may perhaps seem quite broad to use the or connective, initial experience suggested that using and, particularly on property labels (without aliases), meant that few documents were returned as the search was too specific. Furthermore, Solr uses the BM25F relevance metric (based on TF-IDF), which will rank documents with more occurrences of more terms more highly. We consider searching only over the references of the article associated with the subject item (similar to the online option) to boost relevance, <ref type="foot" target="#foot_8">9</ref> and searching over all documents collected for the offline corpus to boost recall.</p><p>As a further feature, Solr allows for returning a snippet of each document determined to be a highly relevant part of the document for the search. The typical application of this feature is for building results lists, where the user can preview the most relevant part of the text, which also fits our use-case of letting editors preview snippets of text from different documents that might support a given claim. We illustrate this feature in Figure <ref type="figure" target="#fig_1">2</ref> for an example claim.</p><p>Held-out evaluation: As an initial test of the different search options, for our set of 5000 items, we can use the 163 URLs that appear on a Wikidata claim Table <ref type="table">4</ref>. Example searches for the four options considered based on Table <ref type="table">3</ref> Option 1: "university of chile" or "inception" or "1842" Option 2: "university of chile" or "inception" or "date founded" or ... or "1842" Option 3: "university of chile" or "la u de chile" or ... or "inception" or "date founded" or ... or "1842" Option 4: ("university of chile" or "la u de chile" or ...) and ("inception" or "date founded" or ...) and "1842" and appear in our index. We take the Wikidata claim that they appear on, and measure the recall of the 163 URLs in the top 3 suggestions for each option searching with the external references of the article associated with the subject item. We also consider a baseline that selects 3 random external references for the article of the subject item. The results are shown in Table <ref type="table" target="#tab_2">5</ref>, where we see that the best results are offered by search option 1, which retrieves the known external reference as a top-3 suggestion in 72% of the cases. It is important to note that any result returned may be correct as we only know a subset of the correct references, so the recall should be interpreted as a lower bound.</p><p>Gold standard evaluation: Given the aforementioned limitations of the held-out experiments, we opted to manually label a subset of claims, where we choose 5 items from each of the five groups A-E, which we then labelled. The labelling indicates which external reference in the Wikipedia article associated with the chosen (subject) item supports which claim on that item. We first tried a random sampling of 5 items from each group but labelling became infeasible as items with hundreds of associated external references and claims were found, where manually pairing them off was considered too complex; furthermore, in the later groups, some items had only one reference associated. Instead we choose to sample items with a number of associated references close to the mean for that group. We show some statistics for the gold standard in Table <ref type="table" target="#tab_3">6</ref>, where we indicate the average number of claims in Wikidata per item, the average number of references indexed from the corresponding Wikipedia articles per item, and average percentage of claims per item supported by at least one reference from the corresponding Wikipedia article. The All column considers the statistics across all groups. It is worth noting that given the low numbers of references for groups C-E, the results for searches become somewhat trivial; for this reason we will include random baselines. The searches of our gold standard are then formed by the claims for which at least one supporting reference is found. Average claims supported 42% 27% 26% 52% 31% 37% Fig. <ref type="figure">3</ref>. nDCG for search methods on the gold standard</p><p>In Figure <ref type="figure">3</ref> we present the normalised Discounted Cumulative Gain (nDCG) metric for the different search options with respect to the different groups. We also include the random baseline for comparison. Intuitively speaking, a score of 1 indicates the best possible ordering possible, ranking all supporting references above all non-supporting references. The results are divided by group. We see that for groups A and B, the search methods perform much better than the random baseline. The best results are given for group E, but this is largely due to the trivial nature of the task when given few references, as noted by the high performance of the random baseline. In general there is not much difference of note between the different search options, though we can perhaps indicate that Option 1 performs (slightly) best and Option 4 performs worst.</p><p>Given that the nDCG measure is somewhat difficult to interpret, in Figure <ref type="figure">4</ref> we present the Any@k measure: noting that in order to establish verifiability, in general one reference is sufficient, we look at the percentage of claims/searches for which at least one suggestion in the top-k was relevant. We believe that this gives a more direct measure of how the reference suggestions perform in practice. We see that considering the top-3 results, Options 1-3 succeed in finding supporting references in close to 88% of cases, increasing to 90% for top-4 results.</p><p>Snippets: For the 25 gold standard items, we manually evaluated the text snippets that Solr selects to indicate why a document is relevant, where we found that only 9% of these snippets were sufficient to support a claim by themselves, although they were often useful to help understand more about the content of Fig. <ref type="figure">4</ref>. Any-at-k for search methods on the gold standard the webpage without visiting it. In particular, we found that reference documents often support claims in a more implicit way, requiring a more general understanding of different parts of the text, rather than just one part.</p><p>Global results: Finally we used our 25 gold standards to run searches over the full corpus of 2.5 million references using the search options previously outlined. The results were largely negative: the best results were obtained using Option 1, which yielded Any@5 values of 19%. Manually revising the results, we found that most of the documents returned by Solr were irrelevant to the topic at hand, due to the broader corpus being used. It may, however, be possible to better fine-tune the queries to return better results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>We now briefly summarise our insights regarding research questions RQ1-5. RQ1 Offline vs. online indexing: Online indexing was slow, with references for well-known entities taking up to 20 minutes to download and index. However, achieving a complete corpus by offline indexing is very time consuming. RQ2 Coverage: Similar to the results of Piscopo et al. <ref type="bibr" target="#b11">[12]</ref>, we find low overlap between references in Wikipedia and Wikidata; in terms of our goal standard developed for a small sample of 25 items, we estimate that about 37% of claims had supporting references in their corresponding Wikipedia articles. RQ3 Search phrasing: The best results were given by using an or connective on primary labels, though including aliases gave similar results. RQ4 Relevance: BM25F gave good results when searching for claims within the references of the corresponding Wikipedia article, but poor results for the given search phrasing options when considering the full corpus. RQ5 Suggestion Quality: When a claim has a supporting reference in the corresponding Wikipedia article for the subject item, the proposed method will find at least one such supporting reference in the top-5 results around 90% of the time; however, the generated snippets rarely suffice to support the claim, meaning the editor will often have to visit and revise the documents.</p><p>When claims are supported by references in the corresponding Wikipedia article, traditional Information Retrieval methods appear sufficient to give good recommendations. The more general issue we encountered in this initial research is that few Wikidata claims have relevant references in the corresponding Wikipedia article. This suggests two possible future directions:</p><p>-Offline: Given that some Wikidata items do not have an associated Wikipedia article, that many Wikipedia articles have few references, etc., it would be interesting to develop a broader corpus with more documents from the Web, perhaps from the Common Crawl. In order to ensure that the documents are authoritative, this corpus might only include content from web-sites with a threshold number of references detected in Wikipedia. A challenge will be to ensure the relevance of search results, where the connection between the Wikidata items and the indexed documents would be lost; however, this challenge could be addressed with more advanced relevance measures based on the fields of the documents, comparing the similarity of each document's content to relevant Wikipedia articles, amongst other such techniques. -Online: We have found that our online option is too slow due to the need to crawl references at runtime. Another option similar to the online optionin terms of obviating the need for a local index of documents -would be to use the existing infrastructure of major search engines to search the Web at runtime, filtering for sites that are considered authoritative. A major benefit of such an approach is that the (costly) retrieval, indexing and refreshing of content could be delegated to the search engine. The downside of such an approach would be the issues of respecting rate-limits for the search API, plus the inability to pre-process the content for the specific task.</p><p>In summary, a method for automatically suggesting references for Wikidata claims would help human editors to be more productive, and would help to make better use of their (often volunteered) time. As a result, the coverage of references on Wikidata would increase, and its quality as a secondary source of knowledge would improve. While this paper does not provide a definitive solution, we have gained some important insights into the strengths and limitations of basing suggestions on Wikipedia's references. We further provide online material to facilitate future research, including the retrieved content of a large subset of documents found in the reference sections of English Wikipedia.</p><p>Material online. Available on Zenodo <ref type="bibr" target="#b1">[2]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Proposed architecture for suggesting references</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Suggestions generated for "Chile capital Santiago" in a prototype user interface.Table 3. Example of a Wikidata claim with IDs, labels and aliases</figDesc><graphic coords="8,152.06,115.84,311.28,154.30" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Top 10 domains in terms of raw URLs vs. indexed URLs</figDesc><table><row><cell>№ Raw URLs</cell><cell>Indexed URLs</cell></row><row><cell>1 archive.org</cell><cell>bbc.co.uk</cell></row><row><cell>2 doi.org</cell><cell>nytimes.com</cell></row><row><cell>3 nih.gov</cell><cell>archive.org</cell></row><row><cell>4 nytimes.com</cell><cell>billboard.com</cell></row><row><cell>5 bbc.co.uk</cell><cell>newspapers.com</cell></row><row><cell cols="2">6 webcitation.org thegazette.co.uk</cell></row><row><cell>7 allmusic.com</cell><cell>sports-reference.com</cell></row><row><cell>8 youtube.com</cell><cell>reuters.com</cell></row><row><cell cols="2">9 theguardian.com baseball-reference.com</cell></row><row><cell>10 archive.is</cell><cell>bbc.com</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Crawl for selected sample of five groups</figDesc><table><row><cell></cell><cell>A</cell><cell>B</cell><cell>C</cell><cell>D</cell><cell>E Total</cell></row><row><cell>Raw URLs</cell><cell cols="5">40,666 12,763 7,111 4,917 5,365 70,822</cell></row><row><cell cols="2">Indexed URLs 22,268</cell><cell cols="4">6,945 3,682 2,399 2,945 37,983</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 5 .</head><label>5</label><figDesc>Recall@3 for the four search options and a random baseline</figDesc><table><row><cell></cell><cell cols="5">Option 1 Option 2 Option 3 Option 4 Random</cell></row><row><cell>R@3</cell><cell>0.72</cell><cell>0.64</cell><cell>0.66</cell><cell>0.57</cell><cell>0.37</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 6 .</head><label>6</label><figDesc>High-level statistics from the gold standard</figDesc><table><row><cell></cell><cell>A</cell><cell>B</cell><cell>C</cell><cell>D</cell><cell cols="2">E All</cell></row><row><cell>Average claims</cell><cell>48</cell><cell>18</cell><cell>17</cell><cell>13</cell><cell>8</cell><cell>21</cell></row><row><cell>Average indexed references</cell><cell>23</cell><cell>7</cell><cell>4</cell><cell>2</cell><cell>3</cell><cell>7</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">See https://www.wikidata.org/wiki/Wikidata:Verifiability</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">See https://www.wikidata.org/wiki/Help:Sources/Items_not_needing_sources</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">See https://wikidata-todo.toolforge.org/stats.php</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">As an anecdotal example of the latter, we refer to Siri reporting the death of Stan Lee, apparently based on an invalid statement added to Wikidata: https://io9. gizmodo.com/siri-erroneously-told-people-stan-lee-was-dead-1827322243.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">See https://meta.wikimedia.org/wiki/WikiCite</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">http://nutch.apache.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">https://lucene.apache.org/solr/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">By default, Nutch partitions URLs by host and then randomly selects URLs within each partition.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8">Another alternative would be to further include documents for the value item. We discarded this option in order to simplify experiments, observing that the value of a claim is often much more general than the subject item; for example, considering the claim that Neil Young was born in Canada, it would not make sense to search within the external references for Canada.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgements. This work was funded by Fondecyt Grant No. 1181896 and ANID Millennium Science Initiative Program ICN17 002.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Chou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gonçalves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Walton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Redi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Wiki-Workshop</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">External References of English Wikipedia (ref-wiki-en</title>
		<author>
			<persName><forename type="first">P</forename><surname>Curotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hogan</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.4001139</idno>
		<ptr target="https://doi.org/10.5281/zenodo.4001139" />
		<imprint>
			<date type="published" when="2020-08">Aug 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Delpeuch</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.55004</idno>
		<ptr target="https://doi.org/10.5281/zenodo.55004" />
		<title level="m">Structured citations in the English Wikipedia</title>
				<imprint>
			<date type="published" when="2016-06">Jun 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Are Wikipedia citations important evidence of the impact of scholarly articles and books?</title>
		<author>
			<persName><forename type="first">K</forename><surname>Kousha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Thelwall</surname></persName>
		</author>
		<idno type="DOI">10.1002/asi.23694</idno>
		<ptr target="https://doi.org/10.1002/asi.23694" />
	</analytic>
	<monogr>
		<title level="j">J. Assoc. Inf. Sci. Technol</title>
		<imprint>
			<biblScope unit="volume">68</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="762" to="779" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Wikidata and Libraries: Facilitating Open Knowledge</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lemus-Rojas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pintscher</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Leveraging Wikipedia: Connecting Communities of Knowledge</title>
				<imprint>
			<publisher>ALA Editions</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="143" to="158" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Analysis of References Across Wikipedia Languages</title>
		<author>
			<persName><forename type="first">W</forename><surname>Lewoniewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wecel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Abramowicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information and Software Technologies (ICIST)</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="561" to="573" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title/>
		<author>
			<persName><forename type="first">J</forename><surname>Luzuriaga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Muñoz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Rosales</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.3483254</idno>
		<ptr target="https://doi.org/10.5281/zenodo.3483254" />
	</analytic>
	<monogr>
		<title level="j">Wikitables</title>
		<imprint>
			<date type="published" when="2019-10">Oct 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Scholia, Scientometrics and Wikidata</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">Å</forename><surname>Nielsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mietchen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">L</forename><surname>Willighagen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ESWC Satellite Events</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="237" to="259" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Quantifying Engagement with Citations on Wikipedia</title>
		<author>
			<persName><forename type="first">T</forename><surname>Piccardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Redi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Colavizza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>West</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Web Conference (WWW)</title>
				<imprint>
			<publisher>ACM / IW3C2</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2365" to="2376" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Provenance Information in a Collaborative Knowledge Graph: An Evaluation of Wikidata External References</title>
		<author>
			<persName><forename type="first">A</forename><surname>Piscopo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaffee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Phethean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Simperl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference (ISWC)</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="542" to="558" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">What we talk about when we talk about Wikidata quality: a literature survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Piscopo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Simperl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Symposium on Open Collaboration (Open-Sym)</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page">11</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">What do Wikidata and Wikipedia Have in Common?: An Analysis of their Use of External References</title>
		<author>
			<persName><forename type="first">A</forename><surname>Piscopo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vougiouklis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaffee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Phethean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Hare</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Simperl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Symposium on Open Collaboration (OpenSym)</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1" to="1" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Citation needed: A taxonomy and algorithmic assessment of Wikipedia&apos;s verifiability</title>
		<author>
			<persName><forename type="first">M</forename><surname>Redi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Fetahu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Morgan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Taraborelli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Web Conference (WWW)</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1567" to="1578" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia</title>
		<author>
			<persName><forename type="first">H</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>West</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Colavizza</surname></persName>
		</author>
		<idno>CoRR abs/2007.07022</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Wikidata: A Free Collaborative Knowledgebase</title>
		<author>
			<persName><forename type="first">D</forename><surname>Vrandečić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krötzsch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Comm. ACM</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page" from="78" to="85" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Tell me more: an actionable quality model for Wikipedia</title>
		<author>
			<persName><forename type="first">M</forename><surname>Warncke-Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cosley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Riedl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Symposium on Open Collaboration (OpenSym)</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
