OpenTapioca: Lightweight Entity Linking for Wikidata Antonin Delpeuch1[0000−0002−8612−8827] Department of Computer Science, University of Oxford, UK antonin.delpeuch@cs.ox.ac.uk Abstract. We propose a simple Named Entity Linking system that can be trained from Wikidata only. This demonstrates the strengths and weaknesses of this data source for this task and provides an easily reproducible baseline to compare other systems against. Our model is lightweight to train, to run and to keep synchronous with Wikidata in real time. Keywords: Entity linking · Wikidata 1 Introduction Named Entity Linking is the task of detecting mentions of entities from a knowledge base in free text, as illustrated in Figure 1. Most of the entity linking literature focuses on target knowledge bases which are derived from Wikipedia, such as DBpedia [1] or YAGO [19]. These bases are curated automatically by harvesting information from the info-boxes and categories on each Wikipedia page and are therefore not editable directly. Wikidata [21] is an editable, multilingual knowledge base which has recently gained popularity as a target database for entity linking [8,22,15,13]. As these new approaches to entity linking also introduce novel learning methods, it is hard to tell apart the benefits that come from the new models and those which come from the choice of knowledge base and the quality of its data. We review the main differences between Wikidata and static knowledge bases ex- tracted from Wikipedia, and analyze their implactions for entity linking. We illustrate these differences by building a simple entity linker, OpenTapioca1 , which only uses data from Wikidata, and show that it is competitive with other systems with access to larger data sources for some tasks. OpenTapioca can be trained easily from a Wikidata dump only, and can be efficiently kept up to date in real time as Wikidata evolves. We also propose tools to adapt existing entity linking datasets to Wikidata, and offer a new entity linking dataset, consisting of affiliation strings extracted from research articles. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). 1 The implementation and datasets are available at https://github.com/wetneb/ opentapioca and the demo can be found at https://opentapioca.org/. 2 Antonin Delpeuch employer (P108) Julie Pace (Q34666768) Associated Press (Q40469) Associated Press writer Julie Pace contributed from Washington . Washington D.C. (Q61) Fig. 1. Example of an annotated sentence with mentions in red and entities in blue 2 Particularities of Wikidata Wikidata is a wiki itself, meaning that it can be edited by anyone, but differs from usual wikis by its data model: information about an entity can only be input as structured data, in a format that is similar to RDF. Wikidata stores information about the world in a collection of items, which are structured wiki pages. Items are identified by ther Q-id, such as Q40469, and they are made of several data fields. The label stores the preferred name for the entity. It is supported by a description, a short phrase describing the item to disambiguate it from namesakes, and aliases are alternate names for the entity. These three fields are stored separately for each language supported by Wikidata. Items also hold a collection of statements: these are RDF-style claims which have the item as subject. They can be backed by references and be made more precise with qualifiers, which all rely on a controlled vocabulary of properties (similar to RDF predicates). Finally, items can have site links, connecting them to the corresponding page for the entity in other Wikimedia projects (such as Wikipedia). Note that Wikidata items do not need to be associated with any Wikipedia page: in fact, Wikidata’s policy on the notability of the subjects it covers is much more permissive than in Wikipedia. For a more detailed introduction to Wikidata’s data model we refer the reader to [21,5]. Our goal is to evaluate the usefulness of this crowdsourced structured data for en- tity linking. We will therefore refrain from augmenting it with any external data (such as phrases and topical information extracted from Wikipedia pages), as is generally done when working with DBpedia or YAGO. By avoiding a complex mash-up of data coming from disparate sources, our entity linking system is also simpler and easier to reproduce. Finally, it is possible keep OpenTapioca in real-time synchronization with the live version of Wikidata, with a lag of a few seconds only. This means that users are able to fix or improve the knowledge base, for instance by adding a missing alias on an item, and immediately see the benefits on their entity linking task. This constrasts with all other systems we are aware of, where the user either cannot directly intervene on the underlying data, or there is a significant delay in propagating these updates to the entity linking system. OpenTapioca: Lightweight Entity Linking for Wikidata 3 3 Related work We review the dominant architecture of entity linking heuristics following [?], and as- sess its applicability to Wikidata. Entities in the knowledge base are associated with a set (or probability distribution) of possible surface forms. Given a text to annotate, candidate entities are generated by looking for occurrences of their surface forms in the text. Because of homonymy, many of these candidate occurrences turn out to be false matches, so a classifier is used to predict their correctness. We can group the features they tend to use in the following categories: – local compatibility: these features assess the adequacy between an entity and the phrase that refers to it. This relies on the dictionary of surface forms mentioned above, and does not take into account the broader context of the phrase to link. – topic similarity: this measures the compatibility between the topics in the text to annotate and the topics associated with the candidate entity. Topics can be repre- sented in various ways, for instance with a bag of words model. – mapping coherence: entities mentioned in the same text are often related, so link- ing decisions are inter-dependent. This relies on a notion of proximity between entities, which can be defined with random walks in the knowledge base for in- stance. 3.1 Local compatibility These features compare the phrase to annotate with the known surface forms for the entity. Collecting such forms is often done by extracting mentions from Wikipedia [3]. Link labels, redirects, disambiguation pages and bold text in abstracts can all be use- ful to discover alternate names for an entity. It is also possible to crawl the web for Wikipedia links to improve the coverage, often at the expense of data quality [16]. Beyond collecting a set of possible surface forms, these approaches count the num- ber of times an entity e was mentioned by a phrase w. This makes it possible to use a Bayesian methodology: the compatibility of a candidate entity e with a given mention w is P (e|w) = PP(e,w) (w) , which can be estimated from the statistics collected. In Wikidata, items have labels and aliases in multiple languages. As this information is directly curated by editors, these phrases tend to be of high quality. However, they do not come with occurence counts. As items link to each other using their Wikidata identifiers only, it is not possible to compare the number of times USA was used to refer United States of America (Q30) or to United States Army (Q9212) inside Wikidata. Unlike Wikipedia’s page titles which must be unique in a given language, two Wiki- data items can have the same label in the same language. For instance Curry is the English label of both the item about the Curry programming language (Q2368856) and the item about the village in Alaska (Q5195194), and the description field is used to disambiguate them. Manual curation of surface forms implies a fairly narrow coverage, which can be an issue for general purpose entity linking. For instance, people are commonly refered to with their given or family name only, and these names are not systematically added 4 Antonin Delpeuch as aliases: at the time of writing, Trump is an alias for Donald Trump (Q22686), but Cameron is not an alias for David Cameron (Q192). As a Wikidata editor, the main incentive to add aliases to an item is to make it easier to find the item with Wikidata’s auto-suggest field, so that it can be edited or linked to more easily. Aliases are not designed to offer a complete set of possible surface forms found in text: for instance, adding common mispellings of a name is discouraged.2 3.2 Topic similarity The compatibility of the topic of a candidate entity with the rest of the document is traditionally estimated by similarity measures from information retrieval such as TFIDF [17,14] or keyword extraction [18,10,3]. Wikidata items only consist of structured data, except in their descriptions. This makes it difficult to compute topical information using the methods above. Vector-based representations of entities can be extracted from the knowledge base alone [2,24], but it is not clear how to compare them to topic representations for plain text, which would be computed differently. In more recent work, neural word embeddings were used to represent topical information for both text and entities [4,13,9]. This requires access to large amounts of text both to train the word vectors and to derive the entity vectors from them. These vectors have been shown to encode significant semantic information by themselves [11], so we refrain from using them in this study. 3.3 Mapping coherence Entities mentioned in the same context are often topically related, therefore it is useful not to treat linking decisions in isolation but rather to try to maximize topical coherence in the chosen items. This is the issue on which entity linking systems differ the most as it is harder to model. First, we need to estimate the topical coherence of a sequence of linking decisions. This is often done by first defining a pairwise relatedness score between the target en- tities. For instance, a popular metric introduced by [23] considers the set of wiki links |a|, |b| made from or to two entities a, b and computes their relatedness: log(max(|a|, |b|)) − log(|a ∩ b|) rel(a, b) = 1 − log(|K|) − log(min(|a|, |b|)) where |K| is the number of entities in the knowledge base. When linking to Wikidata instead of Wikipedia, it is tempting to reuse these heuris- tics, replacing wikilinks by statements. However, Wikidata’s linking structure is quite different from Wikipedia: statements are generally a lot sparser than links and they have a precise semantic meaning, as editors are restricted by the available properties when creating new statements. We propose in the next section a similarity measure that we find to perform well experimentally. Once a notion of semantic similarity is chosen, we need to integrate it in the infer- ence process. Most approaches build a graph of candidate entities, where edges indicate 2 The guidelines are available at https://www.wikidata.org/wiki/Help:Aliases OpenTapioca: Lightweight Entity Linking for Wikidata 5 semantic relatedness: the difference between the heuristics lie in the way this graph is used for the matching decisions. [12] use an approximate algorithm to find the densest subgraph of the semantic graph. This determines choices of entities for each mention. In other approaches, the initial evidence given by the local compatibility score is prop- agated along the edges of the semantic graph [10,6] or aggregated at a global level with a Conditional Random Field [4]. 4 OpenTapioca: an entity linking model for Wikidata We propose a model that adapts previous approaches to Wikidata. Let d be a document (a piece of text). A spot s ∈ d is a pair of start and end positions in d. It defines a phrase d[s], and a set of candidate entities E[s]: those are all Wikidata items for which d[s] is a label or alias. Given two spots s, s0 we denote by |s − s0 | the number of characters between them.3 We build a binary classifier which predicts for each s ∈ d and e ∈ E[s] if s should be linked to e. 4.1 Local compatibility Although Wikidata makes it impossible to count how often a particular label or alias is used to refer to an entity, these surface forms are carefully curated by the community. They are therefore fairly reliable. Given an entity e and a phrase d[s], we need to compute p(e|d[s]). Having no ac- p(e) cess to such a probability distribution, we choose to approximate this quantity by p(d[s]) , where p(e) is the probability that e is linked to, and p(d[s]) is the probability that d[s] occurs in a text. In other words, we estimate the popularity of the entity and the com- monness of the phrase separately. We estimate the popularity of an entity e by a log-linear combination of its number of statements ne , site links se and its PageRank log p(e). The PageRank is computed on the entire Wikidata using statement values and qualifiers as edges. The probability p(d[s]) is estimated by a simple word unigram language model that can be trained either on any large unannotated dataset4 . The local compatibility is therefore represented by a vector of features F (e, d[s]) and the local compatibility is computed as follows, where λ is a weights vector: F (e, d[s]) = (− log p(d[s]), log p(e), ne , se , 1) p(e|d[s]) ∝ eF (e,d[s])·λ 4.2 Semantic similarity The issue with the features above is that they ignore the context in which a mention is found. To make it context-sensitive, we adapt the approach of [6] to our setup. The 3 This is the distance between the start position of the later spot and the end position of the earlier one 4 For the sake of respecting our constraint to use Wikidata only, we train this language model from Wikidata item labels. 6 Antonin Delpeuch general idea is to define a graph on the candidate entities, linking candidate entities which are semantically related, and then find a combination of candidate entities which have both high local compatibility and which are densely related in the graph. For each pair of entities e, e0 we define a similarity metric S(e, e0 ). Let l(e) be the set of items that e links to in its statements. Consider a one-step random walks starting on 1−β e, with probability β to stay on e and probability |l(e)| to reach one of the linked items. 0 We define S(e, e ) as the probability that two such one-step random walks starting from e and e0 end up on the same item. This can be computed explicitly as δe∈l(e0 ) S(e, e0 ) = β 2 δe=e0 + β(1 − β)( |l(e0 )| δe0 ∈l(e) |l(e) ∩ l(e0 )| + + (1 − β)2 |l(e)| |l(e)||l(e0 )| We then build a weighted graph Gd whose vertices are pairs (s ∈ d, e ∈ E[s]). In other words, we add a vertex for each candidate entity at a given spot. We fix a maximum distance D for edges: vertices (s, e) and (s0 , e0 ) can only be linked if |s − s0 | ≤ D and s 6= s0 . In this case, we define the weight of such an edge as (η + 0 | s(e, e0 )) D−|s−s D , where η is a smoothing parameter. In other words, the edge weight is proportional to the smoothed similarity between the entities, discounted by the distance between the mentions. The weighted graph Gd can be represented as an adjacency matrix. We transform it into a column-stochastic matrix Md by normalizing its columns to sum to one.5 This defines a Markov chain on the candidate entities, that we will use to propagate the local evidence. 4.3 Classifying entities in context [6] first combine the local features into a local evidence score, and then spread this local evidence using the Markov chain (where LC(d) is the vector of all p(e|d[s]) for each spot and candidate at that spot): G(d) = (αI + (1 − α)Md )k · LC(d) (1) We propose a variant of this approach, where each individual local compatibility fea- ture is propagated independently along the Markov chain.6 Let F be the matrix of all local features for each candidate entity: F = (F (e1 , d[s1 ]), . . . , F (en , d[sn ])). After k iterations in the Markov chain, this defines features Mdk F . Rather than relying on these features for a fixed number of steps k, we record the features at each step, which defines the vector (F, Md · F, Md2 · F, . . . , Mdk · F ) This alleviates the need for an α parameter while keeping the number of features small. We train a linear support vector classifier on these features and this defines the final 5 The matrix Md is a square matrix whose dimension is the number of candidate entities in the text. This can grow large but the matrix remains sparse due to the locality assumption. 6 It is important for this purpose that features are initially scaled to the unit interval. OpenTapioca: Lightweight Entity Linking for Wikidata 7 score of each candidate entity. For each spot, our system picks the highest-scoring can- didate entity that the classifier predicts as a match, if any. 5 Experimental setup Most entity linking datasets are annotated against DBpedia or YAGO. Wikidata contains items which do not have any corresponding Wikipedia article (in any language), so these items do not have any DBpedia or YAGO URI either.7 Therefore, converting an entity linking dataset from DBpedia to Wikidata requires more effort than simply following owl:sameAs links: we also need to annotate mentions of Wikidata items which do not have a corresponding DBpedia URI. We used the RSS-500 dataset of news excerpts annotated against DBpedia and en- coded in NIF format [20]. We first translated all DBpedia URIs to Wikidata items8 . Then, we used OpenRefine [7] to extract the entities marked not covered by DBpedia and matched them against Wikidata. After human review, this added 63 new links to the 524 converted from DBpedia (out of 476 out-of-KB entities). We also annotated a new dataset from scratch. The ISTEX dataset consists of one thousand author affiliation strings extracted from research articles and exposed by the ISTEX text and data mining service9 . In this dataset, only 64 of the 2,624 Wikidata mentions do not have a corresponding DBpedia URI. We use the Wikidata JSON dump of 2018-02-24 for our experiments, indexed with Solr (Lucene). We restrict the index to humans, organizations and locations, by selecting only items whose type was a subclass of (P279) human (Q5), organization (Q43229) or geographical object (Q618123). Labels and aliases in all languages are added to a case-sensitive finite state transducer index. We trained our classifier and its hyper-parameters by five-fold cross-validation on the training sets of the ISTEX and RSS datasets. We used GERBIL [20] to evaluate OpenTapioca against other approaches. We report the InKB micro and macro F1 scores on test sets, with GERBIL’s weak annotation match method.10 Table 1 compares our approach to other entity linkers available online and compat- ible with the GERBIL evaluation platform. Our approach performs best on datasets of small text fragments (tweets in Microposts 2016, academic affiliations in ISTEX-1000) where context dependencies between entities are narrow. We suspect that the weakness of our language model (trained on Wikidata terms only) also hinders performance on longer texts. 6 Conclusion The surface forms curated by Wikidata editors are sufficient to reach honourable re- call, without the need to expand them with mentions extracted from Wikipedia. Our 7 This is the case of Julie Pace (Q34666768) in Figure 1. 8 We built the nifconverter tool to do this conversion for any NIF dataset. 9 The original data is available under an Etalab license at https://www.istex.fr/ 10 The full details can be found at http://w3id.org/gerbil/experiment?id= 201904110006 8 Antonin Delpeuch AIDA-CoNLL Microposts 2016 Micro Macro Micro Macro AIDA 0.725 0.684 0.056 0.729 Babelfy 0.481 0.421 0.031 0.526 DBP Spotlight 0.528 0.456 0.053 0.306 FREME NER 0.382 0.237 0.037 0.790 OpenTapioca 0.482 0.399 0.087 0.515 ISTEX-1000 RSS-500 Micro Macro Micro Macro AIDA 0.531 0.494 0.455 0.447 Babelfy 0.461 0.447 0.314 0.304 DBP Spotlight 0.574 0.575 0.281 0.261 FREME NER 0.422 0.321 0.307 0.274 OpenTapioca 0.870 0.858 0.335 0.310 Table 1. F1 scores on test datasets restriction to people, locations and organizations probably helps in this regard and we anticipate worse performance for broader domains. Our approach works best for sci- entific affiliations, where spelling is more canonical than in newswire. The availability of Twitter identifiers directly in Wikidata helps us to reach acceptable performance in this domain. The accuracy degrades on longer texts which require relying more on the ambiant topical context. In future work, we would like to explore the use of entity em- beddings to improve our approach in this regard. References 1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: A nucleus for a web of open data. The semantic web pp. 722–735 (2007). https://doi.org/10.1007/978- 3-540-76298-05 2 2. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating Embed- dings for Modeling Multi-relational Data. In: Advances in Neural Information Processing Systems. p. 9 (2013) 3. Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (2007). https://doi.org/10.1007/978-3-319-73706-51 0 4. Ganea, O.E., Hofmann, T.: Deep Joint Entity Disambiguation with Local Neural Attention. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) 2017 (Apr 2017). https://doi.org/10.18653/v1/D17-1277 5. Geiß, J., Spitz, A., Gertz, M.: NECKAr: A named entity classifier for Wikidata. In: Inter- national Conference of the German Society for Computational Linguistics and Language Technology. pp. 115–129. Springer (2017). https://doi.org/10.1007/978-3-319-73706-51 0 6. Han, X., Sun, L., Zhao, J.: Collective entity linking in web text: A graph- based method. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 765–774. ACM (2011). https://doi.org/10.1145/2009916.2010019 OpenTapioca: Lightweight Entity Linking for Wikidata 9 7. Huynh, D., Morris, T., Mazzocchi, S., Sproat, I., Magdinier, M., Guidry, T., Castagnetto, J.M., Home, J., Johnson-Roberson, C., Moffat, W., Moyano, P., Leoni, D., Peilonghui, Alvarez, R., Vishal Talwar, Wiedemann, S., Verlic, M., Delpeuch, A., Shixiong Zhu, Pritchard, C., Sardesai, A., Thomas, G., Berthereau, D., Kohn, A.: OpenRefine (2019). https://doi.org/10.5281/zenodo.595996 8. Klang, M., Nugues, P.: Named Entity Disambiguation in a Question Answering System p. 3 (2014) 9. Kolitsas, N., Ganea, O.E., Hofmann, T.: End-to-End Neural Entity Linking. arXiv:1808.07699 [cs] (Aug 2018). https://doi.org/10.18653/v1/k18-1050 10. Mihalcea, R., Csomai, A.: Wikify!: Linking documents to encyclopedic knowledge. In: Pro- ceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. pp. 233–242. ACM (2007). https://doi.org/10.1145/1321440.1321475 11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 12. Moro, A., Raganato, A., Navigli, R.: Entity Linking meets Word Sense Disambiguation: A Unified Approach. Transactions of the Association for Computational Linguistics 2(0), 231– 244 (May 2014) 13. Raiman, J., Raiman, O.: DeepType: Multilingual Entity Linking by Neural Type System Evolution. arXiv:1802.01021 [cs] (Feb 2018) 14. Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and Global Algorithms for Disam- biguation to Wikipedia. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 1375–1384. Association for Computational Linguistics (2011) 15. Sorokin, D., Gurevych, I.: Mixing Context Granularities for Improved Entity Linking on Question Answering Data across Entity Categories. arXiv:1804.08460 [cs] (Apr 2018) 16. Spitkovsky, V.I., Chang, A.X.: A Cross-Lingual Dictionary for English Wikipedia Concepts. LREC pp. 3168–3175 (2012). https://doi.org/10.1007/978-3-642-10871-67 17. Štajner, T., Mladenić, D.: Entity resolution in texts using statistical learning and ontologies. In: Asian Semantic Web Conference. pp. 91–104. Springer (2009) 18. Strube, M., Ponzetto, S.P.: WikiRelate! Computing semantic relatedness using Wikipedia. In: AAAI. vol. 6, pp. 1419–1424 (2006) 19. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge. In: Proceed- ings of the 16th International Conference on World Wide Web. pp. 697–706. ACM (2007) 20. Usbeck, R., Eickmann, B., Ferragina, P., Lemke, C., Moro, A., Navigli, R., Piccinno, F., Rizzo, G., Sack, H., Speck, R., Troncy, R., Röder, M., Waitelonis, J., Wesemann, L., Ngonga Ngomo, A.C., Baron, C., Both, A., Brümmer, M., Ceccarelli, D., Cornolti, M., Cherix, D.: GERBIL: General Entity Annotator Benchmarking Framework. In: Proceedings of the 24th International Conference on World Wide Web - WWW ’15. pp. 1133–1143. ACM Press, Florence, Italy (2015). https://doi.org/10.1145/2736277.2741626 21. Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledge base. Communica- tions of the ACM (2014). https://doi.org/10.1145/2629489 22. Weichselbraun, A., Kuntschik, P., Braşoveanu, A.M.: Mining and Leveraging Background Knowledge for Improving Named Entity Linking. In: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics. pp. 27:1–27:11. WIMS ’18, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3227609.3227670 23. Witten, I.H., Milne, D.N.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links (2008) 24. Xiao, H., Huang, M., Zhu, X.: TransG : A Generative Model for Knowledge Graph Em- bedding. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2316–2325. Association for Computational Lin- guistics, Berlin, Germany (2016). https://doi.org/10.18653/v1/P16-1219