Time-Aware Entity Linking Renato Stoffalette João L3S Research Center, Leibniz University of Hannover, Appelstraße 9A, Hannover, 30167, Germany, joao@L3S.de Abstract. Entity Linking is the task of automatically identifying entity mentions in a piece of text and linking them to their corresponding entries in a reference knowledge base like Wikipedia. Although there is a plethora of works on entity linking, existing state-of-the-art approaches do not explicitly consider the time aspect and specifically the temporality of an entity’s prior probability (popularity) and embedding (semantic network). Consequently, they show limited performance in annotating old documents like news or web archives, while the problem is bigger in cases of short texts with limited context, such as archives of social media posts and query logs. This thesis focuses on this problem and proposes a modeling that leverages time-aware prior probabilities and word embeddings in the entity linking task. 1 Introduction and Problem Statement One way to enhance machine understandability of natural language documents is by adding semantics to documents. The Linked Open Data and Semantic Web communities have played a major role and gained a lot of attention over the past years. Researchers have produced a great deal of methods for publishing structured data on the web and interlinking related concepts coming from different sources. A diverse number of applications can benefit from these initiatives, for example, query expansion and auto completion, knowledge base population, ranking results from search engines, among others. This thesis focuses on the entity linking problem [20], which basically consists in automating the process of entities extraction from natural language documents and linking them to the correct concepts in a reference knowledge base, for instance Wikipedia [13], YAGO [11] and DBPedia [12]. Although there is a plethora of works trying to solve this problem, existing state-of-the-art approaches do not explicitly consider the time aspect in their modeling and thus work well mainly when dealing with documents published close to the model creation and training times. Nevertheless, the increasing number of digital archives worldwide, including news, web, and social media archives, and the need for their effective analysis and exploration, requires the use of effective, time-aware entity annotations methods [7]. Consider for example the text snippet “Ronaldo scored and Real Madrid won”. If this text belongs to an article from 2017, the mention Ronaldo 2 probably refers to the portuguese soccer player Cristiano Ronaldo. However, if the document is dated from 2002, then it probably refers to the brazilian soccer player Ronaldo Luı́s Nazário de Lima. For handling such cases, existing entity linking approaches should be trained again using data and knowledge from older versions of the reference knowledge base(s), which can be very laborious and also infeasible since such training information is probably neither sufficient nor available (especially for older time periods). The problem is even bigger when trying to annotate short texts with limited context, like old social media posts or query logs. For example, the query “Germany Brazil” submitted in July 2014 probably refers to the football match of the 2014 FIFA World Cup, and thus the mentions should be linked to the football teams, not the countries. An entity’s prior probability (i.e its popularity) is an important component in most entity linking approaches and is considered a strong indicator (as well as a baseline) to select the correct entity for a given mention [6]. In addition, the context of an entity (i.e its semantic network) is another important characteristic that is exploited by state-of-the-art entity linking approaches [17, 11]. In this thesis, we consider that both the prior probability of an entity and the context of an entity mention are temporal in nature, and thereby the time aspect should be explicitly considered in entity linking. Towards this objective, we introduce and formulate the problem of “Time-Aware Entity Linking”. Given a document and a time period (e.g., publication date or an estimation of publication date), our approach links entities in the document to a contemporary knowledge base (Wikipedia) by considering time-aware prior probabilities and word embeddings. Below, we motivate the problem, discuss related works, pose the main research questions and hypotheses, and describe the methodology that we follow as well as our evaluation plan. 2 Motivation Easy access to historical Web information becomes more and more important as significant parts of our cultural heritage are produced and consumed online. National and international initiatives have recognized this need and started to collect and preserve parts of the Web. The most prominent one being the Internet Archive1 , has collected more than 2.5 Petabyte of Web content since 1996. In the same direction, efforts have emerged to collect and preserve social media archives, like the Twitter Archive at the Library of Congress [24]. Despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into usable and useful information sources [2, 4]. Entity linking can play an important role in assisting to add semantics to archived documents, which in turn will enable their effective analysis and exploration [7]. However, from our experience in trying to annotate a variety of historical documents in the context of the Alexandria project2 (like news 1 https://archive.org/ 2 ERC Advanced Grant, Nr. 339233, http://alexandria-project.eu/ 3 articles, web pages, query logs, social media posts), existing state-of-the-art entity linking approaches show limited performance, producing a reasonable number of false annotations due to their time-agnostic approach. We believe that making entity linking time-aware can increase the quality of annotations, especially for the case of historical Web content. 3 Related work Entity linking requires a knowledge base containing the entities to which entity mentions can be linked. In the open domain text, one popular approach is to construct a dictionary of entities from online encyclopedias such as Wikipedia or from semantic networks such as DBPedia, Babelnet and YAGO. Bunescu et al. [3] were the first to propose a solution for the entity linking problem using Wikipedia, but only in the work done by Mihalcea and Csomai [13] the term Wikification was formally introduced. Wikification, sometimes also called Disambiguation to Wikipedia (D2W), consists basically in extracting the most important concepts in a document and identifying for each of these concepts an appropriate link to a Wikipedia article. Both works are considered local approaches, because only one mention is disambiguated at a time and the methods are focused on well written documents. The main drawback of disambiguating one mention at a time is the fact that there is no assumption of relatedness among the entities. Thus, in order to overcome such limitation, Cucerzan [5] introduced a global approach in which the disambiguation process is performed for all the mentions at the same time. In his work, he recognizes coherence and a general interdependence (i.e a semantic relation) among mentions in the same document. Milne and Witten [15] proposed a machine learning approach that combined an entity’s prior probability with its relatedness to the surrounding context, but their approach relied strongly on the unambiguous mentions to create context and disambiguate the ambiguous ones, and hence it was not very efficient on fragments of texts that did not have at least one single unambiguous mention. Other authors have tried to solve entity linking problems using different techniques and also link to entities derived from other knowledge bases or semantic networks. DBPedia Spotlight [12] for example, links entities to DBPedia by basically calculating cosine similarity between the context of the entity mention and the context of each candidate entity using the Bag-of-Words approach. Yet, more robust approaches such as AIDA [11] which links entities to YAGO and Wikipedia, Babelfy [17] which links entities to Babelnet or even Tagme [9] and WAT [19] which link only entities in Wikipedia, employ graph based algorithms and try to find densest sub graphs as a solution, but finding densest sub graph is computationally NP-hard, therefore each one adopts a different heuristic to find an approximation algorithm and solve the entity ambiguity. More recent approaches, such as the ones proposed by Blanco et al. [1], Pappu et al. [18] and Moreno et al. [16] employ neural network models, and the reason 4 for their success is due to the lower computational complexity of neural models and the possibility to calculate very accurate high dimensional word vectors. Despite the number of contributions and the variety of techniques applied to solve entity linking problems, none of the previous works have incorporated the time aspect. The work which is more related to our objectives is the one proposed by Fang and Chang [8] in which spatiotemporal signals are modeled for solving the entity ambiguity in microblogs. Our preliminary results validate our hypothesis that entities are temporal in nature, and thus we believe that the time aspect should be taken into account in the entity linking task. 4 Research question(s) This thesis aims at addressing the following research question: – (Q1) Can we improve entity linking on archived content by considering the time aspect as well as characteristics of the underlying corpus? Besides, we also focus on the following two sub-questions: – (Q2) How can we evaluate such entity linking approaches? – (Q3) How can we efficiently annotate large collections of documents? Addressing the main research question (Q1) allows someone to provide high- quality annotations for a variety of archived content, like documents in web or news archives, social media posts, query logs, etc. Since existing benchmarks and ground truths do not consider the time aspect, Q2 focuses on providing the means to evaluate the effectiveness of time-aware entity linking approaches and compare it with baseline and time-agnostic approaches. Finally, since web and social media archives are usually huge in size, Q3 aims at providing efficient and scalable approaches. 5 Hypotheses Our main research question (Q1) is based on the following two hypotheses: – (H1) The prior probability of an entity is temporal in nature. – (H2) The context of an entity mention is temporal in nature. In simple terms, H1 considers that the popularity of an −ambiguous in our case− entity changes over time, while H2 considers that the semantic network (strongly connected words) of an ambiguous entity mention also changes over time. Consider for instance the example in Section 1. The brazilian soccer player Ronaldo was very popular in 2002, while the portuguese player Cristiano Ronaldo is very popular nowadays but not in 2002. For the same reason, the context of the word “ronaldo” has changed over time. In 2002, “ronaldo” was probably co-occurring with words like “brazil”, “inter” and “zidane”, while nowadays it co-occurs with “madrid”, “portugal”, “messi”, etc. Both hypotheses have been partially validated by Fang and Chang [8] and Hamilton et al. [10]. The former have shown that entities’ prior probabilities often change across time and domains, while the latter have studied words 5 semantic evolution and shown that frequent words change more slowly while polysemous words change more quickly. Moreover, Zhang et al. [23] have demonstrated that entity recommendations for keyword queries change over time, while Tran et al. [21] noticed that entity relatedness is both time and context dependent. These works confirm our assumption that entities are dynamic in nature. 6 Approach The main idea behind our approach is the use of time and corpus-dependent word embeddings. 6.1 Modeling and Problem Definition Let D be a corpus of documents, for example a set of news articles, covering the time period TD = [τs , τe ] (where τs , τe are two different time points with τs < τe ). Consider also a contemporary reference knowledge base K, for instance Wikipedia, containing information for a set of entities E. Given a piece of text s extracted from a document d ∈ D published during the time period Td ⊆ TD , the output of the “time-aware entity linking” task is a set of mappings (m, e) where m is a word or phrase from s, and e ∈ E is an entity from K that determines the identity of m based on both the context of m (s, d, and D) and the time period Td . For example, given i) the sentence “Ronaldo scored and Real Madrid won” extracted from a 2002 article of a news archive, and ii) Wikipedia as the reference knowledge base, a correct mapping is the following: (Ronaldo, https://en. wikipedia.org/wiki/Ronaldo_(Brazilian_footballer)). 6.2 Approach Overview The main components in the proposed method are depicted in Figure 1 and further described below. Fig. 1: The main components in our approach. Candidate Entity Generation and Mention Context Representation. For an input document d, we first generate a list of pairs (m, Em ) where each mention m has a list of candidate entities Em ∈ E. For this we exploit the anchor texts from Wikipedia articles and we create a dictionary of mentions and entities. Thus, for every mention m we select as candidate entities those that appear as link destinations for m. Each mention has also a context representation that 6 considers its surrounding text and mentions. For this we intend to investigate other variations, such as combining topical coherence or exploring different windows sizes. Time dependent Entity Priors. We compute time-dependent entity prior probabilities for all entities in E by exploiting different Wikipedia editions. There are two major challenges in this step. The first one is how to deal with long-tail entities for which limited or no content exists in old Wikipedia versions. The second one is how to deal with older time periods for which there is no Wikipedia version. Time and Corpus dependent Mention Embeddings. We create different mention embedding models for different time periods (e.g., month-wise and year-wise) by exploiting the documents in D. For this, we use the word2vec algorithm proposed by Mikolov et al. [14] and also adopt standard techniques from vector quantization and signal compression methods to quantize the entries of the word vectors and encode the quantization results [1]. This enables storage and retrieval in a space and time efficient way. Entity Static Representation. An entity e ∈ E has a static and up-to-date (independent of Td ) representation, built by exploiting encyclopedic-like text describing information about e (i.e its Wikipedia page). A challenging issue here is how to represent long-tail entities for which limited information is provided. 7 Evaluation plan Existing gold standard data sets used for entity linking methods assessment do not take into account the time dimension, making it difficult to compare between our proposed approach and time-agnostic methods. Therefore we need to create new ground truth data sets for a variety of archived corpora, including news and web archives, social media archives and query logs. Currently we are in the process of annotating news documents extracted from the New York Times corpus with AIDA, Babelfy and Tagme. Our next steps include analyzing their overall agreement and crowd source corrections of eventual mistakes. We plan to evaluate the effectiveness of our approach for both old and new documents and within several time granularities. We also intend to conduct experiments and comparisons with entity linking approaches integrated into the Gerbil framework [22] as it already offers a web- based platform for comparison of tools using multiple data sets and uniform measuring approaches. 8 Preliminary results and Reflections Since the computation of prior probabilities is typically done over knowledge bases such as Wikipedia, we confirmed H1 by analyzing Wikipedia editions in two different time periods (2006 and 2016). For each edition, we collected all links in Wikipedia articles and created a reference knowledge base of mentions and their referring entities. The probability that a mention m links to the entity 7 e is given by the number of times m links to e over the number of times that m occurs in the whole corpus. Initially we were only concerned with the top ranked candidate entity for each one of the 31,123 mentions that commonly occurred in between the two Wikipedia editions. When considering both ambiguous and unambiguous mentions, in 9.44% of the cases the mention changed its top ranked candidate entity, whilst when removing the unambiguous mentions this number increased to 15.36%. This is mainly due to the fact that most of the unambiguous mentions kept the same entity mappings. Moreover, we examined the changes in the top-5 ranked positions of the candidate entities. For that we computed the entities rank correlation by treating an element i which appears in list L1 and not in L2 , at position |L2 | + 1. We computed the rank correlation for 18,727 mentions, since this is the number of mentions that are ambiguous and appear both in the 2006 and 2016 Wikipedia corpus. We noticed that in 71.98% of the cases the rank correlation values were greater than 0.5, which tells us there is some significant number of changes in the candidate entities’ rank positions. Table 1 shows an example of a mention and its top-5 candidate entities for different Wikipedia editions. Table 1: Mention example and its top-5 ranked candidate entities in two different Wikipedia editions (i.e 2006 and 2016). Mention Entity Prior Year Doctor Watson 0.146 James D. Watson 0.130 Watson Watson, Australian Capital Territory 0.115 2006 Division of Watson 0.076 Watson 0.061 Watson (computer) 0.068 Ben Watson (footballer, born July 1985) 0.054 Watson Je-Vaughn Watson 0.050 2016 Jamie Watson (soccer) 0.047 Arthur Watson (footballer, born 1870) 0.043 In Zhang et. al [23] the authors created a dataset with 22 queries to enable evaluations for time-aware entity recommendation. We used this dataset to perform some preliminary evaluations of our approach. We created month-wise word embeddings using the provided news corpus which spans over 7 months and contains about 7 million articles. We evaluated a total of 26 ambiguous mentions (with > 1 candidates) without considering context information (for queries such as “Tour de France Nibali” we counted two ambiguous mentions, being “Tour de France” and “Nibali”). For each mention, we collected its candidate entities and evaluated if our approach could identify the correct Wikipedia entity in a reference knowledge base created from a 2017 Wikipedia edition. Our approach reaches a promising accurary of about 73% on only ambiguous mentions (19/26 cases of correct entity linking). 8 9 Conclusion This thesis introduces and formalizes the problem of time-aware entity linking which can be particularly useful for annotating archived collections of documents, such as news or web archives. We validated our hypothesis that the prior probability of an entity is temporal in nature and we presented an entity linking modeling that incorporates time-aware entity priors and word embeddings. In the future, we intend to extensively evaluate variations of our model for different archived corpora and time periods, using ground truth data sets created specifically for time-aware entity linking. Acknowledgments I am thankful to my advisors Dr. Pavlos Fafalios and Prof. Dr. Wolfgang Nejdl who provided expertise and helped me structuring this work. References 1. Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. pp. 179–188. ACM (2015) 2. Bruns, A., Weller, K.: Twitter as a first draft of the present: and the challenges of preserving it for the future. In: Proceedings of the 8th ACM Conference on Web Science. ACM (2016) 3. Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Eacl. vol. 6, pp. 9–16 (2006) 4. Calhoun, K.: Exploring digital libraries: foundations, practice, prospects. Facet Publishing (2014) 5. Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic. pp. 708–716 (2007) 6. Fader, A., Soderland, S., Etzioni, O.: Scaling wikipedia-based named entity disambiguation to arbitrary web text. In: IN PROC. OF WIKIAI (2009) 7. Fafalios, P., Holzmann, H., Kasturia, V., Nejdl, W.: Building and Querying Semantic Layers for Web Archives. In: ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’17) (2017) 8. Fang, Y., Chang, M.W.: Entity linking on microblogs with spatial and temporal signals. Transactions of the Association for Computational Linguistics 2 (2014) 9. Ferragina, P., Scaiella, U.: Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: Proceedings of the 19th ACM international conference on Information and knowledge management. pp. 1625–1628. ACM (2010) 10. Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096 (2016) 11. Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 782–792. Association for Computational Linguistics (2011) 12. Mendes, P.N., Jakob, M., Garcia-Silva, A., Bizer, C.: Dbpedia spotlight: Shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics) (2011) 9 13. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. pp. 233–242. ACM (2007) 14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 15. Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17th ACM conference on Information and knowledge management. ACM (2008) 16. Moreno, J.G., Besançon, R., Beaumont, R., Dhondt, E., Ligozat, A.L., Rosset, S., Tannier, X., Grau, B.: Combining word and entity embeddings for entity linking. In: European Semantic Web Conference. pp. 337–352. Springer (2017) 17. Moro, A., Raganato, A., Navigli, R.: Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2, 231–244 (2014) 18. Pappu, A., Blanco, R., Mehdad, Y., Stent, A., Thadani, K.: Lightweight multilingual entity extraction and linking. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. pp. 365–374. ACM (2017) 19. Piccinno, F., Ferragina, P.: From tagme to wat: a new entity annotator. In: Proceedings of the first international workshop on Entity recognition & disambiguation. pp. 55–62. ACM (2014) 20. Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27(2), 443–460 (2015) 21. Tran, N.K., Tran, T., Niederée, C.: Beyond time: Dynamic context-aware entity recommendation. In: European Semantic Web Conference. Springer (2017) 22. Usbeck, R., Röder, M., Ngonga Ngomo, A.C., Baron, C., Both, A., Brümmer, M., Ceccarelli, D., Cornolti, M., Cherix, D., Eickmann, B., et al.: Gerbil: general entity annotator benchmarking framework. In: Proceedings of the 24th International Conference on World Wide Web. pp. 1133–1143. ACM (2015) 23. Zhang, L., Rettinger, A., Zhang, J.: A probabilistic model for time-aware entity recommendation. In: International Semantic Web Conference. pp. 598–614. Springer (2016) 24. Zimmer, M.: The twitter archive at the library of congress: Challenges for information practice and information policy. First Monday 20(7) (2015)