Linking Entities in #Microposts Romil Bansal, Sandeep Panem, Priya Radhakrishnan, Manish Gupta, Vasudeva Varma International Institute of Information Technology, Hyderabad ABSTRACT Entity linking consists of two different tasks, mention detection Social media has emerged to be an important source of informa- and entity disambiguation. Entity linking from general text is a well tion. Entity linking in social media provides an effective way to explored problem. Existing entity linking tools are intended for extract useful information from microposts shared by the users. En- use over news corpora and similar document-based corpora with tity linking in microposts is a difficult task as they lack sufficient relatively long length. But as microposts lack sufficient context, context to disambiguate the entity mentions. In this paper, we do these context-based approaches fail to perform well on microposts. entity linking by first identifying entity mentions and then disam- In this paper we describe our system proposed for the NEEL biguating the mentions based on three different features: (1) simi- Challenge 2014 [1]. The proposed system disambiguates the en- larity between the mention and the corresponding Wikipedia entity tity mentions in the tweets based on three different measures: (1) pages; (2) similarity between the mention and the tweet text with Wikipedia’s context based measure (§2.2.1); (2) anchor text based the anchor text strings across multiple webpages, and (3) popularity measure (§2.2.2); and (3) Twitter popularity based measure (§2.2.3). of the entity on Twitter at the time of disambiguation. The system is The mention detection is done using existing Twitter part-of- tested on the manually annotated dataset provided by Named Entity speech (POS) taggers [2, 5]. Extraction and Linking (NEEL) Challenge 2014, and the obtained results are on par with the state-of-the-art methods. 2. OUR APPROACH Categories and Subject Descriptors 2.1 Mention Detection Mention detection is the task of finding entity mentions in the H.3.3 [Information Storage and Retrieval]: Information Search given text. We assumed mentions as named entities present in- and Retrieval side the tweets. Various approaches for named entity recognition in tweets have been proposed recently [3, 5]. This includes spotting General Terms continuous sequence of proper nouns as named entities in the tweet. Algorithms, Experimentation But sometimes named entities like ‘Statue of Liberty’, ‘Game of Thrones’ etc. also includes tokens other than nouns. To detect such Keywords mentions, Ritter et al. [5] proposed a machine learning based sys- tem for named entity detection in tweets. Gimpel et al. [2] present Named Entity Extraction and Linking (NEEL) Challenge, Entity yet another approach for POS tagging of tweets. We tried both of Linking, Entity Disambiguation, Social Media these POS taggers to extract proper noun sequences. In our experi- ments Ritter et al.’s tagger gave an accuracy of 77% while Gimpel 1. INTRODUCTION et al.’s tagger gave an accuracy of 92%. So we merged the re- Social media networks like Twitter have emerged to be major sults from both as shown in Fig. 1. The tweet text is fed to the platforms for sharing information in form of short messages (tweets). system and the longest continuous sequences of proper noun to- Analysis of tweets can be useful for various applications like e- kens detected using the above approach are extracted as the entity commerce, entertainment, recommendations, etc. Entity linking is mentions from the given tweet. The merged system provided an the one such analysis task which deals with finding correct referent accuracy of 98% in predicting mentions. entities in the knowledge base for various mentions in the tweet. Entity linking in social media is important as it helps in detect- Wikipedia based ing, understanding and tracking information about an entity shared measure across social media. ARK POS Tagger Gimpel et al. [2] Tweet Merge Anchor text based LambdaMART Entity Text Mentions measure T-NER POS Tagger Ritter et al. [5] Copyright c 2014 held by author(s)/owner(s); copying permitted Twitter popularity only for private and academic purposes. based measure Published as part of the #Microposts2014 Workshop proceedings, Mention Detection Entity Disambiguation available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141) Copyright is held by the author/owner(s). WWW’14 Companion, #Microposts2014, April April 7–11, 7th, 2014, 2014, Seoul, Seoul, Korea. Korea. Figure 1: System Architecture ACM 978-1-4503-2745-9/14/04. http://dx.doi.org/XYZW. · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 2.2 Entity Disambiguation best results are obtained when a combination of all the measures Entity disambiguation is the task of assigning the correct referent were used for disambiguation3 . A 5-fold cross validation on the entity from the knowledge base to the given mention. We disam- dataset gave an average F1 of 0.52 for M1+M2+M3. biguate the entity mention using three measures as described below. The scores from these three measures are combined using Lamb- Table 1: Results: M1 represents Wikipedia’s Context based daMART [7] model to arrive at the final disambiguated entity. Measure (§2.2.1), M2 represents Anchor Text based Measure 2.2.1 Wikipedia’s Context based Measure (M1) (§2.2.2) and M3 represents Twitter Popularity based Measure (§2.2.3) This measure disambiguates a mention by calculating the fre- Measure F1-measure quency of occurrence of the mention in the Wikipedia corpus. Wikipedia’s M1 0.355 context based measure has been used in various approaches for dis- M2 0.100 ambiguating mentions in tweets [4]. We query MediaWiki API1 M3 0.194 with the entity mention. MediaWiki API returns the candidate en- M1+M2 0.355 tities in the ranked order. Each candidate entity is assigned its re- ciprocal rank as score. Thus, a ranked list of candidate entities with M2+M3 0.244 their scores are created using M1. M1+M3 0.405 M1+M2+M 0.512 2.2.2 Anchor Text based Measure (M2) Google Cross-Wiki Dictionary (GCD) [6] is a string to concept mapping, created using anchor text from various web pages. A 4. CONCLUSION concept is an individual Wikipedia article, identified by its URL. For effective entity linking, mention detection in tweets is impor- The text strings constitute the anchor hypertexts that refer to these tant. We improve the accuracy of detecting mentions by combining concepts. Thus, anchor text strings represent a concept. We query various Twitter POS taggers. We resolve multiple mentions, ab- the GCD with a mention along with the tweet text. Based on the breviations and spell variations of a named entity using the Google similarity to the query string, a ranked list of probable candidate Cross-Wiki Dictionary. We also use popularity of an entity on Twit- entities are created (which is the ranked list using M2). The ranking ter for improving the disambiguation. Our system performed well criteria is based on Jaccard similarity between the anchor text and with a F1 score of 0.512 on the given dataset. the query. So if the mention is highly similar to the anchor text, then the corresponding concept will have a high score. 5. REFERENCES [1] A. E. Cano Basave, G. Rizzo, A. Varga, M. Rowe, 2.2.3 Twitter Popularity based Measure (M3) M. Stankovic, and A.-S. Dadzie. Making Sense of Microposts Tweets about entities follow a bursty pattern. Bursty patterns are (#Microposts2014) Named Entity Extraction & Linking the bursts of tweets that appear after an event relating to an entity Challenge. In Proc., 4th Workshop on Making Sense of happens. We exploited this fact and tried to measure the number Microposts (#Microposts2014), pages 54–60, 2014. of times the given mention refers to a particular entity on Twitter [2] K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, recently. The mention is queried on Twitter API2 and the resul- J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and tant tweets are analyzed. All the tweets along with the mention N. A. Smith. Part-of-speech Tagging for Twitter: Annotation, are then queried on the GCD and the candidate entities are taken. Features, and Experiments. In Proc. of the 49th Annual Based on the scores returned using GCD, all the candidate entities Meeting of the Association for Computational Linguistics: are ranked (which is the ranked list using M3). As Twitter popu- Human Language Technologies: Short Papers - Volume 2 larity based measure captures the people’s interests at a particular (NAACL-HLT), pages 42–47, 2011. time, it works well for entity disambiguation on recent tweets. In [3] S. Guo, M.-W. Chang, and E. Kıcıman. To Link or Not to essence, the methods M2 and M3 are similar but with different in- Link? A Study on End-to-End Tweet Entity Linking. In Proc. puts. Both use GCD, and produce candidate mentions and score as of the Human Language Technologies: The Annual output. However, M2 takes mention and single tweet text as input Conference of the North American Chapter of the Association whereas M3 takes mention and multiple tweets as input. for Computational Linguistics (NAACL-HLT), pages We have three rankings available using M1, M2, M3. Now the 1020–1030, 2013. task is to arrive at the final ranking of the candidate entities by com- [4] X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu. Entity bining the rankings of the three different models. The rankings of Linking for Tweets. In Proc. of the 51th Annual Meeting of different models should be combined such that the overall F1 score the Association for Computational Linguistics (ACL), pages is maximized. For this, we use LambdaMART which combines 1304–1311, 2013. LambdaRank and MART models. LambdaMART creates boosted [5] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named Entity regression trees for combining the rankings of the three different Recognition in Tweets: An Experimental Study. In Proc. of systems. the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2011. 3. RESULTS AND EVALUATION [6] V. I. Spitkovsky and A. X. Chang. A Cross-Lingual Dictionary The dataset comprises of 2.3K tweets each annotated with the for English Wikipedia Concepts. In Proc. of the 8th Intl. Conf. entity mention and its corresponding DBpedia URL. We divided on Language Resources and Evaluation (LREC), 2012. the dataset into the 7:3 (train:test) ratio. Table 1 shows the results [7] Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting obtained using the NEEL Challenge evaluation framework. The Boosting for Information Retrieval Measures. Journal of 1 Information Retrieval, 13(3):254–270, Jun 2010. https://www.mediawiki.org/wiki/API:Search 2 3 https://dev.twitter.com/docs/api/1.1/get/search/tweets submitted as Agglutweet_1.tsv 72 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014