Entity Recognition and Linking on Tweets with Random Walks Zhaochen Guo Denilson Barbosa Department of Computing Science Department of Computing Science University of Alberta University of Alberta zhaochen@ualberta.ca denilson@ualberta.ca ABSTRACT as the noisy terms, hashtags, retweets, abbreviations, and This paper presents our system at the #Microposts2015 cyber-slang. Appropriately addressing these problems, and NEEL Challenge [4]. The task is to recognize and type taking advantage of the existing approaches are important. mentions from English Microposts, and link them to their We developed a NEEL system for the challenge based on corresponding entries in DBpedia 2014. For this task, we a state-of-the-art entity linking approach, and incorporated developed a method based on a state-of-the-art entity link- a tweet specific mention extraction component. Our sys- ing system - REL-RW [2], which exploits the entity graph tem takes advantage of the entity graph in the knowledge from the knowledge base to compute semantic relatedness base, and does not rely on the lexical features in the tweets, between entities, and use it for entity disambiguation. The which makes it robust on di↵erent datasets. In the following advantage of the approach is its robustness for various types sections, we will describe our system and report the experi- of documents. We built our system on REL-RW and em- mental results on the challenge benchmarks. ployed a tweet specific NER component to improve the per- formance on tweets. The system achieved overall 0.35 F1 on 2. OUR APPROACH the development dataset from NEEL 2015, while the disam- biguation component alone can achieve 0.70 F1. 2.1 Mention Extraction As the first component of our system, mention extrac- Keywords tion extracts named entities from the given tweets. Our Entity Recognition, Entity Disambiguation, Social Media system originally employed the Stanford NER with models trained on the well-formed news documents. However, it 1. INTRODUCTION cannot handle the short tweets very well. We then used the TwitIE [1] from GATE, a NER tool designed specifically for Microposts such as tweets become popular nowadays. The tweets, to perform the mention extraction in our system. tweets, though short and simple, can spread information Compared to the Stanford NER, TwitIE added several fast and broadly. Events, reviews, news and so on are all improvements. The first is the Normaliser. To address un- posted on Twitter, which make tweets a very valuable re- seen tokens and noisy grammars in tweets, TwitIE used a source to support many activities such as political option spelling dictionary specific to the tweets to identify and cor- mining, product development (customer review), or social rect spelling variations. The second improvement is a tweet activism. We need to understand the tweets to make best adapted model for the POS tagging. While still employ- use of them for such applications. Given the maximum 140 ing the Stanford POS Tagger, TwitIE replaced the original characters limit, there is barely enough useful information model with a new model trained on Twitter datasets which in a tweet. Exploiting entities mentioned in tweets can en- were annotated with the Penn TreeBank with extra tag la- rich the text with their contexts and semantics in knowl- bels such as retweets, URLs, hashtags and user mentions. edge bases, which is important for a better understanding With these improvements, TwitIE helps improve the NER of tweets. The NEEL task aims to solve this issue by auto- performance of our system. Note that we use the types in- matically recognizing entities and their types from English ferred from TwitIE as the types for mentions. tweets, and linking them to their DBpedia 2014 resources. NER and entity linking have been active research subjects. 2.2 Candidate Generation However, most previous works focus on traditional long doc- The second component is the candidate generation which uments, which do not pose the challenges in tweets, such selects potential candidates from the knowledge base for mentions in the tweets. Our system utilized an alias dic- tionary collected from Wikipedia titles, redirect pages, dis- ambiguation pages, and the anchor text in Wikilinks [2], Permission Copyright to c make 2015 digital held by or author(s)/owner(s); hard copies of all or copying part of this work for permitted which maps each alias to entities it refers to in Wikipedia. only for or personal private and academic classroom purposes. use is granted without fee provided that copies are We simply use exact string matching against the dictio- Published as part of the #Microposts2015 Workshop proceedings, not made or distributed for profit or commercial advantage and that copies available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395) nary for the candidate generation. Mentions that do not bear this notice and the full citation on the first page. To copy otherwise, to match any alias in the dictionary are immediately linked to republish, to post on May #Microposts2015, servers or to2015, 18th, redistribute to lists, Florence, requires prior specific Italy. permission and/or a fee. NIL. Otherwise, the mapping entities of the matched alias Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. are selected as candidates. To improve the efficiency, we · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 further prune the candidates by two criteria [2]: prior prob- Precision Recall F1 ability which is defined as the probability the alias refers Tagging 0.34 0.22 0.27 to an entity in the Wikipedia corpus, and context similarity Linking 0.35 0.36 0.35 which measures the context similarity (cosine similarity) of Clustering 0.45 0.29 0.35 the mention and the entity. For both criteria, the top 10 ranked candidates are selected and then merged to generate Table 1: Results on the development datasets. the final candidate list for the given mention. 2.3 Entity Disambiguation tities is below a threshold. For clustering, we simply group mentions by their name similarity. In the future, we plan to Entity disambiguation is to select the target entity from exploit the semantic representation of the tweets to measure the candidates of a mention. We use our prior algorithm [2] their semantic similarity and use it for NIL clustering. for this task. The main idea is to represent the seman- tics of the document (tweet) and candidate entities using a set of related entities in DBpedia for which the weight of 3. EXPERIMENTS each entity is measured by their semantic relatedness with We built our system using a 2013 DBpedia dump, includ- the candidates. We then use the semantic representation ing the knowledge base and alias dictionary. Table 3 lists to compute the semantic similarity between the candidates the results of our system on the development dataset. As and the document. For each mention-entity pair, we mea- shown, the performance of the mention extraction (tagging) sure their prior probability, context similarity, and semantic is very poor, especially the recall. We believe more tuning similarity and linearly combine them together to compute an would improve the performance. Since the novelty of our overall similarity. The candidate with the highest similarity system is the disambiguation part, we further evaluated the will be selected as the target entity. performance of the entity disambiguation component sepa- The key part of the approach is the semantic representa- rately (assuming all mentions are correctly recognised), and tion and relatedness. Knowledge bases such as DBpedia are the system can achieve results of 0.74 precision, 0.66 recall graphs where entities are connected semantically. We con- for an F1 of 0.70 on the dataset. struct an entity graph from the knowledge base and use the connectivity in the graph to measure the semantic related- 4. CONCLUSION ness between entities. We use random walks with restart to In this paper, we described a system for the #Microp- traverse the graph. Upon convergence, this process results ost2015 NEEL challenge, in which we adopted a tweet spe- in a probability distribution over the vertices corresponding cific NER system for mention extraction, and used an entity to the likelihood these vertices are visited. This probability disambiguation approach that utilized the connectivity of can then be used as an estimation of relatedness between entities in DBpedia to capture the semantics of entities and entities in the graph. For each target entity, we restart from disambiguate mentions. that entity in each random walk, generating a personalized Due to time limitation, our system still has much room probability for the target entity, and use it as the seman- for improvements. As shown, mention extraction is now the tic representation. For the semantic representation of the bottleneck of the system and needs further improvement. document, we perform the random walk restarting from a More features from the tweets could be used to train a bet- set of entities representing the document. Since the true ter model. For the mention disambiguation, we will explore entities of mentions in the documents are not available, we supervised approaches such as learning to rank to combine either choose the representative entities from the unambigu- the semantic features such as the semantic similarity and ous mentions which have only one candidate, or the candi- lexical features specific to tweets. Also, the semantic rep- date entities whose weights are approximated by their prior resentation seems to be valuable for the NIL clustering and probability. With the representative entities, the semantic worth exploration. representation of the document can then be computed as the probability distribution obtained through the random walk from these entities. 5. REFERENCES To improve the efficiency, instead of using the entire DB- [1] K. Bontcheva, L. Derczynski, A. Funk, M. A. pedia graph, we construct a small entity graph by starting Greenwood, D. Maynard, and N. Aswani. TwitIE: An with the set of candidates, and adding all entities adjacent open-source information extraction pipeline for to these candidates in the original graph. This subgraph microblog text. In RANLP. ACL, 2013. contains entities semantically related to the candidates and [2] Z. Guo and D. Barbosa. Robust entity linking via is large enough to compute the semantic representation of random walks. In CIKM, pages 499–508, 2014. entities and the document. [3] T. Hughes and D. Ramage. Lexical semantic Once obtaining the semantic representation, we measure relatedness with random graph walks. In the semantic similarity between each candidate and the doc- EMNLP-CoNLL, pages 581–589, 2007. ument using the Zero-KL Divergence [3], which is then com- [4] G. Rizzo, A. E. Cano Basave, B. Pereira, and A. Varga. bined with the prior probability and context similarity to Making Sense of Microposts (#Microposts2015) Named disambiguate candidates. Entity rEcognition and Linking (NEEL) Challenge. In M. Rowe, M. Stankovic, and A.-S. Dadzie, editors, 5th 2.4 NIL Prediction and Clustering Workshop on Making Sense of Microposts For NIL prediction, mentions are deemed out of a knowl- (#Microposts2015), pages 44–53, 2015. edge base (and thus linked to NIL) either when no candidates are available or their similarity with the highest ranked en- 58 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015