Entity Recognition and Linking on Tweets
                                     with Random Walks

                                   Zhaochen Guo                                                      Denilson Barbosa
                      Department of Computing Science                                        Department of Computing Science
                            University of Alberta                                                  University of Alberta
                           zhaochen@ualberta.ca                                                   denilson@ualberta.ca


 ABSTRACT                                                                              as the noisy terms, hashtags, retweets, abbreviations, and
 This paper presents our system at the #Microposts2015                                 cyber-slang. Appropriately addressing these problems, and
 NEEL Challenge [4]. The task is to recognize and type                                 taking advantage of the existing approaches are important.
 mentions from English Microposts, and link them to their                                We developed a NEEL system for the challenge based on
 corresponding entries in DBpedia 2014. For this task, we                              a state-of-the-art entity linking approach, and incorporated
 developed a method based on a state-of-the-art entity link-                           a tweet specific mention extraction component. Our sys-
 ing system - REL-RW [2], which exploits the entity graph                              tem takes advantage of the entity graph in the knowledge
 from the knowledge base to compute semantic relatedness                               base, and does not rely on the lexical features in the tweets,
 between entities, and use it for entity disambiguation. The                           which makes it robust on di↵erent datasets. In the following
 advantage of the approach is its robustness for various types                         sections, we will describe our system and report the experi-
 of documents. We built our system on REL-RW and em-                                   mental results on the challenge benchmarks.
 ployed a tweet specific NER component to improve the per-
 formance on tweets. The system achieved overall 0.35 F1 on                            2.    OUR APPROACH
 the development dataset from NEEL 2015, while the disam-
 biguation component alone can achieve 0.70 F1.                                        2.1    Mention Extraction
                                                                                          As the first component of our system, mention extrac-
 Keywords                                                                              tion extracts named entities from the given tweets. Our
 Entity Recognition, Entity Disambiguation, Social Media                               system originally employed the Stanford NER with models
                                                                                       trained on the well-formed news documents. However, it
 1.     INTRODUCTION                                                                   cannot handle the short tweets very well. We then used the
                                                                                       TwitIE [1] from GATE, a NER tool designed specifically for
    Microposts such as tweets become popular nowadays. The                             tweets, to perform the mention extraction in our system.
 tweets, though short and simple, can spread information                                  Compared to the Stanford NER, TwitIE added several
 fast and broadly. Events, reviews, news and so on are all                             improvements. The first is the Normaliser. To address un-
 posted on Twitter, which make tweets a very valuable re-                              seen tokens and noisy grammars in tweets, TwitIE used a
 source to support many activities such as political option                            spelling dictionary specific to the tweets to identify and cor-
 mining, product development (customer review), or social                              rect spelling variations. The second improvement is a tweet
 activism. We need to understand the tweets to make best                               adapted model for the POS tagging. While still employ-
 use of them for such applications. Given the maximum 140                              ing the Stanford POS Tagger, TwitIE replaced the original
 characters limit, there is barely enough useful information                           model with a new model trained on Twitter datasets which
 in a tweet. Exploiting entities mentioned in tweets can en-                           were annotated with the Penn TreeBank with extra tag la-
 rich the text with their contexts and semantics in knowl-                             bels such as retweets, URLs, hashtags and user mentions.
 edge bases, which is important for a better understanding                             With these improvements, TwitIE helps improve the NER
 of tweets. The NEEL task aims to solve this issue by auto-                            performance of our system. Note that we use the types in-
 matically recognizing entities and their types from English                           ferred from TwitIE as the types for mentions.
 tweets, and linking them to their DBpedia 2014 resources.
 NER and entity linking have been active research subjects.                            2.2    Candidate Generation
 However, most previous works focus on traditional long doc-
                                                                                          The second component is the candidate generation which
 uments, which do not pose the challenges in tweets, such
                                                                                       selects potential candidates from the knowledge base for
                                                                                       mentions in the tweets. Our system utilized an alias dic-
                                                                                       tionary collected from Wikipedia titles, redirect pages, dis-
                                                                                       ambiguation pages, and the anchor text in Wikilinks [2],
 Permission
 Copyright to c make
                  2015 digital
                        held by or author(s)/owner(s);
                                    hard copies of all or copying
                                                             part of this work for
                                                                       permitted       which maps each alias to entities it refers to in Wikipedia.
 only for or
 personal  private  and academic
              classroom               purposes.
                         use is granted    without fee provided that copies are           We simply use exact string matching against the dictio-
 Published   as  part of the  #Microposts2015        Workshop     proceedings,
 not made or distributed for profit or commercial advantage and that copies
 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)                       nary for the candidate generation. Mentions that do not
 bear this notice and the full citation on the first page. To copy otherwise, to       match any alias in the dictionary are immediately linked to
 republish, to post on May
 #Microposts2015,      servers  or to2015,
                            18th,     redistribute to lists,
                                            Florence,        requires prior specific
                                                        Italy.
 permission and/or a fee.                                                              NIL. Otherwise, the mapping entities of the matched alias
 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.                                      are selected as candidates. To improve the efficiency, we


· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
 further prune the candidates by two criteria [2]: prior prob-                            Precision   Recall    F1
 ability which is defined as the probability the alias refers                 Tagging       0.34       0.22    0.27
 to an entity in the Wikipedia corpus, and context similarity                 Linking       0.35       0.36    0.35
 which measures the context similarity (cosine similarity) of                Clustering     0.45       0.29    0.35
 the mention and the entity. For both criteria, the top 10
 ranked candidates are selected and then merged to generate            Table 1: Results on the development datasets.
 the final candidate list for the given mention.

 2.3   Entity Disambiguation                                      tities is below a threshold. For clustering, we simply group
                                                                  mentions by their name similarity. In the future, we plan to
    Entity disambiguation is to select the target entity from
                                                                  exploit the semantic representation of the tweets to measure
 the candidates of a mention. We use our prior algorithm [2]
                                                                  their semantic similarity and use it for NIL clustering.
 for this task. The main idea is to represent the seman-
 tics of the document (tweet) and candidate entities using a
 set of related entities in DBpedia for which the weight of       3.    EXPERIMENTS
 each entity is measured by their semantic relatedness with          We built our system using a 2013 DBpedia dump, includ-
 the candidates. We then use the semantic representation          ing the knowledge base and alias dictionary. Table 3 lists
 to compute the semantic similarity between the candidates        the results of our system on the development dataset. As
 and the document. For each mention-entity pair, we mea-          shown, the performance of the mention extraction (tagging)
 sure their prior probability, context similarity, and semantic   is very poor, especially the recall. We believe more tuning
 similarity and linearly combine them together to compute an      would improve the performance. Since the novelty of our
 overall similarity. The candidate with the highest similarity    system is the disambiguation part, we further evaluated the
 will be selected as the target entity.                           performance of the entity disambiguation component sepa-
    The key part of the approach is the semantic representa-      rately (assuming all mentions are correctly recognised), and
 tion and relatedness. Knowledge bases such as DBpedia are        the system can achieve results of 0.74 precision, 0.66 recall
 graphs where entities are connected semantically. We con-        for an F1 of 0.70 on the dataset.
 struct an entity graph from the knowledge base and use the
 connectivity in the graph to measure the semantic related-       4.    CONCLUSION
 ness between entities. We use random walks with restart to
                                                                     In this paper, we described a system for the #Microp-
 traverse the graph. Upon convergence, this process results
                                                                  ost2015 NEEL challenge, in which we adopted a tweet spe-
 in a probability distribution over the vertices corresponding
                                                                  cific NER system for mention extraction, and used an entity
 to the likelihood these vertices are visited. This probability
                                                                  disambiguation approach that utilized the connectivity of
 can then be used as an estimation of relatedness between
                                                                  entities in DBpedia to capture the semantics of entities and
 entities in the graph. For each target entity, we restart from
                                                                  disambiguate mentions.
 that entity in each random walk, generating a personalized
                                                                     Due to time limitation, our system still has much room
 probability for the target entity, and use it as the seman-
                                                                  for improvements. As shown, mention extraction is now the
 tic representation. For the semantic representation of the
                                                                  bottleneck of the system and needs further improvement.
 document, we perform the random walk restarting from a
                                                                  More features from the tweets could be used to train a bet-
 set of entities representing the document. Since the true
                                                                  ter model. For the mention disambiguation, we will explore
 entities of mentions in the documents are not available, we
                                                                  supervised approaches such as learning to rank to combine
 either choose the representative entities from the unambigu-
                                                                  the semantic features such as the semantic similarity and
 ous mentions which have only one candidate, or the candi-
                                                                  lexical features specific to tweets. Also, the semantic rep-
 date entities whose weights are approximated by their prior
                                                                  resentation seems to be valuable for the NIL clustering and
 probability. With the representative entities, the semantic
                                                                  worth exploration.
 representation of the document can then be computed as the
 probability distribution obtained through the random walk
 from these entities.                                             5.    REFERENCES
    To improve the efficiency, instead of using the entire DB-    [1] K. Bontcheva, L. Derczynski, A. Funk, M. A.
 pedia graph, we construct a small entity graph by starting           Greenwood, D. Maynard, and N. Aswani. TwitIE: An
 with the set of candidates, and adding all entities adjacent         open-source information extraction pipeline for
 to these candidates in the original graph. This subgraph             microblog text. In RANLP. ACL, 2013.
 contains entities semantically related to the candidates and     [2] Z. Guo and D. Barbosa. Robust entity linking via
 is large enough to compute the semantic representation of            random walks. In CIKM, pages 499–508, 2014.
 entities and the document.                                       [3] T. Hughes and D. Ramage. Lexical semantic
    Once obtaining the semantic representation, we measure            relatedness with random graph walks. In
 the semantic similarity between each candidate and the doc-          EMNLP-CoNLL, pages 581–589, 2007.
 ument using the Zero-KL Divergence [3], which is then com-       [4] G. Rizzo, A. E. Cano Basave, B. Pereira, and A. Varga.
 bined with the prior probability and context similarity to           Making Sense of Microposts (#Microposts2015) Named
 disambiguate candidates.                                             Entity rEcognition and Linking (NEEL) Challenge. In
                                                                      M. Rowe, M. Stankovic, and A.-S. Dadzie, editors, 5th
 2.4   NIL Prediction and Clustering                                  Workshop on Making Sense of Microposts
   For NIL prediction, mentions are deemed out of a knowl-            (#Microposts2015), pages 44–53, 2015.
 edge base (and thus linked to NIL) either when no candidates
 are available or their similarity with the highest ranked en-


                                                                                                                           58
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015