E2E: An End-to-End Entity Linking System for
                               Short and Noisy Text

                Ming-Wei Chang                       Bo-June Hsu                  Hao Ma       Ricky Loynd            Kuansan Wang
                                                                         Microsoft Research
                                                                           Redmond, WA
                               {minchang|paulhsu|haoma|riloynd|kuansanw}@microsoft.com


 ABSTRACT                                                                              Candidate Generation.
 We present E2E, an end-to-end entity linking system that                                 The next step is to generate a list of surface form can-
 is designed for short and noisy text found in microblogs and                          didates that could potentially link to entities. E2E uses
 text messages. Mining and extracting entities from short                              a lexicon to generate the candidate surface forms. A lex-
 text is an essential step for many content analysis applica-                          icon is a dictionary that maps a surface form to its pos-
 tions. By jointly optimizing entity recognition and disam-                            sible entity set. For example, the word “giants” could re-
 biguation as a single task, our system can process short and                          fer to “New York Giants”, or “San Francisco Giants”, etc.
 noisy text robustly.                                                                  Our lexicon is mainly composed by extracting information
                                                                                       from Wikipedia and Freebase. The dictionary is constructed
                                                                                       to support fuzzy mention matching based on edit distance.
 Keywords                                                                              Note that we over-generate candidates at this stage, and no
 Information Extraction, Social Media, Entity Linking                                  filtering is performed.
                                                                                       Joint Recognition and Disambiguation.
 1. INTRODUCTION                                                                          This stage is the key component of the E2E framework.
                                                                                       Given a message, the goal here is to figure out the entity
    In this paper, we describe our entity linking system called                        assignment of each candidate mention generated from pre-
 E2E for the #Microposts2014 NEEL Challenge [1]. Our                                   vious stages. Note that a candidate mention may be rejected
 system focuses on the task of extracting and linking enti-                            altogether (mapped to the null entity).
 ties from short and noisy text given entity databases like                               Our model is based on a supervised learning method.
 Wikipedia or Freebase. An entity linking system usually                               Given a message m and a candidate mention a, the entity
 needs to perform two key functions: mention recognition                               assignment is generated from the ranking of all possible en-
 and entity disambiguation. In mention recognition the sys-                            tities in the entity set E(a).
 tem identifies each mention (surface form) of an entity in
 the text. In entity disambiguation, the system maps men-                                                 arg max f (Φ(m, a, e)),               (1)
                                                                                                         e∈{E(a)∩∅}
 tions to canonical entities. E2E has been carefully designed
 to treats entity recognition and disambiguation as a single                           where f is the function of the model, and Φ is a feature
 task.                                                                                 function over the input m, the mention a and the candidate
                                                                                       output e. Note that it is very likely E2E rejects a candidate
                                                                                       and does not link it to an entity (link a to ∅). The joint
 2. THE ARCHITECTURE OF E2E                                                            approach that recognizes and disambiguates entity mentions
    When a short message is received, E2E processes the mes-                           together is crucial for E2E to properly link surface forms to
 sage in four stages: Text Normalization, Candidate Genera-                            the corresponding entities.
 tion and Joint Recognition-and-Disambiguation, and Over-
 lap Resolution.                                                                       Overlap Resolution.
                                                                                         At this point, many of the linked mentions will overlap
 Text Normalization.                                                                   each other. Dynamic programming resolves these conflicts
   In this stage, a short message is normalized and tokenized.                         by choosing the best-scoring set of non-overlapping mention-
 For tweets, the retweet symbols and some other special sym-                           entity mappings. The experimental results show that resolv-
 bols are removed. Punctuation symbols are represented as                              ing overlap improve the models performance consistently in
 separate tokens in general.                                                           different settings.
                                                                                       3.   SYSTEM IMPLEMENTATION
                                                                                         Our database is constructed from both Wikipedia and
 Permission
 Copyright to c make
                  2014 digital
                        held by or author(s)/owner(s);
                                    hard copies of all or copying
                                                             part of this work for
                                                                       permitted       Freebase. The whole system is implemented in C#.
 personal
 only for or  classroom
           private       use is granted
                    and academic           without fee provided that copies are
                                      purposes.                                          Entity linking systems often require a large amount of
 not made orasdistributed
 Published                for #Microposts2014
                 part of the   profit or commercial    advantageproceedings,
                                                     Workshop       and that copies    memory due to the size of the structured/unstructured data
 available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141)
 bear this notice and the full citation on the first page. To copy otherwise, to       for many entities. High memory consumption restricts the
 republish, to post on April
                       servers7th,
                                or to2014,
                                      redistribute
                                           Seoul, to  lists, requires prior specific
 #Microposts2014,
 permission and/or a fee.
                                                   Korea.                              scale of an entity linking system, limiting the number of al-
 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.                                      lowed entities that can be handled. Long loading times also


· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                                                                            0.85
 reduce the efficiency of conducting experiments. In E2E, we
                                                                                                           Precision       Recall       F1
 adopt the completion trie data structure proposed in [4] in-                0.8
 stead of a hash map dictionary. The completion trie greatly
                                                                            0.75
 reduces the memory footprint and loading time of E2E.
    We have tested two learning methods when developing                      0.7


                                                                   P/R/F1
 E2E: a structured support vector machine algorithm [2] and
                                                                            0.65
 a fast implementation of the MART gradient boosting algo-
 rithm [3]. The structural SVM model is a linear model that                  0.6

 takes into account all of the candidates together in the same
                                                                            0.55
 tweet. MART learns an ensemble of decision/regression
 trees with scalar values at the leaves, but treats each candi-              0.5
                                                                                   0   0.5   1   1.5   2        2.5    3      3.5   4        4.5
 date separately. The submitted results are generated using                                                 S
 MART due to its superior performance on our development
 set.
                                                                             Figure 1: Results of E2E on the development set.
 Features.
    Three groups of features were used in our system. The
 textual features are the features regarding the textual prop-    where [·] is an indicator function. When s increases, the sys-
 erties of the surface form and its context. For example,         tem will produce more entities. From the results in Figure 1,
 one feature indicates if the current surface form and the        we found that tuning s does impact results significantly. Af-
 surrounding words are capitalized or not. We also use fea-       ter learning parameters and desired value of s are chosen,
 tures generated from the output of the in-house named en-        we then retrain the E2E using the full training data, and
 tity recognition system that is specially designed to be ro-     generate final results with s = 0, 2.5 and 3.5, respectively.
 bust on non-capitalized words. The entity graph features
                                                                  Error Analysis.
                                                                     We analyze at our results on the development set with
 capture the semantic cohesiveness between the entity-entity
                                                                  s = 3.5. In the development set, there are 1304 mentions,
 and entity-mention pairs. This group of features was mainly
                                                                  and E2E generates total number of 18746 candidates in the
 calculated using the entity database and its structured data.
                                                                  candidate generation stage. Our error analysis shows that
 Finally, the statistical features indicates the word usage and
                                                                  E2E misses 340 entity mentions and predict extra 284 men-
 entity popularity using the information collected from the
                                                                  tions. Among the errors, E2E has troubles on the “num-
 web.
                                                                  ber” entities (e.g. 1_(number)). Further investigation re-
    Among the three group features, the statistical feature
                                                                  veals that the tokenization choice of E2E plays a big part of
 group is the most important one. We describe some of the
                                                                  these errors, given that most punctuations are being treated
 most important features in the following. Let a denote the
                                                                  as separate tokens. Interestingly, E2E only makes 44 cases
 surface form of a candidate, and e denote the an entity.
                                                                  where it correctly recognizes the mentions but link to wrong
 One important feature is the link probability feature Pl (a),
                                                                  entities. Most errors occur when E2E fail to recognize men-
 which indicates the probability that a phrase is used as an
                                                                  tions correctly.
 anchor in Wikipedia. For each phrase a, we also collect
 statistics about the probability that a phrase is capitalized    5.         CONCLUSIONS
 in Wikipedia. We refer to this feature as the capitalization        In this paper, we presented E2E, a system that performs
 rate feature, Pc (a).                                            joint entity recognition and disambiguation on short and
    We also compute features that captures the relationships      noisy text. We found that the substance of a successful
 between an anchor a and an entity e. The probability P (e|a)     entity linking system consists of successfully combining all
 captures the likelihood of an anchor linked to an Wikipedia      of the components.
 entity. We have downloaded Wikipedia page view counts,              Due to the time limitation, the submitted system still has
 representing page view information from 2012.1 According         plenty of room to improve. For example, one important
 to the popularity information, we add another probability        direction is to explore the relationships between different
 feature that captures the relative popularity of the pages       tweets to improve entity linking results. Developing a ro-
 that could be linkedP from the anchor a. More precisely,         bust mention detection algorithm is an important research
 Pv (e|a) = v(ei )/( {e∈E(a)∩∅} v(e)), where v(e) represents      direction as well.
 the view count for the page e.
                                                                  6.         REFERENCES
 4. RESULTS                                                       [1] A. E. Cano Basave, G. Rizzo, A. Varga, M. Rowe,
    In our experiments, we split the training set into two sets       M. Stankovic, and A.-S. Dadzie. Making Sense of
 that contains 1534 and 800 tweets, respectively. The 800-            Microposts (#Microposts2014) Named Entity
 tweet data is used as our development set. Our analysis              Extraction & Linking Challenge. In Proc.,
 shows that robust mention detection is often the source of           #Microposts2014, pages 54–60, 2014.
 errors in the current the entity linking systems. In order to    [2] M.-W. Chang and W.-T. Yih. Dual coordinate descent
 achieve better F1 score, we change the prediction function           algorithms for efficient large margin structured
 to                                                                   prediction. TACL, 2013.
               arg max f (Φ(m, a, e)) − s[e = ∅],          (2)    [3] J. H. Friedman. Greedy function approximation: A
              e∈{E(a)∩∅}                                              gradient boosting machine. Annals of Statistics, 1999.
                                                                  [4] B.-J. P. Hsu and G. Ottaviano. Space-efficient data
 1
     http://dammit.lt/wikistats                                       structures for top-k completion. In WWW, 2013.


                                                                                                                                              63
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014