E2E: An End-to-End Entity Linking System for Short and Noisy Text Ming-Wei Chang Bo-June Hsu Hao Ma Ricky Loynd Kuansan Wang Microsoft Research Redmond, WA {minchang|paulhsu|haoma|riloynd|kuansanw}@microsoft.com ABSTRACT Candidate Generation. We present E2E, an end-to-end entity linking system that The next step is to generate a list of surface form can- is designed for short and noisy text found in microblogs and didates that could potentially link to entities. E2E uses text messages. Mining and extracting entities from short a lexicon to generate the candidate surface forms. A lex- text is an essential step for many content analysis applica- icon is a dictionary that maps a surface form to its pos- tions. By jointly optimizing entity recognition and disam- sible entity set. For example, the word “giants” could re- biguation as a single task, our system can process short and fer to “New York Giants”, or “San Francisco Giants”, etc. noisy text robustly. Our lexicon is mainly composed by extracting information from Wikipedia and Freebase. The dictionary is constructed to support fuzzy mention matching based on edit distance. Keywords Note that we over-generate candidates at this stage, and no Information Extraction, Social Media, Entity Linking filtering is performed. Joint Recognition and Disambiguation. 1. INTRODUCTION This stage is the key component of the E2E framework. Given a message, the goal here is to figure out the entity In this paper, we describe our entity linking system called assignment of each candidate mention generated from pre- E2E for the #Microposts2014 NEEL Challenge [1]. Our vious stages. Note that a candidate mention may be rejected system focuses on the task of extracting and linking enti- altogether (mapped to the null entity). ties from short and noisy text given entity databases like Our model is based on a supervised learning method. Wikipedia or Freebase. An entity linking system usually Given a message m and a candidate mention a, the entity needs to perform two key functions: mention recognition assignment is generated from the ranking of all possible en- and entity disambiguation. In mention recognition the sys- tities in the entity set E(a). tem identifies each mention (surface form) of an entity in the text. In entity disambiguation, the system maps men- arg max f (Φ(m, a, e)), (1) e∈{E(a)∩∅} tions to canonical entities. E2E has been carefully designed to treats entity recognition and disambiguation as a single where f is the function of the model, and Φ is a feature task. function over the input m, the mention a and the candidate output e. Note that it is very likely E2E rejects a candidate and does not link it to an entity (link a to ∅). The joint 2. THE ARCHITECTURE OF E2E approach that recognizes and disambiguates entity mentions When a short message is received, E2E processes the mes- together is crucial for E2E to properly link surface forms to sage in four stages: Text Normalization, Candidate Genera- the corresponding entities. tion and Joint Recognition-and-Disambiguation, and Over- lap Resolution. Overlap Resolution. At this point, many of the linked mentions will overlap Text Normalization. each other. Dynamic programming resolves these conflicts In this stage, a short message is normalized and tokenized. by choosing the best-scoring set of non-overlapping mention- For tweets, the retweet symbols and some other special sym- entity mappings. The experimental results show that resolv- bols are removed. Punctuation symbols are represented as ing overlap improve the models performance consistently in separate tokens in general. different settings. 3. SYSTEM IMPLEMENTATION Our database is constructed from both Wikipedia and Permission Copyright to c make 2014 digital held by or author(s)/owner(s); hard copies of all or copying part of this work for permitted Freebase. The whole system is implemented in C#. personal only for or classroom private use is granted and academic without fee provided that copies are purposes. Entity linking systems often require a large amount of not made orasdistributed Published for #Microposts2014 part of the profit or commercial advantageproceedings, Workshop and that copies memory due to the size of the structured/unstructured data available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141) bear this notice and the full citation on the first page. To copy otherwise, to for many entities. High memory consumption restricts the republish, to post on April servers7th, or to2014, redistribute Seoul, to lists, requires prior specific #Microposts2014, permission and/or a fee. Korea. scale of an entity linking system, limiting the number of al- Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. lowed entities that can be handled. Long loading times also · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 0.85 reduce the efficiency of conducting experiments. In E2E, we Precision Recall F1 adopt the completion trie data structure proposed in [4] in- 0.8 stead of a hash map dictionary. The completion trie greatly 0.75 reduces the memory footprint and loading time of E2E. We have tested two learning methods when developing 0.7 P/R/F1 E2E: a structured support vector machine algorithm [2] and 0.65 a fast implementation of the MART gradient boosting algo- rithm [3]. The structural SVM model is a linear model that 0.6 takes into account all of the candidates together in the same 0.55 tweet. MART learns an ensemble of decision/regression trees with scalar values at the leaves, but treats each candi- 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 date separately. The submitted results are generated using S MART due to its superior performance on our development set. Figure 1: Results of E2E on the development set. Features. Three groups of features were used in our system. The textual features are the features regarding the textual prop- where [·] is an indicator function. When s increases, the sys- erties of the surface form and its context. For example, tem will produce more entities. From the results in Figure 1, one feature indicates if the current surface form and the we found that tuning s does impact results significantly. Af- surrounding words are capitalized or not. We also use fea- ter learning parameters and desired value of s are chosen, tures generated from the output of the in-house named en- we then retrain the E2E using the full training data, and tity recognition system that is specially designed to be ro- generate final results with s = 0, 2.5 and 3.5, respectively. bust on non-capitalized words. The entity graph features Error Analysis. We analyze at our results on the development set with capture the semantic cohesiveness between the entity-entity s = 3.5. In the development set, there are 1304 mentions, and entity-mention pairs. This group of features was mainly and E2E generates total number of 18746 candidates in the calculated using the entity database and its structured data. candidate generation stage. Our error analysis shows that Finally, the statistical features indicates the word usage and E2E misses 340 entity mentions and predict extra 284 men- entity popularity using the information collected from the tions. Among the errors, E2E has troubles on the “num- web. ber” entities (e.g. 1_(number)). Further investigation re- Among the three group features, the statistical feature veals that the tokenization choice of E2E plays a big part of group is the most important one. We describe some of the these errors, given that most punctuations are being treated most important features in the following. Let a denote the as separate tokens. Interestingly, E2E only makes 44 cases surface form of a candidate, and e denote the an entity. where it correctly recognizes the mentions but link to wrong One important feature is the link probability feature Pl (a), entities. Most errors occur when E2E fail to recognize men- which indicates the probability that a phrase is used as an tions correctly. anchor in Wikipedia. For each phrase a, we also collect statistics about the probability that a phrase is capitalized 5. CONCLUSIONS in Wikipedia. We refer to this feature as the capitalization In this paper, we presented E2E, a system that performs rate feature, Pc (a). joint entity recognition and disambiguation on short and We also compute features that captures the relationships noisy text. We found that the substance of a successful between an anchor a and an entity e. The probability P (e|a) entity linking system consists of successfully combining all captures the likelihood of an anchor linked to an Wikipedia of the components. entity. We have downloaded Wikipedia page view counts, Due to the time limitation, the submitted system still has representing page view information from 2012.1 According plenty of room to improve. For example, one important to the popularity information, we add another probability direction is to explore the relationships between different feature that captures the relative popularity of the pages tweets to improve entity linking results. Developing a ro- that could be linkedP from the anchor a. More precisely, bust mention detection algorithm is an important research Pv (e|a) = v(ei )/( {e∈E(a)∩∅} v(e)), where v(e) represents direction as well. the view count for the page e. 6. REFERENCES 4. RESULTS [1] A. E. Cano Basave, G. Rizzo, A. Varga, M. Rowe, In our experiments, we split the training set into two sets M. Stankovic, and A.-S. Dadzie. Making Sense of that contains 1534 and 800 tweets, respectively. The 800- Microposts (#Microposts2014) Named Entity tweet data is used as our development set. Our analysis Extraction & Linking Challenge. In Proc., shows that robust mention detection is often the source of #Microposts2014, pages 54–60, 2014. errors in the current the entity linking systems. In order to [2] M.-W. Chang and W.-T. Yih. Dual coordinate descent achieve better F1 score, we change the prediction function algorithms for efficient large margin structured to prediction. TACL, 2013. arg max f (Φ(m, a, e)) − s[e = ∅], (2) [3] J. H. Friedman. Greedy function approximation: A e∈{E(a)∩∅} gradient boosting machine. Annals of Statistics, 1999. [4] B.-J. P. Hsu and G. Ottaviano. Space-efficient data 1 http://dammit.lt/wikistats structures for top-k completion. In WWW, 2013. 63 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014