=Paper=
{{Paper
|id=Vol-1691//paper_14.
|storemode=property
|title=Named Entity Linking in #Tweets with KEA
|pdfUrl=https://ceur-ws.org/Vol-1691/paper_14.pdf
|volume=Vol-1691
|authors=Jörg Waitelonis,Harald Sack
|dblpUrl=https://dblp.org/rec/conf/msm/WaitelonisS16
}}
==Named Entity Linking in #Tweets with KEA==
Named Entity Linking in #Tweets with KEA Jörg Waitelonis, Harald Sack Hasso-Plattner-Institute Prof.-Dr.-Helmert Str. 2-3, 14482 Potsdam, Germany {joerg.waitelonis|harald.sack}@hpi.de ABSTRACT pedia2 are combined with a link-graph analysis on the Wiki- This paper presents the KEA system at the #Microposts pedia hyperlink graph and the DBpedia3 knowledge base. 2016 NEEL Challenge. Its task is to recognize and type The basic principles of the KEA named entity linking are mentions from English microposts and link them to their summarized in [4]. A comparison of KEA and other state- corresponding entries in DBpedia. For this task, we have of-the-art named entity linking systems is provided in [6]. adapted our Named Entity Disambiguation tool originally In the subsequent sections, KEA will be introduced in designed for natural language text to the special require- more detail, followed by adaptions made especially for the ments of noisy, terse, and poorly worded tweets containing NEEL challenge, and our achieved results. special functional terms and language. 2. THE KEA APPROACH Keywords To address the tasks of the #Micropost 2016 NEEL chal- lenge, we have adapted our NEL approach KEA. It is orig- named entity linking, disambiguation, microposts inally configured to be applied on natural language text and combinations of textual metadata from heterogeneous 1. INTRODUCTION sources such as e. g. metadata generated by automated mul- Microposts have become a highly popular medium to share timedia analysis or user provided metadata, such as e. g. facts, opinions or emotions. They provide an invaluable real- tags, comments, and discussions. All this metadata can be time resource of data, ready to be mined for training predic- of different provenience, reliability, trustworthiness, as well tive models. However, the effectiveness of existing analysis as level of abstraction. tools faces critical challenges when applied to microposts. KEA uses DBpedia as a reference knowledge base for en- In fact it is seriously compromised, since Twitter1 messages tity linking and basically follows the five-stage approach de- often are noisy, terse, poorly worded and posted in many picted in Fig. 1. different languages. They contain special functional expres- sions, such as e. g. usernames, hashtags, retweets, abbrevia- 2.1 Preprocessing tions, and cyber-slang [2]. Moreover, Twitter being the most The incoming text is processed by the following linguistic popular micropost service follows a streaming paradigm im- pipeline. The Stanford Log-linear tagger[5] as well as Stan- posing that entities must be recognized in real-time. ford Named Entity Recognizer[1] (NER) are applied to de- In this paper, we describe our approach to address the termine part-of-speech as well as named entity types. Next, #Micropost 2016 NEEL challenge [3] with the adaptation an ASCII folding filter converts alphabetic, numeric, and of an existing Named Entity Disambiguation system – KEA symbolic Unicode characters, which are not in the the ”Ba- – originally designed for the processing of natural language sic Latin” Unicode block into their ASCII equivalents, e. g. texts, adapted to the special challenges imposed by microp- ”Ole Rømer” is transformed to ”Ole Romer”. Tokenization is osts. performed on non-characters except special characters join- KEA originally implements a dictionary and knowledge- ing compound words, such as, e. g. ”-”. based approach of word sense disambiguation, i. e. The resulting list of tokens is fed into a shingle filter to co-occurrence analysis based on articles of the English Wiki- construct token n-grams from the token stream. For exam- ple, the sentence ”please divide this sentence into shingles” 1 http://twitter.com/ might be tokenized into 2-shingles ”please divide”, ”divide this”, ”this sentence”, ”sentence into”, and ”into shingles”. Usually, 3-shingles are created as a default. In the case of a proper noun recognized by the NER at most 5-shingles are created with the ±2 surrounding tokens. This extension Copyright c 2016 held by author(s)/owner(s); copying permitted only for private and academic purposes. enables to map also longer compound proper names such as Published as part of the #Microposts2016 Workshop proceedings, e. g. ”John F. Kennedy Airport” which cannot be mapped available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691) correctly otherwise with a 3-shingle configuration. The to- #Microposts2016, Apr 11th, 2016, Montréal, Canada. 2 ACM ISBN 978-1-4503-2138-9. http://wikipedia.org/ 3 DOI: 10.1145/1235 http://dbpedia.org/ · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 Preprocessing Candidate Scoring Candidate (tokenizing, Merging and (Feature Disambiguation Mapping POS-tagging, …) Filtering Generation) Figure 1: The overall NEL process. ken stream now contains tokens with sole words, but also Overall, after this processing step, every candidate gets tokens with ’shingled’ words. a list of scores assigned being determined via several of the mentioned methods. Theses lists of scores are considered 2.2 Candidate Mapping as the candidates’ feature vectors, expressing how well a Every token is mapped to a gazetteer, which has been candidate entity fits to the given context. compiled from DBpedia entities’ labels, redirect labels, and disambiguation labels being mapped to their appropriate 2.5 Disambiguation DBpedia entities. Since the originally used gazetteer in KEA Since all scores of the analyzed features have a positive is based on DBpedia 3.9, entities and labels from the DB- but unlimited value range, a linear feature scaling is applied pedia 2015-04 dataset are added for the NEEL challenge. to standardize the ranges between 0.0 and 1.0. Different ap- Labels are indexed lowercase and finally mapped to the to- proaches ranging from statistical analysis to machine learn- kens resulting in a list of potential entity candidates for each ing techniques can be envisaged to decide which candidate is token. The mapping is obtained by exact matches only. A chosen as the winner for a token. The most basic approach normalization of simple plural forms is applied beforehand. considers the weighted sum of the scores as a confidence Hence, for each token of the token stream a set of potential score, whereas the weights are optimized via grid search on entity candidates is determined. a given development or training dataset. The confidence score is cut-off by a empirically optimized threshold, to de- 2.3 Candidate Merging and Filtering cide, if a candidate entity is to be selected as the assumed To resolve possible overlaps of tokens successfully, longer correct result. tokens, which are mapped successfully, are preferred over shorter ones. Since longer tokens contain more descriptive 3. ADAPTATIONS TO THE NEEL CHAL- terms, they are considered to be more specific. This means, for example, that ”new york city” is preferred over ”new york” LENGE and ”york city”. Furthermore, tokens are discarded, if they To be applicable also for microposts as in the NEEL chal- do not contain nouns or contain sole stopwords, i. e. token lenge, the KEA processing has been adapted in two ways. ”the times” will not be discarded, because it contains the We distinguish between modifications made especially for noun ”times”. the general domain of ”microposts/tweets” and modifications resulting from the observation of the provided training data 2.4 Scoring (Feature Generation) set. For every entity candidate, features are determined via a pipeline of analysis components (scorers). These compo- 3.1 Adaptations to the Domain nents asses different characteristics how well a candidate en- For the NEEL challenge, we have utilized characteristic tity fits to the given input text, which is considered as the tweet information by excluding ”@” and ”#” from the tok- context. We distinguish between local and context-related enization to later identify twitter user names and hash tags features. Local features only consider the candidate as well properly. With respect to the provided NEEL challenge as the tokens properties. For example, consider the text guidelines of annotations, the filter is extended to restrict ”Armstrong landed on earth’s satellite”: For a candidate the system to tokens containing singular and plural proper w.l.o.g ”dbp:Neil Armstrong” of the possible candidate list nouns, user names, as well as hashtags only. The stopword of the token ’Armstrong’ certain features can be determined, list is extended with twitter specific functional terms (e. g. as e. g. string-distance between the candidate labels and the ”RT”, ”MT”, etc.) to be ignored in further processing. KEA token (respectively the surface form), the candidates link is configured to consider a single micropost (tweet) as the graph popularity, its DBpedia type, the provenance of the given context for disambiguation. Furthermore, the thresh- label, the surface form matches best (e. g. main label, or old of the achieved confidence score is used to cut-off uncer- redirect label), or the level of ambiguity of the token (e. g. tain candidates resulting in NIL annotations. Tokens iden- approximated by the number of candidates). tified as user name or hashtag which cannot successfully be Context-features assess the relation of a candidate entity mapped to candidate entities are also annotated with NIL. to the other candidates within the given context, e. g. di- rect links to other context candidates in the DBpedia link 3.2 Adaptations to the Training Set graph, co-occurrence of the other tokens’ surface forms in From the provided training dataset all surface forms have the corresponding Wikipedia article of the candidate under been extracted to extend the gazetteer for candidate map- consideration, co-references in Wikipedia articles, as well as ping. We have optimized the scorer weights as well as the further graph based features of the link graph induced by overall threshold according to the results achieved for the all candidates of the context (context graph). This includes training and development datasets. Furthermore, the stop- for example, graph distance measurements, connected com- word list has been extended according to the achieved results ponent analysis, or centrality and density observations. from the training and development datasets, i. e. terms con- 62 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 stantly mapped wrongly because they have not been anno- of the Wikipedia update stream to extend or prioritize the tated in the datasets such as weekdays and months. used dictionary of surface forms as well as the underlying link graph. 3.3 Types From our observations, a significant part of the achieved Since KEA did not support the required annotation with improvements results from the fact that training sets as well types out of the box, a simple extension of the original frame- as test sets cover the identical domains (i. e. Star Wars and work has been implemented. For a disambiguated mapped Donald Trump). Hence, the extension of the dictionary with entity, type annotations are determined simply via lookup in surface forms of the training dataset seems to be very effec- the DBpedia instance types dataset. For NIL annotations, tive. The conclusion is, that a domain adaption for a given where no entity could be determined, the according NER general purpose system might lead to significantly better type, if available, has been chosen. results. Even if this sounds trivial, we did not expect an improvement of c. 40% in f-measure. 4. EXPERIMENTS AND RESULTS Unfortunately, many documents of the training data set (1951 out of 6024) do not have any annotations at all. There- For the #Microposts 2016 NEEL challenge we have first fore, we are looking forward to future NEEL challenges with analyzed the provided development dataset without the above more complete ground truth datasets. described adaptions to obtain a baseline (cf. Table 1), and then again with the NEEL challenge modifications (cf. Ta- ble 2). 6. REFERENCES [1] J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating Non-local Information into Information Table 1: Results for the NEEL2016 development Extraction Systems by Gibbs Sampling. In Proceedings data set (baseline, without modifications) of the 43nd Annual Meeting of the Association for Measure Prec. Recall F1 score Computational Linguistics (ACL 2005), pages 363–370, strong link match 0.399 0.490 0.440 2005. strong typed mention match 0.232 0.213 0.222 [2] B. Han and T. Baldwin. Lexical Normalisation of Short mention ceaf 0.611 0.562 0.586 Text Messages: Makn Sens a #Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 368–378, Table 2: Results for the NEEL2016 development Stroudsburg, PA, USA, 2011. data set after adaptions and optimization [3] G. Rizzo, M. van Erp, J. Plu, and R. Troncy. Making Measure Prec. Recall F1 score Sense of Microposts (#Microposts2016) Named Entity strong link match 0.667 0.862 0.752 rEcognition and Linking (NEEL) Challenge. In strong typed mention match 0.572 0.660 0.613 D. Preoţiuc-Pietro, D. Radovanović, A. E. mention ceaf 0.744 0.858 0.797 Cano-Basave, K. Weller, and A.-S. Dadzie, editors, 6th Workshop on Making Sense of Microposts (#Microposts2016), pages 50–59, 2016. According to our expectations, the special adaptations for the NEEL challenge have resulted in significantly better re- [4] H. Sack. Business Information Systems Workshops: sults compared to the original tool configuration. A closer BIS 2015 International Workshops, Poznań, Poland, inspection of the achieved mappings has shown that KEA June 24-26, 2015, Revised Papers, chapter The Journey was able to find correct mappings to entities which are not is the Reward - Towards New Paradigms in Web provided in the NEEL ground truth, e. g.: Search, pages 15–26. Springer International Publishing, Cham, 2015. #wcyb -> dbp:WCYB-TV [5] K. Toutanova and C. D. Manning. Enriching the #WSJ -> dbp:The_Wall_Street_Journal Knowledge Sources Used in a Maximum Entropy #NSC -> dbp:National_Security_Council Part-of-speech Tagger. In Proceedings of the 2000 Joint #kyloren -> dpb:Kylo_Ren SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Compared to the training data ground truth, the KEA Conjunction with the 38th Annual Meeting of the system tends to detect mentions overeagerly, i. e. the system Association for Computational Linguistics - Volume 13, produces more extra annotations than missing annotations, EMNLP ’00, pages 63–70, Stroudsburg, PA, USA, which results in a loss of precision. Many of KEA’s extra 2000. Association for Computational Linguistics. annotations are common nouns such as affirmative action, [6] R. Usbeck, M. Röder, A.-C. Ngonga Ngomo, C. Baron, astronaut, petition, signature, mosque, emoji, enemy. A. Both, M. Brümmer, D. Ceccarelli, M. Cornolti, D. Cherix, B. Eickmann, P. Ferragina, C. Lemke, 5. CONCLUSION & FUTURE WORK A. Moro, R. Navigli, F. Piccinno, G. Rizzo, H. Sack, R. Speck, R. Troncy, J. Waitelonis, and L. Wesemann. For the task of NEL on microposts, it is a challenge to GERBIL – General Entity Annotation Benchmark maintain the topicality of the underlying knowledge base. Framework. In Proceedings of the 24th International New hash-tags, neologisms, as well as cyber-slang are rather Conference on World Wide Web (WWW15), pages difficult to resolve correctly in an automated way because 1133–1143. ACM, USA, 2015. they are not present in the dictionaries. To cope with this situation, one possibility would be to include a live analysis 63 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016