Combining Multiple Signals for Semanticizing Tweets: University of Amsterdam at #Microposts2015 Cristina Gârbacea, Daan Odijk, David Graus, Isaac Sijaranamual, Maarten de Rijke University of Amsterdam, Science Park 904, Amsterdam, The Netherlands {G.C.Garbacea, D.Odijk, D.P.Graus, I.B.Sijaranamual, deRijke}@uva.nl ABSTRACT Our participation in this challenge revolves around the existing In this paper we present an approach for extracting and linking enti- open-source entity linking software developed at the University of ties from short and noisy microblog posts. We describe a diverse set Amsterdam. We use Semanticizer1 , a state-of-the-art entity linking of approaches based on the Semanticizer, an open-source entity link- framework. So far Semanticizer has been successfully employed in ing framework developed at the University of Amsterdam, adapted linking entities in search engine queries [1] and in linking entities in to the task of the #Microposts2015 challenge. We consider alterna- short documents in streaming scenarios [6]. Moreover, it has been tives for dealing with ambiguity that can help in the named entity further extended to deal with additional types of data like television extraction and linking processes. We retrieve entity candidates from subtitles [3]. In what follows we explain how we use Semanticizer multiple sources and process them in a four-step pipeline. Results for the task at hand, and describe each of our submitted runs to the show that we correctly manage to identify entity mentions (our best competition. run attains an F1 score of 0.809 in terms of the strong mention match metric), but subsequent steps prove to be more challenging for our 2. SYSTEM ARCHITECTURE approach. Our system processes each incoming tweet in four stages: mention detection, entity disambiguation and typing, NIL identification and Keywords clustering, and overlap resolution. We explain each stage in turn. Named entity extraction; Named entity linking; Social media Mention detection: The first step aims to identify all entity men- tions in the input text, and is oriented towards high recall. We take 1. INTRODUCTION the union of the output of two mention identification methods: This paper describes our participation in the named entity ex- Semanticizer: the state-of-the-art system performs lexical match- traction and linking challenge at #Microposts2015. Information ing of entities’ surface forms. These surface forms are derived from extraction from microblog posts is an emerging research area which the KB, and comprise anchor texts that refer to Wikipedia pages, presents a series of problems for the natural language processing disambiguation and redirect pages, and page titles as described in community due to the shortness, informality and noisy lexical nature Table 1. For this, we use two instances of Semanticizer, running on of the content. Extracting entities from tweets is a complex process two Wikipedia dumps: one dated May 2014 (the version used to typically performed in a sequential fashion. As a first step, named build DBpedia 3.9), and a more recent one, dated February 2015. entity recognition (NER) aims to detect mentions that refer to enti- We perform three separate preprocessing steps on the tweet text, ties, e.g., names of people, locations, organizations or products (also the results of which get sent to the Semanticizer. These steps are: known as entity detection), and subsequently to classify the men- i) the raw text, ii) the cleaned text (replacing @-mentions with tions into predefined categories (entity typing). After NER, named corresponding Twitter account names, and splitting hashtags using entity linking (NEL) is performed: linking the identified mentions to dynamic programming), and iii) the normalized text (e.g., case- entries in a knowledge base (KB). Due to its richness in semantic folding, removing diacritics). content and coverage, Wikipedia is a commonly used KB for linking NER: For identifying entity mentions that do not exist in Wikipedia, mentions to entities, or deciding when a mention refers to an entity i.e., out of KB entities, we employ a state-of-the-art named entity that is not in the KB, in which case it is referenced by a NIL identi- recognizer, previously applied to finding mentions of emerging enti- fier. DBpedia aims to extract structured information from Wikipedia, ties on Twitter [2]. We train five different NER models, three using and combines this information into a huge, cross-domain knowledge the ground truth data from the Microposts challenges (2013 through graph which provides explicit structure between concepts and the 2015), one using pseudo-ground truth (generated by linking tweets relations among them. as in [2]) and one trained on all data. Given the candidate mentions identified by NER and Semanti- cizer, we include a binary feature to express whether the mention has been detected by both systems. For each mention we end up Permission to make digital or hard copies of all or part of this work for with the set of features described in Table 1 that we use in train- personal Copyright or classroom c 2015 helduse by is granted without fee provided author(s)/owner(s); copyingthatpermitted copies are ing a Random Forest classifier (using 100 trees and rebalancing onlymade not for private and academic or distributed for profit purposes. or commercial advantage and that copies Published as part of the #Microposts2015 Workshop proceedings, the classes per tweet by modifying instance weights), to predict bear this notice and the full citation on the first page. To copy otherwise, to available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395) republish, to post on servers or to redistribute to lists, requires prior specific whether a candidate mention is an entity mention (actually refers to permission and/or a fee. an entity). #Microposts2015, May 18th, 2015, Florence, Italy. WWW #Microposts2015 at WWW ’15, 18-22 May 2015, Florence, Italy 1 Copyright 2015 ACM X-XXXXX-XX-X/XX/XX ...$15.00. https://github.com/semanticize/semanticizer · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 Table 1: Features used for mention detection. Table 2: Description of our runs. Feature Description RunID NER Semanticizer Disambiguation Filter linkOccCount no. of times mention appears as anchor text on Wikipedia Run 1 2015 - - - linkDocCount no. of docs in which mention appears as anchor text Run 2 2015 - full-text search - occCount no. of times the mention appears on Wikipedia Run 3 2015 - full-text search NIL senseOccCount no. of times the mention is anchor to Wikipedia title Run 4 all - full-text search - senseDocCount no. of docs the mention is anchor to Wikipedia title Run 5 - 2014 senseProbability - priorProbability % of docs where anchor links to target Wikipedia title Run 6 Same as Run 5 without overlap resolution. linkProbability % of docs where mention is anchor for a Wikipedia link Run 7 all all full-text search NIL senseProbability % of docs where mention links to target Wikipedia article Run 8 2015 all RankSVM NIL isCommon the mention is found by both NER and Semanticizer Run 9 all all RankSVM NIL Run 10 Same as Run 9 with a lower mention detection threshold. Entity disambiguation and typing: Given the entity mentions from the previous stage, the next step is to identify referenced entities. We retrieve the full list of candidate entities, extract features, Table 3: F1 scores on the dev set for strong mention match and cast the disambiguation step of identifying the correct entity for (SMM), strong typed mention match (STMM), strong link a mention as a learning to rank problem. match (SLM), and mention ceaf (MC) metrics. Next to the features in Table 1, we use additional full-text search RunID SMM STMM SLM MC features. We index Wikipedia using ElasticSearch (ES), and issue Run 1 0.809 0.456 0.164 0.715 the tweet as a query for candidate entities’ retrieval scores. We Run 2 0.809 0.460 0.330 0.731 also retrieve the 10 most similar entities for each candidate, using a Run 3 0.809 0.455 0.291 0.730 more like this query. Finally, we incorporate Wikipedia page view Run 4 0.554 0.311 0.213 0.497 Run 5 0.411 0.288 0.280 0.374 statistics2 from April 2014 as features. We use these features to train Run 6 0.620 0.389 0.280 0.567 RankSVM to rank the entity candidates for each mention, and take Run 7 0.533 0.330 0.210 0.486 the top ranked candidate as the entity to link. We map the entity to Run 8 0.732 0.418 0.334 0.633 its DBpedia URI, and determine its type through a manual mapping Run 9 0.577 0.365 0.247 0.525 of DBpedia classes to the #Microposts2015 taxonomy. Run 10 0.566 0.355 0.280 0.515 NIL identification and clustering: To decide whether the top- ranked entity is correct, or the mention refers to an out-of-KB entity, we compute meta-features based on the RankSVM classifier’s scores. 4. CONCLUSION We use these meta-features to train a Random Forest classifier for We have presented a system that performs entity mention de- NIL detection. We cluster NILs by linking identical mentions to a tection, disambiguation and clustering on short and noisy text by single NIL identifier based on their surface forms. drawing candidates from multiple sources and combining them. We Overlap resolution: Finally, we resolve all overlapping mentions observe that our simple NER and ES runs perform better than our that are output by the mention identification step. We create a graph more complex runs. We believe that more robust methods are needed of all non-overlapping mentions, and assign them their link score to deal with the errors introduced at each step of the pipeline. For (non-linked mentions get a fixed score). We then find the highest future work we plan on improving mention detection with additional scoring path through the graph using dynamic programming, and Semanticizer features. return the mentions of this path as our resolved list of mentions. Acknowledgements. This research was partially supported by the Nether- Our submitted runs rely on this scheme and variations thereof. See lands Organisation for Scientific Research (NWO) under project numbers Table 2 for an overview of the runs. We hypothesize that the Se- 727.011.005, SEED and 640.006.013, DADAISM; Amsterdam Data Science, manticizer will yield high entity recall, but low precision. Filtering and the Dutch national program COMMIT. the resulting candidates by senseProbability will increase precision. We expect the NER runs to be superior to Semanticizer or ES-only runs. Finally, we believe that combining the NER and Semanticizer REFERENCES outputs with additional candidates returned by ES will outperform [1] D. Graus, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. all our other runs. Semanticizing search engine queries: the University of Amsterdam at the ERD 2014 challenge. In The first international workshop on Entity 3. RESULTS recognition & disambiguation, 2014. [2] D. Graus, M. Tsagkias, L. Buitinck, and M. de Rijke. Generating We evaluate our approach on the dev set consisting of 500 tweets pseudo-ground truth for predicting new concepts in social streams. In made available by the organizers [4], [5]. In Table 3 we report ECIR 2014. Springer, 2014. on the official metrics for entity detection, tagging, clustering and [3] D. Odijk, E. Meij, and M. de Rijke. Feeding the second screen: Semantic linking. Our best performing runs (Run 1, Run 2) in terms of linking based on subtitles. In OAIR 2013, 2013. [4] G. Rizzo, A. E. Cano Basave, B. Pereira, and A. Varga. Making Sense of mention detection and typing rely mainly on NER and ES features. Microposts (#Microposts2015) Named Entity rEcognition and Linking Even though Semanticizer detects candidates with high recall, our (NEEL) Challenge. In Rowe et al. [5], pages 44–53. analysis indicates that most errors occur when the system fails to [5] M. Rowe, M. Stankovic, and A.-S. Dadzie, editors. Proceedings, 5th recognize mentions correctly, which negatively impacts the linking Workshop on Making Sense of Microposts (#Microposts2015): Big scores. Since each step in the pipeline relies on the output from the things come in small packages, Florence, Italy, 18th of May 2015, 2015. previous step, cascading errors influence our results, and we believe [6] N. Voskarides, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Query-dependent contextualization of streaming data. In ECIR 2014. a more in-depth error analysis of each stage is desirable. Despite its Springer, 2014. simplicity, our clustering approach performs reasonably well. 2 https://dumps.wikimedia.org/other/pagecounts-raw/ 60 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015