Combining Multiple Signals for Semanticizing Tweets:
           University of Amsterdam at #Microposts2015

           Cristina Gârbacea, Daan Odijk, David Graus, Isaac Sijaranamual, Maarten de Rijke
                              University of Amsterdam, Science Park 904, Amsterdam, The Netherlands
                       {G.C.Garbacea, D.Odijk, D.P.Graus, I.B.Sijaranamual, deRijke}@uva.nl


 ABSTRACT                                                                                Our participation in this challenge revolves around the existing
 In this paper we present an approach for extracting and linking enti-                open-source entity linking software developed at the University of
 ties from short and noisy microblog posts. We describe a diverse set                 Amsterdam. We use Semanticizer1 , a state-of-the-art entity linking
 of approaches based on the Semanticizer, an open-source entity link-                 framework. So far Semanticizer has been successfully employed in
 ing framework developed at the University of Amsterdam, adapted                      linking entities in search engine queries [1] and in linking entities in
 to the task of the #Microposts2015 challenge. We consider alterna-                   short documents in streaming scenarios [6]. Moreover, it has been
 tives for dealing with ambiguity that can help in the named entity                   further extended to deal with additional types of data like television
 extraction and linking processes. We retrieve entity candidates from                 subtitles [3]. In what follows we explain how we use Semanticizer
 multiple sources and process them in a four-step pipeline. Results                   for the task at hand, and describe each of our submitted runs to the
 show that we correctly manage to identify entity mentions (our best                  competition.
 run attains an F1 score of 0.809 in terms of the strong mention match
 metric), but subsequent steps prove to be more challenging for our                   2.     SYSTEM ARCHITECTURE
 approach.                                                                              Our system processes each incoming tweet in four stages: mention
                                                                                      detection, entity disambiguation and typing, NIL identification and
 Keywords                                                                             clustering, and overlap resolution. We explain each stage in turn.
 Named entity extraction; Named entity linking; Social media                          Mention detection: The first step aims to identify all entity men-
                                                                                      tions in the input text, and is oriented towards high recall. We take
 1.     INTRODUCTION                                                                  the union of the output of two mention identification methods:
    This paper describes our participation in the named entity ex-                       Semanticizer: the state-of-the-art system performs lexical match-
 traction and linking challenge at #Microposts2015. Information                       ing of entities’ surface forms. These surface forms are derived from
 extraction from microblog posts is an emerging research area which                   the KB, and comprise anchor texts that refer to Wikipedia pages,
 presents a series of problems for the natural language processing                    disambiguation and redirect pages, and page titles as described in
 community due to the shortness, informality and noisy lexical nature                 Table 1. For this, we use two instances of Semanticizer, running on
 of the content. Extracting entities from tweets is a complex process                 two Wikipedia dumps: one dated May 2014 (the version used to
 typically performed in a sequential fashion. As a first step, named                  build DBpedia 3.9), and a more recent one, dated February 2015.
 entity recognition (NER) aims to detect mentions that refer to enti-                    We perform three separate preprocessing steps on the tweet text,
 ties, e.g., names of people, locations, organizations or products (also              the results of which get sent to the Semanticizer. These steps are:
 known as entity detection), and subsequently to classify the men-                    i) the raw text, ii) the cleaned text (replacing @-mentions with
 tions into predefined categories (entity typing). After NER, named                   corresponding Twitter account names, and splitting hashtags using
 entity linking (NEL) is performed: linking the identified mentions to                dynamic programming), and iii) the normalized text (e.g., case-
 entries in a knowledge base (KB). Due to its richness in semantic                    folding, removing diacritics).
 content and coverage, Wikipedia is a commonly used KB for linking                       NER: For identifying entity mentions that do not exist in Wikipedia,
 mentions to entities, or deciding when a mention refers to an entity                 i.e., out of KB entities, we employ a state-of-the-art named entity
 that is not in the KB, in which case it is referenced by a NIL identi-               recognizer, previously applied to finding mentions of emerging enti-
 fier. DBpedia aims to extract structured information from Wikipedia,                 ties on Twitter [2]. We train five different NER models, three using
 and combines this information into a huge, cross-domain knowledge                    the ground truth data from the Microposts challenges (2013 through
 graph which provides explicit structure between concepts and the                     2015), one using pseudo-ground truth (generated by linking tweets
 relations among them.                                                                as in [2]) and one trained on all data.
                                                                                         Given the candidate mentions identified by NER and Semanti-
                                                                                      cizer, we include a binary feature to express whether the mention
                                                                                      has been detected by both systems. For each mention we end up
 Permission to make digital or hard copies of all or part of this work for            with the set of features described in Table 1 that we use in train-
 personal
 Copyright or classroom
               c 2015 helduse by
                              is granted  without fee provided
                                  author(s)/owner(s);       copyingthatpermitted
                                                                        copies are    ing a Random Forest classifier (using 100 trees and rebalancing
 onlymade
 not   for private   and academic
            or distributed for profit purposes.
                                      or commercial advantage and that copies
 Published   as  part  of the #Microposts2015        Workshop    proceedings,         the classes per tweet by modifying instance weights), to predict
 bear this notice and the full citation on the first page. To copy otherwise, to
 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)
 republish, to post on servers or to redistribute to lists, requires prior specific
                                                                                      whether a candidate mention is an entity mention (actually refers to
 permission  and/or a fee.                                                            an entity).
 #Microposts2015,      May 18th, 2015, Florence, Italy.
 WWW #Microposts2015 at WWW ’15, 18-22 May 2015, Florence, Italy                      1
 Copyright 2015 ACM X-XXXXX-XX-X/XX/XX ...$15.00.                                         https://github.com/semanticize/semanticizer


· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
           Table 1: Features used for mention detection.                                         Table 2: Description of our runs.
 Feature          Description                                                   RunID     NER                    Semanticizer     Disambiguation    Filter
 linkOccCount no. of times mention appears as anchor text on Wikipedia          Run 1     2015               -                -                   -
 linkDocCount no. of docs in which mention appears as anchor text               Run 2     2015               -                full-text search    -
 occCount         no. of times the mention appears on Wikipedia                 Run 3     2015               -                full-text search    NIL
 senseOccCount no. of times the mention is anchor to Wikipedia title            Run 4     all                -                full-text search    -
 senseDocCount no. of docs the mention is anchor to Wikipedia title             Run 5     -                  2014             senseProbability    -
 priorProbability % of docs where anchor links to target Wikipedia title        Run 6     Same as Run 5 without overlap resolution.
 linkProbability % of docs where mention is anchor for a Wikipedia link         Run 7     all                all              full-text search    NIL
 senseProbability % of docs where mention links to target Wikipedia article     Run 8     2015               all              RankSVM             NIL
 isCommon         the mention is found by both NER and Semanticizer             Run 9     all                all              RankSVM             NIL
                                                                                Run 10    Same as Run 9 with a lower mention detection threshold.
 Entity disambiguation and typing: Given the entity mentions
 from the previous stage, the next step is to identify referenced
 entities. We retrieve the full list of candidate entities, extract features,   Table 3: F1 scores on the dev set for strong mention match
 and cast the disambiguation step of identifying the correct entity for         (SMM), strong typed mention match (STMM), strong link
 a mention as a learning to rank problem.                                       match (SLM), and mention ceaf (MC) metrics.
    Next to the features in Table 1, we use additional full-text search                       RunID      SMM       STMM         SLM     MC
 features. We index Wikipedia using ElasticSearch (ES), and issue                             Run 1      0.809      0.456       0.164   0.715
 the tweet as a query for candidate entities’ retrieval scores. We                            Run 2      0.809      0.460       0.330   0.731
 also retrieve the 10 most similar entities for each candidate, using a                       Run 3      0.809      0.455       0.291   0.730
 more like this query. Finally, we incorporate Wikipedia page view                            Run 4      0.554      0.311       0.213   0.497
                                                                                              Run 5      0.411      0.288       0.280   0.374
 statistics2 from April 2014 as features. We use these features to train
                                                                                              Run 6      0.620      0.389       0.280   0.567
 RankSVM to rank the entity candidates for each mention, and take                             Run 7      0.533      0.330       0.210   0.486
 the top ranked candidate as the entity to link. We map the entity to                         Run 8      0.732      0.418       0.334   0.633
 its DBpedia URI, and determine its type through a manual mapping                             Run 9      0.577      0.365       0.247   0.525
 of DBpedia classes to the #Microposts2015 taxonomy.                                          Run 10     0.566      0.355       0.280   0.515
 NIL identification and clustering: To decide whether the top-
 ranked entity is correct, or the mention refers to an out-of-KB entity,
 we compute meta-features based on the RankSVM classifier’s scores.             4.    CONCLUSION
 We use these meta-features to train a Random Forest classifier for                We have presented a system that performs entity mention de-
 NIL detection. We cluster NILs by linking identical mentions to a              tection, disambiguation and clustering on short and noisy text by
 single NIL identifier based on their surface forms.                            drawing candidates from multiple sources and combining them. We
 Overlap resolution: Finally, we resolve all overlapping mentions               observe that our simple NER and ES runs perform better than our
 that are output by the mention identification step. We create a graph          more complex runs. We believe that more robust methods are needed
 of all non-overlapping mentions, and assign them their link score              to deal with the errors introduced at each step of the pipeline. For
 (non-linked mentions get a fixed score). We then find the highest              future work we plan on improving mention detection with additional
 scoring path through the graph using dynamic programming, and                  Semanticizer features.
 return the mentions of this path as our resolved list of mentions.
                                                                                Acknowledgements. This research was partially supported by the Nether-
 Our submitted runs rely on this scheme and variations thereof. See             lands Organisation for Scientific Research (NWO) under project numbers
 Table 2 for an overview of the runs. We hypothesize that the Se-               727.011.005, SEED and 640.006.013, DADAISM; Amsterdam Data Science,
 manticizer will yield high entity recall, but low precision. Filtering         and the Dutch national program COMMIT.
 the resulting candidates by senseProbability will increase precision.
 We expect the NER runs to be superior to Semanticizer or ES-only
 runs. Finally, we believe that combining the NER and Semanticizer              REFERENCES
 outputs with additional candidates returned by ES will outperform              [1] D. Graus, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke.
 all our other runs.                                                                Semanticizing search engine queries: the University of Amsterdam at
                                                                                    the ERD 2014 challenge. In The first international workshop on Entity
 3.     RESULTS                                                                     recognition & disambiguation, 2014.
                                                                                [2] D. Graus, M. Tsagkias, L. Buitinck, and M. de Rijke. Generating
    We evaluate our approach on the dev set consisting of 500 tweets                pseudo-ground truth for predicting new concepts in social streams. In
 made available by the organizers [4], [5]. In Table 3 we report                    ECIR 2014. Springer, 2014.
 on the official metrics for entity detection, tagging, clustering and          [3] D. Odijk, E. Meij, and M. de Rijke. Feeding the second screen: Semantic
 linking. Our best performing runs (Run 1, Run 2) in terms of                       linking based on subtitles. In OAIR 2013, 2013.
                                                                                [4] G. Rizzo, A. E. Cano Basave, B. Pereira, and A. Varga. Making Sense of
 mention detection and typing rely mainly on NER and ES features.                   Microposts (#Microposts2015) Named Entity rEcognition and Linking
 Even though Semanticizer detects candidates with high recall, our                  (NEEL) Challenge. In Rowe et al. [5], pages 44–53.
 analysis indicates that most errors occur when the system fails to             [5] M. Rowe, M. Stankovic, and A.-S. Dadzie, editors. Proceedings, 5th
 recognize mentions correctly, which negatively impacts the linking                 Workshop on Making Sense of Microposts (#Microposts2015): Big
 scores. Since each step in the pipeline relies on the output from the              things come in small packages, Florence, Italy, 18th of May 2015, 2015.
 previous step, cascading errors influence our results, and we believe          [6] N. Voskarides, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke.
                                                                                    Query-dependent contextualization of streaming data. In ECIR 2014.
 a more in-depth error analysis of each stage is desirable. Despite its             Springer, 2014.
 simplicity, our clustering approach performs reasonably well.
 2
     https://dumps.wikimedia.org/other/pagecounts-raw/


                                                                                                                                                     60
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015