UNIBA: Exploiting a Distributional Semantic Model for
            Disambiguating and Linking Entities in Tweets

                      Pierpaolo Basile                                 Annalina Caputo                     Giovanni Semeraro
               University of Bari Aldo Moro                      University of Bari Aldo Moro            University of Bari Aldo Moro
             pierpaolo.basile@uniba.it annalina.caputo@uniba.it giovanni.semeraro@uniba.it
                                         Fedelucio Narducci
                                                                 University of Bari Aldo Moro
                                                               fedelucio.narducci@uniba.it

 ABSTRACT                                                                             matical space; words represented close in this space are sim-
 This paper describes the participation of the UNIBA team                             ilar. The word space is built analyzing word co-occurrences
 in the Named Entity rEcognition and Linking (NEEL) Chal-                             in a large corpus. Our algorithm is able to disambiguate
 lenge. We propose a knowledge-based algorithm able to                                an entity by computing the similarity between the context
 recognize and link named entities in English tweets. The                             and the glosses associated with all possible entity concepts.
 approach combines the simple Lesk algorithm with informa-                            Such similarity is computed through the vector similarity
 tion coming from both a distributional semantic model and                            in the DSM. Section 2 provides details about the adopted
 usage frequency of Wikipedia concepts. The algorithm per-                            strategies for: 1) Entity Recognition and 2) Linking. The
 forms poorly in the entity recognition, while it achieves good                       experimental evaluation, along with commentary about re-
 results in the disambiguation step.                                                  sults, are presented in Section 3.


 Keywords                                                                             2.    THE METHODOLOGY
 Named Entity Linking, Distributional Semantic Models, Lesk                             Our methodology is a two-step algorithm consisting in
 Algorithm                                                                            an initial identification of all possible entities mentioned in
                                                                                      a tweet followed by the linking (disambiguation) of entities
                                                                                      through the disambiguation algorithm. DBpedia is exploited
 1.      INTRODUCTION                                                                 twice in order to 1) extract all the possible surface forms
    In this paper we describe our participation in the Named                          related to entities, and 2) retrieve glosses used in the disam-
 Entity rEcognition and Linking (NEEL) Challenge [4]. The                             biguation process. In this case we use as gloss the extended
 task is composed of three steps: 1) identify entities in a                           abstract assigned to each DBpedia concept.
 tweet; 2) link entities to appropriate concepts1 in DBpe-
 dia; 3) cluster entities that belong to specific classes (entity                     2.1    Entity Recognition
 types) defined by the organizers.                                                       In order to speed up the entity recognition step we build
    We propose two approaches that share the same methodol-                           an index where each surface form (entity) is paired with the
 ogy to disambiguate entities, while di↵ering in the approach                         set of all its possible DBpedia concepts. The index is built
 used to recognize entities in the tweet. We implement two                            by exploiting Lucene API2 , specifically for each surface form
 algorithms for entity detection. The former (U N IBAsup)                             (lexeme) occurring as the title of a DBpedia concept3 , a doc-
 exploits PoS-tag information to detect a list of candidate                           ument composed of two fields is created. The first field stores
 entities, while the latter (U N IBAunsup) tries to find se-                          the surface form, while the second one contains the list of
 quences of tokens (n-grams) that are titles of Wikipedia                             all possible DBpedia concepts that refer to the surface form
 pages or surface forms which refer to Wikipedia pages.                               in the first field. The entity recognition module exploits this
    The disambiguation and linking steps rely on a knowledge-                         index in order to find entities in a tweet. Given a tweet,
 based method that combines a Distributional Semantic Mod-                            the module performs the following steps: 1) Tokenizing and
 els (DSM) with the prior probability assigned to each DBpe-                          PoS-tagging the tweet via Tweet NLP4 ; 2) Building a list
 dia concept. A DSM represents words as points in a mathe-                            of candidate entity. We exploit two approaches: all n-grams
 1                                                                                    up to five words (U N IBAunsup); all sequences of tokens
     An entity can belong to several concepts.                                        tagged as proper nouns by the PoS tagger (U N IBAsup);
                                                                                      3) Querying the index and retrieving the list of the top 25
                                                                                      matching surface forms for each candidate entity; 4) Scor-
 Permission to make digital or hard copies of all or part of this work for            ing each surface form as the linear combination of: a) the
 personal or classroom use is granted without fee provided that copies are
 not made or cdistributed
 Copyright        2015 heldfor by
                               profit or commercial advantage
                                  author(s)/owner(s);       copyingandpermitted
                                                                       that copies
                                                                                      score provided by the search engine; b) a string similar-
 only this
 bear  for notice
           privateand
                    and
                      theacademic     purposes.
                          full citation on the first page. To copy otherwise, to      2
 Published   as part  of the   #Microposts2015       Workshop    proceedings,           http://lucene.apache.org/
 republish, to post on servers or to redistribute to lists, requires prior specific   3
 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)
 permission  and/or a fee.                                                              We extend the list of possible surface forms using also
 WWW     2015 - #Microposts2015       Making  Sense of     Microposts (#Microp-       the resource available at: http://wifo5-04.informatik.
 #Microposts2015,      May 18th, 2015,     Florence,   Italy.                         uni-mannheim.de/downloads/datasets/
 osts2015) - NEEL Challenge                                                           4
 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.                                       http://www.ark.cs.cmu.edu/TweetNLP/


· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
 ity function based on the Levenshtein Distance between the          by the organizers.
 candidate entity and the surface form in the index; c) the
 Jaccard Index in terms of common words between the can-             3.   EVALUATION AND RESULTS
 didate entity and the surface form in the index; 5) Filtering
                                                                        This section reports results of our system on the develop-
 the candidate entities recognized in the previous steps: enti-
                                                                     ment set provided by the organizers. The dataset consists
 ties are removed if the score computed in the previous step is
                                                                     of 500 manually annotated tweets. Results are reported in
 below a given threshold. In this scenario we set the thresh-
                                                                     Table 1. The first column shows the entity recognition strat-
 old to 0.85. The output of the entity recognition module is a
                                                                     egy, the other columns report respectively the F-measure
 list of candidate entities in which a set of possible DBpedia
                                                                     of: strong link match (SLM), strong typed mention match
 concepts is assigned to each surface form in the list.
                                                                     (STMM), mention ceaf (MC). SLM measures the linking
 2.2    Linking                                                      performance, while STMM takes into account both link and
                                                                     type. MC measures both recognition and classification.
    We exploit an adaptation of the distributional Lesk algo-
 rithm proposed by Basile et al. [1] for disambiguating named             ER Strategy       F-SLM      F-STMM        F-MC
 entities. The algorithm replaces the concept of word overlap
                                                                          U N IBAsup        0.362      0.267         0.389
 initially introduced by Lesk [2] with the broader concept of
                                                                          U N IBAunsup      0.258      0.191         0.306
 semantic similarity computed in a distributional semantic
 space. Let e1 , e2 , ...en be the sequence of entities extracted          Table 1: Results on the development set
 from the tweet, the algorithm disambiguates each target en-
 tity ei by computing the semantic similarity between the
                                                                        We cannot discuss the quality of the overall performance
 glosses of concepts associated with the target entity and its
                                                                     since we have not information about both baseline and other
 context. This similarity is computed by representing in a
                                                                     participants. However, we can observe that the recognition
 DSM both the gloss and the context as the sum of words
                                                                     method based on PoS-tags obtains the best performance.
 they are composed of; then this similarity takes into account
                                                                     We performed an additional evaluation in which we removed
 the co-occurrence evidences previously collected through a
                                                                     the entity recognition module and took entities directly from
 corpus of documents. The corpus plays a key role since the
                                                                     the gold standard. The idea is to evaluate only the linking
 richer it is the higher is the probability that each word is
                                                                     step. Results of this evaluation are very encouraging, we
 fully represented in all its contexts of use. We exploit the
                                                                     obtain a F-SLM=0.563, while excluding the NIL instances
 word2vec tool5 [3] in order to build a DSM, by analyzing all
                                                                     we achieve a link match of 0.825. These results prove the
 the pages in the last English Wikipedia dump6 . The cor-
                                                                     e↵ectiveness of the proposed disambiguation approach based
 rect concept for an entity is the one whose gloss maximizes
                                                                     on DSM.
 the semantic similarity with the word/entity context. The
 algorithm consists of four steps.
    1. Building the glosses. We retrieve the set Ci = {ci1 , ci2 ,   Acknowledgments
        . . . , cik } of DBpedia concepts associated to the entity   This work fulfils the research objectives of the PON project
        ei . For each concept cij , the algorithm builds the gloss   EFFEDIL (PON 02 00323 2938699). The computational
        representation gij by retrieving the extended abstract       work has been executed on the IT resources made avail-
        from DBpedia.                                                able by two PON projects financed by the MIUR: ReCaS
    2. Building the context. The context T for the entity ei         (PONa3 00052) and PRISMA (PON04a2 A).
        is represented by all the words that occur in the tweet
        except for the surface form of the entity.                   4.   REFERENCES
    3. Building the vector representations. The context T
        and each gloss gij are represented as vectors (using         [1] P. Basile, A. Caputo, and G. Semeraro. An Enhanced
        the vector sum) in the DSM.                                      Lesk Word Sense Disambiguation Algorithm through a
    4. Sense ranking. The algorithm computes the cosine                  Distributional Semantic Model. In Proc. of COLING
        similarity between the vector representation of each             2014: Technical Papers, pages 1591–1600. ACL, August
        extended gloss gij and that of the context T . Then,             2014.
        the cosine similarity is linearly combined with a func-      [2] M. Lesk. Automatic Sense Disambiguation Using
        tion that takes into account the usage of the DBpedia            Machine Readable Dictionaries: How to Tell a Pine
        concepts. We analyse a function that computes the                Cone from an Ice Cream Cone. In Proc. of SIGDOC
        probability assigned to each DBpedia concept given               ’86, SIGDOC ’86, pages 24–26. ACM, 1986.
        a candidate entity. The probability of a concept cij is      [3] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
        computed as the number of times the entity ei is tagged          Efficient estimation of word representations in vector
        with the concept cij in Wikipedia. Zero probabili-               space. In Proc. of ICLR Work., 2013.
        ties are avoided by introducing an additive (Laplace)        [4] G. Rizzo, A. E. Cano Basave, B. Pereira, and A. Varga.
        smoothing.                                                       Making Sense of Microposts (#Microposts2015) Named
    We exploit the rdf:type relation in DBpedia to map each              Entity rEcognition and Linking (NEEL) Challenge. In
 DBpedia concepts to the types defined in the task. In par-              M. Rowe, M. Stankovic, and A.-S. Dadzie, editors, 5th
 ticular, we provide a manual map for all the types defined              Workshop on Making Sense of Microposts
 in the dbpedia-owl ontology to the respective types provided            (#Microposts2015), pages 44–53, 2015.
 5
  https://code.google.com/p/word2vec/
 6
  We use 400 dimensions for vectors analysing only terms
 that occur at least 25 times.


                                                                                                                              63
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015