UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets Pierpaolo Basile Annalina Caputo Giovanni Semeraro University of Bari Aldo Moro University of Bari Aldo Moro University of Bari Aldo Moro pierpaolo.basile@uniba.it annalina.caputo@uniba.it giovanni.semeraro@uniba.it Fedelucio Narducci University of Bari Aldo Moro fedelucio.narducci@uniba.it ABSTRACT matical space; words represented close in this space are sim- This paper describes the participation of the UNIBA team ilar. The word space is built analyzing word co-occurrences in the Named Entity rEcognition and Linking (NEEL) Chal- in a large corpus. Our algorithm is able to disambiguate lenge. We propose a knowledge-based algorithm able to an entity by computing the similarity between the context recognize and link named entities in English tweets. The and the glosses associated with all possible entity concepts. approach combines the simple Lesk algorithm with informa- Such similarity is computed through the vector similarity tion coming from both a distributional semantic model and in the DSM. Section 2 provides details about the adopted usage frequency of Wikipedia concepts. The algorithm per- strategies for: 1) Entity Recognition and 2) Linking. The forms poorly in the entity recognition, while it achieves good experimental evaluation, along with commentary about re- results in the disambiguation step. sults, are presented in Section 3. Keywords 2. THE METHODOLOGY Named Entity Linking, Distributional Semantic Models, Lesk Our methodology is a two-step algorithm consisting in Algorithm an initial identification of all possible entities mentioned in a tweet followed by the linking (disambiguation) of entities through the disambiguation algorithm. DBpedia is exploited 1. INTRODUCTION twice in order to 1) extract all the possible surface forms In this paper we describe our participation in the Named related to entities, and 2) retrieve glosses used in the disam- Entity rEcognition and Linking (NEEL) Challenge [4]. The biguation process. In this case we use as gloss the extended task is composed of three steps: 1) identify entities in a abstract assigned to each DBpedia concept. tweet; 2) link entities to appropriate concepts1 in DBpe- dia; 3) cluster entities that belong to specific classes (entity 2.1 Entity Recognition types) defined by the organizers. In order to speed up the entity recognition step we build We propose two approaches that share the same methodol- an index where each surface form (entity) is paired with the ogy to disambiguate entities, while di↵ering in the approach set of all its possible DBpedia concepts. The index is built used to recognize entities in the tweet. We implement two by exploiting Lucene API2 , specifically for each surface form algorithms for entity detection. The former (U N IBAsup) (lexeme) occurring as the title of a DBpedia concept3 , a doc- exploits PoS-tag information to detect a list of candidate ument composed of two fields is created. The first field stores entities, while the latter (U N IBAunsup) tries to find se- the surface form, while the second one contains the list of quences of tokens (n-grams) that are titles of Wikipedia all possible DBpedia concepts that refer to the surface form pages or surface forms which refer to Wikipedia pages. in the first field. The entity recognition module exploits this The disambiguation and linking steps rely on a knowledge- index in order to find entities in a tweet. Given a tweet, based method that combines a Distributional Semantic Mod- the module performs the following steps: 1) Tokenizing and els (DSM) with the prior probability assigned to each DBpe- PoS-tagging the tweet via Tweet NLP4 ; 2) Building a list dia concept. A DSM represents words as points in a mathe- of candidate entity. We exploit two approaches: all n-grams 1 up to five words (U N IBAunsup); all sequences of tokens An entity can belong to several concepts. tagged as proper nouns by the PoS tagger (U N IBAsup); 3) Querying the index and retrieving the list of the top 25 matching surface forms for each candidate entity; 4) Scor- Permission to make digital or hard copies of all or part of this work for ing each surface form as the linear combination of: a) the personal or classroom use is granted without fee provided that copies are not made or cdistributed Copyright 2015 heldfor by profit or commercial advantage author(s)/owner(s); copyingandpermitted that copies score provided by the search engine; b) a string similar- only this bear for notice privateand and theacademic purposes. full citation on the first page. To copy otherwise, to 2 Published as part of the #Microposts2015 Workshop proceedings, http://lucene.apache.org/ republish, to post on servers or to redistribute to lists, requires prior specific 3 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395) permission and/or a fee. We extend the list of possible surface forms using also WWW 2015 - #Microposts2015 Making Sense of Microposts (#Microp- the resource available at: http://wifo5-04.informatik. #Microposts2015, May 18th, 2015, Florence, Italy. uni-mannheim.de/downloads/datasets/ osts2015) - NEEL Challenge 4 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. http://www.ark.cs.cmu.edu/TweetNLP/ · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 ity function based on the Levenshtein Distance between the by the organizers. candidate entity and the surface form in the index; c) the Jaccard Index in terms of common words between the can- 3. EVALUATION AND RESULTS didate entity and the surface form in the index; 5) Filtering This section reports results of our system on the develop- the candidate entities recognized in the previous steps: enti- ment set provided by the organizers. The dataset consists ties are removed if the score computed in the previous step is of 500 manually annotated tweets. Results are reported in below a given threshold. In this scenario we set the thresh- Table 1. The first column shows the entity recognition strat- old to 0.85. The output of the entity recognition module is a egy, the other columns report respectively the F-measure list of candidate entities in which a set of possible DBpedia of: strong link match (SLM), strong typed mention match concepts is assigned to each surface form in the list. (STMM), mention ceaf (MC). SLM measures the linking 2.2 Linking performance, while STMM takes into account both link and type. MC measures both recognition and classification. We exploit an adaptation of the distributional Lesk algo- rithm proposed by Basile et al. [1] for disambiguating named ER Strategy F-SLM F-STMM F-MC entities. The algorithm replaces the concept of word overlap U N IBAsup 0.362 0.267 0.389 initially introduced by Lesk [2] with the broader concept of U N IBAunsup 0.258 0.191 0.306 semantic similarity computed in a distributional semantic space. Let e1 , e2 , ...en be the sequence of entities extracted Table 1: Results on the development set from the tweet, the algorithm disambiguates each target en- tity ei by computing the semantic similarity between the We cannot discuss the quality of the overall performance glosses of concepts associated with the target entity and its since we have not information about both baseline and other context. This similarity is computed by representing in a participants. However, we can observe that the recognition DSM both the gloss and the context as the sum of words method based on PoS-tags obtains the best performance. they are composed of; then this similarity takes into account We performed an additional evaluation in which we removed the co-occurrence evidences previously collected through a the entity recognition module and took entities directly from corpus of documents. The corpus plays a key role since the the gold standard. The idea is to evaluate only the linking richer it is the higher is the probability that each word is step. Results of this evaluation are very encouraging, we fully represented in all its contexts of use. We exploit the obtain a F-SLM=0.563, while excluding the NIL instances word2vec tool5 [3] in order to build a DSM, by analyzing all we achieve a link match of 0.825. These results prove the the pages in the last English Wikipedia dump6 . The cor- e↵ectiveness of the proposed disambiguation approach based rect concept for an entity is the one whose gloss maximizes on DSM. the semantic similarity with the word/entity context. The algorithm consists of four steps. 1. Building the glosses. We retrieve the set Ci = {ci1 , ci2 , Acknowledgments . . . , cik } of DBpedia concepts associated to the entity This work fulfils the research objectives of the PON project ei . For each concept cij , the algorithm builds the gloss EFFEDIL (PON 02 00323 2938699). The computational representation gij by retrieving the extended abstract work has been executed on the IT resources made avail- from DBpedia. able by two PON projects financed by the MIUR: ReCaS 2. Building the context. The context T for the entity ei (PONa3 00052) and PRISMA (PON04a2 A). is represented by all the words that occur in the tweet except for the surface form of the entity. 4. REFERENCES 3. Building the vector representations. The context T and each gloss gij are represented as vectors (using [1] P. Basile, A. Caputo, and G. Semeraro. An Enhanced the vector sum) in the DSM. Lesk Word Sense Disambiguation Algorithm through a 4. Sense ranking. The algorithm computes the cosine Distributional Semantic Model. In Proc. of COLING similarity between the vector representation of each 2014: Technical Papers, pages 1591–1600. ACL, August extended gloss gij and that of the context T . Then, 2014. the cosine similarity is linearly combined with a func- [2] M. Lesk. Automatic Sense Disambiguation Using tion that takes into account the usage of the DBpedia Machine Readable Dictionaries: How to Tell a Pine concepts. We analyse a function that computes the Cone from an Ice Cream Cone. In Proc. of SIGDOC probability assigned to each DBpedia concept given ’86, SIGDOC ’86, pages 24–26. ACM, 1986. a candidate entity. The probability of a concept cij is [3] T. Mikolov, K. Chen, G. Corrado, and J. Dean. computed as the number of times the entity ei is tagged Efficient estimation of word representations in vector with the concept cij in Wikipedia. Zero probabili- space. In Proc. of ICLR Work., 2013. ties are avoided by introducing an additive (Laplace) [4] G. Rizzo, A. E. Cano Basave, B. Pereira, and A. Varga. smoothing. Making Sense of Microposts (#Microposts2015) Named We exploit the rdf:type relation in DBpedia to map each Entity rEcognition and Linking (NEEL) Challenge. In DBpedia concepts to the types defined in the task. In par- M. Rowe, M. Stankovic, and A.-S. Dadzie, editors, 5th ticular, we provide a manual map for all the types defined Workshop on Making Sense of Microposts in the dbpedia-owl ontology to the respective types provided (#Microposts2015), pages 44–53, 2015. 5 https://code.google.com/p/word2vec/ 6 We use 400 dimensions for vectors analysing only terms that occur at least 25 times. 63 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015