UniMiB: Entity Linking in Tweets using Jaro-Winkler Distance, Popularity and Coherence Davide Caliano Elisabetta Fersini Pikakshi Manchanda Università degli Studi di Università degli Studi di Università degli Studi di Milano-Bicocca, Italy Milano-Bicocca, Italy Milano-Bicocca, Italy d.caliano@campus.unimib.it fersini@disco.unimib.it pikakshi.manchanda@disco.unimib.it Matteo Palmonari Enza Messina Università degli Studi di Università degli Studi di Milano-Bicocca, Italy Milano-Bicocca, Italy palmonari@disco.unimib.it messina@disco.unimib.it ABSTRACT For the task of identifying named entities, we use a state- This paper summarizes the participation of UNIMIB team of-the-art NER system, T-NER [4] which is a supervised in the Named Entity rEcognition and Linking (NEEL) Chal- model based on Conditional Random Fields (CRF), pre- lenge in #Microposts2016. In this paper, we propose a trained on a state-of-the-art gold standard of tweets [4]. The knowledge-base approach for identifying and linking named CRF model of T-NER has been used to identify, given a entities from tweets. The named entities are, further, classi- tweet t as input, the candidate entities e1 , e2 , ..., en in t. In fied using evidence provided by our entity linking algorithm other words, the CRF model segments a tweet into entities and type-casted into Microposts categories. and non-entities. For performing entity recognition using T-NER, we re- move the special characters (@, #,..) as a pre-processing Keywords step and process the tweets in UTF-8 format in order to deal Knowledge base; Named entity recognition; Named entity with emoticons. T-NER is not trained to recognize @user- linking names as entities and the current version of our system does not resolve username references. This has a significant im- 1. INTRODUCTION pact on the overall performance of our system. Microblogging platforms such as Twitter have become a rich source of real-time information. Today, information is 2.2 Candidate Resource Selection & Ranking being readily extracted from such platforms, in the form of For the task of selecting a candidate resource for an entity, named entities, relations and events. The tasks of this chal- we use DBpedia1 as our KB. We perform a pre-processing lenge comprise identification and classification of named en- step here, wherein, for every identified entity which consists tities from a set of tweets, and linking the identified entities of a segment that begins with a capital letter, we segment to corresponding KB resources if a match is found, or to a that entity into a set of tokens based on the capital let- NIL reference if no candidate resources can be retrieved [5]. ter. For instance, the entity mention ‘StarWars’ is treated In order to identify named entities, we use a pre-trained, as ‘Star Wars’ during the candidate retrieval phase so as to state-of-the-art Named Entity Recognition (NER) system obtain better candidate matches. To this end, we extract [4]. Using this system, we tokenize and segment the tweets all the Titles of all Wikipedia articles 2 from DBpedia using to identify entities and non-entities. Further, our linking al- rdfs:label and index them using LuceneAPI3 . For each identi- gorithm is based on a greedy approach which disambiguates fied entity, top-k candidate KB resources are retrieved using and links all the identified entities with DBpedia resources. a high-recall approach. Here we set k = 500. We estimate Finally, the entities are classified using evidence from the a knowledge-base score, called KB(ck ), for each candidate linking phase. resource ck of an entity ej as follows: KB(ej , ck ) = (α · lex(ej , lck ) + (1 − α) · (cosk (e∗j , ack ))) + R(ck ) (1) 2. METHODOLOGY where: 2.1 Named Entity Identification • lex(ej , lck ) denotes a lexical similarity between an en- tity ej and the label of a candidate resource lck ; • cosk (e∗j , ack )) represents a discounted cosine similar- Copyright c 2016 held by author(s)/owner(s); copying permitted only for private and academic purposes. ity between an entity context e∗j and a candidate KB Published as part of the #Microposts2016 Workshop proceedings, abstract description ack ; available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691) 1 http://wiki.dbpedia.org/ #Microposts2016, Apr 11th, 2016, Montréal, Canada. 2 ACM ISBN 978-1-4503-2138-9. http://dbpedia.org/Downloads2015-04 3 DOI: 10.1145/1235 http://lucene.apache.org/ · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 • R(ck ) is a popularity measure of a given candidate in ack . To compute equation (5), we retrieve the abstracts for the KB. all the top-k candidate resources c1 , c2 , ..., ck from DBpedia. An entity context, denoted as e∗j , is modelled as a vector More formally, lex(ej , lck ) is defined as follows: composed of an identified entity ej in a tweet ti and the  JW (e ,l  words in the tweet which have been tagged as noun / verb / j ck ) lex(ej , lck ) = lcs(ej , lck ) + WD WD +1 (2) adjective. Equation (5) allows us to scale the similarity with respect to each candidate abstract according to its ranking where lcs(ej , lck ) denotes a normalized Lucene Conceptual  position. JW (e ,lc ) Score4 between ej and lck , while WD j WD +1 k repre- Finally, the last contribution provided in equation (1) is sents a string distance measure, based on the well-known provided by R(ck ), which allows us to take into account Jaro-Winkler distance, between an entity and the label of the popularity of a given candidate in the KB for the final a candidate resource. The coefficient WD is set equal to ranking. To this purpose, we computed the popularity R(ck ) 3.0 and represents a boosting coefficient that allows us to of a KB resource ck by using the following boosted Page weigh more syntactically close matches. The asymmetric Rank coefficient: Jaro-Winkler distance weighs more edit distances occurring R(ck ) = β · P R(ck ) (6) in the first subsequences of two strings, and is defined as: P0 where P R(ck ) is the normalized PageRank coefficient [6], JW (ej , lck ) = Jaro(ej , lck ) + · (1 − Jaro(ej , lck )) (3) and β is a damping coefficient, which lies in the range [0,1], 10 and has been experimentally determined as equal to 0.6. where Jaro is a similarity metric [2] and P 0 is a measure that In order to determine the optimal configuration of our takes into account the length of the longest common prefix system, the parameters have been experimentally evaluated. of ej and lck . Moreover, in situations where a candidate The top-k candidates are ranked using equation (1) where label lck is composed of more than one token, we calculate the score of each candidate resource is denoted by KB(ck ). JW (ej , lck ) as follows: Finally, the value of α in equation (1) has been investigated lc lc varying between the range [0,1] and the optimal value α = JW (ej , lck ) = max(JW (ej , P1 k ), ..., JW (ej , Pn k )) (4) 0.7 results as the best configuration. lc where Pi k denotes one of every possible permutation of 2.3 Entity Linking and Type Classification tokens in lck . This particular step is undertaken because We followed an unsupervised, greedy approach to link an users may refer to an entity in a tweet using a concise, more entity with a DBpedia resource. In this way, we link ev- popular substring of the entity, which may not necessarily ery identified entity with a corresponding candidate resource be the first token of the entity itself. For instance, in the with the highest candidate score achieved using equation tweet, (1). However, entities for which no candidate matches are @steph93065 shes hates me but she’s no bigot, retrieved from the index have been mapped to a NIL ref- intelligent and correct most of the time. #Trump erence with an assigned type Thing. The entities are, fur- ther, classified using the relation rdf:type with the help of we observe that candidate KB resources for the entity men- dbpedia-owl Ontology 5 . For this purpose, we indexed the tion ‘Trump’ comprise of Trump (card game, rdf:type Thing), mapping-based types dataset of DBpedia classes6 . Donald Trump (rdf:type Person), and Trump (comics) Moreover, we established a mapping between the DBpedia (rdf:type CartoonCharacter), amongst other resources. By Ontology and Microposts categories (Thing, Person, Loca- using the afore-mentioned equation (4), we are able to com- tion, Organization, Event, Character and Product) by fol- pute the JW distance for the entity mention ‘Trump’ not lowing the description of the Microposts categories [5] by only with ‘Donald Trump’, which yields a low JW similarity, the challenge organizers. Every DBpedia Ontology class but also with ‘Trump’, which yields a high JW similarity. that can not be mapped intuitively following this descrip- To evaluate the second component cosk (e∗j , ack ) of the KB tion, such as the Ontology class Species, has been mapped score in equation (1), we have indexed the extended abstracts to the Microposts category Thing. We adopted only one ex- of all DBpedia resources. This has been done with an objec- ception to this rule, where we mapped the DBpedia Ontol- tive to be able to disambiguate an entity with a candidate ogy class Name, with its subclasses, GivenName, Surname label using an entity’s usage context in the tweet, on one to the Microposts category Person. GivenNames and Sur- hand, and contextual evidence from the KB on the other. names are used in tweets mostly to refer to a person in the The measure cosk (e∗j , ack ), which is used for denoting con- real world, i.e., they are mentions of entities that would be textual similarity between an entity ej and a KB candidate re-classified under the Microposts category Person. This in- resource ck , is defined as: terpretation of mapping for names and surnames is inspired  by previous work on mapping semantics [1].   cos(e∗j , ack ) if k = 1  cosk (e∗j , ack ) = (5) 2.4 Entity Boundary Re-Scoping   cos(e∗j , ack )  k≥2 We performed an additional post-processing step, where log2 (k) an identified entity’s boundary is re-scoped based on the la- where cos(e∗j , ack ) denotes the cosine similarity between an bel of the resource linked to the entity in the previous phase. entity context e∗j and a candidate KB abstract description We apply this step when the resource label is a substring of 4 5 https://lucene.apache.org/core/4 6 0/core/org/apache/ http://mappings.dbpedia.org/server/ontology/classes/ 6 lucene/search/similarities/TFIDFSimilarity.html http://dbpedia.org/Downloads2015-04 71 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 Table 1: Performance: Entity Linking and Classifi- Table 3: NER Oracle: Entity Linking Performance cation Dataset Precision Recall F1 Measure Dataset SLM STMM Mention Ceaf without Re-scoping 0.327 0.297 0.380 Training* 0.524 0.459 0.489 Training with Re-scoping 0.336 0.300 0.378 Dev* 0.452 0.387 0.417 without Re-scoping 0.194 0.139 0.237 Dev with Re-scoping 0.221 0.134 0.250 dataset, due to which the entity recognition system exhibits Table 2: Performance: Entity Recognition entity segmentation errors. Dataset Precision Recall F1 Measure Finally, table 3 summarizes the performance of our entity without Re-scoping 0.627 0.362 0.459 linking algorithm in terms of precision, recall and F1 scores Training with Re-scoping 0.625 0.347 0.446 assuming a NER Oracle. To this end, we use a modified without Re-scoping 0.514 0.166 0.251 version of the Training and Dev gold standards, denoted as Dev with Re-scoping 0.545 0.178 0.268 Training* and Dev* which comprise of linkable entities only, i.e., void of NIL mentions. They are annotated with 6371 and 253 linkable entities, respectively. Our linking approach the entity mention. In this way, we are able to filter out is able to link correctly ≈ 50% of the entities in the modified noisy tokens in entities that were identified in the first step ground truth. When a NER Oracle is used, the performance by the entity recognition system. For instance, in the tweet, of the system obviously falls for entity boundary re-scoping. Day 9: Wearing a StarWars T-Shirt each day Hence, we report the results without entity boundary re- until ‘The Force Awakens’. We’re half way there! scoping for the Training* and Dev* datasets. For the test https://t.co/QoAOxoSCJk set evaluation, we provide 2 runs of our system on the test dataset for both configurations. the entity recognition system identifies ‘StarWars T-Shirt’ as In previous work we defined a more sophisticated entity an entity, due to a capitalization issue, however, our linking classification method, which combines evidence from the La- algorithm is able to link this entity correctly with the KB beledLDA component of T-NER and from the types of can- resource Star Wars, based on contextual and KB evidence. didate entities [3]. In this challenge we could not apply As a result, we re-scope the boundary of the identified entity this method due to problems in integrating the LabeledLDA ‘StarWars T-Shirt’ to ‘StarWars’ to improve the identifica- component in our current pipeline, but we plan to use this tion performance of the system. We evaluate our system method again in the near future. using two configurations, viz. without entity boundary re- scoping and with entity boundary re-scoping, as reported in Section 3 below. 4. REFERENCES [1] M. Atencia, A. Borgida, J. Euzenat, C. Ghidini, and L. Serafini. A formal semantics for weighted ontology 3. RESULTS mappings. In The Semantic Web–ISWC 2012, pages We use the training and dev datasets to test the perfor- 17–33. Springer, 2012. mance of the pre-trained NER system (supervised approach) [2] M. A. Jaro. Probabilistic linkage of large public health and, use the identified entities for testing the performance of data files. Statistics in medicine, 14(5-7):491–498, 1995. our linking algorithm (unsupervised approach). The train- [3] P. Manchanda, E. Fersini, and M. Palmonari. ing and dev gold standards consist of ≈6000 and 100 tweets, Leveraging entity linking to enhance entity recognition annotated with a total of 8665 and 338 entities, respectively. in microblogs. In Proceedings of the 7th International Table 1 shows the performance of our entity linking and Joint Conference on Knowledge Discovery, Knowledge classification approach for Strong Link Match (SLM), Strong Engineering and Knowledge Management, pages Typed Mention Match (STMM) and Mention Ceaf. As ev- 147–155, 2015. ident, the performance of the linking approach (SLM) im- [4] A. Ritter, S. Clark, O. Etzioni, et al. Named entity proves when entity boundary re-scoping is applied, for both recognition in tweets: an experimental study. In the datasets. An overall low performance of the entity link- Proceedings of the Conference on Empirical Methods in ing system could be attributed to poor performance of the Natural Language Processing, pages 1524–1534. entity recognition system, as illustrated in Table 2. On the Association for Computational Linguistics, 2011. other hand, the performance for type classification approach (STMM) improves for the training dataset with entity re- [5] G. Rizzo, M. van Erp, J. Plu, and R. Troncy. Making scoping, however, the improvement is not significant. Sense of Microposts (#Microposts2016) Named Entity As shown in table 2, significant precision values are ob- rEcognition and Linking (NEEL) Challenge. In tained on both the datasets, however, recall as well as F1 D. Preoţiuc-Pietro, D. Radovanović, A. E. scores on the dev dataset are poor. A possible reason could Cano-Basave, K. Weller, and A.-S. Dadzie, editors, 6th be attributed to the presence of a lot of #hashtags and Workshop on Making Sense of Microposts @usernames recognized as entities in the ground truth, which (#Microposts2016), pages 50–59, 2016. leads to a poor performance of the entity recognition system, [6] A. Thalhammer and A. Rettinger. Browsing dbpedia even if @ and # are removed. An important observation is entities with summaries. In The Semantic Web: ESWC that by applying entity boundary re-scoping, precision and 2014 Satellite Events, pages 511–515. Springer, 2014. recall fall for the training dataset, however, its the oppo- site for the dev dataset. This can again be attributed to the presence of lot of #hashtags and @usernames in the dev 72 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016