UniMiB: Entity Linking in Tweets using Jaro-Winkler
              Distance, Popularity and Coherence

                 Davide Caliano                       Elisabetta Fersini                           Pikakshi Manchanda
              Università degli Studi di              Università degli Studi di                      Università degli Studi di
               Milano-Bicocca, Italy                  Milano-Bicocca, Italy                          Milano-Bicocca, Italy
              d.caliano@campus.unimib.it                fersini@disco.unimib.it               pikakshi.manchanda@disco.unimib.it

                                   Matteo Palmonari                             Enza Messina
                                 Università degli Studi di                 Università degli Studi di
                                  Milano-Bicocca, Italy                     Milano-Bicocca, Italy
                                  palmonari@disco.unimib.it                  messina@disco.unimib.it


ABSTRACT                                                                 For the task of identifying named entities, we use a state-
This paper summarizes the participation of UNIMIB team                of-the-art NER system, T-NER [4] which is a supervised
in the Named Entity rEcognition and Linking (NEEL) Chal-              model based on Conditional Random Fields (CRF), pre-
lenge in #Microposts2016. In this paper, we propose a                 trained on a state-of-the-art gold standard of tweets [4]. The
knowledge-base approach for identifying and linking named             CRF model of T-NER has been used to identify, given a
entities from tweets. The named entities are, further, classi-        tweet t as input, the candidate entities e1 , e2 , ..., en in t. In
fied using evidence provided by our entity linking algorithm          other words, the CRF model segments a tweet into entities
and type-casted into Microposts categories.                           and non-entities.
                                                                         For performing entity recognition using T-NER, we re-
                                                                      move the special characters (@, #,..) as a pre-processing
Keywords                                                              step and process the tweets in UTF-8 format in order to deal
Knowledge base; Named entity recognition; Named entity                with emoticons. T-NER is not trained to recognize @user-
linking                                                               names as entities and the current version of our system does
                                                                      not resolve username references. This has a significant im-
1.    INTRODUCTION                                                    pact on the overall performance of our system.
   Microblogging platforms such as Twitter have become a
rich source of real-time information. Today, information is
                                                                      2.2        Candidate Resource Selection & Ranking
being readily extracted from such platforms, in the form of              For the task of selecting a candidate resource for an entity,
named entities, relations and events. The tasks of this chal-         we use DBpedia1 as our KB. We perform a pre-processing
lenge comprise identification and classification of named en-         step here, wherein, for every identified entity which consists
tities from a set of tweets, and linking the identified entities      of a segment that begins with a capital letter, we segment
to corresponding KB resources if a match is found, or to a            that entity into a set of tokens based on the capital let-
NIL reference if no candidate resources can be retrieved [5].         ter. For instance, the entity mention ‘StarWars’ is treated
   In order to identify named entities, we use a pre-trained,         as ‘Star Wars’ during the candidate retrieval phase so as to
state-of-the-art Named Entity Recognition (NER) system                obtain better candidate matches. To this end, we extract
[4]. Using this system, we tokenize and segment the tweets            all the Titles of all Wikipedia articles 2 from DBpedia using
to identify entities and non-entities. Further, our linking al-       rdfs:label and index them using LuceneAPI3 . For each identi-
gorithm is based on a greedy approach which disambiguates             fied entity, top-k candidate KB resources are retrieved using
and links all the identified entities with DBpedia resources.         a high-recall approach. Here we set k = 500. We estimate
Finally, the entities are classified using evidence from the          a knowledge-base score, called KB(ck ), for each candidate
linking phase.                                                        resource ck of an entity ej as follows:

                                                                          KB(ej , ck ) = (α · lex(ej , lck ) + (1 − α) · (cosk (e∗j , ack ))) + R(ck ) (1)
2.    METHODOLOGY
                                                                      where:
2.1    Named Entity Identification
                                                                            • lex(ej , lck ) denotes a lexical similarity between an en-
                                                                              tity ej and the label of a candidate resource lck ;

                                                                            • cosk (e∗j , ack )) represents a discounted cosine similar-
Copyright c 2016 held by author(s)/owner(s); copying permitted
only for private and academic purposes.                                       ity between an entity context e∗j and a candidate KB
Published as part of the #Microposts2016 Workshop proceedings,                abstract description ack ;
available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691)
                                                                      1
                                                                        http://wiki.dbpedia.org/
#Microposts2016, Apr 11th, 2016, Montréal, Canada.                    2
ACM ISBN 978-1-4503-2138-9.                                             http://dbpedia.org/Downloads2015-04
                                                                      3
DOI: 10.1145/1235                                                       http://lucene.apache.org/


· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
     • R(ck ) is a popularity measure of a given candidate in                 ack . To compute equation (5), we retrieve the abstracts for
       the KB.                                                                all the top-k candidate resources c1 , c2 , ..., ck from DBpedia.
                                                                              An entity context, denoted as e∗j , is modelled as a vector
More formally, lex(ej , lck ) is defined as follows:                          composed of an identified entity ej in a tweet ti and the
                                                 JW (e ,l                   words in the tweet which have been tagged as noun / verb /
                                                        j    ck )
         lex(ej , lck ) = lcs(ej , lck ) + WD        WD +1
                                                                        (2)   adjective. Equation (5) allows us to scale the similarity with
                                                                              respect to each candidate abstract according to its ranking
where lcs(ej , lck ) denotes a normalized Lucene Conceptual
                                                                             position.
                                                   JW (e ,lc )
Score4 between ej and lck , while WD            j
                                             WD +1
                                                   k
                                                        repre-                   Finally, the last contribution provided in equation (1) is
sents a string distance measure, based on the well-known                      provided by R(ck ), which allows us to take into account
Jaro-Winkler distance, between an entity and the label of                     the popularity of a given candidate in the KB for the final
a candidate resource. The coefficient WD is set equal to                      ranking. To this purpose, we computed the popularity R(ck )
3.0 and represents a boosting coefficient that allows us to                   of a KB resource ck by using the following boosted Page
weigh more syntactically close matches. The asymmetric                        Rank coefficient:
Jaro-Winkler distance weighs more edit distances occurring
                                                                                                    R(ck ) = β · P R(ck )                  (6)
in the first subsequences of two strings, and is defined as:
                                         P0                                   where P R(ck ) is the normalized PageRank coefficient [6],
    JW (ej , lck ) = Jaro(ej , lck ) +      · (1 − Jaro(ej , lck )) (3)       and β is a damping coefficient, which lies in the range [0,1],
                                         10
                                                                              and has been experimentally determined as equal to 0.6.
where Jaro is a similarity metric [2] and P 0 is a measure that                 In order to determine the optimal configuration of our
takes into account the length of the longest common prefix                    system, the parameters have been experimentally evaluated.
of ej and lck . Moreover, in situations where a candidate                     The top-k candidates are ranked using equation (1) where
label lck is composed of more than one token, we calculate                    the score of each candidate resource is denoted by KB(ck ).
JW (ej , lck ) as follows:                                                    Finally, the value of α in equation (1) has been investigated
                                         lc                   lc
                                                                              varying between the range [0,1] and the optimal value α =
    JW (ej , lck ) = max(JW (ej , P1 k ), ..., JW (ej , Pn k ))         (4)   0.7 results as the best configuration.
           lc
where Pi k denotes one of every possible permutation of                       2.3      Entity Linking and Type Classification
tokens in lck . This particular step is undertaken because                       We followed an unsupervised, greedy approach to link an
users may refer to an entity in a tweet using a concise, more                 entity with a DBpedia resource. In this way, we link ev-
popular substring of the entity, which may not necessarily                    ery identified entity with a corresponding candidate resource
be the first token of the entity itself. For instance, in the                 with the highest candidate score achieved using equation
tweet,                                                                        (1). However, entities for which no candidate matches are
        @steph93065 shes hates me but she’s no bigot,                         retrieved from the index have been mapped to a NIL ref-
        intelligent and correct most of the time. #Trump                      erence with an assigned type Thing. The entities are, fur-
                                                                              ther, classified using the relation rdf:type with the help of
we observe that candidate KB resources for the entity men-                    dbpedia-owl Ontology 5 . For this purpose, we indexed the
tion ‘Trump’ comprise of Trump (card game, rdf:type Thing),                   mapping-based types dataset of DBpedia classes6 .
Donald Trump (rdf:type Person), and Trump (comics)                               Moreover, we established a mapping between the DBpedia
(rdf:type CartoonCharacter), amongst other resources. By                      Ontology and Microposts categories (Thing, Person, Loca-
using the afore-mentioned equation (4), we are able to com-                   tion, Organization, Event, Character and Product) by fol-
pute the JW distance for the entity mention ‘Trump’ not                       lowing the description of the Microposts categories [5] by
only with ‘Donald Trump’, which yields a low JW similarity,                   the challenge organizers. Every DBpedia Ontology class
but also with ‘Trump’, which yields a high JW similarity.                     that can not be mapped intuitively following this descrip-
   To evaluate the second component cosk (e∗j , ack ) of the KB               tion, such as the Ontology class Species, has been mapped
score in equation (1), we have indexed the extended abstracts                 to the Microposts category Thing. We adopted only one ex-
of all DBpedia resources. This has been done with an objec-                   ception to this rule, where we mapped the DBpedia Ontol-
tive to be able to disambiguate an entity with a candidate                    ogy class Name, with its subclasses, GivenName, Surname
label using an entity’s usage context in the tweet, on one                    to the Microposts category Person. GivenNames and Sur-
hand, and contextual evidence from the KB on the other.                       names are used in tweets mostly to refer to a person in the
The measure cosk (e∗j , ack ), which is used for denoting con-                real world, i.e., they are mentions of entities that would be
textual similarity between an entity ej and a KB candidate                    re-classified under the Microposts category Person. This in-
resource ck , is defined as:                                                  terpretation of mapping for names and surnames is inspired
                                                                             by previous work on mapping semantics [1].
                              
                               cos(e∗j , ack ) if k = 1
                              
          cosk (e∗j , ack ) =                                (5)              2.4      Entity Boundary Re-Scoping
                              
                               cos(e∗j , ack )
                                               k≥2                             We performed an additional post-processing step, where
                                  log2 (k)                                    an identified entity’s boundary is re-scoped based on the la-
where cos(e∗j , ack ) denotes the cosine similarity between an                bel of the resource linked to the entity in the previous phase.
entity context e∗j and a candidate KB abstract description                    We apply this step when the resource label is a substring of
4                                                                             5
 https://lucene.apache.org/core/4 6 0/core/org/apache/                            http://mappings.dbpedia.org/server/ontology/classes/
                                                                              6
lucene/search/similarities/TFIDFSimilarity.html                                   http://dbpedia.org/Downloads2015-04


                                                                                                                                             71
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
Table 1: Performance: Entity Linking and Classifi-                 Table 3: NER Oracle: Entity Linking Performance
cation                                                              Dataset        Precision       Recall      F1 Measure
            Dataset              SLM     STMM     Mention Ceaf
            without Re-scoping   0.327   0.297    0.380             Training*      0.524           0.459       0.489
 Training
             with Re-scoping     0.336   0.300    0.378             Dev*           0.452           0.387       0.417
            without Re-scoping   0.194   0.139    0.237
 Dev
             with Re-scoping     0.221   0.134    0.250

                                                                   dataset, due to which the entity recognition system exhibits
       Table 2: Performance: Entity Recognition                    entity segmentation errors.
            Dataset              Precision   Recall   F1 Measure      Finally, table 3 summarizes the performance of our entity
            without Re-scoping   0.627       0.362    0.459        linking algorithm in terms of precision, recall and F1 scores
 Training
             with Re-scoping     0.625       0.347    0.446        assuming a NER Oracle. To this end, we use a modified
            without Re-scoping   0.514       0.166    0.251        version of the Training and Dev gold standards, denoted as
 Dev
             with Re-scoping     0.545       0.178    0.268
                                                                   Training* and Dev* which comprise of linkable entities only,
                                                                   i.e., void of NIL mentions. They are annotated with 6371
                                                                   and 253 linkable entities, respectively. Our linking approach
the entity mention. In this way, we are able to filter out
                                                                   is able to link correctly ≈ 50% of the entities in the modified
noisy tokens in entities that were identified in the first step
                                                                   ground truth. When a NER Oracle is used, the performance
by the entity recognition system. For instance, in the tweet,
                                                                   of the system obviously falls for entity boundary re-scoping.
       Day 9: Wearing a StarWars T-Shirt each day                  Hence, we report the results without entity boundary re-
       until ‘The Force Awakens’. We’re half way there!            scoping for the Training* and Dev* datasets. For the test
       https://t.co/QoAOxoSCJk                                     set evaluation, we provide 2 runs of our system on the test
                                                                   dataset for both configurations.
the entity recognition system identifies ‘StarWars T-Shirt’ as        In previous work we defined a more sophisticated entity
an entity, due to a capitalization issue, however, our linking     classification method, which combines evidence from the La-
algorithm is able to link this entity correctly with the KB        beledLDA component of T-NER and from the types of can-
resource Star Wars, based on contextual and KB evidence.           didate entities [3]. In this challenge we could not apply
As a result, we re-scope the boundary of the identified entity     this method due to problems in integrating the LabeledLDA
‘StarWars T-Shirt’ to ‘StarWars’ to improve the identifica-        component in our current pipeline, but we plan to use this
tion performance of the system. We evaluate our system             method again in the near future.
using two configurations, viz. without entity boundary re-
scoping and with entity boundary re-scoping, as reported in
Section 3 below.
                                                                   4.   REFERENCES
                                                                   [1] M. Atencia, A. Borgida, J. Euzenat, C. Ghidini, and
                                                                       L. Serafini. A formal semantics for weighted ontology
3. RESULTS                                                             mappings. In The Semantic Web–ISWC 2012, pages
   We use the training and dev datasets to test the perfor-            17–33. Springer, 2012.
mance of the pre-trained NER system (supervised approach)          [2] M. A. Jaro. Probabilistic linkage of large public health
and, use the identified entities for testing the performance of        data files. Statistics in medicine, 14(5-7):491–498, 1995.
our linking algorithm (unsupervised approach). The train-          [3] P. Manchanda, E. Fersini, and M. Palmonari.
ing and dev gold standards consist of ≈6000 and 100 tweets,            Leveraging entity linking to enhance entity recognition
annotated with a total of 8665 and 338 entities, respectively.         in microblogs. In Proceedings of the 7th International
   Table 1 shows the performance of our entity linking and             Joint Conference on Knowledge Discovery, Knowledge
classification approach for Strong Link Match (SLM), Strong            Engineering and Knowledge Management, pages
Typed Mention Match (STMM) and Mention Ceaf. As ev-                    147–155, 2015.
ident, the performance of the linking approach (SLM) im-
                                                                   [4] A. Ritter, S. Clark, O. Etzioni, et al. Named entity
proves when entity boundary re-scoping is applied, for both
                                                                       recognition in tweets: an experimental study. In
the datasets. An overall low performance of the entity link-
                                                                       Proceedings of the Conference on Empirical Methods in
ing system could be attributed to poor performance of the
                                                                       Natural Language Processing, pages 1524–1534.
entity recognition system, as illustrated in Table 2. On the
                                                                       Association for Computational Linguistics, 2011.
other hand, the performance for type classification approach
(STMM) improves for the training dataset with entity re-           [5] G. Rizzo, M. van Erp, J. Plu, and R. Troncy. Making
scoping, however, the improvement is not significant.                  Sense of Microposts (#Microposts2016) Named Entity
   As shown in table 2, significant precision values are ob-           rEcognition and Linking (NEEL) Challenge. In
tained on both the datasets, however, recall as well as F1             D. Preoţiuc-Pietro, D. Radovanović, A. E.
scores on the dev dataset are poor. A possible reason could            Cano-Basave, K. Weller, and A.-S. Dadzie, editors, 6th
be attributed to the presence of a lot of #hashtags and                Workshop on Making Sense of Microposts
@usernames recognized as entities in the ground truth, which           (#Microposts2016), pages 50–59, 2016.
leads to a poor performance of the entity recognition system,      [6] A. Thalhammer and A. Rettinger. Browsing dbpedia
even if @ and # are removed. An important observation is               entities with summaries. In The Semantic Web: ESWC
that by applying entity boundary re-scoping, precision and             2014 Satellite Events, pages 511–515. Springer, 2014.
recall fall for the training dataset, however, its the oppo-
site for the dev dataset. This can again be attributed to
the presence of lot of #hashtags and @usernames in the dev


                                                                                                                                72
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016