<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Microposts</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/1235</article-id>
      <title-group>
        <article-title>UniMiB: Entity Linking in Tweets using Jaro-Winkler Distance, Popularity and Coherence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Caliano</string-name>
          <email>d.caliano@campus.unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisabetta Fersini</string-name>
          <email>fersini@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pikakshi Manchanda</string-name>
          <email>pikakshi.manchanda@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Palmonari</string-name>
          <email>palmonari@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enza Messina</string-name>
          <email>messina@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di</institution>
          ,
          <addr-line>Milano-Bicocca</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi di</institution>
          ,
          <addr-line>Milano-Bicocca</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi di</institution>
          ,
          <addr-line>Milano-Bicocca</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Università degli Studi di</institution>
          ,
          <addr-line>Milano-Bicocca</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Università degli Studi di</institution>
          ,
          <addr-line>Milano-Bicocca</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>6</volume>
      <fpage>70</fpage>
      <lpage>72</lpage>
      <abstract>
        <p>This paper summarizes the participation of UNIMIB team in the Named Entity rEcognition and Linking (NEEL) Challenge in #Microposts2016. In this paper, we propose a knowledge-base approach for identifying and linking named entities from tweets. The named entities are, further, classified using evidence provided by our entity linking algorithm and type-casted into Microposts categories.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge base</kwd>
        <kwd>Named entity recognition</kwd>
        <kwd>Named entity linking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Microblogging platforms such as Twitter have become a
rich source of real-time information. Today, information is
being readily extracted from such platforms, in the form of
named entities, relations and events. The tasks of this
challenge comprise identification and classification of named
entities from a set of tweets, and linking the identified entities
to corresponding KB resources if a match is found, or to a
NIL reference if no candidate resources can be retrieved [5].</p>
      <p>In order to identify named entities, we use a pre-trained,
state-of-the-art Named Entity Recognition (NER) system
[4]. Using this system, we tokenize and segment the tweets
to identify entities and non-entities. Further, our linking
algorithm is based on a greedy approach which disambiguates
and links all the identified entities with DBpedia resources.
Finally, the entities are classified using evidence from the
linking phase.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY Named Entity Identification</title>
      <p>Copyright c 2016 held by author(s)/owner(s); copying permitted
only for private and academic purposes.</p>
      <p>Published as part of the #Microposts2016 Workshop proceedings,
available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691)</p>
      <p>For the task of identifying named entities, we use a
stateof-the-art NER system, T-NER [4] which is a supervised
model based on Conditional Random Fields (CRF),
pretrained on a state-of-the-art gold standard of tweets [4]. The
CRF model of T-NER has been used to identify, given a
tweet t as input, the candidate entities e1, e2, ..., en in t. In
other words, the CRF model segments a tweet into entities
and non-entities.</p>
      <p>For performing entity recognition using T-NER, we
remove the special characters (@, #,..) as a pre-processing
step and process the tweets in UTF-8 format in order to deal
with emoticons. T-NER is not trained to recognize
@usernames as entities and the current version of our system does
not resolve username references. This has a significant
impact on the overall performance of our system.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Candidate Resource Selection &amp; Ranking</title>
      <p>For the task of selecting a candidate resource for an entity,
we use DBpedia1 as our KB. We perform a pre-processing
step here, wherein, for every identified entity which consists
of a segment that begins with a capital letter, we segment
that entity into a set of tokens based on the capital
letter. For instance, the entity mention ‘StarWars’ is treated
as ‘Star Wars’ during the candidate retrieval phase so as to
obtain better candidate matches. To this end, we extract
all the Titles of all Wikipedia articles2 from DBpedia using
rdfs:label and index them using LuceneAPI3. For each
identified entity, top-k candidate KB resources are retrieved using
a high-recall approach. Here we set k = 500. We estimate
a knowledge-base score, called KB(ck), for each candidate
resource ck of an entity ej as follows:</p>
      <p>
        KB(ej, ck) = (α · lex(ej, lck ) + (1 − α) · (cosk(ej∗, ack ))) + R(ck) (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where:
• lex(ej , lck ) denotes a lexical similarity between an
entity ej and the label of a candidate resource lck ;
• cosk(ej∗, ack )) represents a discounted cosine
similarity between an entity context ej∗ and a candidate KB
abstract description ack ;
1http://wiki.dbpedia.org/
2http://dbpedia.org/Downloads2015-04
3http://lucene.apache.org/
• R(ck) is a popularity measure of a given candidate in
the KB.
      </p>
      <p>More formally, lex(ej, lck ) is defined as follows:
lex(ej , lck ) = lcs(ej , lck ) + WD
JW (ej,lck )</p>
      <p>
        WD+1
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
where lcs(ej, lck ) denotes a normalized Lucene Conceptual
Score4 between ej and lck , while WD JWW(Dej+,l1ck )
represents a string distance measure, based on the well-known
Jaro-Winkler distance, between an entity and the label of
a candidate resource. The coefficient WD is set equal to
3.0 and represents a boosting coefficient that allows us to
weigh more syntactically close matches. The asymmetric
Jaro-Winkler distance weighs more edit distances occurring
in the first subsequences of two strings, and is defined as:
P 0
      </p>
      <p>
        JW (ej, lck ) = Jaro(ej, lck ) + 10 · (1 − Jaro(ej, lck )) (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
where Jaro is a similarity metric [2] and P 0 is a measure that
takes into account the length of the longest common prefix
of ej and lck . Moreover, in situations where a candidate
label lck is composed of more than one token, we calculate
JW (ej, lck ) as follows:
      </p>
      <p>
        JW (ej, lck ) = max(JW (ej, P1lck ), ..., JW (ej, Pnlck )) (
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
where Pilck denotes one of every possible permutation of
tokens in lck . This particular step is undertaken because
users may refer to an entity in a tweet using a concise, more
popular substring of the entity, which may not necessarily
be the first token of the entity itself. For instance, in the
tweet,
@steph93065 shes hates me but she’s no bigot,
intelligent and correct most of the time. #Trump
we observe that candidate KB resources for the entity
mention ‘Trump’ comprise of Trump (card game, rdf:type Thing),
Donald Trump (rdf:type Person), and Trump (comics)
(rdf:type CartoonCharacter), amongst other resources. By
using the afore-mentioned equation (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), we are able to
compute the JW distance for the entity mention ‘Trump’ not
only with ‘Donald Trump’, which yields a low JW similarity,
but also with ‘Trump’, which yields a high JW similarity.
      </p>
      <p>
        To evaluate the second component cosk(ej∗, ack ) of the KB
score in equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), we have indexed the extended abstracts
of all DBpedia resources. This has been done with an
objective to be able to disambiguate an entity with a candidate
label using an entity’s usage context in the tweet, on one
hand, and contextual evidence from the KB on the other.
The measure cosk(ej∗, ack ), which is used for denoting
contextual similarity between an entity ej and a KB candidate
resource ck, is defined as:
cosk(ej∗, ack ) =
 cos(ej∗, ack ) if k = 1






cos(ej∗, ack )
log2(k)
k ≥ 2
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
where cos(ej∗, ack ) denotes the cosine similarity between an
entity context ej∗ and a candidate KB abstract description
4https://lucene.apache.org/core/4 6 0/core/org/apache/
lucene/search/similarities/TFIDFSimilarity.html
ack . To compute equation (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ), we retrieve the abstracts for
all the top-k candidate resources c1, c2, ..., ck from DBpedia.
An entity context, denoted as e∗, is modelled as a vector
j
composed of an identified entity ej in a tweet ti and the
words in the tweet which have been tagged as noun / verb /
adjective. Equation (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) allows us to scale the similarity with
respect to each candidate abstract according to its ranking
position.
      </p>
      <p>
        Finally, the last contribution provided in equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) is
provided by R(ck), which allows us to take into account
the popularity of a given candidate in the KB for the final
ranking. To this purpose, we computed the popularity R(ck)
of a KB resource ck by using the following boosted Page
Rank coefficient:
      </p>
      <p>
        R(ck) = β · P R(ck)
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
where P R(ck) is the normalized PageRank coefficient [6],
and β is a damping coefficient, which lies in the range [0,1],
and has been experimentally determined as equal to 0.6.
      </p>
      <p>
        In order to determine the optimal configuration of our
system, the parameters have been experimentally evaluated.
The top-k candidates are ranked using equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) where
the score of each candidate resource is denoted by KB(ck).
Finally, the value of α in equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) has been investigated
varying between the range [0,1] and the optimal value α =
0.7 results as the best configuration.
2.3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Entity Linking and Type Classification</title>
      <p>
        We followed an unsupervised, greedy approach to link an
entity with a DBpedia resource. In this way, we link
every identified entity with a corresponding candidate resource
with the highest candidate score achieved using equation
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ). However, entities for which no candidate matches are
retrieved from the index have been mapped to a NIL
reference with an assigned type Thing. The entities are,
further, classified using the relation rdf:type with the help of
dbpedia-owl Ontology 5. For this purpose, we indexed the
mapping-based types dataset of DBpedia classes6.
      </p>
      <p>Moreover, we established a mapping between the DBpedia
Ontology and Microposts categories (Thing, Person,
Location, Organization, Event, Character and Product ) by
following the description of the Microposts categories [5] by
the challenge organizers. Every DBpedia Ontology class
that can not be mapped intuitively following this
description, such as the Ontology class Species, has been mapped
to the Microposts category Thing. We adopted only one
exception to this rule, where we mapped the DBpedia
Ontology class Name, with its subclasses, GivenName, Surname
to the Microposts category Person. GivenNames and
Surnames are used in tweets mostly to refer to a person in the
real world, i.e., they are mentions of entities that would be
re-classified under the Microposts category Person. This
interpretation of mapping for names and surnames is inspired
by previous work on mapping semantics [1].
2.4</p>
    </sec>
    <sec id="sec-5">
      <title>Entity Boundary Re-Scoping</title>
      <p>We performed an additional post-processing step, where
an identified entity’s boundary is re-scoped based on the
label of the resource linked to the entity in the previous phase.
We apply this step when the resource label is a substring of
5http://mappings.dbpedia.org/server/ontology/classes/
6http://dbpedia.org/Downloads2015-04</p>
      <p>STMM
0.297
0.300
0.139
0.134</p>
      <p>Mention Ceaf
0.380
0.378
0.237
0.250
the entity mention. In this way, we are able to filter out
noisy tokens in entities that were identified in the first step
by the entity recognition system. For instance, in the tweet,
Day 9: Wearing a StarWars T-Shirt each day
until ‘The Force Awakens’. We’re half way there!
https://t.co/QoAOxoSCJk
the entity recognition system identifies ‘StarWars T-Shirt’ as
an entity, due to a capitalization issue, however, our linking
algorithm is able to link this entity correctly with the KB
resource Star Wars, based on contextual and KB evidence.
As a result, we re-scope the boundary of the identified entity
‘StarWars T-Shirt’ to ‘StarWars’ to improve the
identification performance of the system. We evaluate our system
using two configurations, viz. without entity boundary
rescoping and with entity boundary re-scoping, as reported in
Section 3 below.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>We use the training and dev datasets to test the
performance of the pre-trained NER system (supervised approach)
and, use the identified entities for testing the performance of
our linking algorithm (unsupervised approach). The
training and dev gold standards consist of ≈6000 and 100 tweets,
annotated with a total of 8665 and 338 entities, respectively.</p>
      <p>Table 1 shows the performance of our entity linking and
classification approach for Strong Link Match (SLM), Strong
Typed Mention Match (STMM) and Mention Ceaf. As
evident, the performance of the linking approach (SLM)
improves when entity boundary re-scoping is applied, for both
the datasets. An overall low performance of the entity
linking system could be attributed to poor performance of the
entity recognition system, as illustrated in Table 2. On the
other hand, the performance for type classification approach
(STMM) improves for the training dataset with entity
rescoping, however, the improvement is not significant.</p>
      <p>As shown in table 2, significant precision values are
obtained on both the datasets, however, recall as well as F1
scores on the dev dataset are poor. A possible reason could
be attributed to the presence of a lot of #hashtags and
@usernames recognized as entities in the ground truth, which
leads to a poor performance of the entity recognition system,
even if @ and # are removed. An important observation is
that by applying entity boundary re-scoping, precision and
recall fall for the training dataset, however, its the
opposite for the dev dataset. This can again be attributed to
the presence of lot of #hashtags and @usernames in the dev
dataset, due to which the entity recognition system exhibits
entity segmentation errors.</p>
      <p>Finally, table 3 summarizes the performance of our entity
linking algorithm in terms of precision, recall and F1 scores
assuming a NER Oracle. To this end, we use a modified
version of the Training and Dev gold standards, denoted as
Training* and Dev* which comprise of linkable entities only,
i.e., void of NIL mentions. They are annotated with 6371
and 253 linkable entities, respectively. Our linking approach
is able to link correctly ≈ 50% of the entities in the modified
ground truth. When a NER Oracle is used, the performance
of the system obviously falls for entity boundary re-scoping.
Hence, we report the results without entity boundary
rescoping for the Training* and Dev* datasets. For the test
set evaluation, we provide 2 runs of our system on the test
dataset for both configurations.</p>
      <p>In previous work we defined a more sophisticated entity
classification method, which combines evidence from the
LabeledLDA component of T-NER and from the types of
candidate entities [3]. In this challenge we could not apply
this method due to problems in integrating the LabeledLDA
component in our current pipeline, but we plan to use this
method again in the near future.
4.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Atencia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Borgida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Euzenat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ghidini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Serafini</surname>
          </string-name>
          .
          <article-title>A formal semantics for weighted ontology mappings</article-title>
          .
          <source>In The Semantic Web-ISWC</source>
          <year>2012</year>
          , pages
          <fpage>17</fpage>
          -
          <lpage>33</lpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Jaro</surname>
          </string-name>
          .
          <article-title>Probabilistic linkage of large public health data files</article-title>
          . Statistics in medicine,
          <volume>14</volume>
          (
          <issue>5-7</issue>
          ):
          <fpage>491</fpage>
          -
          <lpage>498</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manchanda</surname>
          </string-name>
          , E. Fersini, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          .
          <article-title>Leveraging entity linking to enhance entity recognition in microblogs</article-title>
          .
          <source>In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management</source>
          , pages
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          , et al.
          <article-title>Named entity recognition in tweets: an experimental study</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1524</fpage>
          -
          <lpage>1534</lpage>
          . Association for Computational Linguistics,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Erp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Plu</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          .
          <article-title>Making Sense of Microposts (#Microposts2016) Named Entity rEcognition and Linking (NEEL) Challenge</article-title>
          . In D. Preo¸
          <article-title>tiuc-</article-title>
          <string-name>
            <surname>Pietro</surname>
            , D. Radovanovi´c,
            <given-names>A. E.</given-names>
          </string-name>
          <string-name>
            <surname>Cano-Basave</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Weller</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-S</surname>
          </string-name>
          . Dadzie, editors,
          <source>6th Workshop on Making Sense of Microposts (#Microposts2016)</source>
          , pages
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Thalhammer</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Rettinger</surname>
          </string-name>
          .
          <article-title>Browsing dbpedia entities with summaries</article-title>
          .
          <source>In The Semantic Web: ESWC 2014 Satellite Events</source>
          , pages
          <fpage>511</fpage>
          -
          <lpage>515</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>