<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linking Entities in #Microposts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Merge Mentions</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Entity</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Figure 1: System Architecture</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Romil Bansal, Sandeep Panem, Priya Radhakrishnan, Manish Gupta, Vasudeva Varma International Institute of Information Technology</institution>
          ,
          <addr-line>Hyderabad</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>Social media has emerged to be an important source of information. Entity linking in social media provides an effective way to extract useful information from microposts shared by the users. Entity linking in microposts is a difficult task as they lack sufficient context to disambiguate the entity mentions. In this paper, we do entity linking by first identifying entity mentions and then disambiguating the mentions based on three different features: (1) similarity between the mention and the corresponding Wikipedia entity pages; (2) similarity between the mention and the tweet text with the anchor text strings across multiple webpages, and (3) popularity of the entity on Twitter at the time of disambiguation. The system is tested on the manually annotated dataset provided by Named Entity Extraction and Linking (NEEL) Challenge 2014, and the obtained results are on par with the state-of-the-art methods.</p>
      </abstract>
      <kwd-group>
        <kwd>Entity Disambiguation</kwd>
        <kwd>Named Entity Extraction and Linking (NEEL) Challenge</kwd>
        <kwd>Entity Linking</kwd>
        <kwd>Entity Disambiguation</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Wikipedia based
measure
Twitter popularity
based measure</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Social media networks like Twitter have emerged to be major
platforms for sharing information in form of short messages (tweets).
Analysis of tweets can be useful for various applications like
ecommerce, entertainment, recommendations, etc. Entity linking is
the one such analysis task which deals with finding correct referent
entities in the knowledge base for various mentions in the tweet.
Entity linking in social media is important as it helps in
detecting, understanding and tracking information about an entity shared
across social media.</p>
      <p>Copyright c 2014 held by author(s)/owner(s); copying permitted
only for private and academic purposes.</p>
      <p>Published as part of the #Microposts2014 Workshop proceedings,
Cavoapiylaribglhetoisnlhienled absy CthEeUauRthVoro/lo-w11n4e1r(s()h.ttp://ceur-ws.org/Vol-1141)
2.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>OUR APPROACH</title>
    </sec>
    <sec id="sec-4">
      <title>Mention Detection</title>
      <p>
        Mention detection is the task of finding entity mentions in the
given text. We assumed mentions as named entities present
inside the tweets. Various approaches for named entity recognition
in tweets have been proposed recently [
        <xref ref-type="bibr" rid="ref3 ref5 ref8">3, 5</xref>
        ]. This includes spotting
continuous sequence of proper nouns as named entities in the tweet.
But sometimes named entities like ‘Statue of Liberty’, ‘Game of
Thrones’ etc. also includes tokens other than nouns. To detect such
mentions, Ritter et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a machine learning based
system for named entity detection in tweets. Gimpel et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] present
yet another approach for POS tagging of tweets. We tried both of
these POS taggers to extract proper noun sequences. In our
experiments Ritter et al.’s tagger gave an accuracy of 77% while Gimpel
et al.’s tagger gave an accuracy of 92%. So we merged the
results from both as shown in Fig. 1. The tweet text is fed to the
system and the longest continuous sequences of proper noun
tokens detected using the above approach are extracted as the entity
mentions from the given tweet. The merged system provided an
accuracy of 98% in predicting mentions.
      </p>
      <p>
        Entity disambiguation is the task of assigning the correct referent
entity from the knowledge base to the given mention. We
disambiguate the entity mention using three measures as described below.
The scores from these three measures are combined using
LambdaMART [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] model to arrive at the final disambiguated entity.
      </p>
      <sec id="sec-4-1">
        <title>Wikipedia’s Context based Measure (M1)</title>
        <p>
          This measure disambiguates a mention by calculating the
frequency of occurrence of the mention in the Wikipedia corpus. Wikipedia’s
context based measure has been used in various approaches for
disambiguating mentions in tweets [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We query MediaWiki API1
with the entity mention. MediaWiki API returns the candidate
entities in the ranked order. Each candidate entity is assigned its
reciprocal rank as score. Thus, a ranked list of candidate entities with
their scores are created using M1.
2.2.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Anchor Text based Measure (M2)</title>
        <p>
          Google Cross-Wiki Dictionary (GCD) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a string to concept
mapping, created using anchor text from various web pages. A
concept is an individual Wikipedia article, identified by its URL.
The text strings constitute the anchor hypertexts that refer to these
concepts. Thus, anchor text strings represent a concept. We query
the GCD with a mention along with the tweet text. Based on the
similarity to the query string, a ranked list of probable candidate
entities are created (which is the ranked list using M2). The ranking
criteria is based on Jaccard similarity between the anchor text and
the query. So if the mention is highly similar to the anchor text,
then the corresponding concept will have a high score.
2.2.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Twitter Popularity based Measure (M3)</title>
        <p>Tweets about entities follow a bursty pattern. Bursty patterns are
the bursts of tweets that appear after an event relating to an entity
happens. We exploited this fact and tried to measure the number
of times the given mention refers to a particular entity on Twitter
recently. The mention is queried on Twitter API2 and the
resultant tweets are analyzed. All the tweets along with the mention
are then queried on the GCD and the candidate entities are taken.
Based on the scores returned using GCD, all the candidate entities
are ranked (which is the ranked list using M3). As Twitter
popularity based measure captures the people’s interests at a particular
time, it works well for entity disambiguation on recent tweets. In
essence, the methods M2 and M3 are similar but with different
inputs. Both use GCD, and produce candidate mentions and score as
output. However, M2 takes mention and single tweet text as input
whereas M3 takes mention and multiple tweets as input.</p>
        <p>We have three rankings available using M1, M2, M3. Now the
task is to arrive at the final ranking of the candidate entities by
combining the rankings of the three different models. The rankings of
different models should be combined such that the overall F1 score
is maximized. For this, we use LambdaMART which combines
LambdaRank and MART models. LambdaMART creates boosted
regression trees for combining the rankings of the three different
systems.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND EVALUATION</title>
      <p>The dataset comprises of 2.3K tweets each annotated with the
entity mention and its corresponding DBpedia URL. We divided
the dataset into the 7:3 (train:test) ratio. Table 1 shows the results
obtained using the NEEL Challenge evaluation framework. The
best results are obtained when a combination of all the measures
were used for disambiguation3. A 5-fold cross validation on the
dataset gave an average F1 of 0.52 for M1+M2+M3.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION</title>
      <p>For effective entity linking, mention detection in tweets is
important. We improve the accuracy of detecting mentions by combining
various Twitter POS taggers. We resolve multiple mentions,
abbreviations and spell variations of a named entity using the Google
Cross-Wiki Dictionary. We also use popularity of an entity on
Twitter for improving the disambiguation. Our system performed well
with a F1 score of 0.512 on the given dataset.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. E. Cano</given-names>
            <surname>Basave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stankovic</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-S.</given-names>
            <surname>Dadzie</surname>
          </string-name>
          .
          <article-title>Making Sense of Microposts (#Microposts2014) Named Entity Extraction &amp; Linking Challenge</article-title>
          .
          <source>In Proc., 4th Workshop on Making Sense of Microposts (#Microposts2014)</source>
          , pages
          <fpage>54</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heilman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Yogatama</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Flanigan</surname>
            , and
            <given-names>N. A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Part-of-speech Tagging for Twitter: Annotation, Features, and Experiments</article-title>
          .
          <source>In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2 (NAACL-HLT)</source>
          , pages
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Guo</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Kıcıman</surname>
          </string-name>
          . To Link or Not to Link?
          <article-title>A Study on End-to-End Tweet Entity Linking</article-title>
          .
          <source>In Proc. of the Human Language Technologies</source>
          :
          <article-title>The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT)</article-title>
          , pages
          <fpage>1020</fpage>
          -
          <lpage>1030</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Entity Linking for Tweets</article-title>
          .
          <source>In Proc. of the 51th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , pages
          <fpage>1304</fpage>
          -
          <lpage>1311</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          , Mausam, and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Named Entity Recognition in Tweets: An Experimental Study</article-title>
          .
          <source>In Proc. of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Spitkovsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>A Cross-Lingual Dictionary for English Wikipedia Concepts</article-title>
          .
          <source>In Proc. of the 8th Intl. Conf. on Language Resources and Evaluation (LREC)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Svore</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
          </string-name>
          .
          <article-title>Adapting Boosting for Information Retrieval Measures</article-title>
          .
          <source>Journal of Information Retrieval</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>254</fpage>
          -
          <lpage>270</lpage>
          ,
          <year>Jun 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>3submitted as Agglutweet_1</article-title>
          .tsv
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>