<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MSM2013 IE Challenge NERTUW : Named Entity Recognition on tweets using Wikipedia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sandhya Sachidanandan</string-name>
          <email>sandhya.s@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prathyush Sambaturu</string-name>
          <email>prathyush.sambaturu@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kamalakar Karlapalem</string-name>
          <email>kamal@iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IIIT-Hyderabad</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1019</volume>
      <fpage>67</fpage>
      <lpage>70</lpage>
      <abstract>
        <p>We propose an approach to recognize named entities in tweets, disambiguate and classify them into four categories namely person, organization, location and miscellaneous using Wikipedia. Our approach annotates the tweets on the fly, ie, it does not require any training data.</p>
      </abstract>
      <kwd-group>
        <kwd>named entity recognition</kwd>
        <kwd>entity disambiguation</kwd>
        <kwd>entity classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>A significant amount of tweets generated each day, discusses about different
types of popular entities which may be persons, locations, organizations etc.
Most of the popular entities has a page in Wikipedia. Hence, Wikipedia can act
as a useful source of information to recognize popular named entities in tweets.
Moreover, Wikipedia contains huge number of names of different types of entities
which will help us to recognize entities which does not have an explicit page in
Wikipedia.</p>
      <p>Tweets are of very short length. A tweet may or may not have enough context
information to disambiguate the named entities in it. There would be a very
small number of words in the tweet which supports the disambiguation of named
entities which needs to be utilized efficiently. If the tweet do not have enough
context to disambiguate the named entities in it, the popularity of each entity
has to be leveraged in disambiguating it. Disambiguating an entity is essential
to classify it correctly into location, person, organization or miscellaneous.</p>
      <p>Our contributions are :- 1) An approach which utilizes the titles, anchors and
infoboxes contained in Wikipedia and a little information from Wordnet and the
context information in tweets to recognize, disambiguate and classify named
entities in tweets. 2) Our approach does not require any training data and hence
no human labelling effort is needed. 3) Along with the global information from
Wikipedia, our approach utilizes the context information in the tweet by
mapping them to their correct senses using a word sense disambiguation approach
which is then used to disambiguate the named entities in the tweet. This will
also help in disambiguating the words other than the named entities present in
the tweet if any.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>
        – Input tweet is split into ngrams. Link probability of each ngram is calculated
as in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and those ngrams with link probability less than a threshold τ
(experimentally set to 0.01) are discarded . Link probability of a phrase p is
calculated as shown in Equation 1.
      </p>
      <p>
        LP rob(p) =
na(p, W )
n(p, T )
(1)
where, na(p, W ) is the number of times a phrase p is used as an anchor text
in Wikipedia W and n(p, T ) is the number of times the phrase occur as text
in a corpus T of around one million tweets. Each concept associated with a
phrase, will get the same link probability LP rob(p).
– For each ngram, a set of Wikipedia article titles are obtained based on their
lexical match. The Wikipedia article titles mapped to the longest matching
ngrams are then treated as candidate entities for disambiguation. For each
ngram that matched to the title of a disambiguation page in Wikipedia, all
the articles related to the ngram are added.
– The candidate entities are then passed on to a Syntax analyser, which uses
YAGO’s type relation to extract WordNet synsets mapped to the
candidate entities. With the synsets mapped to the candidate entities and all the
synsets of verbs and common nouns associated with the tweet as vertices,
a syntax graph is generated using WordNet. The idea behind creating the
syntax graph, is to identify the candidate entities which are supported by the
syntax of the text. Since, this should be accompanied by disambiguation of
words in the text, we found the approach proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to be appropriate.
In order to identify the candidate entities supported by the syntax of the
tweet, we modify [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] by adding words from WordNet which are mapped to
the candidate entities, to the syntax graph being generated. If a candidate
entity is supported by the syntax of the tweet, the words from WordNet
mapped to it get connected to the correct sense of the words added from the
tweet in the Syntax graph. A portion of the syntax graph generated for a
tweet is shown in Figure 1.
– Page Rank algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is then applied on the syntax graph, setting high
prior probabilities for synsets of common nouns and verbs added from the
tweet. The average of the score of all synsets mapped to a candidate entity
is treated as its syntax score.
– With the candidate entities as vertices, a semantic graph is created. The
similarity between each pair of candidate entities is calculated and an edge
is added with the similarity score as weight if the score is greater than an
experimentally set threshold. This makes the most related candidate entities
establishment
      </p>
      <p>social unit
sitcom
comedy
organization
company
drama
service</p>
      <p>activity
assistance
vertex
edge</p>
      <p>
        connected in the resulting semantic graph, which may result in many
connected components in the graph. An example of a so constructed semantic
graph is shown in Figure 2.
– Weighted Page rank algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is then applied on the semantic graph and
the resulting scores assigned to the candidate entities is treated as the final
score for ranking. The priors for each candidate entity is set as the linear
combination of the following scores
:• Syntax score of each entity as calculated by the Syntax analyzer. This
score represents the context information in the tweet.
• Link probability of the ngram from which the candidate entity is
generated.
• Anchor probability of the candidate entity which is the number of times
the entity is used as an anchor in Wikipedia. Both link probability
and anchor probability represents the popularity of the candidate entity
which plays a significant role in disambiguating the candidate entities in
cases where a little or no context information is available in the tweet.
– Entity classification : Each ngram which has a candidate entity in the
semantic graph is considered as a named entity. For each ngram, the
candidate entity with the highest page rank in the semantic graph is given to
a named entity classifier, which uses the keywords present in the infobox
of the Wikipedia page of the candidate entity to classify it as person,
location, organization or miscellaneous. We extracted the unique keywords with
maximum occurence, pertaining to each entity type provided in the training
data to classify the named entities.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Error analysis and Discussion</title>
      <p>– We use an automated and scalable approach to collect keywords from the
infoboxes of Wikipedia pages to identify different entity types. Though it
The Artist (film)</p>
      <p>The Artist (magazine)</p>
      <p>Meryl Streep</p>
      <p>OSCAR
Academy Award</p>
      <p>The Oscar (film)</p>
      <p>is able to classify a significant number of entities correctly, it fails in cases
where the articles do not contain infobox.
– Since not all entities are present in Wikipedia, we used a post processing step
where we merge certain entities with the same type which occur adjacently in
the tweet. More post processing can be done by merging adjacently located
entities which are not of the same type and assign the most generic type to
it which is not done.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>E.</given-names>
            <surname>Meij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Weerkamp</surname>
          </string-name>
          , and M. de Rijke.
          <article-title>Adding semantics to microblog posts</article-title>
          .
          <source>In WSDM</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tarau</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Figa.</surname>
          </string-name>
          <article-title>Pagerank on semantic networks, with application to word sense disambiguation</article-title>
          .
          <source>In COLING</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          .
          <article-title>Graph connectivity measures for unsupervised word sense disambiguation</article-title>
          .
          <source>In IJCAI</source>
          , pages
          <fpage>1683</fpage>
          -
          <lpage>1688</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>W.</given-names>
            <surname>Xing</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ghorbani</surname>
          </string-name>
          .
          <article-title>Weighted pagerank algorithm</article-title>
          .
          <source>In CNSR</source>
          , pages
          <fpage>305</fpage>
          -
          <lpage>314</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>