<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NER from Tweets: SRI-JU System @MSM 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amitava Das</string-name>
          <email>amitava.santu1@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Utsab Burman</string-name>
          <email>utsab.barman.ju2@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Balamurali A R</string-name>
          <email>balamurali.ar3@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sivaji Bandyopadhyay</string-name>
          <email>sivaji_cse_ju@yahoo.com</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1019</volume>
      <fpage>62</fpage>
      <lpage>66</lpage>
      <abstract>
        <p>Now a day Twitter has become an interesting source of experiment for different NLP experiments like entity extraction, user opinion analysis and more. Due to the noisy nature of user generated content it is hard to run standard NLP tools to obtain a better result. The task of named entity extraction from tweets is one of them. Traditional NER approaches on tweets do not perform well. Tweets are usually informal in nature and short (up to 140 characters). They often contain grammatical errors, misspellings, and unreliable capitalization. These unreliable linguistic features cause traditional methods to perform poorly on tweets. This article reports the author's participation in the Concept Extraction Challenge, Making Sense of micro posts (#MSM2013). Three different systems runs have been submitted. The first run is the baseline, second run is with capitalization and syntactic feature and the last run is with dictionary features. The last run yielded than all other. The accuracy of the final run has been checked is 79.57 (precision), 71.00 (recall) and 74.79 (f-measure) respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>f. User's wordplay in tweets. This includes phonetic spelling and intentional
misspelling for verbal effect e.g. that was soooooo great (“that was so great”).</p>
      <p>g. Censor avoidance. This includes use of numbers or punctuation to disguise
vulgarities, e.g. sh1t, f***, etc.</p>
      <p>h. Presence of emoticons. While often recognized by a human reader, emoticons
are not usually understood in NLP tasks such as Machine Translation and
Information Retrieval. Examples: :) (Smiling face), &lt;3 (heart).</p>
    </sec>
    <sec id="sec-2">
      <title>2 Data</title>
      <p>The work has been done on MSM-2013 dataset. The datasets were available in 2
subsets as training and test datasets. No development set has been provided therefore
the training data was divided into 2 further subsets (in 70%-30% ratio). The name
entities are considered as two types - single word NE and multiword NE. The division
of the available training data was made based on the presence of 4 different types of
name entities with each type single and multiword. The statistics of the above process
is elaborated in Table 1.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Experiment</title>
      <p>Three different runs have been submitted. This is a CRF based system and the
features are described below. Yamcha toolkit has been used for CRF implementation.
#MSM2013</p>
      <sec id="sec-3-1">
        <title>3.1 Baseline</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Capitalization</title>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Predicate Rules</title>
        <p>Our baseline system incorporates the part of speech tags, stemmed tokens to train the
baseline classifier. For POS tags of a micro post, we used CMU-POS tagger tool1
which is specialized for tweets.</p>
        <p>Capitalization of tokens is one of the key features to recognize the name entities in
micro posts. It has been used as a binary feature in the classifier.</p>
        <p>Generally the position of a name entity in a sentence is always close to the positions
of functional words. For example in, of, near and etc. N-grams rules have been
developed and used to train the classifier.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Out of Vocabulary Words</title>
        <p>Most of the name entities are not the dictionary words. We used Samsad2 &amp; NICTA
dictionary3 in the experiment.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Gazetteers</title>
        <p>For Location and MISC types two separate lists has been augmented. The LOC type
consists of 220 country names and 100 popular city names. The MISC type has 110
NEs of different types. Mostly the error case in the Dev set.</p>
        <p>We have experimented with series of features. Tweets are extremely noisy and
therefore a concise set of named entity clue is very hard to finalize. Indeed person and
organization categories are relatively naïve but location and miscellaneous category
are very hard for a classifier.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Performance</title>
      <p>The performance results on the Dev set is been reported in the Table 2. It should be
noted the actual result on the test is yet to be evaluated by the organizer of MSM.</p>
      <sec id="sec-4-1">
        <title>1 http://www.ark.cs.cmu.edu/TweetNLP/ 2 http://dsal.uchicago.edu/dictionaries/biswas-bengali/ 3 http://www.csse.unimelb.edu.au/~tim/etc/emnlp2012-lexnorm.tgz</title>
        <p>#MSM2013</p>
        <p>Concept Extraction Challenge</p>
        <p>We run multiple iterations to reach the final accuracy. Broadly they could be
categorized in 5 genres, as reported below. Among those iterations 3 best runs (1, 3
and 5) have been submitted. The details of the features used in each runs are as below
and the scores are elaborated in Table 2.</p>
        <p>1)
2)
3)
4)
5)</p>
      </sec>
      <sec id="sec-4-2">
        <title>Baseline: POS + Stem</title>
        <p>1 + Capitalization: Capitalization feature
2 + N-Grams FW Predicates: in, of, or features
3 + OOV
4+Gazetters: LOC Dict + MISC Dict</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusion</title>
      <p>In this paper we present a novel method for identification and classification of name
entities based on the features. Though classifying named entities from twitter data is
hard because of the noise and non-grammatical nature.</p>
      <p>In this article we report our scores based on dev. set, we will incorporate the
evaluation scores of #MSM2013 to support our evaluation framework.</p>
      <p>Form the features that took part in our experiments, the gazetteer list, used in our
experiment is small. We will try to include more in future.</p>
      <p>We have observed that a-few Structural information can help to increase the
results. For example - URL, Mention and Hash Tag. Our exploration is to find out
more viable features that help to understand the semantics of micro post.
#MSM2013
of-speech tagging for twitter: Annotation, features, and experiments.
CARNEGIEMELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2010.
2. Ritter, Alan, Sam Clark, and Oren Etzioni. "Named entity recognition in tweets: an
experimental study." In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, pp. 1524-1534. Association for Computational Linguistics, 2011.
3. Finin, Tim, et al. "Annotating named entities in Twitter data with crowdsourcing."
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data
with Amazon's Mechanical Turk. Association for Computational Linguistics, 2010.
4. Han, Bo, and Timothy Baldwin. "Lexical normalisation of short text messages: Makn sens
a# twitter." In Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies, vol. 1, pp. 368-378. 2011.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gimpel</surname>
          </string-name>
          , Kevin, Nathan Schneider,
          <string-name>
            <surname>Brendan O'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Daniel Mills</surname>
          </string-name>
          , Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and
          <string-name>
            <surname>Noah</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          Part-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>