<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AMRITA_CEN-NLP@FIRE 2015:CRF based Named Entity Extraction for Twitter Microposts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sanjay S.P</string-name>
          <email>sanjay.poongs@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar M</string-name>
          <email>m_anandkumar@cb.amrita.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soman KP</string-name>
          <email>kp_soman@amrita.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Excellence in, Computational Engineering and</institution>
          ,
          <addr-line>Networking, Amrita Vishwa, Vidyapeetham, Ettimadai, Coimbatore.</addr-line>
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>96</fpage>
      <lpage>99</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 ABSTRACT</title>
      <p>This proposed method implements the Named Entity Recognition
(NER) for four dialects Such as English, Tamil, Malayalam, and
Hindi. The results obtained from this work are submitted to a
research evaluation workshop Forum for Information Retrieval and
Evaluation (FIRE 2015). It is single-layered problem which is
divided into multi- layered this step is called pre-processing; it has
three levels of named entity tags which are referred as BIO format.
This format is trained using Condition Random field(CRF) are used
for implementing in NER system , the results obtained are grouped
back to single-label or single-tagged referred as Format converting.
In FIRE 2015, we developed English, Tamil, Malayalam, and Hindi
NER system using CRF. The FIRE estimated the average precision
for all the four languages.</p>
    </sec>
    <sec id="sec-2">
      <title>CCS Concepts</title>
      <p>• Theory</p>
      <p>of computation~Conditional random feild
• Computing methodologies~Natural language processing
• Information systems~Information extraction • Human-centered
computing~Social tagging systems
Named Entity Recognition (NER); Natural Language Processing
(NLP); Conditional Random Fields (CRF). Entity Extraction from
Social Media Text -Indian Languages (ESM-IL);</p>
    </sec>
    <sec id="sec-3">
      <title>2 INTRODUCTION</title>
      <p>Named-entity recognition (NER) is known as entity chunking, entity
identification and entity extraction. It is an information extraction
that find and locate , classify elements in text documents into
defined categories such as organizations, the names of persons,
locations, quantities, , expressions of times, monetary values,
percentages, etc. That seeks to locate and classify elements in text
into pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary
values, percentages, etc.</p>
      <p>The Tweets are the general user data which the user use to
communicate with others. The Tweets contains all the named entity
like Person, organization, location, money, data, time, etc. the entity
recognition is little difficult to the normal entity extraction due to
user typed data which has no pre format or it may contains many
short forms and mixed data .The NER is used in d IT sectors, tweets
and conversation monitoring etc.The given files are converted into
BIO format for the training the data .no preprocess where done or no
data is modified in the languages. After the BIO format conversion
the needed features has been extracted along with the pos tagged
words which is given to the CRF++ for training and testing the
data.The remaining discussion in this paper are , 3 Task
descripition,4 system overview 5 conclusion ,6 acknowledgement .</p>
    </sec>
    <sec id="sec-4">
      <title>3 TASK DESCRIPTION</title>
      <p>The task provided to us is challenged with 2 types of data set .The
first file contains the TWITTER data, and the second file contains
the ANNOTATION file which has the information of the tag ,index,
and length of the Twitter data’s.so the given data is first
preprocessed into BIO format and then extra features are added to it
and then trained using CRF. This work is based on Conditional
Random Field (CRF). It is used for developing NER system based
on Machine learning approach. It is a customizable tool in which
Feature sets can be redefined and fast training is possible. The
converted BIO format is used for training the CRF. And output
results are generated. The BIO format was like:</p>
      <sec id="sec-4-1">
        <title>Data</title>
        <p>…</p>
        <sec id="sec-4-1-1">
          <title>Govt.may</title>
          <p>involve</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>The given data’s are like:</title>
          <p>623056949634994177</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Mr.BAssi …. Table 1 BIO-tag</title>
        <p>The first 2 numbers are Twitter id and user id which was mapped
with an Annotation file which was in the format like:
Twitter id:623056949634994177</p>
        <p>NETAG:ORGANIZATION
Index:105 Length:14
The challenge in this task provided with 2 types of data. They
provided data’s for four language English, Tamil, Hindi, and
Malayalam. The data provided is converted into BIO format and
then it is trained using the CRF. The number of sentences for the
training and testing data are given in the table below.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Language Train Data Test Data</title>
      <p>The data for 4 language are taken and then trained .the training set
include feature files unigram feature and bigram feature. The
languages for which these features are obtained are given bellow.</p>
    </sec>
    <sec id="sec-6">
      <title>4 SYSTEM OVERVIEW</title>
      <p>The training data with extracted features are then given to
CRF++.The template file is the information for extracting the
features .the CRF trains the data according to the template file and
produce the model files. There are 2 model files for which the
template files are altered for Unigram and Bigram features. The
extracted feature file along with the NE tagged data is now trained
using CRF++</p>
    </sec>
    <sec id="sec-7">
      <title>4.1 CRF MODEL BASED NER</title>
      <p>In this task CRF++ is used for training and testing. The extracted
features are trained using CRF. The template file which contains all
the information to extract the feature. Each sentence is separated by
an empty line. The CRF will generated a 2 model files ,the first
model files has only unigram features and the second model files has
Unigram and Bigram features. The languages for which unigram and
bigram feature. In the example data flow diagram (4.1.1) the words
w1, w2, w3, w4 are given to the feature extraction unit where all the
The linguistic Feature: the extracted features for the 4 languages are given below
binary features and the pos tag features, culture, length, position
features are extracted and added with the BIO format. File which is
then given to the CRF++ along with the CRF model file which has
the trained data file. The CRF will return the output of the tagged
file in the BIO format. The format conversion block will convert the
file back to the ANOTATION format for the evaluation.</p>
    </sec>
    <sec id="sec-8">
      <title>4.2 Features</title>
      <p>Context words:
POS tag:
The previous word and the next word is considered for training the
data.</p>
      <p>
        The training and testing data are POS tagged with the tagger tools.
Twitter POS tagger does not exist for other language than English,
so we used the standard POS taggers except for Tamil. The Twitter
POS tagger by Gimbel [
        <xref ref-type="bibr" rid="ref3">7</xref>
        ] is used for English language .Malayalam
POS tags are retrieved from the Malayalam POS tagger. NLTK
Hindi POS tagger is used for tagging the Hindi tweets. The pos
tagged data plays an important role as they improve the accuracy.
Prefix and suffix:
The prefix suffix features will check the previous and next letter.
The 2, 3, 4 are the count of the letters which they check before and
after which is added for all the 4 language.
      </p>
      <p>Clusters:
The clusters are taken only for the English language, the brown
cluster is used for the English. There are no cluster tool or the Indian
languages so this feature is not taken for other languages.</p>
    </sec>
    <sec id="sec-9">
      <title>Features</title>
      <p>Context words: The Previous and the next word
Pos tag: The part of speech tag for the current word
Prefix and suffix : The prefix suffix of length 3,4,5 are taken
Clusters : using brown cluster
Shape feature
Length : the word length as a feature
Position: position of the word as a feature</p>
      <p>The binary features for the languages are given bellow.</p>
    </sec>
    <sec id="sec-10">
      <title>Binary Features</title>
      <p>Contains number
Capitalization
Contains Dot
ends with Comma
Ends with !
Ends with ?
Contains #
Contains ‘s
The extracted features are combined with the BIO file and then
tested.</p>
    </sec>
    <sec id="sec-11">
      <title>Binary features:</title>
      <p>In this binary features the values will be either 1 or 0. The feature is
1 if there exist a (.,!? #). This features are called binary features and
for English capitalization and‘s is also taken as a binary features.</p>
    </sec>
    <sec id="sec-12">
      <title>4.3 SYSTEM EVALUATION</title>
      <p>Approximate match metric is used for evaluating partial correctness
of the named entity. The right boundary should match. The named
entity tag should be same as the gold standard tag. The tags that are
perfectly matched are given weightage of 1 and partially matched
tags are given weightage of 0.5. Among 10 Named Entities
identified by the system, if 4 are perfectly identified and 5 are
partially identified then approximate match = ((4∗1) + (5∗0.5))/10 =
0.65.</p>
      <sec id="sec-12-1">
        <title>Language</title>
      </sec>
      <sec id="sec-12-2">
        <title>Hindi</title>
      </sec>
      <sec id="sec-12-3">
        <title>English</title>
      </sec>
      <sec id="sec-12-4">
        <title>Tamil</title>
      </sec>
      <sec id="sec-12-5">
        <title>Malayalam</title>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>English</title>
    </sec>
    <sec id="sec-14">
      <title>Hindi</title>
      <p>In this paper we briefly discussed about the NER for twitter data.we
used CRF++ for the tagging of the data. The extended features has
been discussed and table for all the linguistic features and Binary
features has been briefly explained. The tagged data has been
identified .since Twitter data is huge so we are in the need for Entity
extraction for various purposes.</p>
      <p>The future work will be based on added more rich features like
clustering the data for all the Indian languages. We need to perform
an error analysis so we could improve the effectiveness of the data.
We would like to thank Forum of Information Retrieval and
Evaluation (FIRE 2015) organizers for organizing a wonderful
research evaluation workshop and giving opportunities for
researchers to present their work on Natural Language Processing
(NLP). We also like to thank Computational Linguistics Research
Group (CLRG), AU-KBC Research Centre, for organizing the Entity
Extraction from Social Media Text Indian Languages (ESM-IL)
Task.
[2] P Gupta, Kalika Bali, R E Banchs, M Choudhury, P
Rosso.Query Expansion for Mixed-Script Information</p>
      <p>Retrieval. In Processing’s of the 37th international
ACM SIGIR conference on Research &amp; development in
information retrieval2014.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Abinaya.N</given-names>
            ,
            <surname>Neethu</surname>
          </string-name>
          <string-name>
            <given-names>John</given-names>
            , M. Anand Kumar and
            <surname>Dr.K.P. P Soman - Amrita University</surname>
          </string-name>
          .AMRITA@FIRE-2014:
          <article-title>Named Entity Recognition for Indian LanguagesFIRE 2014</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lafferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence [4</article-title>
          ]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lafferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          ,
          <source>In Proc. of ICML</source>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          ,
          <year>2001</year>
          [6]
          <string-name>
            <given-names>Tuan</given-names>
            <surname>Tran</surname>
          </string-name>
          , Mihai Georgescu , Xiaofei Zhu ,
          <article-title>Nattiya Kanhabua, Analysing the duration of trending topics in Twitter using wikipedia</article-title>
          ,
          <source>Proceedings of the 2014 ACM conference on Web science, June</source>
          <volume>23</volume>
          - 26,
          <year>2014</year>
          , Bloomington, Indiana, USA
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kevin</surname>
          </string-name>
          , et al.
          <article-title>"Part-of-speech tagging for twitter: Annotation, features, and experiments." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Kalika</given-names>
            <surname>Bali</surname>
          </string-name>
          , Yogarshi Vyas,Monojit ChoudhuryMicrosoft India and University of Maryland.
          <article-title>POS Tagging of English-Hindi Code-Mixed Social Media Content</article-title>
          .
          <source>Proceedings of the 2014</source>
          EMNLP pages
          <fpage>974</fpage>
          -
          <lpage>979</lpage>
          , October 25-
          <fpage>29</fpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>