<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Entity Extraction from Social Media Text Indian Languages (ESM-IL)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sandip Modha</string-name>
          <email>sjmodha@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chintak Mandalia LDRP Institute of Technology &amp; Research Center</institution>
          ,
          <addr-line>Gandhinagar, Gujarat</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LDRP Institute of Technology &amp; Research Center</institution>
          ,
          <addr-line>Gandhinagar, Gujarat</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Manthan Raval LDRP Institute of Technology &amp; Research Center</institution>
          ,
          <addr-line>Gandhinagar, Gujarat</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Memon Mohammed Rahil LDRP Institute of Technology &amp; Research Center</institution>
          ,
          <addr-line>Gandhinagar, Gujarat</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>100</fpage>
      <lpage>102</lpage>
      <abstract>
        <p>This paper shows the implementation of named entity recognition (NER) which is one of the applications of Natural Language Processing and is regarded as the subtask of information retrieval. NER is the process to detect Named Entities (NEs) in a document and to categorize them into certain Named entity classes such as the name of organization, person, location, sport, river, city, country, quantity etc. There are lots of work have been accomplished in English related to NER. But, at present, still we have not been able to achieve much of the success pertaining to NER in the Indian languages. The following paper discusses about NER, the various approaches of NER, Performance Metrics, the challenges in NER in the Indian languages and finally some of the results that have been achieved by performing NER in Hindi by aggregating approaches such as Rule based CRF suite and for tagging RDRpostagger and geniatagger. The paper shows working methodology and its result on named entity extraction from social media text of fire 2015.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS Concepts</title>
      <p>• Theory of computation~Support vector machines
• Computing methodologies~Natural language processing
• Information systems~Information extraction • Human-centered
computing~Social tagging systems</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        Social media is vast source of information from which we can
extract lots of important data as per the specific requirement. This
paper presents a technique for named entity recognition from
English and Hindi text data. Our main task is to extract name
entity from social media tweets in Indian language (Hindi and
English) and classify these tweets in named entity tags as people,
location etc., which is around 22 classes to be tagged. We used
machine learning algorithm CRF (Conditional Random Field)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
to identify Named Entities in corpus. CRF algorithm is
implemented using CRFSuite[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] tool. CRFsuite[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is an
implementation of Conditional Random Fields for labeling
sequential data which provides Fast training and tagging,
Linearchain CRF, etc.
      </p>
      <p>
        Supervised learning is used for training dataset. We have used this
training dataset to train out system for tagging named entities.
CRFsuite[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] generate model based on the supervised learning
provided.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. CONDITIONAL RANDOM FIELDS (CRFs)</title>
      <p>
        Given Conditional Random Field is a type of discriminative
probabilistic model used for the labeling sequential data such as
natural language text. Conditional Random Fields (CRFs) is
mainly used as a class of statistical modeling method which is
applied in pattern recognition and machine learning. CRFs are
undirected graphical models, a special case of which correspond
to conditionally-trained finite state machines. In the special case
in which the output nodes of the graphical model are linked by
edges in a linear chain, CRFs[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] make first order markov
assumption and can viewed as a conditionally trained probabilistic
finite automata. CRFs model consists of F=&lt;f1,…,fk&gt;, a vector of
feature functions, θ = &lt;θ1,…,θk&gt; a vector of weights for each
feature function. Let O=&lt;o1,…,ot&gt; be an observed sentence.
e
e
3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>METHODOLOGY</title>
      <p>We use two different methods for identifying Named-Entity form
given text. In one method we use Handcrafted or automatically
generated rules for NER. In second method or approach we use
machine learning technique for modeling. Also we have different
machine learning technique i.e. supervise learning,
semisupervised learning, unsupervised learning for modeling.
Supervised learning gives best performance but it requires large
amount of good quality annotated data. Unsupervised and
semisupervised learning is used when there is scarcity of annotated
data in training.</p>
      <p>We have used Machine learning based approach to perform NER
task for given data, because it is more efficient than rule-based
approach and it is more frequently used.</p>
    </sec>
    <sec id="sec-5">
      <title>3.1 Pre Processing</title>
      <p>
        The given task requires prediction of named entities from social
media, so first task is to tag the word from the whole sentence.
Therefore we have to split into word by doing these we get 'The'
'brown' 'cat' for both English and Hindi. Next step is to give part
of speech(POS)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to text here we have used RDR POS Tagger
for both the languages which identifies noun, verb, adverb from
the given text. We used genia tagger for chunking in English.
Genia tagger tag words with relevant IOB chunking tag. For
example:
      </p>
      <p>“The brown ca”t will get chunk tag as the: B-NP,
brown: I-NP, cat: I-NP.</p>
      <p>We were provided with NER tagged data for training by
FIRE2015. We prepared a file with tag word and its pos tag, chunk tag
and NER tag for training purpose.</p>
      <p>For example:
Location India NNP B-NP</p>
    </sec>
    <sec id="sec-6">
      <title>3.2 Training</title>
      <p>
        We have used the open-source tool, CRFsuite[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which is one of
the popular implementations of CRF (Conditional Random Fields)
for training data and also for tagging test data. CRFsuite[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
internally generates features from attributes in a data set. In
general, this is the most important process for machine-learning
approaches because a feature design greatly affects the labeling
accuracy.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.3 Testing</title>
      <p>
        The untagged test data are given for testing with its POS tag[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and Chunk tag. POS tagging and chunk tagging is done with help
of RDR POS [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] tagger and genia tagger. After that this untagged
test data with its POS tag and chunk tag are given as input to our
model to get test result.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3.4 Feature Set</title>
      <p>
        Feature set which is used for CRF [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] based NER System which
includes Prefix or Suffix of word, length of word, Capitalization,
POS tag, Chunking etc. we created two different model for both
Hindi and English using different feature sets.
      </p>
      <p>Hyphen(-)
Colon(:)
Apostrophe(')
Back Slash
Two Digit
Number
Four Digit
Number
All Uppercase
All Digit
$ or Rs
POS Tag- NNP
or QC
Gazzaters</p>
      <p>Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes</p>
      <p>Yes
Yes
Yes
Yes
Yes
Yes</p>
      <p>Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes</p>
      <p>Yes
Yes
Yes
Yes
Yes
Yes
Also we have included more features in hindi like जी , बजे,
etc. in CRFsuite training.</p>
      <p>For example:
मोदी जी का ममशन है
कार्यवाही 12 बजे तक स्थमित
So this kind of feature words are used in training model.</p>
    </sec>
    <sec id="sec-9">
      <title>3.5 Post Processing</title>
      <p>
        CRFsuite [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] gives only NE tag as output. So we combined output
with its named entity. Then we prepared output as given format in
training file by adding relevant information like tweet_id, user_id,
Index, length of word. For example:
User ID:2922444438
NE:india Index:122
Tweet ID:618698235092152320
      </p>
      <p>NETAG:LOCATION</p>
      <p>Length:5</p>
    </sec>
    <sec id="sec-10">
      <title>4. RESULTS</title>
    </sec>
    <sec id="sec-11">
      <title>4.1 Evaluation</title>
      <p>There are two standard measures used for evaluation of NE
tagger. (I) Precision(P) is the measure of the number of entities
correctly identified over the number of entities identified. (II)
Recall(R) is the measure of number of entities identified correctly
over actual number of entities. Both precision and recall are
therefore based on an understanding and measure of relevance.
Harmonic mean of precision and recall which is F measure is
calculated.</p>
      <p>Language Precision(P) Recall(R) F1-Score
Hin run-1
Hin run-2
Eng run-1
Eng run-2
67.11
74.73
7.30
5.35
0.76
46.84
4.17
5.67
1.51
57.59
5.31
5.50</p>
    </sec>
    <sec id="sec-12">
      <title>5. CONCLUSION</title>
      <p>
        Conditional random field(CRF) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] are better for Indian languages
than other models like HMM, MEMM etc. NER learned using
CRFs takes more time for training. As part of Speech (POS) and
Chunking is part of training, incorrect tagging also reduce the
accuracy of the Recognized Named Entity. For achieving high
performance and accuracy of NER system more study and deeper
understanding of linguistic features are required.
      </p>
    </sec>
    <sec id="sec-13">
      <title>6. ACKNOWLEDGMENTS</title>
      <p>We thank Mr. Sandip Modha and other faculties of college for
helpful input. This work is part of ESM-IL (Entity Extraction
from Social Media Text - Indian Language).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Andrew</surname>
            <given-names>McCallum</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Wei</given-names>
            <surname>Li</surname>
          </string-name>
          :
          <article-title>Named Entity Recognition with Conditional Random Fields, Feature Induction</article-title>
          and WebEnhanced Lexicons
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>RDR</given-names>
            <surname>Postagger</surname>
          </string-name>
          http://rdrpostagger.sourceforge.net/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Alan</given-names>
            <surname>Ritter</surname>
          </string-name>
          , Sam Clark, Mausam and
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Named Entity Recognition in Tweets</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>John</given-names>
            <surname>Lafferty</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andrew McCallum</surname>
            ,
            <given-names>and Fernando</given-names>
          </string-name>
          <string-name>
            <surname>Pereira</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Naoaki</given-names>
            <surname>Okazaki</surname>
          </string-name>
          <article-title>'s (CRF Suit): Implementation of Conditional Random Fields (CRFs</article-title>
          ) http://www.chokkan.org/software/crfsuite/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Dr</surname>
          </string-name>
          .
          <article-title>Rakesh ch</article-title>
          . Balabantaray,
          <string-name>
            <surname>Suprava Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kshirabdhi Tanaya Mishra</surname>
            <given-names>IIIT</given-names>
          </string-name>
          ,
          <string-name>
            <surname>BBSR</surname>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>: CRF++ based approach</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Yassine</given-names>
            <surname>Benajiba</surname>
          </string-name>
          and Paolo Rosso:
          <article-title>Arabic name entity recognition using conditional Random Fields</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>[8] Genia tagger http://www.nactem.ac.uk/GENIA/tagger/</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] CRF++ CRF++: Yet Another CRF toolkit CRF++ a simple, customizable, and open source implementation of Conditional Random Fields (CRFS)</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>