=Paper=
{{Paper
|id=Vol-1587/T4-5
|storemode=property
|title=AMRITA_CEN-NLP@FIRE 2015:CRF Based Named Entity Extractor For Twitter Microposts
|pdfUrl=https://ceur-ws.org/Vol-1587/T4-5.pdf
|volume=Vol-1587
|authors=Sanjay S P,Anand Kumar M,Soman KP
|dblpUrl=https://dblp.org/rec/conf/fire/PMK15
}}
==AMRITA_CEN-NLP@FIRE 2015:CRF Based Named Entity Extractor For Twitter Microposts==
<pdf width="1500px">https://ceur-ws.org/Vol-1587/T4-5.pdf</pdf>
<pre>
     AMRITA_CEN-NLP@FIRE 2015:CRF based Named Entity
              Extraction for Twitter Microposts
                 Sanjay S.P                Anand Kumar M                    Soman KP
          Centre for Excellence in      Centre for Excellence in      Centre for Excellence in
       Computational Engineering and Computational Engineering and Computational Engineering and
        Networking, Amrita Vishwa     Networking, Amrita Vishwa     Networking, Amrita Vishwa
               Vidyapeetham                 Vidyapeetham                   Vidyapeetham
        Ettimadai, Coimbatore. India  Ettimadai, Coimbatore. India  Ettimadai, Coimbatore. India
         sanjay.poongs@gmail.com     m_anandkumar@cb.amrita.edu        kp_soman@amrita.edu


1 ABSTRACT                                                                     and conversation monitoring etc.The given files are converted into
This proposed method implements the Named Entity Recognition                   BIO format for the training the data .no preprocess where done or no
(NER) for four dialects Such as English, Tamil, Malayalam, and                 data is modified in the languages. After the BIO format conversion
Hindi. The results obtained from this work are submitted to a                  the needed features has been extracted along with the pos tagged
research evaluation workshop Forum for Information Retrieval and               words which is given to the CRF++ for training and testing the
Evaluation (FIRE 2015). It is single-layered problem which is                  data.The remaining discussion in this paper are , 3 Task
divided into multi- layered this step is called pre-processing; it has         descripition,4 system overview 5 conclusion ,6 acknowledgement .
three levels of named entity tags which are referred as BIO format.
This format is trained using Condition Random field(CRF) are used
                                                                               3 TASK DESCRIPTION
for implementing in NER system , the results obtained are grouped              The task provided to us is challenged with 2 types of data set .The
back to single-label or single-tagged referred as Format converting.           first file contains the TWITTER data, and the second file contains
In FIRE 2015, we developed English, Tamil, Malayalam, and Hindi                the ANNOTATION file which has the information of the tag ,index,
NER system using CRF. The FIRE estimated the average precision                 and length of the Twitter data’s.so the given data is first
for all the four languages.                                                    preprocessed into BIO format and then extra features are added to it
                                                                               and then trained using CRF. This work is based on Conditional
CCS Concepts                                                                   Random Field (CRF). It is used for developing NER system based
• Theory of computation~Conditional         random feild                       on Machine learning approach. It is a customizable tool in which
• Computing methodologies~Natural language processing                          Feature sets can be redefined and fast training is possible. The
• Information systems~Information extraction • Human-centered                  converted BIO format is used for training the CRF. And output
computing~Social tagging systems                                               results are generated. The BIO format was like:
                                                                                                             Table 1
Keywords                                                                       Data                         BIO-tag

Named Entity Recognition (NER); Natural Language Processing                    @aajtak Mr.BAssi             O
(NLP); Conditional Random Fields (CRF). Entity Extraction from                 …                            O
Social Media Text -Indian Languages (ESM-IL);
                                                                               …                            O
2 INTRODUCTION
Named-entity recognition (NER) is known as entity chunking, entity             Delhi                        B-ORGANIZATION
identification and entity extraction. It is an information extraction
                                                                               Govt.may                     I-ORGANIZATION
that find and locate , classify elements in text documents into
defined categories such as organizations, the names of persons,                involve                      O
locations, quantities, , expressions of times, monetary values,
percentages, etc. That seeks to locate and classify elements in text           The given data’s are like:
into pre-defined categories such as the names of persons,                      623056949634994177           1945618028          @aajtak
organizations, locations, expressions of times, quantities, monetary           Mr.BAssi        ….           Delhi Govt.may      involve
values, percentages, etc.
                                                                               The first 2 numbers are Twitter id and user id which was mapped
The Tweets are the general user data which the user use to                     with an Annotation file which was in the format like:
communicate with others. The Tweets contains all the named entity              Twitter id:623056949634994177       Userid:1945618028
like Person, organization, location, money, data, time, etc. the entity                  NETAG:ORGANIZATION NE:Delhi                Govt.may
recognition is little difficult to the normal entity extraction due to                   Index:105        Length:14
user typed data which has no pre format or it may contains many
short forms and mixed data .The NER is used in d IT sectors, tweets

                                                                          96
3.1 TRAINING DATA SET                                                        binary features and the pos tag features, culture, length, position
                                                                             features are extracted and added with the BIO format. File which is
The challenge in this task provided with 2 types of data. They               then given to the CRF++ along with the CRF model file which has
provided data’s for four language English, Tamil, Hindi, and                 the trained data file. The CRF will return the output of the tagged
Malayalam. The data provided is converted into BIO format and                file in the BIO format. The format conversion block will convert the
then it is trained using the CRF. The number of sentences for the            file back to the ANOTATION format for the evaluation.
training and testing data are given in the table below.
                               Table 2
Language English Tamil Malayalam                               Hindi
Train     5941   6000  8426                                    7983
Data
Test Data 9595   8222  4121                                    10752
The data for 4 language are taken and then trained .the training set
include feature files unigram feature and bigram feature. The
languages for which these features are obtained are given bellow.
4 SYSTEM OVERVIEW
The training data with extracted features are then given to
CRF++.The template file is the information for extracting the
features .the CRF trains the data according to the template file and
produce the model files. There are 2 model files for which the
template files are altered for Unigram and Bigram features. The
extracted feature file along with the NE tagged data is now trained
using CRF++

                                                                                                             Figure 2


                                                                             4.2 Features
                                                                             Context words:
                                                                             The previous word and the next word is considered for training the
                                                                             data.
                                                                             POS tag:
                                                                             The training and testing data are POS tagged with the tagger tools.
                                                                             Twitter POS tagger does not exist for other language than English,
                                                                             so we used the standard POS taggers except for Tamil. The Twitter
                                                                             POS tagger by Gimbel [7] is used for English language .Malayalam
                                                                             POS tags are retrieved from the Malayalam POS tagger. NLTK
                                                                             Hindi POS tagger is used for tagging the Hindi tweets. The pos
                               Figure 1                                      tagged data plays an important role as they improve the accuracy.
The extracted model files are then used for testing.                         Prefix and suffix:
                                                                             The prefix suffix features will check the previous and next letter.
                                                                             The 2, 3, 4 are the count of the letters which they check before and
4.1 CRF MODEL BASED NER                                                      after which is added for all the 4 language.
In this task CRF++ is used for training and testing. The extracted
features are trained using CRF. The template file which contains all
                                                                             Clusters:
the information to extract the feature. Each sentence is separated by        The clusters are taken only for the English language, the brown
an empty line. The CRF will generated a 2 model files ,the first             cluster is used for the English. There are no cluster tool or the Indian
model files has only unigram features and the second model files has         languages so this feature is not taken for other languages.
Unigram and Bigram features. The languages for which unigram and
bigram feature. In the example data flow diagram (4.1.1) the words
w1, w2, w3, w4 are given to the feature extraction unit where all the


The linguistic Feature: the extracted features for the 4 languages are given below


                                                                        97
                                                                         Table 3
Features                                                                                            English          Hindi     Tamil        Malayalam
Context words: The Previous and the next word
Pos tag: The part of speech tag for the current word
Prefix and suffix : The prefix suffix of length 3,4,5 are taken
Clusters : using brown cluster                                                                                       X         X            X
Shape feature                                                                                       X                X                      X
Length : the word length as a feature
Position: position of the word as a feature

 The binary features for the languages are given bellow.
                                                                         Table 4
 Binary Features                                                                                     English          Hindi      Tamil       Malayalam
 Contains number
 Capitalization                                                                                                       X          X           X

 Contains Dot
 ends with Comma
 Ends with !
 Ends with ?
 Contains #
 Contains ‘s                                                                                                          X          X           X


 The extracted features are combined with the BIO file and then
 tested.
                                                                                  4.4 Runs
 Binary features:
                                                                                                                Table 5
 In this binary features the values will be either 1 or 0. The feature is
 1 if there exist a (.,!? #). This features are called binary features and        In this task we have submitted 2runs.
 for English capitalization and‘s is also taken as a binary features.
                                                                                         Language          Unigram feature       Unigram and
                                                                                                             only(Run 1)      Bigram feature(Run
 4.3 SYSTEM EVALUATION                                                                                                                2)
 Approximate match metric is used for evaluating partial correctness                       Hindi                                        X
 of the named entity. The right boundary should match. The named
                                                                                          English
 entity tag should be same as the gold standard tag. The tags that are
 perfectly matched are given weightage of 1 and partially matched                          Tamil
 tags are given weightage of 0.5. Among 10 Named Entities
 identified by the system, if 4 are perfectly identified and 5 are                      Malayalam                                       X
 partially identified then approximate match = ((4∗1) + (5∗0.5))/10 =
 0.65.                                                                            Run1: Hindi, English, Tamil and Malayalam runs with only unigram
                                                                                  features are trained in CRF and tested.
                                                                                  Run2: English and Tamil files with unigram and bigram features are
                                                                                  trained and tested.


                                                                             98
4.5 Results
Language                  Hindi                            Tamil                       Malayalam                  English

Runs                      P         R           F          P       R           F       P        R        F        P         R        F

Run1                       74.65        5.26        9.83   70.11    19.81      30.89    60.05    39.94    47.97 46.78       24.90     32.50

Run2                          -          -           -     54.87    18.91      28.13       -        -        -    46.88     25.64     33.15


5 CONCLUSION
                                                                               [6] Tuan Tran , Mihai Georgescu , Xiaofei Zhu , Nattiya Kanhabua,
In this paper we briefly discussed about the NER for twitter data.we           Analysing the duration of trending topics in Twitter using wikipedia,
used CRF++ for the tagging of the data. The extended features has
been discussed and table for all the linguistic features and Binary            Proceedings of the 2014 ACM conference on Web science, June 23-
features has been briefly explained. The tagged data has been                  26, 2014, Bloomington, Indiana, USA
identified .since Twitter data is huge so we are in the need for Entity        [7] Gimpel, Kevin, et al. "Part-of-speech tagging for twitter:
extraction for various purposes.                                               Annotation, features, and experiments." Proceedings of the 49th
The future work will be based on added more rich features like                 Annual Meeting of the Association for Computational Linguistics:
clustering the data for all the Indian languages. We need to perform           Human Language Technologies: short papers-Volume 2.
an error analysis so we could improve the effectiveness of the data.           Association for Computational Linguistics, 2011.
                                                                               [8] Kalika Bali, Yogarshi Vyas,Monojit Choudhury–
                                                                               Microsoft India and University of Maryland.POS Tagging of
6 AKNOWLEDGEMENT                                                               English-Hindi Code-Mixed Social Media Content.Proceedings of the
We would like to thank Forum of Information Retrieval and                      2014 EMNLP pages 974–979, October 25-29 (2014).
Evaluation (FIRE 2015) organizers for organizing a wonderful
research evaluation workshop and giving opportunities for
researchers to present their work on Natural Language Processing
(NLP). We also like to thank Computational Linguistics Research
Group (CLRG), AU-KBC Research Centre, for organizing the Entity
Extraction from Social Media Text Indian Languages (ESM-IL)
Task.


REFERENCE
[1] Abinaya.N, Neethu John, M. Anand Kumar and
    Dr.K.P. P Soman - Amrita University.AMRITA@FIRE-2014:
    Named Entity Recognition for Indian LanguagesFIRE 2014.

[2] P Gupta, Kalika Bali, R E Banchs, M Choudhury, P
Rosso.Query Expansion for Mixed-Script Information
    Retrieval. In Processing’s of the 37th international
ACM SIGIR conference on Research & development in
information retrieval2014.

[3] J. Lafferty, A. McCallum, and F. Pereira. Conditional random
fields: Probabilistic models for segmenting and labeling sequence
data, In Proc. of ICML, pp.282-289, 2001

[4]J. Lafferty, A. McCallum, and F. Pereira. Conditional random
fields: Probabilistic models for segmenting and labeling sequence
data, In Proc. of ICML, pp.282-289, 2001

[5]Karen Stepanyan , George Gkotsis , Vangelis Banos , Alexandra
I. Cristea , Mike Joy, A hybrid approach for spotting,
disambiguating and annotating places in user-generated text,
Proceedings of the 22nd international conference on World Wide
Web companion, May 13-17, 2013, Rio de Janeiro, Brazil


                                                                          99

</pre>