HITS@FIRE task 2015: Twitter based Named Entity
                 Recognizer for Indian Languages
                     Pallavi K P., Srividhya K., Rexiline Ragini John Victor, Ramya M. M.,
                                           Hindustan Institute of Technology and Science,
                                                           Padur, Chennai.
                       rs.pkp0310@hindustanuniv.ac.in, rs.sk0214@hindustanuniv.ac.in,
                        rs.jrr0913@hindustanuniv.ac.in, mmramya@hindustanuniv.ac.in


ABSTRACT                                                                      identifies and classifies names in a given corpus. NER is
Natural Language processing (NLP) in its pure sense, is a                     backbone for several NLP applications such as language
platform that provides the ability for transforming natural                   translation, social media analysis and information mining.
language text to useful information. Named Entity Recognition                            NER is important for Indian language twitter data, as
(NER) is a key task in NLP for classification of named entities in            there is no clue in identifying named entities in them. Recognizing
natural languages. Though, there are several algorithms for named             named entities in social media data is quite challenging due to the
entity classification, identifying named entities in twitter data is a        unstructured clauses in a sentence and entities are more diverse in
demanding task. Loads of information are being shared by people               nature. There are a variety of approaches for recognizing named
in twitter on a daily basis. This information is unstructured and             entities. Supervised approaches include Hidden Markov Model
often contains important information about organizations, politics,           (HMM), Decision Trees, Maximum Entropy Model (MEnt),
disasters, promotional advertisements etc. In this paper, we                  Support Vector Machines (SVM) and CRF etc.. Here, for the
provide a NER that can effectively classify named entities in                 ESM-IL task, CRF has been used to develop NER for English,
twitter data for Indian Languages such as English, Hindi and                  Hindi and Tamil twitter data [2]. Several research work has been
Tamil. POS, Chunk, Suffix, Prefix information has been used for               done on Named Entity Recognition [3-5]. Now Named Entity
training in Conditional Random Fields (CRF) based NER Model.                  Recognition is gaining popularity for its diverse applications in
CRF is a popular model for labeling and classification in text                the real world [6, 7].
mining. Performance analysis was done using n-fold validation
and F-measure. A maximum precision of 93.82 for English, 92.28
                                                                              2. PROBLEM DEFINITION
for Hindi and 86.94 for Tamil twitter data was achieved through N             Twitter is one of the popular social networking sites where people
fold validation. Results provided by ESM-IL share task in terms               share their opinions through their tweets. Each tweet or post can
of precision for English is 50.48, for Hindi is 81.49 and for Tamil           contain a maximum of 140 characters, including smilies, hash
70.42. The proposed algorithm has a higher classification                     tags, other symbols and website links. We attempted to develop a
accuracy and it is achieved through n-fold validation.                        twitter based NER for English, Hindi and Tamil data. Twitter
                                                                              training data were provided by ESM-IL for this task. Training
                                                                              data included raw data files and annotated named entities. Before
                                                                              pre-processing, annotated named entities were mapped with the
CCS Concepts                                                                  corresponding tweets in the raw data file, to label the named
• Human centered computing➝ Human machine interaction➝                        entities. Group of labels is called as Tag set. A total of 22 tags
Collaborative and social computing • Applied Computing                        were identified from the training data, namely person, location,
➝Document managing and text computing methodologies➝                          organization, date, quantity, money etc.. Pre-processing has been
Artificial Intelligence and Machine Learning.                                 done before applying the methodology. It included tokenization
                                                                              and removing noisy data.

Keywords                                                                      3. METHODOLOGY USED
Natural Language Processing, Named Entity Recognition,
                                                                               CRF, a probabilistic approach for developing NER is used for
Conditional Random Fields, n-fold validation, Twitter data,
                                                                              classification. CRF is a popular approach for effectively
English, Hindi, Tamil.                                                        classifying named entities. It takes into account the neighboring
                                                                              samples or the context information of the sentence. But the
                                                                              disadvantage in twitter data is, lack of context information.
1. INTRODUCTION                                                               Tweets in the raw data file were tokenized and pre-processed by
NLP is gaining prominence due to the importance given to social               removing noisy data website links, hash tag, and smilies etc. was
media data such as twitter and Facebook. Twitter data are                     done. Initially, neighboring samples were only considered in
predominantly extracted and monitored by public and private                   developing the model for NER. Later to include lexical
organizations for analysis of various trends in the industry and for          information, POS taggers, Chunker and 1st, 2nd, 3rd degree
opinion mining. For efficient NLP, corpus should be from the                  suffixes and prefixes were also added as features. The
same domain of the NLP application [1]. NLP involves a set of                   methodology of the proposed algorithm is shown in Figure 1.
computational linguistic tools for interaction between computer
and natural languages. NER is one among such tools which

                                                                         81
                        Input
                                                                          Table1: Features Description
                                                                          Features                           English     Hindi       Tamil
      Raw data                   Annotated data                           POS tags                           Yes         Yes         No

                                                                          Chunk tags                         Yes         No          No

         Pre-processing and tokenization                                  1st, 2nd , 3rdCharacter Suffix     Yes         Yes         Yes

                                                                          1st, 2nd , 3rd Character Prefix    Yes         Yes         Yes

      POS tagger
                                                                          For n-fold cross validation, the annotated data was randomly
                                                  Training                portioned into five sub sets. Validation was performed by n
        Chunker                 n-fold              data                  rounds of testing the system by providing one set of the annotated
                                                                          data for testing each time and the remaining n-1 subsets for
                                                                          training [10]. Each fold contained equal amount of tokens.
     Suffix-Prefix                             CRF                        For NER task, CRF++ an open source package was used [11].
       extractor                Test         generated                    Conditional Random Fields is a probabilistic framework for
                                data           NER                        segmenting and labelling sequence data. From the literature it was
                                              tagger                      understood that CRF performs better than other models like
                                                                          Hidden Markov Model by providing conditional probability and
                                                                          Maximum Entropy Markov Models by observation and sequence
                                                                          of labels [1]. Finally, indexing was done, and the data was
                                           Indexing and                   presented in the prescribed ESM-IL format.
                                            presentation
                                               of data                    4. PERFORMANCE METRICS
                                                                          The performance metrics used for analysis are Precision, Recall
                                                                          and F-measure. Precision and Recall are the most effective and
                                                                          frequently used measures in case of information retrieval.
Figure 1: Methodology of the NER
                                                                          Precision can be defined as

Input considered for NER were raw and annotated data provided
by ESM-IL shared task. To avoid misclassification of named
entities, labelling was done by assigning the first word of the
entity with B-tag (Begining of the tag) and remaining words in the        Recall can be expressed as
same entity with I-tag (Inside the tag). With the help of the
annotated data, named entities in the raw data were mapped. For
pre-processing and in order to identify the noisy data rules were
framed. For eg., detection and removal of website links was
performed by checking for the presence of tokens such as http//:,
.com.                                                                     F-measure is defined as the weighted harmonic mean of precision
                                                                          and recall.
          For English parts of speech tagging and chunking,
pattern.en module was used [8]. It is a python based POS tagger
cum chunker for twitter data. Training in a twitter based POS
tagger provides better accuracy than a normal tagger. So pattern
POS tagger was preferred. Due to the unavailability of twitter
based POS tagger, for Hindi parts of speech tagging the tagger            where, True positives are the total number of NE’s tagged
provided by Society for Natural Language Technology Research              correctly with boundaries.
was used [9]. It is a CRF based open source software. Apart from
the POS and Chunk taggers, other features like suffixes and
                                                                          False positives are the total number of words that are wrongly
prefixes were also used. The list of features used are provided in
Table 1.                                                                  tagged by the system which are not tagged manually.

                                                                     82
False negatives are the total number of untagged words by the
system which are manually tagged [12].
                                                                                                      Sample Tweets
                                                                                1. 624133739023446016 917553836 As outrage
5. RESULTS AND DISCUSSION                                                          builds, Karnataka Vikas Grameena Bank looks
In this section, the results obtained with the CRF model is                        at rescuing farmers: Much of the farming
discussed. Twitter data for experiment was provided by FIRE                        community's outrage… http://dlvr.it/BcS9Kh
2015. Table 2 compares the results obtained using n-fold
validation and the results provided by ESM-IL shared task.                      2. 623537594798739456 3016932104 विश्व टी20
                                                                                   फाइनल     की    मेजबानी   करे गा   ईडन    गाडडन्स
Table 2: Precision, Recall and F measure results for n-fold
validation and ESM-IL shared task evaluation.                                      http://hindi.webdunia.com/latest-cricket-
                                                                                   news/twenty20-world-cup-
Languages       Precision           Recall         F-measure                       115072100078_1.html … pic.twitter.com/pqzyc
             n-fold ESM-IL n-fold ESM-IL n-fold ESM-IL                             mT9b8
English      93.82 50.21       80.53 37.06       86.66    42.64                 3. 621569431932530688 317752766 நெல்லை
                                                                                   மாவட்டம் கல்லிலடக்குறிச்சியில் திறக்கப்பட்ட
Hindi        92.28 81.21       76.23 44.57       83.49 57.55
                                                                                   பாைத்தால்      நபாதுமக்கள்         மகிழ்ச்சி:
Tamil        86.94 64.52       73.87 22.14       79.87 32.97                       நெல்லை… http://goo.gl/fb/Efqp0N
                                                                                                 NEs identified in tweets
As a part of the initial pre-processing website links, hash tags and
smilies were removed from the raw data. Further, POS tagging                    1. Karnataka B-ORGANIZATION
and chunking for English data was done. This was given to the                      Vikas I-ORGANIZATION
CRF model and a test run was generated. A maximum F-measure                        Grameena I-ORGANIZATION
of value 80.81 was obtained. To improve the accuracy features                      Bank I-ORGANIZATION
like prefixes and suffixes of the NE were included and a
maximum F-measure of 86.66 was obtained.                                        2. विश्व B-ENTERTAINMENT
          For Hindi, after initial pre-processing, POS tagging was
                                                                                   टी20: I-ENTERTAINMENT
done. An efficient chunker is not available for Hindi data, and                    ईडन B-LOCATION
therefore chunking was not done. This was supplied as input to                     गाडडन्स I-LOCATION
the CRF model and a maximum F-measure of 83.49 was obtained.
Prefixes and suffixes of NEs were also combined with the existing               3. நெல்லை          B-LOCATION
POS tags and 78.47 F-measure value was obtained. When
considering only prefixes and suffixes of NE as feature, the model                   மாவட்டம் I-LOCATION
resulted in an F-measure of value 73.12. The classification
accuracy reduced for Hindi when compared to English due to the                       கல்லிலடக்குறிச்சியில் B-LOCATION
unavailability of twitter based POS tagger and chunker.
                                                                             Table 3: Sample results obtained for English, Hindi and Tamil
          For Tamil data, since there is no proper POS tagger and            twitter data.
chunker available of twitter and generic data, features like suffixes
and prefixes of NE were considered and the model obtained a
maximum F-measure 79.87. Excluding the suffixes and prefixes,                6. CONCLUSION AND FUTURE WORK
considering only the tokens as a feature, the model gave an F-               Named Entity Recognizers for English, Hindi and Tamil were
measure of 62.95. The classification accuracy reduced for Tamil              developed for the twitter data. A CRF based model was generated
data similar to Hindi due to the unavailability of Twitter based             by POS tagging, chunking and applying other feature information
POS tagger and chunker. Table 3 shows the examples for the                   to the given data. To test the accuracy of the CRF model n fold
results obtained from CRF generated model for English, Hindi                 validation was done. N fold experiment for training data gave a
and Tamil twitter data. Sample tweets contain tweet-id, user-id              maximum precision of 93.82 for English, 92.28 for Hindi and
and the tweet.                                                               86.94 for Tamil twitter data. ESM-IL evaluation for English,
                                                                             Hindi and Tamil data resulted in competitive precisions of 81.49,
                                                                             70.42 and 50.21 respectively. Due to the increase in the
                                                                             percentage of recall, F-measure was reduced. This increase in
                                                                             recall can be reduced by providing pre-processed rules based on
                                                                             lexical resources.

                                                                             7. ACKNOWLEDGEMENTS
                                                                             The authors thank Hindustan Institute of Technology & Science
                                                                             for their continuous support.


                                                                        83
8. REFERENCES
1. Pallavi, K. P., and Anitha S. Pillai. 2015, Kannpos-Kannada
Parts of Speech Tagger Using Conditional Random Fields.
Emerging Research in Computing, Information, Communication
and Applications. Springer India, (Aug 2015), 479-491.
2. Lafferty, John, Andrew McCallum, and Fernando CN Pereira.,
2001, Conditional random fields: Probabilistic models for
segmenting and labeling sequence data., In Proceedings of the
18th International Conference on Machine Learning 2001, (Jun
2001), 282-28
3. Malarkodi, C. S., & Pattabhi, R. K. Rao and Sobha, Lalitha
Devi, 2012,Tamil NER–Coping with Real Time Challenges, In
Proceedings of the Workshop on Machine Translation and
Parsing in Indian Languages, (Dec 2012), 23-38.
4. Biswas, S., et al., 2010, A Two Stage Language Independent
Named Entity Recognition for Indian Languages, IJCSIT
International Journal of Computer Science and Information
Technologies, 1.4 , 285-289.
5. Saha, Sujan Kumar, et al.,2008, Named entity recognition in
Hindi using maximum entropy and transliteration., Research
journal on Computer Science and Computer Engineering with
Applications, (Jul 2008), 33-41.
6. Jung, Jason J., 2012, Online named entity recognition method
for microtexts in social networking services: A case study of
twitter." Expert Systems with Applications, 39,9, (Jan 2012),
8066-8070.
7. Zirikly, Ayah, and Mona Diab. "Named entity recognition for
arabic social media." Proceedings of naacl-hlt. (Jun 2015), 176-
185.
8. www.clips.ua.ac.be/pages/pattern-en
9. nltr.org/snltr-software/
10. John J. Hutton, Pediatric Biomedical Informatics: Computer
Applications in Pediatric Research, Springer Publications, 2012.
11. https://taku910.github.io/crfpp/
12. Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, Cambridge
University Press. 2008.


                                                                   84