Vira@FIRE 2015: Entity Extraction from Social Media Text
              Indian Languages (ESM-IL)
               Vira Bagiya                                   Anjana Patel                                      Amit Ganatra
   Charotar University of Science &                Charotar University of Science &                  Charotar University of Science &
            Technology,                                     Technology,                                       Technology,
           Changa,Gujarat                                 Changa, Gujarat                                   Changa, Gujarat
                India                                           India                                             India.
    virabagiya11@gmail.com                       14pgce002@charusat.edu.in amitganatra.ce@charusat.ac.in


ABSTRACT                                                                     The paper is organized as follows. Section 2 gives an overview of
In this paper we have tried to identify and extract “Named                   Task description and approaches applied for NER task and
Entities” from social media text using conditional random field-             complete description of our system. Furthermore section 3
(CRF) [3]. The paper represents our working methodology and                  describes the different issues in development of the system for
result on Entity Extraction from Social Media Text Indian                    different Indian languages. In section 4 there is the test result and
Languages task of FIRE-2015. We have extracted named entities                how its accuracy can be increased. Finally section 5 concludes the
from        two languages Hindi and English. Named Entity                    paper.
Extraction system is implemented based on CRFSuite. CRFSuite
[8] is the populer implementation of Conditional Random Fields
                                                                             1.1 Task Description
(CRF). This is a sequential labelling task to achieve the desired            “Entity extraction from social media text in Indian Languages” is
tagging output. Conditional random fields (CRF) are a class                  a task in which we have provided different tweets. --From this
of statistical modelling method often applied in pattern                     tweets – our work is to annotate and classify these tweets into
recognition, machine learning and many natural language                      different named entity tags like Person, Organization, Location,
processing tasks. We get F1-score of 19.82 and 3.72 for the                  Entertainment etc. In training dataset we have given three
Hindi and English text respectively.                                         columns tweet_id, user_id and tweet_text and in its processed
                                                                             annotated dataset we have given tweet_id, user_id, Named Entity
Keywords                                                                     tag(NE tag), Named Entity, index and its length. The Same thing
Machine learning; Named Entity Extraction; Named Entity                      we should perform on the testing dataset provided. Our main task
Recognition.                                                                 is to identify named entity from testing dataset and apply
                                                                             appropriate tag to it.


1. INTRODUCTION                                                              1.2 System Architecture
                                                                             Our Named entity recognition system is developed to classify and
Named-entity recognition (NER) (also known as entity
                                                                             tagged named entities into 22 different classes such as Person,
identification, entity chunking and entity extraction) is a subtask
                                                                             Location, Organization, Entertainment etc. We have provided
of information extraction that seeks to locate and classify
                                                                             training dataset which is mainly used for learning process.
elements in text into pre-defined categories such as the names of
persons, organizations, locations, expressions of times, quantities,         There are following unique 22 named entity tags.
monetary values, percentages, etc.                                                          Table 1: Unique Named entity tags
Social Media is vast source of information- from which we can                                 1.    PERSON
extract lots of important data as per the specified requirements.                             2.    ORGANIZATION
According to the 8th schedule, India is known to have 22 official
Indian languages. NER in Indian languages is still considered to                              3.    LOCATION
be a budding topic of research in the field of NLP and much of                                4.    ENTERTAINMENT
work is needed to be performed in this regard. For English and                                5.    DAY
Hindi languages there are so many NER tagger exists and hence
                                                                                              6.    MATERIALS
                                                                                              7.    PLANTS
                                                                                              8.    PERIOD
                                                                                              9.    LOCOMOTIVE
this paper propose a CRF based NER tagger using CRFsuite
(Okazaki, 2007) [8]. CRFsuite is an implementation of CRF and it                              10.   YEAR
is faster than CRF++ [7]. CRFsuite is an open source software                                 11.   MONEY
which automatically extract features from the learning.                                       12.   COUNT
                                                                                              13. FESTIVAL

                                                                       107
                 14. DATE                                                    2. APPROACHES FOR NER
                 15. QUANTITY                                                There are basically two approaches that are employed in Named
                 16. FACILITIES                                              Entity Recognition. These include:
                 17. DISEASE                                                           a. Rule Based Approach
                 18. ARTIFACT                                                          b. Machine Learning Based Approach
                 19. MONTH                                                   In rule based approach there are Handcrafted or automatically
                 20. TIME                                                    generated rules or patterns. Machine learning techniques are used
                                                                             for statistical modeling which can be either unsupervised, semi-
                 21. LIVTHINGS                                               supervised or supervised mode of learning. Unsupervised and
                 22. SDAY                                                    semi-supervised mode of learning are used when there is a
                                                                             scarcity of annotated data for training but the best performance is
                                                                             obtained by using supervised mode of learning which requires a
                                                                             large amount of good quality annotated corpus.
                                                                             We have used machine learning based approach. This approach
                                                                             is also known as automated approach or statistical approach.
                                                                             Machine learning based approach is more efficiently and
                                                                             frequently used as compared to the Rule based approach.
                                                                             We have developed a system to perform NER in English and
                                                                             Hindi and submitted the same. We use the open-source
                                                                             software, CRFsuite[8] which is one of the popular
                                                                             implementations of Conditional Random Fields (CRF)[3] for
                                                                             training a model based on the training dataset and then use the
                                                                             model to generate tags for the test dataset.
                                                                             2.1 Extracted features from learning
                                                                             For English and Hindi languages there are following
                                                                             features extracted from learning
                                                                             CRFsuite automatically extract some features from learning.
                                                                             Other features are added for tagging different different named
                                                                             entities.
                                                                                      Gazetter(specified as list-look up table): Gazetter of
                                                                                       location names has been created and applied to identify
                                                                                       different locations in India. Same as - Plant names,
                                                                                       Festival names, Entertainment , Locomaotives ,
                                                                                       Livthings tagged using the different different gazetters
                                                                                       as a feature.
                                                                                      Suffixes: In hindi person name identified using suffix
                                                                                       “जी”.Means word ending with “ji” can be the person
                                                                                       name and it can be specified as the feature.
                                                                                       for example:     Modiji is a person name
                 Figure 1: System Architecture
                                                                                       If a word followed by “ko” then this word can be
We have used the supervised learning, as we are given training                         specified as the person name for eample:
dataset. We used this training dataset to train our system for                         “Gita ko haridwar jana hain” – ‘Gita’ is aperson name
tagging named entities and kept these tags in a space separated                        which is followed by ‘ko’.
files. CRFsuite generate the model based on training – learning                       Prefixes :There are so many prefixes can be used to
provided. Later on, system uses these model for generate output                        identify named entities. For example, Named entity
(named entity tagging). The training dataset is primary focus for                      followed by Mr. or Miss or Mrs would be a person
our training. Figure 1 - flow-chart is showing the basic flow of our                   name
system in detail.
                                                                                      Word Context: Context of the word of window size
As we have used CRFsuite to implement our NER system. In
                                                                                       four is used which takes two words before and two
which features can be easily extracted for labeling entities based
                                                                                       words after the word as feature. This helps modeling
on the provided training datasets. Hence We can easily add our
                                                                                       the language structure about how where and with which
own features by modifying some line of codes as it is an open
                                                                                       words entities are used in a sentence. There are total
source software. Features can be generated for unigram as well as
                                                                                       seven feature values for word context which includes
bigrams.
                                                                                       the word itself, two words before it, and two words
                                                                                       after it, and pairing of word with its previous and next
                                                                                       word.


                                                                       108
         POS tag: Parts of Speech (POS) tag of a word is also               and IR communities. Considerable success has been achieved
          considered as a feature because all the entities are               in English with extraction of multiple entities as per domain
          nouns.                                                             of interest. However, the area poses considerable challenges
                                                                             when tried in other languages and particularly Indian Languages.
         Regular Expressions: We have used different regular
                                                                             Such as - There is no capitalization available in Indian languages.
          expressions to identify temporal based named entity
          like Date,Month,Year, Period,Day and Time.                         There is lot of research work going on in NER for Indian
                                                                             languages, such as Workshops NERSSEA-2008, SANLP 2010,
2.2 Pre-processing                                                           2011 but, there is lack of bench mark data to compare several
Social media text is noisy in nature. People use shorthand and               existing systems. There is no common evaluation methods exists
ungrammatical text for saving their time. Thus capitalization is             to judge any researchers’ work.
not properly applied as well as Spellings are not correct. This data
becomes hard to handle in the aspect of Information-extraction.
First from the given testing dataset we have removed all the links           4. RESULTS AND DISCUSSION
presented in the tweets. Then tokenizing, Part-of-speech tagging
and chunking is done. For English language we have used
Stanford Part-Of-Speech tagger [5]. For Hindi language we have
                                                                             4.1 Evaluation metrics
used RDRPostagger [6].                                                       Two standard measures, Precision (P) and Recall (R) are used for
                                                                             evaluation of the Named Entity (NE) tagger, where precision is
Input for CRFsuite(NER Tagger):                                              the measure of the number of entities correctly identified over the
As there is space separated 4 fields input to the CRFsuite, we have          number of entities identified and recall is the measure of number
combined output from tokenizer , POS tagger and make one space               of entities correctly identified over actual number of entities. F
separated file for training. Then this same process applied for              measure is calculated which is the harmonic mean of precision
testing dataset.                                                             and recall
For example: PERSON Gitika NNP B-NP
Each tweets is preprocessed according to the requirement of CRF
suite which needs a file in which each line has a single word and                      F=
its NER tag separated with a white space, A new line represents
                                                                             When β = 1, F measure is called F1 measure or simply F1 score.
end of a sentence Two processed files were created, one with BIO
tags which shows multiword entities (for English language) and               4.2 Test results
another without it (for hindi language).
                                                                                                          Table1: test result
2.3 Post-processing                                                          Languages           Precision         Recall           F1-Score
Output of the NER tagger would be only NE tag. So we have
combined it with its named entity, tweet_id and user_id. Then find           Hindi                    25.65            16.14           19.82
the length of the named entity and its index-means position of the           English                  4.13              3.39            3.72
named entity.
And as per given format we have arranged such as:
Tweet ID:623472520352636928             User Id:241166752                    5. CONCLUSION
         NETAG:PERSON NE:Ali            Index:104                            CRF models are appropriate for the highly inflective Indian
         Length:3                                                            languages and perform better than other systems like HMM,
                                                                             MEMM etc. (Vijay Sundar Ram R, 2011). CRFsuite generate
                                                                             model based on the learning and provides output (NE tag) as per
3. ANALYSIS                                                                  the generated model. But Problem is, NER system learned using
Over the past decade, Indian language content on various media               CRF takes more time for training the model. The parts-of-speech
types such as websites, blogs, email, chats has increased                    tag is the important feature for NER to identify the named entity
significantly. And it is observed that with the advent of smart              chunk. Incorrect parts-of-speech tag for the token may result in
phones more people are using social media such as twitter,                   reducing the accuracy of NER system. Achieving a high
facebook to comment on people, products, services, organizations,            performing NER system requires more study and deeper
governments. Thus we see content growth is driven by people                  understating of linguistic features. Various permutation and
from non-metros and small cities who are mostly comfortable in               combination of feature sets can be used and tested for getting high
their own mother tongue rather than English. Though still this               recall value and eventually higher F1-scores.
Indian language content is only a fraction of the English content.
The growth of Indian language content is expected to increase by             6. FUTURE WORK
more than 70% every year.                                                    In English and Hindi both language we will try to get more
Hence there is great need to process this huge data automatically.           accurate results in identifying and tagging named entities. For that
Especially companies are interested to ascertain public view on              we will optimize our features sets.
their products and processes. This requires natural language                 7. ACKNOWLEDGEMENT
processing software systems which identify entities, identification
                                                                             We would like to thanks Prof. Prasenjit Majumder sir for his
of associations or relation between entities. Hence an automatic
                                                                             guidance throughout the work. Additionally, we would like to
Entity extraction system is required.
                                                                             thank FIRE 2015 for providing a great opportunity to work under
Named Entity Recognition (NER) is one of the most important                  this task and facilitating the support.
information extractions techniques being developed in the NLP

                                                                       109
8. REFERENCES

[1]    Asif Ekbal, R.H.: Language Independent Named Entity
      Recognition in Indian Languages. In: IJCNLP, pp. 33–40
      (2008)
[2] David Nadeau, S.S. (n.d.).: A survey of named Entity
    recognition and classification. National Research Council
    Canada/ New York University
[3] J. Lafferty, A. McCallum, and F. Pereira, “Conditional
    random fields: Probabilistic models for segmenting and
    labeling sequence data,” in Proceedings of ICML, pp. 282–
    289, (2001)
[4]   Vipul Garg, Nikit Saraf, and Prasenjit Majumder:
      Named Entity Recognition for Gujarati: A CRF Based
      Approach
[5] standford POStagger
       http://www-nlp.stanford.edu/software/tagger.shtml
[6] RDRPostagger
      http://rdrpostagger.sourceforge.net/
[7] Kudo, Taku. "CRF++: Yet another CRF toolkit."
    Software available at http://crfpp. sourceforge. net (2005)
[8] Okazaki, N.: CRFsuite: A fast implementation of Conditional
    Random Fields, CRFs (2007), retrieved from
    http://www.chokkan.org/software/crfsuite/


                                                                  110