Entity Extraction from Social Media Text
                             Indian Languages (ESM-IL)
          Chintak Mandalia                          Memon Mohammed Rahil                                        Manthan Raval
     LDRP Institute of Technology &                  LDRP Institute of Technology &                   LDRP Institute of Technology &
          Research Center,                                Research Center,                                 Research Center,
      Gandhinagar, Gujarat, India                     Gandhinagar, Gujarat, India                      Gandhinagar, Gujarat, India
    chintak.soni75@gmail.com                          rmemon122@gmail.com                          manthanraval249@gmail.com

                                                            Sandip Modha
                                          LDRP Institute of Technology & Research Center,
                                                  Gandhinagar, Gujarat, India
                                                        sjmodha@gmail.com

ABSTRACT                                                                     sequential data which provides Fast training and tagging, Linear-
This paper shows the implementation of named entity recognition              chain CRF, etc.
(NER) which is one of the applications of Natural Language                   Supervised learning is used for training dataset. We have used this
Processing and is regarded as the subtask of information retrieval.          training dataset to train out system for tagging named entities.
NER is the process to detect Named Entities (NEs) in a document              CRFsuite[5] generate model based on the supervised learning
and to categorize them into certain Named entity classes such as             provided.
the name of organization, person, location, sport, river, city,
country, quantity etc. There are lots of work have been
accomplished in English related to NER. But, at present, still we
have not been able to achieve much of the success pertaining to              2. CONDITIONAL RANDOM FIELDS
NER in the Indian languages. The following paper discusses about             (CRFs)
NER, the various approaches of NER, Performance Metrics, the                       Given Conditional Random Field is a type of discriminative
challenges in NER in the Indian languages and finally some of the            probabilistic model used for the labeling sequential data such as
results that have been achieved by performing NER in Hindi by                natural language text. Conditional Random Fields (CRFs) is
aggregating approaches such as Rule based CRF suite and for                  mainly used as a class of statistical modeling method which is
tagging RDRpostagger and geniatagger. The paper shows working                applied in pattern recognition and machine learning. CRFs are
methodology and its result on named entity extraction from social            undirected graphical models, a special case of which correspond
media text of fire 2015.                                                     to conditionally-trained finite state machines. In the special case
                                                                             in which the output nodes of the graphical model are linked by
CCS Concepts                                                                 edges in a linear chain, CRFs[5] make first order markov
• Theory of computation~Support           vector machines                    assumption and can viewed as a conditionally trained probabilistic
• Computing methodologies~Natural language processing                        finite automata. CRFs model consists of F=<f1,…,fk>, a vector of
• Information systems~Information extraction • Human-centered                feature functions, θ = <θ1,…,θk> a vector of weights for each
computing~Social tagging systems                                             feature function. Let O=<o1,…,ot> be an observed sentence.

Keywords
Entity Extraction; Features; Social Media text; Machine Learning;                                           e
Conditional Random Fields (CRFs); supervised algorithm;
                                                                                                                e
1. INTRODUCTION
Social media is vast source of information from which we can
extract lots of important data as per the specific requirement. This         3. METHODOLOGY
paper presents a technique for named entity recognition from                 We use two different methods for identifying Named-Entity form
English and Hindi text data. Our main task is to extract name                given text. In one method we use Handcrafted or automatically
entity from social media tweets in Indian language (Hindi and                generated rules for NER. In second method or approach we use
English) and classify these tweets in named entity tags as people,           machine learning technique for modeling. Also we have different
location etc., which is around 22 classes to be tagged. We used              machine learning technique i.e. supervise learning, semi-
machine learning algorithm CRF (Conditional Random Field)[5]                 supervised learning, unsupervised learning for modeling.
to identify Named Entities in corpus. CRF algorithm is
implemented using CRFSuite[5] tool. CRFsuite[5] is an                        Supervised learning gives best performance but it requires large
                                                                             amount of good quality annotated data. Unsupervised and semi-
implementation of Conditional Random Fields for labeling
                                                                             supervised learning is used when there is scarcity of annotated
                                                                             data in training.

                                                                       100
We have used Machine learning based approach to perform NER                   Hyphen(-)            Yes         Yes         Yes          Yes
task for given data, because it is more efficient than rule-based
approach and it is more frequently used.                                      Colon(:)             Yes          -          Yes            -
                                                                              Apostrophe(')        Yes          -          Yes            -
3.1 Pre Processing                                                            Back Slash           Yes         Yes         Yes          Yes
The given task requires prediction of named entities from social
media, so first task is to tag the word from the whole sentence.              Two Digit            Yes         Yes         Yes          Yes
Therefore we have to split into word by doing these we get 'The'              Number
'brown' 'cat' for both English and Hindi. Next step is to give part           Four Digit           Yes         Yes         Yes          Yes
of speech(POS)[2] to text here we have used RDR POS Tagger                    Number
for both the languages which identifies noun, verb, adverb from
the given text. We used genia tagger for chunking in English.                 All Uppercase        Yes         Yes         Yes          Yes
Genia tagger tag words with relevant IOB chunking tag. For
                                                                              All Digit            Yes         Yes         Yes          Yes
example:
           “The brown cat” will get chunk tag as the: B-NP,                   $ or Rs              Yes          -          Yes            -
brown: I-NP, cat: I-NP.
                                                                              POS Tag- NNP         Yes          -          Yes            -
We were provided with NER tagged data for training by FIRE-                   or QC
2015. We prepared a file with tag word and its pos tag, chunk tag             Gazzaters            Yes          -          Yes            -
and NER tag for training purpose.
For example:
Location India NNP B-NP                                                     Also we have included more features in hindi like जी , बजे,
3.2 Training                                                                etc. in CRFsuite training.
We have used the open-source tool, CRFsuite[5] which is one of
                                                                            For example:
the popular implementations of CRF (Conditional Random Fields)
for training data and also for tagging test data. CRFsuite[5]               मोदी जी का ममशन है
internally generates features from attributes in a data set. In             कार्यवाही 12 बजे तक स्थमित
general, this is the most important process for machine-learning
approaches because a feature design greatly affects the labeling            So this kind of feature words are used in training model.
accuracy.
                                                                            3.5 Post Processing
3.3 Testing                                                                 CRFsuite [5] gives only NE tag as output. So we combined output
The untagged test data are given for testing with its POS tag[2]            with its named entity. Then we prepared output as given format in
and Chunk tag. POS tagging and chunk tagging is done with help              training file by adding relevant information like tweet_id, user_id,
of RDR POS [2] tagger and genia tagger. After that this untagged            Index, length of word. For example:
test data with its POS tag and chunk tag are given as input to our
model to get test result.
                                                                            Tweet ID:618698235092152320              User      ID:2922444438
3.4 Feature Set                                                                      NETAG:LOCATION                  NE:india Index:122
Feature set which is used for CRF [5] based NER System which                         Length:5
includes Prefix or Suffix of word, length of word, Capitalization,
POS tag, Chunking etc. we created two different model for both
Hindi and English using different feature sets.                             4. RESULTS
                Table 1. Feature Set Usage description
      Features           Eng       Eng      Hin model Hin model
                                                                            4.1 Evaluation
                                                                            There are two standard measures used for evaluation of NE
                        model     model        (1)       (2)
                                                                            tagger. (I) Precision(P) is the measure of the number of entities
                         (1)       (2)
                                                                            correctly identified over the number of entities identified. (II)
  POS Tag                Yes        Yes         Yes        Yes              Recall(R) is the measure of number of entities identified correctly
                                                                            over actual number of entities. Both precision and recall are
  Chunk Tag              Yes        Yes          -           -
                                                                            therefore based on an understanding and measure of relevance.
  Prefix & Suffix        Yes        Yes         Yes        Yes              Harmonic mean of precision and recall which is F measure is
                                                                            calculated.
  Capit-alize            Yes        Yes          -           -
  Token Shape            Yes        Yes          -           -
  Token Type             Yes        Yes         Yes        Yes
  Length                 Yes        Yes         Yes        Yes
  Dot(.)                 Yes        Yes         Yes        Yes
  Comma(,)               Yes        Yes         Yes        Yes

                                                                      101
4.2 Test Result

         Table 2. Test results of our system.
   Language Precision(P) Recall(R) F1-Score

   Hin run-1     67.11        0.76       1.51
   Hin run-2     74.73       46.84      57.59
   Eng run-1      7.30        4.17       5.31
   Eng run-2      5.35        5.67       5.50


5. CONCLUSION
Conditional random field(CRF) [5] are better for Indian languages
than other models like HMM, MEMM etc. NER learned using
CRFs takes more time for training. As part of Speech (POS) and
Chunking is part of training, incorrect tagging also reduce the
accuracy of the Recognized Named Entity. For achieving high
performance and accuracy of NER system more study and deeper
understanding of linguistic features are required.

6. ACKNOWLEDGMENTS
We thank Mr. Sandip Modha and other faculties of college for
helpful input. This work is part of ESM-IL (Entity Extraction
from Social Media Text - Indian Language).


7. REFERENCES
[1] Andrew McCallum, Wei Li: Named Entity Recognition with
Conditional Random Fields, Feature Induction and Web-
Enhanced Lexicons
[2] RDR Postagger http://rdrpostagger.sourceforge.net/
[3] Alan Ritter, Sam Clark, Mausam and Oren Etzioni. Named
Entity Recognition in Tweets
[4] John Lafferty,Andrew McCallum, and Fernando Pereira.
2001.Conditional random fields: Probabilistic models for
segmenting and labeling sequence data
[5] Naoaki Okazaki's (CRF Suit): Implementation of Conditional
Random Fields (CRFs) http://www.chokkan.org/software/crfsuite/
[6] Dr.Rakesh ch. Balabantaray,Suprava Das,Kshirabdhi Tanaya
Mishra IIIT, BBSR(2013): CRF++ based approach
[7] Yassine Benajiba and Paolo Rosso:Arabic name entity
recognition using conditional Random Fields
[8] Genia tagger http://www.nactem.ac.uk/GENIA/tagger/
[9] CRF++ CRF++: Yet Another CRF toolkit CRF++ a simple,
customizable, and open source implementation of Conditional
Random Fields (CRFS)


                                                                    102