=Paper= {{Paper |id=Vol-1737/T7-6 |storemode=property |title=Named Entity Recognition for Code Mixing in Indian Languages using Hybrid Approach |pdfUrl=https://ceur-ws.org/Vol-1737/T7-6.pdf |volume=Vol-1737 |authors=Rupal Bhargava,Bapiraju Vamsi Tadikonda,Yashvardhan Sharma |dblpUrl=https://dblp.org/rec/conf/fire/BhargavaTS16 }} ==Named Entity Recognition for Code Mixing in Indian Languages using Hybrid Approach== https://ceur-ws.org/Vol-1737/T7-6.pdf
       Named Entity Recognition for Code Mixing in Indian
              Languages using Hybrid Approach


          Rupal Bhargava1                         Bapiraju Vamsi               Yashvardhan Sharma3
                                                   Tadikonda 2

                                       WiSoc Lab, Department of Computer Science
                                 Birla Institute of Technology and Science, Pilani Campus
                                                        Pilani-333031
                        {rupal.bhargava1 , f20130392 , yash3 } @pilani.bits-pilani.ac.in

ABSTRACT                                                          sentences is referred as code mixing. Code mixing is present
Automating the process of Named Entity Recognition has            where people speak in an informal way, like in social media.
received a lot of attention over past few years in Social Media      Growing usage of social media platforms like Facebook,
Text. Named Entities are real world objects such as Person,       Twitter and WhatsApp has led to an increase of code mix
Organization, Product, Location. Identifying these entities       data present because of interplay of Indian languages. Hence
in social media text is an important challenging task due         for making best use of this data it needs to be analyzed. In-
the informal nature of text present on social media. One          dic languages are used very much nowadays in general con-
such challenge that is faced in recognizing named entities        versations but there are a few Natural Language Processing
in Indian Social Media Text is Code Mixing. Code Mixing           (NLP) resources that are available.When code mixed text
is usage of more than one language in a sentence. Being           is combined as well very less resources are present. Hence
a multilingual country, people of India tend to know more         there is a greater need for developing NLP tool that can
than one language, which in turn results in the code mixing       handle code mixed texts with Indic languages.
of text while expressing their opinions. This paper describes        Entity recognition is a very important subtask of Infor-
the proposed approach for shared task CMEE-IL (Code Mix           mation extraction and find its applications in information
Entity Extraction in Indian Language), FIRE 2016. Pro-            retrieval, machine translation and other higher NLP appli-
posed algorithm uses a hybrid approach of a dictionary cum        cations such as coreference resolution. Named Entities are
supervised classification approach for identifying entities in    names of famous persons, organizations, locations, animals
Code Mix Text of Indian Languages such as Hindi- English          etc.Named entities have a many uses like in sentiment anal-
and Tamil-English.                                                ysis,where recognizing named entities is important as they
                                                                  don’t add much value to the statement. Similarly while tag-
                                                                  ging articles named entities are required for better search
CCS Concepts                                                      results. There are many such uses where named entities are
•Computing methodologies → Natural language pro-                  used so named entity recognition is very important.Towards
cessing; Information extraction; Language resources; Ma-          this, FIRE 2016 has organized a task for entity recognition
chine learning;                                                   in code mix text for Indian languages which identifies named
                                                                  entity in code mix text of English-Hindi and Tamil- English
                                                                  code mixed tweets. The Task was to identify the various
Keywords                                                          entities such as person names, organization names, movie
Code Mixing, Indian Languages, Named Entity Recognition,          names, location names in a given tweet.
Natural Language Processing, Information Retrieval                   Rest of the paper is organized as follows. Section 2 ex-
                                                                  plains the related work that has been done in the past few
                                                                  years. Section 3 presents the analysis of data set provided
1.   INTRODUCTION                                                 by CMEE-IL 2016 Task Organizers. Section 4 explains the
  India is a multilingual country with more than 1600 lan-        Proposed Technique that have been performed for the task
guages being spoken such as Hindi, Punjabi,Bengali, Telugu,       with block diagrams. Section 5 discusses algorithm to ex-
Marathi, Tamil, Gujarati and many more. With the intro-           plain the procedure. Section 6 elaborates the evaluation and
duction of Indic keyboards and articles that use Indic lan-       experimental results and error analysis. Section 7 concludes
guages, people started using Indic languages in their normal      the paper and presents future work.
conversation and this has made people to converse easily on
the internet. With such a scenario, it is common that peo-
ple know at least one more language apart from their native       2.   RELATED WORK
language due to which there is a high possibility that peo-         In Named Entity recognition there has been significant
ple mix words from two or more different languages while          research done so far in English but same cannot be said
writing or speaking something. This mixing of words in            for Indian Languages due to rich morphology of indian lan-
guages. Sujan et al [10] proposed a Named Entity Recogni-
tion (NER) system which is a hybrid of maximum entropy                  Table 1: Frequency of NE in both Data Sets
model, language specific rules and gazetteer lists.This sys-              Type of NE    Tamil-English Hindi-English
tem performs well for hindi and bengali languages. Malarkodi                Artifact         18       25
et al [5] have developed a system specific for tamil language                Count           94       132
using Conditional random fields(CRF). Entity Extraction                       Date           14       33
from Social Media Text - Indian Languages (ESM-IL) task                     Disease           5       7
of FIRE 2015 [8] had proposed a task of identifying Entities                Distance          4       0
in Indian Languages for social media text. As a baseline                 Entertainment       260      810
system, task organizers build a system which just used raw                 Facilities        23       10
data and was trained on Conditional Random Field. It was                   Livthings         16       7
observed by them that most of the participants obtained                     Location         188      194
similar precision as that of baseline. However, there was a               Locomotive          5       13
significant improvement of recall over the baseline. As a sub-             Materials         28       24
mission to the task Pallavi et al [6] proposed a system using                Money           66       25
Conditional Random Fields(CRF)whereas Anand et al [1]                        Month           25       10
proposed a system using a Support Vector Machine (SVM).                   Organization       68       109
Kamal et al. [11] used POS tag as a state and developed                      Period          53       44
a system using Hidden Markov Model (HMM) Classifier for                      Person          661      712
the task.                                                                    Plants           3       1
   In current scenario there has been a lot of work going on
                                                                           Quantity           0       2
recently found out trend of code-mixed texts in Indian Lan-
                                                                              Sday            6       23
guages. Parth et al. [4] in 2014 formally introduced the
                                                                              Time           18       22
concept of Mixed Script Information Retrieval (MSIR) and
challenges associated with it. Code Mixing recieved attained                  Year           54       143
some attention in 2015. In shared task MSIR, FIRE 2015                                      1609      2346
[9] problem was proposed, for identifying mix scripts in text
along with 9 different Indian languages, Named Entities and
punctuations for which significant results were obtained. It      sists of names of organizations. The rest of the tags have
was found that most confusing language pair was that of           very less annotations present in the annotated file.
Hindi and Gujarati. Further, it was concluded that perfor-           Comparing Hindi-English with Tamil-English, The per-
mance of the system for each of the category is dependent on      centage of tags for the minority tags remains almost the
the tokens used for that category. Not only this code mix         same. But in the Hindi- English corpus the Tag Entertain-
has found its application in different areas such as Query        ment has the highest number of annotations present.It is
Labeling [3], Sentiment Analysis [2], Question Classification     followed by Person which is close to Entertainment.The rest
etc.                                                              of order remains the same but with varying percentages.

3.   DATA ANALYSIS                                                4.     PROPOSED TECHNIQUE
   Data Set provided by task organizers contained two code          A word level NE-recognition system is designed to recog-
mix data set, Tamil-English and Hindi- English. In Each           nise Named Entities in a tweet. The proposed methodology
dataset , the training data consisted of two files, A text file   involves a pipelined approach for detecting each NE tag and
containing raw tweets along with their tweetID and UserID         has been divided into following four phases:
and another text file containing Annotations to the tweets
present in the raw tweets file. The raw tweet files consists           1. Pre-processing
of 2700 tweets in the Hindi-English corpus and 3200 tweets
                                                                       2. Number Based Named Entity Recognition
in the Tamil-English corpus. All the tweets in the Hindi-
English corpus were already romanized whereas the tamil-               3. Gazetteer List Based Named Entity Recognition
english corpus had a mixture of both tamil script and ro-
manised script. There were 22 tags present in the corpus as            4. Tree Based Named Entity Identifier
mentioned in Table 1.
   Named Entity (NE) Tag Person, Entertainment and Loca-
tion occupies majority of the instances in Tamil -English cor-    4.1     Pre-Processing
pus. Person tag comprises of Names of Famous Actors, Ac-            The Data is pre-processed before detecting named entities.
tresses, Politicians, New Reporters and Social Media Celebri-     This is done to ensure that the data is uniform and the
ties. Entertainment Comprises of Names of Famous TV               system can benefit from that. The preprocessing consists
shows and movies while Location consists of Names of Fa-          of creating a copy of the string in lowercase for uniformity.
mous Cities, Indian Towns and Names of Countries.                 It also removes all links present in the tweet. This pre-
   Apart from these, there are some Numerical and Time            processed tweet along with the original tweet is then passed
based Tags that are present as well which comprise of the         to the next phase.
remaining part of Tamil-English data set. These tags in-
clude Count, Distance, date, money, month, time and year.         4.2     Number Based Entity Recognition
Money represents numbers along with a monetary tag like             This phase of proposed algorithm identifies number based
’15 dollars’. Organization is another tag present which con-      entity such as date, time, month, day, year, money, period,
                                                                  Table 2: Features used for creating feature vector.
                                                                   Sno Features
                                                                   1    Presence of token in English dictionary
                                                                   2    Prefixes of length 1 to 3
                                                                   3    Suffixes of length 1 to 3
                                                                   4    Capitalization related features like starting letter
                                                                        capital, all letters capital, other letters capital.
                                                                   5    Features based on presence or absence of special
                                                                        characters like #, @, numbers, other symbols.
                                                                   6    Presence of emoticons
                                                                   7    Token present in gazetteer list.
                                                                   8    Is previous token a NE Tag.


                                                                  features mentioned in Table 2, Decision tree and Extremely
                                                                  randomized trees are trained for classification [7].

                                                                  5.   ALGORITHM
                                                                     Algorithm 1 explains the proposed technique for Named
Figure 1: Block Diagram for Proposed Algorithm                    Entity Recognition of code mixed text. The System first
                                                                  pre-processes the input by removing the website and twitter
                                                                  links (implemented by callable: Link remover ) and then
quantity, distance and count using a set of Regular Expres-       converts the tweet into lowercase (implemented by callable:
sions Regular expressions are designed based on the common        Case conversion). We check for all numerical features like
patterns observed in the annotations for these tags. Regu-        date, time, money, quantity, period, distance, day and count
lar expressions work best in detection for these tags because     (implemented by callable: check Numerical).Before adding
there are limited variations possible for each of these Tags.     it to the final predictions we check for overlapping Tags and
For example the tag Day can be only one of the 7 possi-           remove them (using add without repetition). This is the
ble days of a week in a language. So detecting them using         second phase of the system.
regular expressions will be efficient. While checking for NE         In the third phase, we tokenize the tweets (using the func-
tags there is a possibility of having multiple tags attached to   tion Tokenize) and check if any of the token is present in
the same token. To remove ambiguity proposed technique            any of the gazetteer list (using check gazetteer List) and
checks for tags in a particular pre defined order.                add them to the final list of tags of that tweet.This ends
                                                                  the third phase of the system. In the final phase we create
4.3      Gazetteer Based Entity Recognition                       feature vectors for each tweet and then predict using the
   As shown in Table 1, apart from Entertainment, Location,       classifier (clf) already trained using a training data. The
Person and Organization, Rest of the Tags contain very less       classifier(clf) used are decision trees and extremely random-
data that cannot be used to train a classifier. So Gazetteer      ized trees.
Lists are used for identifying Named Entities with insuffi-
cient training data. Gazetteer lists are created from the         Algorithm 1 Algorithm for Identifying Named Entity
annotations given to the training data. While checking in         Recognition for Code Mixed Text in Indian Language
gazetteer list # and @ symbols were ignored.                       1: Input: Code-Mixed tweets list , S
                                                                   2: Output: Predicted Named Entity Labels, P
4.4      Classification                                            3: Initialization: P=[], toks=[], NE Data=[], NE Tags=[]
   The rest of the NE Tags are identified by creating feature      4: for i=0 to S.length do
vector for each token of a tweet. These feature vectors are        5:     Link remover(S[i])
then trained using a decision tree and extremely randomized        6:     Case conversion(S[i])
tree classifier . The features considered for building feature     7: end for
vector are mentioned in Table 2.                                   8: for i=0 to S.length do
   English dictionary feature is used to identify the presence     9:     d=check Numerical(S[i])
and absence of an english word i.e if it is a english word        10:     add without repetition(d)
it is 1 and if it is a non english word then its 0. Python        11:     tok =Tokenize(S[i])
dictionary called pyenchant 1 was used for this identifying       12:     for j=0 to tok.length do
this feature. Also for prefix suffix feature a dictionary was     13:         g = check gazetteer List(tok[ j ] )
built using most common prefixes and suffixes and presence        14:         add without repetition(P , g )
of these prefixes and suffixes (length = 1 to 3) in tokens were   15:         f = Create Feature vector(tok[ j ])
identified using the same. Gazetteer list feature checks for      16:         c=clf.predict(f)
the presence of the token in gazetteers list of the remaining     17:         add without repetition(P,c)
tags and uses this as a feature. Previous token tag was also      18:     end for
taken into account to check structure of the tweet. Using all     19: end for
1
    http://packages.python.org/pyenchant/
  Table 3: Different Versions of Proposed System
 Version Tags trained on Classi- Classifier Used
         fier
 1       Person, Entertainment, Decision Tree
         Location, Organization
 2       Person, Entertainment, Extremely Random-
         Location, Organization ized Tree
 3       Person, Entertainment, Decision Tree
         Location,      Organiza-
         tion, Artifact, Facilities
 4       Person, Entertainment, Extremely Random-
         Location,      Organiza- ized Tree
         tion, Artifact, Facilities


Table 4: Results for Hindi English Proposed System
 Runs    Precision      Recall        F-Measure
 Run 1   58.66          32.93         42.18                      Figure 2: Result Comparision for all teams(Hindi-
 Run 2   58.84          35.32         44.14                      English)
 Run 3   59.15          34.62         43.68


6.    EXPERIMENTS & RESULTS
   CMEE-IL, FIRE 2016, had proposed the task of Named
Entity Recognition in Hindi-English and Tamil-English Code
mixed text. Three runs were submitted for the task evalua-
tion for each language pair. Four versions were created for
different runs submitted. In all the versions, numerical fea-
ture were detected using numerical function as explained in
section 4.2. Rest of the tags were classified using different
versions created as specified in Table 3.
   The rest of the tags were classified using gazetteer list
phase as mentioned in section 4.3.This was done due to low
amount of training data available for few tags such as plants,
disease, locomotive etc as mentioned in Table 1. All the
Variations for the proposed algorithm were evaluated using
F-Score for each language pair and finally Run1, 2, 3 for
Hindi-English used versions 1, 2 and 4 respectively whereas
Run 1, 2, 3 for Tamil-English used versions 1, 3 and 4 re-
spectively.                                                      Figure 3: Result Comparision for all teams(Tamil-
                                                                 English)
6.1   Evaluation & Discussion
   As shown in Table 4, run 2 performed the best for Hindi-
English Proposed System. Similarly, run 2 performed best         turn reduces the average recall value.The recall might have
for Tamil-English as well as indicated in Table 5. Based on      increased if partial identification of NE were considered. The
the F-Score, it can be concluded that algorithm with more        proposed system stood fifth among the Hindi-English Sys-
gazetteer List and Extremely Randomized forest (Version          tems and was ranked fourth in the case of Tamil-English as
2) performed well in case of Hindi- English. But in case of      shown in Figure 2 & 3 respectively.
Tamil-English, algorithm with less Gazetteer List and Deci-
sion Tree (Version 3) proved to be effective.                    6.2   Error Analysis
   Precision value could be less because of string matching         Few phases in proposed approach might have attributed
with elements of the gazetteer lists. Also, it can observed      to misclassification for few tags. One such phase can be
that recall value is low for all the runs, this could be due     the gazetteer list phase of the proposed method which is a
to less number of Named Entities for any sentence which in       dictionary based approach and has disadvantages associated
                                                                 with it. When there is less data present in the dictionary
                                                                 the precision will be low for the system. So there is a need
                                                                 for more elements in the list for a better recognition system.
Table 5: Results for Tamil English Proposed System
                                                                 Also, if there is an ambiguity in tags then there a chance
 Runs    Precision      Recall        F-Measure
                                                                 of misclassification. For example if we say there is a token
 Run 1   55.86          10.87         18.20
                                                                 ’Honey’ it can represent a Person like ’Honey Singh’ or as
 Run 2   58.71          12.21         20.22                      a tag material. This ambiguity can only be solved using a
 Run 3   58.94          11.94         19.86                      classifier.
7.   CONCLUSION & FUTURE WORK                                        Evaluation, Gandhinagar, India, December, 2015
  In this paper, a hybrid approach of a dictionary cum su-           (2015), vol. 1587 of CEUR Workshop Proceedings,
pervised classification approach for identifying entities in         CEUR-WS.org, pp. 76–82.
Code Mix Text of Indian Languages such as Hindi- English         [9] Royal Sequiera, Choudhury, M., Gupta, P.,
and Tamil-English is submitted for the task CMEE-IL,FIRE             Rosso, P., Kumar, S., Banerjee, S., Naskar,
2016. The proposed system used a pipelined approach to               S. K., Bandyopadhyay, S., Chittaranjan, G.,
identify the named entities. There are four variants of the          Das, A., and Chakma, K. Overview of fire-2015
system based on the number of tags, the classifier can de-           shared task on mixed script information retrieval. In
tect and the classifier used. Further improvisation can be           Working notes of FIRE 2015 - Forum for Information
done by incorporating features related to the structure of           Retrieval Evaluation, Gandhinagar, India, December,
the sentence. POS Tagging and Chunking of the tweets has             2015 (2015), vol. 1587 of CEUR Workshop
not been included, this an be another improvisation in the           Proceedings, CEUR-WS.org, pp. 19–25.
proposed algorithm. Although we need better POS Taggers         [10] Saha, S. K., Chatterji, S., Dandapat, S.,
for Code mixed Languages which can tag both romanised                Sarkar, S., and Mitra, P. A hybrid approach for
and non-romanised tweets for the same.                               named entity recognition in indian languages. In
                                                                     Proceedings of the IJCNLP-08 Workshop on NER for
8.   REFERENCES                                                      South and South East Asian Languages (2008),
                                                                     pp. 17–24.
 [1] Anand Kumar M, Shriya Se, S. K. Amrita cen@
                                                                [11] Sarkar, K. A hidden markov model based system for
     fire 2015: Extracting entities for social media texts in
                                                                     entity extraction from social media english text at fire
     indian languages. In Working notes of FIRE 2015 -
                                                                     2015. In Working notes of FIRE 2015 - Forum for
     Forum for Information Retrieval Evaluation,
                                                                     Information Retrieval Evaluation, Gandhinagar,
     Gandhinagar, India, December, 2015 (2015), vol. 1587
                                                                     India, December, 2015 (2015), vol. 1587 of CEUR
     of CEUR Workshop Proceedings, CEUR-WS.org,
                                                                     Workshop Proceedings, CEUR-WS.org, pp. 91–97.
     pp. 87–90.
 [2] Bhargava, R., Sharma, Y., and Sharma, S.
     Sentiment analysis for mixed script indic sentences. In
     Advances in Computing, Communications and
     Informatics (ICACCI), 2016 International Conference
     on (2016), IEEE, pp. 524–529.
 [3] Bhargava, R., Sharma, Y., Sharma, S., and Baid,
     A. Query labelling for indic languages using a hybrid
     approach. In Working notes of FIRE 2015 - Forum for
     Information Retrieval Evaluation, Gandhinagar,
     India, December, 2015 (2015), vol. 1587 of CEUR
     Workshop Proceedings, CEUR-WS.org, pp. 40–42.
 [4] Gupta, P., Bali, K., Banchs, R. E., Choudhury,
     M., and Rosso, P. Query expansion for mixed-script
     information retrieval. In Proceedings of the 37th
     international ACM SIGIR conference on Research &
     development in information retrieval (2014), ACM,
     pp. 677–686.
 [5] Malarkodi, C., Pattabhi, R., and Sobha, L. D.
     Tamil ner–coping with real time challenges. In 24th
     International Conference on Computational
     Linguistics (2012), p. 23.
 [6] Pallavi, K., Srividhya, K., and Rexiline Ragini
     John Victor, R. M. Hits@ fire task 2015: Twitter
     based named entity recognizer for indian languages. In
     Working notes of FIRE 2015 - Forum for Information
     Retrieval Evaluation, Gandhinagar, India, December,
     2015 (2015), vol. 1587 of CEUR Workshop
     Proceedings, CEUR-WS.org, pp. 83–86.
 [7] Pedregosa, F., Varoquaux, G., Gramfort, A.,
     Michel, V., Thirion, B., Grisel, O., Blondel,
     M., Prettenhofer, P., Weiss, R., Dubourg, V.,
     et al. Scikit-learn: Machine learning in python.
     Journal of Machine Learning Research 12, Oct (2011),
     2825–2830.
 [8] Rao, P. R., Malarkodi, C., and Devi, S. L.
     Esm-il: Entity extraction from social media text for
     indian languages@ fire 2015–an overview. In Working
     notes of FIRE 2015 - Forum for Information Retrieval