A Hybrid Approach for Entity Extraction in Code-Mixed Social Media Data Deepak Gupta Shweta Shubham Tripathi Comp. Sc. & Engg. Deptt. Comp. Sc.& Engg. Deptt. Electrical Engg. Deptt. IIT Patna, India IIT Patna, India MNIT Jaipur, India deepak.pcs16@iitp.ac.in shweta.pcs14@iitp.ac.in stripathi1770@gmail.com Asif Ekbal Pushpak Bhattacharyya Comp. Sc. & Engg. Deptt. Comp. Sc. & Engg. Deptt. IIT Patna, India IIT Patna, India asif@iitp.ac.in pb@iitp.ac.in ABSTRACT ment has been increased a lot. Over the past decade, In- Entity extraction is one of the important tasks in various dian language content on various media types such as blogs, natural language processing (NLP) application areas. There email, website, chats has increased significantly. And it is has been a significant amount of works related to entity ex- observed that with the advent of smart phones more people traction, but mostly for a few languages (such as English, are using social media such as whatsapp, twitter, facebook some European languages and few Asian languages) and to share their opinion on people, products, services, organi- doamins such as newswire. Nowadays social media have be- zations, governments. The abundant of social media data come a convenient and powerful way to express one’s opinion created many new opportunities for information access, but and sentiment. India is a diverse country with a lot of lin- also many new challenges. To deals with these challenges guistic and cultural variations. Texts written in social media many of the research is going on and it have become one of are informal in nature, and perople often use more than one the prime present-day research areas. Non-English speakers, script while writing. User generated content such as tweets, especially Indians, do not always use Unicode to write some- blogs and personal websites of people are written using Ro- thing in social media in Indian languages. Instead, they use man script or sometimes users may use both Roman as well their roman script or transliteration and frequently use En- as indigenous scripts. Entity extraction is, in general, a glish words or phrases through code-mixing and also often more challenging task for such an informal text, and mix- mix multiple languages in addition to anglicisms to express ing of codes further complicates the process. In this paper, their thoughts. Although English is the principal language we propose a hybrid approah for enity extraction from code for social media communications, there is a necessity to de- mixed language pairs such as English-Hindi and English- velop mechanism for other languages, including Indian lan- Tamil. We use a rich linguistic feature set to train Condi- guages. According to the Indian constitution there are 22 tional Random Field (CRF) classifier. The output of clas- official language in India. However Census of India of 2001, sifier is post-processed with a carefully hand-crafted feature reported India has 122 major languages and 1599 other lan- set. The proposed system achieve the F-scores of 62.17% and guages. The 2001 Census recorded 30 languages which were 44.12% for English-Hindi and English-Tamil language pairs, spoken by more than a million native speakers and 122 which respectively. Our system attains the best F-score among were spoken by more than 10, 000 people. Language diver- all the systems submitted in Fire 2016 shared task for the sity and dialect changes instigate frequent code-mixing in English-Tamil language pairs. India. Hence, Indians are multi-lingual by adaptation and necessity, and frequently change and mix languages in social media contexts, which poses additional difficulties for auto- CCS Concepts matic social media text processing on Indian language. The •Computing methodologies → Natural Language Pro- growth of Indian language content is expected to increase cessing; •Information System → Information Extrac- by more than 70% every year. Hence there is a great need tion; •Algorithm → Conditional Random Field(CRF); to process this huge data automatically. Named Entity Recognition (NER) is one of the key information extraction Keywords tasks, which is concerned with identifying names of enti- ties such as people, location, organization and product. It Code-mixing, Entity Extraction, Named Entity Recogni- can be divided into two main phases: entity detection and tion, Conditional Random Field(CRF), Indian Language, entity typing (also called classification)[7]. Recently, Infor- Social Media data mation extraction over micro-blogs have become an active research topic [4], following early experiments which showed 1. INTRODUCTION this genre to be extremely challenging for state-of-the-art Code-mixing refers to the mixing of two or more languages algorithms[5, 2]. For instance, named entity recognition or language varieties. Code switching and code mixing are methods typically have 85-90% accuracy on longer texts, interchangeably used by the peoples. With the availabil- but 30-50% on tweets[16]. First, the shortness of micro-blogs ity of easy internet access to people, social media involve- (maximum 140 characters for tweets) makes them hard to interpret. Consequently, ambiguity is a major problem since {ti , tj . . . tk } from S whose characteristics is similar to semantic annotation methods cannot easily make use of co- any of the entity from entity set E. reference information. Unlike longer news articles, there is a low amount of discourse information per microblog docu- 2. Entity classification step: Classify each of the to- ment, and threaded structure is fragmented across multiple kens of set TE into one of the entity type from entity documents, flowing in multiple directions. Second, micro- set E. texts exhibit much more language variation, tend to be less grammatical than longer posts, contain unorthodox capi- talization, and make frequent use of emoticons, abbrevia- 3. DATASET tions and hashtags, which can form an important part of the There are two language pair data set available to evalu- meaning. To combat these problems, research has focused ate the system performance. It was crawled from tweeter, on microblog-specific information extraction algorithms (e.g. mainly the crawled tweet are in English-Hindi and English- named entity recognition for Twitter using CRFs[16] or hy- Tamil language mix. There are 22 types of entities present brid methods[18]. Particular attention is given to micro-text in the training data set in which the majority of entities normalization[8], as a way of removing some of the linguis- are from ‘Entertainment’, ‘Person’ ‘Location’ and ‘Organi- tic noise prior to part-of-speech tagging and entity recog- zation’. The statistics of the training data set is shown in the nition. In literature primarily machine learning and rule Table-1. We have also shown some of the sample tweets from based approach has been used for named entity recognition both Language pair in Table-2. English-Tamil language pair (NER). Machine learning (ML) based techniques for NER tweets contains some of the tweets from only Tamil language make use of a large amount of NE annotated training data only. English-Hindi tweet data set contains total 2700 tweets to acquire higher level knowledge by extracting relevant fea- from 2699 tweeter users. Similarly English-Tamil tweet data tures from the labeled data. Several ML techniques have set contains total 2183 tweets from 1866 tweeter users. already been applied for the NER tasks such as Support vector vector classifier[9], Maximum Entropy[3, 10], Markov English-Hindi English-Tamil Entities Model(HMM)[1], Conditional Random Field (CRF)[12] etc. # Entity # Entity The rule based techniques have also been explored in the COUNT 132 94 task by[6, 13, 19]. The hybrid approaches that combines PLANTS 1 3 different Machine learning based approaches are also used PERIOD 44 53 by Rohini et al.[17] by combining Maximum entropy, Hid- LOCOMOTIVE 13 5 den Markov Model and handcrafted rules to build an NER ENTERTAINMENT 810 260 system. Entity extraction has been actively researched for MONEY 25 66 over 20 years. Most of the research has, however been fo- TIME 22 18 cused on resource rich languages, such as English, French LIVTHINGS 7 16 and Spanish. However entity extraction and recognition DISEASE 7 5 from social media text on for Indian language have been in- ARTIFACT 25 18 troduced on FIRE-15 workshop[15]. The code-mixing entity MONTH 10 25 extraction from social media text on for Indian mix language introduced in the FIRE-2016. Entity extraction from code- FACILITIES 10 23 mixing social media text poses some key challenge which are PERSON 712 661 as follows: MATERIALS 24 28 LOCATION 194 188 1. The released data set contains code mixing as well as YEAR 143 54 uni-language utterance. DATE 33 14 ORGANIZATION 109 68 2. Set of entity are not limited to only traditional set of QUANTITY 2 0 entity e.g. Person Name, Location Name, Organiza- DAY 67 15 tion Name etc. There are 22 different types of entities SDAY 23 6 are there to extract from text. DISTANCE 0 4 3. There are lack of resources/tools for Indian languages. Total 2413 1624 code-mixing makes problems more difficult for pre- processing tasks required for NER such as sentence Table 1: Training dataset statistics for both the lan- splitter, tokenization, part-of-speech tagging and chunk- guage pair. ENTERTAINMENT type of entity has ing etc. the maximum no. of entity in both the language pair. 2. PROBLEM DEFINITION The problem definition of code-mixing entity extraction comprises two sub-problem entity extraction and entity clas- 3.1 Conditional Random Field(CRF) sification. Mathematically the problem of code-mixing en- Lafferty et al.[11] define the the probability of a particu- tity extraction can be described as follows: Lets S is code- lar label sequence y given observation sequence x to be a mixing sentence having n tokens t1 , t2 . . . tn . E is the set of normalized product of potential functions, each of the form k pre-defined entity E = {E1 , E2 , . . . Ek }. X X exp( λj tj (yi−1 , yi , x, i) + µk sk (yi , x, i)) (1) 1. Entity Extraction step: Extract set of tokens TE = j k Langauage Pair Sample Tweet given token is basically the result of moving a window @YOUniqueDoc Nandu,muje shaq hai of n characters along the text. We extracted charac- ki humari notice ke bagair tere ghar ke English-Hindi secret route ki help se you met kya. ter n-grams of length one (unigram), two(bigram) and My intution is never wrong three (trigram), and use these as features of the clas- A RiftWood Productions presents to you the sifiers. season finale of Le’ Bill & Giddy,La Muje’r Ungala nenachu neengale romba proud 3. Word normalization : Words are normalized in or- ah feel panra vishayam enna ?! der to capture the similarity between two different English-Tamil Post ur comments. Will be read words that share some common properties. Each up- on sun music at 5pm live ;) percase letter is replaced by ‘A’, lowercase by ’a’ and IruMugan will be a Class + Mass movie like Thani Oruvan. The biggest plus is the number by ’0’. screenplay - Thambi Ramaiyah.. Words Normalization Table 2: Sample tweet from both language pair NH10 AA00 Maine Aaaaa NCR AAA where tj (yi−1 , yi , x, i) is a transition feature function of the entire observation sequence and the labels at positions i and i − 1 in the label sequence; sk (yi , x, i) is a state feature func- 4. Prefix and Suffix: Prefix and suffix of fixed length tion of the label at position i and the observation sequence; character sequences (here, 3) are stripped from each and λj and µk are parameters to be estimated from training token and used as features of the classifier. data. When defining feature functions, we construct a set of real-valued features b(x, i) of the observation to expresses 5. Word Class Feature: This feature was defined to some characteristic of the empirical distribution of the train- ensure that the words having similar structures belong ing data that should also hold of the model distribution. An to the same class. In the first step we normalize all example of such a feature is the words following the process as mentioned above.  Thereafter, consecutive same characters are squeezed 1 if the observation word at position i is ‘religion’ b(x, i) = into a single character. For example, the normalized 0 otherwise word AAAaaa is converted to Aa. We found this fea- Each feature function takes on the value of one of these real- ture to be effective for the biomedical domain, and we valued observation features b(x, i) if the current state (in the directly adapted this without any modification. case of a state function) or previous and current states (in 6. Word Position: In order to capture the word con- the case of a transition function) take on particular values. text in the sentence, we have used a numeric value to All feature functions are therefore real-valued. For example, indicate the position of word in the sentence. The nor- consider the following transition function:  malized position of word in the sentence is used as a b(x, i) if yi−1 = B-ORG and yi = I-ORG features. The feature values lies in the ranges between tj (yi−1 , yi , x, i) = 0 and 1. 0 otherwise This allows the probability of a label sequence y given an 7. Number of Upper case Characters: This features observation sequence x to be written as takes into account the number of uppercase alphabets 1 X in the word. The feature is relative in nature and P (y|x, λ) = exp( λj Fj (y, x)) (2) ranges between 0 and 1. Z(x) j 8. Test Word Probability: This feature finds the prob- where Fj (y, x) can be written as follows: ability of the test word to be labeled the same as in n X training data. The length of this vector feature is the Fj (y, x) = fj (yi−1 , yi , x, i) (3) total number of labels or output tags, where each bit i=1 represents an output tag, It is initialized with 0. If the where each fj (yi−1 , yi , x, i) is either a state function test word does not appear in training, every bits retain s(yi−1 , yi , x, i) or a transition function t(yi−1 , yi , x, i). their initially marked value 0. Based on the probability value, we have two features: 4. FEATURE EXTRACTION (a) Top@1-Probability: For the output tag with The proposed system uses an exhaustive set of features highest probability, its corresponding bit in the for NE recognition. These features are described below. feature vector is set to 1. All other bits remain as 0. 1. Context word: Local contextual information is use- ful to determine the type of the current word. We use (b) Top@2-Probability: For the output tag with the contexts of previous two and next two words as highest and second highest probability, their cor- features. responding bit are set to 1 in the feature vector. All other bits remain as 0. 2. Character n-gram: Character n-gram is a contigu- ous sequence of n characters extracted from a given 9. Binary Features: These binary features are identi- word. The set of n-grams that can be generated for a fied after the through analysis of training data. (a) isSufficientLength: Since most of the entity 5. EXPERIMENTAL SETUP from training data have a significant length. There- To extract the entity from code mixed data we have fol- fore we set a binary feature to fire when the length lowed three step approach, which are described in this sec- of token is greater than a specific threshold value. tion. Fig-1 shows a architecture diagram of our proposed The threshold value 4 is used to extract the bi- approach. nary features. (b) isAllCapital: This value of this feature is set 5.1 Pre-processing when all the character of current token is in up- Pre-processing stage is an important task before applying percase. any classifier. The release data set was in raw text sentence having the list of entity. There are two step was performed (c) isFirstCharacterUpper: This value of this fea- as part of pre-processing. ture is set when the first character of current to- ken is in uppercase.. 1. Tokenization: Since the data set are crawled from Twitter therefore a suitable tokenizer which can deals (d) isInitCap: This feature checks whether the cur- with social media data need to be used. We used the rent token starts with a capital letter or not. This CMU PoS tagger[14] for tokenization and PoS tagging provides an evidence for the target word to be of of tweets. NE type for the English language. 2. Token Encoding: We used the IOB encoding for tag- (e) isInitPunDigit: We define a binary-valued fea- ging token. The IOB format (Inside, Outside, Begin- ture that checks whether the current token starts ning) is a common tagging format for tagging tokens with a punctuation or a digit. It indicates that in a chunking task in computational linguistics (ex. the respective word does not belong to any lan- Named Entity Recognition). The B- prefix before a guage. Few such examples are 4u,:D, :P etc. tag indicates that the tag is the beginning of a chunk, (f) isDigit: This feature is fired when the current and an I- prefix before a tag indicates that the tag token is numeric. is inside a chunk. An O tag indicates that a token belongs to no chunk. (g) isDigitAlpha: We define this feature in such a way that checks whether the current token is al- 5.2 Sequence Labeling phanumeric. The word for which this feature has In literature primarily HMM, MEMM and CRF has been a true value has a tendency of not being labeled used for sequence labeling task. Here we used the CRF as any named entity type. classifier for label the sequence of token. The features set (h) isHashTag: Since we are dealing with tweeter described in section-4 were finally formulated to provide as data , therefore we encounter a lot of hashtag is input to our CRF classifier[11]. We used the CRF++1 imple- tweets. We define the binary feature that checks mentation of CRF. The default parameter setting was used whether the current token starts with # or not. to carry out the experiment. 5.3 Post-processing The rule and dictionary based post-processing was per- Input : Disease Name list as DN formed followed by labeling obtained from CRF classifier. Living things list as LT ; The detailed explanation are as follows: Special Days list as SD List of pair obtained from CRF as L(W,C) English-Hindi Output: Post-processed list of (token,label) pair For English-Hindi code mixed data we used dictionary of obtained after post-processing as PL(W,C’) Disease Name, Living Things and Special Days. A list con- PL(W,C’)=L(W,C) sist of 250 disease name was obtained from wiki page2 . A while L(W,C) is non-empty do manual list are created of 668 living things from different if DN contains Wi then web page source. Similarly for list of special days, a man- C=DISEASE; ual list of 92 special days was obtained from this website3 . C’=C; The dictionary element was fired in the order as mentioned else if LT contains Wi then in Algorithm-1. At last regular expressions are formed to C=LIVINGTHINGS; correct PERIOD, MONEY and TIME on post-processed C’=C; output. else if SD contains Wi then C=SPECIALDAYS; English-Tamil C’=C; Since none of our team member are native Tamil speaker so else we could not do deep error analysis of CRF predicted output. C’=C; We used the same dictionary which was used in English- end Hindi data set, because those dictionary contains only En- return PL(W,C’); glish word. Finally regular expressions are formed to correct Algorithm 1: Post-processing algorithm for code mixed 1 data set https://taku910.github.io/crfpp/ 2 https://simple.wikipedia.org/wiki/List of diseases 3 www.drikpanchang.com/calendars/indian/indiancalendar.html Run-1 Run-2 Run-3 Best-Run S. No. Team P R F P R F P R F P R F 1 Irshad-IIITHyd 80.92 59 68.24 NA NA 80.92 59.00 68.24 2 Deepak-IITPatna 81.15 50.39 62.17 NA NA 81.15 50.39 62.17 3 VeenaAmritha-T1 75.19 29.46 42.33 75 29.17 42.00 79.88 41.37 54.51 79.88 41.37 54.51 4 BharathiAmritha-T2 76.34 31.15 44.25 77.72 31.84 45.17 NA 77.72 31.84 45.17 5 Rupal-BITSPilani 58.66 32.93 42.18 58.84 35.32 44.14 59.15 34.62 43.68 58.84 35.32 44.14 6 SomnathJU 37.49 40.28 38.83 NA NA 37.49 40.28 38.83 7 Nikhil-BITSHyd 59.28 19.64 29.50 61.8 26.39 36.99 NA 61.80 26.39 36.99 8 ShivkaranAmritha-T3 48.17 24.9 32.83 NA NA 48.17 24.90 32.83 9 AnujSaini 72.24 18.85 29.90 NA NA 72.24 18.85 29.90 Table 3: Official results obtained by the various teams participated in the CMEE-IL task- FIRE 2016 for code mixed English-Hindi language pair. Here P, R and F denotes precision, recall and F-score respectively. Run-1 Run-2 Run-3 Best-Run S. No. Team P R F P R F P R F P R F 1 Deepak-IITPatna 79.92 30.47 44.12 NA NA 79.92 30.47 44.12 2 VeenaAmritha-T1 77.38 8.72 15.67 74.74 9.93 17.53 79.51 21.88 34.32 79.51 21.88 34.32 3 BharathiAmritha-T2 77.7 15.43 25.75 79.56 19.59 31.44 NA 79.56 19.59 31.44 4 RupalBITSPilani-R2 58.66 10.87 18.20 58.71 12.21 20.22 58.94 11.94 19.86 58.71 12.21 20.22 5 ShivkaranAmritha-T3 47.62 13.42 20.94 NA NA 47.62 13.42 20.94 Table 4: Official results obtained by the various teams participated in the CMEE-IL task- FIRE 2016 for code mixed English-Tamil language pair. Here P, R and F denotes precision, recall and F-score respectively. the PERIOD, MONEY and TIME on post-processed out- shown in bold font. Our system got the highest Precision put. of 81.15% among all the submitted system. The proposed approach achieved F-score of 62.17% on English-Hindi lan- guage pair. Table-4 shows the official result for English- Tamil language pair data set. Our system(Deepak-IITPatna) performance are shown in bold font. Our system is the best performing system among all the submitted system. We achieved the 79.92%, 30.47% and 44.12% precision(p), re- call(r) and F-score(f) respectively. The reason for lower F- score on Tamil-English could be the lack of good features which can help to recognize a Tamil word as named entity. 7. CONCLUSION & FUTURE WORK This paper describes a code mixed named entity recogni- Figure 1: Proposed model architecture for Code- tion from Social media text in English-Hindi and English- mixed entity extraction Tamil language pair data. Our proposed approach is a hy- brid model of machine learning and rule based system. The experimental results show that our system is the best per- 6. RESULT & ANALYSIS former among the systems participated in the CMEE-IL task An entity extraction model for English-Hindi & English- for code mixed English-Tamil language pair. For English- Tamil language pair are trained by using CRF as base clas- Hindi language pair system achieved the highest precision sifier. We test our system on the test data for the concerned value 81.15% among all the submitted system. In future we language pair. The proposed approach was used to extract would like to build a more robust code mixed NER system the entity from both the language pair data. by using deep learning system. The developed entity extraction and identification system has been evaluated using the precision(P), recall (R) and 8. REFERENCES F-measure (F). The organizers of the CMEE-IL task FIRE [1] D. M. Bikel, S. Miller, R. Schwartz, and 2016, released the data in two phases: in the first phase, R. Weischedel. Nymble: a high-performance learning training data is released along with the corresponding NE name-finder. In Proceedings of the fifth conference on annotation file. In the second phase, the test data is released Applied natural language processing, pages 194–201. and no NE annotation file is provided. The extracted NE Association for Computational Linguistics, 1997. annotation file for test data was finally sent to the organiz- [2] K. Bontcheva, L. Derczynski, and I. Roberts. ers for evaluation. The organizers evaluate the different runs Crowdsourcing named entity recognition and entity submitted by the various teams and send the official results linking corpora. The Handbook of Linguistic to the participating teams. Annotation (to appear), 2014. The official results for English-Hindi language pair are shown [3] A. Borthwick. A maximum entropy approach to named in Table-3. Our system(Deepak-IITPatna) performance are entity recognition. PhD thesis, Citeseer, 1999. [4] A. E. Cano Basave, A. Varga, M. Rowe, M. Stankovic, [18] M. Van Erp, G. Rizzo, and R. Troncy. Learning with and A.-S. Dadzie. Making sense of microposts (# the web: Spotting named entities on the intersection msm2013) concept extraction challenge. 2013. of nerd and machine learning. In # MSM, pages [5] L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, 27–30, 2013. G. Gorrell, R. Troncy, J. Petrak, and K. Bontcheva. [19] T. Wakao, R. Gaizauskas, and Y. Wilks. Evaluation of Analysis of named entity recognition and linking for an algorithm for the recognition and classification of tweets. Information Processing & Management, proper names. In Proceedings of the 16th conference on 51(2):32–49, 2015. Computational linguistics-Volume 1, pages 418–423. [6] R. Grishman. The nyu system for muc-6 or where’s Association for Computational Linguistics, 1996. the syntax? In Proceedings of the 6th conference on Message understanding, pages 167–175. Association for Computational Linguistics, 1995. [7] R. Grishman and B. Sundheim. Message understanding conference-6: A brief history. In Proceedings of the 16th Conference on Computational Linguistics - Volume 1, COLING ’96, pages 466–471, Stroudsburg, PA, USA, 1996. Association for Computational Linguistics. [8] B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 368–378. Association for Computational Linguistics, 2011. [9] H. Isozaki and H. Kazawa. Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics, 2002. [10] N. Kumar and P. Bhattacharyya. Named entity recognition in hindi using memm. Techbical Report, IIT Mumbai, 2006. [11] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. [12] W. Li and A. McCallum. Rapid development of hindi named entity recognition using conditional random fields and feature induction. ACM Transactions on Asian Language Information Processing (TALIP), 2(3):290–294, 2003. [13] D. McDonald. Internal and external evidence in the identification and semantic categorization of proper names. Corpus processing for lexical acquisition, pages 21–39, 1996. [14] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith. Improved part-of-speech tagging for online conversational text with word clusters. Association for Computational Linguistics, 2013. [15] P. R. Rao, C. Malarkodi, and S. L. Devi. Esm-il: Entity extraction from social media text for indian languages@ fire 2015–an overview. [16] A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524–1534. Association for Computational Linguistics, 2011. [17] R. Srihari, C. Niu, and W. Li. A hybrid approach for named entity and sub-type tagging. In Proceedings of the sixth conference on Applied natural language processing, pages 247–254. Association for Computational Linguistics, 2000.