Conditional Random Fields for Code Mixed Entity Recognition [NLP_CEN_AMRITA@CMEE-IL-FIRE-2016] Barathi Ganesh HB Anand Kumar M and Soman KP Artificial Intelligence Practice Center for Computational Engineering and Tata Consultancy Services Networking (CEN) Kochi - 682 042 Amrita School of Engineering, Coimbatore India Amrita Vishwa Vidyapeetham barathiganesh.hb@tcs.com Amrita University, India m_anandkumar@cb.amrita.edu, kp_soman@.amrita.edu ABSTRACT and in-proper words), mining information from social media Entity Recognition is an essential part of Information Ex- text has become complex to achieve. When shared texts traction, where explicitly available information and relations incorporates multiple languages and transliterated words, it are extracted from the entities within the text. Plethora introduces much complexity to make the fully automated of information is available in social media in the form of analytics system [7]. So far text analytics applications fo- text and due to its nature of free style representation, it in- cused on English text alone. In recent works, it can be troduces much complexity while mining information out of observed that, researches started contributing towards the it. This complexity is enhanced more by representing the code mixed text analytics applications [12, 5, 3, 1]. text in more than one language and the usage of translit- By observing the above, we have experimented the se- erated words. In this work we utilized sequential model- quential modeling algorithm - Conditional Random Fields ing algorithm with hybrid features to perform the Entity (CRF) along with the hybrid features for performing entity Recognition on the corpus given by CMEE-IL (Code Mixed extraction on code mixed social media texts (i.e. tweets). Entity Extraction - Indian Language) organizers. The ex- A set of corpus based lexicon features extracted out of the perimented approach performed great on both the Tamil- words in the tweets to make Random Forest Tree based bi- English and Hindi-English tweet corpus by attaining nearly nary classifier (Entity, Non- Entity). This classifier predicts 95% against the training corpus and 45.17%, 31.44% against the given word is entity or not. Along with this binary re- the testing corpus. sult, other common lexicon features are utilized to build the CRF based entity recognizer. Remaining of the paper details about the CRF for entity Keywords recognition in section 2, Random Forest Tree as a binary Entity Recognition; Sequential Modeling; Conditional Ran- classifier in section 3, feature engineering carried over in sec- dom Fields tion 4 and section 5 details about the experimentation and observations about the results achieved. 1. INTRODUCTION 2. SEQUENTIAL MODELING WITH CON- The information shared by people in this digital era has continuous growth in nature (facebook1 , twitter2 ). Mining DITIONAL RANDOM FIELDS information from these social media text has becomes essen- Over the last few years, CRF has became the pioneer al- tial for both the government and industrial sectors. More- gorithm in sequential modeling applications (Part Of Speech over these texts serves as a information source for different tagging, Named Entity Recognition) [9, 2]. CRF is from a text applications [13]. discriminative and undirected-probabilistic graphical model, Entity Recognition is one of the major key component in which is generally used in structured prediction application. information extraction applications, which could be used to Unlike other ordinary classification methods CRF has a ca- extract the implicitly and explicitly available information pability of classifying sequence of sample (i.e. context load- and relation between the information [8, 11]. Entity Recog- ing with respect to the neighbouring words). nition is a task of assigning words or phrases in a text into The advantages of CRF over other sequential modeling al- its predefined set of real world world entities like person, gorithms are it avoids the label biasing problem; conditional location, organization, ..., etc [10]. probability distribution made over the target label sequences Due to the constraints introduced by the social media (i.e. sequence of tags) given a input sequences (i.e. sequence platforms (number of words and formats) and due to the ab- of words); it has a capability to easily include a wide vari- sence of proper constraints in usage of shared text (grammar ety of arbitrary and non-independent features with respect to the input words [6]. Let x1:N be the word sequence and 1 www.facebook.com y1:N is the output label sequence, then the CRF can be 2 www.twitter.com mathematically represented as, Feature Type Binary Nominal performance sequential modeling system. Few of the nom- Type of the Word inal and binary features utilized in this proposed approach All upper, All Digit, is given in the Table 1. Alphanumeric word, X All symbols, All letter First letter capital, 3. ENTITY SELECTION WITH RANDOM Shape of the Word FOREST TREE ex:- (Vijay - Uuuuu X X The feature mentioned in Table 1 i.e. Entity or not, is a 11-12-1991 - nnsnnsnnnn) binary function derived through Random Forest Tree clas- Part of Speech Tag X sifier. More than the other features mentioned in the Table Prefix of length 1 to 4 1, this binary feature provide more constraint to the feature eg:- (Parking- g, ng, ing, king) X function in CRF to find distribution over the output label. Suffix of length 1 to 4 Random Forest Tree is a classification algorithm, which is eg:- (Parking- P, Pa, Par, Park) X formed by selecting the most occurring resultant class among Length of Word X the set of weak decision trees [4]. In this approach the lexi- Entity or not con based features from the entity words are extracted and Decision from Random X by considering these features as the attributes for the Ran- Forest Tree Classifier dom Forest Tree classifier, the classes (entity, not a entity) for the given word is predicted. Table 1: CRF Features Given a training set W = w1, w2, w3, ..., wn (words) with the output labels Y = y1, y2, y3, ..., yn (entity , not a entity) and feature set F = f 1, f 2, f 3, ..., f n, bagging repeatedly (B times - Number of trees) done by selecting random samples ! and attributes from the training set and builds the decision 1 X X tree for each set. Then the predictions for test words Ŵ exp λj tj (yi−1 , yi , x, i) + µk sk (yi , x, i) (1) Z j can be found by averaging the predictions from all the indi- k vidual decision trees built through the train set. It can be interpreted as following: ! X X X Z= exp λj tj (yi−1 , yi , x, i) + µk sk (yi , x, i) fb = f (Wb , Yb , Fb ) (4) y1:N j k (2) In above equation x represents the input word sequence ([Vijay, acted, in, a, film, Sura]), y represents the output B 1 X label sequence ([Actor, other, other, other, other, Entertain- Y = fb (Ŵ F̂ ) (5) ment]), tj (yi−1 , yi , x, i) is a transition function constrained B b=1 by the feature function as given in equation 3 (i.e. probabil- Corpus based lexicon features are extracted in-order to ity of label changing from one label to another learned from train the above classifier. Initially a feature set is built training corpus and change of label at position i − 1 to i in from the entity words available in Tamil-English and Hindi- test sequence), sk (yi , x, i) is similar to the emission proba- English corpus. Then by taking these features as a vocab- bility at Hidden Morkov Model but constrained by feature ulary, the Term - Document Matrix (TDM) is built against function similar to tj (yi−1 , yi , x, i), Z is the normalization the words. Then this matrix along with the binary labels factor and λj , µk are the optimization parameters learned (entity, not a entity) are fed to the Random Forest Tree to from training corpus. make the decision. The feature set of TDM includes prefix The transition function tj (yi−1 , yi , x, i) and emission func- and suffix of length 1 to 3 of the words, length of words and tion sk (yi , x, i) takes on the values only if b(x, i) is greater position of the word in that tweet. than 0. b(x, i) will be greater than 0, if the current state (in the case of the emission functions), previous and cur- Discription Tamil-English Hindi-English rent states (in the case of the transition functions) take on # Tweets 3184.0 2701.0 particular values with respect to the training corpus. An # Unique Tweets 2821.0 2669.0 example, b(x, i) activation function is given below: # Tags 1624.0 2413.0  # Unique Tags 21.0 21.0  b (x, i) if yi−1 = other and # Entity words 1624.0 2413.0 yi = Entertainment  tj (yi−1 , yi , x, i) = (3) # Unique Entity words 1016.0 1200.0   # words 32142.0 43766.0  0 otherwise Avg # words / tweet 10.1 16.2 Entity-Word ratio 5.1% 5.5% In the above equation, b(x, i) will be greater than 0 only if the following two labels (other, Entertainment) consecu- Table 2: Data-set Statistics tively occurs in the training set. From the above inputs, it is clear that transition and emission functions are constrained with respect to the feature function b(x, i). Incorporating relevant features from the training set will leads to a high 4. EXPERIMENT AND OBSERVATIONS Entity Tamil-English Hindi-English ARTIFACT 18 25 LIVTHINGS 16 7 DISEASE 5 7 COUNT 94 132 DATE 14 33 FACILITIES 23 10 PERSON 661 712 DISTANCE 4 - SDAY 6 23 MONTH 25 10 DAY 15 67 PLANTS 3 1 MATERIALS 28 24 TIME 18 22 MONEY 66 25 ENTERTAINMENT 260 810 LOCATION 188 194 LOCOMOTIVE 5 13 ORGANIZATION 68 109 PERIOD 53 44 YEAR 54 143 QUANTITY - 2 Figure 1: Model Diagram of Proposed Approach Table 3: Entity Tags Statistics The overall approach is performed in a system with fol- fix present in the corpus are taken as attributes to train lowing specification: Linux operating system, python3.4, 16 the Random Forest Tree classifier. N C√N number of trees GB RAM and 8 core processor. In order to perform CRF, are utilized to built the Random Forest tree, where N is Sklearn - CRFSuite3 is utilized, TDM matrix is built us- the total number of attributes. Similarly testing corpus is ing sklearn-CountVectorizer4 library, Random Forest Tree also applied on the above steps to get the given word is en- classifier5 is from sklearn library, part of speech tagging tity or not. In order to measure the training performance done using NLTK6 library and preprocessor using twitter- 10-fold 10-cross validation is carried out and obtained near preprocessor7 . 96%, 97% respectively for the Tamil - English and Hindi - The statistics about both the data-sets are given in Ta- English corpus. ble 2 and Table 3. Initially raw tweets are tagged with its With the above obtained binary feature, other features corresponding entities given in Table 3 with respect to the mentioned in the Table 1 are extracted out of the training annotation file provided by Code Mixed Entity Extraction corpus. A window of length 5 is taken to capture the context -Indian Language task organizers. of word as well as features by taking previous two words and Since the given data-set is tweet, the tendency of noise later two words from the current word. Using these features presence is higher and unwanted text, non-text information as the constraint function CRF sequential model is built will lead to build a sequential model with low performance. for entity recognition task. Similarly features are extracted These unwanted informations, web links and emoticons are for testing and output labels are predicted for input testing removed from tweets through twitter preprocessor. word sequences. Finally words with the consecutive output Followed by the preprocessing step, a set of corpus based labels are concatenated together to form phrases with single features are extracted out of the entity words in a tweet to tag. To ensure the training performance, similar to Random built the Random Forest Tree based binary classifier. For Forest Tree here also cross validation is carried over and extraction initially all the entities in the training corpus are obtained nearly 94% as the precision for both the corpus. re-tagged as ’Entity’ and others as ’not a Entity’. From The performance against the test set of top 5 teams are the entity words present in the training corpus their corre- given in Table 4 and Table 5. It can be observed that from sponding prefix-suffix of length 1 to 4 are taken to build the the top score the precision of the proposed system only varies vocabulary for TDM matrix by using CountVectorizer. around 2% in Hindi - English corpus and almost equal in The TDM matrix is built based upon the presence of pre- Tamil - English Corpus. The problem arises with the recall, fix, suffix information present within the words. Along with which affects final F measure. Hence our future work will this TDM matrix length of the word, position of the word be focused on improving the recall of the proposed system. lies in its tweet and total number of times the prefix or suf- 3 pypi.python.org/pypi/sklearn-crfsuite 5. CONCLUSION 4 scikit-learn.org Conditional Random Field based Entity Recognition with 5 scikit-learn.org hybrid features was experimented on CMEE - IL (Code 6 www.nltk.org Mixed Entity Extraction - Indian Language) corpus and 7 github.com/s/preprocessor attained greater performance. The experimented approach Team Precision Recall F [12] Y. Vyas, S. Gella, J. Sharma, K. Bali, and Deepak-IIT-Patna 79.92 30.47 44.12 M. Choudhury. Pos tagging of english-hindi Veena-Amritha-T1 79.51 21.88 34.32 code-mixed social media content. volume 14, pages Bharathi-Amrita-T2 79.56 19.59 31.44 974–979, October 2014. Rupal-BITS-Pilani-R2 58.71 12.21 20.22 [13] D. Westerman, P. Spence, and B. Van Der Heide. Shivkaran-Amritha-T3 47.62 13.42 20.94 Social media as information source: Recency of updates and credibility of information. volume 19, Table 4: Results : Tamil - English pages 171–183, January 2014. Team Precision Recall F Irshad-IIIT-Hyd 80.92 59.00 68.24 Deepak-IIT-Patna 81.15 50.39 62.17 Veena-Amritha-T1 79.88 41.37 54.51 Bharathi-Amrita-T2 77.72 31.84 45.17 Rupal-BITS-Pilani 58.84 35.32 44.14 Table 5: Results : Hindi - English performed great on both the Tamil-English and Hindi-English tweet corpus by attaining nearly 95% against the training corpus and 45.17%, 31.44% against the testing corpus. Pre- processing of social media text is an essential part. This will improve the feature engineering (reduces the sparsity) and boost the performance of the proposed system. Hence the future work will be focused on incorporating necessary pre-processing steps along with the proposed approach. 6. REFERENCES [1] N. Abinaya, N. John, H. B. Barathi Ganesh, M. Anand Kumar, and K. Soman. Amrita cen fire-2014: Named entity recognition for indian languages using rich features. pages 103 – 111, December 2014. [2] H. B. Barathi Ganesh, N. Abinaya, M. Anand Kumar, R. Vinayakumar, and K. Soman. Amrita-cen neel: Identification and linking of twitter entities. 2015. [3] U. Barman, A. Das, J. Wagner, and J. Foster. Code mixing: A challenge for language identification in the language of social media. volume 13, 2014. [4] L. Breiman. Random forests. volume 1, pages 5–32, October 2001. [5] A. Das and B. Gamback. Code-mixing in social media text. [6] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. volume 1, pages 282–289, June 2001. [7] D. Maynard, K. Bontcheva, and D. Rout. Challenges in developing opinion mining tools for social media. pages 15–22, 2012. [8] J. Piskorski and R. Yangarber. Information extraction: Past, present and future. 2013. [9] A. PVS and G. Karthik. Part-of-speech tagging and chunking using conditional random fields and transformation based learning. volume 21, 2007. [10] A. Ritter, S. Clark, and O. Etzioni. Named entity recognition in tweets: an experimental study. pages 1524–1534, July 2011. [11] J. Tang, M. Hong, D. Zhang, L. B, and L. J. Information extraction: Methodologies and applications. October 2007.