CEN@Amrita FIRE 2016: Context based Character Embeddings for Entity Extraction in Code-Mixed Text Srinidhi Skanda V Shivkaran Singh Remmiya Devi G Center for Computational Center for Computational Center for Computational Engineering and Networking(CEN) Engineering and Networking(CEN) Engineering and Networking(CEN) Amrita School of Engineering, Amrita School of Engineering, Amrita School of Engineering, Coimbatore Coimbatore Coimbatore Amrita Vishwa Vidyapeetham, Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham Amrita University,India Amrita University,India Amrita University,India skanda9051@gmail.com shivkaran.ssokhey@gmail.com Veena P V Anand Kumar M Soman K P Center for Computational Center for Computational Centre for Computational Engineering and Networking(CEN) Engineering and Networking(CEN) Engineering and Networking(CEN) Amrita School of Engineering, Amrita School of Engineering, Amrita School of Engineering Coimbatore Coimbatore Coimbatore. Amrita Vishwa Vidyapeetham, Amrita Vishwa Vidyapeetham, Amrita Vishwa Vidyapeetham, Amrita University,India Amrita University,India Amrita University,India m_anandkumar@cb.amrita.edu ABSTRACT the application of Natural Language Processing (NLP) methods. This paper presents the working methodology and results on Code One such application is Named Entity Recognition (NER), which Mix Entity Extraction in Indian Languages (CMEE-IL) shared the identifies and classifies entities into categories such as a person, task of FIRE-2016. The aim of the task is to identify various organization, date-time etc. The NER is used in text mining entities such as a person, organization, movie and location names application, information retrieval, and machine translation etc. in a given code-mixed tweets. The tweets in code mix are written NER task for Indian languages is a complex task. The use of in English mixed with Hindi or Tamil. In this work, Entity code-mixing further complicates this task. Extraction system is implemented for both Hindi-English and The shared task on Code Mix Entity Extraction in Indian Tamil-English code-mix tweets. The system employs context Languages (CMEE-IL)1 held at Forum for Information Retrieval based character embedding features to train Support Vector (2016) focuses on extracting entities from the code-mix twitter Machine (SVM) classifier. The training data was tokenized such data. that each line containing a single word. These words were further split into characters. Embedding vectors of these characters are Many works have done on Entity Recognition using word appended with the I-O-B tags and used for training the system. embedding as a feature. Few research works are done on Entity During the testing phase, we use context embedding features to extraction using Character Embedding technique such as Entity predict the entity tags for characters in test data. We observed that recognition using character embedding by Cicero dos Santos et al. the cross-validation accuracy using character embedding gave [1], Text segmentation using character-level text embedding is better results for Hindi-English twitter dataset compare to Tamil- done by Grzegorz [2]. Some other works based on code -mix data English twitter dataset. are SVM-based classification for entity extraction for Indian languages [3]. Entity extraction was done based on Conditional CCS Concepts Random Field [4] and Identification and linking of tweets was • Information Retrieval ➝ Retrieval tasks and goals ➝ performed earlier [10]. Information Extraction • Machine Learning ➝ Machine Learning approaches ➝ Kernel Methods ➝ Support Vector Machines We have submitted a system based on the character embedding a feature representation. To retrieve character embedding features, Keywords the word2vec model is used [5]. Word2vec converts input Named-entity recognition, Information extraction, Code-Mixed character to a vector of n-dimension. Feature representation Support vector machine (SVM), Word embedding, Context-based obtained from word2vec is used for training the system. Classifier character embedding. used for training - Support Vector Machine (SVM) which is based on machine learning [6]. 1. INTRODUCTION The explosion of social networking site like Facebook, Twitter An outline of the task is given in Section 2. Implemented system and Instagram in a linguistic diverse country like India impacted is described in Section 3. Section 4 describes Experiment and many users using multiple languages in tweets, personal blogs etc. result. The conclusion is discussed in Section 5. The scarcity of proper input tools for Indian Languages forced the user to use the roman script mixed with the native script that led to the code mixed language. This code-mixing further complicates 1http://www.au-kbc.org/nlp/CMEE-FIRE2016/ 2. CODE MIX ENTITY EXTRACTION Named Entity (NE) type, Named Entity String, Start Index and Length of the string. The participants were asked to submit a (CMEE-IL) TASK similar annotation file after extraction all the entity related Hindi-English and Tamil-English code-mixed language is given information from test data file. as training data, to implement the shared task on CMEE-IL. The task is to extract the named entities along with their named entity 3. SYSTEM DESCRIPTION tags. The training and testing dataset had raw tweets and In the proposed system, feature extraction plays a vital role as the annotations in different files for code-mixed Hindi-English and accuracy of the system relies majorly on the extracted features. Tamil-English language. There were 18 types of entities present in Before extracting any features, the dataset should be in an the training dataset in which entities like a person, location, and appropriate format for the learning algorithm. The I-O-B tagging entertainment were more in number compare to other entity types. format was used for this format conversion. The first step in The number of tweets and number of tokens in training, as well as format conversion was to tokenize the given data and arrange each testing corpus for both code-mixed languages, are shown in Table word (token) per line. To identify start and end of a tweet we 1. added a tag to make it easy for further processing. The Table 1. Dataset Count proposed system is based on character based context embedding. The basic intuition behind the character based context embedding Tweet Type # of Tweet # of Tokens is shown in Figure 1. In the figure, the Word(W0) (LIKE2) is split Train Test Train Test into corresponding characters. Each character is represented by a 50 − dimensional vector (vector length is user defined). These Hindi-English 2700 7429 130866 130868 character embeddings are further concatenated to form a word Tamil-English 3200 1376 41466 20190 vector. The sample tweets from training dataset containing code-mix Hindi-English and Tamil-English can be seen in Table 2. We noticed there were some tweets written completely or partially in the Tamil language in Tamil-English code-mixed corpus. No such findings were noticed in the Hindi-English code-mixed corpus. Table 2. Examples of code-mixed Tweets English @_madhumitha_ next time im in chennaidef going -Tamil over like \"hello ennakuongapethitheriyum\" @thoatta நான் ச ால் ல வரது டீஸர் டிரரலர் நல் லா இருக்குனு ச ான்ன படங் கள எல் லாம் சவற் றி அரடயல.over expectations nala flop airumnu STD ச ால் லுது English @aixo_anjum sister me aesy bkwas nhi krta na muje -Hindi shoq he I liked your pinned tweet at that time too just wait I shall give you proof Further, we analyzed the corpus for the counts of major entities. The count of major entities with the tags in the training corpus can be seen in Table 3. It can be observed from Table 3 that entities related to Entertainment tags were the most occurring entities. Table 3. Most frequent entity tags Code Entity Tags Figure 1. Character Based Context Embedding Mix Person Location Organization Entertainment This unique vector is appended with the corresponding word. This Lang. 50-dimensional representation is converted to 150-dimenstion by Hindi- appending the neighboring context words. W-1 represents previous 712 194 109 810 context word of the word W0 and W+1 represent the next context English word of the word W0. Appending the vectors of W-1, W0 and W+1 Tamil- are called context appending. 661 188 68 260 English The overall methodology is split into two modules; each module is described in separate figures. Figure 2 shows the method The given raw text file contains Tweet-ID, User-ID and corresponding tweet. The annotation file was arranged in columns, with each column representing Tweet-ID, User-ID, 2 The capitalization of character is just for representation purpose. followed for feature representation. Figure 3 shows steps involved extracted from the output file are converted to required annotated in the classification task. format. The first module depicted in Figure 2 explains step involved in feature representation. We collected additional datasets from the different data source to improve the accuracy of the system. The additional dataset for Tamil-English was collected from Sentiment Analysis in Indian Languages (SAIL-2015) [9]. Dataset from International Conference on Natural Language Processing (ICON) 2015 POS Tagging task [8], Mixed Script Information Retrieval 2016 (MSIR) [7] and some twitter dataset using web scraping were used for Hindi-English. This huge file comprising of training as well as additional dataset collected was then subjected for tokenization. These tokenized words are further split into corresponding characters. These characters are fed to vector embedding module. The output of the module will have a matrix of length 1 × 50 for every character in our corpus. These vectors are concatenated back to words using an appending module. The embedding features for each word are further subjected to context appending, resulting in a vector of length 1 × 150. These vectors contain the information about the word as well as neighboring word contexts. Figure 3. Architecture of the implemented system 4. RESULT AND ANALYSIS This paper describes work of entity extraction system for code- mix twitter data using character embedding technique. In character embedding approach words in raw code-mix training data is tokenized into one word per line fashion and further split into characters. This character is embedded with the n-dimension vectors to produce feature vectors. Feature vectors appended with the I-O-B tags to represent feature - label sets. This set is fed to Support Vector Machine (SVM) classifier to train a classification model. During testing phase test data undergoes character embedding to represent the test data as a feature vector. These feature vectors are fed to classification model that we already trained. Classification model predicts the entity tags for the test data set. The output of the test data set is converted to a suitable Figure 2. Feature extraction form to represent the Annotated format. Annotated format contains Tweet ID, User- ID; word its corresponding predicted Format conversion involves raw tweets from the given training tags, Index and Length. Table 3 shows Cross Validation result for data set and annotated file is given to the I-O-B module. I-O-B both Hindi-English and Tamil-English datasets. The overall converts given training data into the I-O-B format. Here I-O-B accuracy of character embedding for the Hindi-English system is formatted training set consists of training data along with 95.996%. For Tamil-English datasets overall accuracy of corresponding I-O-B tags. This formatted training set is appended character embedding is 94.3451%. with the feature vector to form training data that has to be fed to the classifier. Table 3. Cross-Validation results for character embedding Figure 3 describes procedure involved in classification. During Tweet Type Result the training phase, the classification model is built based on the Overall Known Ambiguous Unknown feature and label pair. During the testing phase, test data along accuracy Known with its embedding features are fed to predict entity tags for testing data. In Figure 3, Context vectors represent features and Hindi- 95.9962 97.6447 83.8945 92.1155 label pairs that are fed to SVM module. Here we used SVM Light English tool to build anSVM-based classifier model. The classifier takes Tamil- 94.3451 95.8483 91.7017 89.6086 features and label as an input and trains itself finally builds a English classification model. From the learned classification model, the system predicts the entity tags for testing data. Predicted tags [3] Anand Kumar, M., Shriya, S., and Soman, K.P. 2015. Table 5 and 6 displays result participated in a shared task. AMRITA-CEN@FIRE 2015: Extracting entities for social Implemented system ranked eighth in Hindi-English twitter media texts in Indian languages. CEUR Workshop dataset with Precision 48.17 and F-measure 32.83. For Tamil- Proceedings, 1587:85–88. English our system was ranked fifth with Precision 47.62 and F- [4] Sanjay, S., Anand Kumar, M., and Soman, K.P. 2015. measure 20.94. AMRITA-CEN-NLP@FIRE 2015:CRF based named entity Table 5. Result of Hindi-English twitter data extraction for Twitter microposts. CEUR Workshop Proceedings, 1587:96–99. Best run Rank Team [5] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and J. Precision Recall F-Measure Dean. 2013. Distributed representations of words and phrases 1 Irshad-IIIT-Hyd 80.92 59.00 68.24 and their compositionality. In Advances in neural information processing systems, pages 3111–3119. 2 Deepak-IIT-Patna 81.15 50.39 62.17 [6] Joachims and Thorsten. 1999. SVM-light: Support vector 3 Veena-Amrita-T1 79.88 41.37 54.51 machine. SVM-Light Support Vector Machine CEN@Amrita 48.17 24.90 32.83 http://svmlight. Joachim. org/, University of Dortmund, 19(4). [7] Shared task on mixed script information retrieval, Table 6. Result of Tamil-English twitter data https://msir2016.github.io, 2016. Best run [8] Vyas, Y., Gella, S., Sharma, J., Bali, K., and Choudhury. Rank Team Precision Recall F-Measure October 2014. POS tagging of English-Hindi code-mixed social media content. In Proceedings of the 2014 Conference 1 Deepak-IITPatna 79.92 30.47 44.12 on Empirical Methods in Natural Language Processing 2 VeenaAmrita-T1 79.51 21.88 34.32 (EMNLP), pages 974–979. 3 BharathiAmritha- 79.56 19.59 31.44 [9] Patra, B.G., Das, D., Das, A., and Prasath, R. 2015. Shared T2 task on sentiment analysis in Indian languages (sail) tweets- an overview. In International Conference on Mining CEN@Amrita 47.62 13.42 20.94 Intelligence and Knowledge Exploration, pages 650–655. [10] Barathi Ganesh, H. B., Abinaya, N., Anand Kumar, M., 5. CONCLUSION AND FUTURE SCOPE Vinayakumar, R., and Soman, K.P. 2015. AMRITA-CEN@ NEEL: Identification and Linking of Twitter Entities. The proposed system for code-mix entity extraction is submitted Making Sense of Microposts (# Microposts2015) as a part shared task on code-mix entity extraction system in Indian languages (CMEE-IL) conducted by FIRE 2016. Task [11] Saha, Sujan Kumar, Sanjay Chatterji, Sandipan Dandapat, involves extracting entity from the code-mix tweets. Given dataset Sudeshna Sarkar, and Pabitra Mitra. 2008. A hybrid consist of Hindi-English and Tamil-English code-mix tweets. In approach for named entity recognition in indian languages. our work, we implemented character embedding system. We In Proceedings of the IJCNLP-08 Workshop on NER for conclude that overall accuracy of Character embedding for Hindi- South and South East Asian Languages, pp. 17-24. English is better compared to the Tamil-English. [12] Abinaya, N., Neethu John., Barathi HB Ganesh., Anand M. Few errors that resulted in decrease performance of the system are Kumar., and Soman, K.P. 2014. AMRITA_CEN@ FIRE- 1. While splitting words to characters due to encoding 2014: Named Entity Recognition for Indian Languages using problem and noise present in the dataset some Rich Features. In Proceedings of the Forum for Information characters are not properly represented. Retrieval Evaluation, pp. 103-111. 2. The performance of the implemented system could be [13] Xue, Bai, Chen Fu, and Zhan Shaobin. 2014. A study on improved by using RNN or CNN based models. sentiment computing and classification of sinaweibo with word2vec. In 2014 IEEE International congress on Big Data, pp. 358-363. 6. ACKNOWLEDGMENT [14] Le, Quoc, V., and Tomas Mikolov. 2014. Distributed We would like to give thanks to the task organizer - Forum for Representations of Sentences and Documents. In ICML, vol. Information Retrieval Evaluation. We also thank organizers of 14, pp. 1188-1196. CMEE-IL task. 7. REFERENCES [1] Dos Santos, Cıcero, Victor Guimaraes, R. J. Niterói and Rio de Janeiro. 2015. Boosting named entity recognition with neural character embeddings. In Proceedings of NEWS 2015 The Fifth Named Entities Workshop, p. 25. [2] Chrupała, Grzegorz. 2013. Text segmentation with character- level text embeddings. arXiv preprint arXiv:1309.4628