CMEE-IL: Code Mix Entity Extraction in Indian Languages from Social Media Text @ FIRE 2016 – An Overview Pattabhi RK Rao Sobha Lalitha Devi AU-KBC Research Centre AU-KBC Research Centre MIT Campus of Anna MIT Campus of Anna University, Chrompet, University, Chrompet, Chennai, India Chennai, India +91 44 22232711 +91 44 22232711 pattabhi@au-kbc.org sobha@au-kbc.org ABSTRACT automatically. Especially companies are interested to ascertain The penetration of smart devices such as mobile phones, tabs has public view on their products and processes. This requires natural significantly changed the way people communicate. This has led language processing software systems which recognizes the to the growth of usage of social media tools such as twitter, entities or the associations of them or relation between them. facebook chats for communication. This has led to development Hence an automatic Entity extraction system is required. of new challenges and perspectives in the language technologies The objectives of this evaluation are: research. Automatic processing of such texts requires us to develop new methodologies. Thus there is great need to develop Creation of benchmark data for Entity Extraction in various automatic systems such as information extraction, Indian language Code Mixed Social Media text. retrieval and summarization. Entity recognition is a very To develop Named Entity Recognition (NER) systems important sub task of Information extraction and finds its in Indian language Social Media text. applications in information retrieval, machine translation and other higher Natural Language Processing (NLP) applications Entity extraction has been actively researched for over 20 years. such as co-reference resolution. Some of the main issues in Most of the research has, however, been focused on resource rich handling of such social media texts are i) Spelling errors ii) languages, such as English, French and Spanish. The scope of this Abbreviated new language vocabulary such as “gr8” for great iii) work covers the task of named entity recognition in social media use of symbols such as emoticons/emojis iv) use of meta tags and text (twitter data) for Indian languages. In the past there were hash tags v) Code mixing. Entity recognition and extraction has events such as Workshop on NER for South and South East Asian gained increased attention in Indian research community. Languages (NER-SSEA, 2008), Workshop on South and South However there is no benchmark data available where all these East Asian Natural Language Processing (SANLP, 2010&2011) systems could be compared on same data for respective languages conducted to bring various research works on NER being done on in this new generation user generated text. Towards this we have a single platform. NERIL tracks at FIRE (Forum for Information organized the Code Mix Entity Extraction in social media text Retrieval and Evaluation) in 2013, 2014 have contributed to the track for Indian languages (CMEE-IL) in the Forum for development of benchmark data and boosted the research towards Information Retrieval Evaluation (FIRE). We present the NER for Indian languages. All these efforts were using texts from overview of CMEE-IL 2016 track. This paper describes the newswire data. The user generated texts such as twitter and corpus created for Hindi-English and Tamil-English. Here we also facebook texts are diverse and noisy. These texts contain non- present overview of the approaches used by the participants. standard spellings and abbreviations, unreliable punctuation styles. Apart from these writing style and language challenges, CCS Concepts another challenge is concept drift (Dredze etal., 2010; Fromreide • Computing methodologies ~ Artificial intelligence et al., 2014); the distribution of language and topics on Twitter • Computing methodologies ~ Natural language processing and Facebook is constantly shifting, thus leading to performance • Information systems ~ Information extraction degradation of NLP tools over time. Keywords Some of the main issues in handling of such texts are i) Spelling Entity Extraction; Social Media Text; Code Mixing, Twitter; errors ii) Abbreviated new language vocabulary such as “gr8” for Indian Languages; Tamil; Hindi; English; Named Entity great iii) use of symbols such as emoticons/emojis iv) use of meta Annotation Corpora for Code Mix Twitter data. tags and hash tags v) Code mixing. For example: 1. INTRODUCTION Over the past decade, Indian language content on various media “Muje kabi bhoolen gy to nhi na? :( types such as websites, blogs, email, chats has increased Want ur sweet feedback about my FC ? mai significantly. And it is observed that with the advent of smart dilli jaa rahi hoon”. phones more people are using social media such as twitter, facebook to comment on people, products, services, organizations, The research in analyzing the social media data is taken up in governments. Thus we see content growth is driven by people English through various shared tasks. Language identification in from non-metros and small cities who are mostly comfortable in tweets (tweetLID) shared task held at SEPLN 2014 had the task of their own mother tongue rather than English. The growth of identifying the tweets from six different languages. SemEval Indian language content is expected to increase by more than 70% 2013, 2014 and 2015 held as shared task track where sentiment every year. Hence there is a great need to process this huge data analysis in tweets were focused. They conducted two sub-tasks namely, contextual polarity disambiguation and message polarity iv) Indian Language tweets are multilingual in nature and classification. In Indian languages, Amitav et al (2015) had predominantly contain English words. organized a shared task titled 'Sentiment Analysis in Indian The following examples illustrate the usage of English words and languages' as a part of MIKE 2015, where sentiment analysis in spoken, dialectal forms in the tweets. tweets is done for tweets in Hindi, Bengali and Tamil language. Example 1 (Tamil): Named Entity recognition was explored in twitter through shared task organized by Microsoft as part of 2015 ACL-IJCNLP, a Ta: Stamp veliyittu ivaga ativaangi ….. shared task on noisy user-generated text, where they had two sub- En: stamp released these_people get_beaten …. tasks namely, twitter text normalization and named entity Ta: othavaangi …. kadasiya kovai recognition for English. En: get_slapped … at_end kovai The ESM-IL track at FIRE 2015 was the first one to come up with the entity annotated benchmark data for the social media text, Ta: pooyi pallakaatti kuththu vaangiyaachchu. where the data was in idealistic scenario, where users use only one En: gone show_tooth punch got language. But nowadays we observe that users use code mixing (“They released stamp, got slapping and beating … at the end even in writing in the social media platforms. Thus there is a need reached Kovai and got punched on the face”) to develop systems that focus on social media texts. There have been other efforts on the code mix social media text in the This example is a Tamil tweet where it is written in a particular applications of information retrieval (MSIR tracks at FIRE dialect and also has usage of English words. 2015and 2016). The paper is organized as follows: section 2 describes the Similarly in Hindi we find lot of spell variations. Such as for the challenges in named entity recognition on Indian languages. words “mumbai”, “gaandhi”, “sambandh”, “thanda” there are Section 3 describes the corpus annotation, the tag set and corpus atleast three different spelling variations. statistics. And section 4 describes the overview of the approaches used by the participants and section 5 concludes the paper. 3. CORPUS DESCRIPTION The corpus was collected using the twitter API in two different 2. CHALLENGES IN INDIAN LANGUAGE time periods. The training partition of the corpus was collected ENTITY EXTRACTION during May – June 2015. And the test partition of the corpus was The challenges in the development of entity extraction systems for collected during Aug – Sep 2015. As explained in the above Indian languages from social media text arise due to several sections, in the twitter data we observe concept drift. Thus to factors. One of the main factors being there is no annotated data evaluate how the systems handle concept drift we had collected available for any of the Indian languages, though the earlier data in two different time periods. In this present initiative the initiatives have been concentrated on newswire text. Apart from corpus is available for three Indian languages Hindi, Malayalam the lack of annotated data, the other factors which differentiate and Tamil. And we have also provided the corpus for English, so Indian languages from other European languages are the that it would help researchers to compare their efforts with respect following: to English vis-à-vis the respective Indian languages. The following figures show different aspects of corpus statistics. a) Ambiguity – Ambiguity between common and proper nouns. Eg: common words such as “Roja” meaning 3.1 ANNOTATION TAGSET Rose flower is a name of a person. The corpus for each language was annotated manually by trained b) Spell variations – One of the major challenges is that experts. Named Entity Recognition task requires entities different people spell the same entity differently. For mentioned in the document to be detected, their sense to be example: In Tamil person name -Roja is spelt as "rosa", disambiguated, select the attributes to be assigned to the entity "roja”. and represent it with a tag. Defining the tag set is a very important c) Less Resources – Most of the Indian languages are less aspect in this work. The tag set chosen should be such that it resource languages. There are no automated tools covers major classes or categories of entities. The tag set defined available to perform preprocessing tasks required for should be such that it could be used at both coarse and fine NER such as part-of-speech tagging, chunking which grained level depending on the application. Hence a hierarchical can handle social media text. tag set will be the suitable one. Though we find that in most of the Apart from these challenges we also find that development of works Automatic Content Extraction (ACE) NE tag set has been automatic entity recognition systems is difficult due to following used, in our work we have used a different tag set. The ACE Tag reasons: set is fine grained is towards defense/security domain. Here we have used Government of India standardized tag set which is more i) Tweets contain a huge range of distinct named entity types. generic. Almost all these types (except for People and Locations) are relatively infrequent, so even a large sample of manually The tag set is a hierarchical tag set. This Hierarchical tag set was annotated tweets will contain very few training examples. developed at AU-KBC Research Centre, and standardized by the Ministry of Communications and Information Technology, Govt. ii) Twitter has a 140 character limit, thus tweets often lack of India. This tag set is being used widely in Cross Lingual sufficient context to determine an entity’s type without the aid of Information Access (CLIA) and Indian Language – Indian background or world knowledge. Language Machine Translation (IL-IL MT) consortium projects. iii) In comparison with English, Indian Languages have more In this tag set, named entity hierarchy is divided into three major dialectal variations. These dialects are mainly influenced by classes; Entity Name, Time and Numerical expressions. The different regions and communities. Name hierarchy has eleven attributes. Numeral Expression and time have four and three attributes respectively. Person, Hindi-English language pair and 5 teams participated for Tamil- organization, Location, Facilities, Cuisines, Locomotives, English language pair. We had developed a base system without Artifact, Entertainment, Organisms, Plants and Diseases are the any pre-processing of the data and use of any lexical resources. eleven types of Named entities. We had developed this base system by just using the raw data as Numerical expressions are categorized as Distance, Money, such without any other features. We used Conditional Random Quantity and Count. Time, Year, Month, Date, Day, Period and Fields (CRFs) for developing the base system. This base line Special day are considered as Time expressions. The tag set system was developed so that it would help in making a better consists of three level hierarchies. The top level (or 1 st level) comparative study. And it was observed that all the teams had hierarchy has 22 tags, the second level has 49 tags and third level outperformed the base line system. In the following paragraphs has 31 tags. Hence a total of 102 tags are available in this schema. we would be briefly explaining the approaches used by each team. But the data provided to the participants consisted of only the 1st All the teams’ results are given in Table 3 and 4. level in the hierarchy that is consisting of only 22 tags. The other Irshad team had used Neural Networks, to develop their system. levels of tagging were hidden. This was done to make it little They had used external resource of Wiki data for creating word easier for the participants to develop their systems using machine embedding. They had not done any cleaning work such as learning methods. removal of URLs, emoticons from tweets. And NLP pre- processing of the text was done. This team had participated only The data statistics are as follows: in Hindi- English and submitted 1run. Table 1. Corpus Statistics Language No. of Tweets No. of NEs Deepak team had used CRFs. Here they have preprocessed the Hindi-English 10129 7573 data for tokenization. They had also used gazetteer lists for Tamil-English 4576 2454 disease names. And this team had submitted results for both Hindi-English and Tamil-English. The NE distribution in both language datasets has been found to Veena team had used machine learning method SVM. They have be having majority of Person, Location, and Entertainment. This used word2vec for feature engineering and extraction. Here they shows that majority of people communication has been on the have used other external corpus from MSIR 2016 and ICON 2015 topics movies and persons. track data sets. They had submitted 3 run each for both Hindi- 3.2 DATA FORMAT English and Tamil-English. This team had also used stylometric The participants were provided the data with annotation markup features, suffixes and prefixes, gazetteers in run 3. Here it is in a separate file called annotation file. The raw tweets were to be interesting to note that though many kinds of features and separately downloaded using the twitter API. The annotation file resources, the system performance was not significantly higher is a column format file, where each column was tab space than other runs where all of these features were not used. separated. It consisted of the following columns: Barathi team, have submitted 2 runs each for Hindi-English and i) Tweet_ID Tamil-English. They have used CRFs and Random Forest Tree. ii) User_Id Their run 1 was based upon lexical features and CRF algorithm. Along with the Run 1 features an additional binary feature (entity iii) NE_TAG or not) decided by the Random Forest Tree is added in Run 2. iv) NE raw string v) NE Start_Index Rupal team had decision trees and extremely randomized tree vi) NE_Length algorithms. The precision obtained is comparatively lower than For example: other new ML methods used by earlier teams. They had cleaned the data for emojis, urls as the first step of processing. Tweet_ID:123456789012345678 The team lead by Somnath had used CRFs and used the popular User_Id:1234567890 CRF++ tool. The system performance was relatively lower. NE_TAG:ORGANIZATION Probably this could be attributed to lack of proper feature extraction and feature engineering. NE Raw String:SonyTV Index:43 One interesting observation is that the team led by Nikhil had also Length:6 used neural networks similar to another team, but have not used any external resource for training. This shows that the data size Index column is the starting character position of the NE needs to be improved for better machine learning. calculated for each tweet and the count starts from ‘0’. The participants were also instructed to provide the test file The team lead by Srinidhi, had used SVM with context based annotations in the same format as given for the training data. character embedding as feature engineering. This team had used several external unlabeled datasets such as MSIR 2016, ICON 2015 shared task datasets. 4. SUBMISSION OVERVIEWS In this evaluation exercise we have used Precision, Recall and F- The different methodologies used by different teams have been measure, which are widely used for this task. A total of 21 teams summarized in Table 2. had registered for participation in this track. Later 9 teams were able to submit their systems for evaluation. A total of 25 test runs Evaluation metrics used are precision, recall and f-measure. All were submitted for evaluation. All the teams had participated for the systems have been evaluated automatically by comparing the gold annotations. The results obtained by participant systems have [3] Hege Fromreide, Dirk Hovy, and Anders Søgaard.2014. been shown in table 3 and 4. “Crowdsourcing and annotating ner for twitter#drift”. European language resources distributionagency. 5. CONCLUSION [4] H.T. Ng, C.Y., Lim, S.K., Foo. 1999. “A Case Study on The main objective of creating benchmark data representing some Inter-Annotator Agreement for Word Sense Disambiguation”. In of the popular Indian languages has been achieved. And this data Proceedings of the {ACL} {SIGLEX} Workshop on Standardizing has been made available to research community for free for Lexical Resources {(SIGLEX99)}. Maryland. pp. 9-13. research purposes. The data is user generated data and is not any genre specific. Efforts are still going on to standardize this data [5] Preslav Nakov and Torsten Zesch and Daniel Cer and make it perfect data set for future researchers. We observe and David Jurgens. 2015. Proceedings of the 9th International that the results obtained for Hindi-English data has been more Workshop on Semantic Evaluation (SemEval 2015). than Tamil-English. This is due to data being noisier and size is [6] Nakov, Preslav and Rosenthal, Sara and Kozareva, less compared to Hindi-English. We hope to see more Zornitsa and Stoyanov, Veselin and Ritter, Alan and Wilson, publications in this area in the coming days from these different Theresa. 2013. SemEval-2013 Task 2: Sentiment Analysis in research groups who could not submit their results. Also we Twitter. Second Joint Conference on Lexical and Computational expect more groups would start using this data for their research Semantics (*SEM), Volume 2: Proceedings of the Seventh work. International Workshop on Semantic Evaluation (SemEval 2013) This CMEE-IL track is one of the first efforts towards creation of entity annotated user generated code mixed social media text for [7] Rajeev Sangal and M. G. Abbas Malik. 2011. Indian languages. In this CMEE-IL annotation tag set we have Proceedings of the 1st Workshop on South and Southeast Asian made use of a hierarchical tag set. Thus this annotated data could Natural Language Processing (SANLP) be used for any kind of applications. This tag set is very [8] Aravind K. Joshi and M. G. Abbas Malik. 2010. exhaustive and has finer tags. The applications which require fine Proceedings of the 1st Workshop on South and Southeast Asian grain tags could use the data with full annotation. And for Natural Language Processing (SANLP). applications which do not require fine grain, the finer tags could (http://www.aclweb.org/anthology/W10-36) be suppressed in the data. The data being generic, this could be used for developing generic systems upon which a domain [9] Rajeev Sangal, Dipti Misra Sharma and Anil Kumar specific system could be built after customization. Singh. 2008. Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. 6. ACKNOWLEDGMENTS (http://www.aclweb.org/anthology/I/I08/I08-03) We thank the FIRE 2016 organizers for giving us the opportunity to conduct the evaluation exercise. [10] Pattabhi RK Rao, CS Malarkodi, Vijay Sundar R and Sobha Lalitha Devi. 2014. Proceedings of Named-Entity 7. REFERENCES Recognition Indian Languages track at FIRE 2014. http://au- [1] Arkaitz Zubiaga, Iñaki San Vicente, Pablo kbc.org/nlp/NER-FIRE2014/ Gamallo, José Ramom Pichel Campos, Iñaki Alegría Loinaz, Nora Aranberri, Aitzol Ezeiza, Víctor Fresno. 2014 TweetLID@SEPLN 2014, Girona, Spain, September 16th, 2014. CEUR Workshop Proceedings 1228, CEUR-WS.org 2014 [2] Mark Dredze, Tim Oates, and Christine Piatko. 2010. “We’re not in kansas anymore: detecting domainchanges in streams”. In Proceedings of the 2010 Conferenceon Empirical Methods in Natural LanguageProcessing, pages 585–595. Association for ComputationalLinguistics. Table 2. Participant Team Overview - Summary Languages & Approaches (ML Lexical Resources Open Source NLP Variation Between Team System Pre-Processing Step method) Used Used Tools Used Runs Submissions Run1:Tweet Run1: Conditional Run1: Tweet Preprocessor, Scikit Random Field Preprocessor alone – Learn, sklearn – Along with the run 1 i)Hindi – used to eliminate http Barathi –AmrithaT2 English: 2 Run2: Conditional crfsuite, nltk features binary links, emoticons Random Field + Run2: Tweet feature (outcome of Random Forest Tree Run2: Tweet Preprocessor, Scikit random forest tree) ii) Tamil– Preprocessor alone – Learn, sklearn – utilized in run 2 English: 2 used to eliminate http crfsuite, nltk links, emoticons i)Hindi – Tokenization by Deepak-IITPatna CMU tagger + Token 1, Dictionary of English: 1 Machine Encoding (IOB) Disease name, CMU ark tagger, learning(CRFs)+Rule Living Things & CRF++ based system ii)Tamil– Special days English: 1 Simple Feed Forward Neural Network with 1 hidden layer of 200 nodes, Activation function - Rectifier, 2, English wiki (Irshad-IIIT-Hyd) Learning rate - 0.03, corpus to develop i) Hindi – Dropout - 0.5, Converted the given word-embeddings Gensim Word2Vec English: 1 data to BIO format using Gensim Learning rule - Word2Vec adagrad, Regularization L2, Mini-batch - 200, Trained for 25 iterations. Run1: seq2seq LSTM 1, Replacement of network was used with Run 1: 3 hidden HTML Escape (Nikhil_BITSHyd) 3 layers and 192 nodes layers with each Characters in each layer hidden layer having Nikhil Bharadwaj 2, Tokenize Tweets 192 nodes. Gosala i) Hindi – Run2: seq2seq LSTM NLTK Word 3, Stop Word NLTK Stop Words Tokenizer and English: 2 network was used with 4 layers and 256 nodes Removal NLTK Stop Words Run 2: 4 hidden BITS Pilani, layers with each in each layer 4, Rule Tagging Hyderabad Campus hidden layer having 5, Mapping Common 256 nodes. Misspellings Run1: Pyenchant (a i) Hindi – Decision Tree (Hindi- Python English (Rupal_BITSPiliani) English: 3 English) Dictionary; Decision Tree (Tamil- Convert to Differences are in English) lowercase, remove the machine-learning Rupal Bhargava ii) Tamil – Run2: links and tokenize Gazetteer Lists were technique used. English: 3 Extremely created from the Randomized Tree annotations file. (Hindi-English) Decision Tree (Tamil- English) Run3: Extremely Randomized Tree (Hindi-English) Extremely Randomized Tree (Tamil-English) Hindi-English: unlabeled datasets from Mixed Script Information 1, Tokenizing data Retrieval 2016 into each token per- (MSIR) and line International (ShivkaranAMU3) Conference on i) Hindi – 2, Special tag is Natural Language English: 1 added to identify end Processing (ICON) 1, Word2vec Model Context Based of each tweet Srinidhi Skanda V 2015 POS Tagging 2, SVM-Light Character Embedding ii) Tamil – 3, Converting input task, external CEN@Amrita English: 1 datasets to IOB twitter data format collected using web scrapping. Tamil-English: systems unlabeled datasets from Sentiment Analysis in Indian Languages (SAIL-2015) (SomnathJU) Clean links and Somnath Banerjee i) Hindi – Conditional Random emoticons English: 1 Fields CRF++ Jadavpur University Run 1 –Structured Skip-gram based embedding features. Structured skip gram Run1: Wang2vec model takes the word based embedding position into features consideration and (VeenaAMU1) i) Hindi – extracts the features. English: 3 Run2:Word2vec based MSIR 2016 & wang2vec, Anand Kumar M embedding features Tokenization, BIO ICON 2015, SAIL Run 2 – neural 2015, Twitter word2vec, SVM- Amrita Vishwa formatting network based ii) Tamil – Run3:Stylometric dataset Light word2vec Vidyapeetham English: 3 features embedding features. Run 3 –stylometric features - prefix, suffix, punctuation, hash tags, gazetted features, index, length, etc. Table 3. Evaluation Results for Hindi-English Run1 Run2 Run3 Best Run Precision Recall F-Measure Precision Recall F-Measure Precision Recall F-Measure Precision Recall F-Measure Team Irshad-IIIT-Hyd 80.92 59 68.24 NA NA 80.92 59.00 68.24 Deepak-IIT-Patna 81.15 50.39 62.17 NA NA 81.15 50.39 62.17 Veena-Amritha-T1 75.19 29.46 42.33 75 29.17 42.00 79.88 41.37 54.51 79.88 41.37 54.51 Barathi-Amritha-T2 76.34 31.15 44.25 77.72 31.84 45.17 NA 77.72 31.84 45.17 Rupal-BITS-Pilani 58.66 32.93 42.18 58.84 35.32 44.14 59.15 34.62 43.68 58.84 35.32 44.14 Somnath-JU 37.49 40.28 38.83 NA NA 37.49 40.28 38.83 Nikhil-BITS-Hyd 59.28 19.64 29.50 61.8 26.39 36.99 NA 61.80 26.39 36.99 Shivkaran-Amritha-T3 48.17 24.9 32.83 NA NA 48.17 24.90 32.83 AnujSaini 72.24 18.85 29.90 NA NA 72.24 18.85 29.90 Table 4. Evaluation Results for Tamil-English Run1 Run2 Run3 Best Run Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure Precision Recall F-Measure Team Deepak-IIT-Patna 79.92 30.47 44.12 NA NA 79.92 30.47 44.12 Veena-Amritha-T1 77.38 8.72 15.67 74.74 9.93 17.53 79.51 21.88 34.32 79.51 21.88 34.32 Barathi-Amritha-T2 77.7 15.43 25.75 79.56 19.59 31.44 NA 79.56 19.59 31.44 Rupal-BITS-Pilani-R2 55.86 10.87 18.20 58.71 12.21 20.22 58.94 11.94 19.86 58.71 12.21 20.22 Shivkaran-Amritha-T3 47.62 13.42 20.94 NA NA 47.62 13.42 20.94