Language Identification in Mixed Script Social Media Text S. Nagesh Bhattu Vadlamani Ravi CoE on Analytics, IDRBT CoE on Analytics, IDRBT Castle Hills Road#1,Masab Tank Castle Hills Road#1,Masab Tank Hyderabad-500057, India Hyderabad-500057, India nageshbs@idrbt.ac.in vravi@idrbt.ac.in ABSTRACT With the spurt in usage of smart devices, large amounts Table 1: Number of Words in each languge of unstructured text is generated by numerous social media Lang No of words Lang No of words tools. This text is often filled with stylistic or linguistic vari- Bengali 2207 Hindi 2457 ations making the text analytics using traditional machine English 5115 Kannada 1302 learning tools to be less effective. One of the specific problem Gujarati 1741 Malayalam 1862 in Indian context is to deal with large number of languages Marathi 1265 Tamil 1886 used by social media users in their roman form. As part Telugu 3462 of FIRE-2015 shared task on mixed script information re- trieval, we address the problem of word level language iden- tification. Our approach consists of a two stage algorithm of languages used is more than 10 and all of them share for language identification. First level classification is done vocabulary excessively. using sentence level character n-grams and second level con- sists of word level character n-grams based classifier. This To understand the complexity of the task we have posed the approach effectively captures the linguistic mode of author primary problem of word-level language identification as a in social texting enviroment. The overall weighted F-Score multi-class classification using word lists gathered for each for the run submitted to FIRE Shared task is 0.7692. The language. These word-lists are obtained using the method sentence level classification algorithm which is used in achiv- suggested in Gupta et al. [2012]. We have converted these ing this result has an accuracy of 0.6887. We could further words into n-gram representation and built a classifier based improve the accuracy of sentence level classifier further by on multi-class logistic regression McCallum [2002] and multi- 1.6% using additional social media text crawled from other class SVM (support vector machine) Crammer and Singer sources. Naive Bayes classifier showed largest improvement [2002]. We conducted this experiment by taking n-gram rep- (5.5%) in accuracy level by the addition of supplementary resentation of each word of the training data as an instance tuples. We also observed that using semi-supervised learn- (leaving the NE words). This experiment yielded an accu- ing algorithm such as Expectation Maximization with Naive racy of 57%-54% respectively. The number of words used in Bayes, the accuracy could be improved to 0.7977. these word lists are given in table 1. The multi-class Logistic functions likelihood is defined as below Categories and Subject Descriptors p(y|x) = P exp(λy .F(x, y)) (1) 0 H.3.3 [Information Storage and Retrieval ]: Informa- y‘ exp(λy .F(x, y )) 0 tion Search and Retrieval-Information filtering. Here y is the label associated with instance x. The instance x is expressed in some feature representation F(x, y). In Keywords the current work, feature representation is n-gram represen- Classification ation of words. λy are class specific parameters learnt during maximum likelihood based training process. 1. INTRODUCTION With the proliferation of social tools like twitter, facebook, As the text segments are typically social media posts, the etc.. large volumes of text is being generated on a daily number of languages with in a text segment can be safely basis. Traditional machine learning tools used for text anal- assumed to be 2. Using this corpus level prior knowledge, ysis such as Named Entity Recognization(NER) or Parts of we built a two-stage classification algorithm. The first stage Speech Tagging or parsing, are dependent on the premise consists of identification of sentence level language. We used that the text provided for them are in purer form. They character-level n-grams of each of the sentences as training achieve their objective using cooccurrence patterns of fea- data for building sentence level classifier. We have used tures. It has been observed by many studies that social me- 1,2,3,4,5 grams of all the words in the sentence as the fea- dia text when fed to such machine learning algorithm, is of- tures. We divided the input training data into 80-20 splits ten plagued by the excessive out of vocabulary words(sparsity using 5-fold cross validation. We built a multi-class classifier of features). The FIRE-2015 shared task1 addresses the lan- using softmax, Naive Bayes and Naive Bayes EM and SVM guage identification task as well as Named Entity recogni- algorithms using the training data. Among these Naive tion in the context of Indian social fora, where the number Bayes EM is a semi-supervised learning algorithm, which 37 Table 2: Class-Wise Disribution in the training data Table 4: Results as submitted for the test run bn-en 215 ml-en 131 Tag F1 Score en 679 mr-en 200 MIX 0.57 gu-en 149 ta-en 318 MIX-en-bn 0 hi-en 383 te-en 525 MIX-en-kn 0 kn-en 272 MIX-en-ml 0 MIX-en-te 0 NE 0.387409201 NE-ml 0 Table 3: Accuracy Results of Cross-Validation on NE-L 0.2791 Training Data NE-O 0 Method Accuracy NE-OA 0 Naive Bayes 0.7419 NE-P 0.2187 MaxEnt 0.8436 NE-PA 0 Multi-Class SVM 0.8123 Others 0 Naive Bayes EM 0.7454 X 0.9555 bn 0.7749 en 0.831 gu 0 uses EM algorithm for improving the test data accuracy. In hi 0.6125 preparing such training data, we have removed the URLs, kn 0.8215 X, NE tagged words. The 5-fold cross-validation classifica- ml 0.8132 tion accuracy are reported in Table 3. We tried varying the mr 0.745 number of n-grams to 3 and 4, which has the effect of de- ta 0.8582 preciating the accuracy 3-6%. The class-wise distribution of documents in the training data is given in Table 2. te 0.6148 tokensAccuracy 77.5231 We get 82% accuracy when we applied the multi-class Logis- tokensCorrect 9302 tic regression based classifier trained on the above data. We utterances 792 have experimented with latent Dirichlet based topic model utterancesAccuracy 18.0556 for this with 100 as the number of topics which was not UtterancesCorrect 143 providing accuracy levels beyond 60%. The results of clas- Average F-measure 0.6845007667 sification using training data are as given in table 3. Weighted F-Measure 0.769245171 1.1 Word-Level Classification After identifying the language pair used for writing a par- based on word-clusters from a large collection of twitter cor- ticular posting, we further build binary classifiers for each pus. We use the tool provided by authors of Ritter et al. of the language pairs namely, bn-en,gu-en,kn-en,hi-en, ml- [2011] for english tweets and a small lexicon of named en- en,mr-en,ta-en,te-en. We use the words in the table 1 to tities for all the other languages for dealing with Named- build the binary classifiers. We used Logistic regression Entity detection. based binary classifier which are giving 92-94% accuracy on training data. The character n-grams (where n is set to 5) are used as features for the binary classifier. The approach 2. EXPERIMENTS suggested in Täckström and McDonald [2011] uses latent We have used McCallum [2002] for multi-class classification. variable models for using document level sentiment ratings The table 4 contains the F1 scores of language identifica- to infer sentence level classifier. tion and Named entities. These are the results of test run submitted for the FIRE workshop. We reported F1 score This approach of using word-level binary classifier works well which is a representative measure capturing both precision as long as the length of the words is sufficiently long to cap- and recall. As we have adopted 2-stage algorithm for word ture the n-gram characteristics of the language of our inter- level language identification, the classification accuracy of est. But, as we see, tweets often contain stylistic variations the first-level(sentence level) classificaion is most imprtant which reduce the length of words signinficantly. When the for the further processing. As we can see in the results there length of words is below 3, the words do not carry n-grams are languages like Gujarati which go misclassified by the representative of target language. To address this problem classifier, having zero F1 score. Named entity detection is we use words with in a window of two length on either side typically addressed using sequence level features which are to make for the sparsity of features of shorter words. This is quite unreliable in short-message context. Our test run re- also a heuristic approach effectively used in the other works sults are limited to the presence in the training data. such as Han and Baldwin [2011]. 2.1 Errors and Analysis Named Entity detection for short text is much harder task, The error analysis is not complete without making the clas- as word-colocations and POS tagging do not work well with sifier further accuracte. In this regard, we have manually mixed script. Ritter et al. [2011] have proposed a solution tagged the test data sentences to be of one of the languages 38 References Table 5: Sentence Classification Accuracy on Test Crammer, K. and Singer, Y. (2002). On the algorithmic im- Data plementation of multiclass kernel-based vector machines. Method Accuracy Accuracy Training- J. Mach. Learn. Res., 2:265–292. Training-Data Data-Expanded Naive Bayes 0.7204 0.7751 Gupta, K., Choudhury, M., and Bali, K. (2012). Mining MaxEnt 0.6887 0.7052 hindi-english transliteration pairs from online hindi lyrics. Naive Bayes EM 0.7684 0.7977 In Chair), N. C. C., Choukri, K., Declerck, T., Doħan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of the Eight Inter- of our interest mixing with english. The authors are confi- national Conference on Language Resources and Evalu- dent in tagging 6 of these languages, we depended on other ation (LREC’12), Istanbul, Turkey. European Language resources for distinguishing malayalam and tamil. As this Resources Association (ELRA). shared task largely focuses on english being the mixing lan- Han, B. and Baldwin, T. (2011). Lexical normalisation of guage we had to classify any of the training data sentences short text messages: Makn sens a #twitter. In Proceed- into 9 classes. In order to improve this sentence level lan- ings of the 49th Annual Meeting of the Association for guage identification task, we collected tweets of the 8 lan- Computational Linguistics: Human Language Technolo- guages of our interest using seed words of each of these lan- gies - Volume 1, HLT ’11, pages 368–378, Stroudsburg, guages. The seed words are chosen in such a way that the PA, USA. Association for Computational Linguistics. resultant tweets retrieved belonged to the mixed script cat- egory. We added atleast 500 tweets for each class to make McCallum, A. K. (2002). Mallet: A machine learning for up a total number of labeled sentences (sentence level tags) language toolkit. http://mallet.cs.umass.edu. to be 7656. We compare the accuracies of various learning algorithms using this expanded training dataset. As we can Ritter, A., Clark, S., Mausam, and Etzioni, O. (2011). see accuracy has increased signinficantly by 6% for Naive Named entity recognition in tweets: An experimental Bayes and 3% for Logistic regression. We report the sum- study. In Proceedings of the Conference on Empirical mary of experiments conducted in Table 5. The second col- Methods in Natural Language Processing, EMNLP ’11, umn shows the classification accuracy of sentence level lan- pages 1524–1534, Stroudsburg, PA, USA. Association for guage identification with the training data provided in FIRE Computational Linguistics. shared task. The third column shows the accuracy results Täckström, O. and McDonald, R. (2011). Semi-supervised using the expanded training dataset. We can observe that latent variable models for sentence-level sentiment anal- the semi-supervised version of Naive Bayes (Naive Bayes EM ysis. In Proceedings of the 49th Annual Meeting of the ) is superior among all the classifiers. We can also observe Association for Computational Linguistics: Human Lan- that Naive Bayes classifier is benefitted the most (5.5% in- guage Technologies: short papers-Volume 2, pages 569– crease) by the supplementary tuples added to the training 574. Association for Computational Linguistics. data. Naive Bayes EM is improved by approximately 3% and MaxEnt is improved by 1.6%. 3. CONCLUSION In the current study we addressed the language identifica- tion problem in mixed script socail media text, at the word level, involving multiple indian languages namely bengali, gujarati, hindi, kannada, malayalam, marathi, tamil, telugu. Observing that the social media mixed script posts often involve english as the mixing language and can involve at- most one more other language as the length of the messages are quite short, we used a two-stage classification approach for sentence level langauge mode of the author and then a binary classifier for distinguishing english and each of the specific languages listed above. The test run submitted has given overall weighted F-measure of 0.7692. The sentence level classification accuracy was 68.87%. We could further improve this accuracy to 79.77% using abundantly available social media tweets crawled using seed words of specific lan- gauge. Acknowledgement We sincerely thank the active participation of members of CoE Analytics, IDRBT, especially K. Sai Kiran, B. Shiva Krishna for helping us in sentence level labeling, and word level labeling. We thank IDRBT for providing the research environment for executing this task. 39