=Paper=
{{Paper
|id=Vol-1228/paper4
|storemode=property
|title=A Language Identification Method Applied to Twitter Data
|pdfUrl=https://ceur-ws.org/Vol-1228/tweetlid-4-singh.pdf
|volume=Vol-1228
|dblpUrl=https://dblp.org/rec/conf/sepln/SinghG14
}}
==A Language Identification Method Applied to Twitter Data==
A Language Identification Method Applied to Twitter Data Anil Kumar Singh Pratya Goyal IIT (BHU), Varanasi, India NIT, Surat, India nlprnd@gmail.com goyalpratya@gmail.com Resumen: Este paper presenta los resultados de varios experimentos que hacen uso de un algoritmo sencillo, guiado por heurı́sticas, para la finalidad de identificar el idioma en datos de Twitter. Estos experimentos son parte de la tarea compartida que se centra en este problema. El algoritmo se basa en una métrica de distancia calculada a partir de n-gramas. Este algoritmo habı́a sido evaluado satisfactoria- mente en textos normales previamente. La métrica de distancia utilizada en este caso es una entropı́a cruzada simétrica. Palabras clave: identificación de idioma, entropı́a cruzada simétrica, microblog- ging Abstract: This paper presents the results of some experiments on using a simple algorithm, aided by a few heuristics, for the purposes of language identification on Twitter data. These experiments were a part of a shared task focused on this problem. The core algorithm is an n-gram based distance metric algorithm. This algorithm has previously been shown to work very well on normal text. The distance metric used is symmetric cross entropy. Keywords: Language identification, symmetric cross entropy, microblogging 1 Introduction and Objectives 2 Architecture and Components Language identification was perhaps the of the System first natural language processing task for The system we have used is quite simple. which a statistical method was used success- There are only two components in the sys- fully (Beesley, 1988). Over the years, many tem. At its core there is a language identifier algorithms have become available that work for normal text. The only other module is very well with normal text (Dunning, 1994; a preprocessing module. This preprocessing Combrinck and Botha, 1994; Jiang and Con- module implements some heuristics. There rath, 1997; Teahan and Harper, 2001; Mar- are two main heuristics implemented. The tins and Silva, 2005). However, with the first one is based on the knowledge that word recent spread of social media globally, the boundaries are an important source of lin- need for language identification algorithms guistic information that can help a language that work well with the data available on such processing system perform better. We just media has been felt increasingly. There has wrap every word (more accurately, a token) been a special focus on microblogging data, inside two special symbols, one for word be- because of at least two main reasons. The ginning and the other for word ending. The first is that microblogs have too little data effect of this heuristic is that it not only for traditional algorithms to work well di- provides additional information, it also ‘ex- rectly and the second is that microblogs use pands’ the short microblogging text a little a kind of abbreviated language where, for ex- bit, which is statistically important. ample, many words are not fully spelled out. The other heuristic relates to cleaning up Some other facts about such data, like multi- the data. Microblogging text, particularly linguality of many microbloggers only make Twitter text, contains extra-textual tokens the problem harder. such as hashtags, mentions, retweet symbols, Our goal was to take one of the algorithms URLs etc. This heuristic removes such extra- that has been shown to work very well for textual tokens from the data before training normal text, add some heuristics to it, and as well as before language identification. see how far it goes in performing language The intuitive basis of our algorithm is sim- identification for microblog data. ilar to the unique n-gram based approach, which was first used for human identifica- The parameters in the above algorithm are: tion (Ingle, 1976) and later for automatic 1. Character based n-gram models Pc and Qc identification (Newman, 1987). The insight behind these methods is as old as the time of 2. Word based n-gram models Pw and Qw Ibn ad-Duraihim who lived in the 14th cen- 3. Orders Oc and Ow of n-grams models tury. 4. Number of retained top n-grams Nc and Nw It is worth noting that when n-grams (pruning ranks for character based and word are used for language identification, normally based n-grams, respectively) no distinction is made between orders of n- 5. Number t of character based models to be grams, that is, unigrams, bigrams and tri- disambiguated by word based models grams etc. are all given the same status. Fur- ther, when using vector space based distance 6. Weight a of word based models measures, n-grams of all orders are merged In our case, for the twitter data, we have together and a single vector is formed. It is not used word based n-grams as they do not this vector over which the distance measures seem to help. Adding them does not improve are applied. the results. Perhaps the reason is that there 3 The Core Algorithm is too little data in terms of word n-grams. So the parameters for our case are: The core algorithm that we have used (Singh, 2006) is an adaptation of the one used by Oc = 7, Ow = 0, Nc = 1000, Nw = 0, a = 0 Cavnar and Trenkle (Cavnar and Trenkle, 1994). The main difference is that instead of We used an existing implementation of using the sum of the differences of ranks, we this algorithm which is available as part of use symmetric cross entropy as the similarity a library called Sanchay1 (version 0.3.0). or distance measure. The parameters were selected based on re- The algorithm can be described as follows: peated experiments. The ones selected are 1. Train the system by preparing character those which gave the best results. The length based and word based (optional) n-grams of n-grams was selected as 7-grams and we from the training data. did find that increasing n-gram length im- 2. Combine n-grams of all orders (Oc for char- proves the results. acters and Ow for words). In this paper we have used this technique 3. Sort them by rank. for monolingual identification in accordance 4. Prune by selecting only the top Nc charac- with the task definition, but it can be used for ter n-grams and Nw word n-grams for each multilingual identification (Singh and Gorla, language-encoding. 2007), although the accuracies are not likely 5. For the given test data or string, calculate to be high when used directly. the character n-gram based score simc with every model for which the system has been 4 Resources Employed trained. For our experiments reported here we have 6. Select the t most likely language-encoding pairs (training models) based on this score. only used the training data provided. We have not used any other resources. We have 7. For each of the t best training models, cal- culate the score with the test model. The also, so far, not used any additional tools score is calculated as: such as a name entity recognizer. We have implemented some heuristics as described in score = simc + a ∗ simw (1) the previous section. where c and w represent character based and word based n-grams, respectively. And a 5 Setup and Evaluation is the weight given to the word based n- We evaluated with two different setups. Be- grams. In our experiment, this weight was fore the test data for the shared task was re- 1 for the case when word n-grams were con- leased, we had randomly divided the train- sidered and 0 when they were not. ing data into two sets by the usual 80-20 8. Select the most likely language-encoding split: one for training and one for evalua- pair out of the t ambiguous pairs, based on tion. We also used two evaluation methods. the combined score obtained from word and 1 character based models. http://sanchay.co.in Table 1: Language-wise Results in Percentages (Macroaverages) Training 80-20 Split Test Set Language Precision Recall F-measure Precision Recall F-measure Spanish 91.62 82.05 86.57 93.12 85.93 89.38 Catalan 74.84 84.27 79.28 63.43 81.99 71.52 Portuguese 86.79 73.95 79.86 65.03 88.53 74.98 Galician 34.97 55.34 42.86 25.71 50.12 33.99 Basque 66.67 71.15 68.83 49.30 76.74 60.03 English 80.53 80.53 80.53 71.44 76.53 73.90 Undefined 42.11 16.67 23.88 42.53 7.84 13.24 Ambiguous 1.00 69.62 82.09 1.00 78.08 87.69 Global 72.19 67.20 68.26 63.82 68.25 63.10 One was simple precision based on microaver- purely on distributional similarity, differences ages, while the other was using the evaluation in training or testing distributions cause un- script provided by the organizers, which was expected errors. The fact that there is more based on macroaverages. Under this setup, data available for some languages (Spanish on repeated runs, the algorithm described and Portuguese) and less for others (Gali- earlier, out of the box, gave a (microaverages cian, Catalan and Basque), the difference be- based) precision of little more than 70%. On ing very large, contributes to these discrep- adding the word boundary heuristic to the ancies. It may also be noted that the results data, the precision increased to around 78%. were much better in terms of microaverage On further adding the cleaning heuristic, the based precision because in that case our eval- precision reached 80.80%. The corresponding uation method took into account multi-label macroaverge based F-score was 68.26%. classification such as ‘en+pt’. In fact, each However, once the test data for the shared multi-label combination was treated as a sin- task was released and we used it with our gle class, both in the case of code switching algorithm, along with the heuristics, the and ambiguity. As a result, many (around (macroaverage based) F-score was 61.5%. half) of the errors were of such as ‘en’ being This increased a little after we slightly im- identified as ‘en+pt’. This also contributed proved the implementation of the preprocess- to making our results lower as evaluated by ing module. The corresponding microaver- the script provided by the organizers. age based precision was 77.47%. On look- ing at the results for each language, we find Table 2: Top single label errors on the that the performance was best for Spanish training 80-20 split (89.38% F-measure) and worst for Galician Language Identified As No. of Times (33.99% F-measure). These results are pre- Spanish Catalan 212 sented in table-1. Portuguese Spanish 72 Galician Portuguese 37 Tables 2 and 3 list the most frequent sin- Undef Basque 31 gle label errors for the two cases (80-20 split Catalan Spanish 29 of the training data and the test set). While Basque Spanish 20 some of the results are as expected, others are English Spanish 13 surprising. For example, Galician and Por- Other Spanish 6 tuguese are very similar and they are con- fused for one another. Similarly for Span- ish and Catalan. But it is surprising that Table 3: Top single label errors on the Catalan is identified as English and Basque test set as Spanish. Also, Galician and Portuguese Language Identified As No. of Times are similar, but the results for them are dif- Spanish Catalan 1879 ferent. These discrepancies become a lit- Undef Galician 494 tle clearer if we notice the fact that the re- Other Portuguese 382 sults are quite different in many ways for Catalan English 214 the two cases: the 80-20 split and the test Portuguese Galician 212 set. The most probable reason for these dis- Galician Portuguese 209 Basque Spanish 59 crepancies is that since this method is based 6 Conclusions and Future Work Dunning, Ted. 1994. Statistical identification of language. Technical Report CRL MCCS-94- We presented the results of our experiments 273, Computing Research Lab, New Mexico on using an existing algorithm for language State University, March. identification on the Twitter data provided Ingle, Norman C. 1976. A language identification for the shared task. We tried the algorithm table. In The Incorporated Linguist, 15(4). as it is and also with some heuristics. The two main heuristics were: adding the word Jiang, Jay J. and David W. Conrath. 1997. Se- mantic similarity based on corpus statistics boundaries to the data in the form of spe- and lexical taxonomy. cial symbols and cleaning up hashtags, men- tions etc. The results were not state of the Kiciman, Emre. 2010. Language differences and metadata features on twitter. In Web N-gram art for Twitter data (Zubiaga et al., 2014), Workshop at SIGIR 2010. ACM, July. but they might show how far an out of the box well-performing algorithm can go for this Lui, Marco and Timothy Baldwin. 2014. Ac- curate language identification of twitter mes- purpose. Also, the results were significantly sages. In Proceedings of the 5th Workshop on worse for the test data than they were for Language Analysis for Social Media (LASM), the 80-20 split on the provided training data. pages 17–25, Gothenburg, Sweden, April. As- This means either the algorithm lacks robust- sociation for Computational Linguistics. ness when it comes to microblogging data, Martins, Bruno and Mario J. Silva. 2005. Lan- or there is a data shift between the training guage identification in web pages. In Proceed- and test data. Perhaps one important con- ings of ACM-SAC-DE, the Document Enge- clusion from the experiments is that adding neering Track of the 20th ACM Symposium on word boundary markers to the data can sig- Applied Computing. nificantly improve the performance. Newman, Patricia. 1987. Foreign language iden- For future work, we plan to experiment tification - first step in the translation pro- with techniques along the lines suggested cess. In Proceedings of the 28th Annual Con- in recent work (Kiciman, 2010; Carter, ference of the American Translators Associa- Weerkamp, and Tsagkias, 2013; Lui and tion. pages 509–516. Baldwin, 2014) on language identification for Simon, Kranig. 2005. Evaluation of language Twitter data. identification methods. In BA Thesis. Uni- versitt Tbingens. References Singh, Anil Kumar. 2006. Study of some distance Adams, Gary and Philip Resnik. 1997. A lan- measures for language and encoding identifica- guage identification application built on the tion. In Proceeding of ACL 2006 Workshop on Java client-server platform. In Jill Burstein Linguistic Distances. Sydney, Australia, Syd- and Claudia Leacock, editors, From Research ney, Australia. Association for Computational to Commercial Applications: Making NLP Linguistics. Work in Practice. Association for Computa- Singh, Anil Kumar and Jagadeesh Gorla. 2007. tional Linguistics, pages 43–47. Identification of languages and encodings in Beesley, K. 1988. Language identifier: A com- a multilingual document. In Proceedings of puter program for automatic natural-language the 3rd ACL SIGWAC Workshop on Web As identification on on-line text. Corpus, Louvain-la-Neuve, Belgium. Carter, Simon, Wouter Weerkamp, and Manos Teahan, W. J. and D. J. Harper. 2001. Us- Tsagkias. 2013. Microblog language identi- ing compression based language models for fication: overcoming the limitations of short, text categorization. In J. Callan, B. Croft unedited and idiomatic text. Language Re- and J. Lafferty (eds.), Workshop on Language sources and Evaluation, 47(1):195–215. Modeling and Information Retrieval. ARDA, Carnegie Mellon University, pages 83–88. Cavnar, William B. and John M. Trenkle. 1994. N-gram-based text categorization. In Pro- Zubiaga, Arkaitz, Iaki San Vicente, Pablo ceedings of SDAIR-94, 3rd Annual Symposium Gamallo, Jos Ramom Pichel, Iaki Alegria, on Document Analysis and Information Re- Nora Aranberri, Aitzol Ezeiza, and Vctor trieval, pages 161–175, Las Vegas, US. Fresno. 2014. Overview of TweetLID: Tweet Language Identification at SEPLN 2014. In Combrinck, H. and E. Botha. 1994. Automatic Proceedings of TweetLID @ SEPLN 2014, language identification: Performance vs. com- Girona, Spain. plexity. In Proceedings of the Sixth Annual South Africa Workshop on Pattern Recogni- tion.