Language Segmentation of Twitter Tweets using Weakly Supervised Language Model Induction Segmentación de Twitter Tweets con Modelos de Lenguaje Inducidos David Alfter University of Trier Universitätsring 15 s2daalft@uni-trier.de Resumen: En este artículo presentamos los primeros resultados de la inducción de modelos de lenguaje de manera semi supervisada para la segmentación por idioma de textos multilingües con especial interés en textos cortos. Palabras clave: modelo de lenguaje, inducción, semi supervisada, segmentación, textos cortos, tuits Abstract: This paper presents early results of a weakly supervised language model induction approach for language segmentation of multilingual texts with a special focus on short texts. Keywords: language model, induction, weakly supervised, short text, tweet, seg- mentation 1 Motivation task of language identification in general (King and Abney, 2013; Lui, Lau, and Twitter tweets often contain non-standard Baldwin, 2014) and on tweets (Mendizabal, language and they are limited to 140 char- Carandell, and Horowitz, 2014; Porta, 2014), acters. While not a problem in itself, these but they cannot always be applied. For one, restrictions can pose difficulties for natu- Tweets often contain a lot of non-standard ral language processing systems (Lui, Lau, spellings and ad hoc spellings that may or and Baldwin, 2014). Furthermore, Tweets may not be due to the imposed character may be written in more than one language limit. This can be problematic if the su- (Zubiaga et al., 2014). This typically hap- pervised methods have only seen standard pens when multilingual speakers switch be- spelling in training. Also, Tweets may con- tween the languages known to them, be- tain languages for which there is insufficient tween or inside sentences (Jain and Bhat, data to train a supervised method. In these 2014). The resulting text is said to be code- cases, unsupervised approaches might yield switched (Jain and Bhat, 2014; Solorio et better results than supervised approaches. al., 2014). This further complicates mat- ters for natural language processing systems Language segmentation consists in iden- that need at least a certain degree of knowl- tifying the language borders within a mul- edge of the language at hand such as part-of- tilingual text (Yamaguchi and Tanaka-Ishii, speech taggers, parsers, or machine transla- 2012). Language segmentation is not the tion (Beesley, 1988; Jain and Bhat, 2014; Zu- same as language identification; the main dif- biaga et al., 2014). The performance of “tra- ference is that language identification iden- ditional” monolingual natural language pro- tifies the languages in a text, and lan- cessing components on mixed language data guage segmentation “only” separates the text tends to be miserable, making it necessary to into monolingual segments (Yamaguchi and identify the languages in a multilingual text Tanaka-Ishii, 2012). Language segmentation in order to get acceptable results (Jain and can be useful when direct language identifi- Bhat, 2014). Even if the results are not terri- cation is not available. ble, language identification and segmentation can significantly increase the accuracy of nat- 2 Language Model Induction ural language processing tools (Alex, Dubey, King and Abney (2013) consider the task of and Keller, 2007). language identification as a sequence label- Supervised methods perform well on the ing task, and Lui, Lau, and Baldwin (2014) a multi-label classification task. In contrast, pervised trained language models, the score the proposed system uses a clustering ap- is indicated in bold. proach. The system induces n-gram language Besides the F score (F1), the F5 score models from the text iteratively and assigns is also indicated. This score sets β to 5, each word of the text to one of the induced weighting recall higher than precision. This language models. One induction step consists means that throwing together pairs that are of the following steps: separate in the gold standard is penalized more strongly than splitting pairs that oc- • Forward generation: Generate language cur together in the gold standard (Manning, models by moving forward through the Raghavan, and Schütze, 2008). text For comparison purposes, a supervised • Backward generation: Generate lan- approach as described in (Dunning, 1994) guage models by moving backwards has been implemented. For the supervised through the text approach, language models for all relevant languages have been trained on Wikipedia • Model merging: Merge the two most dumps from the months June and July similar models from the forward and 2015 in the languages occurring in the data, backward generation based on the uni- namely Greek, English, French, Polish and gram distribution Amharic. Since the Amharic wikipedia is Generation starts at the beginning of the written in the Ge’ez script and the data only text, takes the first word and decomposes it contains transliterated Amharic, all Amharic into uni-, bi- and trigrams. These n-grams texts were transliterated prior to training. are then added to the initial language model, Then, Tweets have been segmented by assign- which is empty at the start. For each follow- ing each word to the model with the highest ing word, the existing language models eval- probability. Training on a corpus of Twitter uate the word in question. The highest rank- data, separated by language, might yield bet- ing model is updated with the word. If no ter results for the supervised approach; how- model scores higher than the threshold value ever, such a corpus would have to be compiled for model creation, a new model is created. first. Backwards generation works exactly the For this toy example, the results show same, but starts at the end of the text and that the language model induction seems to moves towards the beginning of the text. work reasonably well with scores compara- Finally, the two models that have the most ble to the supervised approach, sometimes similar unigram distribution are merged. even performing better than the supervised This way, the language models iteratively approach. amass information about different languages. Closer inspection of the results reveals The induction step is repeated at least that the language model induction tends to twice. At the end of the induction, while generate too many clusters for a single lan- there are two models that have a similarity guage, resulting in a degradation of the ac- greater than a certain threshold value, these curacy, while on the other hand also being models are merged. able to separate the different languages sur- Language segmentation is then performed prisingly well. by assigning each word in the text to the For example, the first tweet “Μόλις ψή- model that yields the highest probability for φισα αυτή τη λύση Internet of Things, στο the word in question. διαγωνισμό BUSINESS IT EXCELLENCE.” is decomposed into two English clusters and 3 Results two Greek clusters, with one erroneous inclu- Table 1 shows the results for a set of example sion of ‘EXCELLENCE.’ in the Greek cluster. tweets manually collected from Twitter. For all tweets, a gold standard has been manu- • Things, ally created and evaluated against. The eval- uation is that of a clustering task; the words • Μόλις λύση διαγωνισμό EXCELLENCE. of a text are clustered around different in- • Internet of BUSINESS IT duced language models. Whenever the lan- guage model induction outperformed the su- • ψήφισα αυτή τη στο Induction Supervised F1 F5 F1 F5 Tweet 1 0.5294 0.4775 0.7441 0.8757 Tweet 2 0.7515 0.9325 0.7570 0.8121 Tweet 3 0.4615 0.8185 0.6060 0.8996 Tweet 4 0.5172 0.7587 0.7360 0.9545 Tweet 5 1.0000 1.0000 0.2500 0.4642 Table 1: F-Scores The second tweet “Demain #dhiha6 Finally, the last tweet ”Buna dabo naw Keynote 18h @dhiparis “The collective dy- (coffee is our bread).” is decomposed as fol- namics of science-publish or perish; is it all lows. The English words are split across four that counts?” par David” and its decompo- clusters while the transliterated Amharic text sition. It is clear that we have one En- is clustered together. The splitting is due to glish cluster and one French cluster, and two the structure of the tweet; there is not enough other clusters, one of which could be la- overlapping information to build an English beled ‘Named Entity’ cluster and the other cluster. possibly ‘English with erroneous inclusion of @dhiparis’. Interestingly, the French way of • (coffee notating time ‘18h’ is also included in the French cluster. • bread). • Keynote “The collective of science- • is publish or perish; it all that counts?” • our • Demain 18h par • Buna dabo naw • #dhiha6 David • @dhiparis dynamics is 4 Conclusion The paper has presented the early findings of The third tweet “Food and breuvages in a weakly supervised approach for language Edmonton are ready to go, just waiting segmentation that works on short texts. By for the fans #FWWC2015 #bilingualism” is taking the text itself as basis for the induced split into one acronym group, three English language models, there is no need for train- clusters and one French cluster with the er- ing data. As the approach does not rely on roneous inclusion of ‘go’. external language knowledge, the approach is • #FWWC2015 language independent. The results seem promising, but the ap- • breuvages, go proach has to be tested on more data. Still, • Food, Edmonton, to, for, the being able to achieve results comparable to supervised approaches with a weakly super- • in, waiting, #bilingualism vised method is encouraging. • and, are, ready, just, fans 5 Future work The fourth tweet “my dad comes back from poland with two crates of strawberries, Future work should concern the reduction of żubrówka and adidas jackets omg” again is the number of generated clusters, ideally ar- split into two English clusters and one Polish riving at one cluster per language. Alterna- cluster with the erroneous inclusion of ‘back’. tively, it would be possible to smooth the fre- quent switching of language models by taking • comes, from, with, two, crates, of, straw- context into account. berries, jackets, omg Also, since the structure of the text strongly influences the presented approach, • my, dad, poland, and, adidas some form of text normalization could be • back, żubrówka used to increase the robustness of the system. Sources Conference of the North American Chap- GaloTyri. “Μόλις ψήφισα αυτή τη λύση In- ter of the Association for Computational ternet of Things, στο διαγωνισμό BUSINESS Linguistics – Human Language Technolo- IT EXCELLENCE.”. 19 June 2015, 12:06. gies, pages 1110–1119. Tweet. Lui, Marco, Jey Han Lau, and Timothy Bald- win. 2014. Automatic detection and lan- Claudine Moulin (ClaudineMoulin). guage identification of multilingual docu- ”Demain #dhiha6 Keynote 18h @dhiparis ments. Transactions of the Association for ”The collective dynamics of science-publish Computational Linguistics, 2:27–40. or perish; is it all that counts?” par David”. Manning, Christopher D, Prabhakar Ragha- 10 June 2015, 17:35. Tweet. van, and Hinrich Schütze. 2008. Intro- duction to information retrieval, volume 1. HBS (HBS_Tweets). ”Food and Cambridge University Press. breuvages in Edmonton are ready to go, just waiting for the fans #FWWC2015 Mendizabal, Iosu, Jeroni Carandell, and #bilingualism”. 6 June 2015, 23:29. Tweet. Daniel Horowitz. 2014. TweetSafa: Tweet language identification. TweetLID @ SE- katarzyne (wifeyriddim). ”my dad comes PLN. back from poland with two crates of straw- Porta, Jordi. 2014. Twitter Language Iden- berries, żubrówka and adidas jackets omg”. tification using Rational Kernels and its 8 June 2015, 08:49. Tweet. potential application to Sociolinguistics. TweetLID @ SEPLN. TheCodeswitcher. ”Buna dabo naw (cof- fee is our bread).”. 9 June 2015, 02:12. Solorio, Thamar, Elizabeth Blair, Suraj Ma- Tweet. harjan, Steven Bethard, Mona Diab, Mah- moud Gohneim, Abdelati Hawwari, Fa- References had AlGhamdi, Julia Hirschberg, Alison Chang, et al. 2014. Overview for the Alex, Beatrice, Amit Dubey, and Frank First Shared Task on Language Identifi- Keller. 2007. Using Foreign Inclusion De- cation in Code-Switched Data. In Pro- tection to Improve Parsing Performance. ceedings of the Conference on Empirical In Proceedings of the Conference on Em- Methods on Natural Language Processing, pirical Methods on Natural Language Pro- pages 62–72. cessing and Computational Natural Lan- guage Learning, pages 151–160. Yamaguchi, Hiroshi and Kumiko Tanaka- Ishii. 2012. Text segmentation by lan- Beesley, Kenneth R. 1988. Language iden- guage using minimum description length. tifier: A computer program for automatic In Proceedings of the 50th Annual Meeting natural-language identification of on-line of the Association for Computational Lin- text. In Proceedings of the 29th Annual guistics, pages 969–978. Association for Conference of the American Translators Computational Linguistics. Association, volume 47, page 54. Zubiaga, Arkaitz, Inaki San Vicente, Pablo Dunning, Ted. 1994. Statistical Identifica- Gamallo, José Ramom Pichel, Inaki Ale- tion of Language. Computing Research gria, Nora Aranberri, Aitzol Ezeiza, and Laboratory, New Mexico State University. Vıctor Fresno. 2014. Overview of Tweet- LID: Tweet language identification at SE- Jain, Naman and Riyaz Ahmad Bhat. PLN 2014. TweetLID @ SEPLN. 2014. Language Identification in Code- Switching Scenario. In Proceedings of the Conference on Empirical Methods on Nat- ural Language Processing, pages 87–93. King, Ben and Steven P Abney. 2013. La- beling the Languages of Words in Mixed- Language Documents using Weakly Su- pervised Methods. In Proceedings of the