EHU at TweetMT: Adapting MT Engines for Formal Tweets EHU en TweetMT: Adaptación de sistemas MT a tuits formales Iñaki Alegria, Mikel Artetxe, Gorka Labaka, Kepa Sarasola University of the Basque Country inaki.alegria@ehu.eus Resumen: En este trabajo se describe la participación del grupo IXA de la UPV/EHU en la tarea sobre Traducción de Tweets en el congreso de la SEPLN (TweetMT 2015). Se han adaptado dos sistemas previamente desarrollados para la traducción es-eu y eu-es, obteniéndose buenos resultados (mejores que otros publicados previamente). Se describe la recopilación de recursos, la adaptación de los sistemas y los resultados obtenidos. Palabras clave: traducción automática, SMT, RBMT, tuits, social media Abstract: This paper describes the participation of the IXA group from the UPV/EHU (University of the Basque Country) in the TweetMT shared task at the SEPLN-2015 conference. We have adapted existing MT engines for the es-eu and eu-es pairs, obtaining good results (better than other experiments reported in previous work). Three main aspects are described: resource compilation, engine adaptation and results. Keywords: machine translation, SMT, RBMT, tweets, social media 1 Introduction on a single reference tend to penalise As the organizers of the workshop say in RBMT systems (rule-based machine the home page1 “the machine translation translation) compared to SMT systems of tweets is a complex task that greatly (statistical machine translation), but we depends on the type of data we work with. wanted to test the results. The translation process of tweets is very • two state-of-the-art SMT systems, one different from that of correct texts posted for for the es-eu pair and the other one for instance through a content manager. The the eu-es pair (Labaka, 2010). texts also vary in terms of structure, where the latter include tweet-specific features such 2 Resource Compilation from as hashtags, user mentions, and retweets, Microtexts among others.” The translation of tweets can be tackled as a direct translation Based on the development set provided, (tweet-to-tweet) or as an indirect translation preliminary work was carried out to obtain (tweet normalization to standard text, text useful resources to adapt the systems: translation and, if needed, tweet generation) (Kaufmann and Kalita, 2010). • an out-of-vocabulary (OOV) dictionary When analyzing the released development was obtained for Basque using our corpus we observed that most of the messages Basque morphological analyzer. We were formal tweets, and we therefore decided observed that the percentage of OOVs to face the problem following the direct was low and only few of them were approach, adapting previous engines to the common in the development set. Even if structure of these texts. We have adapted there are a very small number of entries, three systems: we built a bilingual dictionary with the most frequent OOVs (5 entries). • an RBMT system named Matxin for the • a dictionary of bilingual hashtags was es-eu pair (Mayor et al., 2011). It is obtained by aligning hashtags (using a well known that automatic measures run simple program and manual revision) 1 http://komunitatea.elhuyar.org/tweetmt from parallel tweets. After a manual review of the pairs with more than two generation of the output translation from the occurrences, a dictionary of 60 pairs was target dependency structure. generated. Matxin was adapted to the idiosyncratic features of tweets (URLs, hashtags...). Tweets of monolingual corpora from For this purpose, the de-formatter module previous shared tasks were compiled in order in Matxin (Mayor et al., 2011) was to enrich the language models. enriched adding the following functions (the deformatter module separates the format • Corpora from the TweetNorm shared information (RTF, HTML, etc.) from the task (Alegria et al., 2014). Initial text to be translated, and the plain text is collection of 227,855 Spanish tweets. sent to the analysis phase): • Corpora from the TweetLId shared task • URLs are managed as sentence (Zubiaga et al., 2014). 8,562 Spanish boundaries tweets and 380 Basque tweets. • Hashtags at the begining or at the end In order to increase the low volume of data of the tweet remain untranslated for the Basque LM we use a corpus of tweets • Hashtags inside the text are given for supplied by the CodeSyntax company2 which translation. Some will be translated has a tweet-oriented service for Basque called (#Escocia / #Eskozia) while others UMAP3 . They identify Twitter accounts that will remain untranslated (#Hackathon, use Basque and compile the tweets from #Olasdeenergia) these users. Most of these accounts are multilingual, so language identification was • IDs will receive the same treatment as a key next step. We used our language other named entities identifier (LangId, a free language identifier based on word and trigram frequencies 3.2 Corpora for SMT developed by the IXA group of the University The adaptation and tuning of the SMT of the Basque Country, and which is systems was laborious. First of all, the specialized in recognizing Basque and its provided development corpus was divided surrounding languages (Spanish, French and into 3 subsets: training (2,000 pairs), English)) and filter candidates with high tuning (1,500) and test (500). Because the percentage of OOVs (thus, priorizing formal alignment of the corpus was automatically tweets and adding precision to the results done, we manually reviewed the training part obtained from Langid) and compiled a corpus and observed that the error rate was high. of 454,790 tweets. After discarding non-parallel tweets 1,444 pairs remained in the training corpus. 3 Adaptation and Tuning of the We used a previously compiled parallel MT Engines corpus for the translation model. This 7.4 million segment corpus was compiled by the The RBMT system was adapted to the task Elhuyar Foundation and the University of manually and two new models for the SMT the Basque Country. It includes public systems were trained and tuned. corpora, private corpora and a corpus built 3.1 RBMT by web-as-corpus paradigm (San Vicente and As mentioned, the Matxin system was Manterola, 2012). Also, the mentioned used for es-eu translation. Matxin is training corpus (1,444 pairs from the an open source Spanish-to-Basque RBMT development corpus in the shared task) was engine which follows the traditional transfer repeated 100 times for the bilingual model model. It consists of three main components: (this is not done for the language model 1) analysis of the source sentence into a because we use interpolation). dependency tree structure; 2) transfer from For the language model, previous models the source language dependency tree to a for Spanish and Basque were retrained target language dependency structure; and 3) adding the corpora described in the previous section. 2 http://www.codesyntax.com Table 1 shows the figures for the corpora 3 http://umap.eu used. Sentences Tokens (es) Tokens (eu) General 7,463,951 118,497,426 94,142,809 Bilingual Tweets 1,444 21,022 18,804 General 28,823,939 866,383,394 - Monolingual (es) Tweets 213,141 3,041,837 - General 1,290,501 - 14,894,592 Monolingual (eu) Tweets 454,800 - 6,063,226 Table 1: Figures from the dataset System BLEU-c BLEU NIST-c NIST TER RBMT(es-eu)baseline 0.1395 0.1629 4.6073 5.1930 0.8824 RBMT(es-eu)enhanced 0.1891 0.2089 5.4024 5.7755 0.7377 SMT(es-eu)baseline 0.2108 0.2257 6.0361 6.4351 0.8116 SMT(es-eu)enhanced 0.2401 0.2635 6.2920 6.7714 0.6550 SMT(eu-es)baseline 0.2348 0.2591 6.2768 6.7493 0.7876 SMT(eu-es)enhanced 0.2826 0.3109 6.9641 7.4827 0.6153 Table 2: Results on the test corpora 3.3 Tuning 4 Results and Discussion The development of the system was carried The systems prepared and tuned using the out using publicly available state-of-the-art development corpus were directly used to tools: the GIZA++ toolkit, the SRILM process the test. So we presented two systems toolkit and the Moses decoder. More for the eu-es pair (RBMT and SMT) and one concretely, we followed the phrase-based system (SMT) for the es-eu pair. For these approach with standard parameters: a language pairs only another group presented maximum length of 80 tokens per sentence, results (3 systems). translation probabilities in both directions Table 2 shows the results on the test with Good Turing discounting, word-based corpus provided. We use the most common translation probabilities (lexical model, in measures: BLEU and NIST (Doddington, both directions), a phrase length penalty and 2002). Our SMT system was the best for the target language model. The weights were the eu-es pair, and the second (very close to adjusted using MERT tuning with n-best list the first) for the es-eu pair. As expected, of size 100. the RBMT system gets lower figures in the metrics (only one reference is supplied) but it For the idiosyncratic features of the is interesting to compare them with previous tweets we analyzed the errors when the results. system was applied in the test extracted from the development corpus and we We want to underline that the results for decided to implement the following pre- and the es-eu pair are better than previous results post-processing steps: reported in some papers (Labaka et al., 2007; Labaka et al., 2014). More specifically, the BLEU figures for the RBMT system in this task range from 0.1429 (baseline) to • Tokenization: special treatment of 0.2089 (improved system) and from 0.2257 hyphens (‘-’) before declension cases of (baseline) to 0.2635 (improved system) for IDs, hashtags, figures, time... SMT; while in the last reference (Labaka et al., 2014) BLEU figures range from 0.0572 to • Post-processing: simple rules for fixing 0.1172 using RBMT and around 0.145 using recurrent surface-errors: double hyphen SMT. or colon, special symbols (e.g. ‘¿’ is These results are surprising if we consider used in Spanish but not in Basque) and tweet texts in general, but note that all similar issues. tweets used in the shared-task are formal and # System Text 1 source Arranca la segunda mitad GOAZEN! — 0-0 #athlive ref. Hasi da bigarren zatia, aupa!! — 0-0 #athlive SMT Hasi da bigarren zatia GOAZEN! — 0-0 #athlive RBMT Bigarren erdi GOAZEN ateratzen du! — 0-0 #athlive 2 source Jaume Matas ingresa en prisión URLURLURL ref. Jaume Matas kartzelan sartu dute URLURLURL SMT Jaume Matas kartzelan sartu dute URLURLURL RBMT Jaume Matas espetxean sartzen da URLURLURL 3 source Retenciones de hasta 7 kilómetros en la AP-8 en Irun: URLURLURL ref. 7 kilometroko auto-ilarak AP-8an, Irungo ordainlekuan: URLURLURL SMT 7 kilometroko auto-ilarak AP-8 Irunen: URLURLURL RBMT 7km-taraino AP-8 Irunen erretentzioak: URLURLURL 4 source Qu es un OpenSpace? IDIDID 27 de septiembre. URLURLURL ref. Zer da OpenSpace bat? IDIDID irailaren 27an. URLURLURL SMT Zer da OpenSpace? IDIDID 27. URLURLURL RBMT Zer da Openspace bat? Irailaren 27an IDIDID. URLURLURL 5 source Markel Olano denuncia que Bildu ha decidido actuar en contra de los intereses de los baserritarras #eajpnv URLURLURL ref. Olanok salatu du Bilduk baserritarren interesen kontra egitea erabaki duela #eajpnv URLURLURL SMT Markel Olano salatu du Bilduk jokatzea erabaki du interesen kontra baserritarren #eajpnv URLURLURL RBMT Markel Olanok salatzen du Bilduk baserritarrasen interesen aurka jardutea erabaki duela #eajpnv URLURLURL 6 source Idoia Mendia reducirá la ejecutiva y asignará tareas a cada miembro URLURLUR ref. Idoia Mendiak exekutiba murriztuko du, bakoitzari zeregin bat emanez URLURLURL SMT Idoia Mendiak eta lan egingo du kide bakoitzari URLURLURL RBMT Idoia Mendiak exekutiboa gutxituko du eta lanak esleituko dizkio kide bakoitzari URLURLURL Table 3: Examples RBMT/SMT that the most of them were designed to be • These results cannot be extrapolated to multilingual (and so, perhaps, to be easily the general task of translating tweets. translated). Therefore, we could say that the Translating informal tweets will be much task was easier than usual tasks in MT, at harder. least for this language pair. • MT can help community managers The good performance of the RBMT who manage multilingual Twitter system on the formal tweets was expected, accounts. A Twitter oriented MT as syntax use to be simple in the short texts post-editing system could be developed from Twitter. and evaluated. In Table 3 there some examples of the results. In sentences #4, #5 and #6 Acknowledgments RBMT gets very food translations but in the This work has been supported by the previous sentences the translations from the Spanish MICINN project Tacardi (Grant SMT system are more precise. In the near No. TIN2012-38523-C02-01). CodeSyntax future we want to check if combining both company and Elhuyar Foundation have techniques improvements can be lead. collaborated with us providing several We can draw the following general corpora for the translation and language conclusions: models. Thanks to Josu Azpillaga (CodeSyntax) and to Iñaki San Vicente, Igor Zubiaga, Arkaitz, Inaki San Vicente, Pablo Leturia, Itziar Cortes and Justyna Pietrzak Gamallo, José Ramom Pichel, Inaki (Elhuyar) for their assistance. We would Alegria, Nora Aranberri, Aitzol Ezeiza, also like to thank the anonymous referees for and Vıctor Fresno. 2014. Overview of their comments and suggestions. tweetlid: Tweet language identification at sepln 2014. TweetLID@SEPLN. References TweetLId workshop at SEPLN Alegria, Inaki, Nora Aranberri, Pere R Conference. ceur-ws.org/Vol-1228/. Comas, Vıctor Fresno, Pablo Gamallo, Lluis Padró, Inaki San Vicente, Jordi Turmo, and Arkaitz Zubiaga. 2014. Tweetnorm es corpus: an annotated corpus for spanish microtext normalization. In Proceedings of LREC. Doddington, George. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pages 138–145. Morgan Kaufmann Publishers Inc. Kaufmann, Max and Jugal Kalita. 2010. Syntactic normalization of twitter messages. In International conference on natural language processing, Kharagpur, India. Labaka, Gorka. 2010. Eusmt: incorporating linguistic information into smt for a morphologically rich language. its use in smt-rbmt-ebmt hybridation. Lengoaia eta Sistema Informatikoak Saila (UPV-EHU). Donostia. 2010ko martxoaren 29a. Labaka, Gorka, Cristina España-Bonet, Lluı́s Màrquez, and Kepa Sarasola. 2014. A hybrid machine translation architecture guided by syntax. Machine Translation, 28(2):91–125. Labaka, Gorka, Nicolas Stroppa, Andy Way, and Kepa Sarasola. 2007. Comparing rule-based and data-driven approaches to spanish-to-basque machine translation. Mayor, Aingeru, Iñaki Alegria, Arantza Dı́az De Ilarraza, Gorka Labaka, Mikel Lersundi, and Kepa Sarasola. 2011. Matxin, an open-source rule-based machine translation system for basque. Machine translation, 25(1):53–82. San Vicente, Inaki and Iker Manterola. 2012. Paco2: A fully automated tool for gathering parallel corpora from the web. In LREC, pages 1–6.