Dublin City University at the TweetMT 2015 Shared Task Dublin City University en la tarea TweetMT 2015 Antonio Toral, Xiaofeng Wu, Tommi Pirinen, Zhengwei Qiu, Ergun Bicici, Jinhua Du ADAPT Centre, School of Computing, Dublin City University, Ireland {atoral, xwu, tpirinen, zhengwei.qiu2, ebicici, jdu}@computing.dcu.ie Resumen: Describimos nuestra participación en TweetMT para tres pares de lenguas en ambas direcciones: castellano hacia/desde catalán, euskera y portugués. Hacemos uso de varias técnicas: traducción automática estadı́stica y basada en reglas, segmentación de morfemas, selección de datos con ParFDA y combinación de sistemas. En cuanto a recursos, adquirimos grandes cantidades de tuits para llevar a cabo una adaptación de dominio monolingüe. Nuestro sistema ha sido el mejor de todos los enviados para cinco de los seis pares de lenguas. Palabras clave: traducción automática, tuits, segmentación de morfemas, selección de datos Abstract: We describe our participation in TweetMT for three language pairs in both directions: Spanish from/to Catalan, Basque and Portuguese. We used a range of techniques: statistical and rule-based MT, morph segmentation, data selection with ParFDA and system combination. As for resources, our focus was on crawling vast amounts of tweets to perform monolingual domain adaptation. Our system was the best of all systems submitted for five out of the six language directions. Keywords: machine translation, tweets, morph segmentation, data selection 1 Introduction and Objectives we rely on state-of-the-art SMT, morph seg- mentation for morphologically rich languages While statistical machine translation (SMT) (EU), data selection with ParFDA for fast de- can be considered a mature technology nowa- velopment of accurate SMT systems (Biçici, days, one of its requirements is the availabil- Liu, and Way, 2015) and domain adapta- ity of considerable amounts of parallel text tion (Biçici, 2015), the use of available open- for the language pair of interest. Ideally, the source rule-based systems and, finally, sys- parallel text to train an SMT system should tem combination to take advantage of the come from the same domain and genre as the strengths of the different systems we built. text the system is going to be applied to. As for resources, we crawl vast amounts Thus, using MT to translate types of text of tweets to perform monolingual domain for which no parallel data is available consti- adaptation and complement this with pub- tutes a challenge. This is the case for tweets licly available general-domain monolingual and social media in general, the target text and parallel corpora. of the TweetMT shared task. The rest of the paper is organised as fol- The main objective of our participation in lows. Sections 2 and 3 detail the systems the TweetMT 2015 shared task was to build built and the resources used, respectively. the best MT systems for tweets we could with Section 4 presents the evaluation and, finally, a clear constraint, i.e. it had to be done in a Section 5 outlines conclusions and lines of fu- very short period and, to a large extent, be ture work. limited to available resources. We have taken part for three language pairs in both direc- 2 Architecture and Components tions: Spanish (ES) from/to Catalan (CA), Basque (EU) and Portuguese (PT). of the System We decided to focus on making the best Here we describe the components used in our possible use of available techniques, tools and translation pipeline. First, we pre-process resources. Regarding techniques and tools, the datasets (Section 2.1), then we use a set of MT systems (Section 2.2) that can incor- 2.3 Morphological Segmentation porate additional functionality (Sections 2.3 Morphological segmentation is a popular and 2.4). Finally, we combine MT systems method to deal with SMT for morphologi- (Section 2.5). cally differing languages by simply splitting words into sub-word units. The main benefits 2.1 Data Preprocessing of morphological segmentation are to reduce Prior to be used, all the datasets used in our the out-of-vocabulary (OOV) rate and to in- systems are preprocessed, as follows: crease the percentage of 1 to 1 word align- ments between morphosyntactically different 1. Punctuation normalisation, with languages; e.g. in our case, by matching in- Moses’ (Koehn et al., 2007) script. flectional suffixes in EU to syntactic prepo- 2. Sentence splitting and tokenisation, with sitions in ES, we expect to improve the MT Freeling (Padró and Stanilovsky, 2012). quality for the EU–ES language pair. The segmentation and de-segmentation is able to 3. Normalisation (only for tweets). We sort create word-forms not present in the training the vocabulary of a tweet corpus by word data by matching a translated stem with a frequency and inspect the words that oc- correct suffix. cur in at least 0.5% of the tweets, creat- In our participation, morphological seg- ing rules to convert informal words to mentation was only used for EU–ES on the their formal equivalent. This leads to EU side, since EU’s morphology is signifi- just a handful of rules. E.g. in Spanish, cantly more complex than that of ES. For the “q”, occurring in 2.62% of the tweets, is remaining languages of the shared task, there converted to its formal equivalent “que”. is no such big difference in morphology com- plexity (all of them are closely-related as they 4. Truecasing, with a modified version of belong to the same family) so the expected Moses’ script. We added a set of start- gains do not outweigh the added complexity of-sentence characters commonly used in of segmentation. Spanish: ”-”, ”—”, ”¿”, ”“” and ”‘”. We use unsupervised statistical segmen- tation as provided by Morfessor 2.0 Base- 2.2 MT Systems line (Virpioja et al., 2013).3 The basic setup We build SMT systems using two paradigms: for segmentation is the same as in the Abu- phrase-based with Moses (Koehn et al., 2007) MaTran project submission to the WMT and hierarchical with cdec (Dyer et al., 2010). 2015 translation task (Rubino et al., 2015). In both cases we use default settings. We also However, some minor Twitter-related pre- use off-the-shelf open-source rule-based MT processing has been added in order to keep (RBMT) systems. Namely, Apertium (For- URLs and hashtags intact. The parameters cada et al., 2011) for ES↔CA, ES↔PT and used for Morfessor training are the default of EU→ES,1 and Matxin (Mayor et al., 2011) version 2.0.2-alpha and the data for training for ES→EU.2 is the EU side of the ES–EU parallel training The SMT systems use 5-gram LMs with data (cf. Section 3.1). Knesser-Ney smoothing (Kneser and Ney, To gauge the effects of our method as 1995) except for ParFDA Moses SMT sys- well as the morphological complexity of EU tems, which use LMs of order 8 to 10. We as compared to ES we show in Table 1 the build LMs on individual monolingual corpora OOV rates and vocabulary sizes of the ES (cf. Section 3.2) and interpolate them with and EU sides of the ES–EU training corpus, SRILM (Stolcke and others, 2002) to min- and EU corpora after morphological segmen- imise the perplexity on the dev set. Each tation. Segmentation reduces the type-to- target language and its corpora used to token ratio by a factor of 6 and the OOV build LMs together with their interpolation rate by almost a factor of 10. weights are shown in Table 4. We observe that tweets are given very high weights even if 2.4 ParFDA they are not the biggest corpora in the mixes. ParFDA parallelizes instance selection with 1 an optimized parallel implementation of Revisions 60356, 60384, and 60356, respectively. 2 3 API at http://ixa2.si.ehu.es/glabaka/ http://www.cis.hut.fi/projects/morpho/ Matxin.xml morfessor2.shtml Corpora Tokens Types OOV 2010), with default settings, except for the parameter length, for which we use its de- ES 30,532,489 296,612 14.5 % fault (7) for all directions except for ES→EU, EU 24,966,862 605,207 25.4 % for which we use 5 according to empirical re- EU morphs 35,293,220 100,990 2.6 % sults on the development set. Table 1: Size of ES–EU training corpus in word tokens (ES and EU sides) and in morph 3 Resources Employed tokens (EU). 3.1 Parallel Corpora 5-gram OOV perplexity Ideally, we would use data in the same do- C FDA FDA C FDA FDA main and genre as the test set, i.e. tweets. S→T train train LM %red train train LM %red We have access to parallel tweets provided CA–ES 2948 2957 2324 .21 332 336 294 .11 by the task for ES–CA and ES–EU (4,000 EU–ES 3021 3046 2443 .19 462 483 546 -.18 PT–ES 2871 2896 1951 .32 633 623 486 .23 parallel tweets for each language pair, we use ES–CA 3338 3345 2890 .13 325 330 338 -.04 1,000 for dev and the remaining 3,000 for ES–EU 4110 4129 3349 .19 745 761 637a .15a training). For ES–PT we have access to 999 ES–PT 3087 3117 2216 .28 993 941 746 .25 parallel tweets (we use them for dev) from Brazilator,4 a recent project by DCU and Mi- Table 2: LM comparison built from training crosoft to translate tweets from the 2014 soc- corpus (C train), ParFDA selected training cer World Cup across 24 language directions. data (FDA train), ParFDA selected LM data As the availability of parallel tweets for (FDA LM). %red is reduction proportion. the language pairs of TweetMT 2015 is rather a ES–EU LM is recomputed after the task, re- limited (at most we have 4,000 per language moving duplicates, which slightly decrease BLEU, in- pair), we use additional sources of paral- crease NIST. lel data. For ES–CA we use elPeriodico (eP)5 and a selection of contemporary nov- FDA5 and significantly reduces the time els. For ES–EU, translation memories (TMs) to deploy accurate SMT systems especially provided by the shared task6 and two corpora in the presence of large training data and from Opus (Tiedemann, 2012):7 Open subti- still achieve state-of-the-art SMT perfor- tles 2013 and Tatoeba. Finally, for ES–PT mance (Biçici, Liu, and Way, 2015; Biçici we use Europarl v78 and two corpora from and Yuret, 2015). Detailed composition of Opus: news-commentary and Tatoeba. Ta- the available corpora, which is referred to as ble 3 provides details on these corpora. constrained (C), are provided in Section 3. 3.2 Monolingual Corpora For ES, we also included LDC Gigaword cor- pora (Ângelo Mendonça et al., 2011). The Our main source of monolingual data is in- size of the LM corpora includes both the LDC domain and comes from crawled tweets. We and the monolingual LM corpora provided. use TweetCat (Ljubešić, Fišer, and Erjavec, ParFDA selected training and LM data ob- 2014) and crawl tweets for all the target lan- tains accurate translation outputs with the guages (CA, ES, EU and PT) during March selected LM data reducing the number of and April 2015. OOV tokens by up to 32% and the perplexity For each language we create two lists of by up to 25% and allows us to model higher words as required by the crawler: (i) most order dependencies (Table 2). common discriminating words (up to 100), these are words that are unique to the lan- 2.5 System Combination guage and they are used to seed the crawler so that it can find candidate tweets; and (ii) For each language direction we have built up most common words of the language (200), to five systems, as detailed in Sections 2.2 these are used to determine the language of to 2.4: (i) phrase-based and (ii) hierarchical 4 SMT, (iii) phrase-based with morph segmen- http://www.cngl.ie/brazilator 5 tation, (iv) phrase-based with ParFDA and http://catalog.elra.info/product_info. php?products_id=1122 (v) RBMT. We hypothesise these systems to 6 http://komunitatea.elhuyar.org/tweetmt/ have complementary strengths, and thus we resources/ decide to perform system combination. To 7 http://opus.lingfil.uu.se/ 8 that end we use MEMT (Heafield and Lavie, http://www.statmt.org/europarl/ Pair Corpus # s. # tokens Lang Corpus # tokens Weights tweets 3K 48k, 48k tweets 29M 0.60 ES–CA eP 0.6M 13.5M, 14M CA caWaC 0.5G 0.33 novels 47K .78M, .86M eP 14M 0.07 tweets 3K 42K, 38K tweets 129.2M 0.75 TMs 1.1M 28.9M, 23.5M ES news 0.4G 0.21 ES–EU OpenSubs 0.16M 1.2M, 1.0M europarl 60M 0.04 Tatoeba 902 6.7K, 5.5K tweets 11.3M 0.97 EU 1.9M 54M, 53M EU Wikipedia 11.5M 0.01 ES–PT NC 9K .26M, .25M TMs 23M 0.02 Tatoeba 53K .42M, .41M tweets 33M 0.93 PT Wikipedia 166M 0.02 Table 3: Parallel corpora used for training. Others 286M 0.05 For each corpus we provide its number of sen- tence pairs (# s.) and tokens on both sides (# tokens). Table 4: Monolingual corpora used for train- ing. For each corpus we show its number of crawled tweets. These two lists are derived tokens (# tokens) and its weight in LM in- from a list of the most common words found terpolation. in a corpus of subtitles.9 The tweets crawled are post-processed combinations for the three language pairs we with langid10 to identify their language. We covered: ES–CA, ES–EU and ES–PT. The keep the tweets whose langid’s confidence scores were obtained on raw MT output (i.e. score is above a certain threshold, which is tokenised and truecased) as calculated by us set empirically at 0.7 by inspecting tweets. with BLEU (Papineni et al., 2002) (multibleu In addition to crawled tweets, we use the cased as included in Moses version 3) and target sides of the parallel corpora (cf. Sec- TER (Snover et al., 2006) (as implemented in tion 3.1 and a set of monolingual corpora as TERp version 0.1). Due to time constraints follows. For CA we use caWaC (Ljubešić not all the possible combinations were tried. and Toral, 2014), a corpus crawled from the The scores of the best individual system and .cat top level domain. For ES, news crawl combination are shown in bold. and news-commentary from WMT’13.11 For At least one of the combinations obtains EU, a dump from Wikipedia (20150407). For better scores (both in terms of BLEU and PT, the news sources CETEMPublico,12 and TER) than the best individual system (ex- CETENFolha,13 and a dump from Wikipedia cept for ES↔PT with BLEU and for CA→ES (20150510). with TER), supporting our hypothesis that Table 4 shows details on these corpora in- the individual systems built are complemen- cluding their interpolation weights (cf. Sec- tary. Although SMT systems outperform tion 2.2). RBMT systems for all directions,14 the addi- 4 Evaluation tion of RBMT in system combinations has a positive impact (except for ES↔PT). Phrase- We report our results on the development set based SMT outperforms hierarchical SMT for (all systems built) and then on the test set related language pairs (ES–CA and ES–PT), (systems submitted). but the opposite is true for the unrelated lan- 4.1 Evaluation on Development guage pair ES–EU. We hypothesise this is Data due to the fact that ES and EU follow dif- ferent word orders (SVO and SOV, respec- Table 5 presents the results obtained on the tively), and this leads to pervasive long re- devset by the individual systems and a set of orderings in translation, that are better mod- 9 https://onedrive.live.com/?cid= elled with a hierarchical approach. 3732e80b128d016f&id=3732E80B128D016F!3584 10 14 https://github.com/saffsd/langid.py When interpreting the results, it should be taken 11 http://www.statmt.org/wmt13/ into account that automatic metrics are known to be 12 http://www.linguateca.pt/cetempublico/ biased towards statistical MT approaches (Callison- 13 http://www.linguateca.pt/cetenfolha/ Burch, Osborne, and Koehn, 2006). System BLEU TER DCU1 (1+4) 0.7669 0.1740 PT→ES ES→PT EU→ES ES→EU CA→ES ES→CA System BLEU TER DCU2 (1) 0.7899† 0.1626† Moses (1) 82.21 0.1102 DCU3 (1+2+4) 0.7630 0.1738 cdec (2) 81.45 0.1128 DCU1 (1+4) 0.7826 0.1506 ES→CA ParFDA (3) 82.37 0.1062 DCU2 (1+2+4) 0.7816 0.1500 Apertium (4) 78.17 0.1310 DCU3 (1+3+4) 0.7943† 0.1431† 1+2 81.71 0.1102 DCU1 (1+2+4) 0.2455 0.6533 1+4 82.37 0.1057 DCU2 (1+2+3+4+5) 0.2636† 0.6469† 1+2+4 81.93 0.1085 DCU3 (1+2+4+5) 0.2493 0.6553 Moses (1) 82.52 0.1086 DCU1 (2) 0.2687 0.6512 cdec (2) 81.76 0.1118 DCU2 (1+2+4) 0.2698 0.6406 ParFDA (3) 82.16 0.1063 DCU3 (1+2+4+5) 0.2728 0.6363 CA→ES Apertium (4) 77.96 0.1329 DCU1 (1) 0.3595 0.5290 1+2 82.38 0.1088 DCU2 (1+2) 0.3711† 0.5157† 1+4 82.58 0.1077 DCU3 (1+2+4) 0.3687 0.5163 1+2+4 82.38 0.1083 DCU1 (1) 0.4465 0.5767 1+3+4 82.45 0.1074 DCU2 (1+2) 0.4467 0.5627 Moses (1) 22.57 0.6116 DCU3 (1+2+4) 0.4524† 0.5403† cdec (2) 23.7 0.5863 ParFDA (3) 21.59 0.6181 Table 6: Results on the test set. Matxin (4) 12.66 0.7436 ES→EU Morph (5) 5.20 0.8812 4.2 Evaluation on Test Data 1+2 23.18 0.5796 1+4 18.36 0.6112 Table 6 presents the results on the test set 1+2+4 23.58 0.5771 of the systems we submitted. The scores 1+2+4+5 24.07 0.5741 shown are the ones reported by the organ- 1+2+3+4+5 24.42 0.5777 isers (case-insensitive BLEU and TER) on Moses (1) 24.21 0.6228 post-processed MT outputs (detokenised and cdec (2) 24.65 0.5911 detruecased). For each language direction ParFDA (3) 22.25 0.6346 we submitted the three systems that ob- tained the best performance on the dev set. EU→ES Apertium (4) 18.36 0.6918 Morph (5) 11.25 0.9655 The scores of the best submitted system are 1+2 24.18 0.5883 shown in bold. 1+4 24.33 0.6076 Out of six directions, our best submission 1+2+4 24.94 0.5831 is the top performing system for five of them 1+2+4+5 25.21 0.5792 (indicated with †). For most directions, the Moses (1) 29.21 0.6052 addition of a RBMT system leads to bet- cdec (2) 28.14 0.5962 ter performance. Similarly, for the directions where we have used segmentation (ES↔EU) ES→PT ParFDA (3) 27.74 0.6164 Apertium (4) 24.96 0.6272 and ParFDA (CA→ES and ES→EU), the ad- 1+2 28.76 0.5891 dition of systems based on these techniques 1+4 26.58 0.6082 had a positive impact on the results. 1+2+4 27.00 0.5878 We now delve deeper into the results ob- Moses (1) 30.47 0.5267 tained by SMT systems based on ParFDA cdec (2) 29.42 0.5254 (cf. Section 2.4). Although ParFDA systems PT→ES ParFDA (3) 29.63 0.5338 were submitted to the shared task only as Apertium (4) 27.52 0.5335 part of system combinations, we have eval- 1+2 29.9 0.5230 uated a posteriori the performance of this 1+4 30.01 0.5131 technique by means of standalone systems on 1+2+4 29.89 0.5089 the test set. ParFDA Moses SMT system ob- tains top results in CA→ES and ES→CA and Table 5: Results on the dev set. close to top results in other language pairs with 1.21 BLEU points average difference to the top (Table 7). An interesting feature of TweetMT CA–ES EU–ES PT–ES der grant agreement PIAP-GA-2012-324414 ParFDA .8012 .2713 .4374 (Abu-MaTran), by SFI as part of the Top .7942 .3109 .4519 ADAPT research center (07/CE/I1142) at diff -.007 .0396 .0145 Dublin City University and the project LM order 8 8 8 “Monolingual and Bilingual Text Quality ES–CA ES–EU ES–PT Judgments with Translation Performance ParFDA .7926 .2482 .3589 Prediction” (13/TIDA/I2740). We also Top .7907 .2636 .3711 thank the SFI/HEA Irish Centre for High- diff -.0019 .0154 .0122 End Computing (ICHEC) for the provision of LM order 8 10 8 computational facilities and support. Finally, we would like to thank Mikel L. Forcada and Table 7: BLEU results for ParFDA stan- Iacer Calixto for their advice on normalising dalone systems on the test set, their differ- tweets for Basque and Portuguese, respec- ence to the top, and ParFDA LM order used. tively, and Gorka Labaka for his help with ParFDA obtains top results in CA→ES and Matxin’s API. ES→CA and 1.21 BLEU points average dif- ference. References ParFDA regards its ability to build and de- Ângelo Mendonça, Daniel Jaquette, David ploy SMT systems in a quick manner. In Graff, and Denise DiPersio. 2011. Spanish the specific case of TweetMT, ParFDA took Gigaword third edition, Linguistic Data about 8 hours to build for ES→CA and 28 Consortium. hours for PT→ES taking about 11 GB and Biçici, Ergun. 2015. Domain adaptation for 27 GB disk space in total, respectively. machine translation with instance selec- tion. The Prague Bulletin of Mathemat- 5 Conclusions and Future Work ical Linguistics, 103:5–20. This paper has described our participation in Biçici, Ergun, Qun Liu, and Andy Way. the TweetMT 2015 shared task. Our focus 2015. ParFDA for fast deployment of ac- has been on rapid development of MT sys- curate statistical machine translation sys- tems adapted to tweets by making the best tems, benchmarks, and statistics. In Pro- possible use of available techniques, tools and ceedings of the EMNLP 2015 Tenth Work- resources. Our best submissions have been shop on Statistical Machine Translation, the ones that combine different MT systems Lisbon, Portugal, September. Association (except for ES→CA), supporting our hypoth- for Computational Linguistics. esis that the techniques we have used are complementary. Biçici, Ergun and Deniz Yuret. 2015. Op- As for future work, we consider several timizing instance selection for statistical possible avenues. First, we would like to anal- machine translation with feature decay al- yse in detail the translations produced by our gorithms. IEEE/ACM Transactions On systems in order to derive findings beyond the Audio, Speech, and Language Processing ones we can extract from the automatic eval- (TASLP), 23:339–350. uation metrics used in the task. Second, most of the tweets in the test set use formal lan- Callison-Burch, Chris, Miles Osborne, and guage,15 and thus we would like to test our Philipp Koehn. 2006. Re-evaluation the systems in a more representative set of tweets role of bleu in machine translation re- where informal language would be expected search. In 11th Conference of the Euro- to be more pervasive. pean Chapter of the Association for Com- putational Linguistics, pages 249–256. Acknowledgments Dyer, Chris, Adam Lopez, Juri Ganitke- This research is supported by the EU 7th vitch, Johnathan Weese, Ferhan Ture, Framework Programme FP7/2007-2013 un- Phil Blunsom, Hendra Setiawan, Vladimir 15 Eidelman, and Philip Resnik. 2010. This is due to the fact that they are extracted from twitter accounts that publish tweets in multi- cdec: A decoder, alignment, and learning ple languages, and such accounts belong, to a large framework for finite-state and context-free extent, to institutions that use formal language. translation models. In Proceedings of the Association for Computational Linguistics Lersundi, and Kepa Sarasola. 2011. (ACL). Matxin, an open-source rule-based ma- Forcada, Mikel L., Mireia Ginestı́-Rosell, chine translation system for basque. Ma- Jacob Nordfalk, Jim O’Regan, Sergio chine Translation, 25(1):53–82. Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Padró, Lluı́s and Evgeny Stanilovsky. 2012. Gema Ramı́rez-Sánchez Felipe Sánchez- Freeling 3.0: Towards wider multilin- Martı́nez, and Francis M. Tyers. 2011. guality. In Proceedings of the Lan- Apertium: a free/open-source platform guage Resources and Evaluation Confer- for rule-based machine translation. Ma- ence (LREC 2012), Istanbul, Turkey. chine Translation, 25(2):127–144. Special ELRA. Issue: Free/Open-Source Machine Trans- lation. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Heafield, Kenneth and Alon Lavie. 2010. method for automatic evaluation of ma- Combining machine translation output chine translation. In Proceedings of the with open source: The carnegie mellon 40th annual meeting on association for multi-engine machine translation scheme. computational linguistics, pages 311–318. The Prague Bulletin of Mathematical Lin- guistics, 93:27–36. Rubino, Raphael, Tommi Pirinen, Miquel Esplà-Gomis, Nikola Ljubešić, Sergio Kneser, Reinhard and Hermann Ney. 1995. Ortiz-Rojas, Vassilis Papavassiliou, Improved backing-off for m-gram language Prokopis Prokopidis, and Antonio Toral. modeling. In Acoustics, Speech, and Sig- 2015. Abu-MaTran at WMT 2015 Trans- nal Processing, 1995. ICASSP-95., 1995 lation Task: Morphological Segmentation International Conference on, volume 1, and Web Crawling. In Proceedings of the pages 181–184. IEEE. Tenth Workshop on Statistical Machine Koehn, Philipp, Hieu Hoang, Alexandra Translation. Birch, Chris Callison-Burch, Marcello Snover, Matthew, Bonnie Dorr, Richard Federico, Nicola Bertoldi, Brooke Cowan, Schwartz, Linnea Micciulla, and John Wade Shen, Christine Moran, Richard Makhoul. 2006. A study of translation Zens, Chris Dyer, Ondřej Bojar, Alexan- edit rate with targeted human annotation. dra Constantin, and Evan Herbst. 2007. In Proceedings of Association for machine Moses: Open source toolkit for statistical translation in the Americas, pages 223– machine translation. In Proceedings of the 231. 45th Annual Meeting of the ACL on In- teractive Poster and Demonstration Ses- Stolcke, Andreas et al. 2002. Srilm-an ex- sions, ACL ’07, pages 177–180, Strouds- tensible language modeling toolkit. In IN- burg, PA, USA. Association for Compu- TERSPEECH. tational Linguistics. Tiedemann, Jörg. 2012. Parallel data, tools Ljubešić, Nikola, Darja Fišer, and Tomaž and interfaces in opus. In Nicoletta Calzo- Erjavec. 2014. TweetCaT: a Tool for lari (Conference Chair), Khalid Choukri, Building Twitter Corpora of Smaller Lan- Thierry Declerck, Mehmet Ugur Dogan, guages. In Proceedings of the Ninth In- Bente Maegaard, Joseph Mariani, Jan ternational Conference on Language Re- Odijk, and Stelios Piperidis, editors, Pro- sources and Evaluation (LREC’14), Reyk- ceedings of the Eight International Con- javik, Iceland. ference on Language Resources and Eval- Ljubešić, Nikola and Antonio Toral. 2014. uation (LREC’12), Istanbul, Turkey, may. cawac - a web corpus of catalan and European Language Resources Associa- its application to language modeling and tion (ELRA). machine translation. In Proceedings Virpioja, Sami, Peter Smit, Stig-Arne of the Ninth International Conference Grönroos, Mikko Kurimo, et al. 2013. on Language Resources and Evaluation Morfessor 2.0: Python implementation (LREC’14), Reykjavik, Iceland, may. and extensions for morfessor baseline. Mayor, Aingeru, Iñaki Alegria, Arantza Dı́az de Ilarraza Sánchez, Gorka Labaka, Mikel