The UPC TweetMT participation: Translating Formal Tweets Using Context Information∗ Participación de la UPC en TweetMT: Traducción de tweets formales usando información de contexto Eva Martı́nez Garcia Lluı́s Màrquez Cristina España-Bonet Qatar Computing Research Institute TALP Research Center Qatar Foundation Universitat Politècnica de Catalunya Tornado Tower, Floor 10, Jordi Girona, 1-3, 08034 Barcelona, Spain P.O. Box 5825, Doha, Qatar {emartinez,cristinae}@cs.upc.edu lmarquez@qf.org.qa Resumen: En este artı́culo describimos los sistemas con los que participamos en la tarea compartida TweetMT. Desarrollamos dos sistemas para el par de idiomas castellano–catalán: un traductor estadı́stico diseñado a nivel de frase y un sistema sensible al contexto aplicado a tweets. En el segundo caso definimos el “contexto” de un tweet como los tweets producidos por un mismo usuario durante un dı́a. Estudiamos el impacto de este tipo de información en las traducciones finales cuando se usa un traductor a nivel de documento. Una variante de este sistema incluye modelos semánticos adicionales. Palabras clave: Traducción automática, Twitter, Traducción sensible al contexto Abstract: In this paper, we describe the UPC systems that participated in the TweetMT shared task. We developed two main systems that were applied to the Spanish–Catalan language pair: a state-of-the-art phrase-based statistical machine translation system and a context-aware system. In the second approach, we define the “context” for a tweet as the tweets of a user produced in the same day, and also, we study the impact of this kind of information in the final translations when using a document-level decoder. A variant of this approach considers also semantic information from bilingual embeddings. Keywords: Machine Translation, Twitter, Context Aware Translation 1 Introduction sourcing manual translations. Meedan is a Twitter is a very popular social network. non-profit organization2 which uses this re- This microblogging service allows users to source to share news between the Arabic and share a huge amount of information in a quick the English speaking communities. Another way. Usually, Twitter users produce mono- example, although in this case applied to lingual content (34% in English and 12% in SMS, is the work done with a crowdsourced Spanish for example1 ). However, Twitter is translation during the earthquake in Haiti a multilingual communication environment. in 2010 (Munro, 2010). They allowed the There are many users from different national- Haitian Kreyol and French-speaking commu- ities posting messages in their own language. nities of volunteers to translate texts into En- So, to ease the spread of the information, glish, categorize and geolocate the messages it would be useful to post messages in sev- in real–time in order to help the primary eral languages simultaneously. One option emergency responders. There are also a few to create multilingual tweets is by crowd- works applying machine translation to tweets and the interest in the topic is growing over ∗ This research has been partially funded by the years. In (Gotti, Langlais, and Farzin- the TACARDI project (TIN2012-38523-C02) of the dar, 2013), the application of statistical ma- Spanish Ministerio de Economı́a y Competitividad chine translation (SMT) systems to translate (MEC). The authors thank Iñaki Alegrı́a and Gorka Labaka for providing monolingual corpora. tweets from the Canadian Government Agen- 1 http://www.technologyreview.com/graphiti/ 2 522376/the-many-tongues-of-twitter/ http://news.meedan.net cies is studied. The authors in (Jehl, Hieber, 2.1 Data and Riezler, 2012) describe a system that As parallel corpus we use the Spanish– does not rely on parallel data. In contrast, Catalan corpus El periódico which is a col- they try to find similar tweets in the target lection of news with 2,478,130 aligned sen- language in order to train a standard phrase- tences and available at the ELRA catalogue4 . based SMT pipeline. The shared-task organization released 4,000 All previous papers describe some com- parallel tweets for development and 2,000 for mon problems when trying to translate testing. tweets or short messages. The first and most In order to adapt the systems to the Twit- usual obstacle is the colloquial language used ter genre, we also gather a collection of mono- in the messages, closely followed by the writ- lingual tweets. The Catalan corpus of tweets ing errors. To address these phenomena, it is was collected using the Twitter API during necessary to apply a normalization step pre- the period going from 13th March 2015 to vious to translation. Also, the Twitter 140- 8th May 2015. We selected 65 users with ac- character constraint is hard to maintain in counts mainly coming from Catalan institu- a translation, so an effort must be made to tions, sport clubs or newspapers. This way, generate legal length tweets. Another very we expect users to post mostly using a for- common problem is handling the hashtags. mal language. It is worthy to notice that It is not clear whether they have to be trans- there is some overlap between the users that lated or not, as well as their position in the we selected and the ones considered in the sentence. TweetMT corpora. In particular, we used TweetMT is a shared task with the aim of some tweets from museupicasso, Liceu cat, translating formal tweets3 . These are mes- Penya1930 and RCDEspanyol. Since the sages usually tweeted by institutions and are TweetMT test and development data were well written and with no use of colloquial vo- collected in 2013–2014 and our monolingual cabulary. tweet corpora in 2015, there is no overlap be- In this paper we introduce the two sys- tween the training data and the tweets cor- tems presented to the competition for the pora delivered for the task. 90,744 tweets in Spanish–Catalan language pair and also some Catalan were obtained with this methodol- of the improvements made after the submis- ogy. A similar corpus in Spanish was already sion deadline. First, we present a state-of- available as a resource of the Tweet Normal- the-art SMT system adapted and tuned us- ization Workshop (Alegria et al., 2014)5 . In ing Twitter messages. Second, we present a this case, 227,199 were collected only in two system that looks at the context information days, 1st and 2nd of April 2013. of a tweet to improve its translation. This We also use standard monolingual corpus second system uses a document-level decoder to build larger language models. On the one to take into account the context and it can hand, the corpora available in Catalan in the be combined with bilingual distributed vector Opus site6 are selected (4.8M sentences). On models, which allow to consider additional se- the other hand, the corpora provided for the mantic information. WMT13 Quality Estimation Task7 are used This paper is organized as follows. We de- for Spanish (53.8M sentences). scribe the developed systems in Section 2 and analyze the obtained results in Section 3. Fi- We pre-processed the development dataset nally, we present some discussions and guide- and the monolingual corpora of tweets in or- lines for future work in Section 4. der to make them similar to the format of the 4 http://catalog.elra.info/product_info. 2 System Description php?products_id=1122 5 This section describes the corpora used for http://komunitatea.elhuyar.org/tweet-norm 6 training the systems (both general corpora http://opus.lingfil.uu.se/, Corpora: DOGC, KDE4, OpenSubtitles 2012 and 2013, and corpora of tweets) and their processing Ubuntu and Tatoeba corpora (Tiedemann, 2012; as a common resource for the two main trans- Tiedemann, 2009) (66.5M words ). 7 lation engines. http://statmt.org/wmt13/ quality-estimation-task.html, 3 All the information and resources related to the Corpora: Europarl corpus v7; United Nations; TweetMT2015 shared task are available at: http: NewsCommentary 2007, 2008, 2009 and 2010; AFP, //komunitatea.elhuyar.org/tweetmt/ APW and Xinhua (1.59G words). test set. That includes changing every URL in short sentences such as tweets where the in the data for the URLURLURL label and sub- number of content words is very small (17 stituting every username by the IDIDID label. words/tweet in average in the develoment set We decided not to translate hashtags due to for both languages). their difficulty and because we observed that, In order to alleviate this limitation, we in the development set, approximately two use a document-level decoder that takes as thirds of them remained untranslated. In or- a translation unit a whole document. In our der to maintain the hashtag information, we case, one has to define first what a document replace every hashtag in a tweet by a Hn la- is. After analyzing the development data, we bel, where n is the number of hashtag, and decided to define the context of a tweet as we maintain a record file where the hashtags the surrounding tweets posted by the same that appear in a tweet are stored. This strat- user during the same day. In cases where egy allows us to generalize the translation for this number was less than 30, we put to- every hashtag and eases the replacement by gether the tweets posted during consecutive the corresponding original value before build- days until reaching the threshold of, at least, ing the final translation. The position of the these 30 tweets. In that way, we expect to ob- hashtags in our systems is determined by the tain collections of tweets –a document– that position assigned to the corresponding labels are closely related, since they come from the by the decoder. same source and they have been produced in a short lapse of time. Notice that this way of 2.2 Basic SMT System choosing the related tweets does not reflect a Our basic approach is a state-of-the-art real scenario on Twitter where only the pre- phrase-based SMT system based on the vious tweets from a particular user are avail- Moses decoder (Koehn et al., 2007) and able. However, in an offline scenario, consid- GIZA++ (Och and Ney, 2003). We trained ering past and future context will caracterize the system using the El periódico Spanish– better the domain of the messages. We left Catalan parallel corpus. as future work to compare the differences be- Language models were built using the tween both implementations. SRILM toolkit (Stolcke, 2002). The Spanish In our experiments, we use a document- general language model is an interpolation of oriented decoder: the Docent decoder (Hard- several 5-gram language models with inter- meier et al., 2013; Hardmeier, Nivre, and polated Kneser-Ney discounting as given by Tiedemann, 2012). In a nutshell, this de- (Specia et al., 2013)8 . The Catalan 5-gram coder moves from a sentence search space to language model has been built with the same a document search space. It maximizes and features on the general Catalan monolingual computes the translation score for a docu- corpus explained above. In order to adapt ment as a whole and not only for a sentence. the Moses system to the Twitter genre, we However, Docent also has features that can introduced a second language model trained work at phrase level. In fact, the first step in using only the tweet corpora described in the the document-search of this decoder is equiv- previous subsection. The Moses decoder uses alent to the SMT system that we described both language models as feature functions. previously. Finally the system is tuned with 2.3.1 Semantic Models MERT (Och, 2003) against the BLEU The Docent framework also allows to use measure (Papineni et al., 2002) on the tweets distributed models as semantic space lan- of the development set. guage models. We want to take advantage 2.3 Context-Aware SMT System of this characteristic and introduce more se- mantic information in our system by using A current limitation of standard SMT sys- embeddings trained with the word2vec pack- tems is the fact that they translate sen- age (Mikolov et al., 2013a; Mikolov et al., tences one after the other without using the 2013b). Since our goal is to use the em- information given by the surrounding ones. beddings for translation, we train bilingual This problem can be even more pronounced models following the same strategy as in 8 Interpolation weights were trained with the (Martı́nez-Garcia et al., 2014): the units used interpolate-lm.perl script from Moses and the inter- to train the vector models are bilingual pairs polated language models were binarized afterwards. of targetWord sourceWord. This kind of vec- tors are useful to capture the information re- MTRex12 , RGS*13 , Ol14 (Nießen et al., 2000; lated, not only to the target side or source Tillmann et al., 1997; Snover et al., 2006; side words, but also to the translations them- Snover et al., 2009; Papineni et al., 2002; selves. For this system, we use the best con- Doddington, 2002; Melamed, Green, and figuration obtained in (Martı́nez-Garcia et Turian, 2003; Denkowski and Lavie, 2012; al., 2014), that is, we train a CBOW ar- Lavie and Agarwal, 2007; Lin and Och, chitecture using a context window of 5 to- 2004) and a normalized arithmetic mean of kens to get 600-dimensional vectors. The the lexical metric scores (ULC)(Giménez and aligned parallel corpus needed to train the Màrquez, 2008). Comparing the SMT and models was obtained from the Opus collec- SMTsub systems rows in Table 1 when trans- tion and is built up with the OpenSubti- lating from Catalan to Spanish, it is clear tles 2012, 2013, and the Tatoeba and EU- that fixing the tokenization problem in the bookshop parallel corpora. The final seman- Catalan test set significantly improves the tic models contain 1,527,004 Catalan Spanish scores in all the metrics. units and 1,391,022 Spanish Catalan units. Note that, when translating into Span- When translating a document, Docent uses ish, the SMT system outperforms the rest these semantic models to estimate an addi- whereas when translating into Catalan the tional score for every phrase that is propor- DSMT system is the one with best scores tional to the distance among the vectors of in most metrics. We observe that the dif- that phrase and its local context9 . ferences between the scores of the SMT and DSMT systems are not statistically signifi- cant when translating from Spanish to Cata- 3 Evaluation lan, but the differences between the scores in In the previous section we have in- the other translation direction are indeed sta- troduced three different translation sys- tistically significant, both measured at 95% of tems: a standard sentence-level SMT sys- confidence level15 . For example, the BLEU tem (SMT), a document-level SMT sys- score obtained by the SMT system is 1.32 tem (DSMT) and a document-level SMT points higher than DSMT when translating system enriched with additional semantic into Spanish, but DSMT has 0.12 points of information (semDSMT). For the shared BLEU more than SMT in the other direc- task we only submitted results with the tion. The similarity between the results for SMT and semDSMT systems (SMTsub and the SMT has two main reasons. On the one semDSMTsub systems). However, some hand, the DSMT system departs from the problems with the input tokenization were SMT one, so, for an already good transla- found after the submission. 10 In this sec- tion, such as the ones obtained for tweets, tion, we report both the results before and only few changes are applied. On the other after solving this issue. We also found a prob- hand, the automatic evaluation metrics are lem in the integration of the semantic vector not sensitive to the changes due to the con- models inside the document-oriented decoder text information. It is also important to no- (semDSMT systems) that invalidates the re- tice that there exists only one reference. This sults of this system submitted to the task. fact makes more difficult to obtain an accu- Automatic evaluation results for our sys- rate evaluation of the translations since cor- tems are shown in Table 1. We obtained these rect variations, using synonyms for example, results using the Asiya toolkit (Giménez and will be scored as wrong translations. Màrquez, 2010) for several lexical metrics: For instance, in the first example in Ta- WER, PER, TER, BLEU, NIST, GTM211 , ble 2, we observe how the DSMT obtains a 12 We use the METEOR version using only exact 9 The local context of a phrase consists of its pre- maching. 13 vious 30 tokens. We use the ROUGE variant which skips bigrams 10 There were errors when tokenizing the article without max-gap-length 14 form l’ as well as other elided forms like ’n, d’ or s’. Lexical overlap inspired on the Jaccard coeficient Also, we fixed the tokenization of the pronouns that for sets similarity. 15 appear after a verb with a dash like in animar–los or Significance of the difference between the systems donar–nos. measured for the NIST and BLEU metrics using the 11 We use the GTM version with the parameter as- implementation of paired bootstrap resampling in- sociated to long matches e = 2. cluded in the Moses decoder. Catalan to Spanish — 140 chars/tweet System WER PER TER BLEU NIST GTM2 MTRexRGS* Ol ULC SMTsub 20.17 16.40 19.42 68.20 11.22 62.71 78.46 77.31 74.72 65.04 semDSMTsub 25.10 17.09 22.25 63.12 10.93 57.92 76.44 75.56 73.76 58.62 SMT 14.96 11.82 14.16 76.67 12.07 71.48 84.08 81.34 82.11 78.62 DSMT 15.74 12.23 14.94 75.35 11.92 69.79 83.38 80.74 81.40 76.75 Catalan to Spanish — free System WER PER TER BLEU NIST GTM2 MTRexRGS* Ol ULC SMTsub 20.13 16.31 19.38 68.25 11.22 62.71 78.52 77.35 74.76 65.05 semDSMTsub 25.07 17.01 22.21 63.17 10.94 57.92 76.50 75.60 73.80 58.62 SMT 14.92 11.74 14.13 76.73 12.08 71.49 84.14 81.39 82.15 78.65 DSMT 15.70 12.15 14.90 75.41 11.92 69.79 83.45 80.79 81.44 76.79 Spanish to Catalan — 140 chars/tweet System WER PER TER BLEU NIST GTM2 MTRexRGS* Ol ULC SMTsub 14.35 11.25 13.63 77.93 12.04 72.69 53.98 82.19 83.18 66.51 SMT 14.32 11.30 13.58 78.07 12.04 73.02 54.08 82.21 83.29 66.62 DSMT 14.22 11.10 13.46 78.19 12.07 72.96 54.14 82.45 83.46 67.10 Spanish to Catalan — free System WER PER TER BLEU NIST GTM2 MTRexRGS* Ol ULC SMTsub 14.33 11.24 13.61 77.93 12.04 72.69 53.99 82.19 83.20 66.51 SMT 14.31 11.30 13.56 78.06 12.04 73.02 54.09 82.21 83.31 66.62 DSMT 14.20 11.09 13.44 78.19 12.07 72.97 54.15 82.45 83.48 67.11 Table 1: Evaluation with a set of lexical metrics for our systems on the Catalan–Spanish language pair. Results include the scores obtained with the raw translations (free) and with the restriction of only considering the first 140 characters per tweet (140 chars/tweet). better translation than the SMT system with both systems generate correct translations. respect to the reference, but actually, both In spite of the spelling mistake in the ref- systems obtain a correct translation. There erence (cumpleix instead of compleix ), this are other examples where the DSMT has a time the closest translation to the reference is correct translation but it does not match the the one from the SMT system but the DSMT reference, as shown in the Example 2 from one is still correct. Table 2. In this case, both systems obtain good translations but the SMT translation Most of the problems that we found in our is closer to the reference since the DSMT experiments are related to the lack of normal- uses synonyms for partido and FCB (encuen- isation of the source and to the decision of tro and Barça respectively). One example keeping the hashtags untranslated. We found where the context information is useful is Ex- several examples where the original tweet is ample 3 in Table 2 where DSMT uses can- not well written and this produces errors in cha instead of pista to translate pista, which the translations. For instance, “Gràcies x ls is a more concrete option since the user ac- mencions sobre l’expo #PostPicasso” where count that produced the message is from a our systems are not able to translate correctly famous Spanish basketball team that mostly the informal abbreviation ls. Regarding the tweets information about basketball. In the hashtags, we found “#elmésllegit” that ap- other direction, we found similar phenom- pears translated as “#lomásleı́do” in the ref- ena. Example 4 in Table 2 shows again how erence but in our systems we decided to pre- serve the original hashtags. Example 1: Catalan to Spanish Source Els agents rurals capturen un voltor comú a l’Hospitalet Reference Los agentes rurales capturan un buitre leonado en L’Hospitalet SMT Los agentes rurales capturan un buitre común en L’Hospitalet DSMT Los agentes rurales capturan a un buitre leonado en L’Hospitalet Example 2: Catalan to Spanish Source Final del partit al Vicente Calderón! ATM 0-0 FCB Reference Final del partido en el Vicente Calderón! ATM 0-0 FCB SMT Final del partido en el Vicente Calderón! ATM 0-0 FCB DSMT Final del encuentro en el Vicente Calderón! ATM 0-0 Barça Example 3 : Catalan to Spanish Source Aquesta nit, a les 20:30 hores, el IDIDID B visita la pista del IDIDID. Reference Esta noche, a las 20:30 horas, el IDIDID B visita la cancha del IDIDID. SMT Esta noche, a las 20: 30 horas, el IDIDID B visita la pista del IDIDID. DSMT Esta noche, a las 20: 30 horas, el IDIDID B visita la cancha del IDIDID. Example 4 : Spanish to Catalan Source Kim Basinger cumple hoy 60 años Reference Kim Basinger cumpleix avui 60 anys SMT Kim Basinger compleix avui 60 anys DSMT Kim Basinger avui fa 60 anys Table 2: Translation examples of tweets by our different systems in both translation directions: Spanish to Catalan and Catalan to Spanish. We can also observe that the restriction guage models built with tweets. For the of 140 characters does not have an impor- document-level SMT system, we considered tant effect in the performance. This is be- as context of a tweet the rest of messages cause, for this test set, our systems usually from the same user during the same day. produce tweet translations with a legal length (99.00% from Catalan to Spanish and 99.70% The automatic evaluation of our systems from Spanish to Catalan), and furthermore, shows that both systems perform similarly. among the tweets exceeding the maximum However, it must be taken into account that length, the average number of extra charac- lexical metrics are not context sensitive and ters is less than 6. Notice that it is hard to there is only one reference available. As measure the real length of the tweets since we reported in the literature, we found prob- do not have access to the original messages, lems with the correctness of the messages and instead we have the tweets with the URLs when addressing the problem of translating and IDs replaced by their corresponding la- hashtags as we shown with some examples bels. For the given language pair, our system found during the manual evaluation. mostly respect the original length. This is an expected behaviour since the length fac- Hashtag translation and normalization of tor (Pouliquen, Steinberger, and Ignat, 2003) the input are interesting topics for future for the Catalan-Spanish language pair is close work especially for extending the system to to 1. translate informal tweets. We also consider to implement a pipeline that only takes into 4 Conclusions account the previous context to simulate an online scenario and compare it with the ac- We have described the systems developed tual pipeline. Currently we are enhancing the for the TweetMT shared task: a standard models with the introduction of semantic in- sentence-level SMT system based on Moses formation using word vector embeddings. In and a document-level SMT system based on particular, we are customizing the Docent de- Docent. We adapted both systems using lan- coder to introduce them at translation time. References for phrase-based statistical machine [Alegria et al.2014] Alegria, I., N. Aranberri, translation. In Proc. of the 51st ACL P. R. Comas, V. Fresno, P. Gamallo, Conference, pages 193–198. L. Padró, I. San Vicente, J. Turmo, and [Jehl, Hieber, and Riezler2012] Jehl, L., A. Zubiaga. 2014. TweetNorm es corpus: F. Hieber, and S. Riezler. 2012. Twit- an annotated corpus for spanish microtext ter translation using translation-based normalization. In Proc. of the Ninth In- cross-lingual retrieval. In Proc. of the ternational Conference on Language Re- 7th Workshop on Statistical Machine sources and Evaluation (LREC’14), pages Translation. ACL 2012, pages 410–421. 2274–2278. [Koehn et al.2007] Koehn, P., H. Hoang, [Denkowski and Lavie2012] Denkowski, M. A. Birch, C. Callison-Burch, M. Fed- and A. Lavie. 2012. METEOR-NEXT erico, N. Bertoldi, B. Cowan, W. Shen, and the METEOR paraphrase tables: C. Moran, R. Zens, C. Dyer, O. Bojar, Improved evaluation support for five A. Constantin, and E. Herbst. 2007. target languages. In Proc. of the Joint Moses: open source toolkit for statistical 5th Workshop on Statistical Machine machine translation. In Proc. of the 45th Translation and MetricsMATR, pages ACL Conference, pages 177–180. 339–342. [Lavie and Agarwal2007] Lavie, A. and [Doddington2002] Doddington, G. 2002. Au- A. Agarwal. 2007. Meteor: an automatic tomatic evaluation of machine translation metric for MT evaluation with high levels quality using n-gram co-occurrence statis- of correlation with human judgments. tics. In Proc. of the 2nd International In Proc. of the Second Workshop on Conference on Human Language Technol- Statistical Machine Translation, pages ogy Research, pages 138–145. 228–231. [Giménez and Màrquez2008] Giménez, J. and L. Màrquez. 2008. A smorgasbord of fea- [Lin and Och2004] Lin, C.-Y and F.J. Och. tures for automatic MT evaluation. In 2004. Automatic evaluation of machine Proc. of the Third Workshop on Statisti- translation quality using longest common cal Machine Translation, pages 195–198. subsequence and skip-bigram statics. In ACL. Proc. of the 42nd Annual Meeting of the ACL, pages 605–612. [Giménez and Màrquez2010] Giménez, J. and L. Màrquez. 2010. Asiya: An open toolkit [Martı́nez-Garcia et al.2014] Martı́nez- for automatic machine translation (meta-) Garcia, E., C. España-Bonet, J. Tiede- evaluation. In Prague Bulletin of Mathe- mann, and L. Màrquez. 2014. Word’s matical Linguistics, 94, pages 77–86. vector representations meet machine translation. In Proc. of SSST-8, Eighth [Gotti, Langlais, and Farzindar2013] Gotti, Workshop on Syntax, Semantics and F., P. Langlais, and A. Farzindar. 2013. Structure in Statistical Translation, pages Translating government agencies’ tweet 132–134. feeds: Specificities, problems and (a few) solutions. In Proc. of the NACCL 2013, [Melamed, Green, and Turian2003] Melamed, pages 80–89. I.D., R. Green, and J.P. Turian. 2003. Precision and recall of machine transla- [Hardmeier, Nivre, and Tiedemann2012] tion. In Proc. of the Joint Conference on Hardmeier, C., J. Nivre, and J. Tiede- HLT-NAACL. mann. 2012. Document-wide decoding for phrase-based statistical machine [Mikolov et al.2013a] Mikolov, T., K. Chen, translation. In Proc. of the Joint Con- G. Corrado, and J. Dean. 2013a. Efficient ference on Empirical Methods in NLP estimation of word representations in vec- and Computational Natural Language tor space. In Proc. of Workshop at ICLR. Learning, pages 1179–1190. http://code.google.com/p/word2vec. [Hardmeier et al.2013] Hardmeier, C., [Mikolov et al.2013b] Mikolov, T., S. Stymne, J. Tiedemann, and J. Nivre. I. Sutskever, G. Corrado, and J. Dean. 2013. Docent: A document-level decoder 2013b. Distributed representations of words and phrases and their compo- [Specia et al.2013] Specia, L., K. Shah, J. G. sitionality. In Proc. of NIPS, pages C. De Souza, and T. Cohn. 2013. QuEst 3111–3119. - A translation quality estimation frame- work. In Proc. of ACL Demo Session, [Munro2010] Munro, R. 2010. Crowdsourced pages 79–84. translation for emergency response in haiti: the global collaboration of local [Stolcke2002] Stolcke, A. 2002. SRILM – An knowledge. In AMTA Workshop on Col- extensible language modeling toolkit. In laborative Crowdsourcing for Translation, Proc. Intl. Conf. on Spoken Language Pro- pages 1–4. cessing, pages 257–286. [Nießen et al.2000] Nießen, S., F. Och, [Tiedemann2009] Tiedemann, J. 2009. News G. Leusch, and H. Ney. 2000. An evalu- from OPUS - a collection of multilingual ation tool for machine translation: Fast parallel corpora with tools and interfaces. evaluation for MT research. In Proc. of In Recent Advances in Natural Language the 2nd International LREC Conference, Processing (vol V), pages 237–248. pages 339–342. [Tiedemann2012] Tiedemann, J. 2012. Par- [Och2003] Och, F. 2003. Minimum error rate allel data, tools and interfaces in opus. training in statistical machine translation. In Proc. of the 8th International Con- In Proc. of the ACL Conference, pages ference on Language Resources and Eval- 160–167. uation (LREC’2012), pages 2214–2218. http://opus.lingfil.uu.se. [Och and Ney2003] Och, F. and H. Ney. 2003. A systematic comparison of various sta- [Tillmann et al.1997] Tillmann, C., S. Vogel, tistical alignment models. Computational H. Ney, A. Zubiaga, and H. Sawaf. 1997. Linguistics, pages 19–51. Accelerated DP based Search for Sta- tistical Translation. In Proc. of Euro- [Papineni et al.2002] Papineni, K., S. Roukos, pean Conference on Speech Communica- T. Ward, and W. Zhu. 2002. BLEU: tion and Technology. A Method for Automatic Evaluation of Machine Translation. In Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311– 318. [Pouliquen, Steinberger, and Ignat2003] Pouliquen, B., R. Steinberger, and C. Ig- nat. 2003. Automatic identification of document translations in large multilin- gual document collections. In Proc. of the International Conference on Recent Advances in NLP (RANLP-2003), pages 401–408. [Snover et al.2006] Snover, M., B.J. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. 2006. A study of translation edit rate with targeted human annota- tion. In Proc. of the 7th Conference of the AMTA, pages 223–231. [Snover et al.2009] Snover, M., N. Madnani, B.J. Dorr, and R. Schwartz. 2009. Flu- ency, adequacy or HTER? Exploring dif- ferent human judgments with a tunable MT metric. In Proc. of the Fourth Work- shop on Statistical Machine Translation, pages 259–268.