Overview of TweetMT: A Shared Task on Machine Translation of Tweets at SEPLN 2015 Introducción a TweetMT: Tarea Compartida sobre Traducción Automática de Tuits en la SEPLN 2015 Iñaki Alegria1 , Nora Aranberri1 , Cristina España-Bonet2 , Pablo Gamallo3 , Hugo Gonçalo Oliveira4 , Eva Martı́nez2 , Iñaki San Vicente5 , Antonio Toral6 , Arkaitz Zubiaga7 1 University of the Basque Country, 2 UPC, 3 USC, 4 University of Coimbra, 5 Elhuyar, 6 Dublin City University, 7 University of Warwick tweetmt@elhuyar.com Resumen: Este artı́culo presenta un resumen de la tarea conjunta que tuvo lugar en el marco del taller TweetMT celebrado junto con SEPLN 2015, que consiste en traducir diversas colecciones de tweets en varios lenguajes. El artı́culo describe el proceso de recolección y anotación de datos, el desarrollo y evaluación de la tarea y los resultados obtenidos por los participantes. Palabras clave: Traducción Automática, Microblogs, Tuits, Social Media Abstract: This article presents an overview of the shared task that took place as part of the TweetMT workshop held at SEPLN 2015. The task consisted in translating collections of tweets from and to several languages. The article outlines the data collection and annotation process, the development and evaluation of the shared task, as well as the results achieved by the participants. Keywords: Machine Translation, Microblogs, Tweets, Social Media 1 Introduction The machine translation of tweets is usually tackled in two different ways: (1) as While research in machine translation has a direct translation task (tweet-to-tweet), been studied for a while now, the application or (2) as an indirect translation task (tweet of machine translation techniques to tweets normalization to standard text (Kaufmann is still in its infancy. The machine and Kalita, 2010), text translation and, translation of tweets is a challenging task if needed, tweet generation). Despite the which, to a great extent, depends on fact that the direct translation approach the spelling and grammatical quality of would look like the natural approach in the tweets that one has to provide the an ideal scenario, the lack of parallel translations for. In fact, the difficulty of a or comparable corpora of tweets for the tweet translation process varies dramatically working languages (Petrovic, Osborne, for different types of tweets ranging from and Lavrenko, 2010) makes the indirect informal posts to formal announcements approach a more viable solution in most and news headlines posted by social media of the cases. Alternatively, researchers editors or community managers. The have also tried to gather similar tweets in former are often written from mobile devices, other languages, leveraging Cross-Lingual which exacerbates the poor quality of the Information Retrieval techniques (Jehl, spelling, and include linguistic inaccuracies, Hieber, and Riezler, 2012). symbols and diacritics. Tweets also vary in terms of structure, including features which Despite the paucity of research in the are exclusively used in the platform, such specific task of translating tweets, an as hashtags, user mentions, and retweets, increasing interest can be observed in the among others. These characteristics make scientific community (Gotti, Langlais, and the application of machine translation tools Farzindar, 2013; Peisenieks and Skadiņš, to tweets a new problem that requires specific 2014). Similarly, a related and highly processing techniques to perform effectively. relevant direction of research is the work on machine translation of SMS texts, such as identifying multiple Twitter authors that Munro’s study in the context of the 2010 tweet identical content, albeit in different Haiti earthquake (Munro, 2010). languages, either from a single account Provided the dearth of benchmark or from two different accounts. Hence, resources and comparison studies bringing whenever possible, the parallel corpora to light the potential and shortcomings have been generated from multilingual of today’s machine translation techniques Twitter accounts; this methodology was applied to tweets, we organized TweetMT, applied for the Catalan–Spanish (ca-es) and a workshop and shared task1 on machine Basque–Spanish (eu-es) language pairs, as translation applied to tweets. This workshop we found authors that concurrently tweet is a follow-up to two other related workshops in these languages. However, we did not organized in previous years also at SEPLN: find authors that meet these characteristics TweetNorm 2013 (Alegria et al., 2013) and for the other two language pairs, i.e., TweetLID 2014 (Zubiaga et al., 2014). The Portuguese–Spanish and Galician–Spanish workshop intended to be a forum where (pt-es and gl-es); in these cases, the parallel researchers had a chance to compare their tweets were manually produced through methods, systems and results and the task crowdsourcing. Different to the language focuses on MT of tweets between languages pairs that could be automatically aligned, in of the Iberian Peninsula (Basque, Catalan, the latter cases only test sets were generated Galician, Portuguese, and Spanish). due to time and budget constraints. As a starting point, and especially given The following sections give details about the little work performed so far in the field, the creation of the datasets. Table 1 shows the corpora we compiled for the shared task some statistics of those datasets. includes tweets that are mostly formal and correctly written, while keeping the brevity 2.1 Corpus Creation from inherent to tweets. While the corpora Multilingual Accounts might not be fully representative of the texts The corpus creation process out of that one can find on Twitter, it is instead multilingual Twitter accounts can be intended to boost the work performed within divided into two steps: (i) identifying the the field, encouraging researchers to submit accounts and collecting the messages, and preliminary contributions that will then help (ii) semi-automatic alignment of translated better understand the state of the art so tweets. that future work can be set forth. As 2.1.1 Accounts and Collected Data this research matures, subsequent corpora Different to (Ling et al., 2013), we do not aim will include a wide variety of informal and for mixed language tweets, where the source misspelled tweets to keep making progress. and target segments are included in the same tweet, but rather we manually select a 2 Creation of a Benchmark number of authors that tend to post messages Dataset in various languages. It is worth noting that To the best of our knowledge, there is no this strategy for sampling the authors leads parallel tweet dataset available apart from to a prevalence of account types that belong that produced by (Ling et al., 2013), which to organizations and famous personalities. differs from our purposes in that they worked We identified two kinds of “authors” on tweets that mix two languages, providing following this strategy: (i) authors that use the translated text within the same tweet. a single account to post messages in different Since we wanted to work on the translation of languages, and (ii) authors that have parallel entire tweets into new tweets, we generated accounts to post in different languages using a corpus for the specific purposes of the separate accounts. The initial collection TweetMT Workshop. of tweets amounted to 23 Twitter accounts In order to facilitate corpus generation, (from 16 authors) for the eu-es pair and 19 we developed a semi-automatic method accounts (from 14 authors) for the ca-es pair. to retrieve and align parallel tweets. In all, 75,000 tweets were collected for eu-es The semi-automatic method consists in and 51,000 tweets for the ca-es language pair. The collection includes tweets posted 1 http://komunitatea.elhuyar.eus/tweetmt/ between November 2013 and March 2015. The initial corpus was then split into two language each tweet is written in through datasets: one development-set composed of language identification (Zubiaga et al., 2014). 4,000 parallel tweets for each language pair While Twitter does provide the language and one test-set composed of 2,000 parallel ID along with tweet’s metadata, Basque tweets for each language pair. and Catalan are never tagged as such by Author distribution in the development Twitter, so that we implemented our own set was limited to account with most tweets language identification module to identify (2 for ca-es and 4 for eu-es). Test-sets these languages. Language identification is also contain tweets from the authors in done by using TextCat2 trained over Twitter the development set, but tweets from new specific data. ”unseen” authors are also introduced. This Once we have an author’s tweets separated way we have the possibility to evaluate by language, and hence with source language systems both on ”in-domain” and ”out-of tweets and target language tweets separated, domain” scenarios. we need to align them with likely translations As we said before, one of the limitations for each tweet. For the automated process, of our strategy is that it is only applicable we defined a set of heuristics and statistics to certain language pairs. The linguistic that would help us find matches quite realities of Basque and Catalan (both are accurately. Specifically, we looked at the considered to be co-official together with following three characteristics to find likely Spanish in certain regions that support matches: bilingualism) make the application of such methods viable for our purposes. • Publication date. Translations Unfortunately, it was not the case for must be published within a certain pt-es and gl-es pairs. It is understandable period range to be flapped as possible that few or no users have the need to translations of each other. The tweet both in Spanish and Portuguese, difference between source and target which have little or no geographical timestamps must not exceed a certain overlap; it was however a surprise not threshold. The value of the treshold to find any such example for Galician and was set overall to 10 hours, although Spanish, which has the same status as for a few accounts the publication date Catalan and Basque of being co-official. difference was restricted to 1 hour In consequence, we only could provide after empirically detecting too much development corpora for the eu-es and ca-es noise with the more relaxed standard language pairs. For the Galician–Spanish threshold. and Portuguese–Spanish language pairs, test sets were manually generated through • Overlap of hashtag and user crowdsourcing. Specifically, we used the mentions in source and target CrowdFlower platform to translate tweets tweets. It is very rare to change into the other language. Section 2.2 further the user (@) mentions across discusses this process. language, only in a few cases was observed that phenomenon (e.g., 2.1.2 Alignment using @FCBarcelona ca in a tweet in The large volume of tweets collected in the catalan and using @FCBarcelona es in previous step needs to be properly aligned in a Spanish written tweet) are usually order to create the parallel corpus. Aligning maintained across languages. Hashtags tweets of an author within and across are translated, often, depending on the accounts requires both to find matching popularity of a given hashtag in the translations as well as to occasionally get target audience. A minimum number of rid of tweets that have no translations. We user name and hashtags were required perform this process semi-automatically, first to overlap between source and target by automatically aligning tweets that are parallel tweet candidates. The overlap likely to be each other’s translation, and then is computed as the division between the by manually checking the accuracy of those number of entities in the intersection alignments. of both tweets and the entities in the Before we can even align tweets with their 2 likely translations, we needed to identify the http://www.let.rug.nl/vannoord/TextCat/ union. The threshold is empirically set to Portuguese and Galician, a dataset with to 0.76. 2, 552 Spanish tweets, taken from both our • Longest Common Subsequence ca-es and eu-es parallel corpora, and divided ratio (LCSR) (Cormen et al., in working tasks of 10 tweets each. 2001) between source and target Instructions were provided to workers in tweets. LCSR is an orthographic order to make sure that the translations similarity measure, as it tells us how were consistent. For instance, contributors similar two strings are. It is especially were asked not to translate user mentions reliable when working with closely (keywords with a leading @) and URLs, related languages, as parallel sentences while hashtags should only be translated are often very close to each other, if the contributor considered that it would because both vocabulary and word be natural to use the Portuguese/Galician order are closes. We empirically set a hashtag. minimum threshold of 0.45. The crowdsourcing platform allows to configure the jobs using a number of options. As for the performance of the heuristics, We used some of them with the aim publication date closeness is effective for of obtaining translations of a reasonable filtering out wrong candidates, but it is not quality: enough to find the correct parallel tweet, so it is applied first of the three. User • Geography. One can select a set of and hashtag overlapping ration proved to be countries from which workers are allowed very successful, up to the point that the to work on the job. We limited the contribution of LCSR was minimal. countries to Spain for Galician and to The output of this alignment is then Portugal and Brazil for Portuguese.4 corrected through manual checks by native speakers of their respective languages. The • Performance level. Contributors manual inspection showed a low error rate of the platform fall into three levels, in the automatic alignment, especially for according to their performance. Our ca-es. For this language pair we found a 2% jobs were limited to contributors error rate, evaluated over a sample of 400 in level 3 (the top level), defined tweets on the development set. For eu-es the by Crowdflower as “the highest percentage increased to 15%, also evaluated performance contributors who account over a sample of 400 tweets. Error rate over for 7% of monthly judgments and the collections manually reviewed to create maintain the highest level of accuracy the test-sets was 7% for the ca-es language across an even larger spectrum of pair (12500 tweets) and was 32% for the eu-es CrowdFlower jobs [compared to language pair (15045). contributors in levels 1 and 2]”. In the case of Galician, we had to change 2.2 Crowdsourced Corpus this setting to level 1 as the tasks were Creation using Crowdflower getting completed too slowly. As we did not find bilingual Portuguese–Spanish or Galician–Spanish • Language capability. It allows to Twitter accounts, we used the CrowdFlower restrict the contributors that can work platform3 to build the test data for this in the job by their language skills. language pair. CrowdFlower provides For translations into Portuguese, we a cheap and fast method for collecting restricted the contributors to those who annotations from a broad base of paid are verified speakers of Portuguese. non-expert contributors over the Web. Galician is not in the list of languages It works in a similar way to Amazon’s provided in CrowdFlower, so this job Mechanical Turk (Snow et al., 2008) which was not configured in this case. cannot be used in our case because it requires 4 Initially, the task to translate into Portuguese to have an US address and credit card. was only opened for users from Portugal as the In the task we defined, the contributors focus is on Iberian Portuguese, but after we realized had to translate manually, from Spanish we were having no contributions, we broadened the geographical scope to Brazil as well, which helped to 3 http://www.crowdflower.com/ obtain contributions more swiftly. • Speed trap. If set, contributors are Dataset Tweets Authors Tokens URL @user automatically removed from the job if eu-es dev 4,000 4 181K 2,622 1,569 they take less than a specified amount ca-es dev 4,000 2 161K 3,280 823 of time to complete a task. Our jobs eu-es test 2,000 16 37K 1556 673 contained tasks of 10 translations each es-eu test 2,000 16 43K 1535 692 ca-es test 2,000 14 45K 1590 417 and the time trap was set to 150 seconds. es-ca test 2,000 14 46K 1567 502 Hence if a worker toke less than 15 gl-es test 434 - 7K 274 134 seconds to translate per tweet he/she es-gl test 434 - 7K 291 159 pt-es test 1,250 - 19K 674 349 would be automatically removed from es-pt test 1,250 - 21K 919 583 the job. The task of translating into Portugese was Table 1: Statistics for the datasets generated. completed by 40 different contributors, all of them from Brazil. The contributors were pairs where the corpus was obtained via inquired about the quality of the task; they crowdsourcing —gl and pt—, the file contains were asked to rank out of 5 the clarity of a segmentID and the text of the tweet. the instructions (average 4.12), ease of the job (3.78), pay (4.17) and overall satisfaction 3 Evaluation Framework (4.05). The task to translate into Galician The test sets just described were delivered was carried out by 10 contributors. They to the participants which had to return the ranked the task as follows: clarity of the translations with the following tab separated instructions (4.90), ease of the job (3.59), pay format: (3.82) and overall satisfaction (4.00). tweet Id source language text As a final result, we obtained a parallel translation \n corpus with 2, 500 pt-es and 777 gl-es tweets The translated test would then be which were split into two test datasets with extracted, cut to a maximum length of 1, 225 entries for each translation direction 140 characters, and evaluated by automatic for pt-es and 388 for gl-es. To verify the means. quality of the translations, samples of 30 The performance of the systems is tweets were evaluated both for Portuguese assessed with lexical and syntactic automatic and for Galician. In both cases they were evaluation measures compared against a considered acceptable by the Portuguese and single reference. Lexical metrics which Galician authors of the current paper, even are mostly based on n-gram matching are if some errors were detected. In the case of available for all the language pairs under Galician, we found some mistakes derived the study. However, syntactic metrics are only new spelling rules imposed since 2003. In available for Spanish and some of them for the case of Portuguese, six errors (most of Catalan. them lexical problems) were found from the 30 tweets evaluated. 3.1 Evaluation Metrics In order to study the quality of the 2.3 Datasets post-processing translations at different levels we use a wide Before delivering the data sets to the set of metrics as defined as follows: participants the test was pre-processed. The development corpus includes the original Lexical evaluation measures tweets, neither @user nor URLs were • PER (Tillmann et al., 1997), normalized, but they are in the test corpus TER (Snover et al., 2006), WER (Nießen where @user and URLs are standardized to et al., 2000): Subset of metrics based on IDIDID and URLURLURL, respectively. edit distances Datasets are distributed in tab separated format (tsv) files. For each language pair two • BLEU (Papineni et al., 2002), files are provided, one for every translation NIST (Doddington, 2002), ROUGE direction. For the language pairs where the (RG): Based on n-gram matching parallel corpus was gathered exclusively from (lexical precision: BLEU, NIST; and Twitter —this includes ca, es, and eu— the lexical recall: ROUGE). For ROUGE files contains the tweetID, userID, date and we use RGS*, i.e. a variant with skip the text of each tweet. For the language bigrams without max-gap-length) • GTM (Melamed, Green, and Turian, hours to work on the test set and to send 2003), METEOR (Banerjee and Lavie, the results. 2005) (MTR): Based on the F-measure. For GTM we use GTM2, with the 4.1 Overview of the Systems parameter associated to long matches Submitted e = 2; for METEOR we use MTRex, Out of the 5 initially registered participants, i.e. using only exact matching. only three teams ended up submitting their • Ol (Giménez and Màrquez, 2008): results: DCU (Dublin City University) Lexical Overlap is a measure based on for 3 tracks (ca-es, eu-es, pt-es) (Toral the Jaccard coefficient (Jaccard, 1912) et al., 2015); EHU (University of the to quantify the similarity between sets. Basque Country) for the eu-es track (Alegria Lexical items associated with candidate et al., 2015); and UPC (Universitat and reference translations are considered Politècnica de Catalunya) for the ca-es as two separate sets of items. Overlap track (Martı́nez-Garcia, España-Bonet, and is computed as the cardinality of their Màrquez, 2015). In all, two teams submitted intersection divided by the cardinality of results for the eu-es and ca-es tracks, one their union. team participated in the pt-es track, and no submissions were received for the gl-es pair. • ULC (Giménez and Màrquez, 2008): The related shared tasks that we U niform Linear Combination. When organized in recent years (i.e., TweetNorm applied to lexical metrics it includes (Alegria et al., 2013) and TweetLID (Zubiaga WER, PER, TER, BLEU, NIST, RGS*, et al., 2014)) attracted a higher number of GTM2, MTRex. participants. One of the reasons for this Syntactic evaluation measures drop in number of participants might be the fact that English has not been considered • SP-Op , SP-Oc , SP-pNIST (Giménez and this time as one of the languages included Màrquez, 2007)5 : Based on the lexical in the task; this could have made the task overlap according to the part-of-speech less appealing to some groups, which led to or chunk and the NIST score over these fewer participants from outside the Iberian elements (Shallow Parsing) Peninsula. • CP-Op , CP-Oc , CP-STM9 (Giménez The main characteristics of the systems and Màrquez, 2007)6 : Based on the submitted are compiled in Table 2 and can lexical overlap among part-of-speech or be summarized as follows: constituents of constituency parse trees (Constituency Parsing) DCU This team submitted systems for three language pairs in both directions: • ULC (Giménez and Màrquez, 2008): Spanish from/to Catalan, Basque and U niform Linear Combination. When Portuguese. They used a range of applied to syntactic metrics it includes techniques including state-of-the-art the available metrics for the specific SMT, morph segmentation (only language. for Basque as a morphologically All measures have been calculated rich language), data selection as a with the Asiya toolkit7 for MT means of domain adaptation, available evaluation (Giménez and Màrquez, 2010). open-source rule-based systems and, finally, system combination to combine 4 Shared Task Results the strengths of the different systems Participants were required to register8 in that were built. DCU gathered vast order to obtain the development and test amounts of tweets (from 11M for data-sets. Each participant had only 72 Basque to 130M for Spanish) to perform monolingual domain adaptation and 5 Family of metrics only available for Catalan and complemented this with publicly Spanish. 6 available general-domain monolingual Family of metrics only available for Spanish. 7 http://nlp.cs.upc.edu/asiya/ and parallel corpora. The first (DCU1), 8 http://komunitatea.elhuyar.eus/tweetmt/ second (DCU2) and third (DCU3) participation/ systems submitted for each language System Main Engine Distinctive features Moses and Apertium (ES↔CA), Moses, cdec and Apertium DCU1 (ES→EU), cdec (EU→ES), Moses (ES↔PT). Moses (ES→CA), Moses, cdec and Apertium (CA→ES, EU→ES), System DCU2 Moses, cdec, ParFDA, Matxin and Morph (ES→EU), combination Moses and cdec (ES↔PT). or Moses, cdec and Apertium (ES→CA, ES↔PT), Moses, ParFDA SMT DCU3 and Apertium (CA→ES), Moses, cdec, Matxin and Morph (ES→EU), Moses, cdec, Apertium and Morph (EU→ES). EHU1 SMT Specific language model and pre- and post-processing for tweets EHU2 RBMT Adaptation to Tweets (mainly hashtags) UPC1 SMT Moses system UPC2 SMT Document-level system (Docent), semantic models Table 2: Summary of the systems developed by the participants. direction were the individual systems obtained with word2vec (Mikolov et or combinations that obtained the al., 2013). Besides the parallel tweets best, second best and third best result, available for the shared task, both respectively, on the development set. systems use monolingual tweets for genre and domain adaptation. UPC2 was EHU This team submitted systems for only submitted for Catalan-to-Spanish. the Basque–Spanish pair. They have The authors report some problems with adapted previous MT engines for the this configuration and include both the es–eu and eu-es directions. For the official and new results in their paper. translation into Basque RBMT and Here only the official results are shown. SMT were adapted whereas for the translation from Basque only a SMT based system was used. The main 4.2 Results work was pre- and post-processing for Participants had a 72-hour window to work adaptation to tweets and collecting with the test set and submit up to three new resources for training and tuning results per track. This section is a recap of the systems. For RBMT, a small the results of all the tracks and systems. dictionary of hashtags was obtained Table 3 and Table 4 show the results for from the development-set. For SMT, the participants in the ca-es track. In Table language models were improved using 3, the lexical measures introduced in the monolingual corpora from previous previous section are shown and in Table 4 shared tasks and a new corpora of tweets the syntactic ones. Five systems from two in Basque. teams have been evaluated. DCU3 system was the best for the ca-es direction, a system UPC The team submitted two systems for combining two kinds of SMT engines plus the Catalan–Spanish language pair. The a RBMT one. For the es-ca direction, first one (UPC1) is a standard SMT the two simplest pure phrase-based SMT system built with Moses (Koehn et systems, UPC1 and DCU2, obtained the al., 2007) and trained with 2,178,796 highest scores. The two teams used very parallel sentences extracted from the El similar corpora in their experiments, so the Periódico parallel corpus9 . The second techniques they used make the difference in system (UPC2) uses a document-level this case. decoder, Docent (Hardmeier et al., Tables 5, 6, 7 and 8 show the results for 2013), that takes UPC1 as a first the participants in the eu-es and pt-es tracks. step. Besides, the system uses For the eu-es track four or five (depending on as additional feature semantic models the direction) systems were presented by two 9 http://catalog.elra.info/product_info. teams. In general, the best translator for this php?products_id=1122 language pair is the statistical system EHU1 Catalan to Spanish System WER PER TER BLEU NIST GTM2 MTRex RGS* Ol ULC DCU1 15.24 12.49 13.25 76.73 12.09 72.75 83.8 83.37 83.7 77.84 DCU2 15.15 12.41 13.21 76.52 12.09 72.18 83.76 83.70 83.56 77.86 DCU3 14.59 11.74 12.50 77.70 12.16 73.37 84.63 84.45 84.64 79.67 UPC1 20.17 16.40 19.42 68.20 11.22 62.71 78.46 77.31 74.72 63.82 UPC2 25.10 17.09 22.25 63.12 10.93 57.92 76.44 75.56 73.76 57.45 Spanish to Catalan System WER PER TER BLEU NIST GTM2 MTRex RGS* Ol ULC DCU1 16.70 12.42 14.46 75.79 11.88 70.45 52.08 82.65 82.88 66.23 DCU2 15.17 11.71 13.21 77.75 11.96 72.65 53.44 83.32 83.96 69.98 DCU3 17.09 13.08 14.70 75.25 11.85 70.34 51.73 82.26 82.46 64.94 UPC1 14.35 11.25 13.63 77.93 12.04 72.69 53.98 82.19 83.18 70.56 Table 3: Evaluation with a set of lexical metrics (see Section 3.1 for a description) for the participant systems on the Catalan–Spanish language pair. Results are obtained only considering the first 140 characters per tweet. Catalan to Spanish System CP-Oc(*) CP-Op(*) CP-STM9 SP-Op(*) SP-Oc(*) SP-pNIST ULC DCU1 80.92 81.4 74.1 81.78 83.03 10.87 99.24 DCU2 80.71 81.5 74.19 81.66 82.8 10.90 99.22 DCU3 81.64 82.27 74.48 82.40 83.73 10.92 100.00 UPC1 68.52 70.93 58.74 70.95 71.97 9.36 84.47 UPC2 70.59 72.89 63.05 73.01 73.99 10.01 88.06 Spanish to Catalan System CP-Oc(*) CP-Op(*) CP-STM9 SP-Op(*) SP-Oc(*) SP-pNIST ULC DCU1 – – – 80.77 82.10 10.78 98.41 DCU2 – – – 82.13 83.14 10.88 99.67 DCU3 – – – 80.19 81.42 10.75 97.81 UPC1 – – – 81.59 82.02 10.99 99.33 Table 4: Evaluation with a set of syntactic metrics (see Section 3.1 for a description) for the Catalan–Spanish language pair. Results are obtained with the restriction of considering only the first 140 characters per tweet. Not all the syntactic metrics are available for Catalan. in both directions. When translating from probably reflecting a lower quality for this Spanish into Basque, however, DCU2 with engine on tweets. Notice that their best the combination of 5 different systems gets system in development does not correspond very similar scores. Differences in this case to the best system in test. are in general not statistically significant. Based on the previous figures as well as on the conclusions drawn by the authors Finally, in the pt-es track DCU submitted of the papers submitted to the shared the results of three systems. DCU3 was task (Toral et al., 2015; Alegria et al., the best in the pt-es direction. As in the 2015; Martı́nez-Garcia, España-Bonet, and ca-es track, their best system is again a Màrquez, 2015), we can emphasize the combination of two kinds of SMT engines following conclusions: and a RBMT one. On the opposite direction the best system, DCU2, does not • The results are in general very good include translation options from the RBMT, when compared to previous results for Basque to Spanish System WER PER TER BLEU NIST GTM2 MTRex RGS* Ol ULC DCU1 62.19 44.72 56.37 25.30 6.46 32.70 45.71 34.20 44.48 59.78 DCU2 61.24 44.95 55.35 25.30 6.53 33.14 46.12 34.61 44.92 60.63 DCU3 61.04 44.78 54.99 25.44 6.56 33.34 46.32 35.31 45.50 61.31 EHU1 61.53 38.17 52.96 28.61 6.94 34.53 50.57 40.80 51.12 69.13 Spanish to Basque System WER PER TER BLEU NIST GTM2 MTRex RGS* Ol ULC DCU1 61.48 47.56 55.81 23.22 5.96 32.45 40.00 29.92 42.87 66.27 DCU2 61.06 46.27 55.17 24.44 6.12 33.17 41.18 31.95 44.29 69.18 DCU3 61.77 47.30 56.07 23.42 5.96 32.48 40.12 30.38 43.00 66.56 EHU1 62.00 45.04 56.06 24.34 6.14 33.17 41.98 32.22 45.07 69.63 EHU2 66.43 50.13 62.46 19.54 5.29 29.29 36.36 23.30 38.15 55.33 Table 5: Evaluation with a set of lexical metrics for the Basque–Spanish language pair. Basque to Spanish System CP-Oc(*) CP-Op(*) CP-STM9 SP-Op(*) SP-Oc(*) SP-pNIST ULC DCU1 36.82 38.54 29.67 40.94 43.43 5.24 87.99 DCU2 37.13 38.84 29.77 41.16 43.67 5.23 88.4 DCU3 37.71 39.32 30.11 41.69 44.20 5.27 89.45 EHU1 43.26 45.19 33.59 47.42 49.8 5.48 100.00 Table 6: Evaluation with a set of syntactic metrics for the Basque–Spanish language pair. These metrics are not available for Basque. Portuguese to Spanish System WER PER TER BLEU NIST GTM2 MTRex RGS* Ol ULC DCU1 40.51 33.22 37.39 43.36 8.70 42.69 58.85 52.59 58.77 65.48 DCU2 39.86 33.41 36.87 43.67 8.74 43.28 59.12 52.86 58.96 66.17 DCU3 39.08 33.09 36.11 44.28 8.83 43.90 59.89 53.61 59.54 67.54 Spanish to Portuguese System WER PER TER BLEU NIST GTM2 MTRex RGS* Ol ULC DCU1 47.68 39.40 44.45 36.13 7.57 37.71 53.78 44.10 52.38 65.27 DCU2 46.51 36.67 43.08 37.25 7.77 38.30 54.15 45.24 53.57 68.05 DCU3 47.04 36.51 43.39 36.94 7.76 38.14 53.71 45.19 53.39 67.61 Table 7: Evaluation with a set of lexical metrics for the Portuguese–Spanish language pair. Portuguese to Spanish System CP-Oc(*) CP-Op(*) CP-STM9 SP-Op(*) SP-Oc(*) SP-pNIST ULC DCU1 53.57 55.48 44.99 57.32 59.06 8.15 98.51 DCU2 53.85 55.66 45.30 57.48 59.24 8.17 98.89 DCU3 54.53 56.28 45.96 58.06 59.92 8.23 100.00 Table 8: Evaluation with a set of syntactic metrics for the Portuguese–Spanish language pair. These metrics are not available for Portuguese. the same language pairs (Alegria et al., formal tweets can be accurately translated 2015). encourages its use by community managers who tweet in different languages, by making • Combining techniques, including RBMT their work easier. One of our main objectives and SMT, can lead to improvements for future work is to further generalize the (Toral et al., 2015). machine translation task by including all • Expanding the context by using a user’s kinds of tweets, to assess the ability of MT tweets within the same day can be of use systems to translate informal tweets too. to boost the performance of the machine A second version of the TweetMT dataset translation system (Martı́nez-Garcia, would include: España-Bonet, and Màrquez, 2015). • Tweets in English, so that we can 5 Conclusion attract a larger number of participants, The shared task organized at TweetMT has comparing a larger number of MT enabled us to come up with a benchmark systems. parallel corpus of tweets for translation • A more generalistic Twitter dataset applied to four language pairs: ca-es, eu-es, including informal tweets as well, in gl-es and pt-es. This has allowed participants order to test the result of MT to a large to tune and compare their MT systems. The and diverse corpus like Twitter. corpus developed for the shared task can in turn be downloaded from the workshop’s website10 , which we expect that will enable One of the main remaining challenges is further research in the field. the need to come up with a methodology The participants of the shared task to put together a gold standard corpus that have applied and studied the suitability encompasses the different types of tweets of state-of-the-art MT techniques. These that one can find on Twitter, including techniques have been adapted to the specific more informal tweets than those we have features of tweets, including conventions considered here. To tackle such a process, such as hashtags, and user mentions, as we would first need to solve some open well as considering the brevity of the questions such as whether or not and how texts. The study of the results achieved by to translate words that are not written in the submitted systems enables us to draw its normalized form, as well as how to conclusions to better inform future research deal with multilingualism in a single tweet. in the field. We are confident that the discussion among The results achieved by the participants attendees of the workshop, the presentations of the shared task are surprisingly high, of accepted papers, as well as the invited talk especially considering that we are dealing (Gonzàlez, 2015) will help pave the way in with tweets, whose brevity and specific this crucial task. characteristics make them more challenging to translate. Still, it is worthwhile noting Acknowledgements that the tweets considered in this shared task This work has been supported by can largely be deemed formal. Therefore, we the following projects: Abu-Matran could say that the task (translating formal (FP7-PEOPLE-2012-IAPP) PHEME (FP7, tweets generated from multilingual Tweet grant No. 611233), Tacardi (Spanish IDs) was easier than usual tasks in MT. MICINN TIN2012-38523-C02-01), QTLeap A more thorough analysis of the task and (FP7, grant No. 610516), HPCPLN performance of the participating systems will (Galician Gov, EM13/041), Celtic follow in an extended version of this paper, (Innterconecta program, 2012-CE138). including the conclusions after the discussion in the workshop. References We should also emphasize that these Alegria, Iñaki, Nora Aranberri, Vı́ctor results cannot be generalized to broader tasks Fresno, Pablo Gamallo, Lluis Padró, of translating tweets. However, the fact that Iñaki San Vicente, Jordi Turmo, and 10 http://komunitatea.elhuyar.eus/tweetmt/ Arkaitz Zubiaga. 2013. Introducción resources/ a la tarea compartida tweet-norm 2013: Normalización léxica de tuits en español. TweetMT@SEPLN, Proc. of the SEPLN In Tweet-Norm@SEPLN, pages 1–9. 2015. Alegria, Iñaki, Mikel Artetxe, Gorka Labaka, Gotti, Fabrizio, Philippe Langlais, and and Kepa Sarasola. 2015. EHU at Atefeh Farzindar. 2013. Translating TweetMT: Adapting MT engines for government agencies tweet feeds: formal tweets. In TweetMT@SEPLN, Specificities, problems and (a few) Proc. of the SEPLN 2015. solutions. NAACL 2013, page 80. Banerjee, Satanjeev and Alon Lavie. Hardmeier, C., S. Stymne, J. Tiedemann, 2005. METEOR: An Automatic Metric and J. Nivre. 2013. Docent: A for MT Evaluation with Improved document-level decoder for phrase-based Correlation with Human Judgments. statistical machine translation. In In Proceedings of the ACL Workshop Proceedings of the 51st Annual Conference on Intrinsic and Extrinsic Evaluation of the Association for Computational Measures for Machine Translation and/or Linguistics, pages 193–198. Summarization, pages 65–72, Ann Jaccard, Paul. 1912. The distribution of the Arbor, Michigan, June. Association for flora in the alpine zone. New phytologist, Computational Linguistics. 11(2):37–50. Cormen, Thomas H., Clifford Stein, Jehl, Laura, Felix Hieber, and Stefan Ronald L. Rivest, and Charles E. Riezler. 2012. Twitter translation using Leiserson. 2001. Introduction to translation-based cross-lingual retrieval. Algorithms. McGraw-Hill Higher In Proceedings of the Seventh Workshop Education, 2nd edition. on Statistical Machine Translation, WMT Doddington, George. 2002. Automatic ’12, pages 410–421, Stroudsburg, PA, Evaluation of Machine Translation USA. Association for Computational Quality Using N-gram Co-Occurrence Linguistics. Statistics. In Proceedings of the 2nd Kaufmann, Max and Jugal Kalita. 2010. Internation Conference on Human Syntactic normalization of twitter Language Technology (HLT), pages messages. In International conference on 138–145, San Diego, CA, USA. natural language processing, Kharagpur, Giménez, Jesús and Lluı́s Màrquez. 2007. India. Linguistic Features for Automatic Koehn, Philipp, Hieu Hoang, Alexandra Evaluation of Heterogenous MT Systems. Birch, Chris Callison-Burch, Marcello In Proceedings of the Second Workshop Federico, Nicola Bertoldi, Brooke on Statistical Machine Translation, Cowan, Wade Shen, Christine Moran, pages 256–264, Prague, Czech Richard Zens, Chris Dyer, Ondřej Bojar, Republic. Association for Computational Alexandra Constantin, and Evan Herbst. Linguistics. 2007. Moses: Open Source Toolkit Giménez, Jesús and Lluı́s Màrquez. 2008. A for Statistical Machine Translation. In Smorgasbord of Features for Automatic Proceedings of the 45th Annual Meeting MT Evaluation. In Proceedings of the of the ACL on Interactive Poster and Third Workshop on Statistical Machine Demonstration Sessions, ACL07, pages Translation, pages 195–198, Columbus, 177–180. Ohio, June”. The Association for Ling, Wang, Guang Xiang, Chris Dyer, Computational Linguistics. Alan Black, and Isabel Trancoso. 2013. Giménez, Jesús and Lluı́s Màrquez. 2010. Microblogs as parallel corpora. In Asiya: an Open Toolkit for Automatic Proceedings of the 51st Annual Meeting Machine Translation (Meta-)Evaluation. on Association for Computational The Prague Bulletin of Mathematical Linguistics, ACL ’13. Association for Linguistics, 94:77–86. Computational Linguistics. Gonzàlez, Meritxell. 2015. An analysis Martı́nez-Garcia, Eva, Cristina of twitter corpora and the differences España-Bonet, and Lluı́s Màrquez. between formal and colloquial tweets. In 2015. The UPC TweetMT participation: Translating formal tweets using context Snover, M., B. Dorr, R. Schwartz, information. In TweetMT@SEPLN, Proc. L. Micciulla, and J. Makhoul. 2006. of the SEPLN 2015. A Study of Translation Edit Rate with Targeted Human Annotation. In Melamed, I. Dan, Ryan Green, and Proceedings of the Seventh Conference of Joseph P. Turian. 2003. Precision the Association for Machine Translation and Recall of Machine Translation. In in the Americas (AMTA 2006), pages Proceedings of the Joint Conference 223–231, Cambridge, Massachusetts, on Human Language Technology and USA. the North American Chapter of the Association for Computational Linguistics Snow, Rion, Brendan O’Connor, Daniel (HLT-NAACL), pages 61–63, Edmonton, Jurafsky, and Andrew Y. Ng. 2008. Canada. Cheap and fast—but is it good?: Evaluating non-expert annotations Mikolov, T., K. Chen, G. Corrado, and for natural language tasks. In Proceedings J. Dean. 2013. Efficient estimation of the Conference on Empirical Methods of word representations in vector space. in Natural Language Processing, EMNLP In Proceedings of Workshop at ICLR. ’08, pages 254–263. http://code.google.com/p/word2vec. Tillmann, C., S. Vogel, H. Ney, A. Zubiaga, Munro, Robert. 2010. Crowdsourced and H. Sawaf. 1997. Accelerated DP translation for emergency response Based Search for Statistical Translation. in haiti: the global collaboration of In Proceedings of the Fifth European local knowledge. In AMTA Workshop Conference on Speech Communication on Collaborative Crowdsourcing for and Technology, pages 2667–2670, Translation, pages 1–4. Rhodes, Greece. Nießen, Sonja, Franz Josef Och, Gregor Toral, Antonio, Xiaofeng Wu, Tommi Leusch, and Hermann Ney. 2000. An Pirinen, Zhengwei Qiu, Ergun Bicici, and Evaluation Tool for Machine Translation: Jinhua Du. 2015. Dublin city university Fast Evaluation for MT Research. In at the tweetmt 2015 shared task. In Proceedings of the 2nd International TweetMT@SEPLN, Proc. of the SEPLN Conference on Language Resources and 2015. Evaluation, pages 39–45, Athens, Greece. Zubiaga, Arkaitz, Iñaki San Vicente, Pablo Papineni, K., S. Roukos, T. Ward, and Gamallo, José Ramom Pichel, Iñaki W. Zhu. 2002. BLEU: A Method Alegria, Nora Aranberri, Aitzol Ezeiza, for Automatic Evaluation of Machine and Vı́ctor Fresno. 2014. Overview of Translation. In Proceedings of the 40th tweetlid: Tweet language identification at Annual Meeting of the Association sepln 2014. TweetLID@SEPLN. for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Peisenieks, Jānis and Raivis Skadiņš. 2014. Uses of machine translation in the sentiment analysis of tweets. In Human Language Technologies-The Baltic Perspective: Proceedings of the Sixth International Conference Baltic HLT 2014, volume 268, page 126. IOS Press. Petrovic, Sasa, Miles Osborne, and Victor Lavrenko. 2010. The edinburgh twitter corpus. In Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media, pages 25–26.