Overview of TweetMT: A Shared Task on Machine Translation
                          of Tweets at SEPLN 2015
         Introducción a TweetMT: Tarea Compartida sobre Traducción Automática
                               de Tuits en la SEPLN 2015

        Iñaki Alegria1 , Nora Aranberri1 , Cristina España-Bonet2 , Pablo Gamallo3 ,
Hugo Gonçalo Oliveira4 , Eva Martı́nez2 , Iñaki San Vicente5 , Antonio Toral6 , Arkaitz Zubiaga7
            1
              University of the Basque Country, 2 UPC, 3 USC, 4 University of Coimbra,
                    5
                      Elhuyar, 6 Dublin City University, 7 University of Warwick
                                       tweetmt@elhuyar.com

              Resumen: Este artı́culo presenta un resumen de la tarea conjunta que tuvo lugar
              en el marco del taller TweetMT celebrado junto con SEPLN 2015, que consiste en
              traducir diversas colecciones de tweets en varios lenguajes. El artı́culo describe el
              proceso de recolección y anotación de datos, el desarrollo y evaluación de la tarea y
              los resultados obtenidos por los participantes.
              Palabras clave: Traducción Automática, Microblogs, Tuits, Social Media
              Abstract: This article presents an overview of the shared task that took place
              as part of the TweetMT workshop held at SEPLN 2015. The task consisted in
              translating collections of tweets from and to several languages. The article outlines
              the data collection and annotation process, the development and evaluation of the
              shared task, as well as the results achieved by the participants.
              Keywords: Machine Translation, Microblogs, Tweets, Social Media


        1   Introduction                                        The machine translation of tweets is
                                                             usually tackled in two different ways: (1) as
        While research in machine translation has            a direct translation task (tweet-to-tweet),
        been studied for a while now, the application        or (2) as an indirect translation task (tweet
        of machine translation techniques to tweets          normalization to standard text (Kaufmann
        is still in its infancy.       The machine           and Kalita, 2010), text translation and,
        translation of tweets is a challenging task          if needed, tweet generation). Despite the
        which, to a great extent, depends on                 fact that the direct translation approach
        the spelling and grammatical quality of              would look like the natural approach in
        the tweets that one has to provide the               an ideal scenario, the lack of parallel
        translations for. In fact, the difficulty of a       or comparable corpora of tweets for the
        tweet translation process varies dramatically        working languages (Petrovic, Osborne,
        for different types of tweets ranging from           and Lavrenko, 2010) makes the indirect
        informal posts to formal announcements               approach a more viable solution in most
        and news headlines posted by social media            of the cases.     Alternatively, researchers
        editors or community managers.            The        have also tried to gather similar tweets in
        former are often written from mobile devices,        other languages, leveraging Cross-Lingual
        which exacerbates the poor quality of the            Information Retrieval techniques (Jehl,
        spelling, and include linguistic inaccuracies,       Hieber, and Riezler, 2012).
        symbols and diacritics. Tweets also vary in
        terms of structure, including features which            Despite the paucity of research in the
        are exclusively used in the platform, such           specific task of translating tweets, an
        as hashtags, user mentions, and retweets,            increasing interest can be observed in the
        among others. These characteristics make             scientific community (Gotti, Langlais, and
        the application of machine translation tools         Farzindar, 2013; Peisenieks and Skadiņš,
        to tweets a new problem that requires specific       2014).     Similarly, a related and highly
        processing techniques to perform effectively.        relevant direction of research is the work on
machine translation of SMS texts, such as         identifying multiple Twitter authors that
Munro’s study in the context of the 2010          tweet identical content, albeit in different
Haiti earthquake (Munro, 2010).                   languages, either from a single account
    Provided the dearth of benchmark              or from two different accounts.         Hence,
resources and comparison studies bringing         whenever possible, the parallel corpora
to light the potential and shortcomings           have been generated from multilingual
of today’s machine translation techniques         Twitter accounts; this methodology was
applied to tweets, we organized TweetMT,          applied for the Catalan–Spanish (ca-es) and
a workshop and shared task1 on machine            Basque–Spanish (eu-es) language pairs, as
translation applied to tweets. This workshop      we found authors that concurrently tweet
is a follow-up to two other related workshops     in these languages. However, we did not
organized in previous years also at SEPLN:        find authors that meet these characteristics
TweetNorm 2013 (Alegria et al., 2013) and         for the other two language pairs, i.e.,
TweetLID 2014 (Zubiaga et al., 2014). The         Portuguese–Spanish and Galician–Spanish
workshop intended to be a forum where             (pt-es and gl-es); in these cases, the parallel
researchers had a chance to compare their         tweets were manually produced through
methods, systems and results and the task         crowdsourcing. Different to the language
focuses on MT of tweets between languages         pairs that could be automatically aligned, in
of the Iberian Peninsula (Basque, Catalan,        the latter cases only test sets were generated
Galician, Portuguese, and Spanish).               due to time and budget constraints.
    As a starting point, and especially given        The following sections give details about
the little work performed so far in the field,    the creation of the datasets. Table 1 shows
the corpora we compiled for the shared task       some statistics of those datasets.
includes tweets that are mostly formal and
correctly written, while keeping the brevity
                                                  2.1   Corpus Creation from
inherent to tweets.      While the corpora              Multilingual Accounts
might not be fully representative of the texts    The corpus creation process out of
that one can find on Twitter, it is instead       multilingual Twitter accounts can be
intended to boost the work performed within       divided into two steps: (i) identifying the
the field, encouraging researchers to submit      accounts and collecting the messages, and
preliminary contributions that will then help     (ii) semi-automatic alignment of translated
better understand the state of the art so         tweets.
that future work can be set forth. As             2.1.1 Accounts and Collected Data
this research matures, subsequent corpora         Different to (Ling et al., 2013), we do not aim
will include a wide variety of informal and       for mixed language tweets, where the source
misspelled tweets to keep making progress.        and target segments are included in the
                                                  same tweet, but rather we manually select a
2       Creation of a Benchmark                   number of authors that tend to post messages
        Dataset                                   in various languages. It is worth noting that
To the best of our knowledge, there is no         this strategy for sampling the authors leads
parallel tweet dataset available apart from       to a prevalence of account types that belong
that produced by (Ling et al., 2013), which       to organizations and famous personalities.
differs from our purposes in that they worked         We identified two kinds of “authors”
on tweets that mix two languages, providing       following this strategy: (i) authors that use
the translated text within the same tweet.        a single account to post messages in different
Since we wanted to work on the translation of     languages, and (ii) authors that have parallel
entire tweets into new tweets, we generated       accounts to post in different languages using
a corpus for the specific purposes of the         separate accounts. The initial collection
TweetMT Workshop.                                 of tweets amounted to 23 Twitter accounts
    In order to facilitate corpus generation,     (from 16 authors) for the eu-es pair and 19
we developed a semi-automatic method              accounts (from 14 authors) for the ca-es pair.
to retrieve and align parallel tweets.            In all, 75,000 tweets were collected for eu-es
The semi-automatic method consists in             and 51,000 tweets for the ca-es language
                                                  pair. The collection includes tweets posted
    1
        http://komunitatea.elhuyar.eus/tweetmt/   between November 2013 and March 2015.
    The initial corpus was then split into two    language each tweet is written in through
datasets: one development-set composed of         language identification (Zubiaga et al., 2014).
4,000 parallel tweets for each language pair      While Twitter does provide the language
and one test-set composed of 2,000 parallel       ID along with tweet’s metadata, Basque
tweets for each language pair.                    and Catalan are never tagged as such by
    Author distribution in the development        Twitter, so that we implemented our own
set was limited to account with most tweets       language identification module to identify
(2 for ca-es and 4 for eu-es). Test-sets          these languages. Language identification is
also contain tweets from the authors in           done by using TextCat2 trained over Twitter
the development set, but tweets from new          specific data.
”unseen” authors are also introduced. This            Once we have an author’s tweets separated
way we have the possibility to evaluate           by language, and hence with source language
systems both on ”in-domain” and ”out-of           tweets and target language tweets separated,
domain” scenarios.                                we need to align them with likely translations
    As we said before, one of the limitations     for each tweet. For the automated process,
of our strategy is that it is only applicable     we defined a set of heuristics and statistics
to certain language pairs. The linguistic         that would help us find matches quite
realities of Basque and Catalan (both are         accurately. Specifically, we looked at the
considered to be co-official together with        following three characteristics to find likely
Spanish in certain regions that support           matches:
bilingualism) make the application of
such methods viable for our purposes.               • Publication date.          Translations
Unfortunately, it was not the case for                must be published within a certain
pt-es and gl-es pairs. It is understandable           period range to be flapped as possible
that few or no users have the need to                 translations of each other.        The
tweet both in Spanish and Portuguese,                 difference between source and target
which have little or no geographical                  timestamps must not exceed a certain
overlap; it was however a surprise not                threshold. The value of the treshold
to find any such example for Galician and             was set overall to 10 hours, although
Spanish, which has the same status as                 for a few accounts the publication date
Catalan and Basque of being co-official.              difference was restricted to 1 hour
In consequence, we only could provide                 after empirically detecting too much
development corpora for the eu-es and ca-es           noise with the more relaxed standard
language pairs. For the Galician–Spanish              threshold.
and Portuguese–Spanish language pairs,
test sets were manually generated through           • Overlap of hashtag and user
crowdsourcing. Specifically, we used the              mentions in source and target
CrowdFlower platform to translate tweets              tweets.     It is very rare to change
into the other language. Section 2.2 further          the    user    (@)    mentions   across
discusses this process.                               language, only in a few cases was
                                                      observed that phenomenon (e.g.,
2.1.2 Alignment
                                                      using @FCBarcelona ca in a tweet in
The large volume of tweets collected in the           catalan and using @FCBarcelona es in
previous step needs to be properly aligned in         a Spanish written tweet) are usually
order to create the parallel corpus. Aligning         maintained across languages. Hashtags
tweets of an author within and across                 are translated, often, depending on the
accounts requires both to find matching               popularity of a given hashtag in the
translations as well as to occasionally get           target audience. A minimum number of
rid of tweets that have no translations. We           user name and hashtags were required
perform this process semi-automatically, first        to overlap between source and target
by automatically aligning tweets that are             parallel tweet candidates. The overlap
likely to be each other’s translation, and then       is computed as the division between the
by manually checking the accuracy of those            number of entities in the intersection
alignments.                                           of both tweets and the entities in the
    Before we can even align tweets with their
                                                    2
likely translations, we needed to identify the          http://www.let.rug.nl/vannoord/TextCat/
       union. The threshold is empirically set   to Portuguese and Galician, a dataset with
       to 0.76.                                  2, 552 Spanish tweets, taken from both our
  • Longest Common Subsequence                   ca-es and eu-es parallel corpora, and divided
    ratio (LCSR) (Cormen et al.,                 in working tasks of 10 tweets each.
    2001) between source and target                  Instructions were provided to workers in
    tweets.     LCSR is an orthographic          order to make sure that the translations
    similarity measure, as it tells us how       were consistent. For instance, contributors
    similar two strings are. It is especially    were asked not to translate user mentions
    reliable when working with closely           (keywords with a leading @) and URLs,
    related languages, as parallel sentences     while hashtags should only be translated
    are often very close to each other,          if the contributor considered that it would
    because both vocabulary and word             be natural to use the Portuguese/Galician
    order are closes. We empirically set a       hashtag.
    minimum threshold of 0.45.                       The crowdsourcing platform allows to
                                                 configure the jobs using a number of options.
    As for the performance of the heuristics,    We used some of them with the aim
publication date closeness is effective for      of obtaining translations of a reasonable
filtering out wrong candidates, but it is not    quality:
enough to find the correct parallel tweet,
so it is applied first of the three. User          • Geography. One can select a set of
and hashtag overlapping ration proved to be          countries from which workers are allowed
very successful, up to the point that the            to work on the job. We limited the
contribution of LCSR was minimal.                    countries to Spain for Galician and to
    The output of this alignment is then             Portugal and Brazil for Portuguese.4
corrected through manual checks by native
speakers of their respective languages. The        • Performance level.           Contributors
manual inspection showed a low error rate            of the platform fall into three levels,
in the automatic alignment, especially for           according to their performance. Our
ca-es. For this language pair we found a 2%          jobs were limited to contributors
error rate, evaluated over a sample of 400           in level 3 (the top level), defined
tweets on the development set. For eu-es the         by Crowdflower as “the highest
percentage increased to 15%, also evaluated          performance contributors who account
over a sample of 400 tweets. Error rate over         for 7% of monthly judgments and
the collections manually reviewed to create          maintain the highest level of accuracy
the test-sets was 7% for the ca-es language          across an even larger spectrum of
pair (12500 tweets) and was 32% for the eu-es        CrowdFlower       jobs    [compared    to
language pair (15045).                               contributors in levels 1 and 2]”. In
                                                     the case of Galician, we had to change
2.2     Crowdsourced Corpus                          this setting to level 1 as the tasks were
        Creation using Crowdflower                   getting completed too slowly.
As     we    did     not    find     bilingual
Portuguese–Spanish or Galician–Spanish             • Language capability. It allows to
Twitter accounts, we used the CrowdFlower            restrict the contributors that can work
platform3 to build the test data for this            in the job by their language skills.
language pair.       CrowdFlower provides            For translations into Portuguese, we
a cheap and fast method for collecting               restricted the contributors to those who
annotations from a broad base of paid                are verified speakers of Portuguese.
non-expert contributors over the Web.                Galician is not in the list of languages
It works in a similar way to Amazon’s                provided in CrowdFlower, so this job
Mechanical Turk (Snow et al., 2008) which            was not configured in this case.
cannot be used in our case because it requires      4
                                                     Initially, the task to translate into Portuguese
to have an US address and credit card.           was only opened for users from Portugal as the
   In the task we defined, the contributors      focus is on Iberian Portuguese, but after we realized
had to translate manually, from Spanish          we were having no contributions, we broadened the
                                                 geographical scope to Brazil as well, which helped to
  3
      http://www.crowdflower.com/                obtain contributions more swiftly.
  • Speed trap. If set, contributors are            Dataset      Tweets Authors Tokens   URL     @user
    automatically removed from the job if           eu-es dev    4,000     4     181K    2,622   1,569
    they take less than a specified amount          ca-es dev    4,000     2     161K    3,280    823
    of time to complete a task. Our jobs            eu-es test   2,000    16     37K     1556     673
    contained tasks of 10 translations each         es-eu test   2,000    16     43K     1535     692
                                                    ca-es test   2,000    14     45K     1590     417
    and the time trap was set to 150 seconds.       es-ca test   2,000    14     46K     1567     502
    Hence if a worker toke less than 15             gl-es test    434      -      7K      274     134
    seconds to translate per tweet he/she           es-gl test    434      -      7K      291     159
                                                    pt-es test   1,250     -     19K     674      349
    would be automatically removed from             es-pt test   1,250     -     21K     919      583
    the job.
    The task of translating into Portugese was     Table 1: Statistics for the datasets generated.
completed by 40 different contributors, all
of them from Brazil. The contributors were         pairs where the corpus was obtained via
inquired about the quality of the task; they       crowdsourcing —gl and pt—, the file contains
were asked to rank out of 5 the clarity of         a segmentID and the text of the tweet.
the instructions (average 4.12), ease of the
job (3.78), pay (4.17) and overall satisfaction    3     Evaluation Framework
(4.05). The task to translate into Galician        The test sets just described were delivered
was carried out by 10 contributors. They           to the participants which had to return the
ranked the task as follows: clarity of the         translations with the following tab separated
instructions (4.90), ease of the job (3.59), pay   format:
(3.82) and overall satisfaction (4.00).            tweet Id <tab> source language text
    As a final result, we obtained a parallel      <tab> translation \n
corpus with 2, 500 pt-es and 777 gl-es tweets         The translated test would then be
which were split into two test datasets with       extracted, cut to a maximum length of
1, 225 entries for each translation direction      140 characters, and evaluated by automatic
for pt-es and 388 for gl-es. To verify the         means.
quality of the translations, samples of 30            The performance of the systems is
tweets were evaluated both for Portuguese          assessed with lexical and syntactic automatic
and for Galician. In both cases they were          evaluation measures compared against a
considered acceptable by the Portuguese and        single reference.     Lexical metrics which
Galician authors of the current paper, even        are mostly based on n-gram matching are
if some errors were detected. In the case of       available for all the language pairs under
Galician, we found some mistakes derived the       study. However, syntactic metrics are only
new spelling rules imposed since 2003. In          available for Spanish and some of them for
the case of Portuguese, six errors (most of        Catalan.
them lexical problems) were found from the
30 tweets evaluated.                               3.1     Evaluation Metrics
                                                   In order to study the quality of the
2.3   Datasets post-processing                     translations at different levels we use a wide
Before delivering the data sets to the             set of metrics as defined as follows:
participants the test was pre-processed. The
development corpus includes the original           Lexical evaluation measures
tweets, neither @user nor URLs were
                                                       • PER (Tillmann et al.,              1997),
normalized, but they are in the test corpus
                                                         TER (Snover et al., 2006), WER (Nießen
where @user and URLs are standardized to
                                                         et al., 2000): Subset of metrics based on
IDIDID and URLURLURL, respectively.
                                                         edit distances
    Datasets are distributed in tab separated
format (tsv) files. For each language pair two         • BLEU (Papineni et al.,          2002),
files are provided, one for every translation            NIST (Doddington, 2002), ROUGE
direction. For the language pairs where the              (RG): Based on n-gram matching
parallel corpus was gathered exclusively from            (lexical precision: BLEU, NIST; and
Twitter —this includes ca, es, and eu— the               lexical recall: ROUGE). For ROUGE
files contains the tweetID, userID, date and             we use RGS*, i.e. a variant with skip
the text of each tweet. For the language                 bigrams without max-gap-length)
    • GTM (Melamed, Green, and Turian,                  hours to work on the test set and to send
      2003), METEOR (Banerjee and Lavie,                the results.
      2005) (MTR): Based on the F-measure.
      For GTM we use GTM2, with the                     4.1   Overview of the Systems
      parameter associated to long matches                    Submitted
      e = 2; for METEOR we use MTRex,                   Out of the 5 initially registered participants,
      i.e. using only exact matching.                   only three teams ended up submitting their
    • Ol (Giménez and Màrquez, 2008):                 results: DCU (Dublin City University)
      Lexical Overlap is a measure based on             for 3 tracks (ca-es, eu-es, pt-es) (Toral
      the Jaccard coefficient (Jaccard, 1912)           et al., 2015); EHU (University of the
      to quantify the similarity between sets.          Basque Country) for the eu-es track (Alegria
      Lexical items associated with candidate           et al., 2015); and UPC (Universitat
      and reference translations are considered         Politècnica de Catalunya) for the ca-es
      as two separate sets of items. Overlap            track (Martı́nez-Garcia, España-Bonet, and
      is computed as the cardinality of their           Màrquez, 2015). In all, two teams submitted
      intersection divided by the cardinality of        results for the eu-es and ca-es tracks, one
      their union.                                      team participated in the pt-es track, and no
                                                        submissions were received for the gl-es pair.
    • ULC (Giménez and Màrquez, 2008):                   The related shared tasks that we
      U niform Linear Combination. When                 organized in recent years (i.e., TweetNorm
      applied to lexical metrics it includes            (Alegria et al., 2013) and TweetLID (Zubiaga
      WER, PER, TER, BLEU, NIST, RGS*,                  et al., 2014)) attracted a higher number of
      GTM2, MTRex.                                      participants. One of the reasons for this
Syntactic evaluation measures                           drop in number of participants might be the
                                                        fact that English has not been considered
    • SP-Op , SP-Oc , SP-pNIST (Giménez and            this time as one of the languages included
      Màrquez, 2007)5 : Based on the lexical           in the task; this could have made the task
      overlap according to the part-of-speech           less appealing to some groups, which led to
      or chunk and the NIST score over these            fewer participants from outside the Iberian
      elements (Shallow Parsing)                        Peninsula.
    • CP-Op , CP-Oc , CP-STM9 (Giménez                    The main characteristics of the systems
      and Màrquez, 2007)6 : Based on the               submitted are compiled in Table 2 and can
      lexical overlap among part-of-speech or           be summarized as follows:
      constituents of constituency parse trees
      (Constituency Parsing)                            DCU This team submitted systems for
                                                          three language pairs in both directions:
    • ULC (Giménez and Màrquez, 2008):
                                                          Spanish from/to Catalan, Basque and
      U niform Linear Combination. When
                                                          Portuguese.     They used a range of
      applied to syntactic metrics it includes
                                                          techniques including state-of-the-art
      the available metrics for the specific
                                                          SMT, morph segmentation (only
      language.
                                                          for Basque as a morphologically
   All measures have been calculated                      rich language), data selection as a
with    the   Asiya    toolkit7  for    MT                means of domain adaptation, available
evaluation (Giménez and Màrquez, 2010).                 open-source rule-based systems and,
                                                          finally, system combination to combine
4     Shared Task Results                                 the strengths of the different systems
Participants were required to register8 in                that were built. DCU gathered vast
order to obtain the development and test                  amounts of tweets (from 11M for
data-sets. Each participant had only 72                   Basque to 130M for Spanish) to perform
                                                          monolingual domain adaptation and
   5
     Family of metrics only available for Catalan and     complemented this with publicly
Spanish.
   6                                                      available general-domain monolingual
     Family of metrics only available for Spanish.
   7
     http://nlp.cs.upc.edu/asiya/                         and parallel corpora. The first (DCU1),
   8
     http://komunitatea.elhuyar.eus/tweetmt/              second (DCU2) and third (DCU3)
participation/                                            systems submitted for each language
 System     Main Engine     Distinctive features
                            Moses and Apertium (ES↔CA), Moses, cdec and Apertium
 DCU1
                            (ES→EU), cdec (EU→ES), Moses (ES↔PT).
                            Moses (ES→CA), Moses, cdec and Apertium (CA→ES, EU→ES),
              System
 DCU2                       Moses, cdec, ParFDA, Matxin and Morph (ES→EU),
            combination
                            Moses and cdec (ES↔PT).
                or
                            Moses, cdec and Apertium (ES→CA, ES↔PT), Moses, ParFDA
               SMT
 DCU3                       and Apertium (CA→ES), Moses, cdec, Matxin and Morph (ES→EU),
                            Moses, cdec, Apertium and Morph (EU→ES).
 EHU1           SMT         Specific language model and pre- and post-processing for tweets
 EHU2          RBMT         Adaptation to Tweets (mainly hashtags)
 UPC1           SMT         Moses system
 UPC2           SMT         Document-level system (Docent), semantic models

               Table 2: Summary of the systems developed by the participants.

    direction were the individual systems                obtained with word2vec (Mikolov et
    or combinations that obtained the                    al., 2013). Besides the parallel tweets
    best, second best and third best result,             available for the shared task, both
    respectively, on the development set.                systems use monolingual tweets for genre
                                                         and domain adaptation. UPC2 was
EHU This team submitted systems for                      only submitted for Catalan-to-Spanish.
  the Basque–Spanish pair. They have                     The authors report some problems with
  adapted previous MT engines for the                    this configuration and include both the
  es–eu and eu-es directions. For the                    official and new results in their paper.
  translation into Basque RBMT and                       Here only the official results are shown.
  SMT were adapted whereas for the
  translation from Basque only a SMT
  based system was used.       The main            4.2    Results
  work was pre- and post-processing for            Participants had a 72-hour window to work
  adaptation to tweets and collecting              with the test set and submit up to three
  new resources for training and tuning            results per track. This section is a recap of
  the systems.     For RBMT, a small               the results of all the tracks and systems.
  dictionary of hashtags was obtained                 Table 3 and Table 4 show the results for
  from the development-set. For SMT,               the participants in the ca-es track. In Table
  language models were improved using              3, the lexical measures introduced in the
  monolingual corpora from previous                previous section are shown and in Table 4
  shared tasks and a new corpora of tweets         the syntactic ones. Five systems from two
  in Basque.                                       teams have been evaluated. DCU3 system
                                                   was the best for the ca-es direction, a system
UPC The team submitted two systems for             combining two kinds of SMT engines plus
  the Catalan–Spanish language pair. The           a RBMT one.          For the es-ca direction,
  first one (UPC1) is a standard SMT               the two simplest pure phrase-based SMT
  system built with Moses (Koehn et                systems, UPC1 and DCU2, obtained the
  al., 2007) and trained with 2,178,796            highest scores. The two teams used very
  parallel sentences extracted from the El         similar corpora in their experiments, so the
  Periódico parallel corpus9 . The second         techniques they used make the difference in
  system (UPC2) uses a document-level              this case.
  decoder, Docent (Hardmeier et al.,                  Tables 5, 6, 7 and 8 show the results for
  2013), that takes UPC1 as a first                the participants in the eu-es and pt-es tracks.
  step.      Besides, the system uses              For the eu-es track four or five (depending on
  as additional feature semantic models            the direction) systems were presented by two
   9
     http://catalog.elra.info/product_info.        teams. In general, the best translator for this
php?products_id=1122                               language pair is the statistical system EHU1
                                       Catalan to Spanish
 System    WER      PER     TER     BLEU        NIST    GTM2       MTRex    RGS*       Ol     ULC
 DCU1      15.24    12.49   13.25   76.73       12.09    72.75       83.8   83.37     83.7    77.84
 DCU2      15.15    12.41   13.21   76.52       12.09    72.18      83.76   83.70    83.56    77.86
 DCU3      14.59    11.74   12.50   77.70       12.16    73.37      84.63   84.45    84.64    79.67
 UPC1      20.17    16.40   19.42   68.20       11.22    62.71      78.46   77.31    74.72    63.82
 UPC2      25.10    17.09   22.25   63.12       10.93    57.92      76.44   75.56    73.76    57.45

                                       Spanish to Catalan
 System    WER      PER     TER     BLEU        NIST    GTM2       MTRex    RGS*       Ol     ULC
 DCU1      16.70    12.42   14.46   75.79       11.88    70.45      52.08   82.65    82.88    66.23
 DCU2      15.17    11.71   13.21   77.75       11.96    72.65      53.44   83.32    83.96    69.98
 DCU3      17.09    13.08   14.70   75.25       11.85    70.34      51.73   82.26    82.46    64.94
 UPC1      14.35    11.25   13.63   77.93       12.04    72.69      53.98   82.19    83.18    70.56

Table 3: Evaluation with a set of lexical metrics (see Section 3.1 for a description) for the
participant systems on the Catalan–Spanish language pair. Results are obtained only considering
the first 140 characters per tweet.
                                     Catalan to Spanish
 System    CP-Oc(*)    CP-Op(*)     CP-STM9       SP-Op(*)       SP-Oc(*)   SP-pNIST        ULC
 DCU1       80.92          81.4        74.1        81.78          83.03      10.87       99.24
 DCU2       80.71          81.5       74.19        81.66           82.8      10.90       99.22
 DCU3       81.64         82.27       74.48        82.40          83.73      10.92      100.00
 UPC1       68.52         70.93       58.74        70.95          71.97       9.36       84.47
 UPC2       70.59         72.89       63.05        73.01          73.99      10.01       88.06

                                     Spanish to Catalan
 System    CP-Oc(*)    CP-Op(*)     CP-STM9       SP-Op(*)       SP-Oc(*)   SP-pNIST        ULC
 DCU1          –            –           –          80.77          82.10      10.78      98.41
 DCU2          –            –           –          82.13          83.14      10.88      99.67
 DCU3          –            –           –          80.19          81.42      10.75      97.81
 UPC1          –            –           –          81.59          82.02      10.99      99.33

Table 4: Evaluation with a set of syntactic metrics (see Section 3.1 for a description) for the
Catalan–Spanish language pair. Results are obtained with the restriction of considering only
the first 140 characters per tweet. Not all the syntactic metrics are available for Catalan.

in both directions. When translating from           probably reflecting a lower quality for this
Spanish into Basque, however, DCU2 with             engine on tweets. Notice that their best
the combination of 5 different systems gets         system in development does not correspond
very similar scores. Differences in this case       to the best system in test.
are in general not statistically significant.           Based on the previous figures as well as
                                                    on the conclusions drawn by the authors
   Finally, in the pt-es track DCU submitted        of the papers submitted to the shared
the results of three systems. DCU3 was              task (Toral et al., 2015; Alegria et al.,
the best in the pt-es direction. As in the          2015; Martı́nez-Garcia, España-Bonet, and
ca-es track, their best system is again a           Màrquez, 2015), we can emphasize the
combination of two kinds of SMT engines             following conclusions:
and a RBMT one.             On the opposite
direction the best system, DCU2, does not               • The results are in general very good
include translation options from the RBMT,                when compared to previous results for
                                       Basque to Spanish
 System    WER      PER     TER     BLEU     NIST    GTM2     MTRex     RGS*       Ol     ULC
 DCU1     62.19     44.72   56.37   25.30    6.46    32.70     45.71    34.20    44.48    59.78
 DCU2     61.24     44.95   55.35   25.30    6.53    33.14     46.12    34.61    44.92    60.63
 DCU3     61.04     44.78   54.99   25.44    6.56    33.34     46.32    35.31    45.50    61.31
 EHU1     61.53     38.17   52.96   28.61    6.94    34.53     50.57    40.80    51.12    69.13

                                       Spanish to Basque
 System    WER      PER     TER     BLEU     NIST    GTM2     MTRex     RGS*       Ol     ULC
 DCU1     61.48     47.56   55.81   23.22    5.96    32.45     40.00    29.92    42.87    66.27
 DCU2     61.06     46.27   55.17   24.44    6.12    33.17     41.18    31.95    44.29    69.18
 DCU3     61.77     47.30   56.07   23.42    5.96    32.48     40.12    30.38    43.00    66.56
 EHU1     62.00     45.04   56.06   24.34    6.14    33.17     41.98    32.22    45.07    69.63
 EHU2     66.43     50.13   62.46   19.54    5.29    29.29     36.36    23.30    38.15    55.33

   Table 5: Evaluation with a set of lexical metrics for the Basque–Spanish language pair.

                                     Basque to Spanish
 System   CP-Oc(*)     CP-Op(*)     CP-STM9     SP-Op(*)     SP-Oc(*)   SP-pNIST        ULC
 DCU1       36.82         38.54      29.67       40.94        43.43       5.24       87.99
 DCU2       37.13         38.84      29.77       41.16        43.67       5.23        88.4
 DCU3       37.71         39.32      30.11       41.69        44.20       5.27       89.45
 EHU1       43.26         45.19      33.59       47.42        49.8        5.48      100.00

Table 6: Evaluation with a set of syntactic metrics for the Basque–Spanish language pair. These
metrics are not available for Basque.

                                     Portuguese to Spanish
 System    WER      PER     TER     BLEU     NIST    GTM2     MTRex     RGS*       Ol     ULC
 DCU1     40.51     33.22   37.39   43.36    8.70    42.69     58.85    52.59    58.77    65.48
 DCU2     39.86     33.41   36.87   43.67    8.74    43.28     59.12    52.86    58.96    66.17
 DCU3     39.08     33.09   36.11   44.28    8.83    43.90     59.89    53.61    59.54    67.54

                                     Spanish to Portuguese
 System    WER      PER     TER     BLEU     NIST    GTM2     MTRex     RGS*       Ol     ULC
 DCU1     47.68     39.40   44.45   36.13    7.57    37.71     53.78    44.10    52.38    65.27
 DCU2     46.51     36.67   43.08   37.25    7.77    38.30     54.15    45.24    53.57    68.05
 DCU3     47.04     36.51   43.39   36.94    7.76    38.14     53.71    45.19    53.39    67.61

 Table 7: Evaluation with a set of lexical metrics for the Portuguese–Spanish language pair.

                                    Portuguese to Spanish
 System   CP-Oc(*)     CP-Op(*)     CP-STM9     SP-Op(*)     SP-Oc(*)   SP-pNIST        ULC
 DCU1       53.57         55.48      44.99       57.32        59.06       8.15       98.51
 DCU2       53.85         55.66      45.30       57.48        59.24       8.17       98.89
 DCU3       54.53         56.28      45.96       58.06        59.92       8.23      100.00

Table 8: Evaluation with a set of syntactic metrics for the Portuguese–Spanish language pair.
These metrics are not available for Portuguese.
      the same language pairs (Alegria et al.,   formal tweets can be accurately translated
      2015).                                     encourages its use by community managers
                                                 who tweet in different languages, by making
    • Combining techniques, including RBMT
                                                 their work easier. One of our main objectives
      and SMT, can lead to improvements
                                                 for future work is to further generalize the
      (Toral et al., 2015).
                                                 machine translation task by including all
    • Expanding the context by using a user’s    kinds of tweets, to assess the ability of MT
      tweets within the same day can be of use   systems to translate informal tweets too.
      to boost the performance of the machine    A second version of the TweetMT dataset
      translation system (Martı́nez-Garcia,      would include:
      España-Bonet, and Màrquez, 2015).
                                                   • Tweets in English, so that we can
5     Conclusion                                     attract a larger number of participants,
The shared task organized at TweetMT has             comparing a larger number of MT
enabled us to come up with a benchmark               systems.
parallel corpus of tweets for translation
                                                   • A more generalistic Twitter dataset
applied to four language pairs: ca-es, eu-es,
                                                     including informal tweets as well, in
gl-es and pt-es. This has allowed participants
                                                     order to test the result of MT to a large
to tune and compare their MT systems. The
                                                     and diverse corpus like Twitter.
corpus developed for the shared task can
in turn be downloaded from the workshop’s
website10 , which we expect that will enable        One of the main remaining challenges is
further research in the field.                   the need to come up with a methodology
    The participants of the shared task          to put together a gold standard corpus that
have applied and studied the suitability         encompasses the different types of tweets
of state-of-the-art MT techniques. These         that one can find on Twitter, including
techniques have been adapted to the specific     more informal tweets than those we have
features of tweets, including conventions        considered here. To tackle such a process,
such as hashtags, and user mentions, as          we would first need to solve some open
well as considering the brevity of the           questions such as whether or not and how
texts. The study of the results achieved by      to translate words that are not written in
the submitted systems enables us to draw         its normalized form, as well as how to
conclusions to better inform future research     deal with multilingualism in a single tweet.
in the field.                                    We are confident that the discussion among
    The results achieved by the participants     attendees of the workshop, the presentations
of the shared task are surprisingly high,        of accepted papers, as well as the invited talk
especially considering that we are dealing       (Gonzàlez, 2015) will help pave the way in
with tweets, whose brevity and specific          this crucial task.
characteristics make them more challenging
to translate. Still, it is worthwhile noting     Acknowledgements
that the tweets considered in this shared task   This work has been supported by
can largely be deemed formal. Therefore, we      the following projects:       Abu-Matran
could say that the task (translating formal      (FP7-PEOPLE-2012-IAPP) PHEME (FP7,
tweets generated from multilingual Tweet         grant No.      611233), Tacardi (Spanish
IDs) was easier than usual tasks in MT.          MICINN TIN2012-38523-C02-01), QTLeap
    A more thorough analysis of the task and     (FP7, grant No.        610516), HPCPLN
performance of the participating systems will    (Galician    Gov,    EM13/041),      Celtic
follow in an extended version of this paper,     (Innterconecta program, 2012-CE138).
including the conclusions after the discussion
in the workshop.                                 References
    We should also emphasize that these          Alegria, Iñaki, Nora Aranberri, Vı́ctor
results cannot be generalized to broader tasks      Fresno, Pablo Gamallo, Lluis Padró,
of translating tweets. However, the fact that       Iñaki San Vicente, Jordi Turmo, and
  10
     http://komunitatea.elhuyar.eus/tweetmt/        Arkaitz Zubiaga. 2013. Introducción
resources/                                          a la tarea compartida tweet-norm 2013:
  Normalización léxica de tuits en español.     TweetMT@SEPLN, Proc. of the SEPLN
  In Tweet-Norm@SEPLN, pages 1–9.                  2015.
Alegria, Iñaki, Mikel Artetxe, Gorka Labaka,    Gotti, Fabrizio, Philippe Langlais, and
   and Kepa Sarasola. 2015. EHU at                 Atefeh Farzindar. 2013. Translating
   TweetMT: Adapting MT engines for                government     agencies tweet    feeds:
   formal tweets.      In TweetMT@SEPLN,           Specificities, problems and (a few)
   Proc. of the SEPLN 2015.                        solutions. NAACL 2013, page 80.
Banerjee, Satanjeev and Alon Lavie.              Hardmeier, C., S. Stymne, J. Tiedemann,
  2005. METEOR: An Automatic Metric                and J. Nivre.       2013.    Docent: A
  for MT Evaluation with Improved                  document-level decoder for phrase-based
  Correlation with Human Judgments.                statistical machine translation.       In
  In Proceedings of the ACL Workshop               Proceedings of the 51st Annual Conference
  on Intrinsic and Extrinsic Evaluation            of the Association for Computational
  Measures for Machine Translation and/or          Linguistics, pages 193–198.
  Summarization,   pages 65–72,      Ann         Jaccard, Paul. 1912. The distribution of the
  Arbor, Michigan, June. Association for            flora in the alpine zone. New phytologist,
  Computational Linguistics.                        11(2):37–50.
Cormen, Thomas H., Clifford Stein,               Jehl, Laura, Felix Hieber, and Stefan
  Ronald L. Rivest, and Charles E.                  Riezler. 2012. Twitter translation using
  Leiserson.    2001.     Introduction to           translation-based cross-lingual retrieval.
  Algorithms.       McGraw-Hill Higher              In Proceedings of the Seventh Workshop
  Education, 2nd edition.                           on Statistical Machine Translation, WMT
Doddington, George. 2002. Automatic                 ’12, pages 410–421, Stroudsburg, PA,
  Evaluation of Machine Translation                 USA. Association for Computational
  Quality Using N-gram Co-Occurrence                Linguistics.
  Statistics. In Proceedings of the 2nd          Kaufmann, Max and Jugal Kalita. 2010.
  Internation Conference on Human                  Syntactic normalization of twitter
  Language Technology (HLT), pages                 messages. In International conference on
  138–145, San Diego, CA, USA.                     natural language processing, Kharagpur,
Giménez, Jesús and Lluı́s Màrquez. 2007.        India.
  Linguistic Features for Automatic              Koehn, Philipp, Hieu Hoang, Alexandra
  Evaluation of Heterogenous MT Systems.           Birch, Chris Callison-Burch, Marcello
  In Proceedings of the Second Workshop            Federico,    Nicola Bertoldi,     Brooke
  on Statistical Machine Translation,              Cowan, Wade Shen, Christine Moran,
  pages     256–264,      Prague,     Czech        Richard Zens, Chris Dyer, Ondřej Bojar,
  Republic. Association for Computational          Alexandra Constantin, and Evan Herbst.
  Linguistics.                                     2007.     Moses: Open Source Toolkit
Giménez, Jesús and Lluı́s Màrquez. 2008. A      for Statistical Machine Translation. In
  Smorgasbord of Features for Automatic            Proceedings of the 45th Annual Meeting
  MT Evaluation. In Proceedings of the             of the ACL on Interactive Poster and
  Third Workshop on Statistical Machine            Demonstration Sessions, ACL07, pages
  Translation, pages 195–198, Columbus,            177–180.
  Ohio, June”. The Association for               Ling, Wang, Guang Xiang, Chris Dyer,
  Computational Linguistics.                        Alan Black, and Isabel Trancoso. 2013.
Giménez, Jesús and Lluı́s Màrquez. 2010.         Microblogs as parallel corpora.     In
  Asiya: an Open Toolkit for Automatic              Proceedings of the 51st Annual Meeting
  Machine Translation (Meta-)Evaluation.            on Association for Computational
  The Prague Bulletin of Mathematical               Linguistics, ACL ’13. Association for
  Linguistics, 94:77–86.                            Computational Linguistics.
Gonzàlez, Meritxell. 2015. An analysis          Martı́nez-Garcia,   Eva,      Cristina
  of twitter corpora and the differences           España-Bonet, and Lluı́s Màrquez.
  between formal and colloquial tweets. In         2015. The UPC TweetMT participation:
  Translating formal tweets using context     Snover, M., B. Dorr, R. Schwartz,
  information. In TweetMT@SEPLN, Proc.          L. Micciulla, and J. Makhoul. 2006.
  of the SEPLN 2015.                            A Study of Translation Edit Rate
                                                with Targeted Human Annotation. In
Melamed, I. Dan, Ryan Green, and
                                                Proceedings of the Seventh Conference of
  Joseph P. Turian.     2003.    Precision
                                                the Association for Machine Translation
  and Recall of Machine Translation. In
                                                in the Americas (AMTA 2006), pages
  Proceedings of the Joint Conference
                                                223–231, Cambridge, Massachusetts,
  on Human Language Technology and
                                                USA.
  the North American Chapter of the
  Association for Computational Linguistics   Snow, Rion, Brendan O’Connor, Daniel
  (HLT-NAACL), pages 61–63, Edmonton,           Jurafsky, and Andrew Y. Ng.          2008.
  Canada.                                       Cheap and fast—but is it good?:
                                                Evaluating     non-expert      annotations
Mikolov, T., K. Chen, G. Corrado, and
                                                for natural language tasks. In Proceedings
  J. Dean.    2013.    Efficient estimation
                                                of the Conference on Empirical Methods
  of word representations in vector space.
                                                in Natural Language Processing, EMNLP
  In Proceedings of Workshop at ICLR.
                                                ’08, pages 254–263.
  http://code.google.com/p/word2vec.
                                              Tillmann, C., S. Vogel, H. Ney, A. Zubiaga,
Munro, Robert.      2010. Crowdsourced
                                                 and H. Sawaf. 1997. Accelerated DP
  translation for emergency response
                                                 Based Search for Statistical Translation.
  in haiti: the global collaboration of
                                                 In Proceedings of the Fifth European
  local knowledge. In AMTA Workshop
                                                 Conference on Speech Communication
  on Collaborative Crowdsourcing for
                                                 and Technology,      pages 2667–2670,
  Translation, pages 1–4.
                                                 Rhodes, Greece.
Nießen, Sonja, Franz Josef Och, Gregor
                                              Toral, Antonio, Xiaofeng Wu, Tommi
   Leusch, and Hermann Ney. 2000. An
                                                Pirinen, Zhengwei Qiu, Ergun Bicici, and
   Evaluation Tool for Machine Translation:
                                                Jinhua Du. 2015. Dublin city university
   Fast Evaluation for MT Research. In
                                                at the tweetmt 2015 shared task. In
   Proceedings of the 2nd International
                                                TweetMT@SEPLN, Proc. of the SEPLN
   Conference on Language Resources and
                                                2015.
   Evaluation, pages 39–45, Athens, Greece.
                                              Zubiaga, Arkaitz, Iñaki San Vicente, Pablo
Papineni, K., S. Roukos, T. Ward, and
                                                Gamallo, José Ramom Pichel, Iñaki
  W. Zhu.     2002.   BLEU: A Method
                                                Alegria, Nora Aranberri, Aitzol Ezeiza,
  for Automatic Evaluation of Machine
                                                and Vı́ctor Fresno. 2014. Overview of
  Translation. In Proceedings of the 40th
                                                tweetlid: Tweet language identification at
  Annual Meeting of the Association
                                                sepln 2014. TweetLID@SEPLN.
  for Computational Linguistics, pages
  311–318, Philadelphia, Pennsylvania,
  USA.
Peisenieks, Jānis and Raivis Skadiņš.
   2014. Uses of machine translation in
   the sentiment analysis of tweets.   In
   Human     Language    Technologies-The
   Baltic Perspective: Proceedings of the
   Sixth International Conference Baltic
   HLT 2014, volume 268, page 126. IOS
   Press.
Petrovic, Sasa, Miles Osborne, and Victor
   Lavrenko. 2010. The edinburgh twitter
   corpus. In Proceedings of the NAACL
   HLT 2010 Workshop on Computational
   Linguistics in a World of Social Media,
   pages 25–26.