    The UPC TweetMT participation: Translating Formal Tweets
                  Using Context Information
     Participación de la UPC en TweetMT: Traducción de tweets formales
                         usando información de contexto

              Eva Martı́nez Garcia                     Lluı́s Màrquez
            Cristina España-Bonet            Qatar Computing Research Institute
              TALP Research Center                    Qatar Foundation
       Universitat Politècnica de Catalunya       Tornado Tower, Floor 10,
     Jordi Girona, 1-3, 08034 Barcelona, Spain   P.O. Box 5825, Doha, Qatar
         {emartinez,cristinae}@cs.upc.edu            lmarquez@qf.org.qa

       Resumen: En este artı́culo describimos los sistemas con los que participamos en
       la tarea compartida TweetMT. Desarrollamos dos sistemas para el par de idiomas
       castellano–catalán: un traductor estadı́stico diseñado a nivel de frase y un sistema
       sensible al contexto aplicado a tweets. En el segundo caso definimos el “contexto”
       de un tweet como los tweets producidos por un mismo usuario durante un dı́a.
       Estudiamos el impacto de este tipo de información en las traducciones finales cuando
       se usa un traductor a nivel de documento. Una variante de este sistema incluye
       modelos semánticos adicionales.
       Palabras clave: Traducción automática, Twitter, Traducción sensible al contexto
       Abstract: In this paper, we describe the UPC systems that participated in the
       TweetMT shared task. We developed two main systems that were applied to the
       Spanish–Catalan language pair: a state-of-the-art phrase-based statistical machine
       translation system and a context-aware system. In the second approach, we define
       the “context” for a tweet as the tweets of a user produced in the same day, and
       also, we study the impact of this kind of information in the final translations when
       using a document-level decoder. A variant of this approach considers also semantic
       information from bilingual embeddings.
       Keywords: Machine Translation, Twitter, Context Aware Translation

1    Introduction                                    sourcing manual translations. Meedan is a
Twitter is a very popular social network.            non-profit organization2 which uses this re-
This microblogging service allows users to           source to share news between the Arabic and
share a huge amount of information in a quick        the English speaking communities. Another
way. Usually, Twitter users produce mono-            example, although in this case applied to
lingual content (34% in English and 12% in           SMS, is the work done with a crowdsourced
Spanish for example1 ). However, Twitter is          translation during the earthquake in Haiti
a multilingual communication environment.            in 2010 (Munro, 2010). They allowed the
There are many users from different national-        Haitian Kreyol and French-speaking commu-
ities posting messages in their own language.        nities of volunteers to translate texts into En-
So, to ease the spread of the information,           glish, categorize and geolocate the messages
it would be useful to post messages in sev-          in real–time in order to help the primary
eral languages simultaneously. One option            emergency responders. There are also a few
to create multilingual tweets is by crowd-           works applying machine translation to tweets
                                                     and the interest in the topic is growing over
  This research has been partially funded by         the years. In (Gotti, Langlais, and Farzin-
the TACARDI project (TIN2012-38523-C02) of the       dar, 2013), the application of statistical ma-
Spanish Ministerio de Economı́a y Competitividad     chine translation (SMT) systems to translate
(MEC). The authors thank Iñaki Alegrı́a and Gorka
Labaka for providing monolingual corpora.            tweets from the Canadian Government Agen-
522376/the-many-tongues-of-twitter/                        http://news.meedan.net
cies is studied. The authors in (Jehl, Hieber,         2.1   Data
and Riezler, 2012) describe a system that              As parallel corpus we use the Spanish–
does not rely on parallel data. In contrast,           Catalan corpus El periódico which is a col-
they try to find similar tweets in the target          lection of news with 2,478,130 aligned sen-
language in order to train a standard phrase-          tences and available at the ELRA catalogue4 .
based SMT pipeline.                                    The shared-task organization released 4,000
    All previous papers describe some com-             parallel tweets for development and 2,000 for
mon problems when trying to translate                  testing.
tweets or short messages. The first and most
                                                          In order to adapt the systems to the Twit-
usual obstacle is the colloquial language used
                                                       ter genre, we also gather a collection of mono-
in the messages, closely followed by the writ-
                                                       lingual tweets. The Catalan corpus of tweets
ing errors. To address these phenomena, it is
                                                       was collected using the Twitter API during
necessary to apply a normalization step pre-
                                                       the period going from 13th March 2015 to
vious to translation. Also, the Twitter 140-
                                                       8th May 2015. We selected 65 users with ac-
character constraint is hard to maintain in
                                                       counts mainly coming from Catalan institu-
a translation, so an effort must be made to
                                                       tions, sport clubs or newspapers. This way,
generate legal length tweets. Another very
                                                       we expect users to post mostly using a for-
common problem is handling the hashtags.
                                                       mal language. It is worthy to notice that
It is not clear whether they have to be trans-
                                                       there is some overlap between the users that
lated or not, as well as their position in the
                                                       we selected and the ones considered in the
                                                       TweetMT corpora. In particular, we used
    TweetMT is a shared task with the aim of
                                                       some tweets from museupicasso, Liceu cat,
translating formal tweets3 . These are mes-
                                                       Penya1930 and RCDEspanyol. Since the
sages usually tweeted by institutions and are
                                                       TweetMT test and development data were
well written and with no use of colloquial vo-
                                                       collected in 2013–2014 and our monolingual
                                                       tweet corpora in 2015, there is no overlap be-
    In this paper we introduce the two sys-
                                                       tween the training data and the tweets cor-
tems presented to the competition for the
                                                       pora delivered for the task. 90,744 tweets in
Spanish–Catalan language pair and also some
                                                       Catalan were obtained with this methodol-
of the improvements made after the submis-
                                                       ogy. A similar corpus in Spanish was already
sion deadline. First, we present a state-of-
                                                       available as a resource of the Tweet Normal-
the-art SMT system adapted and tuned us-
                                                       ization Workshop (Alegria et al., 2014)5 . In
ing Twitter messages. Second, we present a
                                                       this case, 227,199 were collected only in two
system that looks at the context information
                                                       days, 1st and 2nd of April 2013.
of a tweet to improve its translation. This
                                                          We also use standard monolingual corpus
second system uses a document-level decoder
                                                       to build larger language models. On the one
to take into account the context and it can
                                                       hand, the corpora available in Catalan in the
be combined with bilingual distributed vector
                                                       Opus site6 are selected (4.8M sentences). On
models, which allow to consider additional se-
                                                       the other hand, the corpora provided for the
mantic information.
                                                       WMT13 Quality Estimation Task7 are used
    This paper is organized as follows. We de-
                                                       for Spanish (53.8M sentences).
scribe the developed systems in Section 2 and
analyze the obtained results in Section 3. Fi-            We pre-processed the development dataset
nally, we present some discussions and guide-          and the monolingual corpora of tweets in or-
lines for future work in Section 4.                    der to make them similar to the format of the
2       System Description                             php?products_id=1122
This section describes the corpora used for                 http://komunitatea.elhuyar.org/tweet-norm
training the systems (both general corpora                  http://opus.lingfil.uu.se/,       Corpora:
                                                       DOGC, KDE4, OpenSubtitles 2012 and 2013,
and corpora of tweets) and their processing            Ubuntu and Tatoeba corpora (Tiedemann, 2012;
as a common resource for the two main trans-           Tiedemann, 2009) (66.5M words ).
lation engines.                                             http://statmt.org/wmt13/
    All the information and resources related to the   Corpora: Europarl corpus v7; United Nations;
TweetMT2015 shared task are available at: http:        NewsCommentary 2007, 2008, 2009 and 2010; AFP,
//komunitatea.elhuyar.org/tweetmt/                     APW and Xinhua (1.59G words).
test set. That includes changing every URL             in short sentences such as tweets where the
in the data for the URLURLURL label and sub-           number of content words is very small (17
stituting every username by the IDIDID label.          words/tweet in average in the develoment set
We decided not to translate hashtags due to            for both languages).
their difficulty and because we observed that,             In order to alleviate this limitation, we
in the development set, approximately two              use a document-level decoder that takes as
thirds of them remained untranslated. In or-           a translation unit a whole document. In our
der to maintain the hashtag information, we            case, one has to define first what a document
replace every hashtag in a tweet by a Hn la-           is. After analyzing the development data, we
bel, where n is the number of hashtag, and             decided to define the context of a tweet as
we maintain a record file where the hashtags           the surrounding tweets posted by the same
that appear in a tweet are stored. This strat-         user during the same day. In cases where
egy allows us to generalize the translation for        this number was less than 30, we put to-
every hashtag and eases the replacement by             gether the tweets posted during consecutive
the corresponding original value before build-         days until reaching the threshold of, at least,
ing the final translation. The position of the         these 30 tweets. In that way, we expect to ob-
hashtags in our systems is determined by the           tain collections of tweets –a document– that
position assigned to the corresponding labels          are closely related, since they come from the
by the decoder.                                        same source and they have been produced in
                                                       a short lapse of time. Notice that this way of
2.2    Basic SMT System                                choosing the related tweets does not reflect a
Our basic approach is a state-of-the-art               real scenario on Twitter where only the pre-
phrase-based SMT system based on the                   vious tweets from a particular user are avail-
Moses decoder (Koehn et al., 2007) and                 able. However, in an offline scenario, consid-
GIZA++ (Och and Ney, 2003). We trained                 ering past and future context will caracterize
the system using the El periódico Spanish–            better the domain of the messages. We left
Catalan parallel corpus.                               as future work to compare the differences be-
   Language models were built using the                tween both implementations.
SRILM toolkit (Stolcke, 2002). The Spanish                 In our experiments, we use a document-
general language model is an interpolation of          oriented decoder: the Docent decoder (Hard-
several 5-gram language models with inter-             meier et al., 2013; Hardmeier, Nivre, and
polated Kneser-Ney discounting as given by             Tiedemann, 2012). In a nutshell, this de-
(Specia et al., 2013)8 . The Catalan 5-gram            coder moves from a sentence search space to
language model has been built with the same            a document search space. It maximizes and
features on the general Catalan monolingual            computes the translation score for a docu-
corpus explained above. In order to adapt              ment as a whole and not only for a sentence.
the Moses system to the Twitter genre, we              However, Docent also has features that can
introduced a second language model trained             work at phrase level. In fact, the first step in
using only the tweet corpora described in the          the document-search of this decoder is equiv-
previous subsection. The Moses decoder uses            alent to the SMT system that we described
both language models as feature functions.             previously.
   Finally the system is tuned with                    2.3.1 Semantic Models
MERT (Och, 2003) against the BLEU                      The Docent framework also allows to use
measure (Papineni et al., 2002) on the tweets          distributed models as semantic space lan-
of the development set.                                guage models. We want to take advantage
2.3    Context-Aware SMT System                        of this characteristic and introduce more se-
                                                       mantic information in our system by using
A current limitation of standard SMT sys-              embeddings trained with the word2vec pack-
tems is the fact that they translate sen-              age (Mikolov et al., 2013a; Mikolov et al.,
tences one after the other without using the           2013b). Since our goal is to use the em-
information given by the surrounding ones.             beddings for translation, we train bilingual
This problem can be even more pronounced               models following the same strategy as in
    Interpolation weights were trained with the
                                                       (Martı́nez-Garcia et al., 2014): the units used
interpolate-lm.perl script from Moses and the inter-   to train the vector models are bilingual pairs
polated language models were binarized afterwards.     of targetWord sourceWord. This kind of vec-
tors are useful to capture the information re-             MTRex12 , RGS*13 , Ol14 (Nießen et al., 2000;
lated, not only to the target side or source               Tillmann et al., 1997; Snover et al., 2006;
side words, but also to the translations them-             Snover et al., 2009; Papineni et al., 2002;
selves. For this system, we use the best con-              Doddington, 2002; Melamed, Green, and
figuration obtained in (Martı́nez-Garcia et                Turian, 2003; Denkowski and Lavie, 2012;
al., 2014), that is, we train a CBOW ar-                   Lavie and Agarwal, 2007; Lin and Och,
chitecture using a context window of 5 to-                 2004) and a normalized arithmetic mean of
kens to get 600-dimensional vectors. The                   the lexical metric scores (ULC)(Giménez and
aligned parallel corpus needed to train the                Màrquez, 2008). Comparing the SMT and
models was obtained from the Opus collec-                  SMTsub systems rows in Table 1 when trans-
tion and is built up with the OpenSubti-                   lating from Catalan to Spanish, it is clear
tles 2012, 2013, and the Tatoeba and EU-                   that fixing the tokenization problem in the
bookshop parallel corpora. The final seman-                Catalan test set significantly improves the
tic models contain 1,527,004 Catalan Spanish               scores in all the metrics.
units and 1,391,022 Spanish Catalan units.                     Note that, when translating into Span-
When translating a document, Docent uses                   ish, the SMT system outperforms the rest
these semantic models to estimate an addi-                 whereas when translating into Catalan the
tional score for every phrase that is propor-              DSMT system is the one with best scores
tional to the distance among the vectors of                in most metrics. We observe that the dif-
that phrase and its local context9 .                       ferences between the scores of the SMT and
                                                           DSMT systems are not statistically signifi-
                                                           cant when translating from Spanish to Cata-
3       Evaluation
                                                           lan, but the differences between the scores in
In the previous section we have in-                        the other translation direction are indeed sta-
troduced three different translation sys-                  tistically significant, both measured at 95% of
tems: a standard sentence-level SMT sys-                   confidence level15 . For example, the BLEU
tem (SMT), a document-level SMT sys-                       score obtained by the SMT system is 1.32
tem (DSMT) and a document-level SMT                        points higher than DSMT when translating
system enriched with additional semantic                   into Spanish, but DSMT has 0.12 points of
information (semDSMT). For the shared                      BLEU more than SMT in the other direc-
task we only submitted results with the                    tion. The similarity between the results for
SMT and semDSMT systems (SMTsub and                        the SMT has two main reasons. On the one
semDSMTsub systems).           However, some               hand, the DSMT system departs from the
problems with the input tokenization were                  SMT one, so, for an already good transla-
found after the submission. 10 In this sec-                tion, such as the ones obtained for tweets,
tion, we report both the results before and                only few changes are applied. On the other
after solving this issue. We also found a prob-            hand, the automatic evaluation metrics are
lem in the integration of the semantic vector              not sensitive to the changes due to the con-
models inside the document-oriented decoder                text information. It is also important to no-
(semDSMT systems) that invalidates the re-                 tice that there exists only one reference. This
sults of this system submitted to the task.                fact makes more difficult to obtain an accu-
   Automatic evaluation results for our sys-               rate evaluation of the translations since cor-
tems are shown in Table 1. We obtained these               rect variations, using synonyms for example,
results using the Asiya toolkit (Giménez and              will be scored as wrong translations.
Màrquez, 2010) for several lexical metrics:                   For instance, in the first example in Ta-
WER, PER, TER, BLEU, NIST, GTM211 ,                        ble 2, we observe how the DSMT obtains a
                                                                We use the METEOR version using only exact
     The local context of a phrase consists of its pre-    maching.
vious 30 tokens.                                                We use the ROUGE variant which skips bigrams
     There were errors when tokenizing the article         without max-gap-length
form l’ as well as other elided forms like ’n, d’ or s’.        Lexical overlap inspired on the Jaccard coeficient
Also, we fixed the tokenization of the pronouns that       for sets similarity.
appear after a verb with a dash like in animar–los or           Significance of the difference between the systems
donar–nos.                                                 measured for the NIST and BLEU metrics using the
     We use the GTM version with the parameter as-         implementation of paired bootstrap resampling in-
sociated to long matches e = 2.                            cluded in the Moses decoder.
                             Catalan to Spanish — 140 chars/tweet
 System           WER      PER     TER     BLEU NIST       GTM2 MTRexRGS*             Ol     ULC
 SMTsub           20.17    16.40   19.42   68.20   11.22   62.71   78.46    77.31   74.72   65.04
 semDSMTsub       25.10    17.09   22.25   63.12   10.93   57.92   76.44    75.56   73.76   58.62
 SMT              14.96    11.82   14.16   76.67   12.07   71.48   84.08    81.34   82.11   78.62
 DSMT             15.74    12.23   14.94   75.35   11.92   69.79   83.38    80.74   81.40   76.75

                                   Catalan to Spanish — free
 System           WER      PER     TER     BLEU NIST       GTM2 MTRexRGS*             Ol     ULC
 SMTsub           20.13    16.31   19.38   68.25   11.22   62.71   78.52    77.35   74.76   65.05
 semDSMTsub       25.07    17.01   22.21   63.17   10.94   57.92   76.50    75.60   73.80   58.62
 SMT              14.92    11.74   14.13   76.73   12.08   71.49   84.14    81.39   82.15   78.65
 DSMT             15.70    12.15   14.90   75.41   11.92   69.79   83.45    80.79   81.44   76.79

                             Spanish to Catalan — 140 chars/tweet
 System           WER      PER     TER     BLEU NIST       GTM2 MTRexRGS*             Ol     ULC
 SMTsub           14.35    11.25 13.63 77.93 12.04 72.69 53.98 82.19 83.18 66.51
 SMT              14.32    11.30 13.58 78.07 12.04 73.02 54.08 82.21 83.29 66.62
 DSMT             14.22    11.10 13.46 78.19 12.07 72.96 54.14 82.45 83.46 67.10

                                   Spanish to Catalan — free
 System           WER      PER     TER     BLEU NIST       GTM2 MTRexRGS*             Ol     ULC
 SMTsub           14.33    11.24 13.61 77.93 12.04 72.69 53.99 82.19 83.20 66.51
 SMT              14.31    11.30 13.56 78.06 12.04 73.02 54.09 82.21 83.31 66.62
 DSMT             14.20    11.09 13.44 78.19 12.07 72.97 54.15 82.45 83.48 67.11

Table 1: Evaluation with a set of lexical metrics for our systems on the Catalan–Spanish language
pair. Results include the scores obtained with the raw translations (free) and with the restriction
of only considering the first 140 characters per tweet (140 chars/tweet).

better translation than the SMT system with         both systems generate correct translations.
respect to the reference, but actually, both        In spite of the spelling mistake in the ref-
systems obtain a correct translation. There         erence (cumpleix instead of compleix ), this
are other examples where the DSMT has a             time the closest translation to the reference is
correct translation but it does not match the       the one from the SMT system but the DSMT
reference, as shown in the Example 2 from           one is still correct.
Table 2. In this case, both systems obtain
good translations but the SMT translation              Most of the problems that we found in our
is closer to the reference since the DSMT           experiments are related to the lack of normal-
uses synonyms for partido and FCB (encuen-          isation of the source and to the decision of
tro and Barça respectively). One example           keeping the hashtags untranslated. We found
where the context information is useful is Ex-      several examples where the original tweet is
ample 3 in Table 2 where DSMT uses can-             not well written and this produces errors in
cha instead of pista to translate pista, which      the translations. For instance, “Gràcies x ls
is a more concrete option since the user ac-        mencions sobre l’expo #PostPicasso” where
count that produced the message is from a           our systems are not able to translate correctly
famous Spanish basketball team that mostly          the informal abbreviation ls. Regarding the
tweets information about basketball. In the         hashtags, we found “#elmésllegit” that ap-
other direction, we found similar phenom-           pears translated as “#lomásleı́do” in the ref-
ena. Example 4 in Table 2 shows again how           erence but in our systems we decided to pre-
                                                    serve the original hashtags.
 Example 1: Catalan to Spanish
 Source        Els agents rurals capturen un voltor comú a l’Hospitalet
 Reference     Los agentes rurales capturan un buitre leonado en L’Hospitalet
 SMT           Los agentes rurales capturan un buitre común en L’Hospitalet
 DSMT          Los agentes rurales capturan a un buitre leonado en L’Hospitalet
 Example 2: Catalan to Spanish
 Source        Final del partit al Vicente Calderón! ATM 0-0 FCB
 Reference     Final del partido en el Vicente Calderón! ATM 0-0 FCB
 SMT           Final del partido en el Vicente Calderón! ATM 0-0 FCB
 DSMT          Final del encuentro en el Vicente Calderón! ATM 0-0 Barça
 Example 3 : Catalan to Spanish
 Source        Aquesta nit, a les 20:30 hores, el IDIDID B visita la pista del IDIDID.
 Reference     Esta noche, a las 20:30 horas, el IDIDID B visita la cancha del IDIDID.
 SMT           Esta noche, a las 20: 30 horas, el IDIDID B visita la pista del IDIDID.
 DSMT          Esta noche, a las 20: 30 horas, el IDIDID B visita la cancha del IDIDID.
 Example 4 : Spanish to Catalan
 Source        Kim Basinger cumple hoy 60 años
 Reference     Kim Basinger cumpleix avui 60 anys
 SMT           Kim Basinger compleix avui 60 anys
 DSMT          Kim Basinger avui fa 60 anys

Table 2: Translation examples of tweets by our different systems in both translation directions:
Spanish to Catalan and Catalan to Spanish.

   We can also observe that the restriction       guage models built with tweets. For the
of 140 characters does not have an impor-         document-level SMT system, we considered
tant effect in the performance. This is be-       as context of a tweet the rest of messages
cause, for this test set, our systems usually     from the same user during the same day.
produce tweet translations with a legal length
(99.00% from Catalan to Spanish and 99.70%           The automatic evaluation of our systems
from Spanish to Catalan), and furthermore,        shows that both systems perform similarly.
among the tweets exceeding the maximum            However, it must be taken into account that
length, the average number of extra charac-       lexical metrics are not context sensitive and
ters is less than 6. Notice that it is hard to    there is only one reference available. As
measure the real length of the tweets since we    reported in the literature, we found prob-
do not have access to the original messages,      lems with the correctness of the messages and
instead we have the tweets with the URLs          when addressing the problem of translating
and IDs replaced by their corresponding la-       hashtags as we shown with some examples
bels. For the given language pair, our system     found during the manual evaluation.
mostly respect the original length. This is
an expected behaviour since the length fac-          Hashtag translation and normalization of
tor (Pouliquen, Steinberger, and Ignat, 2003)     the input are interesting topics for future
for the Catalan-Spanish language pair is close    work especially for extending the system to
to 1.                                             translate informal tweets. We also consider
                                                  to implement a pipeline that only takes into
4   Conclusions                                   account the previous context to simulate an
                                                  online scenario and compare it with the ac-
We have described the systems developed           tual pipeline. Currently we are enhancing the
for the TweetMT shared task: a standard           models with the introduction of semantic in-
sentence-level SMT system based on Moses          formation using word vector embeddings. In
and a document-level SMT system based on          particular, we are customizing the Docent de-
Docent. We adapted both systems using lan-        coder to introduce them at translation time.
