=Paper= {{Paper |id=Vol-1445/tweetmt-3-alegria |storemode=property |title=EHU at TweetMT: Adapting MT Engines for Formal Tweets |pdfUrl=https://ceur-ws.org/Vol-1445/tweetmt-3-alegria.pdf |volume=Vol-1445 |dblpUrl=https://dblp.org/rec/conf/sepln/AlegriaALS15 }} ==EHU at TweetMT: Adapting MT Engines for Formal Tweets== https://ceur-ws.org/Vol-1445/tweetmt-3-alegria.pdf
    EHU at TweetMT: Adapting MT Engines for Formal Tweets
          EHU en TweetMT: Adaptación de sistemas MT a tuits formales

               Iñaki Alegria, Mikel Artetxe, Gorka Labaka, Kepa Sarasola
                               University of the Basque Country
                                   inaki.alegria@ehu.eus

          Resumen: En este trabajo se describe la participación del grupo IXA de la
          UPV/EHU en la tarea sobre Traducción de Tweets en el congreso de la SEPLN
          (TweetMT 2015). Se han adaptado dos sistemas previamente desarrollados para
          la traducción es-eu y eu-es, obteniéndose buenos resultados (mejores que otros
          publicados previamente). Se describe la recopilación de recursos, la adaptación
          de los sistemas y los resultados obtenidos.
          Palabras clave: traducción automática, SMT, RBMT, tuits, social media
          Abstract: This paper describes the participation of the IXA group from the
          UPV/EHU (University of the Basque Country) in the TweetMT shared task at
          the SEPLN-2015 conference. We have adapted existing MT engines for the es-eu
          and eu-es pairs, obtaining good results (better than other experiments reported in
          previous work). Three main aspects are described: resource compilation, engine
          adaptation and results.
          Keywords: machine translation, SMT, RBMT, tweets, social media


1       Introduction                                       on a single reference tend to penalise
As the organizers of the workshop say in                   RBMT systems (rule-based machine
the home page1 “the machine translation                    translation) compared to SMT systems
of tweets is a complex task that greatly                   (statistical machine translation), but we
depends on the type of data we work with.                  wanted to test the results.
The translation process of tweets is very                • two state-of-the-art SMT systems, one
different from that of correct texts posted for            for the es-eu pair and the other one for
instance through a content manager. The                    the eu-es pair (Labaka, 2010).
texts also vary in terms of structure, where
the latter include tweet-specific features such      2     Resource Compilation from
as hashtags, user mentions, and retweets,                  Microtexts
among others.” The translation of tweets
can be tackled as a direct translation               Based on the development set provided,
(tweet-to-tweet) or as an indirect translation       preliminary work was carried out to obtain
(tweet normalization to standard text, text          useful resources to adapt the systems:
translation and, if needed, tweet generation)
(Kaufmann and Kalita, 2010).                             • an out-of-vocabulary (OOV) dictionary
    When analyzing the released development                was obtained for Basque using our
corpus we observed that most of the messages               Basque morphological analyzer.       We
were formal tweets, and we therefore decided               observed that the percentage of OOVs
to face the problem following the direct                   was low and only few of them were
approach, adapting previous engines to the                 common in the development set. Even if
structure of these texts. We have adapted                  there are a very small number of entries,
three systems:                                             we built a bilingual dictionary with the
                                                           most frequent OOVs (5 entries).
    • an RBMT system named Matxin for the
                                                         • a dictionary of bilingual hashtags was
      es-eu pair (Mayor et al., 2011). It is
                                                           obtained by aligning hashtags (using a
      well known that automatic measures run
                                                           simple program and manual revision)
    1
        http://komunitatea.elhuyar.org/tweetmt             from parallel tweets. After a manual
         review of the pairs with more than two      generation of the output translation from the
         occurrences, a dictionary of 60 pairs was   target dependency structure.
         generated.                                     Matxin was adapted to the idiosyncratic
                                                     features of tweets (URLs, hashtags...).
   Tweets of monolingual corpora from                For this purpose, the de-formatter module
previous shared tasks were compiled in order         in Matxin (Mayor et al., 2011) was
to enrich the language models.                       enriched adding the following functions (the
                                                     deformatter module separates the format
    • Corpora from the TweetNorm shared              information (RTF, HTML, etc.) from the
      task (Alegria et al., 2014).      Initial      text to be translated, and the plain text is
      collection of 227,855 Spanish tweets.          sent to the analysis phase):
    • Corpora from the TweetLId shared task            • URLs are       managed     as   sentence
      (Zubiaga et al., 2014). 8,562 Spanish              boundaries
      tweets and 380 Basque tweets.
                                                       • Hashtags at the begining or at the end
   In order to increase the low volume of data           of the tweet remain untranslated
for the Basque LM we use a corpus of tweets            • Hashtags inside the text are given for
supplied by the CodeSyntax company2 which                translation. Some will be translated
has a tweet-oriented service for Basque called           (#Escocia / #Eskozia) while others
UMAP3 . They identify Twitter accounts that              will remain untranslated (#Hackathon,
use Basque and compile the tweets from                   #Olasdeenergia)
these users. Most of these accounts are
multilingual, so language identification was           • IDs will receive the same treatment as
a key next step. We used our language                    other named entities
identifier (LangId, a free language identifier
based on word and trigram frequencies                3.2   Corpora for SMT
developed by the IXA group of the University         The adaptation and tuning of the SMT
of the Basque Country, and which is                  systems was laborious. First of all, the
specialized in recognizing Basque and its            provided development corpus was divided
surrounding languages (Spanish, French and           into 3 subsets:     training (2,000 pairs),
English)) and filter candidates with high            tuning (1,500) and test (500). Because the
percentage of OOVs (thus, priorizing formal          alignment of the corpus was automatically
tweets and adding precision to the results           done, we manually reviewed the training part
obtained from Langid) and compiled a corpus          and observed that the error rate was high.
of 454,790 tweets.                                   After discarding non-parallel tweets 1,444
                                                     pairs remained in the training corpus.
3       Adaptation and Tuning of the                    We used a previously compiled parallel
        MT Engines                                   corpus for the translation model. This 7.4
                                                     million segment corpus was compiled by the
The RBMT system was adapted to the task
                                                     Elhuyar Foundation and the University of
manually and two new models for the SMT
                                                     the Basque Country. It includes public
systems were trained and tuned.
                                                     corpora, private corpora and a corpus built
3.1       RBMT                                       by web-as-corpus paradigm (San Vicente and
As mentioned, the Matxin system was                  Manterola, 2012).     Also, the mentioned
used for es-eu translation.        Matxin is         training corpus (1,444 pairs from the
an open source Spanish-to-Basque RBMT                development corpus in the shared task) was
engine which follows the traditional transfer        repeated 100 times for the bilingual model
model. It consists of three main components:         (this is not done for the language model
1) analysis of the source sentence into a            because we use interpolation).
dependency tree structure; 2) transfer from             For the language model, previous models
the source language dependency tree to a             for Spanish and Basque were retrained
target language dependency structure; and 3)         adding the corpora described in the previous
                                                     section.
    2
        http://www.codesyntax.com                       Table 1 shows the figures for the corpora
    3
        http://umap.eu                               used.
                                            Sentences     Tokens (es)   Tokens (eu)
                                  General    7,463,951    118,497,426   94,142,809
                 Bilingual
                                  Tweets       1,444         21,022       18,804
                                  General   28,823,939    866,383,394        -
             Monolingual (es)
                                  Tweets      213,141      3,041,837         -
                                  General    1,290,501          -       14,894,592
             Monolingual (eu)
                                  Tweets      454,800           -        6,063,226


                                Table 1: Figures from the dataset
            System                     BLEU-c    BLEU      NIST-c     NIST      TER
            RBMT(es-eu)baseline         0.1395   0.1629     4.6073   5.1930   0.8824
            RBMT(es-eu)enhanced         0.1891   0.2089     5.4024   5.7755   0.7377
            SMT(es-eu)baseline          0.2108   0.2257     6.0361   6.4351   0.8116
            SMT(es-eu)enhanced          0.2401   0.2635     6.2920   6.7714   0.6550
            SMT(eu-es)baseline          0.2348   0.2591     6.2768   6.7493   0.7876
            SMT(eu-es)enhanced          0.2826   0.3109     6.9641   7.4827   0.6153

                             Table 2: Results on the test corpora


3.3   Tuning                                       4     Results and Discussion
The development of the system was carried          The systems prepared and tuned using the
out using publicly available state-of-the-art      development corpus were directly used to
tools: the GIZA++ toolkit, the SRILM               process the test. So we presented two systems
toolkit and the Moses decoder.          More       for the eu-es pair (RBMT and SMT) and one
concretely, we followed the phrase-based           system (SMT) for the es-eu pair. For these
approach with standard parameters:          a      language pairs only another group presented
maximum length of 80 tokens per sentence,          results (3 systems).
translation probabilities in both directions           Table 2 shows the results on the test
with Good Turing discounting, word-based           corpus provided. We use the most common
translation probabilities (lexical model, in       measures: BLEU and NIST (Doddington,
both directions), a phrase length penalty and      2002). Our SMT system was the best for
the target language model. The weights were        the eu-es pair, and the second (very close to
adjusted using MERT tuning with n-best list        the first) for the es-eu pair. As expected,
of size 100.                                       the RBMT system gets lower figures in the
                                                   metrics (only one reference is supplied) but it
   For the idiosyncratic features of the
                                                   is interesting to compare them with previous
tweets we analyzed the errors when the
                                                   results.
system was applied in the test extracted
from the development corpus and we                     We want to underline that the results for
decided to implement the following pre- and        the es-eu pair are better than previous results
post-processing steps:                             reported in some papers (Labaka et al., 2007;
                                                   Labaka et al., 2014). More specifically,
                                                   the BLEU figures for the RBMT system in
                                                   this task range from 0.1429 (baseline) to
  • Tokenization:     special treatment of         0.2089 (improved system) and from 0.2257
    hyphens (‘-’) before declension cases of       (baseline) to 0.2635 (improved system) for
    IDs, hashtags, figures, time...                SMT; while in the last reference (Labaka et
                                                   al., 2014) BLEU figures range from 0.0572 to
  • Post-processing: simple rules for fixing       0.1172 using RBMT and around 0.145 using
    recurrent surface-errors: double hyphen        SMT.
    or colon, special symbols (e.g. ‘¿’ is             These results are surprising if we consider
    used in Spanish but not in Basque) and         tweet texts in general, but note that all
    similar issues.                                tweets used in the shared-task are formal and
    #    System    Text
    1    source    Arranca la segunda mitad GOAZEN! — 0-0 #athlive
         ref.      Hasi da bigarren zatia, aupa!! — 0-0 #athlive
         SMT       Hasi da bigarren zatia GOAZEN! — 0-0 #athlive
         RBMT      Bigarren erdi GOAZEN ateratzen du! — 0-0 #athlive
    2    source    Jaume Matas ingresa en prisión URLURLURL
         ref.      Jaume Matas kartzelan sartu dute URLURLURL
         SMT       Jaume Matas kartzelan sartu dute URLURLURL
         RBMT      Jaume Matas espetxean sartzen da URLURLURL
    3    source    Retenciones de hasta 7 kilómetros en la AP-8 en Irun: URLURLURL
         ref.      7 kilometroko auto-ilarak AP-8an, Irungo ordainlekuan: URLURLURL
         SMT       7 kilometroko auto-ilarak AP-8 Irunen: URLURLURL
         RBMT      7km-taraino AP-8 Irunen erretentzioak: URLURLURL
    4    source    Qu es un OpenSpace? IDIDID 27 de septiembre. URLURLURL
         ref.      Zer da OpenSpace bat? IDIDID irailaren 27an. URLURLURL
         SMT       Zer da OpenSpace? IDIDID 27. URLURLURL
         RBMT      Zer da Openspace bat? Irailaren 27an IDIDID. URLURLURL
    5    source    Markel Olano denuncia que Bildu ha decidido actuar en contra de los
                     intereses de los baserritarras #eajpnv URLURLURL
         ref.      Olanok salatu du Bilduk baserritarren interesen kontra egitea
                     erabaki duela #eajpnv URLURLURL
         SMT       Markel Olano salatu du Bilduk jokatzea erabaki du interesen kontra
                     baserritarren #eajpnv URLURLURL
         RBMT      Markel Olanok salatzen du Bilduk baserritarrasen interesen aurka
                     jardutea erabaki duela #eajpnv URLURLURL
    6    source    Idoia Mendia reducirá la ejecutiva y asignará tareas
                     a cada miembro URLURLUR
         ref.      Idoia Mendiak exekutiba murriztuko du, bakoitzari zeregin bat
                     emanez URLURLURL
         SMT       Idoia Mendiak eta lan egingo du kide bakoitzari URLURLURL
         RBMT      Idoia Mendiak exekutiboa gutxituko du eta lanak esleituko
                     dizkio kide bakoitzari URLURLURL

                               Table 3: Examples RBMT/SMT


that the most of them were designed to be         • These results cannot be extrapolated to
multilingual (and so, perhaps, to be easily         the general task of translating tweets.
translated). Therefore, we could say that the       Translating informal tweets will be much
task was easier than usual tasks in MT, at          harder.
least for this language pair.
                                                  • MT can help community managers
   The good performance of the RBMT                 who manage multilingual Twitter
system on the formal tweets was expected,           accounts.     A Twitter oriented MT
as syntax use to be simple in the short texts       post-editing system could be developed
from Twitter.                                       and evaluated.
   In Table 3 there some examples of the
results.   In sentences #4, #5 and #6           Acknowledgments
RBMT gets very food translations but in the     This work has been supported by the
previous sentences the translations from the    Spanish MICINN project Tacardi (Grant
SMT system are more precise. In the near        No. TIN2012-38523-C02-01). CodeSyntax
future we want to check if combining both       company and Elhuyar Foundation have
techniques improvements can be lead.            collaborated with us providing several
   We can draw the following general            corpora for the translation and language
conclusions:                                    models.      Thanks to Josu Azpillaga
(CodeSyntax) and to Iñaki San Vicente, Igor    Zubiaga, Arkaitz, Inaki San Vicente, Pablo
Leturia, Itziar Cortes and Justyna Pietrzak       Gamallo, José Ramom Pichel, Inaki
(Elhuyar) for their assistance. We would          Alegria, Nora Aranberri, Aitzol Ezeiza,
also like to thank the anonymous referees for     and Vıctor Fresno. 2014. Overview of
their comments and suggestions.                   tweetlid: Tweet language identification
                                                  at sepln 2014.        TweetLID@SEPLN.
References                                        TweetLId      workshop     at    SEPLN
Alegria, Inaki, Nora Aranberri, Pere R            Conference. ceur-ws.org/Vol-1228/.
   Comas, Vıctor Fresno, Pablo Gamallo,
   Lluis Padró,     Inaki San Vicente,
   Jordi Turmo, and Arkaitz Zubiaga.
   2014.      Tweetnorm es corpus:      an
   annotated corpus for spanish microtext
   normalization. In Proceedings of LREC.
Doddington, George. 2002. Automatic
  evaluation    of  machine     translation
  quality using n-gram co-occurrence
  statistics. In Proceedings of the second
  international conference on Human
  Language Technology Research, pages
  138–145. Morgan Kaufmann Publishers
  Inc.
Kaufmann, Max and Jugal Kalita. 2010.
  Syntactic normalization of twitter
  messages. In International conference on
  natural language processing, Kharagpur,
  India.
Labaka, Gorka. 2010. Eusmt: incorporating
  linguistic information into smt for a
  morphologically rich language. its use in
  smt-rbmt-ebmt hybridation. Lengoaia eta
  Sistema Informatikoak Saila (UPV-EHU).
  Donostia. 2010ko martxoaren 29a.
Labaka, Gorka, Cristina España-Bonet, Lluı́s
  Màrquez, and Kepa Sarasola. 2014. A
  hybrid machine translation architecture
  guided by syntax. Machine Translation,
  28(2):91–125.
Labaka, Gorka, Nicolas Stroppa, Andy Way,
  and Kepa Sarasola. 2007. Comparing
  rule-based and data-driven approaches to
  spanish-to-basque machine translation.
Mayor, Aingeru, Iñaki Alegria, Arantza Dı́az
  De Ilarraza, Gorka Labaka, Mikel
  Lersundi, and Kepa Sarasola.         2011.
  Matxin,    an open-source rule-based
  machine translation system for basque.
  Machine translation, 25(1):53–82.
San Vicente, Inaki and Iker Manterola.
  2012. Paco2: A fully automated tool for
  gathering parallel corpora from the web.
  In LREC, pages 1–6.