=Paper= {{Paper |id=Vol-1445/tweetmt-5-toral |storemode=property |title=Dublin City University at the TweetMT 2015 Shared Task |pdfUrl=https://ceur-ws.org/Vol-1445/tweetmt-5-toral.pdf |volume=Vol-1445 |dblpUrl=https://dblp.org/rec/conf/sepln/ToralWPQBD15 }} ==Dublin City University at the TweetMT 2015 Shared Task== https://ceur-ws.org/Vol-1445/tweetmt-5-toral.pdf
    Dublin City University at the TweetMT 2015 Shared Task
                Dublin City University en la tarea TweetMT 2015

                   Antonio Toral, Xiaofeng Wu, Tommi Pirinen,
                      Zhengwei Qiu, Ergun Bicici, Jinhua Du
          ADAPT Centre, School of Computing, Dublin City University, Ireland
          {atoral, xwu, tpirinen, zhengwei.qiu2, ebicici, jdu}@computing.dcu.ie

      Resumen: Describimos nuestra participación en TweetMT para tres pares de
      lenguas en ambas direcciones: castellano hacia/desde catalán, euskera y portugués.
      Hacemos uso de varias técnicas: traducción automática estadı́stica y basada en
      reglas, segmentación de morfemas, selección de datos con ParFDA y combinación
      de sistemas. En cuanto a recursos, adquirimos grandes cantidades de tuits para
      llevar a cabo una adaptación de dominio monolingüe. Nuestro sistema ha sido el
      mejor de todos los enviados para cinco de los seis pares de lenguas.
      Palabras clave: traducción automática, tuits, segmentación de morfemas, selección
      de datos
      Abstract: We describe our participation in TweetMT for three language pairs in
      both directions: Spanish from/to Catalan, Basque and Portuguese. We used a range
      of techniques: statistical and rule-based MT, morph segmentation, data selection
      with ParFDA and system combination. As for resources, our focus was on crawling
      vast amounts of tweets to perform monolingual domain adaptation. Our system was
      the best of all systems submitted for five out of the six language directions.
      Keywords: machine translation, tweets, morph segmentation, data selection


1   Introduction and Objectives                    we rely on state-of-the-art SMT, morph seg-
                                                   mentation for morphologically rich languages
While statistical machine translation (SMT)
                                                   (EU), data selection with ParFDA for fast de-
can be considered a mature technology nowa-
                                                   velopment of accurate SMT systems (Biçici,
days, one of its requirements is the availabil-
                                                   Liu, and Way, 2015) and domain adapta-
ity of considerable amounts of parallel text
                                                   tion (Biçici, 2015), the use of available open-
for the language pair of interest. Ideally, the
                                                   source rule-based systems and, finally, sys-
parallel text to train an SMT system should
                                                   tem combination to take advantage of the
come from the same domain and genre as the
                                                   strengths of the different systems we built.
text the system is going to be applied to.
                                                   As for resources, we crawl vast amounts
Thus, using MT to translate types of text
                                                   of tweets to perform monolingual domain
for which no parallel data is available consti-
                                                   adaptation and complement this with pub-
tutes a challenge. This is the case for tweets
                                                   licly available general-domain monolingual
and social media in general, the target text
                                                   and parallel corpora.
of the TweetMT shared task.
                                                       The rest of the paper is organised as fol-
   The main objective of our participation in      lows. Sections 2 and 3 detail the systems
the TweetMT 2015 shared task was to build          built and the resources used, respectively.
the best MT systems for tweets we could with       Section 4 presents the evaluation and, finally,
a clear constraint, i.e. it had to be done in a    Section 5 outlines conclusions and lines of fu-
very short period and, to a large extent, be       ture work.
limited to available resources. We have taken
part for three language pairs in both direc-
                                                   2    Architecture and Components
tions: Spanish (ES) from/to Catalan (CA),
Basque (EU) and Portuguese (PT).                        of the System
   We decided to focus on making the best          Here we describe the components used in our
possible use of available techniques, tools and    translation pipeline. First, we pre-process
resources. Regarding techniques and tools,         the datasets (Section 2.1), then we use a set
of MT systems (Section 2.2) that can incor-            2.3   Morphological Segmentation
porate additional functionality (Sections 2.3          Morphological segmentation is a popular
and 2.4). Finally, we combine MT systems               method to deal with SMT for morphologi-
(Section 2.5).                                         cally differing languages by simply splitting
                                                       words into sub-word units. The main benefits
2.1    Data Preprocessing
                                                       of morphological segmentation are to reduce
Prior to be used, all the datasets used in our         the out-of-vocabulary (OOV) rate and to in-
systems are preprocessed, as follows:                  crease the percentage of 1 to 1 word align-
                                                       ments between morphosyntactically different
  1. Punctuation    normalisation,       with          languages; e.g. in our case, by matching in-
     Moses’ (Koehn et al., 2007) script.               flectional suffixes in EU to syntactic prepo-
  2. Sentence splitting and tokenisation, with         sitions in ES, we expect to improve the MT
     Freeling (Padró and Stanilovsky, 2012).          quality for the EU–ES language pair. The
                                                       segmentation and de-segmentation is able to
  3. Normalisation (only for tweets). We sort          create word-forms not present in the training
     the vocabulary of a tweet corpus by word          data by matching a translated stem with a
     frequency and inspect the words that oc-          correct suffix.
     cur in at least 0.5% of the tweets, creat-            In our participation, morphological seg-
     ing rules to convert informal words to            mentation was only used for EU–ES on the
     their formal equivalent. This leads to            EU side, since EU’s morphology is signifi-
     just a handful of rules. E.g. in Spanish,         cantly more complex than that of ES. For the
     “q”, occurring in 2.62% of the tweets, is         remaining languages of the shared task, there
     converted to its formal equivalent “que”.         is no such big difference in morphology com-
                                                       plexity (all of them are closely-related as they
  4. Truecasing, with a modified version of
                                                       belong to the same family) so the expected
     Moses’ script. We added a set of start-
                                                       gains do not outweigh the added complexity
     of-sentence characters commonly used in
                                                       of segmentation.
     Spanish: ”-”, ”—”, ”¿”, ”“” and ”‘”.
                                                           We use unsupervised statistical segmen-
                                                       tation as provided by Morfessor 2.0 Base-
2.2    MT Systems                                      line (Virpioja et al., 2013).3 The basic setup
We build SMT systems using two paradigms:              for segmentation is the same as in the Abu-
phrase-based with Moses (Koehn et al., 2007)           MaTran project submission to the WMT
and hierarchical with cdec (Dyer et al., 2010).        2015 translation task (Rubino et al., 2015).
In both cases we use default settings. We also         However, some minor Twitter-related pre-
use off-the-shelf open-source rule-based MT            processing has been added in order to keep
(RBMT) systems. Namely, Apertium (For-                 URLs and hashtags intact. The parameters
cada et al., 2011) for ES↔CA, ES↔PT and                used for Morfessor training are the default of
EU→ES,1 and Matxin (Mayor et al., 2011)                version 2.0.2-alpha and the data for training
for ES→EU.2                                            is the EU side of the ES–EU parallel training
    The SMT systems use 5-gram LMs with                data (cf. Section 3.1).
Knesser-Ney smoothing (Kneser and Ney,                     To gauge the effects of our method as
1995) except for ParFDA Moses SMT sys-                 well as the morphological complexity of EU
tems, which use LMs of order 8 to 10. We               as compared to ES we show in Table 1 the
build LMs on individual monolingual corpora            OOV rates and vocabulary sizes of the ES
(cf. Section 3.2) and interpolate them with            and EU sides of the ES–EU training corpus,
SRILM (Stolcke and others, 2002) to min-               and EU corpora after morphological segmen-
imise the perplexity on the dev set. Each              tation. Segmentation reduces the type-to-
target language and its corpora used to                token ratio by a factor of 6 and the OOV
build LMs together with their interpolation            rate by almost a factor of 10.
weights are shown in Table 4. We observe
that tweets are given very high weights even if        2.4   ParFDA
they are not the biggest corpora in the mixes.         ParFDA parallelizes instance selection with
   1
                                                       an optimized parallel implementation of
    Revisions 60356, 60384, and 60356, respectively.
   2                                                      3
    API    at    http://ixa2.si.ehu.es/glabaka/             http://www.cis.hut.fi/projects/morpho/
Matxin.xml                                             morfessor2.shtml
 Corpora          Tokens        Types        OOV         2010), with default settings, except for the
                                                         parameter length, for which we use its de-
 ES              30,532,489     296,612     14.5 %
                                                         fault (7) for all directions except for ES→EU,
 EU              24,966,862     605,207     25.4 %
                                                         for which we use 5 according to empirical re-
 EU morphs       35,293,220     100,990      2.6 %
                                                         sults on the development set.
Table 1: Size of ES–EU training corpus in
word tokens (ES and EU sides) and in morph               3       Resources Employed
tokens (EU).                                             3.1      Parallel Corpora
5-gram        OOV              perplexity
                                                         Ideally, we would use data in the same do-
       C FDA FDA          C FDA FDA                      main and genre as the test set, i.e. tweets.
S→T train train LM %red train train LM            %red   We have access to parallel tweets provided
CA–ES 2948 2957 2324 .21 332 336 294              .11    by the task for ES–CA and ES–EU (4,000
EU–ES 3021 3046 2443 .19 462 483 546              -.18
PT–ES 2871 2896 1951 .32 633 623 486              .23    parallel tweets for each language pair, we use
ES–CA 3338 3345 2890 .13 325 330 338              -.04   1,000 for dev and the remaining 3,000 for
ES–EU 4110 4129 3349 .19 745 761 637a             .15a   training). For ES–PT we have access to 999
ES–PT 3087 3117 2216 .28 993 941 746              .25    parallel tweets (we use them for dev) from
                                                         Brazilator,4 a recent project by DCU and Mi-
Table 2: LM comparison built from training               crosoft to translate tweets from the 2014 soc-
corpus (C train), ParFDA selected training               cer World Cup across 24 language directions.
data (FDA train), ParFDA selected LM data
                                                             As the availability of parallel tweets for
(FDA LM). %red is reduction proportion.
                                                         the language pairs of TweetMT 2015 is rather
   a
    ES–EU LM is recomputed after the task, re-           limited (at most we have 4,000 per language
moving duplicates, which slightly decrease BLEU, in-     pair), we use additional sources of paral-
crease NIST.                                             lel data. For ES–CA we use elPeriodico
                                                         (eP)5 and a selection of contemporary nov-
FDA5 and significantly reduces the time                  els. For ES–EU, translation memories (TMs)
to deploy accurate SMT systems especially                provided by the shared task6 and two corpora
in the presence of large training data and               from Opus (Tiedemann, 2012):7 Open subti-
still achieve state-of-the-art SMT perfor-               tles 2013 and Tatoeba. Finally, for ES–PT
mance (Biçici, Liu, and Way, 2015; Biçici              we use Europarl v78 and two corpora from
and Yuret, 2015). Detailed composition of                Opus: news-commentary and Tatoeba. Ta-
the available corpora, which is referred to as           ble 3 provides details on these corpora.
constrained (C), are provided in Section 3.              3.2      Monolingual Corpora
For ES, we also included LDC Gigaword cor-
pora (Ângelo Mendonça et al., 2011). The               Our main source of monolingual data is in-
size of the LM corpora includes both the LDC             domain and comes from crawled tweets. We
and the monolingual LM corpora provided.                 use TweetCat (Ljubešić, Fišer, and Erjavec,
ParFDA selected training and LM data ob-                 2014) and crawl tweets for all the target lan-
tains accurate translation outputs with the              guages (CA, ES, EU and PT) during March
selected LM data reducing the number of                  and April 2015.
OOV tokens by up to 32% and the perplexity                  For each language we create two lists of
by up to 25% and allows us to model higher               words as required by the crawler: (i) most
order dependencies (Table 2).                            common discriminating words (up to 100),
                                                         these are words that are unique to the lan-
2.5      System Combination                              guage and they are used to seed the crawler
                                                         so that it can find candidate tweets; and (ii)
For each language direction we have built up             most common words of the language (200),
to five systems, as detailed in Sections 2.2             these are used to determine the language of
to 2.4: (i) phrase-based and (ii) hierarchical
                                                             4
SMT, (iii) phrase-based with morph segmen-                    http://www.cngl.ie/brazilator
                                                             5
tation, (iv) phrase-based with ParFDA and                     http://catalog.elra.info/product_info.
                                                         php?products_id=1122
(v) RBMT. We hypothesise these systems to                   6
                                                              http://komunitatea.elhuyar.org/tweetmt/
have complementary strengths, and thus we                resources/
decide to perform system combination. To                    7
                                                              http://opus.lingfil.uu.se/
                                                            8
that end we use MEMT (Heafield and Lavie,                     http://www.statmt.org/europarl/
    Pair   Corpus        # s.       # tokens       Lang      Corpus         # tokens        Weights
           tweets          3K          48k, 48k              tweets              29M            0.60
 ES–CA     eP            0.6M      13.5M, 14M        CA      caWaC               0.5G           0.33
           novels         47K      .78M, .86M                eP                  14M            0.07
           tweets          3K         42K, 38K               tweets           129.2M            0.75
           TMs           1.1M    28.9M, 23.5M        ES      news                0.4G           0.21
 ES–EU
           OpenSubs     0.16M      1.2M, 1.0M                europarl            60M            0.04
           Tatoeba         902      6.7K, 5.5K               tweets            11.3M            0.97
           EU            1.9M        54M, 53M        EU      Wikipedia         11.5M            0.01
 ES–PT     NC              9K      .26M, .25M                TMs                 23M            0.02
           Tatoeba        53K      .42M, .41M                tweets              33M            0.93
                                                     PT      Wikipedia          166M            0.02
Table 3: Parallel corpora used for training.                 Others             286M            0.05
For each corpus we provide its number of sen-
tence pairs (# s.) and tokens on both sides
(# tokens).                                       Table 4: Monolingual corpora used for train-
                                                  ing. For each corpus we show its number of
crawled tweets. These two lists are derived       tokens (# tokens) and its weight in LM in-
from a list of the most common words found        terpolation.
in a corpus of subtitles.9
    The tweets crawled are post-processed         combinations for the three language pairs we
with langid10 to identify their language. We      covered: ES–CA, ES–EU and ES–PT. The
keep the tweets whose langid’s confidence         scores were obtained on raw MT output (i.e.
score is above a certain threshold, which is      tokenised and truecased) as calculated by us
set empirically at 0.7 by inspecting tweets.      with BLEU (Papineni et al., 2002) (multibleu
    In addition to crawled tweets, we use the     cased as included in Moses version 3) and
target sides of the parallel corpora (cf. Sec-    TER (Snover et al., 2006) (as implemented in
tion 3.1 and a set of monolingual corpora as      TERp version 0.1). Due to time constraints
follows. For CA we use caWaC (Ljubešić          not all the possible combinations were tried.
and Toral, 2014), a corpus crawled from the       The scores of the best individual system and
.cat top level domain. For ES, news crawl         combination are shown in bold.
and news-commentary from WMT’13.11 For
                                                      At least one of the combinations obtains
EU, a dump from Wikipedia (20150407). For
                                                  better scores (both in terms of BLEU and
PT, the news sources CETEMPublico,12 and
                                                  TER) than the best individual system (ex-
CETENFolha,13 and a dump from Wikipedia
                                                  cept for ES↔PT with BLEU and for CA→ES
(20150510).
                                                  with TER), supporting our hypothesis that
    Table 4 shows details on these corpora in-
                                                  the individual systems built are complemen-
cluding their interpolation weights (cf. Sec-
                                                  tary. Although SMT systems outperform
tion 2.2).
                                                  RBMT systems for all directions,14 the addi-
4     Evaluation                                  tion of RBMT in system combinations has a
                                                  positive impact (except for ES↔PT). Phrase-
We report our results on the development set      based SMT outperforms hierarchical SMT for
(all systems built) and then on the test set      related language pairs (ES–CA and ES–PT),
(systems submitted).                              but the opposite is true for the unrelated lan-
4.1    Evaluation on Development                  guage pair ES–EU. We hypothesise this is
       Data                                       due to the fact that ES and EU follow dif-
                                                  ferent word orders (SVO and SOV, respec-
Table 5 presents the results obtained on the      tively), and this leads to pervasive long re-
devset by the individual systems and a set of     orderings in translation, that are better mod-
   9
     https://onedrive.live.com/?cid=              elled with a hierarchical approach.
3732e80b128d016f&id=3732E80B128D016F!3584
  10                                                14
     https://github.com/saffsd/langid.py              When interpreting the results, it should be taken
  11
     http://www.statmt.org/wmt13/                 into account that automatic metrics are known to be
  12
     http://www.linguateca.pt/cetempublico/       biased towards statistical MT approaches (Callison-
  13
     http://www.linguateca.pt/cetenfolha/         Burch, Osborne, and Koehn, 2006).
                                                                                    System                 BLEU         TER
                                                                                    DCU1 (1+4)             0.7669       0.1740




                                              PT→ES ES→PT EU→ES ES→EU CA→ES ES→CA
         System          BLEU        TER                                            DCU2 (1)               0.7899†      0.1626†
         Moses (1)         82.21    0.1102                                          DCU3 (1+2+4)           0.7630       0.1738
         cdec (2)          81.45    0.1128                                          DCU1 (1+4)             0.7826       0.1506
ES→CA




         ParFDA (3)       82.37    0.1062                                           DCU2 (1+2+4)           0.7816       0.1500
         Apertium (4)      78.17    0.1310                                          DCU3 (1+3+4)           0.7943†      0.1431†
         1+2               81.71    0.1102                                          DCU1 (1+2+4)           0.2455       0.6533
         1+4              82.37    0.1057                                           DCU2 (1+2+3+4+5)       0.2636†      0.6469†
         1+2+4             81.93    0.1085                                          DCU3 (1+2+4+5)         0.2493       0.6553
         Moses (1)        82.52     0.1086                                          DCU1 (2)               0.2687       0.6512
         cdec (2)          81.76    0.1118                                          DCU2 (1+2+4)           0.2698       0.6406
         ParFDA (3)        82.16   0.1063                                           DCU3 (1+2+4+5)         0.2728       0.6363
CA→ES




         Apertium (4)      77.96    0.1329                                          DCU1 (1)               0.3595       0.5290
         1+2               82.38    0.1088                                          DCU2 (1+2)             0.3711†      0.5157†
         1+4              82.58     0.1077                                          DCU3 (1+2+4)           0.3687       0.5163
         1+2+4             82.38    0.1083                                          DCU1 (1)               0.4465       0.5767
         1+3+4             82.45   0.1074                                           DCU2 (1+2)             0.4467       0.5627
         Moses (1)         22.57    0.6116                                          DCU3 (1+2+4)           0.4524†      0.5403†
         cdec (2)          23.7    0.5863
         ParFDA (3)        21.59    0.6181                                          Table 6: Results on the test set.
         Matxin (4)        12.66    0.7436
ES→EU




         Morph (5)          5.20    0.8812   4.2                                     Evaluation on Test Data
         1+2               23.18    0.5796
         1+4               18.36    0.6112   Table 6 presents the results on the test set
         1+2+4             23.58    0.5771   of the systems we submitted. The scores
         1+2+4+5           24.07   0.5741    shown are the ones reported by the organ-
         1+2+3+4+5        24.42     0.5777   isers (case-insensitive BLEU and TER) on
         Moses (1)         24.21    0.6228   post-processed MT outputs (detokenised and
         cdec (2)         24.65    0.5911    detruecased). For each language direction
         ParFDA (3)        22.25    0.6346   we submitted the three systems that ob-
                                             tained the best performance on the dev set.
EU→ES




         Apertium (4)      18.36    0.6918
         Morph (5)         11.25    0.9655   The scores of the best submitted system are
         1+2               24.18    0.5883   shown in bold.
         1+4               24.33    0.6076       Out of six directions, our best submission
         1+2+4             24.94    0.5831   is the top performing system for five of them
         1+2+4+5          25.21    0.5792    (indicated with †). For most directions, the
         Moses (1)        29.21     0.6052   addition of a RBMT system leads to bet-
         cdec (2)          28.14   0.5962    ter performance. Similarly, for the directions
                                             where we have used segmentation (ES↔EU)
ES→PT




         ParFDA (3)        27.74    0.6164
         Apertium (4)      24.96    0.6272   and ParFDA (CA→ES and ES→EU), the ad-
         1+2              28.76     0.5891   dition of systems based on these techniques
         1+4               26.58    0.6082   had a positive impact on the results.
         1+2+4             27.00   0.5878        We now delve deeper into the results ob-
         Moses (1)        30.47     0.5267   tained by SMT systems based on ParFDA
         cdec (2)          29.42   0.5254    (cf. Section 2.4). Although ParFDA systems
PT→ES




         ParFDA (3)        29.63    0.5338   were submitted to the shared task only as
         Apertium (4)      27.52    0.5335   part of system combinations, we have eval-
         1+2               29.9     0.5230   uated a posteriori the performance of this
         1+4               30.01    0.5131   technique by means of standalone systems on
         1+2+4             29.89   0.5089    the test set. ParFDA Moses SMT system ob-
                                             tains top results in CA→ES and ES→CA and
        Table 5: Results on the dev set.     close to top results in other language pairs
                                             with 1.21 BLEU points average difference to
                                             the top (Table 7). An interesting feature of
     TweetMT     CA–ES      EU–ES      PT–ES          der grant agreement PIAP-GA-2012-324414
     ParFDA      .8012      .2713      .4374          (Abu-MaTran), by SFI as part of the
     Top         .7942      .3109      .4519          ADAPT research center (07/CE/I1142) at
     diff        -.007      .0396      .0145          Dublin City University and the project
     LM order    8          8          8              “Monolingual and Bilingual Text Quality
                 ES–CA      ES–EU      ES–PT          Judgments with Translation Performance
     ParFDA      .7926      .2482      .3589          Prediction” (13/TIDA/I2740).        We also
     Top         .7907      .2636      .3711          thank the SFI/HEA Irish Centre for High-
     diff        -.0019     .0154      .0122          End Computing (ICHEC) for the provision of
     LM order    8          10         8              computational facilities and support. Finally,
                                                      we would like to thank Mikel L. Forcada and
Table 7: BLEU results for ParFDA stan-                Iacer Calixto for their advice on normalising
dalone systems on the test set, their differ-         tweets for Basque and Portuguese, respec-
ence to the top, and ParFDA LM order used.            tively, and Gorka Labaka for his help with
ParFDA obtains top results in CA→ES and               Matxin’s API.
ES→CA and 1.21 BLEU points average dif-
ference.                                              References
ParFDA regards its ability to build and de-           Ângelo Mendonça, Daniel Jaquette, David
ploy SMT systems in a quick manner. In                   Graff, and Denise DiPersio. 2011. Spanish
the specific case of TweetMT, ParFDA took                Gigaword third edition, Linguistic Data
about 8 hours to build for ES→CA and 28                  Consortium.
hours for PT→ES taking about 11 GB and                Biçici, Ergun. 2015. Domain adaptation for
27 GB disk space in total, respectively.                 machine translation with instance selec-
                                                         tion. The Prague Bulletin of Mathemat-
5        Conclusions and Future Work                     ical Linguistics, 103:5–20.
This paper has described our participation in
                                                      Biçici, Ergun, Qun Liu, and Andy Way.
the TweetMT 2015 shared task. Our focus
                                                         2015. ParFDA for fast deployment of ac-
has been on rapid development of MT sys-
                                                         curate statistical machine translation sys-
tems adapted to tweets by making the best
                                                         tems, benchmarks, and statistics. In Pro-
possible use of available techniques, tools and
                                                         ceedings of the EMNLP 2015 Tenth Work-
resources. Our best submissions have been
                                                         shop on Statistical Machine Translation,
the ones that combine different MT systems
                                                         Lisbon, Portugal, September. Association
(except for ES→CA), supporting our hypoth-
                                                         for Computational Linguistics.
esis that the techniques we have used are
complementary.                                        Biçici, Ergun and Deniz Yuret. 2015. Op-
   As for future work, we consider several               timizing instance selection for statistical
possible avenues. First, we would like to anal-          machine translation with feature decay al-
yse in detail the translations produced by our           gorithms. IEEE/ACM Transactions On
systems in order to derive findings beyond the           Audio, Speech, and Language Processing
ones we can extract from the automatic eval-             (TASLP), 23:339–350.
uation metrics used in the task. Second, most
of the tweets in the test set use formal lan-         Callison-Burch, Chris, Miles Osborne, and
guage,15 and thus we would like to test our             Philipp Koehn. 2006. Re-evaluation the
systems in a more representative set of tweets          role of bleu in machine translation re-
where informal language would be expected               search. In 11th Conference of the Euro-
to be more pervasive.                                   pean Chapter of the Association for Com-
                                                        putational Linguistics, pages 249–256.
Acknowledgments                                       Dyer, Chris, Adam Lopez, Juri Ganitke-
This research is supported by the EU 7th                vitch, Johnathan Weese, Ferhan Ture,
Framework Programme FP7/2007-2013 un-                   Phil Blunsom, Hendra Setiawan, Vladimir
    15
                                                        Eidelman, and Philip Resnik.         2010.
    This is due to the fact that they are extracted
from twitter accounts that publish tweets in multi-
                                                        cdec: A decoder, alignment, and learning
ple languages, and such accounts belong, to a large     framework for finite-state and context-free
extent, to institutions that use formal language.       translation models. In Proceedings of the
  Association for Computational Linguistics        Lersundi, and Kepa Sarasola.        2011.
  (ACL).                                           Matxin, an open-source rule-based ma-
Forcada, Mikel L., Mireia Ginestı́-Rosell,         chine translation system for basque. Ma-
   Jacob Nordfalk, Jim O’Regan, Sergio             chine Translation, 25(1):53–82.
   Ortiz-Rojas, Juan Antonio Pérez-Ortiz,      Padró, Lluı́s and Evgeny Stanilovsky. 2012.
   Gema Ramı́rez-Sánchez Felipe Sánchez-        Freeling 3.0: Towards wider multilin-
   Martı́nez, and Francis M. Tyers. 2011.         guality.      In Proceedings of the Lan-
   Apertium: a free/open-source platform          guage Resources and Evaluation Confer-
   for rule-based machine translation. Ma-        ence (LREC 2012), Istanbul, Turkey.
   chine Translation, 25(2):127–144. Special      ELRA.
   Issue: Free/Open-Source Machine Trans-
   lation.                                      Papineni, Kishore, Salim Roukos, Todd
                                                  Ward, and Wei-Jing Zhu. 2002. Bleu: a
Heafield, Kenneth and Alon Lavie. 2010.           method for automatic evaluation of ma-
  Combining machine translation output            chine translation. In Proceedings of the
  with open source: The carnegie mellon           40th annual meeting on association for
  multi-engine machine translation scheme.        computational linguistics, pages 311–318.
  The Prague Bulletin of Mathematical Lin-
  guistics, 93:27–36.                           Rubino, Raphael, Tommi Pirinen, Miquel
                                                  Esplà-Gomis, Nikola Ljubešić, Sergio
Kneser, Reinhard and Hermann Ney. 1995.           Ortiz-Rojas,   Vassilis   Papavassiliou,
  Improved backing-off for m-gram language        Prokopis Prokopidis, and Antonio Toral.
  modeling. In Acoustics, Speech, and Sig-        2015. Abu-MaTran at WMT 2015 Trans-
  nal Processing, 1995. ICASSP-95., 1995          lation Task: Morphological Segmentation
  International Conference on, volume 1,          and Web Crawling. In Proceedings of the
  pages 181–184. IEEE.                            Tenth Workshop on Statistical Machine
Koehn, Philipp, Hieu Hoang, Alexandra             Translation.
  Birch, Chris Callison-Burch, Marcello
                                                Snover, Matthew, Bonnie Dorr, Richard
  Federico, Nicola Bertoldi, Brooke Cowan,
                                                  Schwartz, Linnea Micciulla, and John
  Wade Shen, Christine Moran, Richard
                                                  Makhoul. 2006. A study of translation
  Zens, Chris Dyer, Ondřej Bojar, Alexan-
                                                  edit rate with targeted human annotation.
  dra Constantin, and Evan Herbst. 2007.
                                                  In Proceedings of Association for machine
  Moses: Open source toolkit for statistical
                                                  translation in the Americas, pages 223–
  machine translation. In Proceedings of the
                                                  231.
  45th Annual Meeting of the ACL on In-
  teractive Poster and Demonstration Ses-       Stolcke, Andreas et al. 2002. Srilm-an ex-
  sions, ACL ’07, pages 177–180, Strouds-          tensible language modeling toolkit. In IN-
  burg, PA, USA. Association for Compu-            TERSPEECH.
  tational Linguistics.
                                                Tiedemann, Jörg. 2012. Parallel data, tools
Ljubešić, Nikola, Darja Fišer, and Tomaž       and interfaces in opus. In Nicoletta Calzo-
   Erjavec. 2014. TweetCaT: a Tool for             lari (Conference Chair), Khalid Choukri,
   Building Twitter Corpora of Smaller Lan-        Thierry Declerck, Mehmet Ugur Dogan,
   guages. In Proceedings of the Ninth In-         Bente Maegaard, Joseph Mariani, Jan
   ternational Conference on Language Re-          Odijk, and Stelios Piperidis, editors, Pro-
   sources and Evaluation (LREC’14), Reyk-         ceedings of the Eight International Con-
   javik, Iceland.                                 ference on Language Resources and Eval-
Ljubešić, Nikola and Antonio Toral. 2014.        uation (LREC’12), Istanbul, Turkey, may.
   cawac - a web corpus of catalan and             European Language Resources Associa-
   its application to language modeling and        tion (ELRA).
   machine translation.      In Proceedings     Virpioja, Sami, Peter Smit, Stig-Arne
   of the Ninth International Conference           Grönroos, Mikko Kurimo, et al. 2013.
   on Language Resources and Evaluation            Morfessor 2.0: Python implementation
   (LREC’14), Reykjavik, Iceland, may.             and extensions for morfessor baseline.
Mayor, Aingeru, Iñaki Alegria, Arantza Dı́az
  de Ilarraza Sánchez, Gorka Labaka, Mikel