=Paper=
{{Paper
|id=Vol-2693/paper4
|storemode=property
|title=Departamento de Nosotros: How Machine Translated
Corpora Affects Language Models in MRC Tasks
|pdfUrl=https://ceur-ws.org/Vol-2693/paper4.pdf
|volume=Vol-2693
|authors=Maria Khvalchik,Mikhail Galkin
|dblpUrl=https://dblp.org/rec/conf/ecai/Khvalchik020
}}
==Departamento de Nosotros: How Machine Translated
Corpora Affects Language Models in MRC Tasks==
<pdf width="1500px">https://ceur-ws.org/Vol-2693/paper4.pdf</pdf>
<pre>
             Proceedings of the Workshop on Hybrid Intelligence for Natural Language Processing Tasks HI4NLP (co-located at ECAI-2020)
                                       Santiago de Compostela, August 29, 2020, published at http://ceur-ws.org


     Departamento de Nosotros: How Machine Translated
      Corpora Affects Language Models in MRC Tasks
                                                Maria Khvalchik1 and Mikhail Galkin 2


Abstract. Pre-training large-scale language models (LMs) requires               comparable when testing on external benchmarks recently proposed
huge amounts of text corpora. LMs for English enjoy ever growing                for cross-lingual Extractive QA.
corpora of diverse language resources. However, less resourced lan-                Hence, we tackle the effects of both data quality and neural net-
guages and their mono- and multilingual LMs often struggle to ob-               works characteristics in an effort to demonstrate that both mentioned
tain bigger datasets. A typical approach in this case implies using             above are major factors in the outcomes and should be given equal
machine translation of English corpora to a target language. In this            respect in their primary design.
work, we study the caveats of applying directly translated corpora for
fine-tuning LMs for downstream natural language processing tasks
and demonstrate that careful curation along with post-processing                2   RELATED WORK
lead to improved performance and overall LMs robustness. In the                 Historically, most of NLP tasks, datasets, and benchmarks were
empirical evaluation, we perform a comparison of directly translated            created in English, e.g., the Penn Treebank [10], SQuAD [12],
against curated Spanish SQuAD datasets on both user and system                  GLUE [14]. Therefore, most of the large-scale pre-trained models
levels. Further experimental results on XQuAD and MLQA down-                    were trained in the English-only mode, e.g., BERT [6] employed
stream transfer-learning question answering tasks show that presum-             Wikipedia and the BookCorpus [17] as training datasets. Later on,
ably multilingual LMs exhibit more resilience to machine translation            the NLP community sought after increasing language diversity and
artifacts in terms of the exact match score.                                    multilingual models started to appear, such as mBERT or XLM [4].
                                                                                   However, large and diverse enough pre-training corpora of high
                                                                                quality often do not exist. Several methods have been developed to
1   INTRODUCTION                                                                bridge this gap, e.g., applying machine translation frameworks to
                                                                                English corpora, or performing cross-lingual transfer learning [16].
Numerous research studies demonstrate how important the data qual-
                                                                                Multilingual language models and pre-trained non-English language
ity is to the outcomes of neural networks and how severely they are
                                                                                models are definitely in the focus of the NLP community. Still,
affected by low quality data [13].
                                                                                the language understanding capabilities (hence, the performance)
   However, recently transfer learning, where a model is first pre-
                                                                                of language models largely depend on data collection and cleaning
trained on a data-rich task before being fine-tuned on a downstream
                                                                                steps. In the MRC dimension, for instance, Italian SQuAD [5] is ob-
task, has emerged as a powerful technique in natural language pro-
                                                                                tained via direct translation from the English version whereas French
cessing. Models like T5 [11] are now showing human-level perfor-
                                                                                FQuAD [7] and Russian SberQuAD [8] have been created based on
mance on most of the well-established benchmarks available for nat-
                                                                                their language-specific part of Wikipedia often being much smaller
ural language understanding.
                                                                                than original SQuAD.
   Yet, language understanding is not solved, even in well-studied
                                                                                   With the surge of language-specific pre-trained LMs several
languages like English, even when tremendous resources are used.
                                                                                benchmarks have been developed that aim at evaluating multi- and
This paper focuses on a less resourced language, Spanish, and pur-
                                                                                cross-lingual characteristics of such LMs. Specifically, for the ma-
sues two goals.
                                                                                chine reading comprehension and question answering (QA) task
   First, we seek to demonstrate that data quality is an important
                                                                                there exist XQuAD [1] and MLQA [9]. In this work, we study how
component for training neural networks and overall increases nat-
                                                                                LMs perform in QA tasks in Spanish when fine-tuning on datasets
ural language understanding capabilities. Hence, the quality of data
                                                                                of possibly different quality, i.e., directly machine translated and cu-
should be considered carefully. We consider two recent Neural Ma-
                                                                                rated with the human-in-the-loop strategy.
chine Translated (NMT) SQuAD datasets, discussed in more detail
in Section 4, for machine reading comprehension (MRC) in Spanish
of different quality.                                                           3   PROBLEM FORMULATION
   Second, after providing evidence of the data quality difference,
we fine-tune both datasets on pre-trained multilingual and monolin-             In this work, we aim at exploring the impact of machine translated
gual Spanish BERT models and see that there is a significant perfor-            corpora quality on downstream MRC tasks which is of high impor-
mance gap in terms of Exact Match (EM) and F1 scores in dev sets of             tance in less resourced languages. We consider Spanish as the target
the SQuAD family. However, unexpectedly, the results become more                language as one of the most spoken languages in the world that nev-
                                                                                ertheless has a relatively little amount of available corpora for pre-
1 SemanticWebCompany, Austria, maria.khvalchik@semantic-web.com                 training modern LMs for language understanding tasks. Taking this
2 TU Dresden & Fraunhofer IAIS, Germany, mikhail.galkin@tu-dresden.de
                                                                                into account, we tackle the following research questions:


         Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                           29
Example/
            1   2   3   4   5   6   7    8   9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Evaluator
   1-2      2   2   2   2   1   2    2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   1    2   2   2    2   1   2    1    2   2    1   2   2   2   2   2    0   2   2   0   2   2   0   2    2    2
   3-4      2   1   2   2   1   2    2   1   1   2   2   0   2   0   2   0   2   0   0   1   1   2   2   2   2   2    2   2   0    1   2   2    2    1   2    2   2   2   2   2   2    2   2   2   2   2   2   2   2    2    1
   5-6      2   1   2   2   0   0    2   2   2   2   2   1   2   2   2   1   2   2   2   2   2   2   2   2   2   2    2   2   2    2   2   2    2    2   2    2   2   2   2   2   2    2   2   2   2   2   2   2   2    2    0
   7-8      2   2   2   0   2   1    2   2   1   2   2   2   2   2   2   2   0   2   2   2   0   0   0   2   2   2    2   2   0    2   2   2    0    0   0    1   2   0   0   2   2    2   2   2   2   0   2   2   2    2
   9-10     2   1   2   2   0   1    2   2   2   2   2   2   2   1   2   2   2   2   2   0   2   2   2   2   2   2    1   1   2    2   2   2    1    2   2    1   2   2   2   2   2    2   2   2   2   2   2   2   2    2
  11-12     2   2   2   2   1   1    2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2    2   2   2    2   2   2    1    2   1    2   2   1   1   1   2    2   1   2   2   2   2   2   2    1


                                    Figure 1.        TAR translation evaluation heat map on 50 parallel SQuAD examples scored by 12 evaluators.

Example/ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Evaluator
   1-2    2 0 0 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 1 2 0 0 0 0 2 0 2 2 2 2 1 0 1 0 0 1 0 1                                                                                                                2
   3-4    1 0 0 0 0 0 2 1 2 1 1 0 0 1 1 1 2 1 1 1 2 1 2 1 1 2 0 2 0 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 0 2 2 2 2 2                                                                                                                1
   5-6    1 0 0 1 0 0 2 1 2 0 2 1 1 1 1 1 2 2 1 1 1 1 2 2 1 0 2 2 2 2 2 2 2 0 1 1 1 1 2 1 2 2 2 1 2 2 2 1 1 1                                                                                                                0
   7-8    0 0 0 1 2 0 0 1 1 0 2 0 0 2 0 2 2 2 2 1 0 2 2 2 2 2 2 0 2 2 2 2 0 0 0 0 2 0 2 0 1 2 0 0 2 0 2 0 2 2
   9-10   0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 2 0 1 1 1 2 1 2 0 2 2 2 2 1 2 2 1 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2
  11-12   2 1 1 1 1 1 2 2 2 1 2 2 2 1 1 1 2 2 2 1 2 1 2 2 1 1 2 2 2 1 2 2 1 2 1 1 1 1 1 1 2 2 2 2 2 1 2 2 2 1


                                    Figure 2. MT translation evaluation heat map on 50 parallel SQuAD examples scored by 12 evaluators.

• RQ1: Is there a quantifiable difference in data quality between
                                                                                                                                               Table 1. Translation evaluation average.
  machine translated and manually post-processed corpora for
                                                                                                                                                       Translation by  Average
  Spanish SQuAD datasets? In order to answer this question, we
  conduct a user study in Section 4.                                                                                                                         TAR                       1.717
                                                                                                                                                             MT                        1.320
• RQ2: Can we expect the performance difference of LMs in MRC
  QA tasks when fine-tuned on datasets of different quality? We per-
                                                                                                                                                                      TAR             MT
  form an experimental study in Section 5.1.
• RQ3: Is there a performance difference in downstream transfer-
  learning QA tasks, i.e., on external benchmarks the LMs were not                                                                                                                                         240
  fine-tuned on? Experimental results are shown in Section 5.2.


4     USER STUDY                                                                                                                                                                                                       151
4.1     Data Sources
                                                                                                                                                                                  94
For the user study we employ the two following recent MRC Spanish
translations:                                                                                                                              55
                                                                                                                                                                      35
   TAR: prepared following the Translate-Align-Retrieve methodol-                                                                 25
ogy which implies a lot of post-processing to improve the translation
quality [2]. TAR SQuAD is produced from original English SQuAD                                                                         0                                      1                                    2
corpus and contains both 1.1 and 2.0 versions. Further, each version
contains datasets of two sizes, i.e., regular (or default) and small (half
the size of the regular) that is less noisy and more refined.                                                                 Figure 3.             Score frequencies for TAR and MT translations.
   MT: SQuAD 1.1 and 2.0 versions translated by a private European
NMT company. 3                                                                                                          Furthermore, to evaluate the agreement among raters, we aggre-
                                                                                                                     gate the evaluations in heat map representation in Figure 1 and Fig-
                                                                                                                     ure 2. We can observe that most of the evaluators not only favoured
4.2     Translation Evaluation                                                                                       but also synchronized well on the TAR scores, whereas MT scores
To estimate the quality of translation, 50 parallel examples from                                                    appear to be more contrasting. This could possibly mean that the MT
translated SQuAD 1.1 dataset were selected randomly. Twelve Span-                                                    errors were so diverse that evaluators found the provided scale to
ish speaking evaluators were asked to give the following grades to 25                                                some degree misleading. Hence, it was difficult to strictly represent
parallel examples each:                                                                                              the difference between minor and major errors.
                                                                                                                        Translation errors which could significantly affect the results have
• 2, if understandable and there are only minor mistakes;                                                            been collected and some examples are depicted in Table 4. The error
• 1, if understandable and has a few major mistakes;                                                                 types are the following: wrong gender inference, inaccurate transla-
• 0, if not understandable and has more than a few major mistakes.                                                   tion or capitalization in named entities, adjectives misplacement re-
                                                                                                                     garding the noun. Here, we would like to point at the following errors
   In Table 1 the average translation evaluation score is depicted,                                                  in MT translation:
from which we conclude that TAR translation is significantly better
                                                                                                                     • an example of combining named entity’s "Warner Brothers" in a
than MT translation.
                                                                                                                       literal Spanish translation as well as in the original state: "... y
   To inspect further, in Figure 3 we provide the histogram of score
                                                                                                                       hermanos Warner. Universal, Warner Brothers ...";
frequencies where TAR translation produces 80% of the best scores
                                                                                                                     3 We will publish the dataset on GitHub upon acceptance.
while MT’s produces only around 50%.


                                                                                                             30
          Table 2.   Performance on es-SQuAD. All models are in the case-sensitive mode. Best column results in bold, second best underlined.

                                                                 es-SQuAD (TAR)                                           es-SQuAD (MT)
              Model                                1.1                                     2.0                           1.1                   2.0
                                    Small                 Default           Small                 Default         Default                Default
                                  EM      F1             EM     F1        EM      F1             EM     F1       EM     F1              EM     F1
      mBERT (1.1 Sm TAR)         57.45     73.34         55.04    72.38   -        -             -       -       56.24         71.12   -             -
      mBERT (1.1 Def TAR)        56.30     73.71         59.65    76.32   -        -             -       -       54.80         71.67   -             -
      mBERT (2.0 Sm TAR)         57.36     73.52         55.20    72.56   59.85    66.30         60.18   66.94   55.36         70.64   46.17         57.87
      mBERT (2.0 Def TAR)        56.24     73.11         59.05    75.48   59.76    67.17         62.08   68.90   54.47         71.06   46.97         59.47
      mBERT (1.1 MT)             54.25     71.42         53.90    71.90   -        -             -       -       61.20         64.19   -             -
      mBERT (2.0 MT)             52.92     70.51         53.03    71.25   29.02    38.60         26.32   35.80   61.27         74.15   61.46         74.72
      BETO                       56.72     74.38         59.71    76.92   58.55    67.16         60.01   67.97   55.93         72.56   49.49         62.88
      DistilledBETO              54.11     72.41         57.32    74.94   57.26    66.28         58.58   66.75   52.11         69.86   48.28         63.58

• an example of translation "Universal Pictures" by changing into               tasks. That is, in order to study the impact of the fine-tuning dataset
  plural form and dropping the noun "universales";                              we optimize the model on TAR, but evaluate on the MT dev set, and
• an example of "the US War Department" translation as an "al De-               vice versa. The empirical results are shown in Table 2.
  partamento de Guerra DE NOSOTROS", an impressive translation                     First, we observe that LMs fine-tuned on the MT SQuAD consid-
  of a capitalized abbreviation of the United States.                           erably outperform other models in terms of both EM and F1 only on
                                                                                the MT dev set while being significantly inferior to all other models
   Therefore, we can positively answer RQ 1 as there indeed exists
                                                                                on the TAR dev test. For instance, mBERT fine-tuned on the MT-
a substantial difference in corpora quality when applying additional
                                                                                version of SQuAD 2.0 is about 12 EM and F1 points better than
post-processing over direct machine translated data.
                                                                                BETO on the MT-version of SQuAD 2.0 and at the same time is
                                                                                about 32 EM and F1 points worse than BETO in the default TAR-
5     EXPERIMENTAL STUDY                                                        version of SQuAD 2.0.
In the experimental study we evaluate the performance of Spanish                   Similarly, mBERT trained on the MT-version of SQuAD 1.1
LMs in machine reading comprehension tasks fine-tuning them on                  achieves very good EM score on the MT-version dev set of SQuAD
language corpora obtained via machine translation and translation               1.1 but performs poorly on the TAR versions. Considering the differ-
with further rule-based post-processing.                                        ence in datasets quality demonstrated in Section 4, we deem that such
                                                                                a behavior is a sign of LMs sensitivity to artificially created corpora
   Datasets. For fine-tuning the pre-trained LMs in Spanish we                  with numerous syntactic and semantic mistakes.
leverage different versions of es-SQuAD, i.e., TAR and MT de-                      Moreover, the TAR-trained models show more consistent scores
scribed in Section 4. Small and Default versions of es-SQuAD                    across the given tasks thus supporting the RQ 2, i.e., LMs tend to
(TAR) are annotated as sm and def, respectively. For benchmarking               be more robust when trained and evaluated on well-prepared lan-
we employ dev sets of es-SQuAD datasets as well as test sets of                 guage corpora. Overall, in this experiment we find that LMs trained
MLQA [9] and XQuAD [1] where both context and question are in                   on Spanish-only corpora (e.g., BETO) perform on par or slightly bet-
Spanish.                                                                        ter than massive multilingual LMs like mBERT fine-tuned on a sim-
                                                                                ilar task in a language-specific setting.
   Models. We choose the pre-trained mBERT-base-cased for
fine-tuning on es-SQuAD datasets. For a broader comparison                      5.2    MLQA and XQuAD Performance
we also employ already pre-trained and fine-tuned on SQuAD 2.0
Spanish-only LMs BETO [3] and its distilled version DistilledBETO.
                                                                                Table 3. Performance on MLQA and XQuAD. Cased models. Best in bold,
   Fine-Tuning Setup. The models are trained and evaluated in the
                                                                                second best underlined.
cased mode using the HuggingFace Transformers [15] framework.
                                                                                                                MLQA          XQuAD
When fine-tuning the default hyperparameters are used: three epochs                             Model
of the Adam optimizer with an initial learning rate of 0.00005. The                                           EM     F1      EM       F1
experiments are conducted on the Ubuntu 16.04 server equipped                          mBERT (1.1 Sm TAR)        42.74         64.36   53.61     72.89
with one GTX 1080 Ti GPU and 256 GB RAM.                                               mBERT (1.1 Def TAR)       43.14         66.44   54.62     75.30
                                                                                       mBERT (2.0 Sm TAR)        43.31         65.04   54.45     74.09
                                                                                       mBERT (2.0 Def TAR)       43.44         66.09   55.97     76.82
   Metrics. We measure EM and F1 scores in each experiment as                          mBERT (1.1 MT)            44.43         64.83   57.14     75.46
reported by task-specific evaluation scripts. For consistency reasons,                 mBERT (2.0 MT)            44.13         64.43   54.03     73.17
we do not evaluate models fine-tuned on SQuAD 1.1 datasets against                     BETO                      45.12         68.77   56.97     78.15
                                                                                       DistilledBETO             42.41         66.06   55.46     75.84
SQuAD 2.0 versions.

                                                                                   In the second experiment, we probe the TAR and MT fine-tuned
5.1    SQuAD Performance
                                                                                models against MLQA and XQuAD in the Spanish context - Span-
In the first experiment, we fine-tune mBERT on TAR and MT                       ish question settings. The results are presented in Table 3. A clear
datasets and evaluate their accuracy on the dev test of the respective          winner is BETO which is pre-trained on Spanish-only corpora and


                                                                           31
outperforms nearest contenders by about 2 F1 points. We then ob-                       [12] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang,
serve that TAR-trained models perform consistently better than MT-                          ‘Squad: 100, 000+ questions for machine comprehension of text’, in
                                                                                            Proceedings of the 2016 Conference on Empirical Methods in Natural
trained models in terms of F1 scores. Interestingly, in terms of EM                         Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-
scores mBERT 1.1 MT yields better performance than TAR and even                             4, 2016, pp. 2383–2392, (2016).
language-specific models like BETO. Such a phenomena can be ex-                        [13] Valerie Sessions and Marco Valtorta, ‘The effects of data quality on
plained by robustness of large-scale multilingual LMs that might                            machine learning algorithms’, in Proceedings of the 11th Interna-
tend to generalize better over translation artifacts. We leave further                      tional Conference on Information Quality, MIT, Cambridge, MA, USA,
                                                                                            November 10-12, 2006, eds., John R. Talburt, Elizabeth M. Pierce,
research of this phenomena to the future work.                                              Ningning Wu, and Traci Campbell, pp. 485–498. MIT, (2006).
   Overall, discussing the RQ 3 we hypothesize that for downstream                     [14] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy,
language-specific tasks LMs pre-trained in that specific language are                       and Samuel R. Bowman, ‘GLUE: A multi-task benchmark and analy-
more preferable. In case such a large-scale pre-training corpora is not                     sis platform for natural language understanding’, in 7th International
                                                                                            Conference on Learning Representations, ICLR 2019, (2019).
available, well-processed machine translated sources tend to produce                   [15] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond,
more robust LMs compared to purely machine translated sources.                              Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi
                                                                                            Louf, Morgan Funtowicz, and Jamie Brew, ‘Huggingface’s transform-
                                                                                            ers: State-of-the-art natural language processing’, ArXiv, (2019).
6    CONCLUSION AND FUTURE WORK                                                        [16] Shijie Wu and Mark Dredze, ‘Beto, bentz, becas: The surprising cross-
                                                                                            lingual effectiveness of BERT’, in Proceedings of the 2019 Confer-
In this work, we studied the impact of machine translated corpora                           ence on Empirical Methods in Natural Language Processing and the
quality on question answering tasks. Having formulated three re-                            9th International Joint Conference on Natural Language Processing,
                                                                                            EMNLP-IJCNLP 2019, pp. 833–844, (2019).
search questions, we employed Spanish SQuAD-style datasets for                         [17] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov,
empirical evaluation. User study confirmed there is a significant dif-                      Raquel Urtasun, Antonio Torralba, and Sanja Fidler, ‘Aligning books
ference in dataset quality and amount of language artifacts. Further                        and movies: Towards story-like visual explanations by watching movies
experimental studies confirmed that LMs are sensitive to the quality                        and reading books’, in 2015 IEEE International Conference on Com-
                                                                                            puter Vision, ICCV, pp. 19–27, (2015).
of machine translated corpora. We also observe signs of LMs robust-
ness to translation defects in downstream transfer learning tasks.
   For the future work we pose a question towards conducting an
appropriate analysis on how neural networks overcome the flaws in
the data, being not always machine translated, to become robust and
noise resilient.


REFERENCES
 [1] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama, ‘On the cross-
     lingual transferability of monolingual representations’, CoRR, (2019).
 [2] Casimiro Pio Carrino, Marta R. Costa-jussà, and José A. R. Fonollosa,
     ‘Automatic spanish translation of the squad dataset for multilingual
     question answering’, CoRR, (2019).
 [3] José Cañete, Gabriel Chaperon, Rodrigo Fuentes, and Jorge Pérez,
     ‘Spanish pre-trained bert model and evaluation data’, in to appear in
     PML4DC at ICLR 2020, (2020).
 [4] Alexis Conneau and Guillaume Lample, ‘Cross-lingual language model
     pretraining’, in Advances in Neural Information Processing Systems 32:
     Annual Conference on Neural Information Processing Systems 2019,
     NeurIPS 2019, pp. 7057–7067, (2019).
 [5] Danilo Croce, Alexandra Zelenanska, and Roberto Basili, ‘Neural
     learning for question answering in italian’, in AI*IA 2018 – Advances in
     Artificial Intelligence, eds., Chiara Ghidini, Bernardo Magnini, Andrea
     Passerini, and Paolo Traverso, pp. 389–402, (2018).
 [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,
     ‘BERT: pre-training of deep bidirectional transformers for language un-
     derstanding’, in Proceedings of the 2019 Conference of the North Amer-
     ican Chapter of the Association for Computational Linguistics: Human
     Language Technologies, NAACL-HLT 2019, pp. 4171–4186, (2019).
 [7] Martin d’Hoffschmidt, Maxime Vidal, Wacim Belblidia, and Tom
     Brendlé, ‘Fquad: French question answering dataset’, CoRR, (2020).
 [8] Pavel Efimov, Leonid Boytsov, and Pavel Braslavski, ‘Sberquad - rus-
     sian reading comprehension dataset: Description and analysis’, CoRR,
     (2019).
 [9] Patrick S. H. Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and
     Holger Schwenk, ‘MLQA: evaluating cross-lingual extractive question
     answering’, CoRR, (2019).
[10] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz,
     ‘Building a large annotated corpus of english: The penn treebank’,
     Computational Linguistics, 19(2), 313–330, (1993).
[11] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
     Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, ‘Ex-
     ploring the limits of transfer learning with a unified text-to-text trans-
     former’, CoRR, (2019).


                                                                                  32
                            Table 4. A sample of TAR and MT translations error examples annotated in the user study.
                  English                                            TAR                                                MT

                                        Gender inference (TAR avg score = 0.8, MT avg score = 0)
It’s not clear, however that this stereotypi- Sin embargo, no está claro que esta opinión No está claro, sin embargo, que esta visión
cal view reflects the reality of East Asian estereotipada refleje la realidad de las aulas estereotipada refleja la realidad de las aulas
classrooms or that the educational goals de Asia oriental o que los objetivos educa- del Asia oriental o que los objetivos ed-
in these countries are commensurable with tivos de esos países sean acordes con los de ucativos en estos países son conmensu-
those in Western countries. In Japan, for ex- los países occidentales. En Japón, por ejem- rables con los de los países occidentales. En
ample, although average attainment on stan- plo, aunque el rendimiento medio de los en- Japón, por ejemplo, aunque el logro prome-
dardized tests may exceed those in Western sayos estandarizados puede superar los de dio de las pruebas estandarizadas puede ex-
countries, classroom discipline and behavior los países occidentales, la disciplina y el ceder las de los países occidentales, la dis-
is highly problematic. Although, officially, comportamiento de las aulas son muy prob- ciplina y el comportamiento en el aula son
schools have extremely rigid codes of be- lemáticos. Aunque oficialmente las escue- altamente problemáticos. Aunque, oficial-
havior, in practice many teachers find the las tienen códigos de conducta extremada- mente, las escuelas tienen códigos de com-
students unmanageable and do not enforce mente rígidos, en la práctica muchos mae- portamiento extremadamente rígidos, en la
discipline at all.                            stros consideran que los estudiantes son in- práctica muchos profesores encuentran a los
                                              manejables y no aplican la disciplina en ab- estudiantes inmanejables y no aplican la dis-
                                              soluto.                                      ciplina en absoluto.

               Translation and capitalization inconsistency in named entities (TAR avg score = 1, MT avg score = 0.1)
The motion picture, television, and music La industria del cine, la televisión y la La imagen del movimiento, la televisión y
industry is centered on the Los Angeles in música se centra en Los Ángeles en el sur de la industria musical se centran en los Ánge-
southern California. Hollywood, a district California. Hollywood, un distrito dentro de les en el sur de California. Hollywood, un
within Los Angeles, is also a name associ- Los Ángeles, es también un nombre asoci- distrito de los Ángeles, es también un nom-
ated with the motion picture industry. Head- ado a la industria cinematográfica. Con sede bre asociado a la industria fotográfica de
quartered in southern California are The en el sur de California están The Walt Dis- movimiento. Con sede en el sur de Califor-
Walt Disney Company (which also owns ney Company (que también posee ABC), nia están la compañía Walt Disney (que tam-
ABC), Sony Pictures, Universal, MGM, Sony Pictures, Universal, MGM, Paramount bién posee ABC), imágenes de Sony, uni-
Paramount Pictures, 20th Century Fox, and Pictures, 20th Century Fox, y Warner Broth- versales, MGM, imágenes principales, Fox
Warner Brothers. Universal, Warner Broth- ers. Universal, Warner Brothers y Sony tam- del siglo 20 y hermanos Warner. Univer-
ers, and Sony also run major record compa- bién tienen grandes compañías discográfi- sal, Warner Brothers y Sony también dirigen
nies as well.                                cas.                                         grandes empresas de registro.

During the same year, Tesla wrote a trea-       Durante el mismo año, escribió un tratado,         Durante el mismo año, Tesla escribió un
tise, The Art of Projecting Concentrated        The Art of Projecting Concentrated non-            treatise, el arte de proyectar energía concen-
Non-dispersive Energy through the Natural       dispersive Energy through the Natural Me-          trada no dispersa a través de los medios nat-
Media, concerning charged particle beam         dia, sobre las armas de haz de partículas          urales, en relación con las armas de haz de
weapons. Tesla published the document in        cargadas. Tesla publicó el documento en            partículas cargadas. Tesla publicó el docu-
an attempt to expound on the technical de-      un intento de exponer la descripción téc-          mento en un intento de exponer la descrip-
scription of a "superweapon that would put      nica de una "superarma que pondría fin a           ción técnica de un «superarma que pon-
an end to all war." <...> Tesla tried to in-    toda guerra". <...> Tesla trató de interesar       dría fin a toda guerra». <...> Tesla trató
terest the US War Department, the United        al Departamento de Guerra de los Estados           de interesar al Departamento de Guerra DE
Kingdom, the Soviet Union, and Yugoslavia       Unidos, el Reino Unido, la Unión Soviética         NOSOTROS, al Reino Unido, a la Unión
in the device.                                  y Yugoslavia en el dispositivo.                    Soviética y a Yugoslavia en el dispositivo.

                          Adjectives placement regarding the noun (TAR avg score = 1, MT avg score = 0)
CBS broadcast Super Bowl 50 in the U.S., CBS transmitió el Super Bowl 50 en los Es- CBS emitió 50 super bowl en los U. S. y
and charged an average of $5 million for tados Unidos, y cobró un promedio de $5 cobró un promedio de US $5 millones por
a 30-second commercial during the game. millones por un comercial de 30 segundos un 30 - segundo comercial durante el juego.
The Super Bowl 50 halftime show was head- durante el juego. El espectáculo de medio El espectáculo de semáforo 50 de super fue
lined by the British rock group Coldplay tiempo del Super Bowl 50 fue encabezado encabezado por el grupo de rock británico
with special guest performers Beyoncé and por el grupo de rock británico Coldplay Coldplay con artistas invitados especiales
Bruno Mars, who headlined the Super Bowl con artistas invitados especiales como Be- Beyoncé y Bruno Mars, que encabezaron
XLVII and Super Bowl XLVIII halftime yoncé y Bruno Mars, quienes encabezaron el súper súper XLVII y los espectáculos de
shows, respectively. It was the third-most los shows de medio tiempo del Super Bowl semestral XLVIII, respectivamente. Fue la
watched U.S. broadcast ever.                XLVII y Super Bowl XLVIII, respectiva- tercera, la más observada u.
                                            mente. Fue el tercer programa más visto de
                                            Estados Unidos.


                                                                       33

</pre>