=Paper=
{{Paper
|id=Vol-2693/paper4
|storemode=property
|title=Departamento de Nosotros: How Machine Translated
Corpora Affects Language Models in MRC Tasks
|pdfUrl=https://ceur-ws.org/Vol-2693/paper4.pdf
|volume=Vol-2693
|authors=Maria Khvalchik,Mikhail Galkin
|dblpUrl=https://dblp.org/rec/conf/ecai/Khvalchik020
}}
==Departamento de Nosotros: How Machine Translated
Corpora Affects Language Models in MRC Tasks==
Proceedings of the Workshop on Hybrid Intelligence for Natural Language Processing Tasks HI4NLP (co-located at ECAI-2020) Santiago de Compostela, August 29, 2020, published at http://ceur-ws.org Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks Maria Khvalchik1 and Mikhail Galkin 2 Abstract. Pre-training large-scale language models (LMs) requires comparable when testing on external benchmarks recently proposed huge amounts of text corpora. LMs for English enjoy ever growing for cross-lingual Extractive QA. corpora of diverse language resources. However, less resourced lan- Hence, we tackle the effects of both data quality and neural net- guages and their mono- and multilingual LMs often struggle to ob- works characteristics in an effort to demonstrate that both mentioned tain bigger datasets. A typical approach in this case implies using above are major factors in the outcomes and should be given equal machine translation of English corpora to a target language. In this respect in their primary design. work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing 2 RELATED WORK lead to improved performance and overall LMs robustness. In the Historically, most of NLP tasks, datasets, and benchmarks were empirical evaluation, we perform a comparison of directly translated created in English, e.g., the Penn Treebank [10], SQuAD [12], against curated Spanish SQuAD datasets on both user and system GLUE [14]. Therefore, most of the large-scale pre-trained models levels. Further experimental results on XQuAD and MLQA down- were trained in the English-only mode, e.g., BERT [6] employed stream transfer-learning question answering tasks show that presum- Wikipedia and the BookCorpus [17] as training datasets. Later on, ably multilingual LMs exhibit more resilience to machine translation the NLP community sought after increasing language diversity and artifacts in terms of the exact match score. multilingual models started to appear, such as mBERT or XLM [4]. However, large and diverse enough pre-training corpora of high quality often do not exist. Several methods have been developed to 1 INTRODUCTION bridge this gap, e.g., applying machine translation frameworks to English corpora, or performing cross-lingual transfer learning [16]. Numerous research studies demonstrate how important the data qual- Multilingual language models and pre-trained non-English language ity is to the outcomes of neural networks and how severely they are models are definitely in the focus of the NLP community. Still, affected by low quality data [13]. the language understanding capabilities (hence, the performance) However, recently transfer learning, where a model is first pre- of language models largely depend on data collection and cleaning trained on a data-rich task before being fine-tuned on a downstream steps. In the MRC dimension, for instance, Italian SQuAD [5] is ob- task, has emerged as a powerful technique in natural language pro- tained via direct translation from the English version whereas French cessing. Models like T5 [11] are now showing human-level perfor- FQuAD [7] and Russian SberQuAD [8] have been created based on mance on most of the well-established benchmarks available for nat- their language-specific part of Wikipedia often being much smaller ural language understanding. than original SQuAD. Yet, language understanding is not solved, even in well-studied With the surge of language-specific pre-trained LMs several languages like English, even when tremendous resources are used. benchmarks have been developed that aim at evaluating multi- and This paper focuses on a less resourced language, Spanish, and pur- cross-lingual characteristics of such LMs. Specifically, for the ma- sues two goals. chine reading comprehension and question answering (QA) task First, we seek to demonstrate that data quality is an important there exist XQuAD [1] and MLQA [9]. In this work, we study how component for training neural networks and overall increases nat- LMs perform in QA tasks in Spanish when fine-tuning on datasets ural language understanding capabilities. Hence, the quality of data of possibly different quality, i.e., directly machine translated and cu- should be considered carefully. We consider two recent Neural Ma- rated with the human-in-the-loop strategy. chine Translated (NMT) SQuAD datasets, discussed in more detail in Section 4, for machine reading comprehension (MRC) in Spanish of different quality. 3 PROBLEM FORMULATION Second, after providing evidence of the data quality difference, we fine-tune both datasets on pre-trained multilingual and monolin- In this work, we aim at exploring the impact of machine translated gual Spanish BERT models and see that there is a significant perfor- corpora quality on downstream MRC tasks which is of high impor- mance gap in terms of Exact Match (EM) and F1 scores in dev sets of tance in less resourced languages. We consider Spanish as the target the SQuAD family. However, unexpectedly, the results become more language as one of the most spoken languages in the world that nev- ertheless has a relatively little amount of available corpora for pre- 1 SemanticWebCompany, Austria, maria.khvalchik@semantic-web.com training modern LMs for language understanding tasks. Taking this 2 TU Dresden & Fraunhofer IAIS, Germany, mikhail.galkin@tu-dresden.de into account, we tackle the following research questions: Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 29 Example/ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Evaluator 1-2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 1 2 2 2 2 2 0 2 2 0 2 2 0 2 2 2 3-4 2 1 2 2 1 2 2 1 1 2 2 0 2 0 2 0 2 0 0 1 1 2 2 2 2 2 2 2 0 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 5-6 2 1 2 2 0 0 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 7-8 2 2 2 0 2 1 2 2 1 2 2 2 2 2 2 2 0 2 2 2 0 0 0 2 2 2 2 2 0 2 2 2 0 0 0 1 2 0 0 2 2 2 2 2 2 0 2 2 2 2 9-10 2 1 2 2 0 1 2 2 2 2 2 2 2 1 2 2 2 2 2 0 2 2 2 2 2 2 1 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 11-12 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 1 1 1 2 2 1 2 2 2 2 2 2 1 Figure 1. TAR translation evaluation heat map on 50 parallel SQuAD examples scored by 12 evaluators. Example/ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Evaluator 1-2 2 0 0 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 1 2 0 0 0 0 2 0 2 2 2 2 1 0 1 0 0 1 0 1 2 3-4 1 0 0 0 0 0 2 1 2 1 1 0 0 1 1 1 2 1 1 1 2 1 2 1 1 2 0 2 0 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 0 2 2 2 2 2 1 5-6 1 0 0 1 0 0 2 1 2 0 2 1 1 1 1 1 2 2 1 1 1 1 2 2 1 0 2 2 2 2 2 2 2 0 1 1 1 1 2 1 2 2 2 1 2 2 2 1 1 1 0 7-8 0 0 0 1 2 0 0 1 1 0 2 0 0 2 0 2 2 2 2 1 0 2 2 2 2 2 2 0 2 2 2 2 0 0 0 0 2 0 2 0 1 2 0 0 2 0 2 0 2 2 9-10 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 2 0 1 1 1 2 1 2 0 2 2 2 2 1 2 2 1 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 11-12 2 1 1 1 1 1 2 2 2 1 2 2 2 1 1 1 2 2 2 1 2 1 2 2 1 1 2 2 2 1 2 2 1 2 1 1 1 1 1 1 2 2 2 2 2 1 2 2 2 1 Figure 2. MT translation evaluation heat map on 50 parallel SQuAD examples scored by 12 evaluators. • RQ1: Is there a quantifiable difference in data quality between Table 1. Translation evaluation average. machine translated and manually post-processed corpora for Translation by Average Spanish SQuAD datasets? In order to answer this question, we conduct a user study in Section 4. TAR 1.717 MT 1.320 • RQ2: Can we expect the performance difference of LMs in MRC QA tasks when fine-tuned on datasets of different quality? We per- TAR MT form an experimental study in Section 5.1. • RQ3: Is there a performance difference in downstream transfer- learning QA tasks, i.e., on external benchmarks the LMs were not 240 fine-tuned on? Experimental results are shown in Section 5.2. 4 USER STUDY 151 4.1 Data Sources 94 For the user study we employ the two following recent MRC Spanish translations: 55 35 TAR: prepared following the Translate-Align-Retrieve methodol- 25 ogy which implies a lot of post-processing to improve the translation quality [2]. TAR SQuAD is produced from original English SQuAD 0 1 2 corpus and contains both 1.1 and 2.0 versions. Further, each version contains datasets of two sizes, i.e., regular (or default) and small (half the size of the regular) that is less noisy and more refined. Figure 3. Score frequencies for TAR and MT translations. MT: SQuAD 1.1 and 2.0 versions translated by a private European NMT company. 3 Furthermore, to evaluate the agreement among raters, we aggre- gate the evaluations in heat map representation in Figure 1 and Fig- ure 2. We can observe that most of the evaluators not only favoured 4.2 Translation Evaluation but also synchronized well on the TAR scores, whereas MT scores To estimate the quality of translation, 50 parallel examples from appear to be more contrasting. This could possibly mean that the MT translated SQuAD 1.1 dataset were selected randomly. Twelve Span- errors were so diverse that evaluators found the provided scale to ish speaking evaluators were asked to give the following grades to 25 some degree misleading. Hence, it was difficult to strictly represent parallel examples each: the difference between minor and major errors. Translation errors which could significantly affect the results have • 2, if understandable and there are only minor mistakes; been collected and some examples are depicted in Table 4. The error • 1, if understandable and has a few major mistakes; types are the following: wrong gender inference, inaccurate transla- • 0, if not understandable and has more than a few major mistakes. tion or capitalization in named entities, adjectives misplacement re- garding the noun. Here, we would like to point at the following errors In Table 1 the average translation evaluation score is depicted, in MT translation: from which we conclude that TAR translation is significantly better • an example of combining named entity’s "Warner Brothers" in a than MT translation. literal Spanish translation as well as in the original state: "... y To inspect further, in Figure 3 we provide the histogram of score hermanos Warner. Universal, Warner Brothers ..."; frequencies where TAR translation produces 80% of the best scores 3 We will publish the dataset on GitHub upon acceptance. while MT’s produces only around 50%. 30 Table 2. Performance on es-SQuAD. All models are in the case-sensitive mode. Best column results in bold, second best underlined. es-SQuAD (TAR) es-SQuAD (MT) Model 1.1 2.0 1.1 2.0 Small Default Small Default Default Default EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 mBERT (1.1 Sm TAR) 57.45 73.34 55.04 72.38 - - - - 56.24 71.12 - - mBERT (1.1 Def TAR) 56.30 73.71 59.65 76.32 - - - - 54.80 71.67 - - mBERT (2.0 Sm TAR) 57.36 73.52 55.20 72.56 59.85 66.30 60.18 66.94 55.36 70.64 46.17 57.87 mBERT (2.0 Def TAR) 56.24 73.11 59.05 75.48 59.76 67.17 62.08 68.90 54.47 71.06 46.97 59.47 mBERT (1.1 MT) 54.25 71.42 53.90 71.90 - - - - 61.20 64.19 - - mBERT (2.0 MT) 52.92 70.51 53.03 71.25 29.02 38.60 26.32 35.80 61.27 74.15 61.46 74.72 BETO 56.72 74.38 59.71 76.92 58.55 67.16 60.01 67.97 55.93 72.56 49.49 62.88 DistilledBETO 54.11 72.41 57.32 74.94 57.26 66.28 58.58 66.75 52.11 69.86 48.28 63.58 • an example of translation "Universal Pictures" by changing into tasks. That is, in order to study the impact of the fine-tuning dataset plural form and dropping the noun "universales"; we optimize the model on TAR, but evaluate on the MT dev set, and • an example of "the US War Department" translation as an "al De- vice versa. The empirical results are shown in Table 2. partamento de Guerra DE NOSOTROS", an impressive translation First, we observe that LMs fine-tuned on the MT SQuAD consid- of a capitalized abbreviation of the United States. erably outperform other models in terms of both EM and F1 only on the MT dev set while being significantly inferior to all other models Therefore, we can positively answer RQ 1 as there indeed exists on the TAR dev test. For instance, mBERT fine-tuned on the MT- a substantial difference in corpora quality when applying additional version of SQuAD 2.0 is about 12 EM and F1 points better than post-processing over direct machine translated data. BETO on the MT-version of SQuAD 2.0 and at the same time is about 32 EM and F1 points worse than BETO in the default TAR- 5 EXPERIMENTAL STUDY version of SQuAD 2.0. In the experimental study we evaluate the performance of Spanish Similarly, mBERT trained on the MT-version of SQuAD 1.1 LMs in machine reading comprehension tasks fine-tuning them on achieves very good EM score on the MT-version dev set of SQuAD language corpora obtained via machine translation and translation 1.1 but performs poorly on the TAR versions. Considering the differ- with further rule-based post-processing. ence in datasets quality demonstrated in Section 4, we deem that such a behavior is a sign of LMs sensitivity to artificially created corpora Datasets. For fine-tuning the pre-trained LMs in Spanish we with numerous syntactic and semantic mistakes. leverage different versions of es-SQuAD, i.e., TAR and MT de- Moreover, the TAR-trained models show more consistent scores scribed in Section 4. Small and Default versions of es-SQuAD across the given tasks thus supporting the RQ 2, i.e., LMs tend to (TAR) are annotated as sm and def, respectively. For benchmarking be more robust when trained and evaluated on well-prepared lan- we employ dev sets of es-SQuAD datasets as well as test sets of guage corpora. Overall, in this experiment we find that LMs trained MLQA [9] and XQuAD [1] where both context and question are in on Spanish-only corpora (e.g., BETO) perform on par or slightly bet- Spanish. ter than massive multilingual LMs like mBERT fine-tuned on a sim- ilar task in a language-specific setting. Models. We choose the pre-trained mBERT-base-cased for fine-tuning on es-SQuAD datasets. For a broader comparison 5.2 MLQA and XQuAD Performance we also employ already pre-trained and fine-tuned on SQuAD 2.0 Spanish-only LMs BETO [3] and its distilled version DistilledBETO. Table 3. Performance on MLQA and XQuAD. Cased models. Best in bold, Fine-Tuning Setup. The models are trained and evaluated in the second best underlined. cased mode using the HuggingFace Transformers [15] framework. MLQA XQuAD When fine-tuning the default hyperparameters are used: three epochs Model of the Adam optimizer with an initial learning rate of 0.00005. The EM F1 EM F1 experiments are conducted on the Ubuntu 16.04 server equipped mBERT (1.1 Sm TAR) 42.74 64.36 53.61 72.89 with one GTX 1080 Ti GPU and 256 GB RAM. mBERT (1.1 Def TAR) 43.14 66.44 54.62 75.30 mBERT (2.0 Sm TAR) 43.31 65.04 54.45 74.09 mBERT (2.0 Def TAR) 43.44 66.09 55.97 76.82 Metrics. We measure EM and F1 scores in each experiment as mBERT (1.1 MT) 44.43 64.83 57.14 75.46 reported by task-specific evaluation scripts. For consistency reasons, mBERT (2.0 MT) 44.13 64.43 54.03 73.17 we do not evaluate models fine-tuned on SQuAD 1.1 datasets against BETO 45.12 68.77 56.97 78.15 DistilledBETO 42.41 66.06 55.46 75.84 SQuAD 2.0 versions. In the second experiment, we probe the TAR and MT fine-tuned 5.1 SQuAD Performance models against MLQA and XQuAD in the Spanish context - Span- In the first experiment, we fine-tune mBERT on TAR and MT ish question settings. The results are presented in Table 3. A clear datasets and evaluate their accuracy on the dev test of the respective winner is BETO which is pre-trained on Spanish-only corpora and 31 outperforms nearest contenders by about 2 F1 points. We then ob- [12] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang, serve that TAR-trained models perform consistently better than MT- ‘Squad: 100, 000+ questions for machine comprehension of text’, in Proceedings of the 2016 Conference on Empirical Methods in Natural trained models in terms of F1 scores. Interestingly, in terms of EM Language Processing, EMNLP 2016, Austin, Texas, USA, November 1- scores mBERT 1.1 MT yields better performance than TAR and even 4, 2016, pp. 2383–2392, (2016). language-specific models like BETO. Such a phenomena can be ex- [13] Valerie Sessions and Marco Valtorta, ‘The effects of data quality on plained by robustness of large-scale multilingual LMs that might machine learning algorithms’, in Proceedings of the 11th Interna- tend to generalize better over translation artifacts. We leave further tional Conference on Information Quality, MIT, Cambridge, MA, USA, November 10-12, 2006, eds., John R. Talburt, Elizabeth M. Pierce, research of this phenomena to the future work. Ningning Wu, and Traci Campbell, pp. 485–498. MIT, (2006). Overall, discussing the RQ 3 we hypothesize that for downstream [14] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, language-specific tasks LMs pre-trained in that specific language are and Samuel R. Bowman, ‘GLUE: A multi-task benchmark and analy- more preferable. In case such a large-scale pre-training corpora is not sis platform for natural language understanding’, in 7th International Conference on Learning Representations, ICLR 2019, (2019). available, well-processed machine translated sources tend to produce [15] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, more robust LMs compared to purely machine translated sources. Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew, ‘Huggingface’s transform- ers: State-of-the-art natural language processing’, ArXiv, (2019). 6 CONCLUSION AND FUTURE WORK [16] Shijie Wu and Mark Dredze, ‘Beto, bentz, becas: The surprising cross- lingual effectiveness of BERT’, in Proceedings of the 2019 Confer- In this work, we studied the impact of machine translated corpora ence on Empirical Methods in Natural Language Processing and the quality on question answering tasks. Having formulated three re- 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, pp. 833–844, (2019). search questions, we employed Spanish SQuAD-style datasets for [17] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, empirical evaluation. User study confirmed there is a significant dif- Raquel Urtasun, Antonio Torralba, and Sanja Fidler, ‘Aligning books ference in dataset quality and amount of language artifacts. Further and movies: Towards story-like visual explanations by watching movies experimental studies confirmed that LMs are sensitive to the quality and reading books’, in 2015 IEEE International Conference on Com- puter Vision, ICCV, pp. 19–27, (2015). of machine translated corpora. We also observe signs of LMs robust- ness to translation defects in downstream transfer learning tasks. For the future work we pose a question towards conducting an appropriate analysis on how neural networks overcome the flaws in the data, being not always machine translated, to become robust and noise resilient. REFERENCES [1] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama, ‘On the cross- lingual transferability of monolingual representations’, CoRR, (2019). [2] Casimiro Pio Carrino, Marta R. Costa-jussà, and José A. R. Fonollosa, ‘Automatic spanish translation of the squad dataset for multilingual question answering’, CoRR, (2019). [3] José Cañete, Gabriel Chaperon, Rodrigo Fuentes, and Jorge Pérez, ‘Spanish pre-trained bert model and evaluation data’, in to appear in PML4DC at ICLR 2020, (2020). [4] Alexis Conneau and Guillaume Lample, ‘Cross-lingual language model pretraining’, in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pp. 7057–7067, (2019). [5] Danilo Croce, Alexandra Zelenanska, and Roberto Basili, ‘Neural learning for question answering in italian’, in AI*IA 2018 – Advances in Artificial Intelligence, eds., Chiara Ghidini, Bernardo Magnini, Andrea Passerini, and Paolo Traverso, pp. 389–402, (2018). [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, ‘BERT: pre-training of deep bidirectional transformers for language un- derstanding’, in Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp. 4171–4186, (2019). [7] Martin d’Hoffschmidt, Maxime Vidal, Wacim Belblidia, and Tom Brendlé, ‘Fquad: French question answering dataset’, CoRR, (2020). [8] Pavel Efimov, Leonid Boytsov, and Pavel Braslavski, ‘Sberquad - rus- sian reading comprehension dataset: Description and analysis’, CoRR, (2019). [9] Patrick S. H. Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk, ‘MLQA: evaluating cross-lingual extractive question answering’, CoRR, (2019). [10] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz, ‘Building a large annotated corpus of english: The penn treebank’, Computational Linguistics, 19(2), 313–330, (1993). [11] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, ‘Ex- ploring the limits of transfer learning with a unified text-to-text trans- former’, CoRR, (2019). 32 Table 4. A sample of TAR and MT translations error examples annotated in the user study. English TAR MT Gender inference (TAR avg score = 0.8, MT avg score = 0) It’s not clear, however that this stereotypi- Sin embargo, no está claro que esta opinión No está claro, sin embargo, que esta visión cal view reflects the reality of East Asian estereotipada refleje la realidad de las aulas estereotipada refleja la realidad de las aulas classrooms or that the educational goals de Asia oriental o que los objetivos educa- del Asia oriental o que los objetivos ed- in these countries are commensurable with tivos de esos países sean acordes con los de ucativos en estos países son conmensu- those in Western countries. In Japan, for ex- los países occidentales. En Japón, por ejem- rables con los de los países occidentales. En ample, although average attainment on stan- plo, aunque el rendimiento medio de los en- Japón, por ejemplo, aunque el logro prome- dardized tests may exceed those in Western sayos estandarizados puede superar los de dio de las pruebas estandarizadas puede ex- countries, classroom discipline and behavior los países occidentales, la disciplina y el ceder las de los países occidentales, la dis- is highly problematic. Although, officially, comportamiento de las aulas son muy prob- ciplina y el comportamiento en el aula son schools have extremely rigid codes of be- lemáticos. Aunque oficialmente las escue- altamente problemáticos. Aunque, oficial- havior, in practice many teachers find the las tienen códigos de conducta extremada- mente, las escuelas tienen códigos de com- students unmanageable and do not enforce mente rígidos, en la práctica muchos mae- portamiento extremadamente rígidos, en la discipline at all. stros consideran que los estudiantes son in- práctica muchos profesores encuentran a los manejables y no aplican la disciplina en ab- estudiantes inmanejables y no aplican la dis- soluto. ciplina en absoluto. Translation and capitalization inconsistency in named entities (TAR avg score = 1, MT avg score = 0.1) The motion picture, television, and music La industria del cine, la televisión y la La imagen del movimiento, la televisión y industry is centered on the Los Angeles in música se centra en Los Ángeles en el sur de la industria musical se centran en los Ánge- southern California. Hollywood, a district California. Hollywood, un distrito dentro de les en el sur de California. Hollywood, un within Los Angeles, is also a name associ- Los Ángeles, es también un nombre asoci- distrito de los Ángeles, es también un nom- ated with the motion picture industry. Head- ado a la industria cinematográfica. Con sede bre asociado a la industria fotográfica de quartered in southern California are The en el sur de California están The Walt Dis- movimiento. Con sede en el sur de Califor- Walt Disney Company (which also owns ney Company (que también posee ABC), nia están la compañía Walt Disney (que tam- ABC), Sony Pictures, Universal, MGM, Sony Pictures, Universal, MGM, Paramount bién posee ABC), imágenes de Sony, uni- Paramount Pictures, 20th Century Fox, and Pictures, 20th Century Fox, y Warner Broth- versales, MGM, imágenes principales, Fox Warner Brothers. Universal, Warner Broth- ers. Universal, Warner Brothers y Sony tam- del siglo 20 y hermanos Warner. Univer- ers, and Sony also run major record compa- bién tienen grandes compañías discográfi- sal, Warner Brothers y Sony también dirigen nies as well. cas. grandes empresas de registro. During the same year, Tesla wrote a trea- Durante el mismo año, escribió un tratado, Durante el mismo año, Tesla escribió un tise, The Art of Projecting Concentrated The Art of Projecting Concentrated non- treatise, el arte de proyectar energía concen- Non-dispersive Energy through the Natural dispersive Energy through the Natural Me- trada no dispersa a través de los medios nat- Media, concerning charged particle beam dia, sobre las armas de haz de partículas urales, en relación con las armas de haz de weapons. Tesla published the document in cargadas. Tesla publicó el documento en partículas cargadas. Tesla publicó el docu- an attempt to expound on the technical de- un intento de exponer la descripción téc- mento en un intento de exponer la descrip- scription of a "superweapon that would put nica de una "superarma que pondría fin a ción técnica de un «superarma que pon- an end to all war." <...> Tesla tried to in- toda guerra". <...> Tesla trató de interesar dría fin a toda guerra». <...> Tesla trató terest the US War Department, the United al Departamento de Guerra de los Estados de interesar al Departamento de Guerra DE Kingdom, the Soviet Union, and Yugoslavia Unidos, el Reino Unido, la Unión Soviética NOSOTROS, al Reino Unido, a la Unión in the device. y Yugoslavia en el dispositivo. Soviética y a Yugoslavia en el dispositivo. Adjectives placement regarding the noun (TAR avg score = 1, MT avg score = 0) CBS broadcast Super Bowl 50 in the U.S., CBS transmitió el Super Bowl 50 en los Es- CBS emitió 50 super bowl en los U. S. y and charged an average of $5 million for tados Unidos, y cobró un promedio de $5 cobró un promedio de US $5 millones por a 30-second commercial during the game. millones por un comercial de 30 segundos un 30 - segundo comercial durante el juego. The Super Bowl 50 halftime show was head- durante el juego. El espectáculo de medio El espectáculo de semáforo 50 de super fue lined by the British rock group Coldplay tiempo del Super Bowl 50 fue encabezado encabezado por el grupo de rock británico with special guest performers Beyoncé and por el grupo de rock británico Coldplay Coldplay con artistas invitados especiales Bruno Mars, who headlined the Super Bowl con artistas invitados especiales como Be- Beyoncé y Bruno Mars, que encabezaron XLVII and Super Bowl XLVIII halftime yoncé y Bruno Mars, quienes encabezaron el súper súper XLVII y los espectáculos de shows, respectively. It was the third-most los shows de medio tiempo del Super Bowl semestral XLVIII, respectivamente. Fue la watched U.S. broadcast ever. XLVII y Super Bowl XLVIII, respectiva- tercera, la más observada u. mente. Fue el tercer programa más visto de Estados Unidos. 33