<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Santiago de Compostela, August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Khvalchik</string-name>
          <email>maria.khvalchik@semantic-web.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikhail Galkin</string-name>
          <email>mikhail.galkin@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SemanticWebCompany</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Dresden &amp; Fraunhofer IAIS</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>29</volume>
      <issue>2020</issue>
      <fpage>29</fpage>
      <lpage>33</lpage>
      <abstract>
        <p>Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced languages and their mono- and multilingual LMs often struggle to obtain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA downstream transfer-learning question answering tasks show that presumably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Numerous research studies demonstrate how important the data
quality is to the outcomes of neural networks and how severely they are
affected by low quality data [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        However, recently transfer learning, where a model is first
pretrained on a data-rich task before being fine-tuned on a downstream
task, has emerged as a powerful technique in natural language
processing. Models like T5 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are now showing human-level
performance on most of the well-established benchmarks available for
natural language understanding.
      </p>
      <p>Yet, language understanding is not solved, even in well-studied
languages like English, even when tremendous resources are used.
This paper focuses on a less resourced language, Spanish, and
pursues two goals.</p>
      <p>First, we seek to demonstrate that data quality is an important
component for training neural networks and overall increases
natural language understanding capabilities. Hence, the quality of data
should be considered carefully. We consider two recent Neural
Machine Translated (NMT) SQuAD datasets, discussed in more detail
in Section 4, for machine reading comprehension (MRC) in Spanish
of different quality.</p>
      <p>Second, after providing evidence of the data quality difference,
we fine-tune both datasets on pre-trained multilingual and
monolingual Spanish BERT models and see that there is a significant
performance gap in terms of Exact Match (EM) and F1 scores in dev sets of
the SQuAD family. However, unexpectedly, the results become more
comparable when testing on external benchmarks recently proposed
for cross-lingual Extractive QA.</p>
      <p>
        Hence, we tackle the effects of both data quality and neural
networks characteristics in an effort to demonstrate that both mentioned
above are major factors in the outcomes and should be given equal
respect in their primary design.
2
Historically, most of NLP tasks, datasets, and benchmarks were
created in English, e.g., the Penn Treebank [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], SQuAD [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
GLUE [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Therefore, most of the large-scale pre-trained models
were trained in the English-only mode, e.g., BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] employed
Wikipedia and the BookCorpus [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as training datasets. Later on,
the NLP community sought after increasing language diversity and
multilingual models started to appear, such as mBERT or XLM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        However, large and diverse enough pre-training corpora of high
quality often do not exist. Several methods have been developed to
bridge this gap, e.g., applying machine translation frameworks to
English corpora, or performing cross-lingual transfer learning [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
Multilingual language models and pre-trained non-English language
models are definitely in the focus of the NLP community. Still,
the language understanding capabilities (hence, the performance)
of language models largely depend on data collection and cleaning
steps. In the MRC dimension, for instance, Italian SQuAD [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is
obtained via direct translation from the English version whereas French
FQuAD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Russian SberQuAD [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have been created based on
their language-specific part of Wikipedia often being much smaller
than original SQuAD.
      </p>
      <p>
        With the surge of language-specific pre-trained LMs several
benchmarks have been developed that aim at evaluating multi- and
cross-lingual characteristics of such LMs. Specifically, for the
machine reading comprehension and question answering (QA) task
there exist XQuAD [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and MLQA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this work, we study how
LMs perform in QA tasks in Spanish when fine-tuning on datasets
of possibly different quality, i.e., directly machine translated and
curated with the human-in-the-loop strategy.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROBLEM FORMULATION</title>
      <p>In this work, we aim at exploring the impact of machine translated
corpora quality on downstream MRC tasks which is of high
importance in less resourced languages. We consider Spanish as the target
language as one of the most spoken languages in the world that
nevertheless has a relatively little amount of available corpora for
pretraining modern LMs for language understanding tasks. Taking this
into account, we tackle the following research questions:
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>RQ1: Is there a quantifiable difference in data quality between
machine translated and manually post-processed corpora for
Spanish SQuAD datasets? In order to answer this question, we
conduct a user study in Section 4.</p>
      <p>RQ2: Can we expect the performance difference of LMs in MRC
QA tasks when fine-tuned on datasets of different quality? We
perform an experimental study in Section 5.1.</p>
      <p>RQ3: Is there a performance difference in downstream
transferlearning QA tasks, i.e., on external benchmarks the LMs were not
fine-tuned on? Experimental results are shown in Section 5.2.
4
4.1</p>
    </sec>
    <sec id="sec-3">
      <title>USER STUDY</title>
    </sec>
    <sec id="sec-4">
      <title>Data Sources</title>
      <p>For the user study we employ the two following recent MRC Spanish
translations:</p>
      <p>
        TAR: prepared following the Translate-Align-Retrieve
methodology which implies a lot of post-processing to improve the translation
quality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. TAR SQuAD is produced from original English SQuAD
corpus and contains both 1.1 and 2.0 versions. Further, each version
contains datasets of two sizes, i.e., regular (or default) and small (half
the size of the regular) that is less noisy and more refined.
      </p>
      <p>MT: SQuAD 1.1 and 2.0 versions translated by a private European
NMT company. 3
4.2</p>
    </sec>
    <sec id="sec-5">
      <title>Translation Evaluation</title>
      <p>To estimate the quality of translation, 50 parallel examples from
translated SQuAD 1.1 dataset were selected randomly. Twelve
Spanish speaking evaluators were asked to give the following grades to 25
parallel examples each:</p>
      <sec id="sec-5-1">
        <title>2, if understandable and there are only minor mistakes;</title>
        <p>1, if understandable and has a few major mistakes;
0, if not understandable and has more than a few major mistakes.</p>
        <p>In Table 1 the average translation evaluation score is depicted,
from which we conclude that TAR translation is significantly better
than MT translation.</p>
        <p>To inspect further, in Figure 3 we provide the histogram of score
frequencies where TAR translation produces 80% of the best scores
while MT’s produces only around 50%.</p>
        <p>Furthermore, to evaluate the agreement among raters, we
aggregate the evaluations in heat map representation in Figure 1 and
Figure 2. We can observe that most of the evaluators not only favoured
but also synchronized well on the TAR scores, whereas MT scores
appear to be more contrasting. This could possibly mean that the MT
errors were so diverse that evaluators found the provided scale to
some degree misleading. Hence, it was difficult to strictly represent
the difference between minor and major errors.</p>
        <p>Translation errors which could significantly affect the results have
been collected and some examples are depicted in Table 4. The error
types are the following: wrong gender inference, inaccurate
translation or capitalization in named entities, adjectives misplacement
regarding the noun. Here, we would like to point at the following errors
in MT translation:
an example of combining named entity’s "Warner Brothers" in a
literal Spanish translation as well as in the original state: "... y
hermanos Warner. Universal, Warner Brothers ...";</p>
      </sec>
      <sec id="sec-5-2">
        <title>3 We will publish the dataset on GitHub upon acceptance.</title>
        <p>Therefore, we can positively answer RQ 1 as there indeed exists
a substantial difference in corpora quality when applying additional
post-processing over direct machine translated data.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL STUDY</title>
      <p>In the experimental study we evaluate the performance of Spanish
LMs in machine reading comprehension tasks fine-tuning them on
language corpora obtained via machine translation and translation
with further rule-based post-processing.</p>
      <p>
        Datasets. For fine-tuning the pre-trained LMs in Spanish we
leverage different versions of es-SQuAD, i.e., TAR and MT
described in Section 4. Small and Default versions of es-SQuAD
(TAR) are annotated as sm and def, respectively. For benchmarking
we employ dev sets of es-SQuAD datasets as well as test sets of
MLQA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and XQuAD [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] where both context and question are in
Spanish.
      </p>
      <p>
        Models. We choose the pre-trained mBERT-base-cased for
fine-tuning on es-SQuAD datasets. For a broader comparison
we also employ already pre-trained and fine-tuned on SQuAD 2.0
Spanish-only LMs BETO [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and its distilled version DistilledBETO.
      </p>
      <p>
        Fine-Tuning Setup. The models are trained and evaluated in the
cased mode using the HuggingFace Transformers [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] framework.
When fine-tuning the default hyperparameters are used: three epochs
of the Adam optimizer with an initial learning rate of 0.00005. The
experiments are conducted on the Ubuntu 16.04 server equipped
with one GTX 1080 Ti GPU and 256 GB RAM.
      </p>
      <p>Metrics. We measure EM and F1 scores in each experiment as
reported by task-specific evaluation scripts. For consistency reasons,
we do not evaluate models fine-tuned on SQuAD 1.1 datasets against
SQuAD 2.0 versions.
5.1</p>
    </sec>
    <sec id="sec-7">
      <title>SQuAD Performance</title>
      <p>In the first experiment, we fine-tune mBERT on TAR and MT
datasets and evaluate their accuracy on the dev test of the respective
tasks. That is, in order to study the impact of the fine-tuning dataset
we optimize the model on TAR, but evaluate on the MT dev set, and
vice versa. The empirical results are shown in Table 2.</p>
      <p>First, we observe that LMs fine-tuned on the MT SQuAD
considerably outperform other models in terms of both EM and F1 only on
the MT dev set while being significantly inferior to all other models
on the TAR dev test. For instance, mBERT fine-tuned on the
MTversion of SQuAD 2.0 is about 12 EM and F1 points better than
BETO on the MT-version of SQuAD 2.0 and at the same time is
about 32 EM and F1 points worse than BETO in the default
TARversion of SQuAD 2.0.</p>
      <p>Similarly, mBERT trained on the MT-version of SQuAD 1.1
achieves very good EM score on the MT-version dev set of SQuAD
1.1 but performs poorly on the TAR versions. Considering the
difference in datasets quality demonstrated in Section 4, we deem that such
a behavior is a sign of LMs sensitivity to artificially created corpora
with numerous syntactic and semantic mistakes.</p>
      <p>Moreover, the TAR-trained models show more consistent scores
across the given tasks thus supporting the RQ 2, i.e., LMs tend to
be more robust when trained and evaluated on well-prepared
language corpora. Overall, in this experiment we find that LMs trained
on Spanish-only corpora (e.g., BETO) perform on par or slightly
better than massive multilingual LMs like mBERT fine-tuned on a
similar task in a language-specific setting.
5.2</p>
    </sec>
    <sec id="sec-8">
      <title>MLQA and XQuAD Performance</title>
      <p>outperforms nearest contenders by about 2 F1 points. We then
observe that TAR-trained models perform consistently better than
MTtrained models in terms of F1 scores. Interestingly, in terms of EM
scores mBERT 1.1 MT yields better performance than TAR and even
language-specific models like BETO. Such a phenomena can be
explained by robustness of large-scale multilingual LMs that might
tend to generalize better over translation artifacts. We leave further
research of this phenomena to the future work.</p>
      <p>Overall, discussing the RQ 3 we hypothesize that for downstream
language-specific tasks LMs pre-trained in that specific language are
more preferable. In case such a large-scale pre-training corpora is not
available, well-processed machine translated sources tend to produce
more robust LMs compared to purely machine translated sources.
6</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this work, we studied the impact of machine translated corpora
quality on question answering tasks. Having formulated three
research questions, we employed Spanish SQuAD-style datasets for
empirical evaluation. User study confirmed there is a significant
difference in dataset quality and amount of language artifacts. Further
experimental studies confirmed that LMs are sensitive to the quality
of machine translated corpora. We also observe signs of LMs
robustness to translation defects in downstream transfer learning tasks.</p>
      <p>For the future work we pose a question towards conducting an
appropriate analysis on how neural networks overcome the flaws in
the data, being not always machine translated, to become robust and
noise resilient.</p>
      <p>English</p>
      <p>TAR</p>
      <p>MT</p>
      <p>Gender inference (TAR avg score = 0.8, MT avg score = 0)
It’s not clear, however that this stereotypi- Sin embargo, no está claro que esta opinión No está claro, sin embargo, que esta visión
cal view reflects the reality of East Asian estereotipada refleje la realidad de las aulas estereotipada refleja la realidad de las aulas
classrooms or that the educational goals de Asia oriental o que los objetivos educa- del Asia oriental o que los objetivos
edin these countries are commensurable with tivos de esos países sean acordes con los de ucativos en estos países son
conmensuthose in Western countries. In Japan, for ex- los países occidentales. En Japón, por ejem- rables con los de los países occidentales. En
ample, although average attainment on stan- plo, aunque el rendimiento medio de los en- Japón, por ejemplo, aunque el logro
promedardized tests may exceed those in Western sayos estandarizados puede superar los de dio de las pruebas estandarizadas puede
excountries, classroom discipline and behavior los países occidentales, la disciplina y el ceder las de los países occidentales, la
disis highly problematic. Although, officially, comportamiento de las aulas son muy prob- ciplina y el comportamiento en el aula son
schools have extremely rigid codes of be- lemáticos. Aunque oficialmente las escue- altamente problemáticos. Aunque,
oficialhavior, in practice many teachers find the las tienen códigos de conducta extremada- mente, las escuelas tienen códigos de
comstudents unmanageable and do not enforce mente rígidos, en la práctica muchos mae- portamiento extremadamente rígidos, en la
discipline at all. stros consideran que los estudiantes son in- práctica muchos profesores encuentran a los
manejables y no aplican la disciplina en ab- estudiantes inmanejables y no aplican la
dissoluto. ciplina en absoluto.</p>
      <p>Translation and capitalization inconsistency in named entities (TAR avg score = 1, MT avg score = 0.1)
The motion picture, television, and music La industria del cine, la televisión y la La imagen del movimiento, la televisión y
industry is centered on the Los Angeles in música se centra en Los Ángeles en el sur de la industria musical se centran en los
Ángesouthern California. Hollywood, a district California. Hollywood, un distrito dentro de les en el sur de California. Hollywood, un
within Los Angeles, is also a name associ- Los Ángeles, es también un nombre asoci- distrito de los Ángeles, es también un
nomated with the motion picture industry. Head- ado a la industria cinematográfica. Con sede bre asociado a la industria fotográfica de
quartered in southern California are The en el sur de California están The Walt Dis- movimiento. Con sede en el sur de
CaliforWalt Disney Company (which also owns ney Company (que también posee ABC), nia están la compañía Walt Disney (que
tamABC), Sony Pictures, Universal, MGM, Sony Pictures, Universal, MGM, Paramount bién posee ABC), imágenes de Sony,
uniParamount Pictures, 20th Century Fox, and Pictures, 20th Century Fox, y Warner Broth- versales, MGM, imágenes principales, Fox
Warner Brothers. Universal, Warner Broth- ers. Universal, Warner Brothers y Sony tam- del siglo 20 y hermanos Warner.
Univerers, and Sony also run major record compa- bién tienen grandes compañías discográfi- sal, Warner Brothers y Sony también dirigen
nies as well. cas. grandes empresas de registro.</p>
      <sec id="sec-9-1">
        <title>During the same year, Tesla wrote a trea</title>
        <p>tise, The Art of Projecting Concentrated
Non-dispersive Energy through the Natural
Media, concerning charged particle beam
weapons. Tesla published the document in
an attempt to expound on the technical
description of a "superweapon that would put
an end to all war." &lt;...&gt; Tesla tried to
interest the US War Department, the United
Kingdom, the Soviet Union, and Yugoslavia
in the device.</p>
      </sec>
      <sec id="sec-9-2">
        <title>Durante el mismo año, escribió un tratado,</title>
        <p>The Art of Projecting Concentrated
nondispersive Energy through the Natural
Media, sobre las armas de haz de partículas
cargadas. Tesla publicó el documento en
un intento de exponer la descripción
técnica de una "superarma que pondría fin a
toda guerra". &lt;...&gt; Tesla trató de interesar
al Departamento de Guerra de los Estados
Unidos, el Reino Unido, la Unión Soviética
y Yugoslavia en el dispositivo.</p>
        <p>Durante el mismo año, Tesla escribió un
treatise, el arte de proyectar energía
concentrada no dispersa a través de los medios
naturales, en relación con las armas de haz de
partículas cargadas. Tesla publicó el
documento en un intento de exponer la
descripción técnica de un «superarma que
pondría fin a toda guerra». &lt;...&gt; Tesla trató
de interesar al Departamento de Guerra DE
NOSOTROS, al Reino Unido, a la Unión</p>
        <p>Soviética y a Yugoslavia en el dispositivo.</p>
        <p>Adjectives placement regarding the noun (TAR avg score = 1, MT avg score = 0)
CBS broadcast Super Bowl 50 in the U.S., CBS transmitió el Super Bowl 50 en los Es- CBS emitió 50 super bowl en los U. S. y
and charged an average of $5 million for tados Unidos, y cobró un promedio de $5 cobró un promedio de US $5 millones por
a 30-second commercial during the game. millones por un comercial de 30 segundos un 30 - segundo comercial durante el juego.
The Super Bowl 50 halftime show was head- durante el juego. El espectáculo de medio El espectáculo de semáforo 50 de super fue
lined by the British rock group Coldplay tiempo del Super Bowl 50 fue encabezado encabezado por el grupo de rock británico
with special guest performers Beyoncé and por el grupo de rock británico Coldplay Coldplay con artistas invitados especiales
Bruno Mars, who headlined the Super Bowl con artistas invitados especiales como Be- Beyoncé y Bruno Mars, que encabezaron
XLVII and Super Bowl XLVIII halftime yoncé y Bruno Mars, quienes encabezaron el súper súper XLVII y los espectáculos de
shows, respectively. It was the third-most los shows de medio tiempo del Super Bowl semestral XLVIII, respectivamente. Fue la
watched U.S. broadcast ever. XLVII y Super Bowl XLVIII, respectiva- tercera, la más observada u.
mente. Fue el tercer programa más visto de
Estados Unidos.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Mikel</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          , and Dani Yogatama, '
          <article-title>On the crosslingual transferability of monolingual representations'</article-title>
          ,
          <source>CoRR</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Casimiro</given-names>
            <surname>Pio</surname>
          </string-name>
          <string-name>
            <surname>Carrino</surname>
          </string-name>
          , Marta R.
          <article-title>Costa-jussà, and</article-title>
          <string-name>
            <given-names>José A. R.</given-names>
            <surname>Fonollosa</surname>
          </string-name>
          , '
          <article-title>Automatic spanish translation of the squad dataset for multilingual question answering'</article-title>
          ,
          <source>CoRR</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>José</given-names>
            <surname>Cañete</surname>
          </string-name>
          , Gabriel Chaperon, Rodrigo Fuentes, and Jorge Pérez, '
          <article-title>Spanish pre-trained bert model and evaluation data'</article-title>
          , in to appear
          <source>in PML4DC at ICLR</source>
          <year>2020</year>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Conneau</surname>
          </string-name>
          and Guillaume Lample, '
          <article-title>Cross-lingual language model pretraining'</article-title>
          ,
          <source>in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          <year>2019</year>
          , NeurIPS
          <year>2019</year>
          , pp.
          <fpage>7057</fpage>
          -
          <lpage>7067</lpage>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Danilo</given-names>
            <surname>Croce</surname>
          </string-name>
          , Alexandra Zelenanska, and Roberto Basili, '
          <article-title>Neural learning for question answering in italian'</article-title>
          ,
          <source>in AI*IA 2018 - Advances in Artificial Intelligence</source>
          , eds.,
          <string-name>
            <surname>Chiara</surname>
            <given-names>Ghidini</given-names>
          </string-name>
          , Bernardo Magnini, Andrea Passerini, and Paolo Traverso, pp.
          <fpage>389</fpage>
          -
          <lpage>402</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          , and Kristina Toutanova, '
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding', in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , NAACL-HLT
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Martin d'Hoffschmidt</surname>
          </string-name>
          , Maxime Vidal, Wacim Belblidia, and Tom Brendlé, 'Fquad:
          <article-title>French question answering dataset'</article-title>
          ,
          <source>CoRR</source>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Efimov</surname>
          </string-name>
          , Leonid Boytsov, and Pavel Braslavski, '
          <article-title>Sberquad - russian reading comprehension dataset: Description and analysis'</article-title>
          ,
          <source>CoRR</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Patrick</surname>
            <given-names>S. H.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
            , Barlas Oguz, Ruty Rinott,
            <given-names>Sebastian</given-names>
          </string-name>
          <string-name>
            <surname>Riedel</surname>
          </string-name>
          , and Holger Schwenk, '
          <article-title>MLQA: evaluating cross-lingual extractive question answering'</article-title>
          ,
          <source>CoRR</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Mitchell</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Marcus</surname>
          </string-name>
          , Beatrice Santorini, and Mary Ann Marcinkiewicz, '
          <article-title>Building a large annotated corpus of english: The penn treebank'</article-title>
          ,
          <source>Computational Linguistics</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ),
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          , (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Colin</surname>
            <given-names>Raffel</given-names>
          </string-name>
          , Noam Shazeer, Adam Roberts,
          <string-name>
            <given-names>Katherine</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sharan</given-names>
            <surname>Narang</surname>
          </string-name>
          , Michael Matena,
          <string-name>
            <surname>Yanqi Zhou</surname>
            ,
            <given-names>Wei</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Peter J. Liu</surname>
          </string-name>
          , '
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer'</article-title>
          ,
          <source>CoRR</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Pranav</surname>
            <given-names>Rajpurkar</given-names>
          </string-name>
          , Jian Zhang, Konstantin Lopyrev, and Percy Liang, 'Squad:
          <volume>100</volume>
          , 000+
          <article-title>questions for machine comprehension of text'</article-title>
          ,
          <source>in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2016</year>
          , Austin, Texas, USA, November 1-
          <issue>4</issue>
          ,
          <year>2016</year>
          , pp.
          <fpage>2383</fpage>
          -
          <lpage>2392</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Valerie</given-names>
            <surname>Sessions</surname>
          </string-name>
          and Marco Valtorta, '
          <article-title>The effects of data quality on machine learning algorithms'</article-title>
          ,
          <source>in Proceedings of the 11th International Conference on Information Quality</source>
          , MIT, Cambridge, MA, USA, November
          <volume>10</volume>
          -
          <issue>12</issue>
          ,
          <year>2006</year>
          , eds., John R. Talburt,
          <string-name>
            <surname>Elizabeth M. Pierce</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ningning Wu</surname>
          </string-name>
          , and Traci Campbell, pp.
          <fpage>485</fpage>
          -
          <lpage>498</lpage>
          . MIT, (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Alex</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Amanpreet</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <given-names>Omer Levy</given-names>
            , and
            <surname>Samuel R. Bowman</surname>
          </string-name>
          , '
          <article-title>GLUE: A multi-task benchmark and analysis platform for natural language understanding'</article-title>
          ,
          <source>in 7th International Conference on Learning Representations, ICLR</source>
          <year>2019</year>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Wolf</given-names>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew, '
          <article-title>Huggingface's transformers: State-of-the-art natural language processing'</article-title>
          , ArXiv, (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Shijie</given-names>
            <surname>Wu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Dredze</surname>
          </string-name>
          , '
          <article-title>Beto, bentz, becas: The surprising crosslingual effectiveness of BERT'</article-title>
          ,
          <source>in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP</source>
          <year>2019</year>
          , pp.
          <fpage>833</fpage>
          -
          <lpage>844</lpage>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Yukun</surname>
            <given-names>Zhu</given-names>
          </string-name>
          , Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler, '
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies and reading books'</article-title>
          ,
          <source>in 2015 IEEE International Conference on Computer Vision</source>
          , ICCV, pp.
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>