<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-Language Transformer Adaptation for Frequently Asked Questions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Di Lielloz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Bonadimanz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Giannoney</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Favalliy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raniero Romagnoliy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Moschittiz</string-name>
          <email>alessandro.moschittig@unitn.it</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Transfer learning has been proven to be effective, especially when data for the target domain/task is scarce. Sometimes data for a similar task is only available in another language because it may be very specific. In this paper, we explore the use of machine-translated data to transfer models on a related domain. Specifically, we transfer models from the question duplication task (QDT) to similar FAQ selection tasks. The source domain is the wellknown English Quora dataset, while the target domain is a collection of small Italian datasets for real case scenarios consisting of FAQ groups retrieved by pivoting on common answers. Our results show great improvements in the zero-shot learning setting and modest improvements using the standard transfer approach for direct in-domain adaptation 1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Frequently Asked Question (FAQ) websites are an
essential service for user’s self-assistance. FAQ
websites typically present a list of questions, each
associated with an answer. When searching for
information, users have to go through the FAQs to
determine whether there is a similar question
providing a solution to their problem. However, this
process does not scale well when the number of
FAQs increases since too many questions may be
presented to the user, and a simple search by the
query may not retrieve the desired results.
Additionally, in the last decade, users started looking
for information using smartphones and voice
assistants, such as Alexa, Google Assistant, or Siri.
work done prior to joining Amazon
1Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
By design, voice assistants provide users with a
different information access paradigm: the FAQ
websites’ navigation service is substituted by
natural language dialogues, which satisfy the users’
information need in few interactions. To achieve
this goal, FAQ retrieval systems need to
understand the question and present the user only with a
set of strong candidates. One possible solution
offered by personal assistants is constituted by (i) a
FAQ retrieval system
        <xref ref-type="bibr" rid="ref3">(Caputo et al., 2016)</xref>
        for
efficiently finding relevant questions, and (ii) accurate
neural models to select the most probable FAQ.
      </p>
      <p>One of the major obstacles for building such
a system is the availability of training data for
the selection model. FAQ systems are
domainspecific in nature since they aim to provide users
with information about specific websites or
services. Moreover, the industrial setting does not
always allow for creating a large corpus of questions
for any specific domain, as the customers (FAQ’s
owners) typically cannot provide such data. There
are many reasons: (i) they are not familiar with
the process of training data creation, as it is not
part of their business; (ii) the topic of the FAQ
system does not require more than tens of
question/solution pairs; (iii) it is not easy to generate
a dataset for question-question similarity from a
question-answer system.</p>
      <p>
        A traditional approach to alleviating such a
problem is to use transfer learning (TL), i.e., data
from other domains/tasks is used to train a model
on the target task. TL research has been boosted
by the availability of pre-trained
transformerbased models
        <xref ref-type="bibr" rid="ref14 ref4">(Vaswani et al., 2017; Devlin et al.,
2018)</xref>
        , which capture general-purpose language
models. In this paper, we approach the problem
of FAQ selection, fine-tuning pre-trained language
models on the Question Duplication Task (QDT)
from Quora2. This task aims to identify whether
2https://www.quora.com/q/quoradata/
First-Quora-Dataset-Release-Question-Pairs
QDT
QDT
QDT
FAQ
FAQ
FAQ
True
False
False
True
True
False
False
two questions are duplicated or not, i.e.,
semantically equivalent or not.
        <xref ref-type="bibr" rid="ref1">(Androutsopoulos and
Malakasiotis, 2010)</xref>
        .
      </p>
      <p>Although the FAQ selection task shares some
commonalities with QDT one, they are
different. A FAQ task can indeed be solved by
ranking all the FAQs in the collection using a system
that computes the semantic similarity score
between two questions, i.e., a Paraphrase
Identification model. However, there are still some crucial
differences. While QDT requires to infer if two
questions are semantically equivalent, FAQ
selection seeks questions that share the same intent and,
at the same time, that they share the same answer.
Moreover, the FAQ selection strongly depends on
the domain in which the retrieval system is
applied. For example, if a website responds to
every technical complaint with “contact us”, there
will be many positive pairs that will not share any
real answer. Every portal in which a FAQ
similarity system is needed, e.g., online services and
e-commerce, requires a different level of details
depending on the service type and its complexity.
Table 1 provides some examples taken from QDT
and FAQ datasets to underline the difference
better.</p>
      <p>One of the largest corpora for the fine-tuning
of QDT is the well-known Quora dataset, sourced
from the homonymous community question
answering website. The dataset is constituted by
question pairs, labeled as being duplicates or not.
However, the Quora dataset is only available in the
English language, preventing its use for building
Italian systems.</p>
      <p>
        In this paper, we propose to adapt Transformer
architectures to the task of FAQ selection using
machine translation. We first translated the Quora
dataset to Italian, and then we trained a
state-ofthe-art QDT model for Italian. Finally, we tested
the adapted QDT model to two FAQ datasets
showing significant improvement on the zero-shot
learning baselines (i.e., using no target domain
training data). Moreover, we show that fine-tuning
the adapted model on small target data provides a
consistent improvement over models not
exploiting our transfer learning approach. It should be
noted that our techniques can be seen as an
extension of the Transfer and Adapt (TANDA)
        <xref ref-type="bibr" rid="ref6">(Garg
et al., 2019)</xref>
        , but with the difference that transfer
is carried out on a similar approximate task using
translated data, i.e., Approximated machine
Translated TANDA (ATTANDA).
      </p>
      <p>The rest of the paper is organized as
follows: Section 2 describes similar approaches to do
Cross-Lingual Transfer Learning, Section 3
provides an overview of the available datasets and
Section 4 describes the methodology we
developed. Finally, Section 5 summarizes the main
results and Section 6 draws the conclusions of this
work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        The current state of the art for QDT makes use
of pre-trained transformer-based frameworks, e.g.,
BERT
        <xref ref-type="bibr" rid="ref4">(Devlin et al., 2018)</xref>
        , RoBERTa
        <xref ref-type="bibr" rid="ref11">(Liu et al.,
2019)</xref>
        or XLNet
        <xref ref-type="bibr" rid="ref16">(Yang et al., 2019)</xref>
        . These
models have millions of parameters that are trained in
a two-step approach. First, they are trained as
language models using various losses (e.g., masked
language modeling or sentence order prediction
loss) on a large corpus in an unsupervised way and
then are fine-tuned on the target labeled dataset.
      </p>
      <p>In Transfer Learning, a model is transferred
(i.e., trained) on data coming from a high-resource
task and is then adapted to another, usually more
specific. All the Transformers-based models can
be seen as Transfer Learning models: they are
first trained on large corpora of unlabelled data
and then are specialized in a downstream domain.
Nonetheless, there are scenarios where data about
similar tasks can further improve already-great
models.</p>
      <p>
        Cross-lingual transfer-learning (CLTL) is an
extension in which data from a high-resource
language is used to solve a low-resource language
task. This technique is sometimes used in
combination with Cross-Lingual Word Embeddings
alignment. The actual trend is to align word
embeddings to focus only on shared
languageindependent features and then apply Transfer
Learning techniques
        <xref ref-type="bibr" rid="ref10 ref9">(Lange et al., 2020; Keung
et al., 2020)</xref>
        . However, solving a task using data
coming from a similar one has different
requirements.
      </p>
      <p>
        A similar approach to our has been explored by
        <xref ref-type="bibr" rid="ref13">(Schuster et al., 2019)</xref>
        , in which they used
multilingual data to improve the performance of
lowresource languages. However, even if they used
translated data, they did not explore applying the
transferred model to an affine task. Another
approach
        <xref ref-type="bibr" rid="ref13 ref2 ref5 ref6">(Do and Gaspers, 2019)</xref>
        filters high-quality
samples from a high-resource language dataset to
train the model in reduced time. Authors claim a
significant improvement in the target language and
task, even using only a small amount of
computing.
      </p>
      <p>
        In
        <xref ref-type="bibr" rid="ref8">(Joty et al., 2017)</xref>
        , the authors improve the
performance in question-question similarity by
using an adversarial approach. Thanks to adversarial
training, they extract language-independent
features from a trained model with supervision on
a high-resource language and adapted to a
lowresource one for testing. Results show important
improvements in the target language, even in the
zero-shot setting.
      </p>
      <p>
        Also, in
        <xref ref-type="bibr" rid="ref15">(Wang et al., 2020)</xref>
        , a complete
overview of the common approaches for
crosslingual transfer learning (CLTL) is proposed.
Authors start by comparing (i) joint training, in which
a model is trained on multilingual data using both
a monolingual and a cross-lingual loss, and (ii)
CLWE alignment before training, in which
language embeddings are mapped to a shared space
before fine-tuning. They find out that both
methods perform well and that there is not an overall
winner. Finally, they show that training with both
approaches outperforms previous state-of-the-art
methods.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Datasets</title>
      <sec id="sec-3-1">
        <title>Quora Question Pairs</title>
        <p>The Quora dataset is a collection of question pairs
for QDT. It contains many semantically
equivalent questions that people asked more than once,
for example, ”What is the most populous state in
the USA?” and ”Which state in the United States
has the most people?”. Human experts have
assigned labels; therefore, it is not free from
subjective decisions and questionable labels. The dataset
contains about 404K question pairs, 37% with a
positive label, and 63% with a negative one.
However, this dataset is not error-free: many ids are
used more than once (14K), and many questions
are referred by more than a single id (76K).
3.2</p>
        <p>FAQ: RDC and LCN
RDC and LCN are two real-world datasets of FAQ
retrieval. They were designed to build a QA
component of conversational agent systems in
Italian, targeting specific domains. Neither dataset
is ready for FAQ retrieval out of the box, so we
needed to group questions differently. Given that
many questions share a common answer in RDC,
we created several examples for the FAQ selection
task by clustering questions with respect to the
answers. For RDC, since the answers were simply
the name of the category in which an answer could
be found, we pivoted on the categories to create
the clusters.</p>
        <p>To build the examples, we first built clusters of
equivalent questions, using their similarity gold
standard labels, or rather the answers or the
categories. LCN consists of 388 questions, which
we grouped in 24 clusters of different sizes. The
smallest contains only two elements, while the
largest contains 50 elements. RDC contains 369
entries, which we grouped in 30 clusters with a
minimum and maximum size of 1 and 37,
respectively. 3</p>
        <p>Tests will show that LCN is the hardest dataset.
The reason is that clustering has not been applied
by pivoting on the answers but the same category
instead (answers were not available). Then, each
cluster contains questions that do share a precise
answer but rather the same category.</p>
        <p>3There is an Italian FAQ dataset called QA4FAQ, but it is
not suitable for question similarity since annotations for the
dataset are not available.http://qa4faq.github.io
The transformation of a set of clusters in a
training or test set was done with the following
algorithm: for N times, an element from each cluster
was chosen, called champion, and was temporarily
removed from its cluster. Each champion was then
paired with a random element from every cluster,
assigning positive labels when the two shared
belonging to the same cluster. We found that N = 5
was a reasonable number of rounds since more
would have lead to information repetition.</p>
        <p>Moreover, there was a need to create both small
training and test sets to measure models’
performance when fine-tuned on the FAQ domain. We
could not divide the dataset described before since
training and test sets would have had many
common sentences. To accomplish a perfect
separation, 70% of the clusters were used to create a train
set while the remaining 30% were used for the test
set.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>FAQ: ItaFAQ</title>
        <p>We built a small FAQ dataset in Italian by
scraping popular websites. Then, we asked 10
different people with different backgrounds and levels
of education to create additional questions similar
to those automatically collected. The specific
request was to create questions that would have had
the same or a similar answer. The dataset is
released as open-source and is available for
download4. This dataset can be useful to test an
information retrieval system. However, it is easier
to solve than the previously described RDC and
LCN. The main reasons are that (i) humans tend
to create partially related new questions, and that
(ii) general FAQ dataset about well-known
companies and topics are easier to process than strong
domain-specific data.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>ATTANDA Approach</title>
      <p>4
4.1</p>
      <sec id="sec-4-1">
        <title>Machine Translation of Quora</title>
        <p>
          There are no medium or large-size Italian datasets
for QDT or FAQ retrieval; thus, we applied
machine translation. We used Microsoft Azure
Cognitive Services to translate Quora Question Pairs
into Italian. Since the original Quora dataset had
some questions repeated on different entries, we
followed the approach in
          <xref ref-type="bibr" rid="ref2 ref7">(Haponchyk et al., 2018;
Bonadiman et al., 2019)</xref>
          and grouped all the
questions in clusters by mean of the transitive property:
4The dataset can be downloaded at https://github.
com/lucadiliello/italian-faq-dataset
if a and b are the two questions of a pair with a
positive label and Ci is a cluster, a 2 Ci $ b 2 Ci.
Moreover, if there is a tuple (a; b) with a positive
label and a 2 Ci; b 2 Cj , then Ci and Cj are
merged in Ck = Ci [ Cj .
        </p>
        <p>After that, we translated all the questions of
the clusters with at least two members. This
allowed us to effectively reduce machine
translation costs because we avoided translating
questions that would have appeared only in negative
pairs (millions of negative pairs can be easily
generated by randomly picking questions from
different clusters). We built the transfer dataset by
labeling (i) all pairs of questions in the same cluster
as positive examples; and (ii) a random number
of pairs with members from different clusters as
negative examples. We limited the number of the
latter to be equal to the number of positive
examples.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Transformer architectures</title>
        <p>To reach the highest performance, we developed
our models on the actual state of the art for QA.
We took into consideration:</p>
      </sec>
      <sec id="sec-4-3">
        <title>Multilingual BERT (mBERT), a BERT</title>
        <p>
          model trained on the 104 largest Wikipedia,
in terms of the number of articles. The model
contains 177M5 parameters and has 12
transformer layers
          <xref ref-type="bibr" rid="ref4">(Devlin et al., 2018)</xref>
          ;
Italian BERT6, a BERT model trained only
on Italian text. The version we used was
trained over the concatenation of the OSCAR
corpus and the Italian OPUS corpus, for a
total of 81GB of text. This model features a
total of 110M parameters on 12 layers;
GilBERTo7, a RoBERTa model trained over
71GB of lowercase Italian text extracted from
the OSCAR corpus. The authors state that
this model applies masking to whole words
(WWM), as in
          <xref ref-type="bibr" rid="ref12">(Martin et al., 2020)</xref>
          , instead
of masking at the sub-words level, as in the
original BERT. This model has a total of
111M parameters.
        </p>
        <p>5mBERT has a bigger size since its vocabulary is
considerably larger than monolingual models.</p>
        <p>6Italian BERT models and code are available at https:
//github.com/dbmdz/berts</p>
        <p>7GilBERTo models and code are available at https://
github.com/idb-ita/GilBERTo</p>
        <p>
          Transfer learning performance on validation set ItaFAQ
1:00
0:95
0:85
0:800
0:2
0:4
Train on LCNtrain and test on LCNtest
0:6
Steps
0:50
0:45
0:40
We aim at exploiting data similar to the target task,
which may also come from a different language,
to train models for our FAQ target task. Our
approach can be seen as an extension of TANDA
by
          <xref ref-type="bibr" rid="ref6">(Garg et al., 2019)</xref>
          , which consists in two-step
fine-tuning. First, they transfer the model on a
general QA task with a huge dataset, and then
they adapt the model to a smaller and specific QA
benchmark such as WikiQA. They showed that a
transfer step could improve the final performance
if the source and target tasks are similar. We
extend this idea by creating our transfer dataset
utilizing machine translation, as described before.
We call our approach ATTANDA (Approximated
machine-Translated TANDA).
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>This section shows the results of testing different
models on the FAQ retrieval task. We use
Precision at 1 (P@1), which is equal to accuracy, as we
mainly need to measure if the returned FAQ is
correct. LCNtrain, LCNtest, RDCtrain and RDCtest
are the names of the splits of LCN and RDC
derived by dividing the set of clusters.</p>
      <p>We start by comparing the available,
transformer-based models. Table 2 shows
that Italian BERT is better than the other models
in most tests. This comes not as a surprise since it
is specialized in the Italian language, it takes into
consideration the case sensitivity of the input text,
and it is trained on the most extensive corpus.
GilBERTo also performs well, but RoBERTa’s
improvement is insufficient to overcome the
smaller training set and the case-insensitive
tokenizer.</p>
      <p>Once we established that the best pre-training
model is Italian BERT, since it shows the highest
scores in 3 comparisons out of 4, we tested
different transfer methods on LCN and RDC splits.
We compare the performance of Italian BERT in
two scenarios: (i) the model is directly fine-tuned
on the target domain, and (ii) the model is first
transferred on Quora and then fine-tuned on the
target domain (ATTANDA). We also report the
results of the model without in-domain fine-tuning
We explored transfer learning in a typical
industrial scenario where only small (or no) data is
available in the target language. We showed that it
is possible to use machine translated data to
improve a strictly related task’s performance. We
suspect that if the tasks had been more similar,
for example, Question Answering and FAQ, the
performance gain would have been even better.
However, this was a real-world scenario where the
target datasets were used for production in real
websites, and size and quality were not large. In
this setting, applying a transfer phase can improve
the retrieval of similar questions, and the transfer
step is a low-cost operation compared to the
pretraining.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Prodromos</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A survey of paraphrasing and textual entailment methods</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <volume>38</volume>
          :
          <fpage>135</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Daniele</given-names>
            <surname>Bonadiman</surname>
          </string-name>
          , Anjishnu Kumar, and
          <string-name>
            <given-names>Arpit</given-names>
            <surname>Mittal</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Large scale question paraphrase retrieval with smoothed deep metric learning</article-title>
          .
          <source>In Proceedings of the 5th Workshop</source>
          on Noisy User-generated
          <string-name>
            <surname>Text</surname>
          </string-name>
          (
          <article-title>W-NUT</article-title>
          <year>2019</year>
          ), pages
          <fpage>68</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Annalina</given-names>
            <surname>Caputo</surname>
          </string-name>
          , Marco de Gemmis, Pasquale Lops, Francesco Lovecchio,
          <source>Vito Manzari, and Acquedotto Pugliese AQP Spa</source>
          .
          <year>2016</year>
          .
          <article-title>Overview of the evalita 2016 question answering for frequently asked questions (qa4faq) task</article-title>
          .
          <source>In of the Final Workshop 7 December</source>
          <year>2016</year>
          , Naples, page
          <volume>124</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Quynh</given-names>
            <surname>Do</surname>
          </string-name>
          and
          <string-name>
            <given-names>Judith</given-names>
            <surname>Gaspers</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Cross-lingual transfer learning with data selection for large-scale spoken language understanding</article-title>
          . pages
          <fpage>1455</fpage>
          -
          <lpage>1460</lpage>
          ,
          <fpage>01</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Siddhant</given-names>
            <surname>Garg</surname>
          </string-name>
          , Thuy Vu, and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Moschitti</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Tanda: Transfer and adapt pre-trained transformer models for answer sentence selection</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Haponchyk</surname>
          </string-name>
          , Antonio Uva, Seunghak Yu,
          <string-name>
            <given-names>Olga</given-names>
            <surname>Uryupina</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Moschitti</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Supervised clustering of questions into intents for dialog system applications</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>2310</fpage>
          -
          <lpage>2321</lpage>
          , Brussels, Belgium, October-November.
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Shafiq</given-names>
            <surname>Joty</surname>
          </string-name>
          , Preslav Nakov, Llu´ıs Ma`rquez, and
          <string-name>
            <given-names>Israa</given-names>
            <surname>Jaradat</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Cross-language learning with adversarial neural networks</article-title>
          .
          <source>In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL</source>
          <year>2017</year>
          ), pages
          <fpage>226</fpage>
          -
          <lpage>237</lpage>
          , Vancouver, Canada, August. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Phillip</given-names>
            <surname>Keung</surname>
          </string-name>
          , Yichao Lu, and
          <string-name>
            <given-names>Vikas</given-names>
            <surname>Bhardwaj</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and ner</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Lukas</given-names>
            <surname>Lange</surname>
          </string-name>
          , Anastasiia Iurshina, Heike Adel, and Jannik Stro¨tgen.
          <year>2020</year>
          .
          <article-title>Adversarial alignment of multilingual models for extracting temporal expressions from text</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Yinhan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv, pages
          <fpage>arXiv</fpage>
          -
          <year>1907</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Louis</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Benjamin Muller</surname>
          </string-name>
          , Pedro Javier Ortiz Sua´rez, Yoann Dupont,
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Romary</surname>
          </string-name>
          , E´ ric de la Clergerie, Djame´ Seddah, and
          <string-name>
            <given-names>Benoˆıt</given-names>
            <surname>Sagot</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Camembert: a tasty french language model</article-title>
          .
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Tal</given-names>
            <surname>Schuster</surname>
          </string-name>
          , Ori Ram, Regina Barzilay, and
          <string-name>
            <given-names>Amir</given-names>
            <surname>Globerson</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Cross-lingual alignment of contextual word embeddings, with applications to zeroshot dependency parsing</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Zirui</given-names>
            <surname>Wang</surname>
          </string-name>
          , Jiateng Xie, Ruochen Xu,
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Graham</given-names>
            <surname>Neubig</surname>
          </string-name>
          , and Jaime Carbonell.
          <year>2020</year>
          .
          <article-title>Crosslingual alignment vs joint training: A comparative study and a simple unified framework</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Zhilin</given-names>
            <surname>Yang</surname>
          </string-name>
          , Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>5753</fpage>
          -
          <lpage>5763</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>