Deep Bidirectional Transformers for Italian Question Answering
                       Danilo Croce and Giorgio Brandi and Roberto Basili
                               Department Of Enterprise Engineering
                                  University of Roma, Tor Vergata
                                 Via del Politecnico 1, 00133 Roma
                            {croce,basili}@info.uniroma2.it ∗

                       Abstract                                  paradigma noto come Bidirectional En-
                                                                 coder Representations from Transformers
    English. Deep learning continues to                          (BERT), con risultati che costituiscono lo
    achieve state-of-the-art results in several                  stato dell’arte.
    NLP tasks, such as Question Answering
    (QA). Unfortunately, the requirements of
                                                             1   Introduction
    neural QA systems are very strict in the
    size of the involved training datasets. Re-              Question Answering (QA) ((Hirschman and
    cent works show that the application of                  Gaizauskas, 2001)) tackles the problem of return-
    Automatic Machine Translation is an en-                  ing one or more answers to a question posed by a
    abling factor for the acquisition of large               user in natural language, using as source a large
    scale QA training sets in resource poor                  knowledge base or, even more often, a large scale
    languages such as Italian. In this work,                 text collection: in this setting, the answers corre-
    we show how these resources can be used                  spond to sentences (or their fragments) stored in
    to train a state-of-the-art deep architec-               the text collection. A typical QA process consists
    ture, based on effective techniques re-                  of three main steps: the question processing that
    cently proposed within the Bidirectional                 aims at extracting requirements and objectives of
    Encoder Representations from Transform-                  the user’s query, the retrieval phase where docu-
    ers (BERT) paradigm.                                     ments and sentences that include the answers are
                                                             retrieved from the text collection and the answer
    Italiano. I recenti studi sull’applicazione              extraction phase that locates the answer within the
    di metodi di Deep Learning hanno por-                    candidate sentences (Harabagiu et al., 2000; Kwok
    tato a risultati importanti rispetto a di-               et al., 2001).
    versi problemi di Natural Language Pro-                     Various QA architectures have been proposed
    cessing, come il Question Answering (QA)                 so far. Some of these rely on structured resources,
    task. Sfortunatamente, i requisiti di tali               such as Freebase, while others use unstructured
    sistemi di QA neurali sono molto strin-                  information from sources such as Wikipedia (an
    genti per quanto riguarda le dimensioni                  example of such a system is the Microsoft’s
    dei dataset necessari per addestrare i                   AskMSR (Brill et al., 2002)), or generic Web
    modelli piú complessi. Tuttavia, recenti                pages, e.g. the QuASE system (Sun et al., 2015).
    lavori hanno dimostrato che é possibile                 Hybrid models exist as well, that make use of
    applicare tecniche di traduzione automat-                both the structured and the unstructured informa-
    ica al fine di acquisire collezioni di es-               tion. These include IBM’s DeepQA (Ferrucci et
    empi di larga scala e addestrare architet-               al., 2010) and YodaQA (Baudiš and Šedivý, 2015).
    ture neurali per il Question Answering                      In order to initialize such systems, a manu-
    nelle lingue in cui i dati di training sono              ally constructed and annotated dataset is crucial,
    scarsi, come l’italiano. In questo la-                   from which the mapping between questions and
    voro, mostriamo come queste risorse per-                 answers can be learned. Datasets designed for
    mettono l’addestramento di una architet-                 structured-knowledge based systems, such as We-
    tura neurale molto efficace, basata sul                  bQuestions (Berant et al., 2013), usually contain
    ∗
                                                             the questions, their logical forms and the answers.
      “Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0     On the other side, datasets over unstructured infor-
International (CC BY 4.0).”                                  mation are usually composed of question-answer
pairs: WikiMovies (Miller et al., 2016) is an ex-        2   Bidirectional Encoder Representations
ample of this class of systems and it is made of a           for QA
collection of texts from the movie domain. Finally,
some datasets contain the entire triplets made of        In the field of computer vision, researchers have
the questions, the paragraphs and the answers, that      repeatedly shown the beneficial contribution of
are expressed as specific spans of the paragraph         transfer learning, i.e., the pre-training a neural net-
and thus located in the paragraph. This is the           work model on a known task, for instance image
case of the recently proposed SQuAD dataset (Ra-         classification with respect to the ImageNet dataset,
jpurkar et al., 2016).                                   and then performing fine-tuning using the trained
                                                         neural network as the basis of a new purpose-
    State-of-the-art approaches proposed in litera-
                                                         specific model, e.g., (Girshick et al., 2013).
ture (Chen et al., 2017; Seo et al., 2017; Clark
                                                            The approach proposed in (Devlin et al., 2019),
and Gardner, 2018; Peters et al., 2018) are based
                                                         namely Bidirectional Encoder Representations
on neural paradigms and are often portable across
                                                         from Transformers (BERT) provides a very effec-
different languages. Among them, the neural ap-
                                                         tive model to pre-train a deep and complex neural
proach presented in (Devlin et al., 2019), beside
                                                         network over very large scale of unannotated texts
achieving state-of-the-art results in several NLP
                                                         and to apply it to a large variety of NLP task by
tasks, is shown competitive in QA even with re-
                                                         simply extending it to each new problem by fine-
spect to human annotators.
                                                         tuning the entire architecture.
    Unfortunately, the limited availability of train-
                                                            The building block of BERT is the Transformer
ing data for languages different from English still
                                                         element, an attention-based mechanism that learns
remains an important problem. Even though mul-
                                                         contextual relations between words (or sub-words,
tilingual data collections, such as Wikipedia, do
                                                         i.e. word pieces, (Schuster and Nakajima, 2012))
exist for many languages, the portability of the
                                                         in a text. In its original form, proposed in
corresponding annotated resources for supervised
                                                         (Vaswani et al., 2017), Transformer includes two
learning algorithms remains limited: large-scale
                                                         separate mechanisms, an encoder that reads the
annotated data mostly exist only for the English
                                                         text input and a decoder that produces a prediction
language.
                                                         for the targeted Machine Translation tasks.
    Recent works show that the application of Auto-         In line with (Peters et al., 2018), BERT aims
matic Machine Translation enables the acquisition        at providing a sentence embedding (as well as
of large corpora for QA in resource poor languages       the contextualized embeddings of each word com-
such as Italian (Croce et al., 2018; Croce et al.,       posing the sentence) where the pre-training stage
2019). As a result, SQuAD-IT, i.e., a large scale        aims at acquiring an expressive and robust lan-
dataset made of about 50,000 questions/answer            guage model, where only the encoder is used. As
pairs has been made available. It was not fully          shown in Figure 1 (on the left) the Transformer en-
manually validated but still represents a valuable       coder reads the entire sequence of words at once
resource for training neural approaches.                 and acquire a language model by reconstructing
    In this work, we show how these resources            the original sentence applying a MLM (masked
enable the training of a recent and promising            language model) pre-training objective: the MLM
deep neural architecture, based on the effective         randomly masks some of the tokens from the in-
techniques recently justified within the Bidirec-        put, and the objective is to predict the original
tional Encoder Representations from Transform-           masked word based only on its context. In addition
ers (BERT) paradigm (Vaswani et al., 2017; De-           to the masked language model, BERT also uses a
vlin et al., 2019). The experimental evaluation car-     next sentence prediction task that jointly pre-trains
ried out with respect to SQuAD-IT confirm the im-        text-pair representations. This last objective is cru-
pressive results of BERT even in Italian QA, pro-        cial to improve the network capability of modeling
viding state-of-the-art results which are far higher     relational information between text pairs, which is
with respect to previous methods.                        particularly important in tasks such as QA in order
    In the rest of the paper, section 2 introduces the   to relate an answer to a question.
BERT architecture for QA. Section 3 report the ex-          After the language model is trained over a
perimental evaluation, while Section 4 draws some        generic document collection, the BERT architec-
conclusions.                                             ture allows encoding (i) specific words belong-
Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same
architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are
used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-
tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator
token (e.g. separating questions/answers).


ing to a sentence, (ii) the entire sentence and (iii)   (shown on the right side of Figure 1) requires to
sentence pairs with dedicated embeddings. These         encode the input question and passage as a generic
can be used in input to further deep architectures      text pair, such as the ones used for the next sen-
to solve sentence classification, sequence label-       tence prediction task used in the initial training
ing or relational learning tasks by simply adding       stages.
simple layers and fine-tuning the entire architec-         In order to determine the correct span for the
ture. On top of such embeddings, fine-tuning is         answer, (Devlin et al., 2019) introduces on top
applied by adding task specific and simple layers       of embeddings encoding the words of the ques-
on top of the architecture acquiring the language       tion/answer pairs a so-called start vector S ∈ RH
model. In a nutshell, this layer introduces min-        (with H the dimensionality of the embedding pro-
imal task-specific parameters, and is trained on        duced for each wordpiece Ti ) and an end vector
the targeted tasks by simply fine-tuning all pre-       S ∈ RH . Then, the probability of word i being
trained parameters, optimizing the performance on       the start of the answer span is computed as a dot
the specific problem. The straightforward applica-      product between the associated embedding Ti and
tion of BERT has shown better results than previ-       S followed by a softmax layer over all of the words
                                                                                    S·Ti
ous state-of-the-art models on a wide spectrum of       in the paragraph: Pi = Pe S·T    j
                                                                                           . The analogous
                                                                                           e
natural language processing tasks.                                                     j
                                                        formula is used for the end of the answer span.
   One of the most impressive results was achieved
                                                        The score of a candidate span from position i to
with respect to the Question Answering task pro-
                                                        position j is defined as S ·Ti +E ·Tj , and the maxi-
posed by (Rajpurkar et al., 2016): given a question
                                                        mum scoring span where j ≥ i is used as a predic-
and a passage from Wikipedia containing the an-
                                                        tion. The training objective is the sum of the log-
swer, the task is to predict the answer text span
                                                        likelihoods of the correct start and end positions.
in the passage. An example of paragraph, show-
                                                        The above fine-tuning of BERT achieved state-of-
ing the Wikipedia answer to the question “What
                                                        the-art results over the official benchmarking cam-
was Marie Curie the first female recipient of?”
                                                        paign related to SQuAD and, most noticeably, its
is reported in Figure 2. This specific task orig-
                                                        accuracy is comparable to the one observed in hu-
inated the Stanford Question Answering Dataset
                                                        man annotators1 .
(SQuAD), a collection of 100k crowd-sourced
                                                           It is worth noting that no bias over the input lan-
question/answer pairs.
                                                           1
   The fine-tuning process of BERT in the QA task              https://rajpurkar.github.io/SQuAD-explorer/
                   Figure 2: An example of the SQuAD dataset (Rajpurkar et al., 2016).

               Element                    Training set                        Test set
                               English      Italian Percent.       English    Italian Percent.
               Paragraphs        18,896     18,506        97.9%      2,067     2,010     97.2%
               Questions         87,599     54,159        61.8%     10,570     7,609     72.0%
               Answers           87,599     54,159        61.8%     34,726    21,489     61.9%

Table 1: The quantities of the elements of the final dataset obtained by translating the SQuAD dataset,
with the percentage of material w.r.t the original dataset. The Italian test set was obtained from the
English development set, being the English test set not available publicly.

                  DrQA-IT       BERT-IT                     a dataset made available by (Croce et al., 2019).
            EM      56.1         64.96                      This dataset includes more than 50,000 ques-
            F1      65.9         75.95                      tion/paragraph pairs obtained by automatic trans-
                                                            lating the original SQuAD dataset. The details
Table 2: Results of BERT-iT over the SQuAD-IT               about the number of sentences is reported in Table
dataset                                                     1 where a comparison with the original SQuAD in
                                                            English is reported.
guage exists, so that the language model underly-              The parameters of the neural network were set
ing BERT can be acquired over any text collection           equal to those of the original work, including the
independently from the input language. As a con-            word embeddings resource. Two evaluation met-
sequence a pre-trained model acquired over docu-            rics are used: exact string match (EM) and the
ments written in more than one hundred languages            F1 score, which measures the weighted average of
exists. It will be applied in the next section to train     precision and recall at the token level. EM is a
and evaluate such a QA model over a dataset of              stricter measure evaluated as the percentage of an-
examples in Italian.                                        swers perfectly retrieved by the systems, i.e. the
                                                            text extracted by the span produced by the sys-
3       Experimental Evaluation                             tem is exactly the same as the gold-standard. The
In order to assess the applicability of the BERT            adopted token-based F1 score smooths this con-
architecture against the targeted QA task, a multi-         straint by measuring the overlap (the number of
lingual pre-trained model has been downloaded2 :            shared tokens) between the provided answers and
in particular, this model has been acquired over            the gold standard.
documents written in one hundred languages, it is              Performances are reported in Table 2 together
composed of 12 layers of Transformers and asso-             with the results achieved by a variant of the DrQA
ciates each token in input to a word embedding              system (Chen et al., 2017), evaluated against the
made of 768 dimensions. For consistency with                same SQuAD-IT dataset, as from (Croce et al.,
(Devlin et al., 2019), 5 epochs have been consid-           2019). Improvements are impressive, as both EM
ered to fine-tune the model.                                and F1 are improved of more than 10%. Anyway,
   We trained the architecture over SQuAD-IT3 ,             these results are in line with the impact of BERT
    2
                                                            over the original English dataset. In the final ver-
     https://storage.googleapis.com/bert models/
2018 11 23/multi cased L-12 H-768 A-12.zip                  sion of this paper we will provide an in depth com-
   3
     https://github.com/crux82/squad-it                     parison between DrQA and BERT.
4   Conclusions                                          Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                            Kristina Toutanova. 2019. BERT: Pre-training of
This paper explores the application of Bidirec-             deep bidirectional transformers for language under-
tional Encoder Representations within the QA task           standing. In Proceedings of the 2019 Conference of
in Italian, enabled by the recent availability of a         the North American Chapter of the Association for
                                                            Computational Linguistics: Human Language Tech-
large-scale annotated corpus, SQuAD-IT. The ex-             nologies, Volume 1 (Long and Short Papers), pages
perimental results confirm the robustness of the            4171–4186, Minneapolis, Minnesota, June. Associ-
adopted Transformer-based architecture, with a              ation for Computational Linguistics.
significant improvement with respect to earlier          David A. Ferrucci, Eric W. Brown, Jennifer Chu-
neural architectures. This result paves the way to         Carroll, James Fan, David Gondek, Aditya Kalyan-
the development of portable, robust and accurate           pur, Adam Lally, J. William Murdock, Eric Nyberg,
neural models for QA in Italian, and future work           John M. Prager, Nico Schlaefer, and Christopher A.
                                                           Welty. 2010. Building Watson: An Overview of the
will certainly consider other possible extensions of       DeepQA Project. AI Magazine, 31(3):59–79.
the adopted model.
                                                         Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Ji-
                                                           tendra Malik. 2013. Rich feature hierarchies for ac-
References                                                 curate object detection and semantic segmentation.
                                                           CoRR, abs/1311.2524.
Petr Baudiš and Jan Šedivý. 2015. Modeling of
  the Question Answering Task in the YodaQA Sys-         Sanda M. Harabagiu, Dan I. Moldovan, Marius
  tem. In Josanne Mothe, Jacques Savoy, Jaap               Pasca, Rada Mihalcea, Mihai Surdeanu, Razvan C.
  Kamps, Karen Pinel-Sauvagnat, Gareth Jones, Eric         Bunescu, Roxana Girju, Vasile Rus, and Paul
  San Juan, Linda Capellato, and Nicola Ferro, edi-        Morarescu. 2000. FALCON: boosting knowledge
  tors, Experimental IR Meets Multilinguality, Mul-        for answer engines. In Proceedings of The Ninth
  timodality, and Interaction, pages 222–228, Cham.        Text REtrieval Conference, TREC 2000, Gaithers-
  Springer International Publishing.                       burg, Maryland, USA, November 13-16, 2000.

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy     L. Hirschman and R. Gaizauskas. 2001. Natural lan-
  Liang. 2013. Semantic parsing on freebase from            guage question answering: the view from here. Nat-
  question-answer pairs. In EMNLP, pages 1533–              ural Language Engineering, 7(4):275–300.
  1544. ACL.                                             Cody C. T. Kwok, Oren Etzioni, and Daniel S. Weld.
E. Brill, S. Dumais, M. Banko, Eric Brill, Michele         2001. Scaling question answering to the web. In
   Banko, and Susan Dumais. 2002. An Analysis of           WWW, pages 150–161.
   the AskMSR Question-Answering System. In Pro-         Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-
   ceedings of EMNLP 2002, January.                        Hossein Karimi, Antoine Bordes, and Jason We-
                                                           ston. 2016. Key-value memory networks for di-
Danqi Chen, Adam Fisch, Jason Weston, and Antoine
                                                           rectly reading documents. In EMNLP.
  Bordes. 2017. Reading wikipedia to answer open-
  domain questions. In Proceedings of the 55th An-       Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
  nual Meeting of the Association for Computational       Gardner, Christopher Clark, Kenton Lee, and Luke
  Linguistics (Volume 1: Long Papers), pages 1870–        Zettlemoyer. 2018. Deep contextualized word rep-
  1879.                                                   resentations. In Proceedings of the 2018 Confer-
                                                          ence of the North American Chapter of the Associ-
Christopher Clark and Matt Gardner. 2018. Simple          ation for Computational Linguistics: Human Lan-
  and effective multi-paragraph reading comprehen-        guage Technologies, Volume 1 (Long Papers), pages
  sion. In Proceedings of the 56th Annual Meeting of      2227–2237, New Orleans, Louisiana, June. Associ-
  the Association for Computational Linguistics (Vol-     ation for Computational Linguistics.
  ume 1: Long Papers), pages 845–855, Melbourne,
  Australia, July. Association for Computational Lin-    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,
  guistics.                                                and Percy Liang. 2016. SQuAD: 100.000+ Ques-
                                                           tions for Machine Comprehension of Text. CoRR,
Danilo Croce, Alexandra Zelenanska, and Roberto            abs/1606.05250.
  Basili. 2018. Neural learning for question answer-
  ing in italian. In Chiara Ghidini, Bernardo Magnini,   Mike Schuster and Kaisuke Nakajima. 2012. Japanese
  Andrea Passerini, and Paolo Traverso, editors, AI*IA     and korean voice search. In International Confer-
  2018 – Advances in Artificial Intelligence, pages        ence on Acoustics, Speech and Signal Processing,
  389–402, Cham. Springer International Publishing.        pages 5149–5152.
Danilo Croce, Alexandra Zelenanska, and Roberto          Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and
  Basili. 2019. Enabling deep learning for large scale     Hannaneh Hajishirzi. 2017. Bidirectional atten-
  question answering in italian. Intelligenza Artifi-      tion flow for machine comprehension. In 5th Inter-
  ciale, 13(1):49–61.                                      national Conference on Learning Representations,
  ICLR 2017, Toulon, France, April 24-26, 2017, Con-
  ference Track Proceedings. OpenReview.net.
Huan Sun, Hao Ma, Wen tau Yih, Chen-Tse Tsai,
  Jingjing Liu, and Ming-Wei Chang. 2015. Open do-
  main question answering via semantic enrichment.
  In WWW.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In I. Guyon, U. V. Luxburg, S. Bengio,
  H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
  nett, editors, Advances in Neural Information Pro-
  cessing Systems 30, pages 5998–6008. Curran As-
  sociates, Inc.