Deep Bidirectional Transformers for Italian Question Answering Danilo Croce and Giorgio Brandi and Roberto Basili Department Of Enterprise Engineering University of Roma, Tor Vergata Via del Politecnico 1, 00133 Roma {croce,basili}@info.uniroma2.it ∗ Abstract paradigma noto come Bidirectional En- coder Representations from Transformers English. Deep learning continues to (BERT), con risultati che costituiscono lo achieve state-of-the-art results in several stato dell’arte. NLP tasks, such as Question Answering (QA). Unfortunately, the requirements of 1 Introduction neural QA systems are very strict in the size of the involved training datasets. Re- Question Answering (QA) ((Hirschman and cent works show that the application of Gaizauskas, 2001)) tackles the problem of return- Automatic Machine Translation is an en- ing one or more answers to a question posed by a abling factor for the acquisition of large user in natural language, using as source a large scale QA training sets in resource poor knowledge base or, even more often, a large scale languages such as Italian. In this work, text collection: in this setting, the answers corre- we show how these resources can be used spond to sentences (or their fragments) stored in to train a state-of-the-art deep architec- the text collection. A typical QA process consists ture, based on effective techniques re- of three main steps: the question processing that cently proposed within the Bidirectional aims at extracting requirements and objectives of Encoder Representations from Transform- the user’s query, the retrieval phase where docu- ers (BERT) paradigm. ments and sentences that include the answers are retrieved from the text collection and the answer Italiano. I recenti studi sull’applicazione extraction phase that locates the answer within the di metodi di Deep Learning hanno por- candidate sentences (Harabagiu et al., 2000; Kwok tato a risultati importanti rispetto a di- et al., 2001). versi problemi di Natural Language Pro- Various QA architectures have been proposed cessing, come il Question Answering (QA) so far. Some of these rely on structured resources, task. Sfortunatamente, i requisiti di tali such as Freebase, while others use unstructured sistemi di QA neurali sono molto strin- information from sources such as Wikipedia (an genti per quanto riguarda le dimensioni example of such a system is the Microsoft’s dei dataset necessari per addestrare i AskMSR (Brill et al., 2002)), or generic Web modelli piú complessi. Tuttavia, recenti pages, e.g. the QuASE system (Sun et al., 2015). lavori hanno dimostrato che é possibile Hybrid models exist as well, that make use of applicare tecniche di traduzione automat- both the structured and the unstructured informa- ica al fine di acquisire collezioni di es- tion. These include IBM’s DeepQA (Ferrucci et empi di larga scala e addestrare architet- al., 2010) and YodaQA (Baudiš and Šedivý, 2015). ture neurali per il Question Answering In order to initialize such systems, a manu- nelle lingue in cui i dati di training sono ally constructed and annotated dataset is crucial, scarsi, come l’italiano. In questo la- from which the mapping between questions and voro, mostriamo come queste risorse per- answers can be learned. Datasets designed for mettono l’addestramento di una architet- structured-knowledge based systems, such as We- tura neurale molto efficace, basata sul bQuestions (Berant et al., 2013), usually contain ∗ the questions, their logical forms and the answers. “Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 On the other side, datasets over unstructured infor- International (CC BY 4.0).” mation are usually composed of question-answer pairs: WikiMovies (Miller et al., 2016) is an ex- 2 Bidirectional Encoder Representations ample of this class of systems and it is made of a for QA collection of texts from the movie domain. Finally, some datasets contain the entire triplets made of In the field of computer vision, researchers have the questions, the paragraphs and the answers, that repeatedly shown the beneficial contribution of are expressed as specific spans of the paragraph transfer learning, i.e., the pre-training a neural net- and thus located in the paragraph. This is the work model on a known task, for instance image case of the recently proposed SQuAD dataset (Ra- classification with respect to the ImageNet dataset, jpurkar et al., 2016). and then performing fine-tuning using the trained neural network as the basis of a new purpose- State-of-the-art approaches proposed in litera- specific model, e.g., (Girshick et al., 2013). ture (Chen et al., 2017; Seo et al., 2017; Clark The approach proposed in (Devlin et al., 2019), and Gardner, 2018; Peters et al., 2018) are based namely Bidirectional Encoder Representations on neural paradigms and are often portable across from Transformers (BERT) provides a very effec- different languages. Among them, the neural ap- tive model to pre-train a deep and complex neural proach presented in (Devlin et al., 2019), beside network over very large scale of unannotated texts achieving state-of-the-art results in several NLP and to apply it to a large variety of NLP task by tasks, is shown competitive in QA even with re- simply extending it to each new problem by fine- spect to human annotators. tuning the entire architecture. Unfortunately, the limited availability of train- The building block of BERT is the Transformer ing data for languages different from English still element, an attention-based mechanism that learns remains an important problem. Even though mul- contextual relations between words (or sub-words, tilingual data collections, such as Wikipedia, do i.e. word pieces, (Schuster and Nakajima, 2012)) exist for many languages, the portability of the in a text. In its original form, proposed in corresponding annotated resources for supervised (Vaswani et al., 2017), Transformer includes two learning algorithms remains limited: large-scale separate mechanisms, an encoder that reads the annotated data mostly exist only for the English text input and a decoder that produces a prediction language. for the targeted Machine Translation tasks. Recent works show that the application of Auto- In line with (Peters et al., 2018), BERT aims matic Machine Translation enables the acquisition at providing a sentence embedding (as well as of large corpora for QA in resource poor languages the contextualized embeddings of each word com- such as Italian (Croce et al., 2018; Croce et al., posing the sentence) where the pre-training stage 2019). As a result, SQuAD-IT, i.e., a large scale aims at acquiring an expressive and robust lan- dataset made of about 50,000 questions/answer guage model, where only the encoder is used. As pairs has been made available. It was not fully shown in Figure 1 (on the left) the Transformer en- manually validated but still represents a valuable coder reads the entire sequence of words at once resource for training neural approaches. and acquire a language model by reconstructing In this work, we show how these resources the original sentence applying a MLM (masked enable the training of a recent and promising language model) pre-training objective: the MLM deep neural architecture, based on the effective randomly masks some of the tokens from the in- techniques recently justified within the Bidirec- put, and the objective is to predict the original tional Encoder Representations from Transform- masked word based only on its context. In addition ers (BERT) paradigm (Vaswani et al., 2017; De- to the masked language model, BERT also uses a vlin et al., 2019). The experimental evaluation car- next sentence prediction task that jointly pre-trains ried out with respect to SQuAD-IT confirm the im- text-pair representations. This last objective is cru- pressive results of BERT even in Italian QA, pro- cial to improve the network capability of modeling viding state-of-the-art results which are far higher relational information between text pairs, which is with respect to previous methods. particularly important in tasks such as QA in order In the rest of the paper, section 2 introduces the to relate an answer to a question. BERT architecture for QA. Section 3 report the ex- After the language model is trained over a perimental evaluation, while Section 4 draws some generic document collection, the BERT architec- conclusions. ture allows encoding (i) specific words belong- Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine- tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers). ing to a sentence, (ii) the entire sentence and (iii) (shown on the right side of Figure 1) requires to sentence pairs with dedicated embeddings. These encode the input question and passage as a generic can be used in input to further deep architectures text pair, such as the ones used for the next sen- to solve sentence classification, sequence label- tence prediction task used in the initial training ing or relational learning tasks by simply adding stages. simple layers and fine-tuning the entire architec- In order to determine the correct span for the ture. On top of such embeddings, fine-tuning is answer, (Devlin et al., 2019) introduces on top applied by adding task specific and simple layers of embeddings encoding the words of the ques- on top of the architecture acquiring the language tion/answer pairs a so-called start vector S ∈ RH model. In a nutshell, this layer introduces min- (with H the dimensionality of the embedding pro- imal task-specific parameters, and is trained on duced for each wordpiece Ti ) and an end vector the targeted tasks by simply fine-tuning all pre- S ∈ RH . Then, the probability of word i being trained parameters, optimizing the performance on the start of the answer span is computed as a dot the specific problem. The straightforward applica- product between the associated embedding Ti and tion of BERT has shown better results than previ- S followed by a softmax layer over all of the words S·Ti ous state-of-the-art models on a wide spectrum of in the paragraph: Pi = Pe S·T j . The analogous e natural language processing tasks. j formula is used for the end of the answer span. One of the most impressive results was achieved The score of a candidate span from position i to with respect to the Question Answering task pro- position j is defined as S ·Ti +E ·Tj , and the maxi- posed by (Rajpurkar et al., 2016): given a question mum scoring span where j ≥ i is used as a predic- and a passage from Wikipedia containing the an- tion. The training objective is the sum of the log- swer, the task is to predict the answer text span likelihoods of the correct start and end positions. in the passage. An example of paragraph, show- The above fine-tuning of BERT achieved state-of- ing the Wikipedia answer to the question “What the-art results over the official benchmarking cam- was Marie Curie the first female recipient of?” paign related to SQuAD and, most noticeably, its is reported in Figure 2. This specific task orig- accuracy is comparable to the one observed in hu- inated the Stanford Question Answering Dataset man annotators1 . (SQuAD), a collection of 100k crowd-sourced It is worth noting that no bias over the input lan- question/answer pairs. 1 The fine-tuning process of BERT in the QA task https://rajpurkar.github.io/SQuAD-explorer/ Figure 2: An example of the SQuAD dataset (Rajpurkar et al., 2016). Element Training set Test set English Italian Percent. English Italian Percent. Paragraphs 18,896 18,506 97.9% 2,067 2,010 97.2% Questions 87,599 54,159 61.8% 10,570 7,609 72.0% Answers 87,599 54,159 61.8% 34,726 21,489 61.9% Table 1: The quantities of the elements of the final dataset obtained by translating the SQuAD dataset, with the percentage of material w.r.t the original dataset. The Italian test set was obtained from the English development set, being the English test set not available publicly. DrQA-IT BERT-IT a dataset made available by (Croce et al., 2019). EM 56.1 64.96 This dataset includes more than 50,000 ques- F1 65.9 75.95 tion/paragraph pairs obtained by automatic trans- lating the original SQuAD dataset. The details Table 2: Results of BERT-iT over the SQuAD-IT about the number of sentences is reported in Table dataset 1 where a comparison with the original SQuAD in English is reported. guage exists, so that the language model underly- The parameters of the neural network were set ing BERT can be acquired over any text collection equal to those of the original work, including the independently from the input language. As a con- word embeddings resource. Two evaluation met- sequence a pre-trained model acquired over docu- rics are used: exact string match (EM) and the ments written in more than one hundred languages F1 score, which measures the weighted average of exists. It will be applied in the next section to train precision and recall at the token level. EM is a and evaluate such a QA model over a dataset of stricter measure evaluated as the percentage of an- examples in Italian. swers perfectly retrieved by the systems, i.e. the text extracted by the span produced by the sys- 3 Experimental Evaluation tem is exactly the same as the gold-standard. The In order to assess the applicability of the BERT adopted token-based F1 score smooths this con- architecture against the targeted QA task, a multi- straint by measuring the overlap (the number of lingual pre-trained model has been downloaded2 : shared tokens) between the provided answers and in particular, this model has been acquired over the gold standard. documents written in one hundred languages, it is Performances are reported in Table 2 together composed of 12 layers of Transformers and asso- with the results achieved by a variant of the DrQA ciates each token in input to a word embedding system (Chen et al., 2017), evaluated against the made of 768 dimensions. For consistency with same SQuAD-IT dataset, as from (Croce et al., (Devlin et al., 2019), 5 epochs have been consid- 2019). Improvements are impressive, as both EM ered to fine-tune the model. and F1 are improved of more than 10%. Anyway, We trained the architecture over SQuAD-IT3 , these results are in line with the impact of BERT 2 over the original English dataset. In the final ver- https://storage.googleapis.com/bert models/ 2018 11 23/multi cased L-12 H-768 A-12.zip sion of this paper we will provide an in depth com- 3 https://github.com/crux82/squad-it parison between DrQA and BERT. 4 Conclusions Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of This paper explores the application of Bidirec- deep bidirectional transformers for language under- tional Encoder Representations within the QA task standing. In Proceedings of the 2019 Conference of in Italian, enabled by the recent availability of a the North American Chapter of the Association for Computational Linguistics: Human Language Tech- large-scale annotated corpus, SQuAD-IT. The ex- nologies, Volume 1 (Long and Short Papers), pages perimental results confirm the robustness of the 4171–4186, Minneapolis, Minnesota, June. Associ- adopted Transformer-based architecture, with a ation for Computational Linguistics. significant improvement with respect to earlier David A. Ferrucci, Eric W. Brown, Jennifer Chu- neural architectures. This result paves the way to Carroll, James Fan, David Gondek, Aditya Kalyan- the development of portable, robust and accurate pur, Adam Lally, J. William Murdock, Eric Nyberg, neural models for QA in Italian, and future work John M. Prager, Nico Schlaefer, and Christopher A. Welty. 2010. Building Watson: An Overview of the will certainly consider other possible extensions of DeepQA Project. AI Magazine, 31(3):59–79. the adopted model. Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Ji- tendra Malik. 2013. Rich feature hierarchies for ac- References curate object detection and semantic segmentation. CoRR, abs/1311.2524. Petr Baudiš and Jan Šedivý. 2015. Modeling of the Question Answering Task in the YodaQA Sys- Sanda M. Harabagiu, Dan I. Moldovan, Marius tem. In Josanne Mothe, Jacques Savoy, Jaap Pasca, Rada Mihalcea, Mihai Surdeanu, Razvan C. Kamps, Karen Pinel-Sauvagnat, Gareth Jones, Eric Bunescu, Roxana Girju, Vasile Rus, and Paul San Juan, Linda Capellato, and Nicola Ferro, edi- Morarescu. 2000. FALCON: boosting knowledge tors, Experimental IR Meets Multilinguality, Mul- for answer engines. In Proceedings of The Ninth timodality, and Interaction, pages 222–228, Cham. Text REtrieval Conference, TREC 2000, Gaithers- Springer International Publishing. burg, Maryland, USA, November 13-16, 2000. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy L. Hirschman and R. Gaizauskas. 2001. Natural lan- Liang. 2013. Semantic parsing on freebase from guage question answering: the view from here. Nat- question-answer pairs. In EMNLP, pages 1533– ural Language Engineering, 7(4):275–300. 1544. ACL. Cody C. T. Kwok, Oren Etzioni, and Daniel S. Weld. E. Brill, S. Dumais, M. Banko, Eric Brill, Michele 2001. Scaling question answering to the web. In Banko, and Susan Dumais. 2002. An Analysis of WWW, pages 150–161. the AskMSR Question-Answering System. In Pro- Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir- ceedings of EMNLP 2002, January. Hossein Karimi, Antoine Bordes, and Jason We- ston. 2016. Key-value memory networks for di- Danqi Chen, Adam Fisch, Jason Weston, and Antoine rectly reading documents. In EMNLP. Bordes. 2017. Reading wikipedia to answer open- domain questions. In Proceedings of the 55th An- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt nual Meeting of the Association for Computational Gardner, Christopher Clark, Kenton Lee, and Luke Linguistics (Volume 1: Long Papers), pages 1870– Zettlemoyer. 2018. Deep contextualized word rep- 1879. resentations. In Proceedings of the 2018 Confer- ence of the North American Chapter of the Associ- Christopher Clark and Matt Gardner. 2018. Simple ation for Computational Linguistics: Human Lan- and effective multi-paragraph reading comprehen- guage Technologies, Volume 1 (Long Papers), pages sion. In Proceedings of the 56th Annual Meeting of 2227–2237, New Orleans, Louisiana, June. Associ- the Association for Computational Linguistics (Vol- ation for Computational Linguistics. ume 1: Long Papers), pages 845–855, Melbourne, Australia, July. Association for Computational Lin- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, guistics. and Percy Liang. 2016. SQuAD: 100.000+ Ques- tions for Machine Comprehension of Text. CoRR, Danilo Croce, Alexandra Zelenanska, and Roberto abs/1606.05250. Basili. 2018. Neural learning for question answer- ing in italian. In Chiara Ghidini, Bernardo Magnini, Mike Schuster and Kaisuke Nakajima. 2012. Japanese Andrea Passerini, and Paolo Traverso, editors, AI*IA and korean voice search. In International Confer- 2018 – Advances in Artificial Intelligence, pages ence on Acoustics, Speech and Signal Processing, 389–402, Cham. Springer International Publishing. pages 5149–5152. Danilo Croce, Alexandra Zelenanska, and Roberto Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Basili. 2019. Enabling deep learning for large scale Hannaneh Hajishirzi. 2017. Bidirectional atten- question answering in italian. Intelligenza Artifi- tion flow for machine comprehension. In 5th Inter- ciale, 13(1):49–61. national Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Con- ference Track Proceedings. OpenReview.net. Huan Sun, Hao Ma, Wen tau Yih, Chen-Tse Tsai, Jingjing Liu, and Ming-Wei Chang. 2015. Open do- main question answering via semantic enrichment. In WWW. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- nett, editors, Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Curran As- sociates, Inc.