-

Transfer Learning for Biomedical Question Answering

Arda Akdemir

aakdemir@hgc.jp 0

Tetsuo Shibuya

tshibuya@hgc.jp 0 0 University of Tokyo , Japan

Deep Neural Network (DNN) based Machine Learning models achieved remarkable success in many elds of research. Yet, many recent studies show the limitations of these approaches to generalize to unseen examples and to new domains such as the biomedical domain. Besides, supervised-learning based DNN models require a substantial amount of labeled data which is not readily available for many tasks such as the biomedical question answering task. Transfer Learning is shown to mitigate these challenges by transferring information from auxiliary tasks to improve the performance on a source task, and shown to be especially useful for low-resource tasks. These observations and ndings motivated us to investigate the e ect of transfer learning and multi-task learning on the biomedical question answering task. We proposed a novel multi-task learning model to learn biomedical entities and questions simultaneously. In this work, we explain the three di erent neural models we used to participate for the BioASQ 8B challenge. Our initial results showed that transferring information from the biomedical entity recognition task brings improvement for the biomedical question answering task.

Pretrained language models [ 11, 3 ] have been frequently leveraged to improve performance on various downstream NLP tasks since their introduction. However, it is shown that the performance of these models, which are trained on general domain corpora, drops signi cantly when they are tested on a new domain [ 14 ]. This performance drop is higher for domains that have signi cantly di erent word distributions, such as the biomedical domain. To mitigate this performance drop, a frequently used approach is to pretrain these models on the target domain, which is also called as domain-adaptation. Recently, Lee et al. [ 9 ] pretrained the BERT [ 3 ] language model on the PubMED articles, which is called BioBERT, and achieved state-of-the-art results for several downstream biomedical tasks. This motivated us to use BioBERT as our baseline model in our experiments.

Transfer learning is a general term to describe the learning schemes where the information from a source task is used to improve the performance on a target task. It is shown to be especially useful to improve the performance on low-resource tasks [ 2 ]. Ideally, we would like to transfer information from highresource tasks that have a similar domain with the source task to make the most out of transfer learning. Currently available datasets for biomedical question answering is very limited. Relative to the biomedical question answering datasets, the currently available biomedical entity datasets are large. These ndings motivated us to apply transfer learning to improve the performance on the biomedical question answering task. Speci cally, we claim that the performance on biomedical question answering can be improved by transferring information from the biomedical entity recognition task. We propose a multi-task learning model that learns both biomedical question answering and entity recognition tasks, which have not been implemented before to the best of our knowledge. Our work can be considered as an extension of the previously proposed BioBERT model. Our model di ers from the BioBERT model in two main ways. Unlike the BioBERT model, we propose a single neural architecture to simultaneously learn three question types (factoid, yes/no, list). This allows the model to transfer information between di erent question types. Next, we propose a multi-task learning model to learn the biomedical entity recognition and question answering tasks. BioBERT uses separate architectures for the two downstream tasks. Thus, the pretrained BioBERT model is ne-tuned from scratch for each task. Unlike BioBERT, our model allows transferring information between these two tasks during the ne-tuning step. 1.1

BioASQ Challenge

BioASQ is a challenge on biomedical semantic indexing and question answering [ 6 ]. The challenge aims to advance the state-of-the-art in semantic indexing and question answering, and also establish a reference point for biomedical question answering. More information about the challenge can be obtained from the BioASQ homepage. 1 We participated in the question answering part of the BioASQ 2020 challenge (8B) to test our claim on using transfer learning for biomedical question answering. This paper describes the models we used to make our submissions to the BioASQ 8B challenge. We participated to the challenge with three di erent neural architectures, and used the BioASQ datasets as our test-bed to compare these proposed models. Our main contributions can be listed as follows: { We implemented a novel neural architecture that uses a single model to jointly learn three question types in the BioASQ challenge. 1 http://bioasq.org/ { We proposed a novel multi-task learning model for entity recognition and question answering for the biomedical domain which have not been employed before to the best of our knowledge. { We analyzed the e ect of transferring information from three biomedical entity recognition datasets for the biomedical question answering task. 2

Methodology

In this section we describe each model we used during the BioASQ Task 8b: Biomedical Semantic Question Answering. During the task, we made submissions using ve di erent models, three of which used an identical neural architecture, but the nal model is determined using di erent evaluation methods. We used BioBert-based Question Answering Model [17] as our baseline model, which we refer to as BioBERT baseline. The second model is an extension of the rst model, which jointly learns all question types using a single architecture. We refer to this model as BioBERT allquestions. We used three variations of this model for our submissions. Finally, we used a novel multi-task learning model that learns biomedical entities and all question types simultaneously. We refer to this model as BioBERT multitask. For the BioASQ 8B challenge, we only submitted answers for the `list', `factoid', and `yes-no' type questions. `Summary' type questions require a fundamentally di erent approach, and was beyond the main scope of this work. 2.1

Pre-processing

The raw input format of the BioASQ dataset needs to be pre-processed into the suitable format expected by the BioBERT model. Following Yoon et al. [17], we used a similar pre-processing scheme to convert the BioASQ questions into the SQUAD Question Answering format. In the BioASQ dataset, multiple goldanswers are provided for most questions. Gold answers are denoted as spans inside the snippets provided for each question. During pre-processing, we treated each gold-label snippet and question pair as separate examples to increase the size of the training set. During all our experiments, we only made use of the gold-label snippets. We did not analyze the e ect of appending additional information from external sources such as the links to related documents provided by the BioASQ organizers. Previously, Yoon et al. [17] experimented with various pre-processing methods to bring further improvements. They observed that the bene ts of each strategy depend on the question type and the test-batch. For this reason, we xed the pre-processing method throughout our all experiments to make it clear where the improvements for each proposed model come from. Besides, using only the snippets as input to the neural networks signi cantly reduces the input size and reduces the overall training time. For factoid and list type questions, each gold-label span is used to create a new Question-Passage pair. An example factoid type question and gold-label spans from the provided spans are given in Table 1. The nal predictions for the list type questions are handled during the post-processing step, and explained in the relevant section.

Contrary to the previous work that directly adapts the BERT Question Answering Model [ 3 ] by modifying the `is impossible' eld of the SQUAD dataset format for the yes/no type questions, we implemented our own Yes/No component. This enabled us to use the data without adding the `is impossible' eld, making the dataset format more readable and easier to understand for researchers from the biomedical domain. 2.2

BioBert-based Baseline Model

Pretrained subword contextual embeddings has shown remarkable progress over previous approaches on many downstream Natural Language Processing (NLP) tasks [ 16, 13, 12 ]. Speci cally the transformer-based BERT model [ 3 ] helped achieve state-of-the-art results on many downstream tasks, including question answering.

The performance of models pretrained on general domain corpora (e.g., Wikipedia articles) drops signi cantly when tested on niche domains such as the biomedical domain. Motivated with this observation, Lee et al. [ 9 ] proposed `BioBERT', BERT architecture pretrained on PubMed articles. The proposed model obtained state-of-the-art results on three di erent downstream biomedical NLP tasks. Recently, Yoon et al. [17] obtained the best results in the 2019 BioASQ 7B Question Answering Challenge, and achieved state-of-the-art results on all question types (factoid, yes/no, list). In their proposed approach, separate models are trained from scratch for yes/no, and factoid type questions (factoid/list).

For our baseline model, we used this BioBERT-based approach which we refer to as BioBERT baseline. BERT model is extended with two separate additional neural layers to learn di erent question types. The overall architectures are given in Figure 1. For the yes/no type questions, the output for the rst token ([CLS]) ) of the nal layer of BERT is given as input to a fully connected layer with 2-dim output representing the scores for yes/no scores. This is followed by a softmax layer to convert these scores into probabilities. Given a sequence of n question tokens Q = qt : 1 t n, and m passage tokens P = pt : 1 t m , BioBERT outputs m + n + 2 xed-size (L) vectors V = vj : 1 j (m + n + 2). Next, v1 is multiplied with an (L; 2) dimensional matrix W to generate scores, S = fsyes; snog, for yes and no answers:

V = BioBERT (Q; P ) S = v1T W

O = Sof tmax(S) where O = oyes; ono represents the probabilities for each answer, which is the nal output for the yes/no type questions. Similarly for the factoid/list type questions, each vj is multiplied with an (L; 2) dimensional matrix W2 to generate scores S2 = fsstart; sendg, which represent the score for the start, end spans for each token pj inside the input passage P :

V = BioBERT (Q; P )

S2 = vjT W

For training, each BioBERT-initialized model in Figure 1 is ne-tuned on the BioASQ-8b for each question type, separately. The main drawback of this previously proposed model is that the common BioBERT layer, which constitutes the majority of the parameters (only a single layer is added for each question type), is ne-tuned separately for each question type. The bottleneck for developing high-performing biomedical question answering systems is the scarcity of the labeled training sets. This approach further limits the training dataset size, and not ideal for low-resource domains like the biomedical domain. (a) Yes/No model.

(b) Factoid/List model.

Fig. 1: Overall architectures for training separate models for yes/no and factoid/list question types for the BioBERT baseline model [17]. The common BioBERT model layers are netuned from scratch for each type. 2.3

Joint-Learning Model

The baseline approach does not expose the model to all the examples in the training dataset. This observation motivated us to propose a novel joint-learning model, which uses a single architecture to learn all question types, which we refer to as BioBERT allquestions. Learning of all question types using a single BERTbased model is not employed before in this domain, to the best of our knowledge. The overview of the proposed joint-learning model is shown in Figure 2. This simple extension to the previously proposed BioBERT-based QA Model [17] allows exposing the model to all the available examples in the training dataset. The common BioBERT layer is trained jointly on all question types. This allows the model to transfer information from other question types for better generalization.

An important part of training joint-learning models is the selection of performance metrics. In the conventional single-task machine learning setting, there is usually a single performance metric. The models are evaluated on a development/validation dataset based on this metric, to determine the best performing model during training. In the joint-learning setting, we can evaluate the models based on their performance on each task separately, or we can evaluate them based on their overall performance. For our submissions for the BioASQ 8B challenge we used the following three joint-learning models: { Overall best-performer { Best yes/no model { Best factoid model

To determine the best-performer in each three cases, we used the average results over ve test-batches of the Bio-ASQ 6B challenge [ 6 ]. All three models are obtained from the same training experiment, and correspond to the checkpoints of the same model instance. The multi-task learning model further extends the joint-learner explained in Section 2.3. In this setting, a single neural model is trained for Biomedical Question Answering and Gene/Protein Entity Recognition tasks, simultaneously. The details of the Question Answering component of the model is identical with the joint-learner. In addition, the model contains an entity recognition component consisting of a Fully-Connected layer, followed by a Conditional Random Fields (CRF) layer. CRF-based models are frequently used for the named entity recognition task, to take into account the tag transitions between consecutive tokens [ 8, 1 ]. For this reason, we extended the NER-component of the previously proposed BioBERT-based NER model in [ 9 ] to include an additional CRF layer. The overall architecture of this proposed multi-task learner is shown in Figure 3. For a sequence of n tokens ti : 1 i n, the NER-component receives the BioBERT representation for each token. The subword token representations are then averaged to get the word-level representations. These word-level representations are fed into the FC-layer to generate the scores for each entity label, for each token. The CRF-layer generates the nal score for each label by taking into account the transitions between each label. For the NER component, crfloss is used. The loss is calculated as the di erence between the total score of all possible label-sequences (all possible paths) and the score of the gold-label sequence (gold-label path): bj = BioBERT (t1; :::; ti; :::tn; j) sj = F Cner(bj)

S = [s1; :::; sj; :::; sn] crf loss = f orward score(S; T) path score(S; T; G) where S denotes the scoring matrix containing scores for each label and word pair, G is the gold-label sequence, and T is the transition matrix containing transition scores between each label. f orward score(S; T) denotes the total score of all paths and path score(S; T; G) is the score of the gold label sequence. Ideally, we want all probabilities to accumulate on the gold-label path so that these two scores will be identical.

Inference During the inference mode, we used Viterbi decoding [ 4 ] to nd the highest scoring label sequence for the entity recognition task. 2.5

Post-processing

As explained in the pre-processing section, we divided each question with multiple gold-label snippets into separate inputs. These examples are merged during post-processing to generate a unique answer for each question. For the postprocessing step, we followed [17] to combine the predictions to the same question for factoid/list type questions. Majority voting is used to nd the highest scoring predictions for each factoid/list type question. For each factoid type question, top N highest scoring predictions are returned where N corresponds to the maximum limit allowed for the BioASQ 8B challenge. For the list type questions, we used 0.50 as the probability threshold, and included all answers that have a higher average probability score.

For the yes/no type questions, we averaged the probability scores for each example belonging to the same question instance to determine the nal answer. 3

Experimental Settings

In this section we explain details regarding the experiments we conducted. All experiments are done using a single V100-GPU. For the Question Answering task we used the BioASQ 6B test sets as our validation set, and used the examples in the BioASQ 8B training set, for training. For the entity recognition task, we kept the same train/dev/test split already provided in [ 9 ]. It takes around 4-5 epochs on the training set to achieve the highest performance on the question answering validation sets for all models. 3.1

Datasets

The entity recognition component of the nal multi-task learning model we used for our submissions is trained on the BC2GM dataset [ 15 ]. The dataset contains 20,703 entity mentions in total and annotated using BIO scheme. The rst token of each entity is annotated with `B' and the following tokens are annotated with `I'. Non-entity tokens are annotated with `O'.

In order to evaluate our proposed multi-task learner, we trained the entity recognition component on three di erent datasets. We used the BC2GM [ 15 ], BC4CHEMD [ 7 ], and BC5CDR [ 10 ] datasets for biomedical entity recognition which contain gene entities, chemical entities and disease mentions respectively. As we had maximum submission limit of ve submissions for each test-batch for the BioASQ 8B challenge, we only used the multi-task learning model trained on the BC2GM dataset.

The BioASQ 8B training set contains 3,243 questions in total. We did not make use of the 777 summary type questions, so our overall training set contained 2466 questions. For training our models we used only the snippets already provided by the challenge organizers as the relevant passage for each question. Each snippet and question is treated as a unique (Q; P ) pair which is given as input to the question answering component, where Q and P represent `question' and `passage', respectively.

For evaluating our proposed models, we also used the factoid questions from the BioASQ 6B test set [ 6 ]. The test set contains ve-batches, and the number of factoid questions for each batch are given in Table 4. In this section, we explain how we trained each of the three models we used to make submissions for the BioASQ 8B challenge. In all three models, we initialized the weights of the BERT component using the BioBERT version 1.1 provided by Lee et al. [ 9 ] pretrained on PubMed articles. To have a fair comparison we always used a maximum sequence length of 256, as we observed that going above this value sometimes resulted in memory issues. Table 5 gives a comprehensive list of the hyperparameters we used during our experiments. Baseline model training The baseline model (BioBERT baseline) is composed of two completely separate neural architectures (one for yes/no and one for factoid/list type questions). In this approach, each architecture is trained separately, only using the corresponding dataset. During pre-processing, list type questions are converted into factoid question format, by treating each answer in the list of answers as a single factoid type answer. After this pre-processing step, the format of the factoid and list type questions become identical, so that the same architecture can be used for training on both types.

Joint-learning model training The joint learner (BioBERT allquestions) is trained on all question types at once. At each iteration a (Q; P ) pair is picked randomly from the whole training set. If the picked example is a `yes/no' type question the `Yes/No' component in Figure 2 is used to generate the output of the model. Otherwise, the `Factoid/List' component is used to generate the `start' and `end' scores for each token inside the given passage P . The loss for each input example is backpropagated to update the weights of 1) the questiontype speci c component, and 2) the common BioBERT component. This way, we allow information transfer between di erent question types. Considering the relatively small sizes of the biomedical question answering datasets, this allows a better utilization of what is available. Besides, this approach reduces the total number of parameters of the nal model almost by half, as the majority of the trainable parameters are the common BioBERT weights. As we have multiple target performance metrics (overall performance and performances on each question type), we continued the training until we could not observe any improvement for any question type on the question answering validation sets.

Multi-task learning model training The multi-task learning model

(BioBERT multitask) is simultaneously trained for the question answering and the entity recognition tasks. At each iteration we ip a random coin to determine the task type (QAS or NER), and use the corresponding component from Figure 3. Similar to the joint-learning model this allows information transfer from the NER dataset examples for the question answering task. The common BioBERT model is updated using examples from both tasks, which allows us to expose the model for a signi cantly larger amount of sentences from the biomedical domain. In this work, entity recognition task is used as an auxiliary task to help improve the nal performance on the target question answering task. For this reason, training is done until we could not observe any improvement on the question answering validation set. 4

Results

In this section we start by giving the results we obtained for evaluating our proposed multi-task learner. We compare BioBERT multitask, which learns both entity recognition and question answering tasks simultaneously, with the jointlearning model BioBERT allquestions, which only focuses on the question answering task. The BioASQ 8B data is used to train both models, and the factoid type questions from the BioASQ 6B challenge is used to evaluate them, which contains ve di erent test batches. For training the entity recognition component of the multi-task learning model, we used three di erent biomedical entity datasets. The results for both models are given in Table 6. Our results showed that learning both tasks simultaneously improved the performance for all entity datasets and for all test batches. For all three datasets we observed that the multi-task learning model outperformed the model that only learns the question answering task on all ve test-batches. These results veri ed our initial claim on transfering information from entity recognition task to improve the performance on the target question answering task, and motivated us to apply the proposed multi-task learning model on the BioASQ 8B test sets.

Next, we give the results obtained on the BioASQ 8B challenge for each model we explained above. For the rst test-batch we only made submissions using two models: BioBERT baseline, BioBERT allquestions. For the other four test-batches we made ve submissions using the three models explained above. To be able to make a clean comparison between the proposed models, we kept the post-processings schemes identical for all our submissions. This is necessary to evaluate our claim on using multi-task learning to improve the performance on biomedical question answering task.

The QAS components of the joint-learning model and the model-task learning model are identical. In order to evaluate our claim on using multi-task learning for question answering, we must compare these models, rather than comparing them with the single-task learning model which uses a di erent architecture (separate models for each question type). The results show that for the factoid questions, the multi-task learning based model outperformed all three joint-learning models for all four test-batches. This clearly shows that leveraging information obtained about genes and proteins may help improve the nal performance on the factoid type questions. The results for list and yes/no type questions are mixed, and the bene ts of multi-task learning are unclear for these types.

5 Conclusion

In this paper we described the models we used to make submissions for the BioASQ 8B challenge. We proposed a novel multi-task learning model for biomedical entity recognition and question answering tasks. Our results showed that transferring information from the entity recognition task consistently improved the performance on the factoid type questions of the question answering tasks. On all test-batches of both BioASQ 6B and BioASQ 8B challenges, transferring information brought improvement for factoid questions. We believe that further improvements can be achieved by implementing a more sophisticating information sharing between the two tasks. Analyzing the characteristics of each dataset used, can help us understand why transfer learning improves/degrades the performance for each question type.

So far we have only considered using domain-adaptive pretrained models (BioBERT-based). Recent work on pretraining showed that task-adaptive pretraining brings additional improvement for low-resource tasks [ 5 ]. Our plan is to incorporate task-adaptive pretraining for the biomedical question answering task. 16. Wu, S., Dredze, M.: Beto, bentz, becas: The surprising cross-lingual e ectiveness of bert. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 833{844 (2019) 17. Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained language model for biomedical question answering. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 727{740. Springer (2019)

1. Akbik , A. , Blythe , D. , Vollgraf , R.: Contextual string embeddings for sequence labeling . In: COLING 2018 , 27th International Conference on Computational Linguistics. pp. 1638 { 1649 ( 2018 )

2. Akdemir , A. : Research on task discovery for transfer learning in deep neural networks . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop . pp. 33 { 41 ( 2020 )

3. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 4171 { 4186 ( 2019 )

4. Forney , G.D.: The viterbi algorithm . Proceedings of the IEEE 61 ( 3 ), 268 { 278 ( 1973 )

5. Gururangan , S. , Marasovic , A. , Swayamdipta , S. , Lo , K. , Beltagy , I. , Downey , D. , Smith , N.A. : Don't stop pretraining: Adapt language models to domains and tasks . arXiv preprint arXiv: 2004 . 10964 ( 2020 )

6. Kakadiaris , I.A. , Paliouras , G. , Krithara , A . (eds.): Proceedings of the 6th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question answering . Association for Computational Linguistics , Brussels, Belgium (Nov 2018 ), https://www.aclweb.org/anthology/W18-5300

7. Krallinger , M. , Rabal , O. , Akhondi , S.A. , et al.: Overview of the BioCreative VI chemical-protein interaction Track . In: Proceedings of the sixth BioCreative challenge evaluation workshop . vol. 1 , pp. 141 { 146 ( 2017 )

8. Lample , G. , Ballesteros , M. , Subramanian , S. , Kawakami , K. , Dyer , C. : Neural architectures for named entity recognition . arXiv preprint arXiv:1603.01360 ( 2016 )

9. Lee , J. , Yoon , W. , Kim , S. , Kim , D. , Kim , S. , So , C.H. , Kang , J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining . Bioinformatics 36 ( 4 ), 1234 { 1240 (09 2019 ). https://doi.org/10.1093/bioinformatics/btz682, https://doi.org/10.1093/ bioinformatics/btz682

10. Li , J. , Sun , Y. , Johnson , R.J., Sciaky , D. , Wei , C.H. , Leaman , R. , Davis , A.P. , Mattingly , C.J. , Wiegers , T.C. , Lu , Z. : BioCreative V CDR task corpus: a resource for chemical disease relation extraction . Database 2016 ( 2016 )

11. Peters , M. , Neumann , M. , Iyyer , M. , Gardner , M. , Clark , C. , Lee , K. , Zettlemoyer , L. : Deep contextualized word representations . In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long Papers). pp. 2227 { 2237 ( 2018 )

12. Rajpurkar , P. , Zhang, J., Lopyrev , K. , Liang , P. : Squad: 100 , 000+ questions for machine comprehension of text . In: EMNLP ( 2016 )

13. Reddy , S. , Chen , D. , Manning , C.D.: Coqa: A conversational question answering challenge . Transactions of the Association for Computational Linguistics 7 , 249 { 266 ( 2019 )

14. Ruder , S. : Neural Transfer Learning for Natural Language Processing . Ph.D. thesis , National University of Ireland, Galway ( 2019 )

15. Smith , L. , Tanabe , L.K. , nee

Ando

, R.J., Kuo , C.J. , Chung , I.F. , Hsu , C.N. , Lin , Y.S. , Klinger , R. , Friedrich , C.M. , Ganchev , K. , et al.: Overview of biocreative ii gene mention recognition . Genome biology 9 ( S2 ), S2 ( 2008 )