Transfer Learning for Biomedical Question Answering Arda Akdemir and Tetsuo Shibuya University of Tokyo, Japan aakdemir@hgc.jp tshibuya@hgc.jp Abstract. Deep Neural Network (DNN) based Machine Learning mod- els achieved remarkable success in many fields of research. Yet, many recent studies show the limitations of these approaches to generalize to unseen examples and to new domains such as the biomedical domain. Besides, supervised-learning based DNN models require a substantial amount of labeled data which is not readily available for many tasks such as the biomedical question answering task. Transfer Learning is shown to mitigate these challenges by transferring information from auxiliary tasks to improve the performance on a source task, and shown to be especially useful for low-resource tasks. These observations and findings motivated us to investigate the effect of transfer learning and multi-task learning on the biomedical question answering task. We proposed a novel multi-task learning model to learn biomedical entities and questions si- multaneously. In this work, we explain the three different neural models we used to participate for the BioASQ 8B challenge. Our initial results showed that transferring information from the biomedical entity recog- nition task brings improvement for the biomedical question answering task. 1 Introduction Pretrained language models [11, 3] have been frequently leveraged to improve performance on various downstream NLP tasks since their introduction. How- ever, it is shown that the performance of these models, which are trained on general domain corpora, drops significantly when they are tested on a new do- main [14]. This performance drop is higher for domains that have significantly different word distributions, such as the biomedical domain. To mitigate this performance drop, a frequently used approach is to pretrain these models on the target domain, which is also called as domain-adaptation. Recently, Lee et al. [9] pretrained the BERT [3] language model on the PubMED articles, which is called BioBERT, and achieved state-of-the-art results for several downstream Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. biomedical tasks. This motivated us to use BioBERT as our baseline model in our experiments. Transfer learning is a general term to describe the learning schemes where the information from a source task is used to improve the performance on a target task. It is shown to be especially useful to improve the performance on low-resource tasks [2]. Ideally, we would like to transfer information from high- resource tasks that have a similar domain with the source task to make the most out of transfer learning. Currently available datasets for biomedical ques- tion answering is very limited. Relative to the biomedical question answering datasets, the currently available biomedical entity datasets are large. These find- ings motivated us to apply transfer learning to improve the performance on the biomedical question answering task. Specifically, we claim that the performance on biomedical question answering can be improved by transferring information from the biomedical entity recognition task. We propose a multi-task learning model that learns both biomedical question answering and entity recognition tasks, which have not been implemented before to the best of our knowledge. Our work can be considered as an extension of the previously proposed BioBERT model. Our model differs from the BioBERT model in two main ways. Unlike the BioBERT model, we propose a single neural architecture to simultaneously learn three question types (factoid, yes/no, list). This allows the model to trans- fer information between different question types. Next, we propose a multi-task learning model to learn the biomedical entity recognition and question answer- ing tasks. BioBERT uses separate architectures for the two downstream tasks. Thus, the pretrained BioBERT model is fine-tuned from scratch for each task. Unlike BioBERT, our model allows transferring information between these two tasks during the fine-tuning step. 1.1 BioASQ Challenge BioASQ is a challenge on biomedical semantic indexing and question answer- ing [6]. The challenge aims to advance the state-of-the-art in semantic index- ing and question answering, and also establish a reference point for biomedical question answering. More information about the challenge can be obtained from the BioASQ homepage. 1 We participated in the question answering part of the BioASQ 2020 challenge (8B) to test our claim on using transfer learning for biomedical question answering. This paper describes the models we used to make our submissions to the BioASQ 8B challenge. We participated to the chal- lenge with three different neural architectures, and used the BioASQ datasets as our test-bed to compare these proposed models. Our main contributions can be listed as follows: – We implemented a novel neural architecture that uses a single model to jointly learn three question types in the BioASQ challenge. 1 http://bioasq.org/ – We proposed a novel multi-task learning model for entity recognition and question answering for the biomedical domain which have not been employed before to the best of our knowledge. – We analyzed the effect of transferring information from three biomedical entity recognition datasets for the biomedical question answering task. 2 Methodology In this section we describe each model we used during the BioASQ Task 8b: Biomedical Semantic Question Answering. During the task, we made submissions using five different models, three of which used an identical neural architecture, but the final model is determined using different evaluation methods. We used BioBert-based Question Answering Model [17] as our baseline model, which we refer to as BioBERT baseline. The second model is an extension of the first model, which jointly learns all question types using a single architecture. We refer to this model as BioBERT allquestions. We used three variations of this model for our submissions. Finally, we used a novel multi-task learning model that learns biomedical entities and all question types simultaneously. We refer to this model as BioBERT multitask. For the BioASQ 8B challenge, we only submitted answers for the ‘list’, ‘factoid’, and ‘yes-no’ type questions. ‘Summary’ type questions require a fundamentally different approach, and was beyond the main scope of this work. 2.1 Pre-processing The raw input format of the BioASQ dataset needs to be pre-processed into the suitable format expected by the BioBERT model. Following Yoon et al. [17], we used a similar pre-processing scheme to convert the BioASQ questions into the SQUAD Question Answering format. In the BioASQ dataset, multiple gold- answers are provided for most questions. Gold answers are denoted as spans in- side the snippets provided for each question. During pre-processing, we treated each gold-label snippet and question pair as separate examples to increase the size of the training set. During all our experiments, we only made use of the gold-label snippets. We did not analyze the effect of appending additional infor- mation from external sources such as the links to related documents provided by the BioASQ organizers. Previously, Yoon et al. [17] experimented with various pre-processing methods to bring further improvements. They observed that the benefits of each strategy depend on the question type and the test-batch. For this reason, we fixed the pre-processing method throughout our all experiments to make it clear where the improvements for each proposed model come from. Besides, using only the snippets as input to the neural networks significantly reduces the input size and reduces the overall training time. For factoid and list type questions, each gold-label span is used to create a new Question-Passage pair. An example factoid type question and gold-label spans from the provided Table 1: An example question and the gold-label spans from the provided snip- pets from the BioASQ 6B dataset. The gold-label spans are shown in bold. Question What is Contrave prescribed for? Answer 1 Contrave, ... for the potential treatment of obesity, is an oral, sustained ... Answer 2 Contrave is a combination of ... for the treatment of obesity, and is used ... Answer 3 Contrave, a bupropion and ... for the potential treatment of obesity. spans are given in Table 1. The final predictions for the list type questions are handled during the post-processing step, and explained in the relevant section. Contrary to the previous work that directly adapts the BERT Question An- swering Model [3] by modifying the ‘is impossible’ field of the SQUAD dataset format for the yes/no type questions, we implemented our own Yes/No com- ponent. This enabled us to use the data without adding the ‘is impossible’ field, making the dataset format more readable and easier to understand for researchers from the biomedical domain. 2.2 BioBert-based Baseline Model Pretrained subword contextual embeddings has shown remarkable progress over previous approaches on many downstream Natural Language Processing (NLP) tasks [16, 13, 12]. Specifically the transformer-based BERT model [3] helped achieve state-of-the-art results on many downstream tasks, including question answering. The performance of models pretrained on general domain corpora (e.g., Wikipedia articles) drops significantly when tested on niche domains such as the biomedical domain. Motivated with this observation, Lee et al. [9] proposed ‘BioBERT’, BERT architecture pretrained on PubMed articles. The proposed model obtained state-of-the-art results on three different downstream biomedi- cal NLP tasks. Recently, Yoon et al. [17] obtained the best results in the 2019 BioASQ 7B Question Answering Challenge, and achieved state-of-the-art re- sults on all question types (factoid, yes/no, list). In their proposed approach, separate models are trained from scratch for yes/no, and factoid type questions (factoid/list). For our baseline model, we used this BioBERT-based approach which we refer to as BioBERT baseline. BERT model is extended with two separate additional neural layers to learn different question types. The overall architectures are given in Figure 1. For the yes/no type questions, the output for the first token ([CLS]) ) of the final layer of BERT is given as input to a fully connected layer with 2-dim output representing the scores for yes/no scores. This is followed by a softmax layer to convert these scores into probabilities. Given a sequence of n question tokens Q = qt : 1 ≤ t ≤ n, and m passage tokens P = pt : 1 ≤ t ≤ m , BioBERT outputs m + n + 2 fixed-size (L) vectors V = vj : 1 ≤ j ≤ (m + n + 2). Next, v1 is multiplied with an (L, 2) dimensional matrix W to generate scores, S = {syes , sno }, for yes and no answers: V = BioBERT (Q, P ) S = v1T W O = Sof tmax(S) where O = oyes , ono represents the probabilities for each answer, which is the final output for the yes/no type questions. Similarly for the factoid/list type questions, each vj is multiplied with an (L, 2) dimensional matrix W2 to generate scores S2 = {sstart , send }, which represent the score for the start, end spans for each token pj inside the input passage P : V = BioBERT (Q, P ) S2 = vjT W For training, each BioBERT-initialized model in Figure 1 is fine-tuned on the BioASQ-8b for each question type, separately. The main drawback of this pre- viously proposed model is that the common BioBERT layer, which constitutes the majority of the parameters (only a single layer is added for each question type), is fine-tuned separately for each question type. The bottleneck for devel- oping high-performing biomedical question answering systems is the scarcity of the labeled training sets. This approach further limits the training dataset size, and not ideal for low-resource domains like the biomedical domain. (b) Factoid/List model. (a) Yes/No model. Fig. 1: Overall architectures for training separate models for yes/no and fac- toid/list question types for the BioBERT baseline model [17]. The common BioBERT model layers are finetuned from scratch for each type. 2.3 Joint-Learning Model The baseline approach does not expose the model to all the examples in the training dataset. This observation motivated us to propose a novel joint-learning model, which uses a single architecture to learn all question types, which we refer to as BioBERT allquestions. Learning of all question types using a single BERT- based model is not employed before in this domain, to the best of our knowledge. The overview of the proposed joint-learning model is shown in Figure 2. This simple extension to the previously proposed BioBERT-based QA Model [17] al- lows exposing the model to all the available examples in the training dataset. The common BioBERT layer is trained jointly on all question types. This allows the model to transfer information from other question types for better general- ization. Fig. 2: The proposed joint-learning model for all question types. An important part of training joint-learning models is the selection of perfor- mance metrics. In the conventional single-task machine learning setting, there is usually a single performance metric. The models are evaluated on a develop- ment/validation dataset based on this metric, to determine the best performing model during training. In the joint-learning setting, we can evaluate the models based on their performance on each task separately, or we can evaluate them based on their overall performance. For our submissions for the BioASQ 8B challenge we used the following three joint-learning models: – Overall best-performer – Best yes/no model – Best factoid model To determine the best-performer in each three cases, we used the average re- sults over five test-batches of the Bio-ASQ 6B challenge [6]. All three models are obtained from the same training experiment, and correspond to the checkpoints of the same model instance. Fig. 3: The proposed multi-task learning model which learns all biomedical ques- tion types and the biomedical entities simultaneously. 2.4 Multi-task Learning Model The multi-task learning model further extends the joint-learner explained in Sec- tion 2.3. In this setting, a single neural model is trained for Biomedical Question Answering and Gene/Protein Entity Recognition tasks, simultaneously. The de- tails of the Question Answering component of the model is identical with the joint-learner. In addition, the model contains an entity recognition component consisting of a Fully-Connected layer, followed by a Conditional Random Fields (CRF) layer. CRF-based models are frequently used for the named entity recog- nition task, to take into account the tag transitions between consecutive to- kens [8, 1]. For this reason, we extended the NER-component of the previously proposed BioBERT-based NER model in [9] to include an additional CRF layer. The overall architecture of this proposed multi-task learner is shown in Fig- ure 3. For a sequence of n tokens ti : 1 ≤ i ≤ n, the NER-component receives the BioBERT representation for each token. The subword token representations are then averaged to get the word-level representations. These word-level repre- sentations are fed into the FC-layer to generate the scores for each entity label, for each token. The CRF-layer generates the final score for each label by taking into account the transitions between each label. For the NER component, crf- loss is used. The loss is calculated as the difference between the total score of all possible label-sequences (all possible paths) and the score of the gold-label sequence (gold-label path): bj = BioBERT (t1 , ..., ti , ...tn ; j) sj = F C ner (bj ) S = [s1 , ..., sj , ..., sn ] crf loss = f orward score(S, T) − path score(S, T, G) where S denotes the scoring matrix containing scores for each label and word pair, G is the gold-label sequence, and T is the transition matrix containing transition scores between each label. f orward score(S, T) denotes the total score of all paths and path score(S, T, G) is the score of the gold label sequence. Ideally, we want all probabilities to accumulate on the gold-label path so that these two scores will be identical. Inference During the inference mode, we used Viterbi decoding [4] to find the highest scoring label sequence for the entity recognition task. 2.5 Post-processing As explained in the pre-processing section, we divided each question with multi- ple gold-label snippets into separate inputs. These examples are merged during post-processing to generate a unique answer for each question. For the post- processing step, we followed [17] to combine the predictions to the same question for factoid/list type questions. Majority voting is used to find the highest scoring predictions for each factoid/list type question. For each factoid type question, top N highest scoring predictions are returned where N corresponds to the max- imum limit allowed for the BioASQ 8B challenge. For the list type questions, we used 0.50 as the probability threshold, and included all answers that have a higher average probability score. For the yes/no type questions, we averaged the probability scores for each example belonging to the same question instance to determine the final answer. 3 Experimental Settings In this section we explain details regarding the experiments we conducted. All experiments are done using a single V100-GPU. For the Question Answering task we used the BioASQ 6B test sets as our validation set, and used the examples in the BioASQ 8B training set, for training. For the entity recognition task, we kept the same train/dev/test split already provided in [9]. It takes around 4-5 epochs on the training set to achieve the highest performance on the question answering validation sets for all models. 3.1 Datasets The entity recognition component of the final multi-task learning model we used for our submissions is trained on the BC2GM dataset [15]. The dataset contains 20,703 entity mentions in total and annotated using BIO scheme. The first token of each entity is annotated with ‘B’ and the following tokens are annotated with ‘I’. Non-entity tokens are annotated with ‘O’. Table 2: Statistics about the datasets used for the biomedical entity recognition task. Dataset Split # of Entity Tokens # of Entities Train 37,301 15,197 BC2GM Development 7,498 3,061 Test 15,101 6,325 In order to evaluate our proposed multi-task learner, we trained the entity recognition component on three different datasets. We used the BC2GM [15], BC4CHEMD [7], and BC5CDR [10] datasets for biomedical entity recognition which contain gene entities, chemical entities and disease mentions respectively. As we had maximum submission limit of five submissions for each test-batch for the BioASQ 8B challenge, we only used the multi-task learning model trained on the BC2GM dataset. The BioASQ 8B training set contains 3,243 questions in total. We did not make use of the 777 summary type questions, so our overall training set con- tained 2466 questions. For training our models we used only the snippets already provided by the challenge organizers as the relevant passage for each question. Each snippet and question is treated as a unique (Q, P ) pair which is given as input to the question answering component, where Q and P represent ‘question’ and ‘passage’, respectively. Table 3: Number of questions in the BioASQ 8B dataset. Summary type ques- tions are not used in our work. Dataset Summary Factoid Yes-No List Total BioAsq-8B 777 941 644 881 3,243 BioAsq-8B pre-processed - 4,916 9,786 9,472 24,174 For evaluating our proposed models, we also used the factoid questions from the BioASQ 6B test set [6]. The test set contains five-batches, and the number of factoid questions for each batch are given in Table 4. Table 4: Number of factoid questions in each test batch of the BioASQ 6B challenge used for evaluating the multi-task learning model. Batch Number of Factoid Questions batch-1 31 batch-2 21 batch-3 32 batch-4 33 batch-5 44 3.2 Training In this section, we explain how we trained each of the three models we used to make submissions for the BioASQ 8B challenge. In all three models, we initialized the weights of the BERT component using the BioBERT version 1.1 provided by Lee et al. [9] pretrained on PubMed articles. To have a fair comparison we always used a maximum sequence length of 256, as we observed that going above this value sometimes resulted in memory issues. Table 5 gives a comprehensive list of the hyperparameters we used during our experiments. Table 5: Important hyperparameters used by the models we proposed. Name Final Value Range Maximum Sequence Length 256 [64-512] NER learning rate 0.0015 [0.001-0.1] QAS learning rate 5e−4 [5e−4 - e−2 ] QAS Adam Epsilon e−8 [e−10 - e−5 ] −6 NER Adam Epsilon e [e−10 - e−4 ] Optimizer AdamW Adam, AdamW, SGD Weight Decay 0.001 [0 - 0.01] Baseline model training The baseline model (BioBERT baseline) is com- posed of two completely separate neural architectures (one for yes/no and one for factoid/list type questions). In this approach, each architecture is trained sep- arately, only using the corresponding dataset. During pre-processing, list type questions are converted into factoid question format, by treating each answer in the list of answers as a single factoid type answer. After this pre-processing step, the format of the factoid and list type questions become identical, so that the same architecture can be used for training on both types. Joint-learning model training The joint learner (BioBERT allquestions) is trained on all question types at once. At each iteration a (Q, P ) pair is picked randomly from the whole training set. If the picked example is a ‘yes/no’ type question the ‘Yes/No’ component in Figure 2 is used to generate the output of the model. Otherwise, the ‘Factoid/List’ component is used to generate the ‘start’ and ‘end’ scores for each token inside the given passage P . The loss for each input example is backpropagated to update the weights of 1) the question- type specific component, and 2) the common BioBERT component. This way, we allow information transfer between different question types. Considering the relatively small sizes of the biomedical question answering datasets, this allows a better utilization of what is available. Besides, this approach reduces the total number of parameters of the final model almost by half, as the majority of the trainable parameters are the common BioBERT weights. As we have multiple target performance metrics (overall performance and performances on each ques- tion type), we continued the training until we could not observe any improvement for any question type on the question answering validation sets. Multi-task learning model training The multi-task learning model (BioBERT multitask) is simultaneously trained for the question answering and the entity recognition tasks. At each iteration we flip a random coin to deter- mine the task type (QAS or NER), and use the corresponding component from Figure 3. Similar to the joint-learning model this allows information transfer from the NER dataset examples for the question answering task. The common BioBERT model is updated using examples from both tasks, which allows us to expose the model for a significantly larger amount of sentences from the biomed- ical domain. In this work, entity recognition task is used as an auxiliary task to help improve the final performance on the target question answering task. For this reason, training is done until we could not observe any improvement on the question answering validation set. 4 Results In this section we start by giving the results we obtained for evaluating our pro- posed multi-task learner. We compare BioBERT multitask, which learns both entity recognition and question answering tasks simultaneously, with the joint- learning model BioBERT allquestions, which only focuses on the question an- swering task. The BioASQ 8B data is used to train both models, and the factoid type questions from the BioASQ 6B challenge is used to evaluate them, which contains five different test batches. For training the entity recognition compo- nent of the multi-task learning model, we used three different biomedical entity datasets. The results for both models are given in Table 6. Our results showed that learning both tasks simultaneously improved the performance for all entity datasets and for all test batches. For all three datasets we observed that the multi-task learning model outperformed the model that only learns the question answering task on all five test-batches. These results verified our initial claim on transfering information from entity recognition task to improve the performance on the target question answering task, and motivated us to apply the proposed multi-task learning model on the BioASQ 8B test sets. Table 6: Analysis of multi-task learning of QA and ER using different ER datasets. Results are given for each test batch of the BioASQ6 Challenge. The of- ficial metric used for evaluating models is MRR. Best results for each test-batch are denoted in bold. BioASQ-6 Model Test batch SAcc LAcc MRR batch 1 0.581 0.742 0.646 batch 2 0.667 0.857 0.74 QA Only (Baseline) batch 3 0.594 0.781 0.682 batch 4 0.545 0.636 0.586 batch 5 0.455 0.523 0.485 batch 1 0.581 0.71 0.645 batch 2 0.714 0.857 0.786 QA + BC2GM batch 3 0.625 0.781 0.698 batch 4 0.576 0.636 0.606 batch 5 0.477 0.523 0.5 batch 1 0.71 0.742 0.72 batch 2 0.714 0.857 0.778 QA + BC4CHEMD batch 3 0.656 0.781 0.709 batch 4 0.606 0.636 0.616 batch 5 0.5 0.523 0.511 batch 1 0.677 0.742 0.7 batch 2 0.714 0.857 0.786 QA + BC5CDR batch 3 0.656 0.781 0.719 batch 4 0.576 0.636 0.601 batch 5 0.5 0.523 0.511 Next, we give the results obtained on the BioASQ 8B challenge for each model we explained above. For the first test-batch we only made submissions using two models: BioBERT baseline, BioBERT allquestions. For the other four test-batches we made five submissions using the three models explained above. To be able to make a clean comparison between the proposed models, we kept the post-processings schemes identical for all our submissions. This is necessary to evaluate our claim on using multi-task learning to improve the performance on biomedical question answering task. The QAS components of the joint-learning model and the model-task learning model are identical. In order to evaluate our claim on using multi-task learning for question answering, we must compare these models, rather than comparing them with the single-task learning model which uses a different architecture (sep- arate models for each question type). The results show that for the factoid ques- tions, the multi-task learning based model outperformed all three joint-learning models for all four test-batches. This clearly shows that leveraging information obtained about genes and proteins may help improve the final performance on the factoid type questions. The results for list and yes/no type questions are mixed, and the benefits of multi-task learning are unclear for these types. Table 7: Results for all submitted models on each test batch of BioASQ 8B Question Answering Challenge. Factoid Yes/No List Model type Name Test batch SAcc LAcc MRR Accuracy Macro-F1 Precision Recall F-1 batch 1 0.3125 0.5938 0.4266 0.6800 0.4048 0.3461 0.2933 0.2918 batch 2 0.1600 0.4400 0.2580 0.7500 0.4286 0.5361 0.4476 0.4306 Single-task Learning BioBERT baseline batch 3 0.3214 0.3929 0.3512 0.5806 0.3673 0.4861 0.4056 0.4214 batch 4 0.4118 0.7059 0.5270 0.5000 0.3333 0.3296 0.3810 0.3161 batch 5 0.5000 0.6875 0.5844 0.6176 0.6007 0.2242 0.2173 0.2179 batch 1 0.2813 0.5938 0.4099 0.6800 0.4048 0.3842 0.3200 0.3262 batch 2 0.1600 0.4800 0.2540 0.7778 0.5355 0.4107 0.3738 0.3712 BioBERT allquestions (overall best) batch 3 0.3214 0.3929 0.3423 0.5806 0.3673 0.5972 0.4111 0.4290 batch 4 0.4412 0.7059 0.5564 0.5000 0.3333 0.4045 0.4623 0.3886 batch 5 0.4063 0.6875 0.5063 0.5588 0.3585 0.3646 0.3333 0.3347 batch 1 - - - - - - - - batch 2 0.1200 0.4400 0.2413 0.7500 0.4286 0.4827 0.4071 0.3950 Joint-learning BioBERT allquestions (best factoid) batch 3 0.2857 0.3929 0.3333 0.5806 0.4732 0.5208 0.4056 0.4107 batch 4 0.4706 0.7059 0.5564 0.5000 0.3333 0.4582 0.4153 0.4005 batch 5 0.4063 0.6563 0.5026 0.5588 0.3585 0.4965 0.4167 0.4308 batch 1 - - - - - - - - batch 2 0.1600 0.4400 0.2580 0.7500 0.4286 0.5361 0.4476 0.4306 BioBERT allquestions (best yesno) batch 3 0.3214 0.4286 0.3643 0.5161 0.3404 0.5139 0.3556 0.3721 batch 4 0.4412 0.7059 0.5564 0.5000 0.3333 0.4045 0.4623 0.3886 batch 5 0.4063 0.6875 0.5063 0.5588 0.3585 0.3646 0.3333 0.3347 batch 1 - - - - - - - - batch 2 0.2000 0.4000 0.2800 0.8333 0.7000 0.4643 0.4214 0.4108 Multi-task learning BioBERT multitask batch 3 0.3214 0.4286 0.3643 0.5161 0.3404 0.5139 0.3556 0.3721 batch 4 0.4706 0.6765 0.5637 0.4615 0.3158 0.3843 0.3226 0.2991 batch 5 0.4063 0.7188 0.5365 0.5000 0.3333 0.5729 0.4236 0.4667 5 Conclusion In this paper we described the models we used to make submissions for the BioASQ 8B challenge. We proposed a novel multi-task learning model for biomed- ical entity recognition and question answering tasks. Our results showed that transferring information from the entity recognition task consistently improved the performance on the factoid type questions of the question answering tasks. On all test-batches of both BioASQ 6B and BioASQ 8B challenges, transfer- ring information brought improvement for factoid questions. We believe that further improvements can be achieved by implementing a more sophisticating information sharing between the two tasks. Analyzing the characteristics of each dataset used, can help us understand why transfer learning improves/degrades the performance for each question type. So far we have only considered using domain-adaptive pretrained models (BioBERT-based). Recent work on pretraining showed that task-adaptive pre- training brings additional improvement for low-resource tasks [5]. Our plan is to incorporate task-adaptive pretraining for the biomedical question answering task. References 1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: COLING 2018, 27th International Conference on Computational Lin- guistics. pp. 1638–1649 (2018) 2. Akdemir, A.: Research on task discovery for transfer learning in deep neural net- works. In: Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics: Student Research Workshop. pp. 33–41 (2020) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186 (2019) 4. Forney, G.D.: The viterbi algorithm. Proceedings of the IEEE 61(3), 268–278 (1973) 5. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020) 6. Kakadiaris, I.A., Paliouras, G., Krithara, A. (eds.): Proceedings of the 6th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question an- swering. Association for Computational Linguistics, Brussels, Belgium (Nov 2018), https://www.aclweb.org/anthology/W18-5300 7. Krallinger, M., Rabal, O., Akhondi, S.A., et al.: Overview of the BioCreative VI chemical-protein interaction Track. In: Proceedings of the sixth BioCreative chal- lenge evaluation workshop. vol. 1, pp. 141–146 (2017) 8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016) 9. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (09 2019). https://doi.org/10.1093/bioinformatics/btz682, https://doi.org/10.1093/ bioinformatics/btz682 10. Li, J., Sun, Y., Johnson, R.J., Sciaky, D., Wei, C.H., Leaman, R., Davis, A.P., Mattingly, C.J., Wiegers, T.C., Lu, Z.: BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016 (2016) 11. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–2237 (2018) 12. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100, 000+ questions for machine comprehension of text. In: EMNLP (2016) 13. Reddy, S., Chen, D., Manning, C.D.: Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, 249– 266 (2019) 14. Ruder, S.: Neural Transfer Learning for Natural Language Processing. Ph.D. thesis, National University of Ireland, Galway (2019) 15. Smith, L., Tanabe, L.K., nee Ando, R.J., Kuo, C.J., Chung, I.F., Hsu, C.N., Lin, Y.S., Klinger, R., Friedrich, C.M., Ganchev, K., et al.: Overview of biocreative ii gene mention recognition. Genome biology 9(S2), S2 (2008) 16. Wu, S., Dredze, M.: Beto, bentz, becas: The surprising cross-lingual effectiveness of bert. In: Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 833–844 (2019) 17. Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained language model for biomedical question answering. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 727–740. Springer (2019)