=Paper=
{{Paper
|id=Vol-2936/paper-18
|storemode=property
|title=Transformer-based Language Models for Factoid Question Answering at BioASQ9b
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-18.pdf
|volume=Vol-2936
|authors=Urvashi Khanna,Diego Molla
|dblpUrl=https://dblp.org/rec/conf/clef/KhannaM21
}}
==Transformer-based Language Models for Factoid Question Answering at BioASQ9b==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-18.pdf</pdf>
<pre>
Transformer-based Language Models for Factoid
Question Answering at BioASQ9b
Urvashi Khanna, Diego Mollá
Macquarie University, Australia


                                      Abstract
                                      In this work, we describe our experiments and participating systems in the BioASQ Task 9b Phase B chal-
                                      lenge of biomedical question answering. We have focused on finding the ideal answers and investigated
                                      multi-task fine-tuning and gradual unfreezing techniques on transformer-based language models. For
                                      factoid questions, our ALBERT-based systems ranked first in test batch 1 and fourth in test batch 2. Our
                                      DistilBERT systems outperformed the ALBERT variants in test batches 4 and 5 despite having 81% fewer
                                      parameters than ALBERT. However, we observed that gradual unfreezing had no significant impact on
                                      the model’s accuracy compared to standard fine-tuning.

                                      Keywords
                                      Transfer learning, DistilBERT, ALBERT, Question Answering, BioASQ9b


1. Introduction
Nowadays, the use of language models that have been pretrained on massive amounts of data
are the norm [1, 2, 3]. Rather than making significant task-specific architecture improvements,
these pretrained models can be fine-tuned for various tasks by making minor changes to the
language model architecture, such as adding an output layer on top. Fine-tuning approaches
are critical for learning the distributions of the target task and improving the language model’s
adaptability. However, fine-tuning a language model on small datasets like BioASQ can lead to
catastrophic forgetting and overfitting. Furthermore, training all layers simultaneously on data
of different target tasks may result in poor performance and an unstable model [4]. A schedule
for updating the pretrained weights may be critical for preventing catastrophic forgetting of the
source task’s knowledge. Scheduling techniques like chain thaw [5] and gradual unfreezing [6]
have improved the performance of multiple Natural Language Processing (NLP) tasks. Gradual
unfreezing involves gradually fine-tuning model layers rather than fine-tuning all layers at
once.
   Pretrained language models are usually trained on general language and then adapted to
downstream tasks of varied domains. Many domain-specific tasks, however, face the problem
of the scarcity of labelled datasets. Auxiliary signal through multi-task fine-tuning helps the
language model to adapt on smaller datasets better [7, 8, 9]. Multi-task fine-tuning (also referred

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" Urvashi.Khanna@mq.edu.au (U. Khanna); Diego.Molla-Aliod@mq.edu.au (D. Mollá)
~ https://researchers.mq.edu.au/en/persons/diego-molla-aliod (D. Mollá)
 0000-0003-2345-5596 (U. Khanna); 0000-0003-4973-0963 (D. Mollá)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
to as sequential adaptation in some literature [4]) is the intermediate fine-tuning stage in which
the model is fine-tuned on a larger dataset before fine-tuning on a low-resource dataset. In this
paper, we describe the experiments of our participating systems1 at the BioASQ9b challenge2 .
We discuss two of our systems, mainly focusing on factoid questions. Both systems adapt the
multi-task fine-tuning technique of fine-tuning on a larger dataset before fine-tuning on the
BioASQ9b dataset. Our first system fine-tunes the pre-trained model ALBERT on SQuAD2.0
and then on the BioASQ9b dataset. This system performed exceedingly well on BioASQ9b
Test batches 1 and 2. Our second system investigates the effect of the gradual unfreezing
technique on the smaller, compact transformer-based model, DistilBERT. We assess this system
via two of our submissions at the BioASQ9b Challenge. One of our submissions of DistilBERT
ranked sixth in the BioASQ9b leaderboard3 . From our results, we conclude that gradually
unfreezing DistilBERT had no significant improvement in the accuracy of the BioASQ9b test
data in comparison to standard fine-tuning.
   The rest of this paper is structured as follows. In Section 2, we briefly discuss related work
for background. Section 3 describes the BioASQ dataset and the processing steps involved.
Section 4 details our experimental setup for both our systems. Section 5 discusses the results of
our systems on the BioASQ public leaderboard. Finally, Section 6 provides a conclusion to our
work.


2. Related Work
Transfer learning has been widely used to transfer knowledge across multiple domains. The
scarcity of sizable domain-specific datasets and the cost associated with manually annotating
them are driving this trend. In this section, we discuss previous works that used transfer learning
for the BioASQ biomedical question answering task [10].
   In the 5th BioASQ challenge, Wiese et al. [11] explored domain adaptation to transfer
knowledge from an already existing neural Question Answering (QA) system named FastQA
[12] that was trained on SQuAD [13]. They initialised their model with the pretrained FastQA
models’ parameters during the fine-tuning phase. Using a combination of fine-tuning and
biomedical Word2vec embeddings, their model achieved state-of-the-art results. They also used
optimisation approaches such as L2 weight regularisation and forgetting cost term to minimise
catastrophic forgetting.
   Lee et al. [14] discovered the potential to adapt the general domain language model BERT for
the biomedical domain. They presented BioBERT, the first biomedical language model. In the
pretraining step, BioBERT was initialised with BERT weights and then pretrained on biomedical
domain corpora. BioBERT produced benchmark results on a wide range of biomedical text
mining tasks, including question answering, relation extraction, and named entity recognition.
Yoon et al.’s [15] submission for task 7b topped the leaderboard in the 7th BioASQ challenge.
They used a sequential adaptation technique in which pretrained BioBERT was fine-tuned first
on the SQuAD dataset and then on the BioASQ dataset.

   1
     Code associated with this paper is available at https://github.com/urvashikhanna/bioasq9b
   2
     http://bioasq.org/
   3
     http://participants-area.bioasq.org/results/9b/phaseB/
   Similarly, BioELMo [16] is a biomedical version of ELMo that outperforms BioBERT on the
authors’ probing tasks when used as a feature extractor. However, the fine-tuned BioBERT
outperforms BioELMo on named entity recognition and Natural Language Inference (NLI) tasks.
   Hosein et al. [17] studied domain portability and error propagation of BERT-based QA models
through their BioASQ7b submissions. Their results concluded that general domain language
models could generalise and give good results for domain-specific tasks. They also observed
that pretraining is more critical than fine-tuning when improving the domain portability of
BERT QA models. For yes/no questions in the BioASQ7 Phase B challenge, Resta et al. [18] used
an ensemble of classifiers with input from various transformer-based language models. They
employed contextual embeddings from multiple pretrained language models, such as BERT and
ELMO, as features to capture long-term dependencies.
   Jeong et al. [9] expanded the prior work on BioBERT models [14, 15] in the 8th BioASQ
challenge. They adapted multiple stages of fine-tuning by first fine-tuning BioBERT on the NLI
dataset [19], then on the SQuAD dataset [13], and finally on the downstream BioASQ dataset.
Their results established that tasks like NLI that capture the relationships between sentence
pairs improve the accuracy of the QA systems. Additionally, they analysed and reported the
number of unanswerable questions from the BioASQ7b dataset in the QA setting. Kazaryan et
al. [20] used ALBERT [2] as their base language model which was fine-tuned first on SQuAD
v2.0 [21], and subsequently on the BioASQ8b data.


3. BioASQ Data Processing
BioASQ [10] is an international biomedical challenge that comprises annual tasks on semantic
indexing and biomedical question answering. The ninth BioASQ challenge consists of two
shared tasks. Task 9a is a semantic indexing task that aims to annotate new PubMed articles
automatically [22] with Medical Subject Headings (MeSH). Task 9b is a question answering task
devised for systems to answer four types of biomedical questions: factoid, summary, list, and
yes/no. The participants are provided with questions along with relevant snippets. The output
generated by their systems is either an exact answer (for yes/no, factoid, and list questions) or
ideal answers (for summary questions), or both. The tasks are released in five batches over two
months, with 24 hours to submit the answers after the release of each test batch.
   We primarily concentrate on factoid questions from the BioASQ9b dataset. The dataset
contains a total of 3743 questions, 1092 of which are factoid questions. An example of a factoid
question is shown in Figure 1. Our system returns exact answers for factoid-type questions that
can either be a single entity or a list of entities. We regard the BioASQ challenge task as an
extractive QA task because the answer to the query is extracted from the relevant snippet. The
metrics used for evaluating the systems on the BioASQ leaderboard are: Strict Accuracy (SAcc),
Lenient Accuracy (LAcc), and Mean Reciprocal Rank (MRR). However, MRR is the official metric
used by the BioASQ organisers for factoid questions since it is often used to evaluate other
factoid QA tasks and challenges [10].
   The BioASQ dataset is transformed into the SQuAD format and vice versa using pre-processing
and post-processing steps. In a typical span-extractive question answering task, the system is
provided with a passage P and a question Q, and it must identify an answer span A (𝑎𝑠𝑡𝑎𝑟𝑡 , 𝑎𝑒𝑛𝑑 )
Figure 1: Sample factoid question [23]. The answer to the question is in bold and is extracted from
snippet 2.


in P. The SQuAD dataset is an example of a span prediction QA task containing many question-
answer pairs and a passage that answers the given question. In contrast, the training dataset of
BioASQ includes a question, an answer, and multiple relevant snippets. Therefore, we begin
by pairing each snippet with its question and transforming it into multiple question-snippet
pairs. Also, based on the exact answer provided, we locate the answer’s position in the snippet
and populate it as the start position of the answer span in the dataset. After performing these
pre-processing steps, the BioASQ9b training data samples increased five-fold from 1092 to
5447. Table 1 shows the number of questions in the training and test batches before and after
pre-processing.

Table 1
Summary of BioASQ9b Training and Test data before and after pre-processing.
                         Number of Factoid Questions    Number of Factoid Questions
             Dataset
                           Before Pre-processing           After Pre-processing
             Training                            1092                            5447
             Batch 1                               29                             139
             Batch 2                               34                             151
             Batch 4                               28                             132
             Batch 5                               36                             148

   Our system returns the prediction span for each question. Because we divided the snippets
into several question-snippet pairs during the pre-processing stage, we now have predictions of
multiple answer spans and their probabilities for each question. Each system must submit a
list of up to five responses for the official BioASQ evaluation. As a result, we select the top five
answers for each question in decreasing order of probability as our submission. Thus, for each
factoid question, our system returns a list of up to five responses sorted by their likelihood.


4. Systems Overview
This section describes our systems and the experimental setup of our submissions at the
BioASQ9b challenge. Our submissions in the BioASQ9b challenge are based on two pretrained
models: “DistilBERT” and “ALBERT”. As mentioned above, we focus mainly on factoid questions.
We submitted ALBERT variants for all the BioASQ9b test batches except test batch 3. DistilBERT-
based systems were submitted in test batches 2, 4, and 5. In this section, we detail the models,
the methodology used, and the experimental setup.

4.1. ALBERT
For the system using ALBERT, we follow a staged fine-tuning approach by fine-tuning on a
large dataset before fine-tuning on the smaller dataset. This preliminary stage of fine-tuning
on a large QA task is ideal due to the small size of the BioASQ dataset. However, large-scale
bio-medical QA datasets are not readily available that could be used for the first stage of fine-
tuning. Therefore, we use the SQuAD dataset, a widely used extractive QA dataset. Thus, we
first fine-tune ALBERT on SQuAD2.0 and later on our downstream BioASQ task. This approach
is illustrated in Figure 2.


Figure 2: Diagram depicting our system’s fine-tuning strategy.


   ALBERT is a lighter version of BERT with considerably fewer parameters. Lan et al. [2] used
two parameter-reduction strategies to lower the memory usage and increase the training speed
of BERT. Since ALBERT models scale better than BERT, we have used the xxlarge version of
ALBERT for our experiments. The BioASQ task was set up as a span-extraction QA task in which
the model predicts the start and end span of answers for a given context and question. In both
stages of fine-tuning, the input to the model is the concatenation of passage and question with a
special token [SEP] separating them. This input is tokenized using WordPiece embeddings [24]
to handle the out-of-vocabulary issues. After WordPiece tokenization, the maximum allowable
input sequence length is 512 for both the ALBERT and DistilBERT models. The input has three
embeddings: token, position, and sentence. In order to differentiate between the sentences,
sentence embedding is appended to each sentence, and a special position token is added to
identify the position of each token. The model returns the start and end scores for each word.
The output of the model is the candidate span with the highest score and where the end position
is greater than or equal to the start position.
   We employed “ALBERT-xxlarge” version 2 as our pretrained language model along with its
tokenizer, which are publicly available from the Huggingface Transformers Library [25]. This
model has an additional task-specific linear question answering layer on top to output the start
and end spans. Unless otherwise specified, the hyperparameters for both fine-tuning stages
were set to the default values used by the ALBERT developers. The systems were validated on
the BioASQ7b test batches 1 and 2.
   All the three ALBERT-based submissions use the same fine-tuning approach discussed above
with slight changes to the fine-tuning hyper-parameters. The systems along with hyperparame-
ters are listed in Table 2 and their results are listed in Table 3.

Table 2
ALBERT-based systems along with the hyperparameters.
             System Name     Learning Rate    Batch Size   Sequence Length    Epochs
             ALBERT 1                  3e-5            4                512         3
             ALBERT 2                  2e-5            4                512         4
             ALBERT 3                  1e-5            4                512         3


4.2. Gradual Unfreezing DistilBERT
In recent years, the pretrained language models are getting bigger and deeper with millions,
sometimes billions of parameters [2, 26]. The success of these models on NLP tasks has fueled
the race to scale up the models further. However, deploying these massive models on mobile
and edge devices has implications such as environmental impact and computational cost [27],
making them unsuitable for use in real-world applications. Sanh et al. [28] applied knowledge
distillation [29] and proposed a smaller language model, DistilBERT, that achieves performance
comparable to BERT on various NLP tasks. DistilBERT, a distilled, compact version of BERT,
has 60% fewer parameters than BERT.
   The focus for our second system was to study the effect of gradual unfreezing on the
transformer-based language models. We used DistilBERT as our pretrained model to con-
duct the experiments of gradually unfreezing the transformer layers. The reason for this choice
was the small size of DistilBERT and its ability to achieve close to 95% of all the NLP task
benchmarks when compared to BERT.
   The process of fine-tuning allows the model to learn the distribution of the downstream task.
In standard fine-tuning, all the layers of the model are trained on the target task simultaneously.
Howard et al. [6] introduced a fine-tuning approach of gradually unfreezing one layer at a time,
starting from the top layer. They used a standard Long Short-Term Memory (LSTM) network
without any attention mechanism for their experiments. Our work investigates the gradual
unfreezing approach on DistilBERT using BioASQ9b as our target dataset.
   DistilBERT has three blocks of layers: one embedding layer, six transformer layers, and a
top task-specific layer. In our approach shown in Figure 3, we begin by fine-tuning only the
top task-specific layer for one epoch while keeping all other layers frozen. Then we unfreeze
the transformer layers consecutively in groups of three, fine-tune all the unfrozen layers for
one epoch, and repeat until all layers are fine-tuned except the embedding layer. The decision
to keep the embedding layer always frozen was based on the preliminary experiments in our
previous work [23]. As a result, DistilBERT’s trainable parameters have been reduced from 65
million to 42 million.
Figure 3: Diagram showing our unfreezing approach.


   In this system, “distilbert-base-cased” [25] is first fine-tuned on SQuAD1.1 data and then on
BioASQ9b task. Our gradual unfreezing approach is only applied during the second stage of
fine-tuning. In the second phase of fine-tuning, we fine-tune the model at a constant learning
rate of 3e-5, sequence length of 512, and for three epochs. We evaluate the unfreezing approach
through two submissions at the BiOASQ challenge. The system “DistilBERT” is our baseline
system. In this system, all the layers of DistilBERT are fine-tuned simultaneously. The system
“Unfreezing DistilBERT” is the model that was fine-tuned using our unfreezing approach. Both
systems are fine-tuned with the same hyperparameters for a fair comparison. Table 3 lists our
systems with the results, along with the top-ranked system in the BioaASQ9b leaderboard.
We have reported the MRR in the results table since it is the main metric used by the BioASQ
organisers.


5. Results
The results of our submissions to the BioASQ9b Phase B challenge are shown in Table 3. From
the results, we observe that “ALBERT 2” system was the best system for batch 1, and the
“ALBERT 3” system was ranked fourth on the public leaderboard of the BioASQ9b challenge.
Overall, the systems using the pretrained ALBERT weights have performed exceedingly well
on test batches 1 and 2. However, our ALBERT variants received poor results for test batches
4 and 5. It is worth noting that all the systems will be evaluated by humans experts after the
competition. However, because this data was not accessible at the time of writing this study, we
rely on automatic evaluations available on the BioASQ leaderboard.
Table 3
Results of our five submissions along with the top-ranked system from the BioASQ9b leaderboard. The
first column of the table lists the unique submission identifier along with the system names as displayed
on the public leaderboard. The highest score for each batch is in bold.
                                                                    Factoid - Mean Reciprocal Rank (MRR)
   Submission ( Display name)                   System
                                                                    Batch 1 Batch 2 Batch 4 Batch 5
  MQ TL1 (ALBERT)                      ALBERT 1                     0.4379      0.4667      0.369       0.4468
  MQ TL2 (Ensemble)                    ALBERT 2                     0.4632      0.501       0.4167      0.4731
  MQ TL-3 (Another ALBERT)             ALBERT 3                     0.4621      0.5319      0.4375      0.4778
  MQ TL4 (Final BERT)                  DistilBERT                   -           0.5059      0.5399      0.5171
  MQ Transfer Learning (MRes)          Unfreezing DistilBERT        -           0.4887      0.5893      0.4917
  Top Ranked System                    -                            0.4632      0.5539      0.6929      0.588


Table 4
Results from our previous work [23] on the BioASQ7b dataset. The system ‘KU DMIS Team’ [30, 15] is
BioBERT based system that was top of the leaderboard in the BioASQ7b challenge.
                             Systems                       Mean Reciprocal Rank
                             KU-DMIS Team [30, 15]                            0.5235
                             DistilBERT-fine-tuned                            0.4844
                             DistilBERT-unfreeze-3                            0.4841


   The most noticeable difference between our DistilBERT and ALBERT variants, apart from
their sizes, is the initial fine-tuning stage. In our systems, ALBERT was fine-tuned on SQuAD2.0,
whereas DistilBERT was fine-tuned on SQuAD1.1. The SQuAD2.0 dataset is a reading com-
prehension dataset that, in addition to the SQuAD1.1 dataset, contains approximately 50,000
unanswerable questions. We need to look into whether test batches 1 and 2 had more unan-
swered questions after the organisers release the golden answers, and if so, how it has affected
the results.
   From the results of Table 3, we observe that both “DistilBERT” and “Unfreezing DistilBERT”
outperformed the ALBERT variants for the test batches 4 and 5. Our system “Unfreezing
DistilBERT” is ranked sixth in the BioASQ9b public leaderboard. The average MRR score of test
batches 2, 4 and 5 for systems “DistilBERT” and “Unfreezing DistilBERT” is 0.5209 and 0.5232
respectively, and the difference is not statistically significant4 . Thus, we can conclude that
gradually unfreezing the transformer-based models has no significant impact on the model’s
accuracy compared to typical fine-tuning. These results further support the findings of our
previous work [23] on gradually unfreezing DistilBERT with the BioASQ7b dataset, the results
of which are shown in Table 4. The results show that gradually unfrozen models produce
promising results for a few test batches, but have no overall significant impact across all the
test batches.

     4
       Paired t-tests were used to compute the statistical significance since the MRR can be considered as a normal
distribution as it is an average of samples. We find no statistically significant difference between the gradually
unfrozen model and the baseline.
6. Conclusion
Our participation in BioASQ9b was primarily focused on generating the ideal answers for factoid
questions. We participated in four test batches, with our systems employing pretrained ALBERT
and DistilBERT language models. The results were mixed, with ALBERT-based systems ranking
amongst the top systems for test batches 1 and 2. For test batch 4, the compact DistilBERT
variants, although having 81 percent fewer parameters, scored considerably better than ALBERT.
This paves the way for a biomedical version of DistilBERT for mobile and edge devices for real
life biomedical QA applications. In addition, we investigated the effect of gradual unfreezing on
transformer-based language models using the BioASQ9b dataset. We conclude that gradually
unfreezing the layers of DistilBERT had no significant impact on the model’s accuracy in
comparison to standard fine-tuning. We also investigated an unfreezing approach that makes
use of only 66% of DistilBERT’s parameters when fine-tuning. In the future, we will aim to
investigate ensemble or hybrid models of DistilBERT and ALBERT.


References
 [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/
     anthology/N19-1423. doi:10.18653/v1/N19-1423.
 [2] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for
     self-supervised learning of language representations, arXiv preprint arXiv:1909.11942
     (2019).
 [3] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
 [4] S. Ruder, M. E. Peters, S. Swayamdipta, T. Wolf, Transfer learning in natural language
     processing, in: Proceedings of the 2019 Conference of the North American Chapter of the
     Association for Computational Linguistics: Tutorials, 2019, pp. 15–18.
 [5] B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, S. Lehmann, Using millions of emoji occur-
     rences to learn any-domain representations for detecting sentiment, emotion and sarcasm,
     in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Pro-
     cessing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 1615–
     1625. URL: https://www.aclweb.org/anthology/D17-1169. doi:10.18653/v1/D17-1169.
 [6] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, in:
     Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
     (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia,
     2018, pp. 328–339. URL: https://www.aclweb.org/anthology/P18-1031. doi:10.18653/v1/
     P18-1031.
 [7] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-tune bert for text classification?, in: China
     National Conference on Chinese Computational Linguistics, Springer, 2019, pp. 194–206.
 [8] S. Garg, T. Vu, A. Moschitti, Tanda: Transfer and adapt pre-trained transformer models
     for answer sentence selection, in: Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 34, 2020, pp. 7780–7788.
 [9] J. Kang, Transferability of natural language inference to biomedical question answering,
     arXiv preprint arXiv:2007.00217 (2020).
[10] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weis-
     senborn, A. Krithara, S. Petridis, D. Polychronopoulos, et al., An overview of the bioasq
     large-scale biomedical semantic indexing and question answering competition, BMC
     bioinformatics 16 (2015) 1–28. doi:10.1186/s12859-015-0564-6.
[11] G. Wiese, D. Weissenborn, M. Neves, Neural domain adaptation for biomedical question
     answering, in: Proceedings of the 21st Conference on Computational Natural Language
     Learning (CoNLL 2017), Association for Computational Linguistics, Vancouver, Canada,
     2017, pp. 281–289. URL: https://www.aclweb.org/anthology/K17-1029. doi:10.18653/v1/
     K17-1029.
[12] D. Weissenborn, G. Wiese, L. Seiffe, Making neural QA as simple as possible but not
     simpler, in: Proceedings of the 21st Conference on Computational Natural Language
     Learning (CoNLL 2017), Association for Computational Linguistics, Vancouver, Canada,
     2017, pp. 271–280. URL: https://www.aclweb.org/anthology/K17-1028. doi:10.18653/v1/
     K17-1028.
[13] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 100,000+ questions for machine
     comprehension of text, in: Proceedings of the 2016 Conference on Empirical Methods in
     Natural Language Processing, Association for Computational Linguistics, Austin, Texas,
     2016, pp. 2383–2392. URL: https://www.aclweb.org/anthology/D16-1264. doi:10.18653/
     v1/D16-1264.
[14] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: pre-trained biomedical
     language representation model for biomedical text mining, arXiv preprint arXiv:1901.08746
     (2019).
[15] W. Yoon, J. Lee, D. Kim, M. Jeong, J. Kang, Pre-trained language model for biomedical
     question answering, arXiv preprint arXiv:1909.08229 (2019).
[16] Q. Jin, B. Dhingra, W. Cohen, X. Lu, Probing biomedical embeddings from language
     models, in: Proceedings of the 3rd Workshop on Evaluating Vector Space Representations
     for NLP, Association for Computational Linguistics, Minneapolis, USA, 2019, pp. 82–89.
     URL: https://www.aclweb.org/anthology/W19-2011. doi:10.18653/v1/W19-2011.
[17] S. Hosein, D. Andor, R. McDonal, Measuring domain portability and error propagation in
     biomedical qa, arXiv preprint arXiv:1909.09704 (2019).
[18] M. Resta, D. Arioli, A. Fagnani, G. Attardi, Transformer models for question answering
     at bioasq 2019, in: Joint European Conference on Machine Learning and Knowledge
     Discovery in Databases, Springer, 2019, pp. 711–726.
[19] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge corpus for sentence
     understanding through inference, in: Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, 2018,
     pp. 1112–1122. URL: http://aclweb.org/anthology/N18-1101.
[20] A. Kazaryan, U. Sazanovich, V. Belyaev, Transformer-based open domain biomedical
     question answering at bioasq8 challenge (2020).
[21] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswerable questions for
     SQuAD, in: Proceedings of the 56th Annual Meeting of the Association for Computational
     Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Mel-
     bourne, Australia, 2018, pp. 784–789. URL: https://www.aclweb.org/anthology/P18-2124.
     doi:10.18653/v1/P18-2124.
[22] Pubmed, Pubmed® comprises more than 30 million citations for biomedical literature from
     medline, life science journals, and online books., 2020. URL: https://pubmed.ncbi.nlm.nih.
     gov, [Online; accessed 1-December-2020].
[23] U. Khanna, Gradual unfreezing transformer-based language models for biomedical question
     answering, http://hdl.handle.net/1959.14/1280832, 2021. [Macquarie University, Sydney,
     Australia Online; accessed 03-June-2021].
[24] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
     K. Macherey, et al., Google’s neural machine translation system: Bridging the gap between
     human and machine translation, arXiv preprint arXiv:1609.08144 (2016).
[25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan-
     guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are
     unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[27] E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations for deep learning
     in NLP, in: Proceedings of the 57th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 3645–3650.
     URL: https://www.aclweb.org/anthology/P19-1355. doi:10.18653/v1/P19-1355.
[28] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[29] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv
     preprint arXiv:1503.02531 (2015).
[30] Tsatsaronis et al, Bioasq participants area task 7b: Test results of phase b, http://
     participants-area.bioasq.org/results/7b/phaseB/, 2019. [Online; accessed 17-January-2021].

</pre>