=Paper=
{{Paper
|id=Vol-2696/paper_44
|storemode=property
|title=Transferability of Natural Language Inference to Biomedical Question Answering
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_44.pdf
|volume=Vol-2696
|authors=Minbyul Jeong,Mujeen Sung,Gangwoo Kim,Donghyeon Kim,Wonjin Yoon,Jaehyo Yoo,Jaewoo Kang
|dblpUrl=https://dblp.org/rec/conf/clef/JeongSKKYYK20
}}
==Transferability of Natural Language Inference to Biomedical Question Answering==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_44.pdf</pdf>
<pre>
Transferability of Natural Language Inference to
        Biomedical Question Answering

    Minbyul Jeong1[0000−0002−1346−730X]? , Mujeen Sung1[0000−0002−7978−8114] ? ,
    Gangwoo Kim1[0000−0003−4581−0384] , Donghyeon Kim2[0000−0002−8224−8354] ,
     Wonjin Yoon1[0000−0002−6435−548X] , Jaehyo Yoo1[0000−0002−3600−6362] , and
                        Jaewoo Kang1[0000−0001−6798−9106]
1
    Department of Computer Science and Engineering, Korea University, Seoul, Korea
                 2
                   AIR Lab, Hyundai Motor Company, Seoul, Korea
          {minbyuljeong, mujeensung, gangwoo kim, wjyoon, jaehyoyoo,
                kangj}@korea.ac.kr, donghyeon.kim@hyundai.com


        Abstract. Biomedical question answering (QA) is a challenging task
        due to the scarcity of data and the requirement of domain expertise.
        Pre-trained language models have been used to address these issues. Re-
        cently, learning relationships between sentence pairs has been proved to
        improve performance in general QA. In this paper, we focus on applying
        BioBERT to transfer the knowledge of natural language inference (NLI)
        to biomedical QA. We observe that BioBERT trained on the NLI dataset
        obtains better performance on Yes/No (+5.59%), Factoid (+0.53%), List
        type (+13.58%) questions compared to performance obtained in a pre-
        vious challenge (BioASQ 7B Phase B). We present a sequential transfer
        learning method that significantly performed well in the 8th BioASQ
        Challenge (Phase B). In sequential transfer learning, the order in which
        tasks are fine-tuned is important. We measure an unanswerable rate of
        the extractive QA setting when the formats of factoid and list type ques-
        tions are converted to the format of the Stanford Question Answering
        Dataset (SQuAD).

        Keywords: Transfer Learning · Domain Adaptation · Natural Language
        Inference · Biomedical Question Answering


1      Introduction
Biomedical question answering (QA) is a challenging task due to the limited
amount of data and the requirement of domain expertise. To address these issues,
pre-trained language models [13, 26] are used and further fine-tuned on a target
task [2, 4, 7, 19, 20, 31, 32, 36]. Although the pre-trained language models improve
performance on the target tasks, the models are still short of the upper-bound
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
?
  equal contribution
performance in biomedical QA. Sequential transfer learning is based on transfer
learning and it is used to further improve biomedical QA performance [2,20,36].
For example, fine-tuning on both the SQuAD dataset [28] and the BioASQ
dataset [33] results in higher performance than fine-tuning on only the BioASQ
dataset. In the general QA domain, learning relationships between sentence pairs
first is effective in sequential transfer learning [4, 11, 27, 34, 35]. Thus, in this
paper, we fine-tune the task of NLI [1,10] to improve performance in biomedical
QA. We find that performance improves when the objective function of the fine-
tuned task becomes similar to the function of the downstream task. We also find
that applying the NLI task to the biomedical QA task addresses task discrepancy.
Task discrepancy refers to the several differences in the distribution of context
length, objective function, and domain shift between various fine-tuned tasks.
    Specifically, we focus on reducing the discrepancy of context length distri-
bution between NLI and biomedical QA to improve sequential transfer learning
performance on the target task. To reduce the discrepancy, we only unify the
distributions of context length of the fine-tuned tasks. We reduce the SQuAD
context to a single sentence containing the ground truth answer spans [23]. Fine-
tuning on a unified distribution reduces the time to train and perform inference
on the BioASQ dataset by 52.95% and 25%, respectively. Finally, we measure an
unanswerable rate of the extractive QA setting when the format of the BioASQ
dataset is converted to the format of the SQuAD dataset.
    Our contributions are as follows:

  (i) We show that fine-tuning on an NLI dataset is effective in Yes/No, Factoid,
      and List type questions in BioASQ dataset.
 (ii) We demonstrate that unifying the distributions of context length between
      fine-tuned tasks improves the sequential transfer learning performance of
      biomedical QA.
(iii) In the Factoid and List type questions, we measure an unanswerable rate
      of the extractive QA setting, when the format of the BioASQ dataset is
      converted to that of the SQuAD dataset.


2    Related Works

Transfer Learning Transfer learning, also known as domain adaptation, refers
to the situation of knowledge learned in a previous task to a subsequent task. In
various fields including image processing or natural language processing (NLP),
many studies have shown the effectiveness of transfer learning based on deep
neural networks [15, 22, 24, 31, 37]. More recently, especially in NLP, pre-trained
language models such as ELMo [26] and BERT [13] have been used for transfer
learning [4, 11, 13, 18, 19, 21, 25]. In the biomedical domain, unsupervised pre-
training has been used for biomedical contextualized representations [9, 16, 20].
BioBERT [20] was fine-tuned on biomedical corpora (e.g., PubMed and PubMed
Central) using BERT, and BioBERT can be employed for various tasks in the
biomedical or clinical domain [8, 9, 16, 17, 25, 36].
Transferability of Natural Language Understanding The authors of [2]
transferred the knowledge obtained from the SQuAD dataset to the target
BioASQ dataset to address the data scarcity issue. In [20, 36], the authors
adopted sequential transfer learning (e.g., BioBERT-SQuAD-BioASQ) to im-
prove biomedical QA performance. Meanwhile, multiple NLI datasets have been
constructed for the general domain [1, 3, 10, 28, 35] and domain-specific datasets
(e.g., biomedical) have recently been introduced [25,30]. In [4], the authors have
found that fine-tuning on the MultiNLI (MNLI) dataset [1] consistently improves
performance on target tasks in terms of all the GLUE benchmarks [35]. The au-
thors of [12] have found that applying knowledge from the NLI dataset improves
performance on various yes and no type QA tasks in the general domain. Further-
more, the authors of [34] have used various size datasets in question answering,
text classification/regression, and sequence labeling tasks. In this paper, we use
the MNLI dataset for improving performance in biomedical QA.


3     Methods
In this section, we outline our problem setting for the downstream task. Our
training details are provided in the Appendix A. We explain our method of
learning biomedical entity representations using BioBERT. Then we describe
how to perform the sequential transfer learning of BioBERT for each biomedi-
cal question type of the BioASQ Challenge. Our method can be used to apply
BioBERT, which was used for training NLI dataset, to biomedical QA.

3.1   Problem Setting
We converted the format of the BioASQ dataset to the format of the SQuAD
dataset. In detail, training instances in the BioASQ dataset are composed of
a question (Q), human-annotated answers (A), and relevant contexts (C) (also
called snippets). If the span of answers was not provided by human annotators,
we first found exact spans in contexts based on human-annotated answers to
factoid and list type questions. In this case, we enumerated all the combinations
of Q-C-A triplets only when the answer span exactly matches the context. Yes
and No answers to Yes/No type questions are not suggested in contexts; thus,
we fine-tuned a task-specific binary classifier to predict answers.

3.2   Overall Architecture
Input sequence X consists of the concatenation of the BERT [CLS] token, Q,
and C, with [SEP] tokens in between Q and C. The sequence is denoted as X
= {[CLS] k Q k [SEP] k C k [SEP]} where k refers to the concatenation of
tensors. The hidden representation vector of the ith input token is denoted as
hi ∈ RH where H denotes the hidden size. Finally, we fine-tuned the hidden
vectors corresponding to each question type, and the vectors were fed into a
softmax classifier or binary classifier.
Yes/No Type For computing the yes probability P yes , we projected a linear
transformation matrix M ∈ R1×H to transform the hidden representation of a
[CLS] token C ∈ RH . In binary classification, the sigmoid function can be used
to calculate the yes probability as follows:
                                              1
                               P yes =                                             (1)
                                         1 + e−C·M >
   The binary cross entropy loss is utilized between the yes probability P yes
and its corresponding ground truth answer ayes . Our total loss is computed as
below.

                Loss = −(ayes logP yes + (1 − ayes )log(1 − P yes ))               (2)

Factoid & List type At hidden representation vectors, the start and end
vectors of answer spans were computed in one linear transformation matrix M ∈
R2×H . Let us denote the ith and j th predicted answer tokens as start and end,
respectively. The probability of (Pistart , Pjend ) can be calculated as follows:
                                  >                                        >
                           ehi ·M                                   ehj ·M
  Pi = Pistart k Piend = Ps     h   ·M
                                                  start
                                       > , P j = Pj     k Pjend = Ps     ht ·M >
                                                                                   (3)
                          t=1 e                                    t=1 e
                                  t


where s denotes the sequence length of BioBERT and · is the dot-product. Our
objective function is the negative log-likelihood for the predicted answer with
the ground truth answer position. Start and end position losses are computed
as below:
                           N                                N
                        1 X                              1 X
        Lossstart = −         logPastart,n , Lossend
                                                     = −       logPaend,n          (4)
                        N n=1       s
                                                         N n=1       e


where N denotes the batch size, and as and ae are the ground truth answers of
the start and end positions of each instance, respectively. Our total loss is the
arithmetic mean of Lossstart and Lossend .

3.3   Transferability in domains and tasks
Yes/No Type Training a model to classify relationships of sentence pairs can
enhance its performance on yes or no type questions in the general domain [12].
Based on this finding, we believe that a classifier could be used for yes and no
type questions in biomedical QA. Thus, we fine-tuned BioBERT on the NLI
task so that it can be used to answer biomedical yes or no type questions. We
used the MNLI dataset because it is widely used and has a sufficient amount of
data from various genres. Furthermore, as shown in Table 9 and 10, sequential
transfer learning models trained on the MNLI dataset obtained meaningful re-
sults. For our learning sequence, we fine-tuned BioBERT on the MNLI dataset
which contains the relationships between hypothesis and premise sentences. We
composed a sequential transfer learning method, denoted as BioBERT-MNLI-
BioASQ. However, using the final layer of the MNLI task instead of the binary
classifier to compute P yes does not improve the performance of BioBERT on
the BioASQ dataset. For this reason, we added a simple binary classifier on the
top layer of BioBERT. Furthermore, the distributions of context length in the
MNLI dataset and the distributions of snippets of Yes/No type questions in
the BioASQ dataset are similar. Therefore, we did not unify the context length
distributions of yes and no type questions.

Factoid & List Type The order of sequential transfer learning is important
in bridging the gap between different tasks. Performance improves when the
objective function of the fine-tuned task becomes similar to that of the down-
stream task in Table 5. Thus, we used the learning sequence BioBERT-MNLI-
SQuAD-BioASQ instead of BioBERT-SQuAD-MNLI-BioASQ. To address the
discrepancy of context length distribution between the SQuAD dataset and the
BioASQ dataset, we slightly modified the original experimental setting. As sug-
gested in [23], we reorganized the context length distributions in the SQuAD
dataset which is similar to the MNLI dataset and BioASQ dataset. We devel-
oped an extractive QA setting that is scalable to minimal context and that does
not use irrelevant sentences in full abstracts [36]. Therefore, we extracted a sen-
tence containing the ground truth answer span and set as a complete paragraph
to construct the minimal context. As a result, we reduced the discrepancy of
context length distribution by unifying the context length distributions for our
sequential transfer learning. Unifying the distributions of context length reduced
the time to train and perform inference on factoid and list type questions. Our
method achieved comparable results to those of the baseline method.


4     Experiments

4.1    Datasets

Our datasets are based on the pre-processed datasets provided by [1, 28, 36]. For
the extractive QA setting, we converted the BioASQ dataset format (Yes/No,
Factoid, and List type questions) to the format of the SQuAD dataset. In [36], the
authors suggested three pre-processing strategies, and for our study, we utilized
two of the three strategies: Snippet-as-is and Full-Abstract. However, we added
the criterion of having a blank space before and after each biomedical entity.
This criterion has shown to improve performance in distinguishing biomedical
named entities. The statistics of the pre-processed dataset are listed in Table 8.
We have made the pre-processed BioASQ datasets publicly available.3 In the
experimental setting, we removed approximately 5K training instances from the
SQuAD dataset because their answer spans do not exactly match the context.
3
    https://github.com/dmis-lab/bioasq8b
        Reference System         Yes/No (Macro F1) Factoid (MRR) List (F1)
 Dimitriadis & Tsoumakas [14]          0.5541                  -              -
 Hosein et al., [7]                       -                 0.4562            -
 Oita et al., [5]                      0.4831                  -              -
 Resta et al., [29]                    0.7873                  -              -
 Telukuntla et al., [6]                0.4486               0.4751         0.2002
 Yoon et al., [36]                     0.7169               0.5116         0.4061
 Ours                                  0.8432               0.5163         0.5419

Table 1. BioASQ 7B (Phase B) Challenge results and our results. We use a dash (-) if
the paper does not contain results on each question type. All the scores were averaged
when the batch results are reported in each paper. In each column, the best score is in
bold.


4.2   Experimental Results

In Table 1, we compare our results with the best results from last year’s BioASQ
Challenge Task 7B (Phase B) [5–7,14,29,36]. From this comparison, we observe
that training BioBERT on the MNLI dataset significantly improves its perfor-
mance on the Yes/No (+5.59%), Factoid (+0.53%), and List (+13.58%) type
questions.


                                    Yes/No Type

                                                         Evaluation Metric
 # of Tasks Sequence of Transfer Learning
                                               Accuracy Yes F1 No F1 Macro F1
             BioBERT-SQuAD-BioASQ               0.8518    0.9004 0.6896      0.7950
  6B Test
             BioBERT-MNLI-BioASQ                0.8857    0.9212 0.7798      0.8505
             BioBERT-SQuAD-BioASQ               0.8595    0.8990 0.7344      0.8167
  7B Test
             BioBERT-MNLI-BioASQ                0.8945    0.9275 0.7588      0.8432

Table 2. Yes/No type question experiments. Evaluation metrics are accuracy (Accu-
racy), F1 score, and macro F1 score (Macro F1). The F1-score of yes type questions is
denoted as Yes F1, and the F1 score of the no type questions is denoted as No F1. In
the columns, the best score obtained in each task is in bold.


    First, the Yes/No type question scores obtained by our method are shown
in Table 2. We observed that using the SQuAD dataset for intermediate fine-
tuning improves performance [2, 20, 36]. Therefore, we evaluated our proposed
method of fine-tuning BioBERT using the sequence BioBERT-SQuAD-BioASQ,
as done in [20, 36]. BioBERT is trained on the SQuAD dataset for the QA task.
Fine-tuning BioBERT with the sequence BioBERT-MNLI-BioASQ significantly
improves its performance. BioBERT obtains higher macro F1 scores (+5.55%,
+2.65%) than the baseline. We believe selecting yes and no type questions in the
                                 Context Length Discrepancy

                                                         Factoid (%)         List (%)
# of Tasks Setting     Sequence of Transfer Learning
                                                       SAcc LAcc MRR Prec Recall        F1
                      BioBERT-SQuAD-BioASQ      39.80 57.82 47.22 45.02 47.69 42.34
           Original
                      BioBERT-MNLI-SQuAD-BioASQ 38.80 61.34 47.42 46.60 47.01 42.44
 6B Test              BioBERT-SQuAD-BioASQ      39.71 56.37 45.81 46.81 40.26 39.63
           Document
                      BioBERT-MNLI-SQuAD-BioASQ 39.71 55.10 45.77 46.26 39.23 38.13
                      BioBERT-SQuAD-BioASQ      38.23 57.34 46.24 48.24 46.86 42.83
           Snippet
                      BioBERT-MNLI-SQuAD-BioASQ 41.41 57.40 48.05 46.01 45.95 42.75
                      BioBERT-SQuAD-BioASQ      41.95 58.30 48.66 61.32 52.83 52.36
           Original
                      BioBERT-MNLI-SQuAD-BioASQ 42.22 61.06 49.85 61.46 54.62 54.19
 7B Test              BioBERT-SQuAD-BioASQ      44.46 57.98 50.02 58.30 39.19 43.89
           Document
                      BioBERT-MNLI-SQuAD-BioASQ 43.34 58.13 49.21 61.01 41.82 45.78
                      BioBERT-SQuAD-BioASQ      40.79 58.93 48.27 60.08 53.96 53.18
           Snippet
                      BioBERT-MNLI-SQuAD-BioASQ 45.10 62.45 51.63 60.92 53.12 53.01

Table 3. Context Length Discrepancy Experiments. The metrics used to measure
performance on factoid type questions are strict accuracy (SAcc), lenient accuracy
(LAcc), and mean reciprocal rank (MRR). The metrics used to evaluate performance
on list type questions are precision (Prec), recall (Recall), and macro F1 (F1). ’Original’
indicates training BioBERT on full documents in SQuAD and snippets in BioASQ.
’Document’ indicates that BioBERT was trained on full documents in SQuAD and full
abstracts in BioASQ. ’Snippet’ denotes training on a unified distribution of minimal
context. All five batch results are averaged. In the columns, the best score obtained in
each task is in bold.


BioASQ dataset is similar to deciding the relationship between sentence pairs in
the MNLI dataset. We also replaced the binary classifier of BioBERT, which is
trained on the BioASQ dataset, with the final layer of the MNLI task, but this
did not improve performance. Thus, we fine-tuned the binary classifier to select
yes and no type questions.
    When using the factoid and list type questions in the MNLI dataset, we
considered the discrepancy of context length distributions. The obtained results
are shown in Table 3. In the original experimental setting, full documents in
the SQuAD dataset and snippets in the BioASQ dataset were used for training
BioBERT. The performance of our method on the 6B test set did not improve.
However, we observed that its performance improves with the size of the training
set, as shown by the higher performance on the 7B test set compared with that
on the 6B test set.
    In the document setting, we used the whole paragraphs and the full abstracts
of the SQuAD and BioASQ datasets, respectively. Performance obtained in this
setting is lower than that obtained in the original setting due to using longer
context rather than snippets in the BioASQ dataset. In other words, rather
than using the human annotated corpus (i.e., snippets), the search space in
which an answer can be found was expanded to full abstracts. Nevertheless, the
                       Yes/No                      Factoid                        List
# of Batches                                                                                      Macro Avg.
                System Name       Macro F1   System Name      MRR       System Name       F1
            Ours                  0.8663 Ours                0.4438 Ours                 0.3718    0.5606
 8B batch 1 FudanLabZhu1          0.4518 FudanLabZhu1        0.4557 FudanLabZhu1         0.3408    0.4161
            Umass czi 4           0.5989 Umass czi 4         0.3005 Umass czi 4          0.3448    0.4147
            Ours                  0.8928 Ours                0.3533 Ours                0.3798     0.5420
 8B batch 2 UoT multitask learn   0.7000 UoT multitask learn 0.2800 UoT multitask learn 0.4108     0.4636
            FudanLabZhu4          0.6303 FudanLabZhu4        0.2900 FudanLabZhu4        0.4678     0.4627
            Umass czi 4           0.9016 Umass czi 4         0.3810 Umass czi 4          0.4522    0.5782
 8B batch 3 Ours                  0.9028 Ours                0.3601 Ours                 0.4520    0.5716
            pa-base               0.8995 pa-base             0.3137 pa-base              0.4585    0.5572
            Ours                  0.7636 Ours                0.6078 Ours                 0.4037    0.5917
 8B batch 4 91-initial-Bio        0.7204 91-initial-Bio      0.5735 91-initial-Bio       0.3905    0.5615
            Features Fusion       0.7097 Features Fusion     0.5745 Features Fusion      0.3625    0.5489
            Ours                 0.8518 Ours                 0.5677 Ours                 0.5582    0.6592
 8B batch 5 Parameters retrained 0.7509 Parameters retrained 0.5938 Parameters retrained 0.4004    0.5817
            Features Fusion      0.7509 Features Fusion      0.6115 Features Fusion      0.3810    0.5811


Table 4. BioASQ 8B results obtained by the top three systems. The
best scores were obtained from the BioASQ leaderboard (http://participants-
area.bioasq.org/results/8b/phaseB/). We considered a system with different names as
one system with the highest scores. We report the macro average scores obtained on
all types of questions in the BioASQ dataset. Our systems are in bold.


performance of our proposed method on the factoid type questions in the 7B
test set improved when BioBERT was fine-tuned on the SQuAD dataset.
    For the snippet setting, we unify the distributions of context length in the
extractive QA setting. Our method extracts the sentence containing the ground
truth answer span, i.e., the minimal context; the performance of our method on
the 6B & 7B test sets significantly improved. We recognize that it is hard to
prove the generalization of our method because the test sets for the BioASQ
dataset are too small and the variance of performance is relatively high. How-
ever, we demonstrate our superior performance by reducing the task discrepancy
of factoid type questions in 6B & and 7B. Although, we have achieved better
performance of list type questions, reducing the discrepancy of context length
distribution does not significantly affect. We believe that given the objective
function of list type questions, it needs further analyses to demonstrate the gen-
eralization of sequential transfer learning with fine-tuning NLI dataset.


5     Analysis
Order of Sequential Transfer Learning The BioASQ Challenge Task 8B
(Phase B) results are shown in Table 4. Each team was allowed to submit up
to five systems with different combinations of features. The 8B ground truth
answers are not available so we could not use them for manually evaluating our
proposed method. Thus, we report the scores from the leaderboard.4
    In this ablation study, we explore the importance of the order of sequential
transfer learning. The results are shown in Table 5. We found that fine-tuning
4
    http://participants-area.bioasq.org/results/8b/phaseB/
                                   Order Importance

                                                   Factoid (%)           List (%)
 # of Tasks    Sequence of Transfer Learning
                                                SAcc LAcc MRR Prec Recall           F1
              BioBERT-SQuAD-BioASQ      39.80 57.82 47.22 45.02 47.69 42.34
  6B Test     BioBERT-SQuAD-MNLI-BioASQ 41.15 57.95 47.29 46.18 44.56 40.98
              BioBERT-MNLI-SQuAD-BioASQ 38.80 61.34 47.42 46.60 47.01 42.44
              BioBERT-SQuAD-BioASQ      41.95 58.30 48.66 61.32 52.83 52.36
  7B Test     BioBERT-SQuAD-MNLI-BioASQ 43.31 58.69 49.24 60.77 50.74 50.72
              BioBERT-MNLI-SQuAD-BioASQ 42.22 61.06 49.85 61.46 54.62 54.19

Table 5. Experiments on the importance of the order of sequential transfer learn-
ing. The metrics used for measuring performance on factoid-type questions are strict
accuracy (SAcc), lenient accuracy (LAcc), and mean reciprocal rank (MRR). The met-
rics used for evaluating performance on list-type questions are precision (Prec), recall
(Recall), and macro F1 (F1). The best score obtained in each task is in bold.


BioBERT on the MNLI dataset improved its performance on factoid type ques-
tions. On the other hand, its performance on list type questions improved when
the objective function of fine-tuned tasks was similar to that of the BioASQ
task. In other words, BioBERT needs to be fine-tuned on the SQuAD dataset
after fine-tuning it on the MNLI dataset.


 Type    7B Batch1    7B Batch2    7B Batch3    7B Batch4    7B Batch5      7B Total
Factoid 0.359 (14/39) 0.120 (3/25) 0.310 (9/29) 0.118 (4/34) 0.229 (8/35) 0.216 (35/162)
List    0.083 (1/12) 0.235 (4/17) 0.200 (5/25) 0.136 (3/22) 0.500 (6/12) 0.204 (18/88)

Table 6. Statistics of the unanswerable rate in the extractive QA setting. The cases
where Ground Truth Answer does not exactly match the context of the Human Anno-
tated Corpus (Snippet). The unanswerable rate is related to the upper-bound perfor-
mance of our proposed method.


Unanswerable rate of the Extractive QA Setting So far, the experiments
were performed in the extractive QA setting. We manually analyzed differences
between the answer span and the context of the human annotated corpus from
the BioASQ Challenge Task 7B (Phase B) test set. We used the test set instead
of the training set for measuring the unanswerable rate of the extractive QA
setting for the following two reasons. First, we wanted to measure the upper-
bound performance of our proposed method. Second, the training and test data
of the BioASQ dataset are similar to those of the dataset from the previous year.
Table 6 shows the unanswerable rate of all batch results of the 7B test set which
contains only factoid and list type questions. We calculated the unanswerable
rate of the extractive QA setting using the rule Ground Truth Answer does
not exactly match the context of the Human Annotated Corpus (Snippet). The
rule applies to the following cases: no exact match, lowercase match, additional
phrase added, and different type of blank space between the exact answer and
snippet. In Table 7, we randomly sample such cases. Due to the lack of space,
we provide more examples of cases at our url 5 . In here, we use the extractive
QA setting to measure the upper-bound performance of our method. We hope
our analysis is helpful in designing experimental settings.


                                      Limitations of the Supervised Setting

 Type                                       ID - Question - Context - Answer
        ID: 5c531d8f7e3cb0e231000017
        Question: What causes Bathing suit Ichthyosis(BSI)?
        Ground Truth Answer: transglutaminase-1 gene (TGM1) mutations
Factoid
        Context: Bathing suit ichthyosis (BSI) is an uncommon phenotype classified as a minor
        variant of autosomal recessive congenital ichthyosis (ARCI). OBJECTIVES: We report a case of
        BSI in a 3-year-old Tunisian girl with a novel mutation of the transglutaminase 1 gene (TGM1)
           ID: 5c5214207e3cb0e231000003
           Question: List potential reasons regarding why potentially important genes are ignored
           Ground Truth Answer: Identifiable chemical properties, Identifiable physical properties,
           Identifiable biological properties, Knowledge about homologous genes from model organisms
           Context: Here, we demonstrate that these differences in attention can be explained, to a large
    List
           extent, exclusively from a small set of identifiable chemical, physical, and biological properties
           of genes. Together with knowledge about homologous genes from model organisms, these
           features allow us to accurately predict the number of publications on individual human
           genes, the year of their first report, the levels of funding awarded by the National Institutes
           of Health (NIH), and the development of drugs against disease-associated genes.


Table 7. Unanswerable questions of the extractive QA samples used for the BioASQ
dataset. We used factoid- and list-type questions from the 7B test set. Context refers
to a snippet in the human annotated corpus provided by the organizer of the BioASQ
Challenge. No exact matches are in bold and exact matches in lowercase are underlined.


6          Conclusion

In this work, we used natural language inference (NLI) as a first step in fine-
tuning BioBERT for biomedical question answering (QA). Training BioBERT
to classify relationships between sentence pairs improved its performance in
biomedical QA. We empirically demonstrated that fine-tuning BioBERT on the
NLI dataset improved its performance on the BioASQ dataset from the BioASQ
Challenge. We unified the distributions of context length to mitigate the dis-
crepancy between NLI and biomedical QA. Furthermore, the order of sequential
transfer learning is important when fine-tuning BioBERT. Finally, when con-
verting the format of the BioASQ dataset to the SQuAD format, we measured
5
     https://github.com/dmis-lab/bioasq8b/tree/master/human-eval
the unanswerable rate of the extractive QA setting where an answer does not
exactly match the human annotated corpus.


References
 1. Williams et al., A.: A broad-coverage challenge corpus for sentence understanding
    through inference. In: Proceedings of the 2018 Conference of the NAACL: Human
    Language Technologies, Volume 1 (Long Papers) (2018)
 2. Wiese et al., G.: Neural domain adaptation for biomedical question answering. In:
    Proceedings of the 21st Conference on CoNLL (2017)
 3. Levesque et al., H.: The winograd schema challenge. In: Thirteenth International
    Conference on the Principles of Knowledge Representation and Reasoning (2012)
 4. Phang et al., J.: Sentence encoders on stilts: Supplementary training on interme-
    diate labeled-data tasks. arXiv preprint arXiv:1811.01088 (2018)
 5. Oita et al., M.: Semantically corroborating neural attention for biomedical question
    answering. In: ECML PKDD (2019)
 6. Telukuntla et al., S.K.: Uncc biomedical semantic question answering systems.
    bioasq: Task-7b, phase-b. In: ECML PKDD (2019)
 7. Hosein et al., S.: Measuring domain portability and errorpropagation in biomedical
    qa. arXiv preprint arXiv:1909.09704 (2019)
 8. Alsentzer, E., Murphy, J., Boag, W., Weng, W.H., Jindi, D., Naumann, T., Mc-
    Dermott, M.: Publicly available clinical bert embeddings. In: Proceedings of the
    2nd Clinical Natural Language Processing Workshop (2019)
 9. Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific
    text. In: Proceedings of the 2019 Conference on EMNLP-IJCNLP
10. Bowman, S., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for
    learning natural language inference. In: Proceedings of the 2015 Conference on
    EMNLP
11. Chen, S., Hou, Y., Cui, Y., Che, W., Liu, T., Yu, X.: Recall and learn: Fine-
    tuning deep pretrained language models with less forgetting. arXiv preprint
    arXiv:2004.12651 (2020)
12. Clark, C., Lee, K., Chang, M.W., Kwiatkowski, T., Collins, M., Toutanova, K.:
    Boolq: Exploring the surprising difficulty of natural yes/no questions. In: Pro-
    ceedings of the 2019 Conference of the NAACL: Human Language Technologies,
    Volume 1 (Long and Short Papers) (2019)
13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding. In: Proceedings of the 2019
    Conference of the NAACL: Human Language Technologies (2019)
14. Dimitriadis, D., Tsoumakas, G.: Yes/no question answering in bioasq 2019. In:
    ECML PKDD (2019)
15. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification.
    In: Proceedings of the 56th Annual Meeting of the ACL (Volume 1: Long Papers)
    (2018)
16. Jin, Q., Dhingra, B., Cohen, W., Lu, X.: Probing biomedical embeddings from
    language models. In: Proceedings of the 3rd Workshop on Evaluating Vector Space
    Representations for NLP (2019)
17. Kim, D., Lee, J., So, C.H., Jeon, H., Jeong, M., Choi, Y., Yoon, W., Sung, M.,
    Kang, J.: A neural named entity recognition and multi-type normalization tool for
    biomedical text mining. IEEE Access (2019)
18. Kim, N., Patel, R., Poliak, A., Xia, P., Wang, A., McCoy, T., Tenney, I., Ross,
    A., Linzen, T., Van Durme, B., et al.: Probing what different nlp tasks teach
    machines about function word comprehension. In: Proceedings of the Eighth Joint
    Conference on Lexical and Computational Semantics (* SEM 2019) (2019)
19. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A
    lite bert for self-supervised learning of language representations. arXiv preprint
    arXiv:1909.11942 (2019)
20. Lee, J., Yoon, W., Kim, S., Kim, D., So, C., Kang, J.: Biobert: a pre-trained
    biomedical language representation model for biomedical text mining. Bioinfor-
    matics (Oxford, England) (2019)
21. Liu, N.F., Gardner, M., Belinkov, Y., Peters, M.E., Smith, N.A.: Linguistic knowl-
    edge and transferability of contextual representations. In: Proceedings of the 2019
    Conference of the NAACL: Human Language Technologies, Volume 1 (Long and
    Short Papers) (2019)
22. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with
    deep adaptation networks. arXiv preprint arXiv:1502.02791 (2015)
23. Min, S., Zhong, V., Socher, R., Xiong, C.: Efficient and robust question answering
    from minimal context over documents. In: Proceedings of the 56th Annual Meeting
    of the ACL (Volume 1: Long Papers) (2018)
24. Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., Jin, Z.: How transferable
    are neural networks in nlp applications? In: Proceedings of the 2016 Conference
    on EMNLP
25. Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language process-
    ing: An evaluation of bert and elmo on ten benchmarking datasets. In: Proceedings
    of the 18th BioNLP Workshop and Shared Task (2019)
26. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer,
    L.: Deep contextualized word representations. In: Proceedings of the 2018 Con-
    ference of the NAACL: Human Language Technologies, Volume 1 (Long Papers)
    (2018)
27. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li,
    W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text
    transformer. arXiv preprint arXiv:1910.10683 (2019)
28. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for
    machine comprehension of text. In: Proceedings of the 2016 Conference on EMNLP
29. Resta, M., Arioli, D., Fagnani, A., Attardi, G.: Transformer models for question
    answering at bioasq 2019. In: ECML PKDD (2019)
30. Romanov, A., Shivade, C.: Lessons from natural language inference in the clinical
    domain. In: Proceedings of the 2018 Conference on EMNLP
31. Ruder, S.: Neural transfer learning for natural language processing. Ph.D. thesis
    (2019)
32. Talmor, A., Berant, J.: Multiqa: An empirical investigation of generalization and
    transfer in reading comprehension. In: Proceedings of the 57th Annual Meeting of
    the ACL (2019)
33. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers,
    M.R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., et al.:
    An overview of the bioasq large-scale biomedical semantic indexing and question
    answering competition. BMC bioinformatics (2015)
34. Vu, T., Wang, T., Munkhdalai, T., Sordoni, A., Trischler, A., Mattarella-Micke,
    A., Maji, S., Iyyer, M.: Exploring and predicting transferability across nlp tasks
35. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: Glue: A multi-
    task benchmark and analysis platform for natural language understanding. In:
    Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Inter-
    preting Neural Networks for NLP (2018)
36. Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained language model for
    biomedical question answering. arXiv preprint arXiv:1909.08229 (2019)
37. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in
    deep neural networks? In: Advances in NIPS (2014)


                    MNLI                   Train             Dev
                    Original              392,702           9,815
                  SQuAD v1.1               Train             Dev
                    Original               87,412          10,570
                    Snippet                82,280           9,986
                  SQuAD v2.0               Train            Dev
                    Original              130,319          11,873
                   BioASQ                6B         7B         8B
           Type     Data Strategy    Train Test Train Test Train Test
         Yes/No Snippet-as-is        9,421 127 10,560 140 11,531 152
                 Full-Abstract    7,911     9,403     10,147
         Factoid Appended-Snippet 5,953 161 7,179 162 7,896 151
                 Snippet-as-is    3,512     4,231      4,759
                   Full-Abstract    14,008    15,719    16,879
           List    Appended-Snippet 10,878 81 12,184 88 13,251 75
                   Snippet-as-is     6,922     7,865    8,676

Table 8. Statistics of transferred dataset (MNLI & SQuAD) and target dataset
(BioASQ).


A    Training Details
We use BioBERT as learning biomedical entity representation. We utilize a single
NVIDIA Titan RTX (24GB) GPU to fine-tune the sequence of transfer learning.
In MNLI task, we use hyperparameters suggested by Hugging Face.6 For fine-
tuning, we select the batch size as 12, 24 and a learning rate is within range 1e-6
to 9e-6. In post-processing, we use the abbreviation resolution module called
Ab3P7 to remove the same answer appearance with a different form.


6
  https://github.com/huggingface/transformers/tree/master/examples/text-
  classification
7
  https://github.com/ncbi-nlp/Ab3P
                     Yes/No (%)                    Factoid (%)            List (%)
 Model    Accuracy Yes F1 No F1 Macro F1 SAcc LAcc MRR Prec Recall                   F1
SQuAD      85.18     90.04 68.96      79.50    39.80 57.82 47.22 45.02 47.69 42.34
MNLI       88.57     92.12 77.98     85.05     38.80 61.34 47.42 47.86 46.89 43.33
SNLI       88.51    92.17 77.47       84.82    39.11 58.23 46.96 44.42 48.16 42.20
MedNLI     77.81     85.24 52.32      68.78    40.05 57.66 47.14 45.56 47.31 42.72

Table 9. Experiments of various NLI datasets evaluated on BioASQ 6B (Phase B).
The experiments are considered as a first step of the sequential transfer learning. The
model of Yes/No type are fine-tuned as same as Table 2. The model of Factoid and
List type are fine-tuned as same as Table 3. The best score obtained in each task is in
bold.


                     Yes/No (%)                    Factoid (%)            List (%)
 Model    Accuracy Yes F1 No F1 Macro F1 SAcc LAcc MRR Prec Recall                   F1
SQuAD      85.95     89.90 73.44      81.67    41.95 58.30 48.66 61.32 52.83 52.36
MNLI       89.45    92.75 75.88      84.32     42.22 61.06 49.85 61.46 54.62 54.19
SNLI       85.40     90.11 66.95      78.53    41.84 60.03 49.31 56.20 48.07 47.70
MedNLI     78.67     85.38 49.20      67.29    41.45 60.55 49.05 58.40 48.17 48.25

Table 10. Experiments of various NLI datasets evaluated on BioASQ 7B (Phase B).
The experiments are considered as a first step of the sequential transfer learning. The
model of Yes/No type are fine-tuned as same as Table 2. The model of Factoid and
List type are fine-tuned as same as Table 3. The best score obtained in each task is in
bold.

</pre>