=Paper=
{{Paper
|id=Vol-2936/paper-19
|storemode=property
|title=Post-processing BioBERT And Using Voting Methods for Biomedical Question Answering
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-19.pdf
|volume=Vol-2936
|authors=Margarida M. Campos,Francisco M. Couto
|dblpUrl=https://dblp.org/rec/conf/clef/CamposC21
}}
==Post-processing BioBERT And Using Voting Methods for Biomedical Question Answering==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-19.pdf</pdf>
<pre>
Post-processing BioBERT And Using Voting Methods
for Biomedical Question Answering
Margarida M. Campos1 , Francisco M. Couto1
1
    LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749–016 Lisboa, Portugal


                                         Abstract
                                         There have been remarkable advances in the field of Biomedical Question Answering (QA) through the
                                         application of Transfer Learning to overcome the scarcity of domain-specific corpora. The fine-tuning
                                         of BioBERT on general purpose larger datasets prior to fine-tuning on a specific biomedical task has
                                         proven to significantly improve performance. There are, however, a lot of post-processing techniques
                                         to the outputs of fine-tuned models to be explored.
                                             In this paper we present our QA system, developed for the BioASQ 9th challenge - Task B, Phase
                                         B, developed by our team - LASIGE_ULISBOA. Using the outputs from the fine-tuning of BioBERT on
                                         both the Multi-Genre Natural Language Inference (MNLI) and the Stanford Question Answering Dataset
                                         (SQuAD) datasets. We compare different post processing strategies for prediction retrieval for Yes/No,
                                         Factoid, and List type questions.
                                             We show that using Softmax in the proper location of the pipeline of answer retrieval leads to better
                                         performance and also increases the explainability of a prediction’s confidence level in QA. We also
                                         present a method for applying voting system algorithms to choose candidates for List type answers,
                                         how they can increase MacroF1 score and how one can use them to optimize for either Precision or
                                         Recall. The obtained results, averaged over batches, were 0.798 MacroF1 for Yes/No, 0.478 MRR for
                                         Factoid, and 0.466 F1 for List.
                                             The used software is available in an open access repository.


1. Introduction
BioASQ is an annual challenge that comprises different biomedical semantic indexing and
question answering (QA) tasks. The presented system is a solution for Task B - Phase B, which
consists of providing exact and ideal answers to questions, given related snippets. Biomedical QA
is particularly challenging due to the highly domain-specific vocabulary and limited availability
of curated datasets. In order to minimize these limitations we used BioBERT [1] as our base
model and Transfer Learning - by fine tuning the base model on non-medical larger datasets,
prior to training on the task’s training data. There are three types of questions that require
exact answers in Task B:

                  • Yes/No - binary answer
                  • Factoid - answer is a string
                  • List - answer is a list of strings, each identifying a different entity


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" margarida.moreira.campos@gmail.com (M. M. Campos); fjcouto@edu.ulisboa.pt (F. M. Couto)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Our approach is concerned only with the retrieval of exact answers, and therefore it it was not
designed to retrieve answers to Summary type questions or ideal answers (paragraph-sized
summaries).
   For factoid and list questions, predictions are always substrings of the provided passages
(snippets), making the success of the previous task of snippet retrieval paramount to obtain
good results.
   Although the most significant advances in the area have been made by fine-tuning on different
and bigger datasets or the development of new and complex transformers architectures [2], we
aim to show the importance of post-processing and the use of proper final layers for each task.
   Considering that a tractable and meaningful measure of the level of confidence of a prediction
is as important as the prediction itself, we present as well a proposal of said confidence level for
Yes/No and Factoid questions.
   All the software used can be found in https://github.com/lasigeBioTM/BioASQ9B.


2. Related Work
Our baseline approach was inspired in the work done by DMIS Laboratory (Korea University)
for the previous edition of BioASQ challenge[3].

2.1. BioBERT
The base model for our system is BioBERT, a BERT[4] which was pre-trained using PubMed
abstracts and PubMed Central (PMC) articles. BioBERT has obtained state-of-the-art results in
several biomedical NLP tasks, including QA [1].

2.2. Sequential Transfer Learning
Substantial advances have been made in Natural Language Processing (NLP), specially in
domain-specific tasks with the use of Transfer Learning - using the learnt model from a task
for a subsequent task [5]. The use of extra corpora to train is particularly important given
the reduced size of the BioASQ dataset. Research has found that fine-tuning on the SQuAD
dataset [6] improves the performance of QA systems where the correct answer is a segment
of a provided passage. Another dataset that has proven important is the Multi-Genre Natural
Language Inference (MNLI)[7], which is widely used to improve questions of type Yes/No,
but has also proven to be useful for factoid and list question types, as was shown by DMIS
Laboratory (DMIS)[3].


3. Methods
3.1. Data & Pre-Processing
MNLI Training data consists of pairs of sentences, each classified with a label from {𝐸𝑛𝑡𝑎𝑖𝑙𝑚𝑒𝑛𝑡,
𝐶𝑜𝑛𝑡𝑟𝑎𝑑𝑖𝑐𝑡𝑖𝑜𝑛, 𝑁 𝑒𝑢𝑡𝑟𝑎𝑙}. The cardinality of each label set can be found in Table 1. Intuitively
there is a mapping (MNLI ↔ 𝐵𝑖𝑜𝐴𝑆𝑄): 𝐸𝑛𝑡𝑎𝑖𝑙𝑚𝑒𝑛𝑡 ↔ 𝑌 𝑒𝑠 and 𝐶𝑜𝑛𝑡𝑟𝑎𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ↔ 𝑁 𝑜.
                                             Table 2
Table 1
                                             Statistics of BioASQ 8b training data, used in training.
Statistics MNLI training data.
                                             Number of unique questions(Q) and median
Number of paired sentences per class
                                             number of related snippets (S) per question
          Relation      # Pairs
                                                     Question Type     #𝑄    𝑀 𝑒𝑑𝑖𝑎𝑛#𝑆/𝑄
         Entailment     130,899
                                                        Yes/No         881        10
        Contradiction   130,903
                                                       Factoid         941         9
          Neutral       130,900
                                                         List          644        11


Figure 1: Distribution of number of associated snippets per question, in training data from task 8B


This could suggest that training without the 𝑁 𝑒𝑢𝑡𝑟𝑎𝑙 pairs could improve performance, how-
ever our experiments showed that our system’s performance did not benefit from this strategy,
hence the entire dataset was used.
   SQuAD Training data consists of pairs {𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑆𝑛𝑖𝑝𝑝𝑒𝑡} and the correct answer as well
as its starting position. For training the QA model, the end position was identified and added as
input.
   BioASQ Training of the systems was done using BioASQ 8B training data, and evaluation was
done on BioASQ 8B test batches. In Table 2 we can see the number of questions in the BioASQ
training data, and in Figure 1 we can see the distribution of the number of snippets associated
to a question. Examples of questions can be found in Table 3, and the number of train and test
questions for each type of question can be found in Table 4. It is important to mention that
only 177 (20%) of the Yes/No questions have the label No, making the classification extremely
imbalanced. To handle this, undersampling[8] of the Yes class was performed, resulting in an
even smaller set of 354 unique questions. Oversampling the No class proved to be ineffective.
   For list questions, each entity in the golden label was considered a correct answer for the
given {𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑆𝑛𝑖𝑝𝑝𝑒𝑡} pair, i.e. a pair whose golden list contains 𝑚 entities will appear
as 𝑚 distinct input observations, each labeled with a different correct answer. A summary of
different type of inputs can be seen in Table 5.
   Both factoid and list inputs were converted to the mentioned SQuAD format - containing
the answer’s start and end positions. Observations whose snippets did not contain the correct
answer were discarded.
   As with all BERT inputs, {𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑆𝑛𝑖𝑝𝑝𝑒𝑡} pairs are added a [CLS] token in the beginning
-for classification - and a separation token ([SEP]) is added in between the two input texts, as
Table 3
Example of training questions, snippets and answers from BioASQ training data
                         Question                Snippet                                   Gold
                                                                                           Answer
        Yes/No           Is Baloxavir effec-     "Baloxavir marboxil is a selec-            yes
                         tive for influenza?     tive inhibitor of influenza cap-
                                                 dependent endonuclease. It has
                                                 shown therapeutic activity in pre-
                                                 clinical models of influenza A
                                                 and B virus infections,including
                                                 strains resistant to current antivi-
                                                 ral agents."
        Factoid          Cemiplimab       is     "Cemiplimab is a PD-1 inhibitor            cutaneous
                         used for treatment      that is approved for treatment of          squamous
                         of which cancer?        metastatic or locally advanced cu-         cell carci-
                                                 taneous squamous cell carcinoma."          noma
        List             Which organs are        "In systemic lupus erythemato-             kidney,
                         mostly affected in      sus (SLE), brain and kidney are            brain,
                         Systemic    Lupus       the most frequently affected or-           heart,skin
                         Erythematosus           gans. The heart is one of the
                         (SLE)?                  most frequently affected organs in
                                                 SLE. Any part of the heart can
                                                 be affected, including the peri-
                                                 cardium, myocardium, coronary
                                                 arteries, valves, and the conduction
                                                 system"


Table 4
Number of questions and snippets in training and test sets used for obtaining the reported experimental
results.
                                                  Train                     Test
                  Type of Question
                                        Questions      Snippets    Questions       Snippets
                  Yes/No                   881            11,976      152           1,262
                  Factoid                  941            11,633      151           1,249
                  List                     644             88,36       75            662


well as in the end of the input.
  Additional biomedical datasets could have been curated to be used for fine tuning the system,
however this was not done due to time constraints.

3.2. Fine Tuning
For the fine-tuning of BioBERT the best performing sequences of training reported in [3] were
used. For Yes/No questions the sequence is BioBERT-MNLI-BioASQ, as for factoid and list
Table 5
Input form of each type of dataset used
                                 Input
                    MNLI         {Sentence A, Sentence B, Label}
                    Yes/No       {Question, Snippet, Label}
                    SQuAD
                    Factoid      {Question, Context, Answer start, Answer end}
                    List


Figure 2: Simplified representation of BERT for QA


type questions BioBERT-MNLI-SQuAD-BioASQ was used.
   For fine-tuning in the MNLI dataset we used a slightly altered version of BertForSequenceClas-
sification model from the Transformers [9] library, which consists of adding a linear layer which
receives as input the hidden vector of the [𝐶𝐿𝑆] token [4]. To train in the binary classification
of BioASQ Yes/No questions, following BertForSequenceClassification last 3 neuron layer tests
were made with addition of:
    • one extra binary layer ([CLS]-3-2)
    • fully-connected 512 neuron layer followed by a binary one ([CLS]-3–256-2)
    • fully-connected 256 neuron layer followed by a binary one ([CLS]-3-512-2)
    • replacing previous MNLI 3 neuron layer with a binary one ([CLS]-2)
   For training on the SQuAD corpus, the final classification layers are removed and the archi-
tecture of BertForQuestionAnswering model from the Transformers library is used. A simplified
overview of Input/Output from BertForQuestionAnswering can be seen in Figure 2. In QA the
input provided contains the start and end positions of the tokens representing the span of the
correct answer within the passage. Training is done by creating two new vectors - start logits
and end logits of shape (𝑖𝑛𝑝𝑢𝑡_𝑙𝑒𝑛𝑔𝑡ℎ, 1) that represent the likelihood of each token being the
start and end of the answer, respectively.

3.3. Post-Processing and Output Aggregation
Given that the same question can have multiple snippets associated to it, leading to different
{𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑆𝑛𝑖𝑝𝑝𝑒𝑡} pairs as input, a strategy is needed to combine the different outputs
Figure 3: Example of the output Start and End logits vectors for a snippet. Highlighted cells represent
the allowed maximum score pairs, identically highlighted in Figure 4


into single predictions. Each type of question demands a different approach, hence they are
presented separately.

3.3.1. Yes/No
Let 𝑝𝑖𝑗 and 𝑝𝑖𝑗 represent the model’s output probability of question 𝑖 given the snippet 𝑗 having
answer Yes and No, respectively. The predicted answer will be the one with a highest mean
probability over the 𝐽 snippets associated to question 𝑖. 𝑃𝑖 represents the level of confidence
that the provided answer is correct.                            {︃
  𝑝𝑖 = 𝐽1 𝐽𝑗=1 𝑝𝑖𝑗                                                𝑃 (𝑌 𝑒𝑠) = 𝑝𝑖 , if 𝑝𝑖 ≥ 𝑝𝑖
          ∑︀
                                                          𝑃𝑖 =
  𝑝𝑖 = 𝐽1 𝐽𝑗=1 𝑝𝑖𝑗                                                𝑃 (𝑁 𝑜) = 𝑝𝑖 , otherwise
          ∑︀


3.3.2. Factoid
The relevant outputs from the fine-tuned QA model are the start and end logits vectors. In
Figure 3 an output example from the BioASQ golden set from Batch 1 of task 8B can be seen.
    Let 𝑠𝑙𝑖𝑗 and 𝑒𝑙𝑖𝑗 be the start and end logits value corresponding to token 𝑇𝑖𝑗𝑙 , the 𝑙𝑡ℎ of the
𝑗 𝑡ℎ snippet associated to question 𝑖, and 𝑃 𝑟𝑒𝑑𝑖𝑗 and 𝑃 𝑟𝑜𝑏𝑖𝑗 be the lists of predictions and
associated confidence levels for the same input.
    In order to choose the best prediction for each input, one should find the span (𝑎, 𝑏) that
maximizes some combination of 𝑠𝑎𝑖𝑗 and 𝑒𝑏𝑖𝑗 . Given the logits are not normalized, to use merely
the sum of start and end logits would result in an unfounded comparison between confidence
levels for predictions of different snippets.
    To minimize this discrepancy, our approach for each input was implemented as follows:

   1. Create upper triangular matrix 𝑀 where, 𝑀𝑝,𝑞 = 𝑠𝑝𝑖𝑗 + 𝑒𝑞𝑖𝑗 (See Figure 4), for 𝑞 ≥ 𝑝,
      guaranteeing end does not precede the start
   2. Choose positions 𝑎 and 𝑏 that maximize 𝑀𝑝,𝑞
   3. If the expression resulting from the span from 𝑇𝑖𝑗𝑎 to 𝑇𝑖𝑗𝑏 satisfies admission rules, append
      expression to 𝑃 𝑟𝑒𝑑𝑖𝑗 and 𝑀𝑎,𝑏 to 𝑃 𝑟𝑜𝑏𝑖𝑗
   4. Remove entry 𝑀𝑎,𝑏 from 𝑀
   5. Repeat steps 2 to 4, until lists have length 𝑘, where 𝑘 is an hyperparameter chosen by the
      user
   6. Apply the softmax function to vector 𝑃 𝑟𝑜𝑏𝑖𝑗 of length 𝑘
Figure 4: Matrix 𝑀 where each entry represents the sum of 𝑖𝑡ℎ position start logits vector, and 𝑗 𝑡ℎ
position of end logits vector. Highlighted in red are scores that are not eligible - end position must be
equal or greater to start position, and end position token must not contain a split word, identified with
the characters ## (see Fig 3)


  To select the top 5 predictions for question 𝑖, we simply select the 5 expressions from
the concatenation of the 𝐽 vectors 𝑃 𝑟𝑒𝑑𝑖𝑗 with the 5 highest corresponding values in the
concatenated 𝑃 𝑟𝑜𝑏𝑖𝑗 .

3.3.3. List
Potential answers for list questions are retrieved using the same method as for factoid questions.
The process however requires some extra processing steps, given that for list questions different
entities need to be discriminated.
   To select the best list of candidates, we used voting systems treating each distinct obtained
answer as a candidate and the frequency in the answers as votes. The systems of Single
Transferable Vote (STV) and Preferential Block Voting (PBV) were tested, with STV having the
best performance. Elections are performed in rounds, in each round candidates are categorized
in states: Elected - if the candidate has already won, Rejected - if the candidate is already unable
to win, Hopeful - if the candidate has neither won nor has yet been discarded.
   Candidates for answers are obtained by splitting the predictions by all usual separator
characters and words(e.g. ’,’, ’and’, ’;’, ’or’). We tested the approach of doing the splitting after
the voting - treating full answers as candidates for the STV (STV + PostProcess), and doing the
splitting before the voting - separate distinct entities are treated as a vote, with the score for the
ranked ballot being the average score of all the answers that contain that entity. An example of
ranked candidates before and after being processed can be seen in Tables 6 and 7. E.g. the score
of candidate "dizziness" will be the average of scores where the candidate is contained: 0.21,
0.20 and 0.18 (1𝑠𝑡 ,2𝑛𝑑 and 5𝑡ℎ entries of Table 6). Each snippet will contribute to the voting with
a ballot of ranked candidates, which then enter the voting algorithm.
   A potential handicap of using voting system algorithms for answer selection is the need
to predefine the number of elected entities, since in an election the number of winners is
Table 6                                                     Table 7
Example of ranked list candidate answers from one           Example of ranked list candidate answers
snippet and respective scores                               after being split by seperators, considering
 Predictions                          Scores                the average score of all answers
 dizziness                            0.21                  that contain each entity
 dizziness, orthostatic hypotension   0.20                   Predictions                 Scores
 orthostatic hypotension              0.20                   orthostatic hypotension     0.20
 hallucination                        0.19                   dizziness                   0.197
 hallucination, dizziness             0.18                   hallucination               0.185

Table 8
Summary of hyperparameters used for the fine-tuning that led to the reported results.
                                      Epochs          3
                                      Batch Size      18
                                      Optimizer       Adam
                                      Learning Rate   5 × 10−5


established beforehand. This is not ideal since the correct number of answers for a given list
question is not defined. Two characteristics of the implemented algorithms that allow us to
minimize this problem are:

    • If the number of non rejected candidates is inferior to the number of selected winners,
      these are elected
    • If there are ties in the election, all tied candidates are elected - even if this means electing
      a superior number of candidates

   Although these factors allow for some flexibility in the number of predictions, a more flexible
approach can be used. Since elections are performed in rounds, one can define that the selected
answers are the ones that are not rejected in the pre-final round, i.e., all candidates with states
in {Hopeful,Elected}. When referencing this approach we call the number of candidates Hopeful.

3.4. Software
Our team tried to replicate the results of 4 state of the art systems and found some reproducibility
issues. Some of the causes were: outdated versions of packages, compatibility issues due to the
use of conflicting code libraries like the use of both Tensorflow and PyTorch for different stages
of the pipeline.
   To avoid the aforementioned issues our implementation was done in a modularized fashion,
built in Python 3.6[10], using the Pytorch[11] versions of model implementations from the
Transformers[9] library as main structure. In spite of its fully Pytorch architecture, the system
accepts as input Tensorflow[12] checkpoints (model’s saved parameters).
   Fine-tuning was performed using parallelization on 6 GPUs (Tesla M10) with 8GB of memory
each. Total batch size is 18 (3 samples per GPU). Summary of the training details of the reported
results can be found in Table 8
Table 9
Experimental results of Yes/No models. Trained on training data for task 8B, evaluated on the task’s
5 test batches. [CLS]-3-2 architecture has the addition of a binary layer on top of the MNLI classifi-
cation model, [CLS]-256-2 and [CLS]-512-2 has the addition of a fully connected layer of 256 and 512,
respectively, before the extra binary layer. DMIS represents the average scores of DMIS Lab systems.

                Architecture         Acc          F1no         F1yes        MacroF1
                 [CLS]-3-2          0.7039       0.4828        0.7926         0.6377
               [CLS]-3-256-2        0.7434       0.6286        0.8040        0.7163
               [CLS]-3-512-2        0.7368       0.5745        0.8095         0.6920
                   DMIS             0.8513       0.8071        0.8733         0.8402


3.5. Metrics
For evaluation and comparison of different models, the official BioASQ measures of performance
were used [13].
   For Yes/No questions the official metric is the MacroF1 - mean of F1 score of both Yes and No
classes. Accuracy is also calculated for completeness. Factoid questions are evaluated using
Mean Reciprocal Rank (Metric). Strict Accuracy (SAcc) and Lenient Accuracy (LAcc) are also
calculated. List questions are evaluated by the average F1 score of all questions, with the mean
precision and recall also reported.


4. Results
4.1. Experimental Results
In this section we present the experimental results of the referred approaches. Training was
done in task 8B training set and evaluation was done in the aggregation of all 8B Phase B
batches. Results are compared with the average (weighted by number of questions) of all DMIS
systems results in the 5 batches.

4.1.1. Yes/No
In Table 9 we can see the results of the different classification architectures for the Yes/No
question type. Results are significantly better with the extra fully connected layer, before the
final binary one. Experiments showed that performance differs slightly with the number of
neurons of the middle layer if it lays between 128 and 512. Higher MacroF1 was obtained with
256 neurons ([CLS]-256-2).

4.1.2. Factoid
For factoid questions performance increased substantially with the use of the k-candidates
approach. The results can be seen in Table 10. Best results were obtained with 𝑘 = 2 number of
candidate answers per snippet. It is interesting to point out that for 𝑘 > 4 the results almost do
not differ. This is due to the fact that candidates of order higher than 4 typically have extremely
low scores and end up with probabilities close to 0, therefore are discarded when the top 5
predictions are extracted.

Table 10
Experimental results of Factoid models. Trained on training data for task 8B, evaluated on the task’s
5 test batches. Start+End represents the classic approach of applying Softmax to both Start and End
Logits prior to finding the scores’ maximizing answers. Top k represents the approach of applying
Softmax to the 𝑘 selected candidates. Different values of k are presented (2,5,10). DMIS represents the
average scores of DMIS Lab systems.

                     Strategy          k         SAcc          LAcc         MRR
                    Start + End        -        0.1060        0.2119        0.1485
                                       2        0.3179        0.5232        0.3991
                      Top k            5        0.2195        0.5121        0.3390
                                      10        0.2195        0.5121        0.3390
                      DMIS             -        0.3603        0.5656         0.44


4.1.3. List
In Table 11 we can see the results of experiments with the list questions. We can observe
the impact of requesting different number of winners from the algorithm. Unsurprisingly, a
larger number of winners leads to an increase in Recall and a decrease in Precision. Maximum
performance (MacroF1) is obtained with the Hopeful strategy, for both processing strategies.
   Results show that splitting candidates prior to the voting leads to better results.

4.2. BioASQ Official Results
A summary of the official results from BioASQ Task9B - Phase B can be seen in Table 12, where
we present the results of the top teams along with ours (LASIGE), considering the BioASQ
ordering. The place in each batch is considered to be the place of the best scoring system for
each team, considering all systems of each team as one.

4.3. Unanswerability
An important aspect to look at when evaluating performance, is the unanswerability of some
questions in the dataset. Several questions in the test sets have an answer which can not be
extracted from the provided snippets. Ideally, to measure the actual performance of answer
extracting systems, these would be removed from the test set. Examples of such questions can
be seen in Table 13.
   For the test set of task 8B (resulting of the aggregation the 5 test batches), 22, 5% of fac-
toid questions do not contain the golden answer in any provided snippet, and 25.3% of list
Table 11
Experimental results of List models. Trained on training data for task 8B, evaluated on the task’s 5
test batches. STV-PostProcess and STV-PreProcess refer to the strategies of using the Single Transfer-
able Vote algorithm for answer candidates, considering answers splitted by separators after and before
voting,respectevily. Elected represents the number of winners required for the voting, with Hopeful rep-
resenting the strategy of selecting the non-rejected candidates as winners.DMIS represents the average
scores of DMIS Lab systems.

                Method              Elected        Precision        Recall        MacroF1
                                       2             0.5111         0.3437         0.3921
           STV-PostProcess             5             0.3906         0.4766         0.4115
                                    Hopeful          0.4944         0.3289         0.3751
                                       2             0.4527         0.3306         0.3706
            STV-PreProcess             5             0.4333         0.4167         0.4107
                                    Hopeful            0.5          0.4167         0.4524
                 DMIS                  -             0.4761         0.4206         0.3940


questions have at least one entity that is not contained in the snippets. For Yes/No questions
unanswerability would have to be manually done.


5. Discussion
5.1. Analysis of Results
All reported results were obtained using BioBERT Base (12 stacked encoding layers). Although
we tested with BioBERT Large (24 stacked encoding layers), which usually obtains better results,
the results were very poor. This is probably due to memory restrictions. Since BioBERT Large
has over three times more trainable parameters, reductions in input size and batch size had to
be made, which are probably the cause for the low performance.

5.1.1. Yes/No
The addition of a fully connected layer between the MNLI classification layer (3 neurons) and
the BioASQ binary classification layer improved performance on the test set. This indicates
that the relation between the knowledge obtained on the NLI data and the one needed for the
BioASQ questions is not as obvious as one might expect. This is not uncommon when we are
dealing with corpora from different domains (general purpose vs biomedical), and might also be
related to the existence of unanswerable questions in the dataset that difficult model’s learning
of what represents agreement between question and snippet, since the inputs with no relation
are inducing noise for the binary task.
   In Figure 5 we can see the distribution of the confidence levels for Yes (𝑃 (𝑌 𝑒𝑠)) and No
(𝑃 (𝑁 𝑜)) predictions, compared between the actual value of the correct answer. Note that
Table 12
Summary of the official results from BioASQ Task9B - Phase B. The place in each batch is considered
to be the place of the best scoring system for each team, and considering all systems of each team as
one. Reported scores are taken from the overall best ranked system for each team.

                                                           Yes/No        Factoid            List
       Batch              System              Place
                                                          MacroF1         MRR               F1
                           DMIS                    1       0.9258         0.3856         0.4143
          1                Ir_Sys                  2       0.8183         0.4149         0.2800
                          LASIGE                   3       0.7699         0.3506        0.4860
                          LASIGE                   1       0.9454         0.5539        0.4818
          2           bio-answerfinder             2       0.8952         0.5000         0.4571
                           DMIS                    3       0.8854         0.5294         0.4554
                           lalala                  1       0.9532         0.5347         0.5887
          3           bio-answerfinder             2       0.9023         0.5811         0.4209
                          LASIGE                   7       0.7292         0.4919        0.4918
                           DMIS                    1       0.9480         0.5310         0.7061
          4                Ir_sys                  2       0.9480         0.6929         0.6312
                          LASIGE                   5       0.7807         0.5577        0.5872
                           DMIS                    1       0.8246         0.4722         0.3561
          5            MDS_UNCC                    2       0.7841         0.5204         0.2678
                          LASIGE                   4       0.7564         0.4546        0.2798


Table 13
Examples of unanswerable questions in 8B - Phase B test set. Answers represent the golden label
provided by BioASQ.
           Question                      Snippet                                   Answer
           Which biological pro-         here we demonstrate that mrnas con-       mrna pro-
           cess takes place in           taining alrex-promoting elements are      cessing
           nuclear speckles?             trafficked through nuclear speckles
           Can LB-100 downregu-          PP2A inhibition from LB100 therapy        No
           late miR-33?                  enhances daunorubicin cytotoxicity in
                                         secondary acute myeloid leukemia via
                                         miR-181b-1 upregulation.


model’s discriminatory power (distance between 𝑃 (𝑌 𝑒𝑠) and 𝑃 (𝑁 𝑜)) is much greater for
answers with Yes label. This can also be seen by looking at the differences between the F1
scores of both classes, noting that 𝐹 1𝑦𝑒𝑠 is much higher than 𝐹 1𝑛𝑜 across experiments. This
is not surprising in NLI, as it is easier to identify entailment than it is to distinguish between
contradiction and neutral relations. Entailment is usually distinctly expressed in the passage,
whilst contradiction sometimes needs to be inferred from more complicated relations between
sentences.


Figure 5: Confidence scores for Yes/No predictions split by correct golden label


5.1.2. Factoid
Looking at experimental results (Table10) we can see that sorting predictions using scores
obtained by applying Softmax to the 𝑘 predictions for each snippet strongly improved all
metrics. Moreover, we can look at the fitness of the scores by analysing Figure 6 where we
compare the distribution of confidence levels for predictions when the answer was in fact
correct or not. We can see that for the classic approach there is an almost 100% overlap of
incorrect scores with correct ones, which implies the scoring is not strong. Although there is
still some expected overlap in the k-candidates approach, one can distinctly see a higher level
of confidence for correct answers, indicating the validity of the proposed score as a confidence
level metric.

5.1.3. List
Using voting systems for the choice of list questions proved to be effective, and we can see in
Table 12 that the proposed system obtained overall strong results for List type questions, with
the exception of Batch 5.
   By using the Hopeful approach, one has flexibility in the number of entities that are selected,
and in fact this approach has the best MacroF1 scores across experiments. With the application
of the voting systems, opposed to using a predefined threshold for answer selection, we make
use not only of the confidence level of each answers but also of the occurrence of the answer
and its relative certainty amongst other answers from the same input.
            (a) Scores from Softmax(Start Logits) plus   (b) Softmax(𝑘 top predictions)
                Softmax(End Logits)
Figure 6: Distribution of prediction confidence scores for Factoid questions of the Task 8B - Phase B
test set.


6. Conclusion
In this paper we used transfer learning to fine-tune BioBERT on general purpose datasets (MNLI
and SQuAD) prior to fine-tuning on the BioASQ dataset. We showed how the post-processing of
the model outputs greatly impacts performance, revealing that applying Softmax on the output
scores from only the 𝑘 selected candidates, for obtaining predictions’ confidence level improves
overall performance and makes scores more meaningful. We also showed that using the Single
Transferable vote system for electing list questions candidates for answers obtains promising
results, outperforming the previous approach of selecting candidates merely based on a defined
threshold.
   To increase the current model’s performance in the future, one can: enrich transfer learning
sequences with additional biomedical domain corpora, train current system using BioBERT
Large in larger memory GPUs, with same learning parameters (input size, learning rate and
batch size). Another possibility is to adapt BERT architecture to allow for training of start and
end logits combined, i.e., train QA for finding the exact span of the answer within the text
- conditioning end of answer to its start - instead of training them separately and doing the
conditioning in the post-processing phase.


Acknowledgments
This work was supported by FCT through project DeST: Deep Semantic Tagger project, ref.
PTDC/CCI-BIO/28685/2017, and the LASIGE Research Unit, ref. UIDB/00408/2020 and ref.
UIDP/00408/2020.
  We would like to thank Doctor Maria Fernandes from the University of Luxembourg, who
provided us access to larger GPUs for running experiments, for all her help and support.
References
 [1] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics (2019). URL: http:
     //dx.doi.org/10.1093/bioinformatics/btz682. doi:10.1093/bioinformatics/btz682.
 [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, 2017. arXiv:1706.03762.
 [3] M. Jeong, M. Sung, G. Kim, D. Kim, W. Yoon, J. Yoo, J. Kang, Transferability of natural
     language inference to biomedical question answering, 2021. arXiv:2007.00217.
 [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, 2019. arXiv:1810.04805.
 [5] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural
     networks?, 2014. arXiv:1411.1792.
 [6] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine
     comprehension of text, 2016. arXiv:1606.05250.
 [7] A. Williams, N. Nangia, S. R. Bowman, A broad-coverage challenge corpus for sentence
     understanding through inference, 2018. arXiv:1704.05426.
 [8] S. Dendamrongvit, M. Kubat, Undersampling approach for imbalanced training sets and
     induction from multi-label text-categorization domains, 2009, pp. 40–52. doi:10.1007/
     978-3-642-14640-4_4.
 [9] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan-
     guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[10] Python Core Team, Python: A dynamic, open source programming language, Python
     Software Foundation, Vienna, Austria, 2016. URL: https://www.python.org/.
[11] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
     N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
     S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style,
     high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer,
     F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Sys-
     tems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/
     9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
[12] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,
     J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-
     fowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
     C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van-
     houcke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu,
     X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
     URL: https://www.tensorflow.org/, software available from tensorflow.org.
[13] A. Nentidis, A. Krithara, K. Bougiatiotis, M. Krallinger, C. Rodriguez-Penagos, M. Ville-
     gas, G. Paliouras, Overview of bioasq 2020: The eighth bioasq challenge on large-scale
biomedical semantic indexing and question answering, in: International Conference of
the Cross-Language Evaluation Forum for European Languages, Springer, Springer, 2020.
URL: https://link.springer.com/chapter/10.1007/978-3-030-58219-7_16.

</pre>