<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Post-processing BioBERT And Using Voting Methods for Biomedical Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Margarida M. Campos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco M. Couto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa</institution>
          ,
          <addr-line>1749-016 Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>There have been remarkable advances in the field of Biomedical Question Answering (QA) through the application of Transfer Learning to overcome the scarcity of domain-specific corpora. The fine-tuning of BioBERT on general purpose larger datasets prior to fine-tuning on a specific biomedical task has proven to significantly improve performance. There are, however, a lot of post-processing techniques to the outputs of fine-tuned models to be explored. In this paper we present our QA system, developed for the BioASQ 9th challenge - Task B, Phase B, developed by our team - LASIGE_ULISBOA. Using the outputs from the fine-tuning of BioBERT on both the Multi-Genre Natural Language Inference (MNLI) and the Stanford Question Answering Dataset (SQuAD) datasets. We compare diferent post processing strategies for prediction retrieval for Yes/No, Factoid, and List type questions. We show that using Softmax in the proper location of the pipeline of answer retrieval leads to better performance and also increases the explainability of a prediction's confidence level in QA. We also present a method for applying voting system algorithms to choose candidates for List type answers, how they can increase MacroF1 score and how one can use them to optimize for either Precision or Recall. The obtained results, averaged over batches, were 0.798 MacroF1 for Yes/No, 0.478 MRR for Factoid, and 0.466 F1 for List. The used software is available in an open access repository.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Yes/No - binary answer
• Factoid - answer is a string
• List - answer is a list of strings, each identifying a diferent entity
Our approach is concerned only with the retrieval of exact answers, and therefore it it was not
designed to retrieve answers to Summary type questions or ideal answers (paragraph-sized
summaries).</p>
      <p>For factoid and list questions, predictions are always substrings of the provided passages
(snippets), making the success of the previous task of snippet retrieval paramount to obtain
good results.</p>
      <p>
        Although the most significant advances in the area have been made by fine-tuning on diferent
and bigger datasets or the development of new and complex transformers architectures [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we
aim to show the importance of post-processing and the use of proper final layers for each task.
      </p>
      <p>Considering that a tractable and meaningful measure of the level of confidence of a prediction
is as important as the prediction itself, we present as well a proposal of said confidence level for
Yes/No and Factoid questions.</p>
      <p>All the software used can be found in https://github.com/lasigeBioTM/BioASQ9B.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. BioBERT</title>
        <p>
          Our baseline approach was inspired in the work done by DMIS Laboratory (Korea University)
for the previous edition of BioASQ challenge[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          The base model for our system is BioBERT, a BERT[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] which was pre-trained using PubMed
abstracts and PubMed Central (PMC) articles. BioBERT has obtained state-of-the-art results in
several biomedical NLP tasks, including QA [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Sequential Transfer Learning</title>
        <p>
          Substantial advances have been made in Natural Language Processing (NLP), specially in
domain-specific tasks with the use of Transfer Learning - using the learnt model from a task
for a subsequent task [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The use of extra corpora to train is particularly important given
the reduced size of the BioASQ dataset. Research has found that fine-tuning on the SQuAD
dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] improves the performance of QA systems where the correct answer is a segment
of a provided passage. Another dataset that has proven important is the Multi-Genre Natural
Language Inference (MNLI)[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which is widely used to improve questions of type Yes/No,
but has also proven to be useful for factoid and list question types, as was shown by DMIS
Laboratory (DMIS)[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Data &amp; Pre-Processing</title>
        <p>MNLI Training data consists of pairs of sentences, each classified with a label from {,
,  }. The cardinality of each label set can be found in Table 1. Intuitively
there is a mapping (MNLI ↔ ):  ↔   and  ↔  .
This could suggest that training without the   pairs could improve performance,
however our experiments showed that our system’s performance did not benefit from this strategy,
hence the entire dataset was used.</p>
        <p>SQuAD Training data consists of pairs {, } and the correct answer as well
as its starting position. For training the QA model, the end position was identified and added as
input.</p>
        <p>
          BioASQ Training of the systems was done using BioASQ 8B training data, and evaluation was
done on BioASQ 8B test batches. In Table 2 we can see the number of questions in the BioASQ
training data, and in Figure 1 we can see the distribution of the number of snippets associated
to a question. Examples of questions can be found in Table 3, and the number of train and test
questions for each type of question can be found in Table 4. It is important to mention that
only 177 (20%) of the Yes/No questions have the label No, making the classification extremely
imbalanced. To handle this, undersampling[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] of the Yes class was performed, resulting in an
even smaller set of 354 unique questions. Oversampling the No class proved to be inefective.
        </p>
        <p>For list questions, each entity in the golden label was considered a correct answer for the
given {, } pair, i.e. a pair whose golden list contains  entities will appear
as  distinct input observations, each labeled with a diferent correct answer. A summary of
diferent type of inputs can be seen in Table 5.</p>
        <p>Both factoid and list inputs were converted to the mentioned SQuAD format - containing
the answer’s start and end positions. Observations whose snippets did not contain the correct
answer were discarded.</p>
        <p>As with all BERT inputs, {, } pairs are added a [CLS] token in the beginning
-for classification - and a separation token ([SEP]) is added in between the two input texts, as
well as in the end of the input.</p>
        <p>Additional biomedical datasets could have been curated to be used for fine tuning the system,
however this was not done due to time constraints.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Fine Tuning</title>
        <p>
          For the fine-tuning of BioBERT the best performing sequences of training reported in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] were
used. For Yes/No questions the sequence is BioBERT-MNLI-BioASQ, as for factoid and list
type questions BioBERT-MNLI-SQuAD-BioASQ was used.
        </p>
        <p>
          For fine-tuning in the MNLI dataset we used a slightly altered version of
BertForSequenceClassification model from the Transformers [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] library, which consists of adding a linear layer which
receives as input the hidden vector of the [] token [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. To train in the binary classification
of BioASQ Yes/No questions, following BertForSequenceClassification last 3 neuron layer tests
were made with addition of:
• one extra binary layer ([CLS]-3-2)
• fully-connected 512 neuron layer followed by a binary one ([CLS]-3–256-2)
• fully-connected 256 neuron layer followed by a binary one ([CLS]-3-512-2)
• replacing previous MNLI 3 neuron layer with a binary one ([CLS]-2)
        </p>
        <p>For training on the SQuAD corpus, the final classification layers are removed and the
architecture of BertForQuestionAnswering model from the Transformers library is used. A simplified
overview of Input/Output from BertForQuestionAnswering can be seen in Figure 2. In QA the
input provided contains the start and end positions of the tokens representing the span of the
correct answer within the passage. Training is done by creating two new vectors - start logits
and end logits of shape (_ℎ, 1) that represent the likelihood of each token being the
start and end of the answer, respectively.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Post-Processing and Output Aggregation</title>
        <p>Given that the same question can have multiple snippets associated to it, leading to diferent
{, } pairs as input, a strategy is needed to combine the diferent outputs
into single predictions. Each type of question demands a diferent approach, hence they are
presented separately.</p>
        <p>Let  and  represent the model’s output probability of question  given the snippet  having
answer Yes and No, respectively. The predicted answer will be the one with a highest mean
probability over the  snippets associated to question .  represents the level of confidence
that the provided answer is correct.
3.3.1. Yes/No
 = 1 ∑︀</p>
        <p>=1 
 = 1 ∑︀</p>
        <p>=1 
3.3.2. Factoid
The relevant outputs from the nfie-tuned QA model are the start and end logits vectors. In
Figure 3 an output example from the BioASQ golden set from Batch 1 of task 8B can be seen.</p>
        <p>Let  and  be the start and end logits value corresponding to token  , the ℎ of the
ℎ snippet associated to question , and   and   be the lists of predictions and
associated confidence levels for the same input.</p>
        <p>In order to choose the best prediction for each input, one should find the span (, ) that
maximizes some combination of  and  . Given the logits are not normalized, to use merely
the sum of start and end logits would result in an unfounded comparison between confidence
levels for predictions of diferent snippets.</p>
        <p>To minimize this discrepancy, our approach for each input was implemented as follows:
1. Create upper triangular matrix  where, , =  +  (See Figure 4), for  ≥ ,
guaranteeing end does not precede the start
2. Choose positions  and  that maximize ,
3. If the expression resulting from the span from  to  satisfies admission rules, append
expression to   and , to  
4. Remove entry , from 
5. Repeat steps 2 to 4, until lists have length , where  is an hyperparameter chosen by the
user
6. Apply the softmax function to vector   of length 
 =
{︃ ( ) = , if  ≥ 
 ( ) = , otherwise</p>
        <p>To select the top 5 predictions for question , we simply select the 5 expressions from
the concatenation of the  vectors   with the 5 highest corresponding values in the
concatenated   .
3.3.3. List
Potential answers for list questions are retrieved using the same method as for factoid questions.
The process however requires some extra processing steps, given that for list questions diferent
entities need to be discriminated.</p>
        <p>To select the best list of candidates, we used voting systems treating each distinct obtained
answer as a candidate and the frequency in the answers as votes. The systems of Single
Transferable Vote (STV) and Preferential Block Voting (PBV) were tested, with STV having the
best performance. Elections are performed in rounds, in each round candidates are categorized
in states: Elected - if the candidate has already won, Rejected - if the candidate is already unable
to win, Hopeful - if the candidate has neither won nor has yet been discarded.</p>
        <p>Candidates for answers are obtained by splitting the predictions by all usual separator
characters and words(e.g. ’,’, ’and’, ’;’, ’or’). We tested the approach of doing the splitting after
the voting - treating full answers as candidates for the STV (STV + PostProcess), and doing the
splitting before the voting - separate distinct entities are treated as a vote, with the score for the
ranked ballot being the average score of all the answers that contain that entity. An example of
ranked candidates before and after being processed can be seen in Tables 6 and 7. E.g. the score
of candidate "dizziness" will be the average of scores where the candidate is contained: 0.21,
0.20 and 0.18 (1,2 and 5ℎ entries of Table 6). Each snippet will contribute to the voting with
a ballot of ranked candidates, which then enter the voting algorithm.</p>
        <p>A potential handicap of using voting system algorithms for answer selection is the need
to predefine the number of elected entities, since in an election the number of winners is
established beforehand. This is not ideal since the correct number of answers for a given list
question is not defined. Two characteristics of the implemented algorithms that allow us to
minimize this problem are:
• If the number of non rejected candidates is inferior to the number of selected winners,
these are elected
• If there are ties in the election, all tied candidates are elected - even if this means electing
a superior number of candidates</p>
        <p>Although these factors allow for some flexibility in the number of predictions, a more flexible
approach can be used. Since elections are performed in rounds, one can define that the selected
answers are the ones that are not rejected in the pre-final round, i.e., all candidates with states
in {Hopeful,Elected}. When referencing this approach we call the number of candidates Hopeful.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Software</title>
        <p>Our team tried to replicate the results of 4 state of the art systems and found some reproducibility
issues. Some of the causes were: outdated versions of packages, compatibility issues due to the
use of conflicting code libraries like the use of both Tensorflow and PyTorch for diferent stages
of the pipeline.</p>
        <p>
          To avoid the aforementioned issues our implementation was done in a modularized fashion,
built in Python 3.6[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], using the Pytorch[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] versions of model implementations from the
Transformers[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] library as main structure. In spite of its fully Pytorch architecture, the system
accepts as input Tensorflow [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] checkpoints (model’s saved parameters).
        </p>
        <p>Fine-tuning was performed using parallelization on 6 GPUs (Tesla M10) with 8GB of memory
each. Total batch size is 18 (3 samples per GPU). Summary of the training details of the reported
results can be found in Table 8</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Metrics</title>
        <p>
          For evaluation and comparison of diferent models, the oficial BioASQ measures of performance
were used [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>For Yes/No questions the oficial metric is the MacroF1 - mean of F1 score of both Yes and No
classes. Accuracy is also calculated for completeness. Factoid questions are evaluated using
Mean Reciprocal Rank (Metric). Strict Accuracy (SAcc) and Lenient Accuracy (LAcc) are also
calculated. List questions are evaluated by the average F1 score of all questions, with the mean
precision and recall also reported.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Results</title>
        <p>In this section we present the experimental results of the referred approaches. Training was
done in task 8B training set and evaluation was done in the aggregation of all 8B Phase B
batches. Results are compared with the average (weighted by number of questions) of all DMIS
systems results in the 5 batches.
4.1.1. Yes/No
4.1.2. Factoid
In Table 9 we can see the results of the diferent classification architectures for the Yes/No
question type. Results are significantly better with the extra fully connected layer, before the
ifnal binary one. Experiments showed that performance difers slightly with the number of
neurons of the middle layer if it lays between 128 and 512. Higher MacroF1 was obtained with
256 neurons ([CLS]-256-2).</p>
        <p>For factoid questions performance increased substantially with the use of the k-candidates
approach. The results can be seen in Table 10. Best results were obtained with  = 2 number of
candidate answers per snippet. It is interesting to point out that for  &gt; 4 the results almost do
not difer. This is due to the fact that candidates of order higher than 4 typically have extremely
low scores and end up with probabilities close to 0, therefore are discarded when the top 5
predictions are extracted.
4.1.3. List
In Table 11 we can see the results of experiments with the list questions. We can observe
the impact of requesting diferent number of winners from the algorithm. Unsurprisingly, a
larger number of winners leads to an increase in Recall and a decrease in Precision. Maximum
performance (MacroF1) is obtained with the Hopeful strategy, for both processing strategies.</p>
        <p>Results show that splitting candidates prior to the voting leads to better results.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. BioASQ Oficial Results</title>
        <p>A summary of the oficial results from BioASQ Task9B - Phase B can be seen in Table 12, where
we present the results of the top teams along with ours (LASIGE), considering the BioASQ
ordering. The place in each batch is considered to be the place of the best scoring system for
each team, considering all systems of each team as one.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Unanswerability</title>
        <p>An important aspect to look at when evaluating performance, is the unanswerability of some
questions in the dataset. Several questions in the test sets have an answer which can not be
extracted from the provided snippets. Ideally, to measure the actual performance of answer
extracting systems, these would be removed from the test set. Examples of such questions can
be seen in Table 13.</p>
        <p>For the test set of task 8B (resulting of the aggregation the 5 test batches), 22, 5% of
factoid questions do not contain the golden answer in any provided snippet, and 25.3% of list
questions have at least one entity that is not contained in the snippets. For Yes/No questions
unanswerability would have to be manually done.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Analysis of Results</title>
        <p>All reported results were obtained using BioBERT Base (12 stacked encoding layers). Although
we tested with BioBERT Large (24 stacked encoding layers), which usually obtains better results,
the results were very poor. This is probably due to memory restrictions. Since BioBERT Large
has over three times more trainable parameters, reductions in input size and batch size had to
be made, which are probably the cause for the low performance.
5.1.1. Yes/No
The addition of a fully connected layer between the MNLI classification layer (3 neurons) and
the BioASQ binary classification layer improved performance on the test set. This indicates
that the relation between the knowledge obtained on the NLI data and the one needed for the
BioASQ questions is not as obvious as one might expect. This is not uncommon when we are
dealing with corpora from diferent domains (general purpose vs biomedical), and might also be
related to the existence of unanswerable questions in the dataset that dificult model’s learning
of what represents agreement between question and snippet, since the inputs with no relation
are inducing noise for the binary task.</p>
        <p>In Figure 5 we can see the distribution of the confidence levels for Yes (  ( )) and No
( ( )) predictions, compared between the actual value of the correct answer. Note that
model’s discriminatory power (distance between  ( ) and  ( )) is much greater for
answers with Yes label. This can also be seen by looking at the diferences between the F1
scores of both classes, noting that  1 is much higher than  1 across experiments. This
is not surprising in NLI, as it is easier to identify entailment than it is to distinguish between
contradiction and neutral relations. Entailment is usually distinctly expressed in the passage,
whilst contradiction sometimes needs to be inferred from more complicated relations between
sentences.
5.1.2. Factoid
Looking at experimental results (Table10) we can see that sorting predictions using scores
obtained by applying Softmax to the  predictions for each snippet strongly improved all
metrics. Moreover, we can look at the fitness of the scores by analysing Figure 6 where we
compare the distribution of condfience levels for predictions when the answer was in fact
correct or not. We can see that for the classic approach there is an almost 100% overlap of
incorrect scores with correct ones, which implies the scoring is not strong. Although there is
still some expected overlap in the k-candidates approach, one can distinctly see a higher level
of confidence for correct answers, indicating the validity of the proposed score as a confidence
level metric.
5.1.3. List
Using voting systems for the choice of list questions proved to be efective, and we can see in
Table 12 that the proposed system obtained overall strong results for List type questions, with
the exception of Batch 5.</p>
        <p>By using the Hopeful approach, one has flexibility in the number of entities that are selected,
and in fact this approach has the best MacroF1 scores across experiments. With the application
of the voting systems, opposed to using a predefined threshold for answer selection, we make
use not only of the confidence level of each answers but also of the occurrence of the answer
and its relative certainty amongst other answers from the same input.</p>
        <p>(a) Scores from Softmax(Start Logits) plus</p>
        <p>Softmax(End Logits)
(b) Softmax(  top predictions)</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper we used transfer learning to fine-tune BioBERT on general purpose datasets (MNLI
and SQuAD) prior to fine-tuning on the BioASQ dataset. We showed how the post-processing of
the model outputs greatly impacts performance, revealing that applying Softmax on the output
scores from only the  selected candidates, for obtaining predictions’ confidence level improves
overall performance and makes scores more meaningful. We also showed that using the Single
Transferable vote system for electing list questions candidates for answers obtains promising
results, outperforming the previous approach of selecting candidates merely based on a defined
threshold.</p>
      <p>To increase the current model’s performance in the future, one can: enrich transfer learning
sequences with additional biomedical domain corpora, train current system using BioBERT
Large in larger memory GPUs, with same learning parameters (input size, learning rate and
batch size). Another possibility is to adapt BERT architecture to allow for training of start and
end logits combined, i.e., train QA for finding the exact span of the answer within the text
- conditioning end of answer to its start - instead of training them separately and doing the
conditioning in the post-processing phase.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by FCT through project DeST: Deep Semantic Tagger project, ref.
PTDC/CCI-BIO/28685/2017, and the LASIGE Research Unit, ref. UIDB/00408/2020 and ref.
UIDP/00408/2020.</p>
      <p>We would like to thank Doctor Maria Fernandes from the University of Luxembourg, who
provided us access to larger GPUs for running experiments, for all her help and support.
biomedical semantic indexing and question answering, in: International Conference of
the Cross-Language Evaluation Forum for European Languages, Springer, Springer, 2020.
URL: https://link.springer.com/chapter/10.1007/978-3-030-58219-7_16.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          (
          <year>2019</year>
          ). URL: http: //dx.doi.org/10.1093/bioinformatics/btz682. doi:
          <volume>10</volume>
          .1093/bioinformatics/btz682.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2017</year>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Transferability of natural language inference to biomedical question answering</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <year>2007</year>
          .00217.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lipson</surname>
          </string-name>
          ,
          <article-title>How transferable are features in deep neural networks</article-title>
          ?,
          <year>2014</year>
          . arXiv:
          <volume>1411</volume>
          .
          <fpage>1792</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lopyrev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , Squad:
          <volume>100</volume>
          ,000+ questions for machine comprehension of text,
          <year>2016</year>
          . arXiv:
          <volume>1606</volume>
          .
          <fpage>05250</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nangia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>A broad-coverage challenge corpus for sentence understanding through inference</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <volume>1704</volume>
          .
          <fpage>05426</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dendamrongvit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kubat</surname>
          </string-name>
          ,
          <article-title>Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains</article-title>
          ,
          <year>2009</year>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>52</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>642</fpage>
          -14640-
          <issue>4</issue>
          _
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Transformers:
          <article-title>State-of-the-art natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . URL: https://www.aclweb.org/anthology/2020.emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Python</given-names>
            <surname>Core</surname>
          </string-name>
          <string-name>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Python: A dynamic, open source programming language</article-title>
          ,
          <source>Python Software Foundation</source>
          , Vienna, Austria,
          <year>2016</year>
          . URL: https://www.python.org/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, high-performance deep learning library</article-title>
          , in: H.
          <string-name>
            <surname>Wallach</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>d'Alché-</article-title>
          <string-name>
            <surname>Buc</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2019</year>
          , pp.
          <fpage>8024</fpage>
          -
          <lpage>8035</lpage>
          . URL: http://papers.neurips.cc/paper/ 9015-pytorch
          <article-title>-an-imperative-style-high-performance-deep-learning-library</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brevdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Citro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Devin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harp</surname>
          </string-name>
          , G. Irving,
          <string-name>
            <given-names>M.</given-names>
            <surname>Isard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jozefowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kudlur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Levenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mané</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Monga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Talwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viégas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Warden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wicke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <source>TensorFlow: Large-scale machine learning on heterogeneous systems</source>
          ,
          <year>2015</year>
          . URL: https://www.tensorflow.org/, software available from tensorflow.
          <source>org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rodriguez-Penagos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          , G. Paliouras, Overview of bioasq
          <year>2020</year>
          :
          <article-title>The eighth bioasq challenge on large-scale</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>