<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transfer Learning for Biomedical Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arda Akdemir</string-name>
          <email>aakdemir@hgc.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetsuo Shibuya</string-name>
          <email>tshibuya@hgc.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Tokyo</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Deep Neural Network (DNN) based Machine Learning models achieved remarkable success in many elds of research. Yet, many recent studies show the limitations of these approaches to generalize to unseen examples and to new domains such as the biomedical domain. Besides, supervised-learning based DNN models require a substantial amount of labeled data which is not readily available for many tasks such as the biomedical question answering task. Transfer Learning is shown to mitigate these challenges by transferring information from auxiliary tasks to improve the performance on a source task, and shown to be especially useful for low-resource tasks. These observations and ndings motivated us to investigate the e ect of transfer learning and multi-task learning on the biomedical question answering task. We proposed a novel multi-task learning model to learn biomedical entities and questions simultaneously. In this work, we explain the three di erent neural models we used to participate for the BioASQ 8B challenge. Our initial results showed that transferring information from the biomedical entity recognition task brings improvement for the biomedical question answering task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Pretrained language models [
        <xref ref-type="bibr" rid="ref11 ref3">11, 3</xref>
        ] have been frequently leveraged to improve
performance on various downstream NLP tasks since their introduction.
However, it is shown that the performance of these models, which are trained on
general domain corpora, drops signi cantly when they are tested on a new
domain [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This performance drop is higher for domains that have signi cantly
di erent word distributions, such as the biomedical domain. To mitigate this
performance drop, a frequently used approach is to pretrain these models on the
target domain, which is also called as domain-adaptation. Recently, Lee et
al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] pretrained the BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] language model on the PubMED articles, which
is called BioBERT, and achieved state-of-the-art results for several downstream
biomedical tasks. This motivated us to use BioBERT as our baseline model in
our experiments.
      </p>
      <p>
        Transfer learning is a general term to describe the learning schemes where
the information from a source task is used to improve the performance on a
target task. It is shown to be especially useful to improve the performance on
low-resource tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Ideally, we would like to transfer information from
highresource tasks that have a similar domain with the source task to make the
most out of transfer learning. Currently available datasets for biomedical
question answering is very limited. Relative to the biomedical question answering
datasets, the currently available biomedical entity datasets are large. These
ndings motivated us to apply transfer learning to improve the performance on the
biomedical question answering task. Speci cally, we claim that the performance
on biomedical question answering can be improved by transferring information
from the biomedical entity recognition task. We propose a multi-task learning
model that learns both biomedical question answering and entity recognition
tasks, which have not been implemented before to the best of our knowledge.
Our work can be considered as an extension of the previously proposed BioBERT
model. Our model di ers from the BioBERT model in two main ways. Unlike
the BioBERT model, we propose a single neural architecture to simultaneously
learn three question types (factoid, yes/no, list). This allows the model to
transfer information between di erent question types. Next, we propose a multi-task
learning model to learn the biomedical entity recognition and question
answering tasks. BioBERT uses separate architectures for the two downstream tasks.
Thus, the pretrained BioBERT model is ne-tuned from scratch for each task.
Unlike BioBERT, our model allows transferring information between these two
tasks during the ne-tuning step.
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>BioASQ Challenge</title>
      <p>
        BioASQ is a challenge on biomedical semantic indexing and question
answering [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The challenge aims to advance the state-of-the-art in semantic
indexing and question answering, and also establish a reference point for biomedical
question answering. More information about the challenge can be obtained from
the BioASQ homepage. 1 We participated in the question answering part of
the BioASQ 2020 challenge (8B) to test our claim on using transfer learning
for biomedical question answering. This paper describes the models we used to
make our submissions to the BioASQ 8B challenge. We participated to the
challenge with three di erent neural architectures, and used the BioASQ datasets
as our test-bed to compare these proposed models. Our main contributions can
be listed as follows:
{ We implemented a novel neural architecture that uses a single model to
jointly learn three question types in the BioASQ challenge.
1 http://bioasq.org/
{ We proposed a novel multi-task learning model for entity recognition and
question answering for the biomedical domain which have not been employed
before to the best of our knowledge.
{ We analyzed the e ect of transferring information from three biomedical
entity recognition datasets for the biomedical question answering task.
2
      </p>
      <sec id="sec-2-1">
        <title>Methodology</title>
        <p>In this section we describe each model we used during the BioASQ Task 8b:
Biomedical Semantic Question Answering. During the task, we made submissions
using ve di erent models, three of which used an identical neural architecture,
but the nal model is determined using di erent evaluation methods. We used
BioBert-based Question Answering Model [17] as our baseline model, which we
refer to as BioBERT baseline. The second model is an extension of the rst
model, which jointly learns all question types using a single architecture. We
refer to this model as BioBERT allquestions. We used three variations of this
model for our submissions. Finally, we used a novel multi-task learning model
that learns biomedical entities and all question types simultaneously. We refer
to this model as BioBERT multitask. For the BioASQ 8B challenge, we only
submitted answers for the `list', `factoid', and `yes-no' type questions. `Summary'
type questions require a fundamentally di erent approach, and was beyond the
main scope of this work.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Pre-processing</title>
      <p>The raw input format of the BioASQ dataset needs to be pre-processed into the
suitable format expected by the BioBERT model. Following Yoon et al. [17],
we used a similar pre-processing scheme to convert the BioASQ questions into
the SQUAD Question Answering format. In the BioASQ dataset, multiple
goldanswers are provided for most questions. Gold answers are denoted as spans
inside the snippets provided for each question. During pre-processing, we treated
each gold-label snippet and question pair as separate examples to increase the
size of the training set. During all our experiments, we only made use of the
gold-label snippets. We did not analyze the e ect of appending additional
information from external sources such as the links to related documents provided by
the BioASQ organizers. Previously, Yoon et al. [17] experimented with various
pre-processing methods to bring further improvements. They observed that the
bene ts of each strategy depend on the question type and the test-batch. For
this reason, we xed the pre-processing method throughout our all experiments
to make it clear where the improvements for each proposed model come from.
Besides, using only the snippets as input to the neural networks signi cantly
reduces the input size and reduces the overall training time. For factoid and list
type questions, each gold-label span is used to create a new Question-Passage
pair. An example factoid type question and gold-label spans from the provided
spans are given in Table 1. The nal predictions for the list type questions are
handled during the post-processing step, and explained in the relevant section.</p>
      <p>
        Contrary to the previous work that directly adapts the BERT Question
Answering Model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] by modifying the `is impossible' eld of the SQUAD dataset
format for the yes/no type questions, we implemented our own Yes/No
component. This enabled us to use the data without adding the `is impossible'
eld, making the dataset format more readable and easier to understand for
researchers from the biomedical domain.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>BioBert-based Baseline Model</title>
      <p>
        Pretrained subword contextual embeddings has shown remarkable progress over
previous approaches on many downstream Natural Language Processing (NLP)
tasks [
        <xref ref-type="bibr" rid="ref12 ref13">16, 13, 12</xref>
        ]. Speci cally the transformer-based BERT model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] helped
achieve state-of-the-art results on many downstream tasks, including question
answering.
      </p>
      <p>
        The performance of models pretrained on general domain corpora (e.g.,
Wikipedia articles) drops signi cantly when tested on niche domains such as
the biomedical domain. Motivated with this observation, Lee et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed
`BioBERT', BERT architecture pretrained on PubMed articles. The proposed
model obtained state-of-the-art results on three di erent downstream
biomedical NLP tasks. Recently, Yoon et al. [17] obtained the best results in the 2019
BioASQ 7B Question Answering Challenge, and achieved state-of-the-art
results on all question types (factoid, yes/no, list). In their proposed approach,
separate models are trained from scratch for yes/no, and factoid type questions
(factoid/list).
      </p>
      <p>For our baseline model, we used this BioBERT-based approach which we refer
to as BioBERT baseline. BERT model is extended with two separate additional
neural layers to learn di erent question types. The overall architectures are given
in Figure 1. For the yes/no type questions, the output for the rst token ([CLS])
) of the nal layer of BERT is given as input to a fully connected layer with
2-dim output representing the scores for yes/no scores. This is followed by a
softmax layer to convert these scores into probabilities. Given a sequence of n
question tokens Q = qt : 1 t n, and m passage tokens P = pt : 1 t m ,
BioBERT outputs m + n + 2 xed-size (L) vectors V = vj : 1 j (m + n + 2).
Next, v1 is multiplied with an (L; 2) dimensional matrix W to generate scores,
S = fsyes; snog, for yes and no answers:</p>
      <p>V = BioBERT (Q; P )
S = v1T W</p>
      <p>O = Sof tmax(S)
where O = oyes; ono represents the probabilities for each answer, which is
the nal output for the yes/no type questions. Similarly for the factoid/list type
questions, each vj is multiplied with an (L; 2) dimensional matrix W2 to generate
scores S2 = fsstart; sendg, which represent the score for the start, end spans for
each token pj inside the input passage P :</p>
      <p>V = BioBERT (Q; P )</p>
      <p>S2 = vjT W</p>
      <p>For training, each BioBERT-initialized model in Figure 1 is ne-tuned on the
BioASQ-8b for each question type, separately. The main drawback of this
previously proposed model is that the common BioBERT layer, which constitutes
the majority of the parameters (only a single layer is added for each question
type), is ne-tuned separately for each question type. The bottleneck for
developing high-performing biomedical question answering systems is the scarcity of
the labeled training sets. This approach further limits the training dataset size,
and not ideal for low-resource domains like the biomedical domain.
(a) Yes/No model.</p>
      <p>(b) Factoid/List model.</p>
      <p>Fig. 1: Overall architectures for training separate models for yes/no and
factoid/list question types for the BioBERT baseline model [17]. The common
BioBERT model layers are netuned from scratch for each type.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Joint-Learning Model</title>
      <p>The baseline approach does not expose the model to all the examples in the
training dataset. This observation motivated us to propose a novel joint-learning
model, which uses a single architecture to learn all question types, which we refer
to as BioBERT allquestions. Learning of all question types using a single
BERTbased model is not employed before in this domain, to the best of our knowledge.
The overview of the proposed joint-learning model is shown in Figure 2. This
simple extension to the previously proposed BioBERT-based QA Model [17]
allows exposing the model to all the available examples in the training dataset.
The common BioBERT layer is trained jointly on all question types. This allows
the model to transfer information from other question types for better
generalization.</p>
      <p>An important part of training joint-learning models is the selection of
performance metrics. In the conventional single-task machine learning setting, there
is usually a single performance metric. The models are evaluated on a
development/validation dataset based on this metric, to determine the best performing
model during training. In the joint-learning setting, we can evaluate the models
based on their performance on each task separately, or we can evaluate them
based on their overall performance. For our submissions for the BioASQ 8B
challenge we used the following three joint-learning models:
{ Overall best-performer
{ Best yes/no model
{ Best factoid model</p>
      <p>
        To determine the best-performer in each three cases, we used the average
results over ve test-batches of the Bio-ASQ 6B challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. All three models are
obtained from the same training experiment, and correspond to the checkpoints
of the same model instance.
The multi-task learning model further extends the joint-learner explained in
Section 2.3. In this setting, a single neural model is trained for Biomedical Question
Answering and Gene/Protein Entity Recognition tasks, simultaneously. The
details of the Question Answering component of the model is identical with the
joint-learner. In addition, the model contains an entity recognition component
consisting of a Fully-Connected layer, followed by a Conditional Random Fields
(CRF) layer. CRF-based models are frequently used for the named entity
recognition task, to take into account the tag transitions between consecutive
tokens [
        <xref ref-type="bibr" rid="ref1 ref8">8, 1</xref>
        ]. For this reason, we extended the NER-component of the previously
proposed BioBERT-based NER model in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to include an additional CRF layer.
The overall architecture of this proposed multi-task learner is shown in
Figure 3. For a sequence of n tokens ti : 1 i n, the NER-component receives
the BioBERT representation for each token. The subword token representations
are then averaged to get the word-level representations. These word-level
representations are fed into the FC-layer to generate the scores for each entity label,
for each token. The CRF-layer generates the nal score for each label by taking
into account the transitions between each label. For the NER component,
crfloss is used. The loss is calculated as the di erence between the total score of
all possible label-sequences (all possible paths) and the score of the gold-label
sequence (gold-label path):
bj = BioBERT (t1; :::; ti; :::tn; j)
sj = F Cner(bj)
      </p>
      <p>S = [s1; :::; sj; :::; sn]
crf loss = f orward score(S; T) path score(S; T; G)
where S denotes the scoring matrix containing scores for each label and word
pair, G is the gold-label sequence, and T is the transition matrix containing
transition scores between each label. f orward score(S; T) denotes the total score
of all paths and path score(S; T; G) is the score of the gold label sequence.
Ideally, we want all probabilities to accumulate on the gold-label path so that
these two scores will be identical.</p>
      <p>
        Inference During the inference mode, we used Viterbi decoding [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to nd the
highest scoring label sequence for the entity recognition task.
2.5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Post-processing</title>
      <p>As explained in the pre-processing section, we divided each question with
multiple gold-label snippets into separate inputs. These examples are merged during
post-processing to generate a unique answer for each question. For the
postprocessing step, we followed [17] to combine the predictions to the same question
for factoid/list type questions. Majority voting is used to nd the highest scoring
predictions for each factoid/list type question. For each factoid type question,
top N highest scoring predictions are returned where N corresponds to the
maximum limit allowed for the BioASQ 8B challenge. For the list type questions,
we used 0.50 as the probability threshold, and included all answers that have a
higher average probability score.</p>
      <p>For the yes/no type questions, we averaged the probability scores for each
example belonging to the same question instance to determine the nal answer.
3</p>
      <sec id="sec-6-1">
        <title>Experimental Settings</title>
        <p>
          In this section we explain details regarding the experiments we conducted. All
experiments are done using a single V100-GPU. For the Question Answering task
we used the BioASQ 6B test sets as our validation set, and used the examples
in the BioASQ 8B training set, for training. For the entity recognition task, we
kept the same train/dev/test split already provided in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It takes around 4-5
epochs on the training set to achieve the highest performance on the question
answering validation sets for all models.
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Datasets</title>
      <p>
        The entity recognition component of the nal multi-task learning model we used
for our submissions is trained on the BC2GM dataset [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The dataset contains
20,703 entity mentions in total and annotated using BIO scheme. The rst token
of each entity is annotated with `B' and the following tokens are annotated with
`I'. Non-entity tokens are annotated with `O'.
      </p>
      <p>
        In order to evaluate our proposed multi-task learner, we trained the entity
recognition component on three di erent datasets. We used the BC2GM [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
BC4CHEMD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and BC5CDR [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] datasets for biomedical entity recognition
which contain gene entities, chemical entities and disease mentions respectively.
As we had maximum submission limit of ve submissions for each test-batch for
the BioASQ 8B challenge, we only used the multi-task learning model trained
on the BC2GM dataset.
      </p>
      <p>The BioASQ 8B training set contains 3,243 questions in total. We did not
make use of the 777 summary type questions, so our overall training set
contained 2466 questions. For training our models we used only the snippets already
provided by the challenge organizers as the relevant passage for each question.
Each snippet and question is treated as a unique (Q; P ) pair which is given as
input to the question answering component, where Q and P represent `question'
and `passage', respectively.</p>
      <p>
        For evaluating our proposed models, we also used the factoid questions from
the BioASQ 6B test set [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The test set contains ve-batches, and the number
of factoid questions for each batch are given in Table 4.
In this section, we explain how we trained each of the three models we used to
make submissions for the BioASQ 8B challenge. In all three models, we initialized
the weights of the BERT component using the BioBERT version 1.1 provided
by Lee et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] pretrained on PubMed articles. To have a fair comparison we
always used a maximum sequence length of 256, as we observed that going above
this value sometimes resulted in memory issues. Table 5 gives a comprehensive
list of the hyperparameters we used during our experiments.
Baseline model training The baseline model (BioBERT baseline) is
composed of two completely separate neural architectures (one for yes/no and one
for factoid/list type questions). In this approach, each architecture is trained
separately, only using the corresponding dataset. During pre-processing, list type
questions are converted into factoid question format, by treating each answer in
the list of answers as a single factoid type answer. After this pre-processing step,
the format of the factoid and list type questions become identical, so that the
same architecture can be used for training on both types.
      </p>
      <p>Joint-learning model training The joint learner (BioBERT allquestions) is
trained on all question types at once. At each iteration a (Q; P ) pair is picked
randomly from the whole training set. If the picked example is a `yes/no' type
question the `Yes/No' component in Figure 2 is used to generate the output
of the model. Otherwise, the `Factoid/List' component is used to generate the
`start' and `end' scores for each token inside the given passage P . The loss for
each input example is backpropagated to update the weights of 1) the
questiontype speci c component, and 2) the common BioBERT component. This way,
we allow information transfer between di erent question types. Considering the
relatively small sizes of the biomedical question answering datasets, this allows
a better utilization of what is available. Besides, this approach reduces the total
number of parameters of the nal model almost by half, as the majority of the
trainable parameters are the common BioBERT weights. As we have multiple
target performance metrics (overall performance and performances on each
question type), we continued the training until we could not observe any improvement
for any question type on the question answering validation sets.</p>
    </sec>
    <sec id="sec-8">
      <title>Multi-task learning model training The multi-task learning model</title>
      <p>(BioBERT multitask) is simultaneously trained for the question answering and
the entity recognition tasks. At each iteration we ip a random coin to
determine the task type (QAS or NER), and use the corresponding component from
Figure 3. Similar to the joint-learning model this allows information transfer
from the NER dataset examples for the question answering task. The common
BioBERT model is updated using examples from both tasks, which allows us to
expose the model for a signi cantly larger amount of sentences from the
biomedical domain. In this work, entity recognition task is used as an auxiliary task to
help improve the nal performance on the target question answering task. For
this reason, training is done until we could not observe any improvement on the
question answering validation set.
4</p>
      <sec id="sec-8-1">
        <title>Results</title>
        <p>In this section we start by giving the results we obtained for evaluating our
proposed multi-task learner. We compare BioBERT multitask, which learns both
entity recognition and question answering tasks simultaneously, with the
jointlearning model BioBERT allquestions, which only focuses on the question
answering task. The BioASQ 8B data is used to train both models, and the factoid
type questions from the BioASQ 6B challenge is used to evaluate them, which
contains ve di erent test batches. For training the entity recognition
component of the multi-task learning model, we used three di erent biomedical entity
datasets. The results for both models are given in Table 6. Our results showed
that learning both tasks simultaneously improved the performance for all entity
datasets and for all test batches. For all three datasets we observed that the
multi-task learning model outperformed the model that only learns the question
answering task on all ve test-batches. These results veri ed our initial claim on
transfering information from entity recognition task to improve the performance
on the target question answering task, and motivated us to apply the proposed
multi-task learning model on the BioASQ 8B test sets.</p>
        <p>Next, we give the results obtained on the BioASQ 8B challenge for each
model we explained above. For the rst test-batch we only made submissions
using two models: BioBERT baseline, BioBERT allquestions. For the other four
test-batches we made ve submissions using the three models explained above.
To be able to make a clean comparison between the proposed models, we kept
the post-processings schemes identical for all our submissions. This is necessary
to evaluate our claim on using multi-task learning to improve the performance
on biomedical question answering task.</p>
        <p>The QAS components of the joint-learning model and the model-task learning
model are identical. In order to evaluate our claim on using multi-task learning
for question answering, we must compare these models, rather than comparing
them with the single-task learning model which uses a di erent architecture
(separate models for each question type). The results show that for the factoid
questions, the multi-task learning based model outperformed all three joint-learning
models for all four test-batches. This clearly shows that leveraging information
obtained about genes and proteins may help improve the nal performance on
the factoid type questions. The results for list and yes/no type questions are
mixed, and the bene ts of multi-task learning are unclear for these types.</p>
      </sec>
      <sec id="sec-8-2">
        <title>5 Conclusion</title>
        <p>In this paper we described the models we used to make submissions for the
BioASQ 8B challenge. We proposed a novel multi-task learning model for
biomedical entity recognition and question answering tasks. Our results showed that
transferring information from the entity recognition task consistently improved
the performance on the factoid type questions of the question answering tasks.
On all test-batches of both BioASQ 6B and BioASQ 8B challenges,
transferring information brought improvement for factoid questions. We believe that
further improvements can be achieved by implementing a more sophisticating
information sharing between the two tasks. Analyzing the characteristics of each
dataset used, can help us understand why transfer learning improves/degrades
the performance for each question type.</p>
        <p>
          So far we have only considered using domain-adaptive pretrained models
(BioBERT-based). Recent work on pretraining showed that task-adaptive
pretraining brings additional improvement for low-resource tasks [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Our plan is to
incorporate task-adaptive pretraining for the biomedical question answering
task.
16. Wu, S., Dredze, M.: Beto, bentz, becas: The surprising cross-lingual e ectiveness
of bert. In: Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP). pp. 833{844 (2019)
17. Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained language model
for biomedical question answering. In: Joint European Conference on Machine
Learning and Knowledge Discovery in Databases. pp. 727{740. Springer (2019)
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Akbik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blythe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vollgraf</surname>
          </string-name>
          , R.:
          <article-title>Contextual string embeddings for sequence labeling</article-title>
          .
          <source>In: COLING</source>
          <year>2018</year>
          , 27th International Conference on Computational Linguistics. pp.
          <volume>1638</volume>
          {
          <issue>1649</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Akdemir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Research on task discovery for transfer learning in deep neural networks</article-title>
          .
          <source>In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop</source>
          . pp.
          <volume>33</volume>
          {
          <issue>41</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). pp.
          <volume>4171</volume>
          {
          <issue>4186</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Forney</surname>
            ,
            <given-names>G.D.:</given-names>
          </string-name>
          <article-title>The viterbi algorithm</article-title>
          .
          <source>Proceedings of the IEEE</source>
          <volume>61</volume>
          (
          <issue>3</issue>
          ),
          <volume>268</volume>
          {
          <fpage>278</fpage>
          (
          <year>1973</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gururangan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marasovic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swayamdipta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Downey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N.A.</given-names>
          </string-name>
          :
          <article-title>Don't stop pretraining: Adapt language models to domains and tasks</article-title>
          . arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>10964</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kakadiaris</surname>
            ,
            <given-names>I.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliouras</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krithara</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . (eds.):
          <article-title>Proceedings of the 6th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question answering</article-title>
          .
          <source>Association for Computational Linguistics</source>
          , Brussels, Belgium (Nov
          <year>2018</year>
          ), https://www.aclweb.org/anthology/W18-5300
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabal</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akhondi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          , et al.:
          <article-title>Overview of the BioCreative VI chemical-protein interaction Track</article-title>
          .
          <source>In: Proceedings of the sixth BioCreative challenge evaluation workshop</source>
          . vol.
          <volume>1</volume>
          , pp.
          <volume>141</volume>
          {
          <issue>146</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballesteros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawakami</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>arXiv preprint arXiv:1603.01360</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>So</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          .
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <issue>4</issue>
          ),
          <volume>1234</volume>
          {
          <volume>1240</volume>
          (09
          <year>2019</year>
          ). https://doi.org/10.1093/bioinformatics/btz682, https://doi.org/10.1093/ bioinformatics/btz682
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , R.J.,
          <string-name>
            <surname>Sciaky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leaman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mattingly</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegers</surname>
            ,
            <given-names>T.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>BioCreative V CDR task corpus: a resource for chemical disease relation extraction</article-title>
          .
          <source>Database</source>
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long Papers). pp.
          <volume>2227</volume>
          {
          <issue>2237</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Rajpurkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Lopyrev</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Squad:
          <volume>100</volume>
          , 000+
          <article-title>questions for machine comprehension of text</article-title>
          . In: EMNLP (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Reddy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Coqa: A conversational question answering challenge</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>7</volume>
          ,
          <issue>249</issue>
          {
          <fpage>266</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Neural Transfer Learning for Natural Language Processing</article-title>
          .
          <source>Ph.D. thesis</source>
          , National University of Ireland, Galway (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanabe</surname>
            ,
            <given-names>L.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>nee</surname>
            <given-names>Ando</given-names>
          </string-name>
          , R.J.,
          <string-name>
            <surname>Kuo</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>C.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klinger</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganchev</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al.:
          <article-title>Overview of biocreative ii gene mention recognition</article-title>
          .
          <source>Genome biology</source>
          <volume>9</volume>
          (
          <issue>S2</issue>
          ),
          <source>S2</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>