<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NCU-IISR: Using a Pre-trained Language Model and Logistic Regression Model for BioASQ Task 8b Phase B</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Engineering, National Central University</institution>
          ,
          <addr-line>Taoyuan</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent successes in pre-trained language models, such as BERT, RoBERTa, and XLNet, have yielded state-of-the-art results in the natural language processing field. BioASQ is a question answering (QA) benchmark with a public and competitive leaderboard that spurs advancement in large-scale pre-trained language models for biomedical QA. In this paper, we introduce our system for the BioASQ Task 8b Phase B. We employed a pre-trained biomedical language model, BioBERT, to generate “exact” answers for the questions, and a logistic regression model with our sentence embedding to construct the top-n sentences/snippets as a prediction for “ideal” answers. On the final test batch, our best configuration achieved the highest ROUGE-2 and ROUGE-SU4 F1 scores among all participants in the 8th BioASQ QA task (Task 8b, Phase B).</p>
      </abstract>
      <kwd-group>
        <kwd>Biomedical Question Answering ⸱ Pre-trained Language Model ⸱</kwd>
        <kwd>Logistic Regression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Since 2018, BioASQ1
        <xref ref-type="bibr" rid="ref1">(Tsatsaronis et al., 2015)</xref>
        has organized eight challenges on
biomedical semantic indexing and question answering. This year, the challenges include
three main tasks: Task 8a, Task 8b, and Task MESINESP8. We only participated in
Task 8b Phase B (QA task), in which participants are given a biomedical question and
list of question-relevant articles/snippets as input, and should return either an exact
answer or an ideal answer. The task was provided 3,243 training questions that included
the previous year’s test set with gold annotations, plus 500 test questions for evaluation,
divided into five batches of 100 questions each. All questions and answers were
con* Corresponding author
1 http://bioasq.org/
structed by a team of biomedical experts from around Europe; the questions were
categorized into four types: yes/no, factoid, list, and summary. Three types of questions
required exact answers: yes/no, factoid, and list. For all four types of questions,
participants needed to submit ideal answers. Each participant was allowed to submit a
maximum of five results in Task 8b.
      </p>
      <p>
        Some QA examples are illustrated in Fig. 1. Each BioASQ QA instance gives a
question and several relevant snippets of PubMed abstracts, including the ID of the full
PubMed article. Thus, we formulated the task as query-based multi-document a.
extraction for exact answers and b. summarization for ideal answers. In this paper, we
employed a pre-trained language model released by BioBERT
        <xref ref-type="bibr" rid="ref2">(Lee et al., 2020)</xref>
        , which
model achieved the highest performance last year. However, BioBERT was not
previously used for generating ideal answers. BioBERT is well-constructed for different
natural language processing (NLP) tasks like relation classification and identifying the
answer phrase of a question by the given paragraph. BERT uses a masking mechanism
to train its language model, thus makes the model learn meanings in different situations.
Many biomedical task results show that its language model outperforms traditional
word presentation. Therefore, we further applied BioBERT’s [CLS] embeddings as
input to a logistic regression model for predicting ideal answers.
      </p>
      <p>The sections are organized as follows. Section 2 briefly reviews recent works on QA.
The details of our two methods are described separately in Section 3 and 4. Section 5
describes our configurations submitted to the BioASQ challenge. Section 6 gives a
summary of our system’s performance in the BioASQ QA task.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In most QA tasks, such as SQuAD2
        <xref ref-type="bibr" rid="ref3">(Rajpurkar, Zhang, Lopyrev, &amp; Liang, 2016)</xref>
        ,
SQuAD 2.0
        <xref ref-type="bibr" rid="ref4">(Rajpurkar, Jia, &amp; Liang, 2018)</xref>
        , and PubMedQA
        <xref ref-type="bibr" rid="ref5 ref9">(Jin, Dhingra, Liu,
Cohen, &amp; Lu, 2019)</xref>
        , only exact answers are provided for questions. Exact answers
almost always appear in the context of the given relevant articles/snippets; thus, these
tasks are usually formulated as a sequence to sequence problem. Recently, it was found
that significant improvements can be had in many natural language processing (NLP)
tasks by using pre-trained contextual representations
        <xref ref-type="bibr" rid="ref6">(Peters et al., 2018)</xref>
        rather than
simple word vectors.
      </p>
      <p>
        For instance, Google developed Bidirectional Encoder Representations from
Transformers (BERT)
        <xref ref-type="bibr" rid="ref7">(Devlin, Chang, Lee, &amp; Toutanova, 2018)</xref>
        to solve the problem of
shallow bidirectionality. BERT uses a masked language model (MLM) for the
pretraining objective, which MLM randomly masks some tokens from the unlabeled input
and then predicts the original vocabulary ID of the masked word based on its context.
Because MLM jointly concatenates the left and right context as representation, it can
pre-train a deep bidirectional Transformer. In BERT's framework, two steps
(pre-training and fine-tuning) have the same architectures but different output layers. During
fine-tuning, different down-stream tasks initialize models with the same pre-trained
model parameters, and all parameters are fine-tuned using labeled data from each task.
BERT is the first fine-tuning-based representation model, and its result outperforms
prior models on sentence-level and token-level NLP tasks.
      </p>
      <p>
        Many significant sentence-level classification tasks come from the General
Language Understanding Evaluation (GLUE3) benchmark
        <xref ref-type="bibr" rid="ref8">(Wang et al., 2018)</xref>
        . To help
machines understand language just like humans, GLUE provides nine diverse sentence
understanding tasks; one example is inputting a pair of sentences, for which the system
must predict a relationship with one sentence as the premise and the other as the
hypothesis. Where most token-level natural language understanding (NLU) models are
designed to carry out a specific task using specific domain data, GLUE is an auxiliary
dataset for exploring models with an eye to understanding specific linguistic
phenomena across different domains; it thus provides a publicly online platform for evaluating
and comparing models.
      </p>
      <p>On the other hand, the two major QA tasks, the Stanford Question Answering
Dataset (SQuAD) and SQuAD 2.0, are both token-level tasks. Each instance of the
SQuAD gives a question and a passage from Wikipedia, for which the goal is to find
the answer text span (start and end position in tokens) in the passage. The SQuAD 2.0
task extends the original SQuAD problem definition by allowing there to be no short
answer in the provided paragraph. Each task has an official leaderboard.</p>
      <p>
        Because these NLP tasks have public leaderboards, they are highly competitive and
make for rapid expansion in pre-trained models. BERT provided a good start, after
which improved models came out such as RoBERTa
        <xref ref-type="bibr" rid="ref9">(Liu et al., 2019)</xref>
        , XLNet
        <xref ref-type="bibr" rid="ref10">(Yang
et al., 2019)</xref>
        , ALBERT
        <xref ref-type="bibr" rid="ref11">(Lan et al., 2019)</xref>
        , and ELECTRA
        <xref ref-type="bibr" rid="ref12">(Clark, Luong, Le, &amp;
2 https://rajpurkar.github.io/SQuAD-explorer/
3 https://gluebenchmark.com/leaderboard
Manning, 2020)</xref>
        . These models also achieved state-of-the-art results upon being
released. The model Bidirectional Encoder Representations from Transformers for
Biomedical Text Mining (BioBERT), based on Google’s BERT code, is a language
representation model specific to the biomedical domain, pre-trained on large-scale
biomedical corpora (1 million articles from PubMed4 or 270 thousand from PubMed Central5).
Taking advantage of being able to apply almost the same architecture across tasks,
BioBERT largely outperforms previous models and is state-of-the-art in a variety of
biomedical text mining tasks.
      </p>
      <p>The BioASQ QA task allows participants to only participate in some batches and to
return either only exact answers or ideal answers. The ideal answer includes prominent
supportive information, whereas the exact answer only returns yes or no for yes/no
questions, entity names for factoid questions, or lists of entity names for list questions;
ideal answers can thus be seen as the full definition of exact answers. Ideal answers are
usually written by biomedical experts and presented in a short text that answers the
question. Because most ideal answers cannot be directly mapped to the given relevant
articles/snippets, predicting appropriate ideal answers is more complicated than
predicting exact answers.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Similarity Between a Snippet and a Question</title>
      <p>
        Although the BioASQ QA task provides biomedical questions and relevant snippets of
PubMed abstracts, in actuality, ideal answers do not appear verbatim in the relevant
snippets. The goal of our method was to select the most relevant snippet for each
question in the BioASQ QA instances. To determine the similarity between a question and
a snippet, we directly calculated relevance scores using cosine similarity. Cosine
similarity is one of the most common text similarity metrics, thus is widely utilized in NLP
tasks. Therefore, we first had to transform questions and snippets into vectors. In
general, previous works map words to corresponding vectors by taking word2vec
        <xref ref-type="bibr" rid="ref13">(Mikolov, Sutskever, Chen, Corrado, &amp; Dean, 2013)</xref>
        embeddings trained on a relevant
corpus or else adopt existing word embeddings such as GloVe (Pennington, Socher, &amp;
Manning, 2014), and Wiki-PubMed-PMC (Habibi, Weber, Neves, Wiegandt, &amp; Leser,
2017).
      </p>
      <p>A lot of word2vec embeddings and TF-IDF vectors were referred to by Diego
Mollá’s features (Mollá &amp; Jones, 2019), and we considered that it can be improved. In
other words, TF-IDF regarding some common words (such as articles and conjunctions)
as trivial terms so as to more readily identify the major words of sentences, these
methods are unable to represent polysemic words. Notably, on the GLUE leaderboard,
methods using word2vec embeddings (Skip-gram and CBOW) rank much lower than those
using the ensemble mode of ELMo, such as BERT. BERT provides contextual
embeddings that can solve the problem of polysemy, so deals well with many different tasks.
Therefore, we simplified the procedure of extracting features from BioBERT and only
took the pre-trained embeddings of sentences.
4 https://pubmed.ncbi.nlm.nih.gov/
5 https://www.ncbi.nlm.nih.gov/pmc/</p>
      <p>In our method, before separately obtaining the embeddings of a question and a
snippet, each sentence was first pre-processed into word pieces with WordPiece
tokenization. Then, inputting all word pieces of the sentence to BioBERT, we extracted the
features from the last layer of BioBERT. In BERT, the [CLS] token was inserted into
input tokens, and its embeddings could be considered as the sentence vector (the
features). The step of extracting pre-trained contextual embeddings from BioBERT is
diagrammed in Fig. 2.</p>
      <p>Finally, we used the embeddings (vectors) of a question and snippet pair to calculate
their cosine similarity score. Because each question of a BioASQ QA instance typically
has more than one snippet, we re-ranked the snippets in order of their similarity scores
and took the top 1 snippet as our prediction of the answer (NCU-IISR_2), as that snippet
was considered the most relevant to answering the question.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Logistic Regression of Sentences</title>
      <p>Our approach was inspired by the framework of the logistic regression model proposed
by Diego Mollá. The method follows the two steps of his summarization process: Step
1, split the input text (snippets) into candidate sentences, and score each candidate
sentence. Step 2, return the top-n sentences with the highest scores. As stated above, we
used the pre-trained language model “BioBERT” to replace their features with word
embeddings.</p>
      <p>We first used NLTK’s sentence tokenizer to divide snippets into sentences and
calculated ROUGE-SU4 F1 scores (Lin, 2004) between each sentence and the associated
question, thereby generating positive and negative instances that became the training
set for our logistic regression model. After pre-processing, our logistic regression
model was slightly different from the cosine similarity method. First, we input a
candidate sentence and a question at the same time and used the fine-tuned BioBERT model
for fitting the task. Second, we appended a dense layer with ReLU activation after the
output layer of BioBERT, and we used mean squared error as the loss function. We
took default settings from BERT trained on SQuAD. We also used [CLS] embeddings
as the feature from which to predict the ROUGE-SU4 F1 scores of the test data. In our
case, [CLS] embeddings represented the relation between a candidate sentence and a
question. Fig. 3 illustrates the modified BioBERT architecture used here. Lastly, we
used the prediction values to re-rank the candidate sentences for each question and
selected only the top n sentences as our system output (NCU-IISR_3).</p>
      <p>Due to time limitations, we did not finish aspects of the logistic regression model
such as fine-tuning the model with all instances, expanding the range of snippets to the
full abstract, and comparing activation or loss functions to find a better one. These can
be future work and updates addressed in the next challenge.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Submission</title>
      <p>To obtain exact answers, we used the BioASQ-BioBert model (Yoon, Lee, Kim, Jeong,
&amp; Kang, 2019). This model included two pre-trained weights: one fine-tuned on
SQuAD for "yes/no" questions, and the other on SQuAD 2.0 for "factoid and list"
questions. We then separately fine-tuned again the yes/no, factoid, and list questions of the
BioASQ QA task. Because BERT performs well on SQuAD, we considered that this
method fits suitably into BioASQ's exact answers. We used the open-source code of
the BERT and BioBERT pre-trained language model to find the paragraph-sized
answer (NCU-IISR_1) additionally in ideal answers. For each training instance, the input
is the full PubMed abstracts, and the answer is the snippet.</p>
      <p>Our submitted configurations are summarized in Table 1. Because our submissions
for batch 3 had some errors, Table 1 only shows the results of batches 1, 2, 4, and 5. In
our internal experiments with the “NCU-IISR_3” configuration, we observed that most
predictions had lengths as long as ideal answers in the training set. Therefore, we simply
selected the top 1 sentence as the ideal answer in all types.</p>
      <p>Model performances in predicting exact answers are shown in Table 2. Irrespective
of the question type, most of our results outperformed the median scores. In particular,
we won second place on the factoid questions at batch 2 and found that “NCU-IISR_1”
generally performed higher in the factoid category than on the other two question types.</p>
      <p>Model performances in predicting ideal answers are shown in Table 3. With ideal
answers, two evaluation metrics are used: ROUGE and human evaluation. Roughly
speaking, ROUGE counts the n-gram overlap between an automatically constructed
summary and a set of human-written (gold) summaries, with a higher ROUGE score
being better. Specifically, ROUGE-2 and ROUGE-SU4 were used to evaluate ideal
answers. These automatic evaluations are the most widely used versions of ROUGE
and have been discovered to correlate well with human judgments when multiple
reference summaries are available for each question.</p>
      <p>The human evaluation results (manual scores) have not yet been reported by the
organizers. All ideal answers to the systems will also be evaluated by biomedical experts.
For each ideal answer, the experts give a score ranging from 1-5 on each of four terms:
information recall (the answer reports all necessary information), information precision
(no irrelevant information is reported), information repetition (the answer does not
repeat information multiple times, e.g. when sentences extracted from different articles
convey the same information), and readability (the answer is easily readable and fluent).</p>
      <p>ROUGE-2 F1
A sample of ideal answers will be evaluated by more than one expert in order to
measure the inter-annotator agreement.</p>
      <p>Automatic evaluations in the BioASQ also provide a Recall metric, which shows
how many tokens from the prediction appear in the gold answer. For ideal answers, our
recall values were lower than the median. The ROUGE-2 and ROUGE-SU4 Recall
values for our best system “NCU-IISR_3” are given in Table 4. As mentioned earlier,
we only returned the top 1 sentence from the logistic regression model, thus we
definitely lost some sentences that would have added to ideal answers. In contrast, Diego
Mollá’s work concatenated the top-n sentences when answering questions. If we
compile answers from more sentences, we may solve the problem of poor Recall scores.
This also can be a direction for improvement in the future.
In the 8th BioASQ QA task, we employed BioBERT to deal with both exact answers
and ideal answers. In generating exact answers, we used BioASQ-BioBert to find the
offset (including the start and end positions) of the answer within the given passage
(snippets). Our performance was almost always above the median for yes/no, factoid,
and list question types. However, when it comes to ideal answers, the BioASQ-BioBert
method does not readily recognize the most relevant text. In order to maintain the
completeness of ideal answers, we selected the most relevant snippet or sentences rather
than taking snippet offsets, which may focus on the wrong position and yield imperfect
answers.</p>
      <p>Our results show that in arriving at ideal answers, using the logistic regression model
to select sentences performs better than using cosine similarity to choose a snippet. One
reason for this improvement might be that a large number of snippets are too lengthy
for ideal answers, thus resulting in lower performance. In other words, snippet answers
that consist of only trivial information receive lower ROUGE scores. Our method of
selecting sentences achieved the best ROUGE-2 and ROUGE-SU4 F1 scores among
all participants, but we also note that our Recall scores were much lower than others.
This suggests that our potential improvement with the regression method was unable to
convert more possible sentences.</p>
      <p>In future work, we may try to solve this problem by referring to other methods and
merging in their models. On the other hand, as mentioned previously, we left some
work unfinished in the regression experiment. Thus, future directions include
completely fine-tuning the model with all instances, expanding the range of snippets to
include full abstracts, and comparing activation or loss functions to find a better one. In
the regression method, we only processed snippet context and did not use the complete
PubMed abstracts. Thus, these can be utilized in the future. All told, we hope to keep
the base of BioBERT and make an effort to combine it with different approaches.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Appreciating Po-Ting Lai for giving us suggestions during the challenge and revising
the paper.</p>
      <p>Pennington, J., Socher, R., &amp; Manning, C. D. (2014). Glove: Global vectors for word
representation. Paper presented at the Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP).</p>
      <p>Habibi, M., Weber, L., Neves, M., Wiegandt, D. L., &amp; Leser, U. (2017). Deep learning with word
embeddings improves biomedical named entity recognition. Bioinformatics, 33(14),
i37-i48.</p>
      <p>Mollá, D., &amp; Jones, C. (2019). Classification betters regression in query-based multi-document
summarisation techniques for question answering. Paper presented at the Joint
European Conference on Machine Learning and Knowledge Discovery in Databases.
Lin, C.-Y. (2004, jul). ROUGE: A Package for Automatic Evaluation of Summaries. Paper
presented at the Text Summarization Branches Out, Barcelona, Spain.</p>
      <p>Yoon, W., Lee, J., Kim, D., Jeong, M., &amp; Kang, J. (2019). Pre-trained Language Model for
Biomedical Question Answering. arXiv preprint arXiv:1909.08229.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Tsatsaronis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balikas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malakasiotis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Partalas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zschunke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alvers</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          , . . .
          <string-name>
            <surname>Polychronopoulos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <fpage>138</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>So</surname>
            ,
            <given-names>C. H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>BioBERT: a pretrained biomedical language representation model for biomedical text mining</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>36</volume>
          (
          <issue>4</issue>
          ),
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Rajpurkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Lopyrev</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2016</year>
          ). Squad:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text</article-title>
          .
          <source>arXiv preprint arXiv:1606</source>
          .
          <fpage>05250</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Rajpurkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Know what you don't know: Unanswerable questions for SQuAD</article-title>
          . arXiv preprint arXiv:
          <year>1806</year>
          .03822.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhingra</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W. W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>PubMedQA: A Dataset for Biomedical Research Question Answering</article-title>
          . arXiv preprint arXiv:
          <year>1909</year>
          .06146.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802</source>
          .05365.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , M.-W.,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bowman</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Glue: A multitask benchmark and analysis platform for natural language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .07461.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , . . .
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Carbonell, J.,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          .
          <source>Paper presented at the Advances in neural information processing systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Lan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gimpel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Soricut</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Albert: A lite bert for self-supervised learning of language representations</article-title>
          . arXiv preprint arXiv:
          <year>1909</year>
          .11942.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luong</surname>
          </string-name>
          , M.-T.,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Electra: Pre-training text encoders as discriminators rather than generators</article-title>
          . arXiv preprint arXiv:
          <year>2003</year>
          .10555.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>Paper presented at the Advances in neural information processing systems.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>