<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sequential Transfer Learning in NLP for German Text Summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pascal Fecht</string-name>
          <email>pfecht@inovex.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Blank</string-name>
          <email>sblank@inovex.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hans-Peter Zorn</string-name>
          <email>hzorn@inovex.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>inovex GmbH</institution>
          ,
          <addr-line>76131 Karlsruhe</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>This work examines the impact of sequential transfer learning on abstractive machine summarization. A current trend in Natural Language Processing (NLP) is to pre-train extensive language models and then adapt these models to solve various target tasks. Since these techniques have rarely been investigated in the context of text summarization, this work develops an approach to integrate and evaluate pretrained language models in abstractive text summarization. Our experiments suggest that pre-trained language models can improve summarizing texts. We find that using multilingual BERT (Devlin et al., 2018) as contextual embeddings lifts our model by about 9 points of ROUGE-1 and ROUGE-2 on a German summarization task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Summarizing is the ability to write a brief abstract
of the essential content given in a text. Two types
of approaches for automatic summarization
systems can be distinguished. Extractive methods aim
to identify the crucial information of a written text
and solely copy these parts as summary
        <xref ref-type="bibr" rid="ref23 ref3">(Conroy
and O’leary, 2001; Shen et al., 2007)</xref>
        . On the other
hand, abstractive methods aim to express the
summaries as coherent and fluent texts
        <xref ref-type="bibr" rid="ref12 ref21">(Rush et al.,
2015; Nallapati et al., 2016)</xref>
        . This work focuses
on abstractive methods with deep neural networks.
      </p>
      <p>
        A summarization system, however, is optimized
for the objective of a single task only. In
order to be able to reuse previously learned
knowledge, transfer learning methods share beneficial
information across multiple tasks. Recently,
various approaches
        <xref ref-type="bibr" rid="ref17 ref4 ref5 ref7">(Howard and Ruder, 2018;
Radford et al., 2018; Devlin et al., 2018)</xref>
        in sequential
transfer learning
        <xref ref-type="bibr" rid="ref15 ref20">(Ruder, 2019)</xref>
        have lead to
improvements in a wide range of tasks in NLP by
extensively pre-training a language model (LM) and
adapting the model for specific tasks.
      </p>
      <p>Hence, this work develops an approach based
on a deep neural model for abstractive
summarization that applies recent advances for the task of text
summarization. Therefore, our model is evaluated
on a German dataset extracted from 100,000
German Wikipedia articles.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        In sequential transfer learning
        <xref ref-type="bibr" rid="ref15 ref20">(Ruder, 2019)</xref>
        , two
arbitrary tasks are learned in sequence. During
pre-training on the source task, the objective is
commonly very generic with large data and high
computational costs. An established approach is
the adaptation of pre-trained word embeddings
        <xref ref-type="bibr" rid="ref11 ref14">(Mikolov et al., 2013; Pennington et al., 2014)</xref>
        to several target tasks. However, one
shortcoming of these embeddings is that they are
contextfree, meaning that their representation of words
are identical in any context.
      </p>
      <p>
        An early approach with deep neural networks
incorporates context into embeddings by using the
encoder of a machine translation system with
shallow RNNs
        <xref ref-type="bibr" rid="ref9">(McCann et al., 2017)</xref>
        . ELMo
        <xref ref-type="bibr" rid="ref16">(Peters et al., 2018)</xref>
        generalizes this approach by
pre-training a language model (LM) and
extracting its features as contextual embeddings.
Subsequent contributions like GPT
        <xref ref-type="bibr" rid="ref17">(Radford et al.,
2018)</xref>
        , BERT
        <xref ref-type="bibr" rid="ref4">(Devlin et al., 2018)</xref>
        or GPT-2
        <xref ref-type="bibr" rid="ref18">(Radford et al., 2019)</xref>
        replace the shallow RNNs in
LMs with Transformers
        <xref ref-type="bibr" rid="ref25">(Vaswani et al., 2017)</xref>
        resulting in deep representations. Further, these
approaches do not only extract the features of
language models but fine-tune the entire model for
several classification tasks.
      </p>
      <p>
        Recent work in abstractive text
summarization is commonly based on encoder-decoder
models with RNNs and additional attention
        <xref ref-type="bibr" rid="ref12 ref22">(Nallapati et al., 2016; See et al., 2017)</xref>
        . Furthermore,
pointer-generator networks
        <xref ref-type="bibr" rid="ref22 ref6">(Gu et al., 2016; See
et al., 2017)</xref>
        copy tokens from the source
document to generated summaries. This addresses
the problem of summarization systems which tend
to produce many out-of-vocabulary (OOV) words
during inference. Another known issue of
summarization systems is the repetition of words and
sequences of words in generated summaries. The
coverage vector
        <xref ref-type="bibr" rid="ref24">(Tu et al., 2016)</xref>
        addresses this
by tracking and controlling the covered and
uncovered parts of the source document
        <xref ref-type="bibr" rid="ref22">(See et al.,
2017)</xref>
        . Finally, Paulus et al. (2017) apply
policygradient learning
        <xref ref-type="bibr" rid="ref19">(Rennie et al., 2016)</xref>
        in order to
use the ROUGE as auxiliary learning objective to
dedicatedly measure the quality of generated
summaries.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Summarization model</title>
      <p>
        Our abstractive summarization system is designed
as an encoder-decoder model with attention
        <xref ref-type="bibr" rid="ref1">(Bahdanau et al., 2014)</xref>
        and integrates a copy
mechanism
        <xref ref-type="bibr" rid="ref6">(Gu et al., 2016)</xref>
        to reduce OOV words in the
generated summaries.
      </p>
      <p>
        Encoder Given s words u1, · · · , us in an input
document, the words are embedded as x1, · · · , xs
in the first layer. Subsequently, the encoder
processes each embedding xi at timestep i to a hidden
state h¯i. More specifically, the encoder is a
multilayer multi-head-attention Transformer
        <xref ref-type="bibr" rid="ref25">(Vaswani
et al., 2017)</xref>
        . The set of all encoder hidden states
is referred to as the memory M = {h¯1, · · · , h¯s}
and accessed during decoding. In a similar notion
to fully LSTM-based encoder-decoder models, we
use the final encoder hidden state h¯s to initialize
the decoder. We did not investigate if separating
the concerns of pooling the encoder’s memory to
a fixed-length context representation and encoding
the last word of a sequence influences the
performance.
      </p>
      <p>
        Decoder The decoder distinguishes between
two modes. The generation mode computes the
probability Pgen(•) to generate a word from a
predefined vocabulary. Following a similar idea of
pointer generator networks
        <xref ref-type="bibr" rid="ref13">(Paulus et al., 2017)</xref>
        , a
second copy mode outputs the probability of
copying a word from the source document Pcopy(•).
Both probabilities are combined to approximate
the output probabilities for the next word yi as
P (yi | hi, yi−1, ci, M) =
      </p>
      <p>
        Pgen(yi, g | hi, yi−1, ci, M) +
Pcopy(yi, c | hi, yi−1, ci, M)
(1)
(2)
(3)
where hi is the current state of the decoder, yi−1
the last decoded word and g refers to the
generation and c to the copy mode
        <xref ref-type="bibr" rid="ref6">(Gu et al., 2016)</xref>
        .
      </p>
      <p>
        On the one hand, Pgen(•) uses the additive
attention function
        <xref ref-type="bibr" rid="ref1">(Bahdanau et al., 2014)</xref>
        of the
encoder-decoder model. On the other hand, the
scoring function for copying the j-th input word
xj with the encoder state h¯j is
f (yi = xj ) = tanh(h¯j&gt;Wc)hi
(4)
where Wc is a learned parameter. These
probabilities are jointly optimized with backpropagation
during training by minimizing the negative
loglikelihood.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Approach and Implementation</title>
      <p>
        Our approach embeds learned knowledge of
pretrained language models to improve the language
understanding of documents for abstractive text
summarization. Let e denote the word embedding
and c the contextual embedding (see Section 2) of
an input word u. Following recent work
        <xref ref-type="bibr" rid="ref16 ref9">(McCann
et al., 2017; Peters et al., 2018)</xref>
        , the final
embedding of words is the concatenation of word
embedding and contextual embedding x = [e; c]. To
keep track of positional information in the
Transformer encoder (see Section 3), we use relative
position encodings
        <xref ref-type="bibr" rid="ref25">(Vaswani et al., 2017)</xref>
        .
      </p>
      <p>In our implementation, the word embeddings
are pre-trained German GloVe embeddings1 of
dimension 300. The contextual embeddings are
extracted from the multilingual BERT model2 of
dimension 768. The concatenated embeddings of
dimension dx = 1068 are passed to the stacked
selfattention encoder with N = 4 layers, h = 8
attention heads and a hidden dimensionality of 256.
Furthermore, our decoder is a single-layer LSTM
of dimensionality ddec = denc.</p>
      <p>1https://deepset.ai/
german-word-embeddings</p>
      <p>2https://github.com/google-research/
bert/blob/master/multilingual.md</p>
      <p>
        In order to avoid catastrophic forgetting
        <xref ref-type="bibr" rid="ref5 ref7">(Howard and Ruder, 2018)</xref>
        , contextual
embeddings are fixed parameters and not optimized
during training. On top of this, recent work
        <xref ref-type="bibr" rid="ref15">(Peters
et al., 2019)</xref>
        suggests that feature extraction with
frozen parameters is favorable if the target task is
very different from the source task and requires
many learned parameters.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Dataset</title>
      <p>We use an unreleased dataset3 consisting
of 100,000 samples extracted from German
Wikipedia articles. For the best of our knowledge,
a summary is the first section of the Wikipedia
article and the document represents the subsequent
sections. Documents consist of 602.81 words and
summaries have 35.79 words, on average.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments and Results</title>
      <p>We hypothesize that contexutal embeddings
benefit the generation of German summaries. In
order to test this hypothesis, we train with
multilingual BERT embeddings and German GloVe
embeddings (Section 4) and compare the results to
two different baselines (Table 1). First, a plain
model has randomly initialized embeddings of
dimension 300. Secondly, embeddings of the same
dimension are initialized with pre-trained German
GloVe embeddings which reveals the actual
impact of contextual BERT embeddings. In all
experiments, word embeddings are fine-tuned during
training.</p>
      <p>
        Experimental Setup We train the model for a
maximum of 25 epochs with early stopping and
a patience of 5. Following recent work
        <xref ref-type="bibr" rid="ref22 ref5">(See
et al., 2017; Gehrmann et al., 2018)</xref>
        , the
models are optimized with Adagrad, a learning rate of
η = 0.15 and an initial accumulator value of 0.1.
The vocabulary is pre-defined and contains the
50,000 most frequent German words of the
training dataset. The input documents are clipped to a
length of 400 words and the target summaries to a
length of 100 words. The 100,000 samples are
randomly partitioned into three subsets of 80%
training, 10% validation and 10% testing data.
During inference, the model uses beam search with
a beam size of 3. The subsequent results are
obtained from a single run on the test dataset of the
German Wikipedia dataset (Section 5).
      </p>
      <p>3https://drive.switch.ch/index.php/s/
YoyW9S8yml7wVhN
6.1</p>
      <sec id="sec-6-1">
        <title>Lexical word similarity</title>
        <p>
          We evaluate the lexical word similarity between
generated summaries and the given reference
summaries with ROUGE-F1
          <xref ref-type="bibr" rid="ref8">(Lin, 2004)</xref>
          .
Despite the fact that measuring lexical overlap is
counter-intuitive to the concept of abstraction, our
approach outperforms both extractive baselines,
Lead-3 and TextRank
          <xref ref-type="bibr" rid="ref10">(Mihalcea and Tarau, 2004)</xref>
          ,
by a large margin (Table 1).
        </p>
        <sec id="sec-6-1-1">
          <title>GloVe + BERT GloVe Plain</title>
        </sec>
        <sec id="sec-6-1-2">
          <title>TextRank</title>
          <p>Lead-3</p>
          <p>R-1
38.48
29.16
27.39</p>
          <p>
            Further, we find a significant improvement of
additional multilingual BERT embeddings over
pre-trained GloVe embeddings and learning
embeddings from scratch. This supports our
hypothesis that contextual embeddings are beneficial to
the generation of summaries.
Abstractive summaries aim to express content in
different words instead of merely copying
sequences of words (Section 1). However, the
ROUGE scores do not indicate the level of
abstraction in generated summaries. For this reason, the
copy rate
            <xref ref-type="bibr" rid="ref12">(Nallapati et al., 2016)</xref>
            measures the
average percentage of copied unigrams (words) from
the given document.
          </p>
          <p>
            The copy rate of the reference summaries is
72.52%, which highlights the need for abstraction
in this dataset (Table 2). Both extractive baselines
are not able to paraphrase and are therefore not
fully capable to meet the requirements of the task.
In contrast to this, all of our models generate
summaries with an evident degree of abstraction.
Although, evaluating the quality of abstraction still
requires human assessment.
The copy mechanism of the CopyNet model
(Section 4) encounters the shortcoming of OOV words
during inference (Section 2). However, the
results demonstrate that generated summaries still
contain unknown words (Table 2). In
comparison to other languages and datasets
            <xref ref-type="bibr" rid="ref6">(Gu et al.,
2016)</xref>
            , this emphasizes that the model on the
German Wikipedia dataset requires greater weights on
the generation mode. Hence, this suggests that the
CopyNet model has a trade-off between the level
of abstraction and the number of unkown tokens.
          </p>
          <p>Nevertheless, contextual BERT embeddings
significantly drop the number of OOVs compared
to our other approaches. This further justifies the
aforementioned copy rate which decreases as the
number of unknown words increases since these
are not part of the source document.
6.4</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Repetition</title>
        <p>
          To measure the issue of repetition in text
summarization models (Section 2), we use the repetition
rate
          <xref ref-type="bibr" rid="ref2">(Cettolo et al., 2014)</xref>
          which scores a summary
by the number of repeated n-grams. More
specifically, the repetition rate RR-n(s) of a candidate
summary s is
RR-n(s) =
(5)
where n is the maximum number of considered
n-grams, fng(s, k) is a function creating a list of
n
Y ||fng(s, k) − fng(s, k, 1)||
k=1 ||fng(s, k)||
!( n1 )
k-grams of s and fng(s, k, 1) consists of unique
kgrams of s. Furthermore, || • || is the number of
words in a set.
        </p>
        <p>In our work, the generated summaries of the
approach including contextual BERT embeddings
create much higher repetition than other
approaches (Table 2). However, this work focuses on
transfer learning for text summarization and thus
neglects further improvements to reduce repetition
(Section 2).
6.5</p>
      </sec>
      <sec id="sec-6-3">
        <title>Factual Incorrectness</title>
        <p>As human observations suggest, summaries may
contain false facts (Table 3) and yet achieve good
results across several metrics. These factual
errors are particularly difficult to detect and
resemble with content-based measures since the lexical
overlap can still be very high. Moreover, these
summaries appear to be fluid and, at first sight,
coherent. Thus, these issues are critical and remain
an unsolved problem.</p>
        <sec id="sec-6-3-1">
          <title>Generated summary</title>
        </sec>
        <sec id="sec-6-3-2">
          <title>Reference summary</title>
          <p>Miroslav Lazo ist Miroslav Lazo ist
ein Slowakischer ein slowakischer
Eiseishockeyspieler, hockeyspieler , der
der seit 2010 bei seit 2011 bei den</p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>Awtomobilist Jeka- Malmo Redhawks</title>
        <p>
          terinburg in der in der schwedischen
neugegrundete HockeyAllsvenskan
Champions League unter Vertrag steht .
unter vertrag steht .
Sequential transfer learning with pre-trained
language models has shown to improve the
performance for many tasks in NP. While previous
research focussed on tasks like e.g. text
classification or question answering
          <xref ref-type="bibr" rid="ref17 ref4">(Devlin et al., 2018;
Radford et al., 2018)</xref>
          , this work investigates on the
impact of pre-trained language models on
abstractive summarization. Our experiments show that
leveraging contextual embeddings extracted from
multilingual BERT
          <xref ref-type="bibr" rid="ref4">(Devlin et al., 2018)</xref>
          improves
performance on a large summarization dataset in
German language.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate</article-title>
          . arXiv:
          <volume>1409</volume>
          .0473 https://arxiv.org/abs/1409.0473.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Mauro</given-names>
            <surname>Cettolo</surname>
          </string-name>
          , Nicola Bertoldi, and
          <string-name>
            <given-names>Marcello</given-names>
            <surname>Federico</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The Repetition Rate of Text as a Predictor of the Effectiveness of Machine Translation Adaptation</article-title>
          .
          <source>In Proceedings of the 11th Biennial Conference of the Association for Machine Translation in the Americas (AMTA</source>
          <year>2014</year>
          ). pages
          <fpage>166</fpage>
          -
          <lpage>179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>John M Conroy and Dianne P O'leary</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Text Summarization via Hidden Markov Models and Pivoted QR Matrix Decomposition</article-title>
          .
          <source>In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM</source>
          , pages
          <fpage>406</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          , Kristina Toutanova Google, and
          <string-name>
            <given-names>A I</given-names>
            <surname>Language</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . arXiv:
          <year>1810</year>
          .04805 https://arxiv.org/abs/
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          , Yuntian Deng, and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Rush</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bottom-Up Abstractive Summarization</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          . pages
          <fpage>4098</fpage>
          -
          <lpage>4109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Jiatao</given-names>
            <surname>Gu</surname>
          </string-name>
          , Zhengdong Lu,
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Victor O K Li</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Incorporating Copying Mechanism in Sequence-toSequence Learning</article-title>
          . arXiv:
          <volume>1603</volume>
          .06393v3 https://arxiv.org/abs/1603.06393v3.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Howard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Universal Language Model Fine-tuning for Text Classification</article-title>
          . arXiv:
          <year>1801</year>
          .06146 https://arxiv.org/abs/
          <year>1801</year>
          .06146.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>ROUGE: A Package for Automatic Evaluation of summaries</article-title>
          .
          <source>Text Summarization Branches</source>
          Out pages
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Bryan</surname>
            <given-names>McCann</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>James</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , Caiming Xiong, and Richard Socher.
          <year>2017</year>
          .
          <article-title>Learned in Translation: Contextualized Word Vectors</article-title>
          . arXiv:
          <volume>1708</volume>
          .00107 https://arxiv.org/abs/1708.00107.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Rada</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Tarau</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Textrank: Bringing order into text</article-title>
          .
          <source>In Proceedings of the 2004 conference on empirical methods in natural language processing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Ramesh</given-names>
            <surname>Nallapati</surname>
          </string-name>
          , Bowen Zhou, Cicero Nogueira dos Santos, Caglar Gulcehre, and
          <string-name>
            <given-names>Bing</given-names>
            <surname>Xiang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Abstractive Text Summarization Using Sequenceto-Sequence RNNs and Beyond</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Romain</given-names>
            <surname>Paulus</surname>
          </string-name>
          , Caiming Xiong, and Richard Socher.
          <year>2017</year>
          .
          <article-title>A Deep Reinforced Model for Abstractive Summarization</article-title>
          . arXiv:
          <volume>1705</volume>
          .04304 https://arxiv.org/abs/1705.04304.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          . pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <source>Sebastian Ruder, and Noah A Smith</source>
          .
          <year>2019</year>
          .
          <article-title>To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks</article-title>
          . arXiv:
          <year>1903</year>
          .05987v2 https://arxiv.org/abs/arXiv:
          <year>1903</year>
          .05987v2.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Matthew E.</given-names>
            <surname>Peters</surname>
          </string-name>
          , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          . arXiv:
          <year>1802</year>
          .05365 https://arxiv.org/abs/
          <year>1802</year>
          .05365.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Karthik Narasimhan, Salimans. Tim, and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Improving Language Understanding by Generative Pre-Training</article-title>
          .
          <source>Technical report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Jeffrey Wu, Rewon Child, David Luan,
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Language Models are Unsupervised Multitask Learners</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Steven J. Rennie</surname>
          </string-name>
          , Etienne Marcheret, Youssef Mroueh,
          <string-name>
            <surname>Jarret Ross</surname>
            , and
            <given-names>Vaibhava</given-names>
          </string-name>
          <string-name>
            <surname>Goel</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Selfcritical Sequence Training for Image Captioning</article-title>
          . arXiv:
          <volume>1612</volume>
          .00563 https://arxiv.org/abs/1612.00563.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Neural Transfer Learning for Natural Language Processing</article-title>
          .
          <source>Ph.D. thesis</source>
          , National University of Ireland, Galway.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Alexander M. Rush</surname>
            , Sumit Chopra, and
            <given-names>Jason</given-names>
          </string-name>
          <string-name>
            <surname>Weston</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A Neural Attention Model for Abstractive Sentence Summarization</article-title>
          . arXiv:
          <volume>1509</volume>
          .00685 https://arxiv.org/abs/1509.00685.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Abigail</given-names>
            <surname>See</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter J. Liu</surname>
            , and
            <given-names>Christopher D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Get To The Point: Summarization with Pointer-Generator Networks</article-title>
          .
          <source>arXiv:1704</source>
          .04368 https://arxiv.org/abs/1704.04368.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Dou</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jian-Tao</surname>
            <given-names>Sun</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hua</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Document summarization using conditional random fields</article-title>
          .
          <source>In IJCAI</source>
          . pages
          <fpage>2862</fpage>
          -
          <lpage>2867</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Zhaopeng</given-names>
            <surname>Tu</surname>
          </string-name>
          , Zhengdong Lu, Yang Liu, Xiaohua Liu, and
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Modeling Coverage for Neural Machine Translation</article-title>
          . arXiv:
          <volume>1601</volume>
          .04811 https://arxiv.org/abs/1601.04811.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention Is All You Need</article-title>
          . arXiv:
          <volume>1706</volume>
          .03762 https://arxiv.org/abs/1706.03762.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>