<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Abstractive Text Summarization using Transfer Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ekaterina Zolotareva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tsegaye Misikir Tashu</string-name>
          <email>misikir@inf.elte.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Horváth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ELTE- Eötvös Loránd University, Faculty of Informatics, Department of Data Science and Engineering, Telekom Innovation Laboratories Pázmány Péter sétány 1/C</institution>
          ,
          <addr-line>1117, Budapest, Hungary (dnbo45, tomas.horvath</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recently, abstractive text summarization has achieved success in switching from linear models via sparse and handcrafted features to nonlinear neural network models via dense inputs. This success comes from the application of deep learning models on natural language processing tasks where these models are capable of modeling intricate patterns in data without handcrafted features. In this work, the text summarization problem has been explored using Sequence-to-sequence recurrent neural networks and Transfer Learning with a Unified Textto-Text Transformer approaches. Experimental results showed that the Transfer Learning-based model achieved considerable improvement for abstractive text summarization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Summarization is closely related to data compression and
information understanding both of which are key to
information science and retrieval. The technology of text
summarization can improve information extraction
systems and also allows readers to quickly view a large
number of documents for important information. Indeed,
automatic summarization has been recently recognized as one
of the most important natural language processing (NLP)
tasks, yet one of the least solved one.</p>
      <p>In the literature, there are two main approaches to text
summarization. While extractive methods are arguably
well suited for identifying the most relevant information,
such techniques may lack the fluency and coherency of
human-generated summaries. Abstractive text
summarization is the task of generating a summary consisting of a
few sentences that capture the salient ideas of the input
text document. The adjective ‘abstractive’ is used to
denote a summary that is not a mere selection of a few
existing passages or sentences extracted from the source, but a
compressed paraphrasing of the main contents of the
document, potentially using vocabulary unseen in the source
document [9].</p>
      <p>Abstractive summarization has shown the most promise
towards addressing issues in extracting important
information from the text documents but Abstractive
generation may produce sentences not seen in the original
input document. Motivated by neural network success
in machine translation experiments, the attention-based
encoder-decoder paradigm has recently been widely
studied in abstractive summarization. By dynamically
accessing the relevant pieces of information based on the hidden
states of the decoder during the generation of the output
sequence, the model revisits the input and attends to
important information.</p>
      <p>Recent abstractive document summarization models are
yet not able to achieve convincing performance. In this
paper, we investigate the Transfer learning for abstractive
text summarization to address a key challenge in
summarization, which is to optimally compress the original
document while preserving the key concepts in the original
document. The rest of this paper is organized as follows:
Section 2 provides an overview of the existing works and
approaches. In Section 3, the approach to be investigated is
introduced. Section 5 presents Experimental setting ,data
sets used and results. Finally, Section 6 presents the
discussion and concludes the paper and discusses prospective
plans for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        The number of summarization models introduced every
year has been increasing rapidly. Advancements in
neural network architectures [
        <xref ref-type="bibr" rid="ref16">1, 11</xref>
        ], and the availability of
largescale data enabled the transition from systems based
on expert knowledge and heuristics to data-driven
approaches powered by end-to-end deep neural models.
Current approaches to text summarization utilize advanced
attention and copying mechanisms [
        <xref ref-type="bibr" rid="ref18">3, 12</xref>
        ] multi-task and
multi-reward training techniques [7], graph-based
methods that involve arranging the input text in a graph and
then using ranking or graph traversal algorithms in order
to construct the summary [5] [
        <xref ref-type="bibr" rid="ref20">13</xref>
        ], reinforcement
learning strategies [4], and hybrid extractive-abstractive models
[
        <xref ref-type="bibr" rid="ref2">6</xref>
        ].
      </p>
      <p>
        This work is based on the most recent and novel
TextTo-Text Transfer Transformer (T5) [
        <xref ref-type="bibr" rid="ref15 ref19 ref3">10</xref>
        ] and on one of the
main known Sequence to sequence (Seq2Seq) model [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ].
The T5 model, pre-trained on Colossal Clean Crawled
Corpus (C4), achieved state-of-the-art results on many
NLP benchmarks while being flexible enough to be
finetuned to a variety of important tasks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>The Transformer Model</title>
      <p>It is possible to formulate most NLP tasks in a
“text-totext” format – that is, a task where the model is fed some
text for context or conditioning and is then asked to
produce some output text. This approach provides a
consistent training objective both for pre-training and
finetuning. Specifically,the model is trained with a maximum
likelihood objective regardless of the task.
3.1</p>
      <sec id="sec-3-1">
        <title>The Transformer: Model Architecture</title>
        <p>
          Most competitive and successful neural sequence
transduction models have an encoder-decoder structure [
          <xref ref-type="bibr" rid="ref16">14,
11</xref>
          ]. Here, the encoder maps an input sequence of
symbol representations (x1; :::; xn) to a sequence of continuous
representations z = (z1; :::; zn) [14]. Given z, the decoder
then generates an output sequence (y1; :::; ym) of symbols
one element at a time. At each step, the model is
automatically regressive, with the previously generated symbols
being consumed as additional input when generating the
next step. The Transformer [14] follows this overall
architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown
in the left and right halves of Figure 1, respectively (See
[14] for more).
        </p>
        <p>Encoder: The encoder is composed of a stack of N =
6 identical layers. Each layer has a multi-head
selfattention mechanism, and a simple, position-wise fully
connected feed-forward network. A residual connection
is employed around each of the two sub-layers followed
by layer normalization. That is, the output of each
sublayer is LayerNorm(x + Sublayer(x)) Where Sublayer(x)
is the function implemented by the sub-layer itself [14].
Decoder: The decoder also consists of a stack of N =
6 identical layers. The decoder inserts a third sub-layer
which, in addition to the two sub-layers, provides
multihead attention to the output of the encoder stack. Similar
to the encoder, a residual connection around each of the
two sub-layers is used, followed by a layer normalization.
To prevent positions from paying attention to subsequent
positions, a modified self-attention sub-layer is used in the
decoder [14].</p>
        <p>Attention: An attention function can be described as
mapping a query and a set of key-value pairs to an output,
where the query, the keys, the values and the output are all
vectors [14]. The output can be calculated as a weighted
sum of the values, where the weight assigned to each value
is calculated by a compatibility function of the query with
the corresponding key.</p>
        <p>The advantage of using multi-head attention allows the
model to share information from different representation
subspaces at different positions. With a single attention
head this is prevented by averaging [14]. The Transformer
uses multi-head attention in the following manner:
In “encoder-decoder attention” layers, the queries
come from the previous decoder layer and the
memory keys and values come from the output of the
encoder. This allows every position in the decoder to
attend over all positions in the input sequence [15, 2].
The encoder contains self-attention layers. In a
selfattention layer, all keys, values and queries come
from the same location, in this case from the output
of the previous layer in the encoder. Each position in
the encoder can attend to all positions in the previous
layer of the encoder [14].</p>
        <p>Similarly, self-attention layers in the decoder allow
each position in the decoder to attend to all positions
in the decoder up to and including that position [14].
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>T5 approach</title>
        <p>
          Attention Masks: A major distinguishing factor for
different architectures is the “mask” used by different
attention mechanisms in the model. Recall that the
selfattention operation in a Transformer takes a sequence as
input and outputs a new sequence of the same length [
          <xref ref-type="bibr" rid="ref15 ref19 ref3">10</xref>
          ].
Each entry of the output sequence is produced by
computing a weighted average of entries of the input sequence.
Specifically, let yi refer to the ith element of the output
sequence and x j refer to the jth entry of the input sequence.
In practice, we compute the attention function on a set of
queries simultaneously, packed together into a matrix Q.
The keys and values are also packed together into
matrices K and V. We compute the matrix of outputs as:
yi = å wi; j x j
j
(1)
        </p>
        <p>Where wi; j is the scalar weight produced by the
selfattention mechanism as a function of xi and x j. The
attention mask is then used to zero out certain weights in order
to constrain which entries of the input can be attended to
at a given output time step.</p>
        <p>Encoder-Decoder: An encoder-decoder Transformer
consists of two layers of stacks: the encoder, which is
fed an input sequence, and the decoder, which generates
a new output sequence. The encoder uses a “fully visible”
attention mask. The “fully visible” masking allows a
selfattention mechanism to pay attention to each input of its
output. This form of masking is suitable when the
attention is over a “prefix”, i.e. a context that is provided to the
model that will later be used to make predictions. The
selfattention operations in the decoder of the transformer use
a “causal” masking pattern. Within model training
process, approaching with "causal" mask let decoder prevent
the model from attending to the jth entry during handling
ith input sequence for j &gt; i. This is used during training so
that the model cannot “see into the future” while
producing its output.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Sequence to Sequence Model</title>
      <p>The Recurrent Neural Network(RNN) is a natural
generalization of feed forward neural networks to sequences.
Given a sequence of inputs(x1; :::; xT ), a standard RNN
computes a sequence of outputs (y1; :::; yT ) by iterating the
equation 2 and 3:
(3)</p>
      <p>The RNN can easily map sequences to sequences
whenever the alignment between the inputs and the
outputs is known ahead of time. However, it is not clear
how to apply an RNN to problems whose input and the
output sequences have different lengths with complicated
and non-monotonic relationships.</p>
      <p>Sequence learning consists of mapping the input
sequence with one RNN to a vector of fixed size and
then mapping the vector with another RNN to the target
sequence. Although it could work in principle, since the
RNN is supplied with all relevant information, it would be
difficult to train the RNNs due to the resulting long-term
dependencies. However, the Long Short-Term Memory
(LSTM) is known to learn problems with long-range
time dependencies, so an LSTM can be successful in this
setting.</p>
      <p>The objective of the LSTM is to estimate the conditional
probability p(y1; :::; yM0 jx1; :::; xM) where (x1; :::; xM) is an
input sequence and (y1; :::; yM0 ) is its corresponding output
sequence whose length M0 may differ from M. The LSTM
computes the conditional probability by first obtaining the
fixed-dimensional representation v of the input sequence
(x1; :::; xM) given by the last hidden state of the LSTM,
and then computing the probability of (y1; :::; yM0 ) with a
standard LSTM language model formulation whose initial
hidden state is set to the representation v of (x1; :::; xT ):
M0
p(y1; :::; yM0 jx1; :::; xM) = Õ p(ymjv; y1; :::; ym 1)
m=1
(4)</p>
      <p>In this equation, each p(ymjv; y1; :::; ym 1) distribution is
represented with a soft max over all the words in the
vocabulary. The LSTM formulation from Graves has been
used. It is require that each sentence ends with a
special end-of-sentence symbol “&lt;EOS&gt;”, which enables the
model to define a distribution over sequences of all
possible lengths.
5
5.1</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Setting and Results</title>
      <sec id="sec-5-1">
        <title>Dataset Selection</title>
        <p>The experiment was carried out on the BBC News dataset
provided by Kaggle1. The dataset consists of 2225
documents from the BBC news website corresponding to
stories in five topical areas from 2004 to 2005 and includes
five class labels which are business, entertainment,
politics, sport, technology.
ht = sigmoid(W hxxt + W hhht 1)
(2)
1https://www.kaggle.com/pariza/bbc-news-summary
In preprocessing the documnets, the following tasks were
performed: tokenization using the NLTK2 tokenizer;
removing punctuation marks, determiners, and prepositions;
a transformation to lower-case; stopword removal and
word stemming. In the stop word removal step, the words
that are in the english stop word list were removed. After
removing the stopwords, the words have been stemmed to
their roots.</p>
        <p>Python was used to implement the proposed LSH-based
AEE algorithm. The Scikit-learn3 , gensim 4and the
Numpy5 and PyTorch6libraries were used.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>T5 Model Hyper-Parameter Setting</title>
        <p>The following parameters were selected by taking into
account the computation power and resources at hand.
Therefore, We selected the Hyper parameters using the
manual configuration method. The dataset is split into
80% training data and 20% testing data with sample
function from pandas framework.</p>
        <p>TRAIN_BATCH_SIZE = 2 (default: 64)
VALID_BATCH_SIZE = 2 (default: 1000)
TRAIN_EPOCHS = 2 (default: 10)
VAL_EPOCHS = 1 (default: 10)
LEARNING_RATE = 1e</p>
        <p>4 (default: 0.01)</p>
        <p>SEED = 42 (default: 42)</p>
      </sec>
      <sec id="sec-5-3">
        <title>Initiating Fine-Tuning for the model on BBC News dataset:</title>
        <p>Epoch: 0, Loss: 14.0325
Epoch: 0, Loss: 2.9507
Epoch: 1, Loss: 2.8506
Epoch: 1, Loss: 2.0221
5.4</p>
      </sec>
      <sec id="sec-5-4">
        <title>Seq2Seq Model Settings</title>
        <p>Abstractive summarization neural network model is built
using TensorFlow and Keras machine learning and neural
networks python libraries.</p>
        <p>First, set up the maximum cleaned text and summary
lengths based on the distribution of sequence lengths from
the chosen sample. Add “sostok” – START and “eostok”
– END tokens to the reference summary as this will help
2http://www.nltk.org
3http://scikit-learn.org/
4https://radimrehurek.com/gensim/
5http://www.numpy.org/
6https://www.pytorch.org/
the model to determine when the sequence starts and ends
respectively. The dataset is split into 80% for training
data and 20% for testing data with train_test_split
package from sklearn.model_selection.</p>
        <p>Then, both the training and testing data are tokenized
to form the vocabulary and converted the word sequences
into equal length integer sequences by using Tokenizer and
pad sequences modules from keras.preprocessing
package.</p>
        <p>Our Seq2Seq model has three LSTM layers for the
encoder network and a single LSTM layer for the decoder
network with an embedding layer on both the encoder and
decoder network. The custom attention layer was also
used to remember the lengthy sequences, and the
output layer uses the SoftMax activation function. The
hidden layers have a dimension of 256 units and the
embedding layers have a size of 200 units. Besides, a drop-out
value of 0.4 is used in each hidden layer to reduce model
overfitting and improve performance. These layers have
been implemented and the model is built using different
wrappers like Input, LSTM, Embedding, Dense from the
tensorflow.keras.layers.</p>
        <p>Different values for each hyper-parameters was used
and the following hyper-parameters setting were selected
during training based on the their performance :
Epochs = 25
Optimizer = “rmsprop”
Batch size = 64
Latent dimension = 256
Embedding dimension = 200</p>
        <p>Loss function = “sparse_categorical_crossentropy”
Hyper parameters were selected using the manual
configuration method. In the accuracy and loss values are
determined and analyzed. After training phase comes the
inference phase, in which we input the testing data to our
model and get the output predicted summary.
5.5</p>
      </sec>
      <sec id="sec-5-5">
        <title>Evaluation Metrics</title>
        <p>In Text Summarization, summary evaluation is an essential
chore. Manual and semi-automatic evaluation of
largescale summarization models is costly and cumbersome.
Much effort has been made to develop automatic metrics
that would allow for fast and cheap evaluation of models.
The ROUGE package introduced by Lin [8] offers a set
of automatic metrics based on the lexical overlap between
candidate and reference summaries .</p>
        <p>We used ROUGE metrics for our evaluation process.
ROUGE refers to Recall Oriented Understudy for
Gisting Evaluation which is an automatic summary evaluation
F1
Precision</p>
        <p>Recall
bench-marking metric that is widely used by researchers to
determine the quality of the summary produced by
comparing the machine generated summary with the
reference summary (ideal or human written ones). ROUGE
scores are computed from the number of overlapping
words between the reference summary and machine
generated summary. There are different types of ROUGE such
as ROUGE-N, ROUGE-L,ROUGE-S and ROUGE-W. But
the most commonly used ones are ROUGE-N
(ROUGE1,ROUGE-2) and ROUGE-L and hence we also use the
same.</p>
        <p>ROUGE-N : It denotes the overlapping of n-grams
between the system generated summary and the ideal
reference summary. For instance, unigram (ROUGE-1),
bigram (ROUGE-2), trigram (ROUGE-3) and so on. The
ROUGE-n is given by:</p>
        <p>ROU GE</p>
        <p>å å
n = S2RS gramn2S
å å
S2RS gramn2S</p>
        <sec id="sec-5-5-1">
          <title>Countmatch(gramn)</title>
        </sec>
        <sec id="sec-5-5-2">
          <title>Count(gramn)</title>
          <p>(5)</p>
          <p>Where RS is a set of reference summaries, n stands for
the length of the n-gram, gramn, and Countmatch(gramn)
is the maximum number of n-grams co-occurring in a
generated summary and a set of reference summaries.
ROUGE-L: It denotes the Longest Common
Subsequence (LCS) matching between the reference summary
and system generated summary.
5.6</p>
        </sec>
      </sec>
      <sec id="sec-5-6">
        <title>Results</title>
        <p>The experimental results of Text-To-Text Transfer
Transformer (T5) method were compared with attention based
Sequence to sequence based methods. The experimental
results are presented in Table 1 and Table 2. The Results
shown in Table 1 are from Transformer (T5) method and
the results in table 2 are the baseline method. According
to the experimental results presented, Text-To-Text
Transfer Transformer (T5) based abstractive text
summarization outperformed the baseline attention based seq2seq
approach in all of the matrices used. Sample prediction
results from the test are presented in Table 3
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we have dealt with the demanding task of
abstractive document summarization. We used a newly
Generated Text
Veteran Labour MP
and former Cabinet
minister Jack
Cunningham has said he
will stand down at
the next election Mr
Blair said He was
an...</p>
      <p>Ministers would not
rule out scrapping
the Child Support
Agency if it failed to
improve Work and
Pension Secretary...</p>
      <p>Results</p>
      <p>Actual text
Labour s
Cunningham to stand down
Veteran Labour MP
and former Cabinet
minister Jack
Cunningham has said he
will stand down...</p>
      <p>CSA could close
says minister
Ministers would not rule
out scrapping the</p>
      <p>
        Child Support...
introduced approach [
        <xref ref-type="bibr" rid="ref15 ref19 ref3">10</xref>
        ], the Transformer or T5
framework, to create a multi-sentence summary. Experiments
were carried out to verify the effectiveness of the proposed
method. Experimental results on the BBC News dataset
showed that the T5 model performed well in the
abstractive document summarization. The future direction is to
study the Transformer method for the task of summarizing
multiple documents and also to very the T5 approach on
other benchmark dataset.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>The research has been supported by the European Union,
co-financed by the European Social Fund
(EFOP-3.6.216-2017-00013, Thematic Fundamental Research
Collaborations Grounding Innovation in Informatics and
Infocommunications).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and Yoshua Bengio.
          <article-title>“Neural Machine Translation by Jointly Learning to Align and Translate”</article-title>
          .
          <source>In: 3rd International Conference on Learning Representations, ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          , Conference Track Proceedings. Ed. by
          <source>Yoshua Bengio and Yann LeCun</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          , Yuntian Deng, and
          <string-name>
            <surname>Alexander</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Rush</surname>
          </string-name>
          . “
          <article-title>Bottom-Up Abstractive Summarization”</article-title>
          .
          <source>In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Brussels, Belgium,
          <source>October 31 - November 4</source>
          ,
          <year>2018</year>
          . Ed. by Ellen Riloff et al.
          <source>Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4098</fpage>
          -
          <lpage>4109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>DOI: 10</source>
          .18653/v1/d18-
          <fpage>1443</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Wojciech</given-names>
            <surname>Kryscinski</surname>
          </string-name>
          et al. “
          <article-title>Improving Abstraction in Text Summarization”</article-title>
          .
          <source>In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Brussels, Belgium,
          <source>October 31 - November 4</source>
          ,
          <year>2018</year>
          . Ed. by Ellen Riloff et al.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>1808</fpage>
          -
          <lpage>1817</lpage>
          . DOI:
          <volume>10</volume>
          .18653/v1/d18-
          <fpage>1207</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <article-title>“ROUGE: A Package for Automatic Evaluation of Summaries”</article-title>
          .
          <source>In: Text Summarization Branches Out. Barcelona</source>
          , Spain: Association for Computational Linguistics,
          <year>July 2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>URL: https://www.aclweb.org/anthology/ W04-1013.</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Ramesh</given-names>
            <surname>Nallapati</surname>
          </string-name>
          et al. “
          <article-title>Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond”</article-title>
          .
          <source>In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning</source>
          . Berlin, Germany: Association for Computational Linguistics, Aug.
          <year>2016</year>
          , pp.
          <fpage>280</fpage>
          -
          <lpage>290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio. “</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate”</article-title>
          .
          <source>In: arXiv preprint arXiv:1409.0473</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Arman</given-names>
            <surname>Cohan</surname>
          </string-name>
          et al. “
          <article-title>A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents”</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>2</volume>
          (
          <string-name>
            <surname>Short</surname>
            <given-names>Papers). New</given-names>
          </string-name>
          <string-name>
            <surname>Orleans</surname>
          </string-name>
          , Louisiana: Association for Computational Linguistics,
          <year>June 2018</year>
          , pp.
          <fpage>615</fpage>
          -
          <lpage>621</lpage>
          . DOI: 10 .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <volume>18653</volume>
          / v1 /
          <fpage>N18</fpage>
          - 2097. URL: https : / / www .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>aclweb.org/anthology/N18-2097.</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Yue</given-names>
            <surname>Dong</surname>
          </string-name>
          et al. “
          <article-title>BanditSum: Extractive Summarization as a Contextual Bandit”</article-title>
          .
          <source>In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Brussels, Belgium,
          <source>October 31 - November 4</source>
          ,
          <year>2018</year>
          . Ed. by Ellen Riloff et al.
          <source>Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3739</fpage>
          -
          <lpage>3748</lpage>
          . DOI:
          <volume>10</volume>
          . 18653 / v1 / d18 -
          <fpage>1409</fpage>
          . URL: https://doi.org/10.18653/v1/ d18-
          <fpage>1409</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Günes</given-names>
            <surname>Erkan and Dragomir R Radev</surname>
          </string-name>
          . “Lexrank:
          <article-title>Graph-based lexical centrality as salience in text summarization”</article-title>
          .
          <source>In: Journal of artificial intelligence research 22</source>
          (
          <year>2004</year>
          ), pp.
          <fpage>457</fpage>
          -
          <lpage>479</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Colin</given-names>
            <surname>Raffel</surname>
          </string-name>
          et al. “
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer”</article-title>
          . In: arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>10683</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ilya</surname>
            <given-names>Sutskever</given-names>
          </string-name>
          , Oriol Vinyals, and Quoc V Le.
          <article-title>“Sequence to Sequence Learning with Neural Networks”</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>27</volume>
          . Ed. by
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          et al.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Curran</given-names>
            <surname>Associates</surname>
          </string-name>
          , Inc.,
          <year>2014</year>
          , pp.
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jiwei</surname>
            <given-names>Tan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaojun</given-names>
            <surname>Wan</surname>
          </string-name>
          , and Jianguo Xiao. “
          <article-title>Abstractive Document Summarization with a GraphBased Attentional Neural Model”</article-title>
          . In:
          <article-title>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          . Vancouver, Canada: Association for Computational Linguistics,
          <year>July 2017</year>
          , pp.
          <fpage>1171</fpage>
          -
          <lpage>1181</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>DOI: 10</source>
          .18653/v1/
          <fpage>P17</fpage>
          -1108.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [13] [14] [15]
          <string-name>
            <surname>H. Van Lierde</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tommy W.S.</given-names>
            <surname>Chow</surname>
          </string-name>
          . “
          <article-title>Queryoriented text summarization based on hypergraph transversals”</article-title>
          .
          <source>In: Information Processing Management 56.4</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>1317</fpage>
          -
          <lpage>1338</lpage>
          . ISSN:
          <fpage>0306</fpage>
          -
          <lpage>4573</lpage>
          . DOI: https://doi.org/10.1016/j.ipm.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          . Ed. by I. Guyon et al. Curran Associates, Inc.,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Yonghui Wu</surname>
          </string-name>
          et al. “
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation”</article-title>
          .
          <source>In: arXiv preprint arXiv:1609.08144</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>