Abstractive Text Summarization using Transfer Learning

                                  Ekaterina Zolotareva, Tsegaye Misikir Tashu and Tomáš Horváth

                                          ELTE- Eötvös Loránd University, Faculty of Informatics,
                                              Department of Data Science and Engineering,
                                                    Telekom Innovation Laboratories
                                           Pázmány Péter sétány 1/C, 1117, Budapest, Hungary
                                         (dnbo45, tomas.horvath , misikir)@inf.elte.hu

Abstract: Recently, abstractive text summarization has                    input document. Motivated by neural network success
achieved success in switching from linear models via                      in machine translation experiments, the attention-based
sparse and handcrafted features to nonlinear neural net-                  encoder-decoder paradigm has recently been widely stud-
work models via dense inputs. This success comes from                     ied in abstractive summarization. By dynamically access-
the application of deep learning models on natural lan-                   ing the relevant pieces of information based on the hidden
guage processing tasks where these models are capable                     states of the decoder during the generation of the output
of modeling intricate patterns in data without handcrafted                sequence, the model revisits the input and attends to im-
features. In this work, the text summarization problem has                portant information.
been explored using Sequence-to-sequence recurrent neu-                      Recent abstractive document summarization models are
ral networks and Transfer Learning with a Unified Text-                   yet not able to achieve convincing performance. In this
to-Text Transformer approaches. Experimental results                      paper, we investigate the Transfer learning for abstractive
showed that the Transfer Learning-based model achieved                    text summarization to address a key challenge in summa-
considerable improvement for abstractive text summariza-                  rization, which is to optimally compress the original docu-
tion.                                                                     ment while preserving the key concepts in the original doc-
                                                                          ument. The rest of this paper is organized as follows: Sec-
1    Introduction                                                         tion 2 provides an overview of the existing works and ap-
                                                                          proaches. In Section 3, the approach to be investigated is
Summarization is closely related to data compression and                  introduced. Section 5 presents Experimental setting ,data
information understanding both of which are key to in-                    sets used and results. Finally, Section 6 presents the dis-
formation science and retrieval. The technology of text                   cussion and concludes the paper and discusses prospective
summarization can improve information extraction sys-                     plans for future work.
tems and also allows readers to quickly view a large num-
ber of documents for important information. Indeed, auto-
matic summarization has been recently recognized as one                   2   Related work
of the most important natural language processing (NLP)
tasks, yet one of the least solved one.                                   The number of summarization models introduced every
   In the literature, there are two main approaches to text               year has been increasing rapidly. Advancements in neu-
summarization. While extractive methods are arguably                      ral network architectures [1, 11], and the availability of
well suited for identifying the most relevant information,                largescale data enabled the transition from systems based
such techniques may lack the fluency and coherency of                     on expert knowledge and heuristics to data-driven ap-
human-generated summaries. Abstractive text summariza-                    proaches powered by end-to-end deep neural models. Cur-
tion is the task of generating a summary consisting of a                  rent approaches to text summarization utilize advanced at-
few sentences that capture the salient ideas of the input                 tention and copying mechanisms [3, 12] multi-task and
text document. The adjective ‘abstractive’ is used to de-                 multi-reward training techniques [7], graph-based meth-
note a summary that is not a mere selection of a few exist-               ods that involve arranging the input text in a graph and
ing passages or sentences extracted from the source, but a                then using ranking or graph traversal algorithms in order
compressed paraphrasing of the main contents of the doc-                  to construct the summary [5] [13], reinforcement learn-
ument, potentially using vocabulary unseen in the source                  ing strategies [4], and hybrid extractive-abstractive models
document [9].                                                             [6].
   Abstractive summarization has shown the most promise                      This work is based on the most recent and novel Text-
towards addressing issues in extracting important infor-                  To-Text Transfer Transformer (T5) [10] and on one of the
mation from the text documents but Abstractive gener-                     main known Sequence to sequence (Seq2Seq) model [6].
ation may produce sentences not seen in the original                      The T5 model, pre-trained on Colossal Clean Crawled
      Copyright c 2020 for this paper by its authors. Use permitted un-
                                                                          Corpus (C4), achieved state-of-the-art results on many
der Creative Commons License Attribution 4.0 International (CC BY         NLP benchmarks while being flexible enough to be fine-
4.0).                                                                     tuned to a variety of important tasks.
3     The Transformer Model

It is possible to formulate most NLP tasks in a “text-to-
text” format – that is, a task where the model is fed some
text for context or conditioning and is then asked to pro-
duce some output text. This approach provides a con-
sistent training objective both for pre-training and fine-
tuning. Specifically,the model is trained with a maximum
likelihood objective regardless of the task.


3.1   The Transformer: Model Architecture

Most competitive and successful neural sequence trans-
duction models have an encoder-decoder structure [14,
11]. Here, the encoder maps an input sequence of sym-
bol representations (x1 , ..., xn ) to a sequence of continuous
representations z = (z1 , ..., zn ) [14]. Given z, the decoder
then generates an output sequence (y1 , ..., ym ) of symbols
one element at a time. At each step, the model is automat-
ically regressive, with the previously generated symbols
being consumed as additional input when generating the
next step. The Transformer [14] follows this overall archi-
tecture using stacked self-attention and point-wise, fully          Figure 1: The Transformer - Model Architecture [14]
connected layers for both the encoder and decoder, shown
in the left and right halves of Figure 1, respectively (See
[14] for more).                                                   subspaces at different positions. With a single attention
                                                                  head this is prevented by averaging [14]. The Transformer
                                                                  uses multi-head attention in the following manner:
Encoder: The encoder is composed of a stack of N =
6 identical layers. Each layer has a multi-head self-               • In “encoder-decoder attention” layers, the queries
attention mechanism, and a simple, position-wise fully                come from the previous decoder layer and the mem-
connected feed-forward network. A residual connection                 ory keys and values come from the output of the en-
is employed around each of the two sub-layers followed                coder. This allows every position in the decoder to
by layer normalization. That is, the output of each sub-              attend over all positions in the input sequence [15, 2].
layer is LayerNorm(x + Sublayer(x)) Where Sublayer(x)
                                                                    • The encoder contains self-attention layers. In a self-
is the function implemented by the sub-layer itself [14].
                                                                      attention layer, all keys, values and queries come
                                                                      from the same location, in this case from the output
Decoder: The decoder also consists of a stack of N =                  of the previous layer in the encoder. Each position in
6 identical layers. The decoder inserts a third sub-layer             the encoder can attend to all positions in the previous
which, in addition to the two sub-layers, provides multi-             layer of the encoder [14].
head attention to the output of the encoder stack. Similar
to the encoder, a residual connection around each of the            • Similarly, self-attention layers in the decoder allow
two sub-layers is used, followed by a layer normalization.            each position in the decoder to attend to all positions
To prevent positions from paying attention to subsequent              in the decoder up to and including that position [14].
positions, a modified self-attention sub-layer is used in the
decoder [14].
                                                                  3.2   T5 approach
Attention: An attention function can be described as              Attention Masks: A major distinguishing factor for dif-
mapping a query and a set of key-value pairs to an output,        ferent architectures is the “mask” used by different at-
where the query, the keys, the values and the output are all      tention mechanisms in the model. Recall that the self-
vectors [14]. The output can be calculated as a weighted          attention operation in a Transformer takes a sequence as
sum of the values, where the weight assigned to each value        input and outputs a new sequence of the same length [10].
is calculated by a compatibility function of the query with       Each entry of the output sequence is produced by comput-
the corresponding key.                                            ing a weighted average of entries of the input sequence.
   The advantage of using multi-head attention allows the         Specifically, let yi refer to the ith element of the output se-
model to share information from different representation          quence and x j refer to the jth entry of the input sequence.
                                                                                               yt = W yh ht                           (3)
                                                                    The RNN can easily map sequences to sequences
                                                                  whenever the alignment between the inputs and the
                                                                  outputs is known ahead of time. However, it is not clear
                                                                  how to apply an RNN to problems whose input and the
                                                                  output sequences have different lengths with complicated
                                                                  and non-monotonic relationships.

                                                                     Sequence learning consists of mapping the input
                                                                  sequence with one RNN to a vector of fixed size and
                                                                  then mapping the vector with another RNN to the target
           Figure 2: Multi-Head Attention [14]                    sequence. Although it could work in principle, since the
                                                                  RNN is supplied with all relevant information, it would be
                                                                  difficult to train the RNNs due to the resulting long-term
In practice, we compute the attention function on a set of        dependencies. However, the Long Short-Term Memory
queries simultaneously, packed together into a matrix Q.          (LSTM) is known to learn problems with long-range
The keys and values are also packed together into matri-          time dependencies, so an LSTM can be successful in this
ces K and V. We compute the matrix of outputs as:                 setting.

                        yi = ∑ wi, j x j                   (1)       The objective of the LSTM is to estimate the conditional
                               j                                  probability p(y1 , ..., yM0 |x1 , ..., xM ) where (x1 , ..., xM ) is an
                                                                  input sequence and (y1 , ..., yM0 ) is its corresponding output
   Where wi, j is the scalar weight produced by the self-         sequence whose length M 0 may differ from M. The LSTM
attention mechanism as a function of xi and x j . The atten-      computes the conditional probability by first obtaining the
tion mask is then used to zero out certain weights in order       fixed-dimensional representation v of the input sequence
to constrain which entries of the input can be attended to        (x1 , ..., xM ) given by the last hidden state of the LSTM,
at a given output time step.                                      and then computing the probability of (y1 , ..., yM0 ) with a
                                                                  standard LSTM language model formulation whose initial
Encoder-Decoder: An encoder-decoder Transformer                   hidden state is set to the representation v of (x1 , ..., xT ):
consists of two layers of stacks: the encoder, which is
fed an input sequence, and the decoder, which generates                                               M0
a new output sequence. The encoder uses a “fully visible”             p(y1 , ..., yM0 |x1 , ..., xM ) = ∏ p(ym |v, y1 , ..., ym−1 )   (4)
attention mask. The “fully visible” masking allows a self-                                           m=1
attention mechanism to pay attention to each input of its
                                                                     In this equation, each p(ym |v, y1 , ..., ym−1 ) distribution is
output. This form of masking is suitable when the atten-
                                                                  represented with a soft max over all the words in the vo-
tion is over a “prefix”, i.e. a context that is provided to the
                                                                  cabulary. The LSTM formulation from Graves has been
model that will later be used to make predictions. The self-
                                                                  used. It is require that each sentence ends with a spe-
attention operations in the decoder of the transformer use
                                                                  cial end-of-sentence symbol “<EOS>”, which enables the
a “causal” masking pattern. Within model training pro-
                                                                  model to define a distribution over sequences of all possi-
cess, approaching with "causal" mask let decoder prevent
                                                                  ble lengths.
the model from attending to the jth entry during handling
ith input sequence for j > i. This is used during training so
that the model cannot “see into the future” while produc-         5     Experimental Setting and Results
ing its output.
                                                                  5.1    Dataset Selection

4   Sequence to Sequence Model                                    The experiment was carried out on the BBC News dataset
                                                                  provided by Kaggle1 . The dataset consists of 2225 docu-
The Recurrent Neural Network(RNN) is a natural gener-             ments from the BBC news website corresponding to sto-
alization of feed forward neural networks to sequences.           ries in five topical areas from 2004 to 2005 and includes
Given a sequence of inputs(x1 , ..., xT ), a standard RNN         five class labels which are business, entertainment, poli-
computes a sequence of outputs (y1 , ..., yT ) by iterating the   tics, sport, technology.
equation 2 and 3:

              ht = sigmoid(W hx xt +W hh ht−1 )            (2)          1 https://www.kaggle.com/pariza/bbc-news-summary
5.2     Data Preprocessing                                   the model to determine when the sequence starts and ends
                                                             respectively. The dataset is split into 80% for training
In preprocessing the documnets, the following tasks were     data and 20% for testing data with train_test_split
performed: tokenization using the NLTK2 tokenizer; re-       package from sklearn.model_selection.
moving punctuation marks, determiners, and prepositions;
a transformation to lower-case; stopword removal and            Then, both the training and testing data are tokenized
word stemming. In the stop word removal step, the words      to form the vocabulary and converted the word sequences
that are in the english stop word list were removed. After   into equal length integer sequences by using Tokenizer and
removing the stopwords, the words have been stemmed to       pad sequences modules from keras.preprocessing
their roots.                                                 package.
   Python was used to implement the proposed LSH-based
AEE algorithm. The Scikit-learn3 , gensim 4 and the            Our Seq2Seq model has three LSTM layers for the en-
Numpy5 and PyTorch6 libraries were used.                     coder network and a single LSTM layer for the decoder
                                                             network with an embedding layer on both the encoder and
5.3     T5 Model Hyper-Parameter Setting                     decoder network. The custom attention layer was also
                                                             used to remember the lengthy sequences, and the out-
The following parameters were selected by taking into        put layer uses the SoftMax activation function. The hid-
account the computation power and resources at hand.         den layers have a dimension of 256 units and the embed-
Therefore, We selected the Hyper parameters using the        ding layers have a size of 200 units. Besides, a drop-out
manual configuration method. The dataset is split into       value of 0.4 is used in each hidden layer to reduce model
80% training data and 20% testing data with sample func-     overfitting and improve performance. These layers have
tion from pandas framework.                                  been implemented and the model is built using different
                                                             wrappers like Input, LSTM, Embedding, Dense from the
  • TRAIN_BATCH_SIZE = 2 (default: 64)                       tensorflow.keras.layers.
  • VALID_BATCH_SIZE = 2 (default: 1000)                       Different values for each hyper-parameters was used
                                                             and the following hyper-parameters setting were selected
  • TRAIN_EPOCHS = 2 (default: 10)                           during training based on the their performance :
  • VAL_EPOCHS = 1 (default: 10)                               • Epochs = 25
  • LEARNING_RATE = 1e − 4 (default: 0.01)                     • Optimizer = “rmsprop”
  • SEED = 42 (default: 42)                                    • Batch size = 64

                                                               • Latent dimension = 256
Initiating Fine-Tuning for the model on BBC News
dataset:                                                       • Embedding dimension = 200

  • Epoch: 0, Loss: 14.0325                                    • Loss function = “sparse_categorical_crossentropy”
  • Epoch: 0, Loss: 2.9507                                   Hyper parameters were selected using the manual config-
                                                             uration method. In the accuracy and loss values are de-
  • Epoch: 1, Loss: 2.8506                                   termined and analyzed. After training phase comes the
  • Epoch: 1, Loss: 2.0221                                   inference phase, in which we input the testing data to our
                                                             model and get the output predicted summary.

5.4     Seq2Seq Model Settings
                                                             5.5   Evaluation Metrics
Abstractive summarization neural network model is built
using TensorFlow and Keras machine learning and neural       In Text Summarization, summary evaluation is an essential
networks python libraries.                                   chore. Manual and semi-automatic evaluation of large-
   First, set up the maximum cleaned text and summary        scale summarization models is costly and cumbersome.
lengths based on the distribution of sequence lengths from   Much effort has been made to develop automatic metrics
the chosen sample. Add “sostok” – START and “eostok”         that would allow for fast and cheap evaluation of models.
– END tokens to the reference summary as this will help      The ROUGE package introduced by Lin [8] offers a set
                                                             of automatic metrics based on the lexical overlap between
      2 http://www.nltk.org
      3 http://scikit-learn.org/
                                                             candidate and reference summaries .
      4 https://radimrehurek.com/gensim/                        We used ROUGE metrics for our evaluation process.
      5 http://www.numpy.org/                                ROUGE refers to Recall Oriented Understudy for Gist-
      6 https://www.pytorch.org/                             ing Evaluation which is an automatic summary evaluation
                  ROUGE-1        ROUGE-2        ROUGE-L                        ROUGE-1       ROUGE-2       ROUGE-L
         F1        0.473          0.265           0.361              F1         0.313         0.193          0.262
      Precision    0.467          0.261           0.338           Precision     0.388         0.275          0.289
       Recall      0.480          0.269           0.389            Recall       0.324         0.132          0.199

    Table 1: Results on the BBC test set using T5 Model          Table 2: Results on the BBC test set using Seq2Seq Model

                                                                                      Results
bench-marking metric that is widely used by researchers to
                                                                  Generated Text          Actual text
determine the quality of the summary produced by com-
                                                                  Veteran Labour MP Labour s Cunning-
paring the machine generated summary with the refer-
                                                                  and former Cabinet ham to stand down
ence summary (ideal or human written ones). ROUGE
                                                                  minister Jack Cun- Veteran Labour MP
scores are computed from the number of overlapping
                                                                  ningham has said he and former Cabinet
words between the reference summary and machine gener-
                                                                  will stand down at minister Jack Cun-
ated summary. There are different types of ROUGE such
                                                                  the next election Mr ningham has said he
as ROUGE-N, ROUGE-L,ROUGE-S and ROUGE-W. But
                                                                  Blair said He was will stand down...
the most commonly used ones are ROUGE-N (ROUGE-
                                                                  an...
1,ROUGE-2) and ROUGE-L and hence we also use the
                                                                  Ministers would not CSA could close
same.
                                                                  rule out scrapping says minister Minis-
                                                                  the Child Support ters would not rule
ROUGE-N : It denotes the overlapping of n-grams be-               Agency if it failed to out scrapping the
tween the system generated summary and the ideal ref-             improve Work and Child Support...
erence summary. For instance, unigram (ROUGE-1), bi-              Pension Secretary...
gram (ROUGE-2), trigram (ROUGE-3) and so on. The
ROUGE-n is given by:                                                      Table 3: Sample results using T5 model

                     ∑       ∑       Countmatch (gramn )
                    S∈RS gramn ∈S                                introduced approach [10], the Transformer or T5 frame-
      ROUGE − n =                                          (5)   work, to create a multi-sentence summary. Experiments
                         ∑       ∑     Count(gramn )
                      S∈RS gramn ∈S                              were carried out to verify the effectiveness of the proposed
                                                                 method. Experimental results on the BBC News dataset
   Where RS is a set of reference summaries, n stands for
                                                                 showed that the T5 model performed well in the abstrac-
the length of the n-gram, gramn , and Countmatch(gramn )
                                                                 tive document summarization. The future direction is to
is the maximum number of n-grams co-occurring in a gen-
                                                                 study the Transformer method for the task of summarizing
erated summary and a set of reference summaries.
                                                                 multiple documents and also to very the T5 approach on
                                                                 other benchmark dataset.
ROUGE-L: It denotes the Longest Common Subse-
quence (LCS) matching between the reference summary
and system generated summary.                                    Acknowledgment

5.6    Results                                                   The research has been supported by the European Union,
                                                                 co-financed by the European Social Fund (EFOP-3.6.2-
The experimental results of Text-To-Text Transfer Trans-         16-2017-00013, Thematic Fundamental Research Collab-
former (T5) method were compared with attention based            orations Grounding Innovation in Informatics and Info-
Sequence to sequence based methods. The experimental             communications).
results are presented in Table 1 and Table 2. The Results
shown in Table 1 are from Transformer (T5) method and
the results in table 2 are the baseline method. According        References
to the experimental results presented, Text-To-Text Trans-
fer Transformer (T5) based abstractive text summariza-            [1]   Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
tion outperformed the baseline attention based seq2seq ap-              Bengio. “Neural Machine Translation by Jointly
proach in all of the matrices used. Sample prediction re-               Learning to Align and Translate”. In: 3rd Inter-
sults from the test are presented in Table 3                            national Conference on Learning Representations,
                                                                        ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
                                                                        Conference Track Proceedings. Ed. by Yoshua Ben-
6     Conclusion                                                        gio and Yann LeCun. 2015.
In this paper, we have dealt with the demanding task of
abstractive document summarization. We used a newly
[2]   Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua                   DOI : 10 . 18653 / v1 / K16 - 1028. URL : https :
      Bengio. “Neural machine translation by jointly                //www.aclweb.org/anthology/K16-1028.
      learning to align and translate”. In: arXiv preprint   [10]   Colin Raffel et al. “Exploring the limits of transfer
      arXiv:1409.0473 (2014).                                       learning with a unified text-to-text transformer”. In:
[3]   Arman Cohan et al. “A Discourse-Aware Atten-                  arXiv preprint arXiv:1910.10683 (2019).
      tion Model for Abstractive Summarization of Long       [11]   Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Se-
      Documents”. In: Proceedings of the 2018 Confer-               quence to Sequence Learning with Neural Net-
      ence of the North American Chapter of the Associ-             works”. In: Advances in Neural Information Pro-
      ation for Computational Linguistics: Human Lan-               cessing Systems 27. Ed. by Z. Ghahramani et al.
      guage Technologies, Volume 2 (Short Papers). New              Curran Associates, Inc., 2014, pp. 3104–3112.
      Orleans, Louisiana: Association for Computational
      Linguistics, June 2018, pp. 615–621. DOI: 10 .         [12]   Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. “Ab-
      18653 / v1 / N18 - 2097. URL: https : / / www .               stractive Document Summarization with a Graph-
      aclweb.org/anthology/N18-2097.                                Based Attentional Neural Model”. In: Proceedings
                                                                    of the 55th Annual Meeting of the Association for
[4]   Yue Dong et al. “BanditSum: Extractive Summa-                 Computational Linguistics (Volume 1: Long Pa-
      rization as a Contextual Bandit”. In: Proceedings of          pers). Vancouver, Canada: Association for Com-
      the 2018 Conference on Empirical Methods in Nat-              putational Linguistics, July 2017, pp. 1171–1181.
      ural Language Processing, Brussels, Belgium, Oc-              DOI : 10.18653/v1/P17-1108.
      tober 31 - November 4, 2018. Ed. by Ellen Riloff
      et al. Association for Computational Linguistics,      [13]   H. Van Lierde and Tommy W.S. Chow. “Query-
      2018, pp. 3739–3748. DOI: 10.18653/v1/d18-                    oriented text summarization based on hypergraph
      1409. URL: https://doi.org/10.18653/v1/                       transversals”. In: Information Processing Man-
      d18-1409.                                                     agement 56.4 (2019), pp. 1317–1338. ISSN: 0306-
                                                                    4573. DOI: https://doi.org/10.1016/j.ipm.
[5]   Günes Erkan and Dragomir R Radev. “Lexrank:                   2019.03.003.
      Graph-based lexical centrality as salience in text
      summarization”. In: Journal of artificial intelli-     [14]   Ashish Vaswani et al. “Attention is All you Need”.
      gence research 22 (2004), pp. 457–479.                        In: Advances in Neural Information Processing Sys-
                                                                    tems 30. Ed. by I. Guyon et al. Curran Associates,
[6]   Sebastian Gehrmann, Yuntian Deng, and Alexan-                 Inc., 2017, pp. 5998–6008.
      der M. Rush. “Bottom-Up Abstractive Summariza-
      tion”. In: Proceedings of the 2018 Conference on       [15]   Yonghui Wu et al. “Google’s neural machine
      Empirical Methods in Natural Language Process-                translation system: Bridging the gap between hu-
      ing, Brussels, Belgium, October 31 - November 4,              man and machine translation”. In: arXiv preprint
      2018. Ed. by Ellen Riloff et al. Association for              arXiv:1609.08144 (2016).
      Computational Linguistics, 2018, pp. 4098–4109.
      DOI : 10.18653/v1/d18-1443.
[7]   Wojciech Kryscinski et al. “Improving Abstraction
      in Text Summarization”. In: Proceedings of the
      2018 Conference on Empirical Methods in Natural
      Language Processing, Brussels, Belgium, October
      31 - November 4, 2018. Ed. by Ellen Riloff et al.
      Association for Computational Linguistics, 2018,
      pp. 1808–1817. DOI: 10.18653/v1/d18-1207.
[8]   Chin-Yew Lin. “ROUGE: A Package for Automatic
      Evaluation of Summaries”. In: Text Summarization
      Branches Out. Barcelona, Spain: Association for
      Computational Linguistics, July 2004, pp. 74–81.
      URL: https://www.aclweb.org/anthology/
      W04-1013.
[9]   Ramesh Nallapati et al. “Abstractive Text Sum-
      marization using Sequence-to-sequence RNNs and
      Beyond”. In: Proceedings of The 20th SIGNLL
      Conference on Computational Natural Language
      Learning. Berlin, Germany: Association for Com-
      putational Linguistics, Aug. 2016, pp. 280–290.