Abstractive Text Summarization using Transfer Learning Ekaterina Zolotareva, Tsegaye Misikir Tashu and Tomáš Horváth ELTE- Eötvös Loránd University, Faculty of Informatics, Department of Data Science and Engineering, Telekom Innovation Laboratories Pázmány Péter sétány 1/C, 1117, Budapest, Hungary (dnbo45, tomas.horvath , misikir)@inf.elte.hu Abstract: Recently, abstractive text summarization has input document. Motivated by neural network success achieved success in switching from linear models via in machine translation experiments, the attention-based sparse and handcrafted features to nonlinear neural net- encoder-decoder paradigm has recently been widely stud- work models via dense inputs. This success comes from ied in abstractive summarization. By dynamically access- the application of deep learning models on natural lan- ing the relevant pieces of information based on the hidden guage processing tasks where these models are capable states of the decoder during the generation of the output of modeling intricate patterns in data without handcrafted sequence, the model revisits the input and attends to im- features. In this work, the text summarization problem has portant information. been explored using Sequence-to-sequence recurrent neu- Recent abstractive document summarization models are ral networks and Transfer Learning with a Unified Text- yet not able to achieve convincing performance. In this to-Text Transformer approaches. Experimental results paper, we investigate the Transfer learning for abstractive showed that the Transfer Learning-based model achieved text summarization to address a key challenge in summa- considerable improvement for abstractive text summariza- rization, which is to optimally compress the original docu- tion. ment while preserving the key concepts in the original doc- ument. The rest of this paper is organized as follows: Sec- 1 Introduction tion 2 provides an overview of the existing works and ap- proaches. In Section 3, the approach to be investigated is Summarization is closely related to data compression and introduced. Section 5 presents Experimental setting ,data information understanding both of which are key to in- sets used and results. Finally, Section 6 presents the dis- formation science and retrieval. The technology of text cussion and concludes the paper and discusses prospective summarization can improve information extraction sys- plans for future work. tems and also allows readers to quickly view a large num- ber of documents for important information. Indeed, auto- matic summarization has been recently recognized as one 2 Related work of the most important natural language processing (NLP) tasks, yet one of the least solved one. The number of summarization models introduced every In the literature, there are two main approaches to text year has been increasing rapidly. Advancements in neu- summarization. While extractive methods are arguably ral network architectures [1, 11], and the availability of well suited for identifying the most relevant information, largescale data enabled the transition from systems based such techniques may lack the fluency and coherency of on expert knowledge and heuristics to data-driven ap- human-generated summaries. Abstractive text summariza- proaches powered by end-to-end deep neural models. Cur- tion is the task of generating a summary consisting of a rent approaches to text summarization utilize advanced at- few sentences that capture the salient ideas of the input tention and copying mechanisms [3, 12] multi-task and text document. The adjective ‘abstractive’ is used to de- multi-reward training techniques [7], graph-based meth- note a summary that is not a mere selection of a few exist- ods that involve arranging the input text in a graph and ing passages or sentences extracted from the source, but a then using ranking or graph traversal algorithms in order compressed paraphrasing of the main contents of the doc- to construct the summary [5] [13], reinforcement learn- ument, potentially using vocabulary unseen in the source ing strategies [4], and hybrid extractive-abstractive models document [9]. [6]. Abstractive summarization has shown the most promise This work is based on the most recent and novel Text- towards addressing issues in extracting important infor- To-Text Transfer Transformer (T5) [10] and on one of the mation from the text documents but Abstractive gener- main known Sequence to sequence (Seq2Seq) model [6]. ation may produce sentences not seen in the original The T5 model, pre-trained on Colossal Clean Crawled Copyright c 2020 for this paper by its authors. Use permitted un- Corpus (C4), achieved state-of-the-art results on many der Creative Commons License Attribution 4.0 International (CC BY NLP benchmarks while being flexible enough to be fine- 4.0). tuned to a variety of important tasks. 3 The Transformer Model It is possible to formulate most NLP tasks in a “text-to- text” format – that is, a task where the model is fed some text for context or conditioning and is then asked to pro- duce some output text. This approach provides a con- sistent training objective both for pre-training and fine- tuning. Specifically,the model is trained with a maximum likelihood objective regardless of the task. 3.1 The Transformer: Model Architecture Most competitive and successful neural sequence trans- duction models have an encoder-decoder structure [14, 11]. Here, the encoder maps an input sequence of sym- bol representations (x1 , ..., xn ) to a sequence of continuous representations z = (z1 , ..., zn ) [14]. Given z, the decoder then generates an output sequence (y1 , ..., ym ) of symbols one element at a time. At each step, the model is automat- ically regressive, with the previously generated symbols being consumed as additional input when generating the next step. The Transformer [14] follows this overall archi- tecture using stacked self-attention and point-wise, fully Figure 1: The Transformer - Model Architecture [14] connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively (See [14] for more). subspaces at different positions. With a single attention head this is prevented by averaging [14]. The Transformer uses multi-head attention in the following manner: Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has a multi-head self- • In “encoder-decoder attention” layers, the queries attention mechanism, and a simple, position-wise fully come from the previous decoder layer and the mem- connected feed-forward network. A residual connection ory keys and values come from the output of the en- is employed around each of the two sub-layers followed coder. This allows every position in the decoder to by layer normalization. That is, the output of each sub- attend over all positions in the input sequence [15, 2]. layer is LayerNorm(x + Sublayer(x)) Where Sublayer(x) • The encoder contains self-attention layers. In a self- is the function implemented by the sub-layer itself [14]. attention layer, all keys, values and queries come from the same location, in this case from the output Decoder: The decoder also consists of a stack of N = of the previous layer in the encoder. Each position in 6 identical layers. The decoder inserts a third sub-layer the encoder can attend to all positions in the previous which, in addition to the two sub-layers, provides multi- layer of the encoder [14]. head attention to the output of the encoder stack. Similar to the encoder, a residual connection around each of the • Similarly, self-attention layers in the decoder allow two sub-layers is used, followed by a layer normalization. each position in the decoder to attend to all positions To prevent positions from paying attention to subsequent in the decoder up to and including that position [14]. positions, a modified self-attention sub-layer is used in the decoder [14]. 3.2 T5 approach Attention: An attention function can be described as Attention Masks: A major distinguishing factor for dif- mapping a query and a set of key-value pairs to an output, ferent architectures is the “mask” used by different at- where the query, the keys, the values and the output are all tention mechanisms in the model. Recall that the self- vectors [14]. The output can be calculated as a weighted attention operation in a Transformer takes a sequence as sum of the values, where the weight assigned to each value input and outputs a new sequence of the same length [10]. is calculated by a compatibility function of the query with Each entry of the output sequence is produced by comput- the corresponding key. ing a weighted average of entries of the input sequence. The advantage of using multi-head attention allows the Specifically, let yi refer to the ith element of the output se- model to share information from different representation quence and x j refer to the jth entry of the input sequence. yt = W yh ht (3) The RNN can easily map sequences to sequences whenever the alignment between the inputs and the outputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relationships. Sequence learning consists of mapping the input sequence with one RNN to a vector of fixed size and then mapping the vector with another RNN to the target Figure 2: Multi-Head Attention [14] sequence. Although it could work in principle, since the RNN is supplied with all relevant information, it would be difficult to train the RNNs due to the resulting long-term In practice, we compute the attention function on a set of dependencies. However, the Long Short-Term Memory queries simultaneously, packed together into a matrix Q. (LSTM) is known to learn problems with long-range The keys and values are also packed together into matri- time dependencies, so an LSTM can be successful in this ces K and V. We compute the matrix of outputs as: setting. yi = ∑ wi, j x j (1) The objective of the LSTM is to estimate the conditional j probability p(y1 , ..., yM0 |x1 , ..., xM ) where (x1 , ..., xM ) is an input sequence and (y1 , ..., yM0 ) is its corresponding output Where wi, j is the scalar weight produced by the self- sequence whose length M 0 may differ from M. The LSTM attention mechanism as a function of xi and x j . The atten- computes the conditional probability by first obtaining the tion mask is then used to zero out certain weights in order fixed-dimensional representation v of the input sequence to constrain which entries of the input can be attended to (x1 , ..., xM ) given by the last hidden state of the LSTM, at a given output time step. and then computing the probability of (y1 , ..., yM0 ) with a standard LSTM language model formulation whose initial Encoder-Decoder: An encoder-decoder Transformer hidden state is set to the representation v of (x1 , ..., xT ): consists of two layers of stacks: the encoder, which is fed an input sequence, and the decoder, which generates M0 a new output sequence. The encoder uses a “fully visible” p(y1 , ..., yM0 |x1 , ..., xM ) = ∏ p(ym |v, y1 , ..., ym−1 ) (4) attention mask. The “fully visible” masking allows a self- m=1 attention mechanism to pay attention to each input of its In this equation, each p(ym |v, y1 , ..., ym−1 ) distribution is output. This form of masking is suitable when the atten- represented with a soft max over all the words in the vo- tion is over a “prefix”, i.e. a context that is provided to the cabulary. The LSTM formulation from Graves has been model that will later be used to make predictions. The self- used. It is require that each sentence ends with a spe- attention operations in the decoder of the transformer use cial end-of-sentence symbol “”, which enables the a “causal” masking pattern. Within model training pro- model to define a distribution over sequences of all possi- cess, approaching with "causal" mask let decoder prevent ble lengths. the model from attending to the jth entry during handling ith input sequence for j > i. This is used during training so that the model cannot “see into the future” while produc- 5 Experimental Setting and Results ing its output. 5.1 Dataset Selection 4 Sequence to Sequence Model The experiment was carried out on the BBC News dataset provided by Kaggle1 . The dataset consists of 2225 docu- The Recurrent Neural Network(RNN) is a natural gener- ments from the BBC news website corresponding to sto- alization of feed forward neural networks to sequences. ries in five topical areas from 2004 to 2005 and includes Given a sequence of inputs(x1 , ..., xT ), a standard RNN five class labels which are business, entertainment, poli- computes a sequence of outputs (y1 , ..., yT ) by iterating the tics, sport, technology. equation 2 and 3: ht = sigmoid(W hx xt +W hh ht−1 ) (2) 1 https://www.kaggle.com/pariza/bbc-news-summary 5.2 Data Preprocessing the model to determine when the sequence starts and ends respectively. The dataset is split into 80% for training In preprocessing the documnets, the following tasks were data and 20% for testing data with train_test_split performed: tokenization using the NLTK2 tokenizer; re- package from sklearn.model_selection. moving punctuation marks, determiners, and prepositions; a transformation to lower-case; stopword removal and Then, both the training and testing data are tokenized word stemming. In the stop word removal step, the words to form the vocabulary and converted the word sequences that are in the english stop word list were removed. After into equal length integer sequences by using Tokenizer and removing the stopwords, the words have been stemmed to pad sequences modules from keras.preprocessing their roots. package. Python was used to implement the proposed LSH-based AEE algorithm. The Scikit-learn3 , gensim 4 and the Our Seq2Seq model has three LSTM layers for the en- Numpy5 and PyTorch6 libraries were used. coder network and a single LSTM layer for the decoder network with an embedding layer on both the encoder and 5.3 T5 Model Hyper-Parameter Setting decoder network. The custom attention layer was also used to remember the lengthy sequences, and the out- The following parameters were selected by taking into put layer uses the SoftMax activation function. The hid- account the computation power and resources at hand. den layers have a dimension of 256 units and the embed- Therefore, We selected the Hyper parameters using the ding layers have a size of 200 units. Besides, a drop-out manual configuration method. The dataset is split into value of 0.4 is used in each hidden layer to reduce model 80% training data and 20% testing data with sample func- overfitting and improve performance. These layers have tion from pandas framework. been implemented and the model is built using different wrappers like Input, LSTM, Embedding, Dense from the • TRAIN_BATCH_SIZE = 2 (default: 64) tensorflow.keras.layers. • VALID_BATCH_SIZE = 2 (default: 1000) Different values for each hyper-parameters was used and the following hyper-parameters setting were selected • TRAIN_EPOCHS = 2 (default: 10) during training based on the their performance : • VAL_EPOCHS = 1 (default: 10) • Epochs = 25 • LEARNING_RATE = 1e − 4 (default: 0.01) • Optimizer = “rmsprop” • SEED = 42 (default: 42) • Batch size = 64 • Latent dimension = 256 Initiating Fine-Tuning for the model on BBC News dataset: • Embedding dimension = 200 • Epoch: 0, Loss: 14.0325 • Loss function = “sparse_categorical_crossentropy” • Epoch: 0, Loss: 2.9507 Hyper parameters were selected using the manual config- uration method. In the accuracy and loss values are de- • Epoch: 1, Loss: 2.8506 termined and analyzed. After training phase comes the • Epoch: 1, Loss: 2.0221 inference phase, in which we input the testing data to our model and get the output predicted summary. 5.4 Seq2Seq Model Settings 5.5 Evaluation Metrics Abstractive summarization neural network model is built using TensorFlow and Keras machine learning and neural In Text Summarization, summary evaluation is an essential networks python libraries. chore. Manual and semi-automatic evaluation of large- First, set up the maximum cleaned text and summary scale summarization models is costly and cumbersome. lengths based on the distribution of sequence lengths from Much effort has been made to develop automatic metrics the chosen sample. Add “sostok” – START and “eostok” that would allow for fast and cheap evaluation of models. – END tokens to the reference summary as this will help The ROUGE package introduced by Lin [8] offers a set of automatic metrics based on the lexical overlap between 2 http://www.nltk.org 3 http://scikit-learn.org/ candidate and reference summaries . 4 https://radimrehurek.com/gensim/ We used ROUGE metrics for our evaluation process. 5 http://www.numpy.org/ ROUGE refers to Recall Oriented Understudy for Gist- 6 https://www.pytorch.org/ ing Evaluation which is an automatic summary evaluation ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L F1 0.473 0.265 0.361 F1 0.313 0.193 0.262 Precision 0.467 0.261 0.338 Precision 0.388 0.275 0.289 Recall 0.480 0.269 0.389 Recall 0.324 0.132 0.199 Table 1: Results on the BBC test set using T5 Model Table 2: Results on the BBC test set using Seq2Seq Model Results bench-marking metric that is widely used by researchers to Generated Text Actual text determine the quality of the summary produced by com- Veteran Labour MP Labour s Cunning- paring the machine generated summary with the refer- and former Cabinet ham to stand down ence summary (ideal or human written ones). ROUGE minister Jack Cun- Veteran Labour MP scores are computed from the number of overlapping ningham has said he and former Cabinet words between the reference summary and machine gener- will stand down at minister Jack Cun- ated summary. There are different types of ROUGE such the next election Mr ningham has said he as ROUGE-N, ROUGE-L,ROUGE-S and ROUGE-W. But Blair said He was will stand down... the most commonly used ones are ROUGE-N (ROUGE- an... 1,ROUGE-2) and ROUGE-L and hence we also use the Ministers would not CSA could close same. rule out scrapping says minister Minis- the Child Support ters would not rule ROUGE-N : It denotes the overlapping of n-grams be- Agency if it failed to out scrapping the tween the system generated summary and the ideal ref- improve Work and Child Support... erence summary. For instance, unigram (ROUGE-1), bi- Pension Secretary... gram (ROUGE-2), trigram (ROUGE-3) and so on. The ROUGE-n is given by: Table 3: Sample results using T5 model ∑ ∑ Countmatch (gramn ) S∈RS gramn ∈S introduced approach [10], the Transformer or T5 frame- ROUGE − n = (5) work, to create a multi-sentence summary. Experiments ∑ ∑ Count(gramn ) S∈RS gramn ∈S were carried out to verify the effectiveness of the proposed method. Experimental results on the BBC News dataset Where RS is a set of reference summaries, n stands for showed that the T5 model performed well in the abstrac- the length of the n-gram, gramn , and Countmatch(gramn ) tive document summarization. The future direction is to is the maximum number of n-grams co-occurring in a gen- study the Transformer method for the task of summarizing erated summary and a set of reference summaries. multiple documents and also to very the T5 approach on other benchmark dataset. ROUGE-L: It denotes the Longest Common Subse- quence (LCS) matching between the reference summary and system generated summary. Acknowledgment 5.6 Results The research has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.2- The experimental results of Text-To-Text Transfer Trans- 16-2017-00013, Thematic Fundamental Research Collab- former (T5) method were compared with attention based orations Grounding Innovation in Informatics and Info- Sequence to sequence based methods. The experimental communications). results are presented in Table 1 and Table 2. The Results shown in Table 1 are from Transformer (T5) method and the results in table 2 are the baseline method. According References to the experimental results presented, Text-To-Text Trans- fer Transformer (T5) based abstractive text summariza- [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua tion outperformed the baseline attention based seq2seq ap- Bengio. “Neural Machine Translation by Jointly proach in all of the matrices used. Sample prediction re- Learning to Align and Translate”. In: 3rd Inter- sults from the test are presented in Table 3 national Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Ben- 6 Conclusion gio and Yann LeCun. 2015. In this paper, we have dealt with the demanding task of abstractive document summarization. We used a newly [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua DOI : 10 . 18653 / v1 / K16 - 1028. URL : https : Bengio. “Neural machine translation by jointly //www.aclweb.org/anthology/K16-1028. learning to align and translate”. In: arXiv preprint [10] Colin Raffel et al. “Exploring the limits of transfer arXiv:1409.0473 (2014). learning with a unified text-to-text transformer”. In: [3] Arman Cohan et al. “A Discourse-Aware Atten- arXiv preprint arXiv:1910.10683 (2019). tion Model for Abstractive Summarization of Long [11] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Se- Documents”. In: Proceedings of the 2018 Confer- quence to Sequence Learning with Neural Net- ence of the North American Chapter of the Associ- works”. In: Advances in Neural Information Pro- ation for Computational Linguistics: Human Lan- cessing Systems 27. Ed. by Z. Ghahramani et al. guage Technologies, Volume 2 (Short Papers). New Curran Associates, Inc., 2014, pp. 3104–3112. Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 615–621. DOI: 10 . [12] Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. “Ab- 18653 / v1 / N18 - 2097. URL: https : / / www . stractive Document Summarization with a Graph- aclweb.org/anthology/N18-2097. Based Attentional Neural Model”. In: Proceedings of the 55th Annual Meeting of the Association for [4] Yue Dong et al. “BanditSum: Extractive Summa- Computational Linguistics (Volume 1: Long Pa- rization as a Contextual Bandit”. In: Proceedings of pers). Vancouver, Canada: Association for Com- the 2018 Conference on Empirical Methods in Nat- putational Linguistics, July 2017, pp. 1171–1181. ural Language Processing, Brussels, Belgium, Oc- DOI : 10.18653/v1/P17-1108. tober 31 - November 4, 2018. Ed. by Ellen Riloff et al. Association for Computational Linguistics, [13] H. Van Lierde and Tommy W.S. Chow. “Query- 2018, pp. 3739–3748. DOI: 10.18653/v1/d18- oriented text summarization based on hypergraph 1409. URL: https://doi.org/10.18653/v1/ transversals”. In: Information Processing Man- d18-1409. agement 56.4 (2019), pp. 1317–1338. ISSN: 0306- 4573. DOI: https://doi.org/10.1016/j.ipm. [5] Günes Erkan and Dragomir R Radev. “Lexrank: 2019.03.003. Graph-based lexical centrality as salience in text summarization”. In: Journal of artificial intelli- [14] Ashish Vaswani et al. “Attention is All you Need”. gence research 22 (2004), pp. 457–479. In: Advances in Neural Information Processing Sys- tems 30. Ed. by I. Guyon et al. Curran Associates, [6] Sebastian Gehrmann, Yuntian Deng, and Alexan- Inc., 2017, pp. 5998–6008. der M. Rush. “Bottom-Up Abstractive Summariza- tion”. In: Proceedings of the 2018 Conference on [15] Yonghui Wu et al. “Google’s neural machine Empirical Methods in Natural Language Process- translation system: Bridging the gap between hu- ing, Brussels, Belgium, October 31 - November 4, man and machine translation”. In: arXiv preprint 2018. Ed. by Ellen Riloff et al. Association for arXiv:1609.08144 (2016). Computational Linguistics, 2018, pp. 4098–4109. DOI : 10.18653/v1/d18-1443. [7] Wojciech Kryscinski et al. “Improving Abstraction in Text Summarization”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018. Ed. by Ellen Riloff et al. Association for Computational Linguistics, 2018, pp. 1808–1817. DOI: 10.18653/v1/d18-1207. [8] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, July 2004, pp. 74–81. URL: https://www.aclweb.org/anthology/ W04-1013. [9] Ramesh Nallapati et al. “Abstractive Text Sum- marization using Sequence-to-sequence RNNs and Beyond”. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Berlin, Germany: Association for Com- putational Linguistics, Aug. 2016, pp. 280–290.