=Paper=
{{Paper
|id=Vol-2718/paper28
|storemode=property
|title=Abstractive Text Summarization using Transfer Learning
|pdfUrl=https://ceur-ws.org/Vol-2718/paper28.pdf
|volume=Vol-2718
|authors=Ekaterina Zolotareva,Tsegaye Misikir Tashu,Tomáš Horváth
|dblpUrl=https://dblp.org/rec/conf/itat/ZolotarevaTH20
}}
==Abstractive Text Summarization using Transfer Learning==
Abstractive Text Summarization using Transfer Learning
Ekaterina Zolotareva, Tsegaye Misikir Tashu and Tomáš Horváth
ELTE- Eötvös Loránd University, Faculty of Informatics,
Department of Data Science and Engineering,
Telekom Innovation Laboratories
Pázmány Péter sétány 1/C, 1117, Budapest, Hungary
(dnbo45, tomas.horvath , misikir)@inf.elte.hu
Abstract: Recently, abstractive text summarization has input document. Motivated by neural network success
achieved success in switching from linear models via in machine translation experiments, the attention-based
sparse and handcrafted features to nonlinear neural net- encoder-decoder paradigm has recently been widely stud-
work models via dense inputs. This success comes from ied in abstractive summarization. By dynamically access-
the application of deep learning models on natural lan- ing the relevant pieces of information based on the hidden
guage processing tasks where these models are capable states of the decoder during the generation of the output
of modeling intricate patterns in data without handcrafted sequence, the model revisits the input and attends to im-
features. In this work, the text summarization problem has portant information.
been explored using Sequence-to-sequence recurrent neu- Recent abstractive document summarization models are
ral networks and Transfer Learning with a Unified Text- yet not able to achieve convincing performance. In this
to-Text Transformer approaches. Experimental results paper, we investigate the Transfer learning for abstractive
showed that the Transfer Learning-based model achieved text summarization to address a key challenge in summa-
considerable improvement for abstractive text summariza- rization, which is to optimally compress the original docu-
tion. ment while preserving the key concepts in the original doc-
ument. The rest of this paper is organized as follows: Sec-
1 Introduction tion 2 provides an overview of the existing works and ap-
proaches. In Section 3, the approach to be investigated is
Summarization is closely related to data compression and introduced. Section 5 presents Experimental setting ,data
information understanding both of which are key to in- sets used and results. Finally, Section 6 presents the dis-
formation science and retrieval. The technology of text cussion and concludes the paper and discusses prospective
summarization can improve information extraction sys- plans for future work.
tems and also allows readers to quickly view a large num-
ber of documents for important information. Indeed, auto-
matic summarization has been recently recognized as one 2 Related work
of the most important natural language processing (NLP)
tasks, yet one of the least solved one. The number of summarization models introduced every
In the literature, there are two main approaches to text year has been increasing rapidly. Advancements in neu-
summarization. While extractive methods are arguably ral network architectures [1, 11], and the availability of
well suited for identifying the most relevant information, largescale data enabled the transition from systems based
such techniques may lack the fluency and coherency of on expert knowledge and heuristics to data-driven ap-
human-generated summaries. Abstractive text summariza- proaches powered by end-to-end deep neural models. Cur-
tion is the task of generating a summary consisting of a rent approaches to text summarization utilize advanced at-
few sentences that capture the salient ideas of the input tention and copying mechanisms [3, 12] multi-task and
text document. The adjective ‘abstractive’ is used to de- multi-reward training techniques [7], graph-based meth-
note a summary that is not a mere selection of a few exist- ods that involve arranging the input text in a graph and
ing passages or sentences extracted from the source, but a then using ranking or graph traversal algorithms in order
compressed paraphrasing of the main contents of the doc- to construct the summary [5] [13], reinforcement learn-
ument, potentially using vocabulary unseen in the source ing strategies [4], and hybrid extractive-abstractive models
document [9]. [6].
Abstractive summarization has shown the most promise This work is based on the most recent and novel Text-
towards addressing issues in extracting important infor- To-Text Transfer Transformer (T5) [10] and on one of the
mation from the text documents but Abstractive gener- main known Sequence to sequence (Seq2Seq) model [6].
ation may produce sentences not seen in the original The T5 model, pre-trained on Colossal Clean Crawled
Copyright c 2020 for this paper by its authors. Use permitted un-
Corpus (C4), achieved state-of-the-art results on many
der Creative Commons License Attribution 4.0 International (CC BY NLP benchmarks while being flexible enough to be fine-
4.0). tuned to a variety of important tasks.
3 The Transformer Model
It is possible to formulate most NLP tasks in a “text-to-
text” format – that is, a task where the model is fed some
text for context or conditioning and is then asked to pro-
duce some output text. This approach provides a con-
sistent training objective both for pre-training and fine-
tuning. Specifically,the model is trained with a maximum
likelihood objective regardless of the task.
3.1 The Transformer: Model Architecture
Most competitive and successful neural sequence trans-
duction models have an encoder-decoder structure [14,
11]. Here, the encoder maps an input sequence of sym-
bol representations (x1 , ..., xn ) to a sequence of continuous
representations z = (z1 , ..., zn ) [14]. Given z, the decoder
then generates an output sequence (y1 , ..., ym ) of symbols
one element at a time. At each step, the model is automat-
ically regressive, with the previously generated symbols
being consumed as additional input when generating the
next step. The Transformer [14] follows this overall archi-
tecture using stacked self-attention and point-wise, fully Figure 1: The Transformer - Model Architecture [14]
connected layers for both the encoder and decoder, shown
in the left and right halves of Figure 1, respectively (See
[14] for more). subspaces at different positions. With a single attention
head this is prevented by averaging [14]. The Transformer
uses multi-head attention in the following manner:
Encoder: The encoder is composed of a stack of N =
6 identical layers. Each layer has a multi-head self- • In “encoder-decoder attention” layers, the queries
attention mechanism, and a simple, position-wise fully come from the previous decoder layer and the mem-
connected feed-forward network. A residual connection ory keys and values come from the output of the en-
is employed around each of the two sub-layers followed coder. This allows every position in the decoder to
by layer normalization. That is, the output of each sub- attend over all positions in the input sequence [15, 2].
layer is LayerNorm(x + Sublayer(x)) Where Sublayer(x)
• The encoder contains self-attention layers. In a self-
is the function implemented by the sub-layer itself [14].
attention layer, all keys, values and queries come
from the same location, in this case from the output
Decoder: The decoder also consists of a stack of N = of the previous layer in the encoder. Each position in
6 identical layers. The decoder inserts a third sub-layer the encoder can attend to all positions in the previous
which, in addition to the two sub-layers, provides multi- layer of the encoder [14].
head attention to the output of the encoder stack. Similar
to the encoder, a residual connection around each of the • Similarly, self-attention layers in the decoder allow
two sub-layers is used, followed by a layer normalization. each position in the decoder to attend to all positions
To prevent positions from paying attention to subsequent in the decoder up to and including that position [14].
positions, a modified self-attention sub-layer is used in the
decoder [14].
3.2 T5 approach
Attention: An attention function can be described as Attention Masks: A major distinguishing factor for dif-
mapping a query and a set of key-value pairs to an output, ferent architectures is the “mask” used by different at-
where the query, the keys, the values and the output are all tention mechanisms in the model. Recall that the self-
vectors [14]. The output can be calculated as a weighted attention operation in a Transformer takes a sequence as
sum of the values, where the weight assigned to each value input and outputs a new sequence of the same length [10].
is calculated by a compatibility function of the query with Each entry of the output sequence is produced by comput-
the corresponding key. ing a weighted average of entries of the input sequence.
The advantage of using multi-head attention allows the Specifically, let yi refer to the ith element of the output se-
model to share information from different representation quence and x j refer to the jth entry of the input sequence.
yt = W yh ht (3)
The RNN can easily map sequences to sequences
whenever the alignment between the inputs and the
outputs is known ahead of time. However, it is not clear
how to apply an RNN to problems whose input and the
output sequences have different lengths with complicated
and non-monotonic relationships.
Sequence learning consists of mapping the input
sequence with one RNN to a vector of fixed size and
then mapping the vector with another RNN to the target
Figure 2: Multi-Head Attention [14] sequence. Although it could work in principle, since the
RNN is supplied with all relevant information, it would be
difficult to train the RNNs due to the resulting long-term
In practice, we compute the attention function on a set of dependencies. However, the Long Short-Term Memory
queries simultaneously, packed together into a matrix Q. (LSTM) is known to learn problems with long-range
The keys and values are also packed together into matri- time dependencies, so an LSTM can be successful in this
ces K and V. We compute the matrix of outputs as: setting.
yi = ∑ wi, j x j (1) The objective of the LSTM is to estimate the conditional
j probability p(y1 , ..., yM0 |x1 , ..., xM ) where (x1 , ..., xM ) is an
input sequence and (y1 , ..., yM0 ) is its corresponding output
Where wi, j is the scalar weight produced by the self- sequence whose length M 0 may differ from M. The LSTM
attention mechanism as a function of xi and x j . The atten- computes the conditional probability by first obtaining the
tion mask is then used to zero out certain weights in order fixed-dimensional representation v of the input sequence
to constrain which entries of the input can be attended to (x1 , ..., xM ) given by the last hidden state of the LSTM,
at a given output time step. and then computing the probability of (y1 , ..., yM0 ) with a
standard LSTM language model formulation whose initial
Encoder-Decoder: An encoder-decoder Transformer hidden state is set to the representation v of (x1 , ..., xT ):
consists of two layers of stacks: the encoder, which is
fed an input sequence, and the decoder, which generates M0
a new output sequence. The encoder uses a “fully visible” p(y1 , ..., yM0 |x1 , ..., xM ) = ∏ p(ym |v, y1 , ..., ym−1 ) (4)
attention mask. The “fully visible” masking allows a self- m=1
attention mechanism to pay attention to each input of its
In this equation, each p(ym |v, y1 , ..., ym−1 ) distribution is
output. This form of masking is suitable when the atten-
represented with a soft max over all the words in the vo-
tion is over a “prefix”, i.e. a context that is provided to the
cabulary. The LSTM formulation from Graves has been
model that will later be used to make predictions. The self-
used. It is require that each sentence ends with a spe-
attention operations in the decoder of the transformer use
cial end-of-sentence symbol “”, which enables the
a “causal” masking pattern. Within model training pro-
model to define a distribution over sequences of all possi-
cess, approaching with "causal" mask let decoder prevent
ble lengths.
the model from attending to the jth entry during handling
ith input sequence for j > i. This is used during training so
that the model cannot “see into the future” while produc- 5 Experimental Setting and Results
ing its output.
5.1 Dataset Selection
4 Sequence to Sequence Model The experiment was carried out on the BBC News dataset
provided by Kaggle1 . The dataset consists of 2225 docu-
The Recurrent Neural Network(RNN) is a natural gener- ments from the BBC news website corresponding to sto-
alization of feed forward neural networks to sequences. ries in five topical areas from 2004 to 2005 and includes
Given a sequence of inputs(x1 , ..., xT ), a standard RNN five class labels which are business, entertainment, poli-
computes a sequence of outputs (y1 , ..., yT ) by iterating the tics, sport, technology.
equation 2 and 3:
ht = sigmoid(W hx xt +W hh ht−1 ) (2) 1 https://www.kaggle.com/pariza/bbc-news-summary
5.2 Data Preprocessing the model to determine when the sequence starts and ends
respectively. The dataset is split into 80% for training
In preprocessing the documnets, the following tasks were data and 20% for testing data with train_test_split
performed: tokenization using the NLTK2 tokenizer; re- package from sklearn.model_selection.
moving punctuation marks, determiners, and prepositions;
a transformation to lower-case; stopword removal and Then, both the training and testing data are tokenized
word stemming. In the stop word removal step, the words to form the vocabulary and converted the word sequences
that are in the english stop word list were removed. After into equal length integer sequences by using Tokenizer and
removing the stopwords, the words have been stemmed to pad sequences modules from keras.preprocessing
their roots. package.
Python was used to implement the proposed LSH-based
AEE algorithm. The Scikit-learn3 , gensim 4 and the Our Seq2Seq model has three LSTM layers for the en-
Numpy5 and PyTorch6 libraries were used. coder network and a single LSTM layer for the decoder
network with an embedding layer on both the encoder and
5.3 T5 Model Hyper-Parameter Setting decoder network. The custom attention layer was also
used to remember the lengthy sequences, and the out-
The following parameters were selected by taking into put layer uses the SoftMax activation function. The hid-
account the computation power and resources at hand. den layers have a dimension of 256 units and the embed-
Therefore, We selected the Hyper parameters using the ding layers have a size of 200 units. Besides, a drop-out
manual configuration method. The dataset is split into value of 0.4 is used in each hidden layer to reduce model
80% training data and 20% testing data with sample func- overfitting and improve performance. These layers have
tion from pandas framework. been implemented and the model is built using different
wrappers like Input, LSTM, Embedding, Dense from the
• TRAIN_BATCH_SIZE = 2 (default: 64) tensorflow.keras.layers.
• VALID_BATCH_SIZE = 2 (default: 1000) Different values for each hyper-parameters was used
and the following hyper-parameters setting were selected
• TRAIN_EPOCHS = 2 (default: 10) during training based on the their performance :
• VAL_EPOCHS = 1 (default: 10) • Epochs = 25
• LEARNING_RATE = 1e − 4 (default: 0.01) • Optimizer = “rmsprop”
• SEED = 42 (default: 42) • Batch size = 64
• Latent dimension = 256
Initiating Fine-Tuning for the model on BBC News
dataset: • Embedding dimension = 200
• Epoch: 0, Loss: 14.0325 • Loss function = “sparse_categorical_crossentropy”
• Epoch: 0, Loss: 2.9507 Hyper parameters were selected using the manual config-
uration method. In the accuracy and loss values are de-
• Epoch: 1, Loss: 2.8506 termined and analyzed. After training phase comes the
• Epoch: 1, Loss: 2.0221 inference phase, in which we input the testing data to our
model and get the output predicted summary.
5.4 Seq2Seq Model Settings
5.5 Evaluation Metrics
Abstractive summarization neural network model is built
using TensorFlow and Keras machine learning and neural In Text Summarization, summary evaluation is an essential
networks python libraries. chore. Manual and semi-automatic evaluation of large-
First, set up the maximum cleaned text and summary scale summarization models is costly and cumbersome.
lengths based on the distribution of sequence lengths from Much effort has been made to develop automatic metrics
the chosen sample. Add “sostok” – START and “eostok” that would allow for fast and cheap evaluation of models.
– END tokens to the reference summary as this will help The ROUGE package introduced by Lin [8] offers a set
of automatic metrics based on the lexical overlap between
2 http://www.nltk.org
3 http://scikit-learn.org/
candidate and reference summaries .
4 https://radimrehurek.com/gensim/ We used ROUGE metrics for our evaluation process.
5 http://www.numpy.org/ ROUGE refers to Recall Oriented Understudy for Gist-
6 https://www.pytorch.org/ ing Evaluation which is an automatic summary evaluation
ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L
F1 0.473 0.265 0.361 F1 0.313 0.193 0.262
Precision 0.467 0.261 0.338 Precision 0.388 0.275 0.289
Recall 0.480 0.269 0.389 Recall 0.324 0.132 0.199
Table 1: Results on the BBC test set using T5 Model Table 2: Results on the BBC test set using Seq2Seq Model
Results
bench-marking metric that is widely used by researchers to
Generated Text Actual text
determine the quality of the summary produced by com-
Veteran Labour MP Labour s Cunning-
paring the machine generated summary with the refer-
and former Cabinet ham to stand down
ence summary (ideal or human written ones). ROUGE
minister Jack Cun- Veteran Labour MP
scores are computed from the number of overlapping
ningham has said he and former Cabinet
words between the reference summary and machine gener-
will stand down at minister Jack Cun-
ated summary. There are different types of ROUGE such
the next election Mr ningham has said he
as ROUGE-N, ROUGE-L,ROUGE-S and ROUGE-W. But
Blair said He was will stand down...
the most commonly used ones are ROUGE-N (ROUGE-
an...
1,ROUGE-2) and ROUGE-L and hence we also use the
Ministers would not CSA could close
same.
rule out scrapping says minister Minis-
the Child Support ters would not rule
ROUGE-N : It denotes the overlapping of n-grams be- Agency if it failed to out scrapping the
tween the system generated summary and the ideal ref- improve Work and Child Support...
erence summary. For instance, unigram (ROUGE-1), bi- Pension Secretary...
gram (ROUGE-2), trigram (ROUGE-3) and so on. The
ROUGE-n is given by: Table 3: Sample results using T5 model
∑ ∑ Countmatch (gramn )
S∈RS gramn ∈S introduced approach [10], the Transformer or T5 frame-
ROUGE − n = (5) work, to create a multi-sentence summary. Experiments
∑ ∑ Count(gramn )
S∈RS gramn ∈S were carried out to verify the effectiveness of the proposed
method. Experimental results on the BBC News dataset
Where RS is a set of reference summaries, n stands for
showed that the T5 model performed well in the abstrac-
the length of the n-gram, gramn , and Countmatch(gramn )
tive document summarization. The future direction is to
is the maximum number of n-grams co-occurring in a gen-
study the Transformer method for the task of summarizing
erated summary and a set of reference summaries.
multiple documents and also to very the T5 approach on
other benchmark dataset.
ROUGE-L: It denotes the Longest Common Subse-
quence (LCS) matching between the reference summary
and system generated summary. Acknowledgment
5.6 Results The research has been supported by the European Union,
co-financed by the European Social Fund (EFOP-3.6.2-
The experimental results of Text-To-Text Transfer Trans- 16-2017-00013, Thematic Fundamental Research Collab-
former (T5) method were compared with attention based orations Grounding Innovation in Informatics and Info-
Sequence to sequence based methods. The experimental communications).
results are presented in Table 1 and Table 2. The Results
shown in Table 1 are from Transformer (T5) method and
the results in table 2 are the baseline method. According References
to the experimental results presented, Text-To-Text Trans-
fer Transformer (T5) based abstractive text summariza- [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
tion outperformed the baseline attention based seq2seq ap- Bengio. “Neural Machine Translation by Jointly
proach in all of the matrices used. Sample prediction re- Learning to Align and Translate”. In: 3rd Inter-
sults from the test are presented in Table 3 national Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings. Ed. by Yoshua Ben-
6 Conclusion gio and Yann LeCun. 2015.
In this paper, we have dealt with the demanding task of
abstractive document summarization. We used a newly
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua DOI : 10 . 18653 / v1 / K16 - 1028. URL : https :
Bengio. “Neural machine translation by jointly //www.aclweb.org/anthology/K16-1028.
learning to align and translate”. In: arXiv preprint [10] Colin Raffel et al. “Exploring the limits of transfer
arXiv:1409.0473 (2014). learning with a unified text-to-text transformer”. In:
[3] Arman Cohan et al. “A Discourse-Aware Atten- arXiv preprint arXiv:1910.10683 (2019).
tion Model for Abstractive Summarization of Long [11] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Se-
Documents”. In: Proceedings of the 2018 Confer- quence to Sequence Learning with Neural Net-
ence of the North American Chapter of the Associ- works”. In: Advances in Neural Information Pro-
ation for Computational Linguistics: Human Lan- cessing Systems 27. Ed. by Z. Ghahramani et al.
guage Technologies, Volume 2 (Short Papers). New Curran Associates, Inc., 2014, pp. 3104–3112.
Orleans, Louisiana: Association for Computational
Linguistics, June 2018, pp. 615–621. DOI: 10 . [12] Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. “Ab-
18653 / v1 / N18 - 2097. URL: https : / / www . stractive Document Summarization with a Graph-
aclweb.org/anthology/N18-2097. Based Attentional Neural Model”. In: Proceedings
of the 55th Annual Meeting of the Association for
[4] Yue Dong et al. “BanditSum: Extractive Summa- Computational Linguistics (Volume 1: Long Pa-
rization as a Contextual Bandit”. In: Proceedings of pers). Vancouver, Canada: Association for Com-
the 2018 Conference on Empirical Methods in Nat- putational Linguistics, July 2017, pp. 1171–1181.
ural Language Processing, Brussels, Belgium, Oc- DOI : 10.18653/v1/P17-1108.
tober 31 - November 4, 2018. Ed. by Ellen Riloff
et al. Association for Computational Linguistics, [13] H. Van Lierde and Tommy W.S. Chow. “Query-
2018, pp. 3739–3748. DOI: 10.18653/v1/d18- oriented text summarization based on hypergraph
1409. URL: https://doi.org/10.18653/v1/ transversals”. In: Information Processing Man-
d18-1409. agement 56.4 (2019), pp. 1317–1338. ISSN: 0306-
4573. DOI: https://doi.org/10.1016/j.ipm.
[5] Günes Erkan and Dragomir R Radev. “Lexrank: 2019.03.003.
Graph-based lexical centrality as salience in text
summarization”. In: Journal of artificial intelli- [14] Ashish Vaswani et al. “Attention is All you Need”.
gence research 22 (2004), pp. 457–479. In: Advances in Neural Information Processing Sys-
tems 30. Ed. by I. Guyon et al. Curran Associates,
[6] Sebastian Gehrmann, Yuntian Deng, and Alexan- Inc., 2017, pp. 5998–6008.
der M. Rush. “Bottom-Up Abstractive Summariza-
tion”. In: Proceedings of the 2018 Conference on [15] Yonghui Wu et al. “Google’s neural machine
Empirical Methods in Natural Language Process- translation system: Bridging the gap between hu-
ing, Brussels, Belgium, October 31 - November 4, man and machine translation”. In: arXiv preprint
2018. Ed. by Ellen Riloff et al. Association for arXiv:1609.08144 (2016).
Computational Linguistics, 2018, pp. 4098–4109.
DOI : 10.18653/v1/d18-1443.
[7] Wojciech Kryscinski et al. “Improving Abstraction
in Text Summarization”. In: Proceedings of the
2018 Conference on Empirical Methods in Natural
Language Processing, Brussels, Belgium, October
31 - November 4, 2018. Ed. by Ellen Riloff et al.
Association for Computational Linguistics, 2018,
pp. 1808–1817. DOI: 10.18653/v1/d18-1207.
[8] Chin-Yew Lin. “ROUGE: A Package for Automatic
Evaluation of Summaries”. In: Text Summarization
Branches Out. Barcelona, Spain: Association for
Computational Linguistics, July 2004, pp. 74–81.
URL: https://www.aclweb.org/anthology/
W04-1013.
[9] Ramesh Nallapati et al. “Abstractive Text Sum-
marization using Sequence-to-sequence RNNs and
Beyond”. In: Proceedings of The 20th SIGNLL
Conference on Computational Natural Language
Learning. Berlin, Germany: Association for Com-
putational Linguistics, Aug. 2016, pp. 280–290.