An Empirical Analysis of Attention Based Solution applied to Text Summarization Task Simone Deola1,† , Ricardo A. Matamoros A.1,† and Luca Leo Del Vescovo2,† 1 Università degli studi di Milano-Bicocca, Piazza dell’Ateneo Nuovo, 1, 20126 Milano MI 2 Politecnico di Milano, Piazza Leonardo da Vinci, 32, 20133 Milano MI Abstract During the last years, we have seen the blooming of attention based solution in the NLP field, improving all the state-of-the-art performances set by traditional methods. While analyzing these solutions, for solving the text summarization task, we find different methodologies of evaluation and training in each of them. The evaluation could change for dataset used, resourced used for training and scoring used, making the comparison hard to be evaluated. In this paper we show the results on text summarization of some solutions analyzed, done on the same dataset (CNN/Dailymail), using the same amount of resources and using the same score functions (ROUGE). These results are used to give a fair empirical comparison on this specific task. Keywords Abstractive Text Summarization, Transformer 1. Introduction The Abstractive Text summarization is the Natural Language Processing task that consist of generating concise, coherent summaries by understanding and rephrasing the input text, unlike extractive summarization which selects and rearranges existing sentences. It involves natural language processing techniques and deep learning models to produce human-like summaries from long documents or articles. In this paper we will show the results of an empirical analysis of some of the solution proposed in literature, focusing mainly on new solutions based on the Transformer architecture. We tried to maintain the comparison between these results as fair as possible by using similar amount of resources for each training. 2. State-of-the-Art The Text Summarization problem is an extensively studied task explored in the Natural Language Processing field since the 50s, and the solutions proposed to solve it range from traditional methods like Bag of Words, words embedding and Words2Vec to the more recent usage of deep learning techniques like RNN and LSTM.[1] Workshop on Artificial Intelligence and Applications for Business and Industries (AIABI 2023) co-located with 22th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2023), Rome, Italy Envelope-Open s.deola1@campus.unimib.it (S. Deola); r.matamorosaragon@campus.unimib.it (R. A. M. A.); 10692678@polimi.it (. L. L. D. Vescovo) Orcid 0000-0001-5531-6684 (S. Deola); 0000-0002-1957-2530 (R. A. M. A.); 0009-0007-0100-6091 (. L. L. D. Vescovo) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In the last years, with the introduction of the Attention Mechanism and the consequent usage inside the Transformer model, we have seen an improvement on most of the NLP solutions performances, including the one used for solving the text summarization task. The attention Mechanism, introduced by Bahdanau [2], aims to simulate the cognitive attention by giving the model the ability to focus on specific parts of the input when managing large amount of text. The specific NLP task analyzed in the paper was machine translation, but this solution has been then applied to different tasks. In the following year, Vaswani [3] proposed a new architecture that uses a specific kind of attention, Self Attention, combined to Positional Encoding, improving the time efficiency of the Attention Mechanism on sequence to sequence NLP tasks. This new architectures show state-of-the-art performances on different benchmark NLP tasks, setting up a new standard for the following NLP solutions. The Transformer model has been then used to solve a variety of problems that involves Natural Language data, Computer vision problems, managing Graph data, and others. In this paper, we will analyze mainly solutions based on the Transformer model, including a set of the most promising Large Language Models. These kinds of models are usually based on the Transformer’s architecture, and they use a new paradigm of training, that is usually called pre-training/fine-tuning, after the name given to the two steps that compose it. [4] In the pre-training phase, the model is trained on a huge amount of data in order to solve one or more generic tasks. These tasks are usually based on datasets that can be automatically built from free text, in this way these models can be pre-trained on an arbitrary large set of data. For example, the BERT model[4] is trained on the Masked Language Model task, that consist of predicting one word “masked” inside the original text. The dataset for this task is simply built by removing words at random from the input and substituting it with the [MASK] token. The model goal is to predict the original word. These generic tasks are meant to give to this pre-trained model the ability to understand the rules that guide the language. The pre-training step is the heaviest part of the training but is computed only once for each model and each language. After that the model can be fine-tuned for any specific task, like Q&A, text summarization, translation, etc. During this step, the model is further trained using a supervised dataset. During last years some solutions proposed the removing of the fine-tuning step and the usage of the pre-trained model directly to solve the supervised tasks. We consider this kind of solution out of the scope of this analysis, but we will explore some of this solutions in future publications. In this paper we will mainly focus on the Transformer model [3], LLM like GPT-2[5] and T5[6] and some techniques of transformer weights initialization, using the LLM BERT[7]. 2.1. Models Since all the models we used are based on slightly modified versions of the Transformer architecture, we start explaining briefly the Transformer architecture itself. Transformers are a groundbreaking neural network architecture that revolutionized natural language processing tasks. They introduced the concept of self-attention mechanism, which allows the model to weigh the relevance of different words in a sentence. This attention mechanism enables transformers to capture long-range dependencies and contextual information effectively. Besides, unlike traditional recurrent neural networks, transformers process input sequences in parallel, making them highly efficient for both training and inference. Transformers excel in various NLP tasks, including machine translation, text classification, and language generation. Their ability to capture rich semantic relationships and handle long-range dependencies has made them a fundamental building block in modern language models such as GPT and BERT. BERT [4] is a language representation model, based on a stack of Transformer encoders. Released by Google in 2018, it shows state-of the-art performances on eleven benchmark NLP task. Due to the composition of its architecture, BERT is not suited for generative task so, in order to apply it on the text summarization, we explore the solution proposed by Lewis et al.[7]. The technique proposed uses the weights of BERT in order to initialize a basic Transformer model, that can be then fine-tuned to solve the task. The initialization process shows improvement on the performances of the downstream task with respect to the non initialized version, demonstrating the usefulness of the pre-training process for general understanding of the language. GPT-2 [5] (Generative Pre-trained Transformer 2) is a (ex) state-of-the-art language model, developed by OpenAI. In 2020, It gained significant attention due to its impressive language generation capabilities. It is a transformer-based (decoder-only) model with 1.5B parameters, enough to learn complex patterns and generate coherent, fluent and human-like text in different NLP tasks, such as translations, summaries, and answers. Its success has paved the way for advancements in language understanding and generation models. T5 [6] (Text-to-Text Transfer Transformer) is a state-of-the-art language model devel- oped by Google. It is also based on the transformer architecture (it just removes the Layer Norm bias, placing the layer normalization outside the residual path, and uses a different position embedding scheme). It was designed to perform a wide range of NLP tasks using a unified framework by introducing the concept of “text-to-text” transfer learning, where different NLP tasks are cast as text-to-text transformations. In this way, T5 can be trained on a diverse set of tasks, including machine translation, text summarization, question answering, and more. It exhibits remarkable performance across various tasks, showcasing its ability to generalize and transfer knowledge effectively. Its flexible and modular design, coupled with its impressive results, has made T5 a widely adopted model for solving diverse NLP problems. 3. Experiments The experiments were conducted using Vertex AI by Google, utilizing its powerful computational resources according to the dimension of the model. To have comparable results, we used the same dataset for all the different models, and we tried to keep the same hyperparameters as discussed below. 3.1. Dataset In our experiments, we used the widely known CNN-Dailymail dataset[8]. The CNN-DailyMail dataset is by far the most complete, and it’s extensively used as a benchmark dataset in the field of abstractive text summarization. It consists of 300k news articles paired with corresponding human-generated summaries from two major news sources, CNN and Daily Mail. The dataset was originally created for machine reading, comprehension, abstractive Q&A and development of automatic text summarization models, but then it became popular for abstractive text summarization too. There are two main reasons why the CNN-DailyMail dataset is suitable for studying abstractive summarization: • it contains a large collection of different news articles, providing a diverse range of topics and writing styles. The dataset covers various domains such as politics, sports, entertainment, and more; • the summaries in this dataset are abstractive in nature, meaning they do not simply extract sentences or phrases from the source text but instead generate human-like summaries that capture the key information and essence of the articles. 3.2. Data Preprocessing One of the most significant challenge encountered in our research is the input limitation imposed by our models, which can only handle a maximum of 512 tokens(3̃60 words), like T5 and BERT, or 1024 tokens(7̃20 words), like GPT-2. The transformer model can have input of any dimension, in theory, but the time and memory cost are quadratic with respect to the input dimension, so we decide to consider a 512 input dimension Transformer. This poses a problem since our dataset consists mainly of samples that are in between 500 and 1500 tokens in length (Fig. 1). As a result, the models are inherently unable to fully capture the entirety of these longer instances, potentially leading to information loss and incomplete summarization. In order to deal with such long-form texts, we excluded outliers (above 2000 tokens) and truncated the articles, since usually the most relevant information is in the first part of the article. On the other hand, this may impact the overall coherence and effectiveness of the generated summaries. Addressing this issue is crucial to ensure comprehensive and accurate abstractive text summarization on lengthy input samples. In addition to the data truncation, we also adapt the input texts to the required input for the specific model, using two different methodologies: concatenation and teacher forcing. For the generative model (GPT-2), that uses only an input sequence to be trained on the task of next word prediction, we used the approach explored in the original GPT-2 paper. This approach consists of concatenating the article text with the summary, divided by a new separator token [TL;DR] added to the original tokenizer. In this way, the model will be trained on the relation between a text and its summary and, during prediction, the model will use the newly added token as start point of the summarization process. The Transformer models and T5 need a pair of text as inputs, one for the encoder, the other for the decoder. Since the decoder input needs to be the output of the previous prediction, Figure 1: T5 tokenizer (other tokenizers produce similar results) the training process needs one record for each token in each phrase of the summary in the dataset. This procedure is time expensive and doesn’t fully exploit the parallelization power of the decoder self attention. In order to avoid that, the teacher forcing procedure uses the output, shifted by one in the right direction, as input of the decoder, instead of the produced output. The Transformer decoder, through the usage of a causal mask, can only attend to token on previous position during the prediction, so the shift avoid the model to attend to the correct token given in input during prediction. No other text preprocessing is performed. The standard tokenizer for the given model is applied. 3.3. Hyperparameters The models have been trained using Adam optimizers[9], with slightly different configurations, defined after some preliminary experiments. We trained the basic transformer using a learning rate up to 0.0005, linear warm-up of 6715 steps, normalization by the square root of the hidden size, and square root decay. The same schedule as been applied to the initialized Transformer, but using a learning rate up to 0.004 and 40k warm-up steps. For the T5 model and the GPT one, we used a fixed learning rate of 0.0001 and 0.0003 respectively. The batch size used is mainly driven by the dimension of the models. We used a batch size of 128 for the base Transformer, 128 for the initialized Transformer, 16 for the GPT-2 model and 128 for the T5 model. The number of epochs has been set to a high value and then limited using the early stopping strategy on validation loss, with Patience of 4 epochs at max. The number of epochs has been set to 20 for all the models, but in all the cases the early stopping optimization technique stopped the training earlier. 4. Evaluation metric Rouge Score[10] was chosen for abstractive text summarization due to its common usage and ease of implementation. It allows for quantitative assessment by measuring overlap between generated summaries and references. In particular, we adopted the Rouge-2 F1 metric, as it is widely used in research to evaluate the quality and effectiveness of generated summaries by comparing them against human-generated references. F1 score is a metric commonly used in binary classification tasks to measure the model’s performance. It combines precision and recall into a single value, providing a balanced evaluation of the model’s effectiveness. 𝐹 1 = 2∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 where Precision and Recall are calculated as follows: 𝑐𝑜𝑢𝑛𝑡𝑚𝑎𝑡𝑐ℎ (𝑔𝑟𝑎𝑚2 ) 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 2−𝑔𝑟𝑎𝑚𝑠 𝑓 𝑜𝑢𝑛𝑑 𝑖𝑛 𝑚𝑜𝑑𝑒𝑙 𝑎𝑛𝑑 𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑐𝑜𝑢𝑛𝑡(𝑔𝑟𝑎𝑚2 ) = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 2−𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 model 𝑐𝑜𝑢𝑛𝑡𝑚𝑎𝑡𝑐ℎ (𝑔𝑟𝑎𝑚2 ) 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 2−𝑔𝑟𝑎𝑚𝑠 𝑓 𝑜𝑢𝑛𝑑 𝑖𝑛 𝑚𝑜𝑑𝑒𝑙 𝑎𝑛𝑑 𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑐𝑜𝑢𝑛𝑡(𝑔𝑟𝑎𝑚2 ) = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 2−𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 reference However, Rouge has limitations, notably its bias towards extractive methods, insensitivity to semantic quality, and inability to evaluate coherence and fluency. Hence, a need for human evaluation arises to assess these crucial aspects, providing a more holistic understanding of the model’s performance in abstractive text summarization.[11] An example of this problem is provided below: SCORES REFERENCE MODEL rouge2-f1 rouge2- rouge2- precision recall The quick brown The fast orange fox 0 0 0 fox jumps over the leaps on the slow lazy dog dog The quick brown The quick orange 0.375 0.375 0.375 fox jumps over the fox leaps on the lazy dog lazy dog 5. Results and analysis Hyperparameters Hyperpar. Gpt-2 T5 Bert2Rand Transformer batch_size 128 128 128 128 epochs ROUGE F1 scores Rouge Gpt-2 T5 Bert2Rand Transformer 1-F1 0.26 0.4 0.36 0.3 2-F1 0.08 0.19 0.15 0.11 L-F1 0.18 0.28 0.26 0.22 Data Article Summary (CNN)Never mind cats having nine lives. A Theia, a bully breed mix, was apparently hit stray pooch in Washington State has used by a car, whacked with a hammer and buried up at least three of her own after being hit in a field . ”She’s a true miracle dog and she by a car, apparently whacked on the head deserves a good life,” says Sara Mellado, who with a hammer in a misguided mercy killing is looking for a home for Theia . and then buried in a field – only to survive. That’s according to Washington State Univer- sity, where the dog – a friendly white-and- black bully breed mix now named Theia – has been receiving care at the Veterinary Teaching Hospital. Four days after her apparent death, the dog managed to stagger to a nearby farm, dirt-covered and emaciated, where she was found by a worker who took her to a vet for help. She was taken in by Moses Lake, Wash- ington, resident Sara Mellado. ”Considering everything that she’s been through, she’s in- credibly gentle and loving,” Mellado said, ac- cording to WSU News. ”She’s a true miracle dog and she deserves a good life.” Theia is only one year old but the dog’s brush with death did not leave her unscathed. She suf- fered a dislocated jaw, leg injuries and a caved- in sinus cavity – and still requires surgery to help her breathe. The veterinary hospital’s Good Samaritan Fund committee awarded some money to help pay for the dog’s treat- ment, but Mellado has set up a fundraising page to help meet the remaining cost of the dog’s care. She’s also created a Facebook page to keep supporters updated. Donors have al- ready surpassed the $10,000 target, inspired by Theia’s tale of survival against the odds. On the fundraising page, Mellado writes, ”She is in desperate need of extensive medical pro- cedures to fix her nasal damage and reset her jaw. I agreed to foster her until she finally found a loving home.” She is dedicated to mak- ing sure Theia gets the medical attention she needs, Mellado adds, and wants to ”make sure she gets placed in a family where this will never happen to her again!” Any additional funds raised will be ”paid forward” to help other animals. Theia is not the only animal to apparently rise from the grave in recent weeks. A cat in Tampa, Florida, found seem- ingly dead after he was hit by a car in January, showed up alive in a neighbor’s yard five days after he was buried by his owner. The cat was in bad shape, with maggots covering open wounds on his body and a ruined left eye, but remarkably survived with the help of treat- ment from the Humane Society. EVALUATION Gpt-2 T5 Bert2Rand Transformer A stray cat has A stray pooch in a stray pooch in stray pooch killed apparently been hit Washington State washington state and buried in a by a car, apparently has used up at has been hit by a field in washing- whacked on the least three of her car and then buried ton state, washing- head with a ham- own after being in a field. the dog ton. dog’s brush mer. The dog is in hit by a car. The was found by a with death did not need of extensive dog staggered to worker who took leave her in a criti- medical procedures a nearby farm, her to a vet for cal condition. dog’s to help the dog dirt-covered and help. theia is in brush with death recover. She has emaciated, where desperate need of did not leave her a dislocated jaw, she was found. extensive medical unscathed. leg injuries and She still requires procedures to fix a caved-in sinus surgery to help her her nose. cavity. breathe. 5.1. Results analysis The comparison of the resulting performances shows that: • the usage of pre-training LLM improves the performances of transformer based archi- tecture on the text summarization task. In fact, the transformer initialized using BERT improves the basic Transformer architecture, while the T5 model (full transformer archi- tecture, pre-trained) outperform both the previous. • the big difference in performances between T5 and the two Transformers model can be addressed by the difference in pre-training dataset dimension (T5 used the bigger Common Crawl dataset wrt the datasets used for BERT). Also, the initialized transformer is pre-trained only on the decoder part, while T5 is pre-trained in its entirety. • The GPT-2 models in this experiment perform badly wrt. the other models. This is probably due to the limited fine-tuning that we applied on this specific model with respect to the other models, especially the batch size. This is mainly due to the dimension of the model, composed of 1.5B parameter, harder to fine-tune with the same parameters of the other models, having a dimension of 200M parameters, for computational costs reason. We didn’t consider the results achieved comparable to the others model results. However, we left it in this analysis to show the limitation on training bigger models. Also, the resulting texts produced by GPT-2 are still coherent, grammatically and with the textual content of the input, so will be further analyzed outside the ROUGE evaluation, for comparison. 6. Conclusion 6.1. Known limitations To conclude the analysis we want to address the known limitations of the methodologies applied during these experiments. The results obtained heavily relies on the hyperparameters used for the training, such as the batch size, the learning rate and optimizer, the number of epochs, ecc. In order to obtain the best result for each of the models, a strategy of hyperparameter tuning should be applied, instead of the preliminary experiments that we used in this analysis. Also, the GPT-2 training was not sufficient as explained in the previous chapter. Both limitations was mainly driven by the limited resources available for this analysis. 6.2. Final Remarks In this paper we proposed an empirical analysis of some Transformer based model, applied to the text summarization tasks. We selected four models: Transformer, Transformer initialized with BERT, GPT-2 and T5. We run an experiment on the benchmark dataset CNN/Dailymail to evaluate empirically the performances of such models on the summarization task. The results show that, when trained using similar resources, the usage of LLM (T5, BERT) improves the ability of such model to perform the task. The analysis also explore the challenge and advantage of using such models for solving the task of abstractive text summarization, showing the limitation encountered in a real case scenario. 7. Citations and Bibliographies References [1] M. F. Mridha, et al., A survey of automatic text summarization: Progress, process and challenges, IEEE Access 9 (2021) 156043–156070. [2] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014). [3] A. Vaswani, et al., Attention is all you need, Advances in neural information processing systems 30 (2017). [4] J. Devlin, et al., Bert: Pre-training of deep bidirectional transformers for language under- standing, arXiv preprint arXiv:1810.04805 (2018). [5] A. Radford, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [6] C. Raffel, et al., Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Research 21 (2020) 5485–5551. [7] S. Rothe, S. Narayan, A. Severyn, Leveraging pre-trained checkpoints for sequence gen- eration tasks, Transactions of the Association for Computational Linguistics 8 (2020) 264–280. [8] R. Nallapati, et al., Abstractive text summarization using sequence-to-sequence rnns and beyond, arXiv preprint arXiv:1602.06023 (2016). [9] Z. Zhang, Improved adam optimizer for deep neural networks, in: 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS), IEEE, 2018. [10] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004. [11] N. Schluter, The limits of automatic summarisation according to rouge, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2017.