WITS: Wikipedia for Italian Text Summarization Silvia Casola1,2 , Alberto Lavelli2 1. Università degli studi di Padova 2. Fondazione Bruno Kessler scasola@fbk.eu, lavelli@fbk.eu Abstract Abstractive text summarization has re- cently improved its performance due to the use of sequence to sequence mod- els. However, while these models are extremely data-hungry, datasets in lan- guages other than English are few. In this work, we introduce WITS (Wikipedia for Italian Text Summarization), a large- scale dataset built exploiting Wikipedia ar- ticles’ structure. WITS contains almost 700,000 Wikipedia articles, together with their human-written summaries. Com- pared to existing data for text summariza- Figure 1: The lead section (from the Wikipedia’ tion in Italian, WITS is more than an or- own page), which we consider as the article sum- der of magnitude larger and more chal- mary. We use the remaining of the article as the lenging given its lengthy sources. We source. explore WITS characteristics and present some baselines for future work. Processing (NLP) community. Sequence to se- 1 Introduction quence models have been increasingly used for the task, with pre-trained encoder-decoder trans- Automatic text summarization aims at condens- formers becoming the de facto state of the art ing one or more source documents in a shorter for abstractive text summarization. Normally pre- output, which contains their most salient informa- trained in an unsupervised manner, these models tion. The underlying task can be framed in two are then fine-tuned in a supervised way on the different manners: extractive summarizers select downstream dataset; during fine-tuning, the model the most relevant segments from the input and pro- learns to generate the summary from the source duce a summary which is a concatenation of such document. segments; as a result, the output is a subset of the original text, which the summary follows verba- While various datasets for abstractive summa- tim. On the other hand, abstractive summarizers rization exist for English, resources in other lan- aim to encode the whole source into an internal guages are limited. This paper introduces WITS representation from which they generate the sum- (Wikipedia for Italian Text Summarization), a mary; thus, they produce a new piece of text that large-scale dataset for abstractive summarization condenses the source without necessarily using its in Italian, built exploiting Wikipedia. Taking ad- vocabulary and expressions. vantage of the structure of Wikipedia pages, which Recently, abstractive summarization has at- contain a lead section (Figure 1) – giving an tracted a growing interest in the Natural Language overview of the article’s topic –, followed by the full-length article – describing the topic in details Copyright © 2021 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- –, we create a large and challenging dataset for ab- ternational (CC BY 4.0). stractive summarization in Italian, which we will make publicly available. tend to be small and are mostly used for evalua- WITS is particularly challenging, given its large tion. source length and its high abstractiveness. In this In general, summaries exploit a human-written paper, we describe the dataset, its statistics and abstract. For example, the CNN/Daily Mail Cor- characteristics, and report some preliminary ex- pus (Nallapati et al., 2016)2 leverages a bullet- periments that might be used as baselines for fu- point summary on the newspapers’ websites. A ture work. similar rationale is used in datasets constructed This paper is organized as follows: in Section 2, from scientific papers (Cohan et al., 2018)3 or we describe the state of the art in text summariza- patents (Sharma et al., 2019)4 . In contrast, Rush tion, focusing on resources for Italian. We later et al. (2015)5 frames the task of news summariza- preset the dataset and its related task (Section 3.1); tion as headline generation. we describe the data collection and preprocessing To the best of our knowledge, WikiLingua process in Sections 3.2 and 3.3. In Section 4, we (Ladhak et al., 2020)6 is the only summarization show our results when summarising the dataset us- dataset that contains data in Italian. WikiLingua is ing some existing extractive baseline models. Fi- a cross-lingual dataset for abstractive text summa- nally, we draw our conclusions in Section 5. rization built on top of WikiHow. WikiHow con- tains tutorials on how to perform specific tasks in 2 State of the Art the form of step-by-step instructions. The dataset constructs a summary by concatenating the first Automatic text summarization has recently at- sentence for each step and using the remaining text tracted increasing attention from the NLP commu- as the source. WikiLingua contains data in 18 lan- nity. However, the majority of the research work guages, including Italian (50,943 source-summary still focuses on English. pairs). Both summaries and sources are relatively As a matter of example, out of all the papers short (on average, 44 and 418 tokens, respectively, published in the Association for Computational for the Italian split). Linguistics (ACL) conference in 2021, 46 explic- itly refer to summarization in their title; 38 of these 2.2 Models for Abstractive Text dealt with English only, while 7 presented exper- Summarization iments with one or more other languages (includ- Abstractive text summarization is one of the most ing 2 on source code summarization). For refer- challenging tasks in NLP: it requires very long ence, only one paper (Mastronardo and Tamburini, input understanding (encoding), salient passages 2019) on text summarization (in English) was pub- finding and constrained text generation. Techni- lished at the Italian Conference on Computational cally, models for abstractive text summarization Linguistics (CLiC-it) since its first edition, and are generally sequence-to-sequence: they encode none experimented with Italian. the input and then generate the output through a In this section, we present the state of the art neural network. While some previous work used in abstractive text summarization. We first present Recurrent Neural Networks (Chung et al., 2014), the available datasets for the task; then, we dis- with the possible addition of an encoder-decoder cuss some relevant learning models. We focus on attention mechanism (Chopra et al., 2016), trans- the significant gap between English and Italian, for former models (Vaswani et al., 2017) have later which very few resources exist. become pervasive, following a similar trend in many other NLP areas. Using self-attention, these 2.1 Datasets for Automatic Text models have proved to be superior to Recurrent Summarization 2 A typical dataset for text summarization is com- https://huggingface.co/datasets/cnn d ailymail posed of some source documents (which needs 3 https://huggingface.co/datasets/arxi to be summarized) and their corresponding sum- v dataset 4 maries, used as the gold standard. A minority https://huggingface.co/datasets/big p atent of datasets (e.g., the DUC 2004 dataset1 ) provide 5 https://huggingface.co/datasets/giga multiple gold standards; however, such datasets word 6 https://huggingface.co/datasets/wiki 1 https://duc.nist.gov/duc2004/ lingua Neural Networks, as they are able to better deal 3 WITS with long dependencies, a critical task in text sum- 3.1 Task and Rationale marization. Given a Wikipedia article, we extract the lead sec- Following another recent trend in NLP, many tion (which we sometimes refer to as ”Summary” summarization models use a transfer-learning ap- in the remaining of the paper) and propose the fol- proach: after a pre-training phase, in which they lowing task: are training in an unsupervised way on a huge Given all article sections, summarize its amount of text, they are fine-tuned for the specific content to produce its lead section. downstream task on a relatively limited amount of supervised data. Summarization models either The task is rather natural given pages structure. exploit encoders and decoders previously trained According to the Wikipedia Manual of Style9 , the for other tasks or are pre-trained from scratch on lead section is, in fact, a high-quality summary of a specific objective tailored for summarization. the body of the article. The lead “serves as an in- Rothe et al. (2020), for example, leveraged pre- troduction to the article and a summary of its most viously existing pre-trained models (BERT in De- important contents” and “gives the basics in a nut- vlin et al. (2019); ROBERTA in Liu et al. (2019); shell and cultivates interest in reading on—though and GPT-27 in Radford et al. (2019)) as encoders not by teasing the reader or hinting at what fol- or decoders of the sequence-to-sequence summa- lows”. Moreover, it should “stand on its own as a rizer and showed high performance improvement concise overview of the article’s topic”. with respect to random initialization. More re- As for the content, according to Wikipedia, the cently, summarization models (Song et al., 2019; lead must define the topic, explaining its impor- Lewis et al., 2020) have been pre-trained with an tance and the relevant context; then, it must sum- objective specific to Natural Language Generation marize the most prominent points of the article, tasks. For example, authors of Pegasus (Zhang et emphasizing the most important material. al., 2020) used two objectives: Masked Language Moreover, the lead should only cover informa- Model (Devlin et al., 2019) has been widely used tion that is contained in the article: “significant in previous work, and consists in masking a per- information should not appear in the lead if it centage of tokens in text, later predicted using con- is not covered in the remainder of the article”. text; Gap Sentences Generation is instead a new This is particularly relevant for abstractive sum- pre-training objective, in which a percentage of marization, as models are more prone to produce the original sentences are masked, and the model summaries that are not factual to the source (of- needs to generate them in accordance to the con- ten called hallucinations) when they are trained test. to generate summaries containing information not in the source (Nan et al., 2021). The problem Following a shared practice, most summariza- of factuality in abstractive summarization is cur- tion models have first been trained and evaluated rently an active area of research, as previous work for English only. In some cases, a subsequent has shown that up to 30% of generated summaries multilingual version of the model was also created contain non-factual information (Cao et al., 2018). (Xue et al., 2021). To the best of our knowledge, Linguistically, the lead “should be written in few sequence-to-sequence models in Italian exist a clear, accessible style with a neutral point of to date8 , and while they might be fine-tuned for view”. It is worth noting that, in contrast to Wik- summarization, no full-scale evaluation has been iLingua, where the summary is constructed as a performed yet. concatenation of sentences from different parts of the articles, the summary in WITS is a stand-alone piece of text, with a coherent discourse structure. 7 3.2 Data Collection GPT-2 ha also been adapted for Italian. See: De Mattei, L., Cafagna, M., Dell’Orletta, F., Nissim, M., & Guerini, M. This section describes the process of data collec- 2020. GePpeTto Carves Italian into a Language Model. In tion and preprocessing. CLiC-it 2020 8 9 See, for example, IT5-base (https://huggingfac https://en.wikipedia.org/wiki/Wikipe e.co/gsarti/it5-base) dia:Manual of Style WITS IT-Wikilingua WITS IT-Wikilingua # docs 699,426 50,943 Summary Source Summary Source Summary Source Summary Source PER (avg) 1.13 26.21 0.32 1.05 # sentences (avg) 3.75 33.33 5.01 23.52 LOC (avg) 2.03 24.07 0.42 1.39 # tokens (avg) 70.93 956.66 23.52 418.6 ORG (avg) 0.60 6.65 0.68 0.37 Comp. ratio (avg) 16.14 11.67 MISC (avg) 19.68 19.68 0.84 3.07 All (avg) 23.44 76.61 1.65 5.88 Table 1: Datasets statistics. spacy is used for text and sentence tokenization. The number of tokens Table 2: Named Entities in WITS and IT- and sentences is computed for all documents and WikiLingua. then averaged. in isolation. Notice that WITS is more than an We downloaded the latest XML dump of order of magnitude larger than IT-Wikilingua. Wikipedia in Italian10 , which contains text only. We computed the number of tokens and the We used Python and the Gensim library to pro- number of sentences through the spaCy it-core- cess the file11 . The original number of documents news-lg12 model. Compared to IT-WikiLingua, was 1,454,884. We applied the following exclu- documents in WITS contains more tokens both in sion criteria: we removed pages whose title con- their summary and in their source (which is more tains numbers only (as they mostly describe years than double in length), making the dataset partic- and contain lists of events and references), lists (ti- ularly challenging. Note how the sentences are tles starting with “Lista d”), pages with summaries also more lengthy (thus complex) on average. For with less than 80 characters and articles and pages example, summaries in WITS contain on average for which the article is less than 1.5 times longer less than 4 sentences, but more than 70 words; than the lead. in contrast, IT-WikiLingua’s summaries consist of We then preprocessed the text in the following more than 5 sentences but contain on average 44 way: from the summary, we removed the content tokens. Not surprisingly, WITS’ compression ra- of parentheses (as they often contain alternative tio is larger than IT-WikiLingua’s and very high names or names in a different language, which in absolute value. Finally, we also notice that the cannot be inferred from the article). For the ar- dataset is very rich in named entities. Table 2 re- ticle, we further excluded the following sections, ports the Named Entities as extracted with spaCy which are not relevant for our task: Note (Foot- from WITS and IT-Wikilingua. notes), Bibliografia (References), Voci correlate (See also), Altri progetti (Other projects), Collega- 4 Baselines menti esterni (External links), Galleria di Immag- ini (Images). We tested some preliminary non-neural baseline methods on the dataset, reported in Table 3. 3.3 Dataset Statistics All methods reported are unsupervised. Thus, Table 1 shows some statistics on the dataset and we unsupervisedly obtained the summary from the compares WITS with the Italian split of WikiLin- source and then used the lead as the gold standard gua (which we will refer to as IT-WikiLingua). for evaluation. We evaluated the summaries using IT-WikiLingua contains documents from Recall-Oriented Understudy for Gisting Evalua- 17,673 WikiHow pages, but some of these pages tion (ROUGE) (Lin, 2004). ROUGE is an n-gram describe more than one method related to the same based, recall-oriented metric for summary quality topic. For example, the page “How to Reduce the evaluation. Following previous work (Lloret et al., Redness of Sunburn” contains several methods: 2018), we report ROUGE-1 (R-1), ROUGE-2 (R- “Healing and Concealing Sunburns”, “Lessening 2), and ROUGE-L (R-L) (recall). Your Pain and Discomfort”, and “Preventing We considered the following baselines: a Sunburn”. We consider distinct methods as separate documents, as they can be summarized Lead-3 We extract the first three sentences from 10 the source. Previous work has shown that https://dumps.wikimedia.org/itwiki/l atest/itwiki-latest-pages-articles.xml.b this baseline is often hard to beat (See et z2 al., 2017), especially in news summarization, 11 https://radimrehurek.com/gensim/scri 12 pts/segment\ wiki.html https://spacy.io/models/it which presents an “inverted pyramid” struc- We trained on two GeForce RTX 2080 GPUs ture and tends to report the most important and kept the batch size per GPU to 1. We content at the start. kept the summary length to 75 tokens, and the source text length to 1000 tokens. TextRank (Mihalcea and Tarau, 2004) TextRank is an unsupervised algorithm that extracts the most relevant sentences R-1 R-2 R-L in the source. The algorithm constructs a Lead-3 24.76 5.54 16.54 graph with sentences as nodes and sentence TextRank 30.20 6.57 19.67 similarity (in terms of shared vocabulary) LexRank 26.90 5.91 17.52 as edges. The sentences are then ranked SumBasic 20.60 4.80 14.01 by using the PageRank (Page et al., 1999) IT5-small 21.58 9.69 19.34 algorithm. Table 3: ROUGE results on WITS. LexRank (Erkan and Radev, 2004) LexRank works in a similar way as TextRank. Results show that the Lead-3 baseline perfor- However, instead of computing sentence mance is low; this is likely due to the structure of similarity on normalized shared vocabulary, Wikipedia, which contains several thematic sec- it uses the cosine similarity of their TF-IDF tions without a general introduction outside the vectors. lead section. Extracting the first sentence(s) from each section would likely produce better results SumBasic (Nenkova and Vanderwende, 2005) and could be investigated in future work. SumBasic extracts sentences based on their In contrast, TextRank is the best non-neural word probabilities. Specifically, it scores baseline, with a ROUGE-2 score of 6.57; LexRank each sentence as the mean of the probability performs comparably. SumBasic metrics are even of the words it contains (based on their lower than those obtained with the Lead-3 base- frequency in the document). Iteratively, the line, suggesting that a purely frequency-based ap- sentence with the best score among the ones proach is insufficient given the dataset complexity. containing the most probable word is chosen. Finally, the neural baseline achieves the best The probability of the words in the chosen results in terms of ROUGE-2, even if it is rel- sentence is then squared to limit redundancy. atively small and likely severely under-trained, since only around 30% of the data are used for IT5-small (Raffel et al., 2020) The Text-to-Text fine-tuning, due to computational constraints. This Transfer Transformer (T5) is a pre-trained suggests that sequence-to-sequence neural models sequence-to-sequence language model, have great potential on the dataset, and should be trained treating both input and output as better investigated in future work. Surprisingly, text strings; the rationale is to use the same however, results in terms of ROUGE-1 are instead models for all NLP tasks, unifying them below most of the other baselines. Future work under the sequence-to-sequence framework. should investigate this discrepancy. We use a small version of the original model (60 million parameters)13 , pretrained on the 5 Conclusions Clean Italian mC4 IT14 , the Italian split of the multilingual cleaned version of Common We have presented WITS, the first large-scale Crawl’s Corpus (mC4) (Raffel et al., 2020). dataset for abstractive summarization in Italian. We extracted 10,000 summary-source pairs We have exploited Wikipedia’s articles’ structure from the dataset for the validation set, and to build a challenging, non-technical dataset, with 10,000 for the test set. We trained the model high-quality human-written abstracts. Given the on the rest of the data for 100,000 steps; this lengthy source documents, the short summaries account for around 30% of the training data. and the short extractive fragments, the dataset calls for an abstractive approach. In the paper, we 13 https://huggingface.co/gsarti/it5-sm have explored some standard non-neural extractive all 14 https://huggingface.co/datasets/gsar baselines and a neural abstractive baseline. Future ti/clean mc4 it work will investigate further neural baselines for the dataset. Moreover, the dataset can be easily Mike Lewis, Yinhan Liu, Naman Goyal, Mar- extended applying the procedure described in the jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. paper to more languages, including low-resource 2020. BART: Denoising sequence-to-sequence pre- ones given Wikipedia structure. We are confident training for natural language generation, translation, that research in summarization in languages other and comprehension. In Proceedings of the 58th An- than English will become more active in the near nual Meeting of the Association for Computational future and hope that WITS can be a valuable step Linguistics, pages 7871–7880, Online, July. Associ- ation for Computational Linguistics. in this direction. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text Summariza- References tion Branches Out, pages 74–81, Barcelona, Spain, July. Association for Computational Linguistics. Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Faithful to the original: Fact aware neural Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, abstractive summarization. In Proceedings of the Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, AAAI Conference on Artificial Intelligence. Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining Sumit Chopra, Michael Auli, and Alexander M. Rush. approach. ArXiv, abs/1907.11692. 2016. Abstractive sentence summarization with at- tentive recurrent neural networks. In Proceedings of Elena Lloret, Laura Plaza, and Ahmet Aker. 2018. the 2016 Conference of the North American Chap- The Challenging Task of Summary Evaluation: An ter of the Association for Computational Linguistics: Overview. Language Resources and Evaluation, Human Language Technologies, pages 93–98, San 52(1). Diego, California, June. Association for Computa- tional Linguistics. C. Mastronardo and F. Tamburini. 2019. Enhancing a text summarization system with ELMo. In CLiC-it. Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence mod- Rada Mihalcea and Paul Tarau. 2004. TextRank: eling. In NIPS 2014 Workshop on Deep Learning, Bringing order into text. In Proceedings of the December 2014. 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Spain, July. Association for Computational Linguis- Trung Bui, Seokhwan Kim, Walter Chang, and Nazli tics. Goharian. 2018. A discourse-aware attention model for abstractive summarization of long documents. In Ramesh Nallapati, Bowen Zhou, Cicero dos San- Proceedings of the 2018 Conference of the North tos, Çağlar Gulçehre, and Bing Xiang. 2016. American Chapter of the Association for Computa- Abstractive text summarization using sequence-to- tional Linguistics: Human Language Technologies, sequence RNNs and beyond. In Proceedings of The Volume 2 (Short Papers), pages 615–621, New Or- 20th SIGNLL Conference on Computational Natural leans, Louisiana, June. Association for Computa- Language Learning, pages 280–290, Berlin, Ger- tional Linguistics. many, August. Association for Computational Lin- guistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero deep bidirectional transformers for language under- Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, standing. In Proceedings of the 2019 Conference of Kathleen McKeown, and Bing Xiang. 2021. Entity- the North American Chapter of the Association for level factual consistency of abstractive text summa- Computational Linguistics: Human Language Tech- rization. In Proceedings of the 16th Conference of nologies, Volume 1 (Long and Short Papers), pages the European Chapter of the Association for Com- 4171–4186, Minneapolis, Minnesota, June. Associ- putational Linguistics: Main Volume, pages 2727– ation for Computational Linguistics. 2733, Online, April. Association for Computational Linguistics. Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text Ani Nenkova and Lucy Vanderwende. 2005. The im- summarization. Journal of Artificial Intelligence pact of frequency on summarization. Technical re- Research, 22(1):457–479, December. port, Microsoft Research. Faisal Ladhak, Esin Durmus, Claire Cardie, and Kath- Lawrence Page, Sergey Brin, Rajeev Motwani, and leen McKeown. 2020. WikiLingua: A new bench- Terry Winograd. 1999. The pagerank citation rank- mark dataset for multilingual abstractive summa- ing: Bringing order to the web. Technical Report rization. In Findings of EMNLP, 2020. 1999-66, Stanford InfoLab, November. Alec Radford, Jeff Wu, Rewon Child, David Luan, Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- Dario Amodei, and Ilya Sutskever. 2019. Language ter J. Liu. 2020. PEGASUS: pre-training with models are unsupervised multitask learners. Techni- extracted gap-sentences for abstractive summariza- cal report. tion. In Proceedings of the 37th International Con- ference on Machine Learning, ICML 2020, 13-18 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine July 2020, Virtual Event, volume 119 of Proceed- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ings of Machine Learning Research, pages 11328– Wei Li, and Peter J. Liu. 2020. Exploring the limits 11339. PMLR. of transfer learning with a unified text-to-text trans- former. Journal of Machine Learning Research, 21(140):1–67. Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. 2020. Leveraging pre-trained checkpoints for se- quence generation tasks. Transactions of the Asso- ciation for Computational Linguistics, 8:264–280. Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sen- tence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Lan- guage Processing, pages 379–389, Lisbon, Portugal, September. Association for Computational Linguis- tics. Abigail See, Peter J. Liu, and Christopher D. Man- ning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July. Association for Computational Linguistics. Eva Sharma, Chen Li, and Lu Wang. 2019. BIG- PATENT: A large-scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 2204–2213, Florence, Italy, July. Association for Computational Linguistics. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Confer- ence on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 5926–5936. PMLR. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Ben- gio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Linting Xue, Noah Constant, Adam Roberts, Mi- hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In NAACL.