WITS: Wikipedia for Italian Text Summarization

                                     Silvia Casola1,2 , Alberto Lavelli2
                                     1. Università degli studi di Padova
                                        2. Fondazione Bruno Kessler
                                  scasola@fbk.eu, lavelli@fbk.eu


                        Abstract
    Abstractive text summarization has re-
    cently improved its performance due to
    the use of sequence to sequence mod-
    els. However, while these models are
    extremely data-hungry, datasets in lan-
    guages other than English are few. In
    this work, we introduce WITS (Wikipedia
    for Italian Text Summarization), a large-
    scale dataset built exploiting Wikipedia ar-
    ticles’ structure. WITS contains almost
    700,000 Wikipedia articles, together with
    their human-written summaries. Com-
    pared to existing data for text summariza-                  Figure 1: The lead section (from the Wikipedia’
    tion in Italian, WITS is more than an or-                   own page), which we consider as the article sum-
    der of magnitude larger and more chal-                      mary. We use the remaining of the article as the
    lenging given its lengthy sources. We                       source.
    explore WITS characteristics and present
    some baselines for future work.
                                                                Processing (NLP) community. Sequence to se-
1    Introduction                                               quence models have been increasingly used for
                                                                the task, with pre-trained encoder-decoder trans-
 Automatic text summarization aims at condens-
                                                                formers becoming the de facto state of the art
ing one or more source documents in a shorter
                                                                for abstractive text summarization. Normally pre-
output, which contains their most salient informa-
                                                                trained in an unsupervised manner, these models
tion. The underlying task can be framed in two
                                                                are then fine-tuned in a supervised way on the
different manners: extractive summarizers select
                                                                downstream dataset; during fine-tuning, the model
the most relevant segments from the input and pro-
                                                                learns to generate the summary from the source
duce a summary which is a concatenation of such
                                                                document.
segments; as a result, the output is a subset of the
original text, which the summary follows verba-                    While various datasets for abstractive summa-
tim. On the other hand, abstractive summarizers                 rization exist for English, resources in other lan-
aim to encode the whole source into an internal                 guages are limited. This paper introduces WITS
representation from which they generate the sum-                (Wikipedia for Italian Text Summarization), a
mary; thus, they produce a new piece of text that               large-scale dataset for abstractive summarization
condenses the source without necessarily using its              in Italian, built exploiting Wikipedia. Taking ad-
vocabulary and expressions.                                     vantage of the structure of Wikipedia pages, which
   Recently, abstractive summarization has at-                  contain a lead section (Figure 1) – giving an
tracted a growing interest in the Natural Language              overview of the article’s topic –, followed by the
                                                                full-length article – describing the topic in details
     Copyright © 2021 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       –, we create a large and challenging dataset for ab-
ternational (CC BY 4.0).                                        stractive summarization in Italian, which we will
make publicly available.                                  tend to be small and are mostly used for evalua-
   WITS is particularly challenging, given its large      tion.
source length and its high abstractiveness. In this          In general, summaries exploit a human-written
paper, we describe the dataset, its statistics and        abstract. For example, the CNN/Daily Mail Cor-
characteristics, and report some preliminary ex-          pus (Nallapati et al., 2016)2 leverages a bullet-
periments that might be used as baselines for fu-         point summary on the newspapers’ websites. A
ture work.                                                similar rationale is used in datasets constructed
   This paper is organized as follows: in Section 2,      from scientific papers (Cohan et al., 2018)3 or
we describe the state of the art in text summariza-       patents (Sharma et al., 2019)4 . In contrast, Rush
tion, focusing on resources for Italian. We later         et al. (2015)5 frames the task of news summariza-
preset the dataset and its related task (Section 3.1);    tion as headline generation.
we describe the data collection and preprocessing            To the best of our knowledge, WikiLingua
process in Sections 3.2 and 3.3. In Section 4, we         (Ladhak et al., 2020)6 is the only summarization
show our results when summarising the dataset us-         dataset that contains data in Italian. WikiLingua is
ing some existing extractive baseline models. Fi-         a cross-lingual dataset for abstractive text summa-
nally, we draw our conclusions in Section 5.              rization built on top of WikiHow. WikiHow con-
                                                          tains tutorials on how to perform specific tasks in
2       State of the Art                                  the form of step-by-step instructions. The dataset
                                                          constructs a summary by concatenating the first
Automatic text summarization has recently at-             sentence for each step and using the remaining text
tracted increasing attention from the NLP commu-          as the source. WikiLingua contains data in 18 lan-
nity. However, the majority of the research work          guages, including Italian (50,943 source-summary
still focuses on English.                                 pairs). Both summaries and sources are relatively
   As a matter of example, out of all the papers          short (on average, 44 and 418 tokens, respectively,
published in the Association for Computational            for the Italian split).
Linguistics (ACL) conference in 2021, 46 explic-
itly refer to summarization in their title; 38 of these   2.2   Models for Abstractive Text
dealt with English only, while 7 presented exper-               Summarization
iments with one or more other languages (includ-
                                                          Abstractive text summarization is one of the most
ing 2 on source code summarization). For refer-
                                                          challenging tasks in NLP: it requires very long
ence, only one paper (Mastronardo and Tamburini,
                                                          input understanding (encoding), salient passages
2019) on text summarization (in English) was pub-
                                                          finding and constrained text generation. Techni-
lished at the Italian Conference on Computational
                                                          cally, models for abstractive text summarization
Linguistics (CLiC-it) since its first edition, and
                                                          are generally sequence-to-sequence: they encode
none experimented with Italian.
                                                          the input and then generate the output through a
   In this section, we present the state of the art
                                                          neural network. While some previous work used
in abstractive text summarization. We first present
                                                          Recurrent Neural Networks (Chung et al., 2014),
the available datasets for the task; then, we dis-
                                                          with the possible addition of an encoder-decoder
cuss some relevant learning models. We focus on
                                                          attention mechanism (Chopra et al., 2016), trans-
the significant gap between English and Italian, for
                                                          former models (Vaswani et al., 2017) have later
which very few resources exist.
                                                          become pervasive, following a similar trend in
                                                          many other NLP areas. Using self-attention, these
2.1      Datasets for Automatic Text
                                                          models have proved to be superior to Recurrent
         Summarization
                                                             2
A typical dataset for text summarization is com-               https://huggingface.co/datasets/cnn d
                                                          ailymail
posed of some source documents (which needs                  3
                                                               https://huggingface.co/datasets/arxi
to be summarized) and their corresponding sum-            v dataset
                                                             4
maries, used as the gold standard. A minority                  https://huggingface.co/datasets/big p
                                                          atent
of datasets (e.g., the DUC 2004 dataset1 ) provide           5
                                                               https://huggingface.co/datasets/giga
multiple gold standards; however, such datasets           word
                                                             6
                                                               https://huggingface.co/datasets/wiki
    1
        https://duc.nist.gov/duc2004/                      lingua
Neural Networks, as they are able to better deal                3     WITS
with long dependencies, a critical task in text sum-
                                                                3.1   Task and Rationale
marization.
                                                                Given a Wikipedia article, we extract the lead sec-
   Following another recent trend in NLP, many                  tion (which we sometimes refer to as ”Summary”
summarization models use a transfer-learning ap-                in the remaining of the paper) and propose the fol-
proach: after a pre-training phase, in which they               lowing task:
are training in an unsupervised way on a huge
                                                                      Given all article sections, summarize its
amount of text, they are fine-tuned for the specific
                                                                      content to produce its lead section.
downstream task on a relatively limited amount
of supervised data. Summarization models either                    The task is rather natural given pages structure.
exploit encoders and decoders previously trained                According to the Wikipedia Manual of Style9 , the
for other tasks or are pre-trained from scratch on              lead section is, in fact, a high-quality summary of
a specific objective tailored for summarization.                the body of the article. The lead “serves as an in-
Rothe et al. (2020), for example, leveraged pre-                troduction to the article and a summary of its most
viously existing pre-trained models (BERT in De-                important contents” and “gives the basics in a nut-
vlin et al. (2019); ROBERTA in Liu et al. (2019);               shell and cultivates interest in reading on—though
and GPT-27 in Radford et al. (2019)) as encoders                not by teasing the reader or hinting at what fol-
or decoders of the sequence-to-sequence summa-                  lows”. Moreover, it should “stand on its own as a
rizer and showed high performance improvement                   concise overview of the article’s topic”.
with respect to random initialization. More re-                    As for the content, according to Wikipedia, the
cently, summarization models (Song et al., 2019;                lead must define the topic, explaining its impor-
Lewis et al., 2020) have been pre-trained with an               tance and the relevant context; then, it must sum-
objective specific to Natural Language Generation               marize the most prominent points of the article,
tasks. For example, authors of Pegasus (Zhang et                emphasizing the most important material.
al., 2020) used two objectives: Masked Language                    Moreover, the lead should only cover informa-
Model (Devlin et al., 2019) has been widely used                tion that is contained in the article: “significant
in previous work, and consists in masking a per-                information should not appear in the lead if it
centage of tokens in text, later predicted using con-           is not covered in the remainder of the article”.
text; Gap Sentences Generation is instead a new                 This is particularly relevant for abstractive sum-
pre-training objective, in which a percentage of                marization, as models are more prone to produce
the original sentences are masked, and the model                summaries that are not factual to the source (of-
needs to generate them in accordance to the con-                ten called hallucinations) when they are trained
test.                                                           to generate summaries containing information not
                                                                in the source (Nan et al., 2021). The problem
   Following a shared practice, most summariza-                 of factuality in abstractive summarization is cur-
tion models have first been trained and evaluated               rently an active area of research, as previous work
for English only. In some cases, a subsequent                   has shown that up to 30% of generated summaries
multilingual version of the model was also created              contain non-factual information (Cao et al., 2018).
(Xue et al., 2021). To the best of our knowledge,                  Linguistically, the lead “should be written in
few sequence-to-sequence models in Italian exist                a clear, accessible style with a neutral point of
to date8 , and while they might be fine-tuned for               view”. It is worth noting that, in contrast to Wik-
summarization, no full-scale evaluation has been                iLingua, where the summary is constructed as a
performed yet.                                                  concatenation of sentences from different parts of
                                                                the articles, the summary in WITS is a stand-alone
                                                                piece of text, with a coherent discourse structure.

   7
                                                                3.2   Data Collection
      GPT-2 ha also been adapted for Italian. See: De Mattei,
L., Cafagna, M., Dell’Orletta, F., Nissim, M., & Guerini, M.    This section describes the process of data collec-
2020. GePpeTto Carves Italian into a Language Model. In         tion and preprocessing.
CLiC-it 2020
    8                                                             9
      See, for example, IT5-base (https://huggingfac                https://en.wikipedia.org/wiki/Wikipe
e.co/gsarti/it5-base)                                           dia:Manual of Style
                            WITS            IT-Wikilingua                           WITS          IT-Wikilingua
 # docs                    699,426              50,943                        Summary Source    Summary Source
                     Summary Source       Summary Source       PER (avg)      1.13      26.21   0.32       1.05
 # sentences (avg)   3.75        33.33    5.01        23.52    LOC (avg)      2.03      24.07   0.42       1.39
 # tokens (avg)      70.93       956.66   23.52       418.6
                                                               ORG (avg)      0.60      6.65    0.68       0.37
 Comp. ratio (avg)          16.14                11.67
                                                               MISC (avg)     19.68     19.68   0.84       3.07
                                                               All (avg)      23.44     76.61   1.65       5.88
Table 1: Datasets statistics. spacy is used for text
and sentence tokenization. The number of tokens               Table 2: Named Entities in WITS and IT-
and sentences is computed for all documents and               WikiLingua.
then averaged.

                                                              in isolation. Notice that WITS is more than an
   We downloaded the latest XML dump of                       order of magnitude larger than IT-Wikilingua.
Wikipedia in Italian10 , which contains text only.               We computed the number of tokens and the
We used Python and the Gensim library to pro-                 number of sentences through the spaCy it-core-
cess the file11 . The original number of documents            news-lg12 model. Compared to IT-WikiLingua,
was 1,454,884. We applied the following exclu-                documents in WITS contains more tokens both in
sion criteria: we removed pages whose title con-              their summary and in their source (which is more
tains numbers only (as they mostly describe years             than double in length), making the dataset partic-
and contain lists of events and references), lists (ti-       ularly challenging. Note how the sentences are
tles starting with “Lista d”), pages with summaries           also more lengthy (thus complex) on average. For
with less than 80 characters and articles and pages           example, summaries in WITS contain on average
for which the article is less than 1.5 times longer           less than 4 sentences, but more than 70 words;
than the lead.                                                in contrast, IT-WikiLingua’s summaries consist of
   We then preprocessed the text in the following             more than 5 sentences but contain on average 44
way: from the summary, we removed the content                 tokens. Not surprisingly, WITS’ compression ra-
of parentheses (as they often contain alternative             tio is larger than IT-WikiLingua’s and very high
names or names in a different language, which                 in absolute value. Finally, we also notice that the
cannot be inferred from the article). For the ar-             dataset is very rich in named entities. Table 2 re-
ticle, we further excluded the following sections,            ports the Named Entities as extracted with spaCy
which are not relevant for our task: Note (Foot-              from WITS and IT-Wikilingua.
notes), Bibliografia (References), Voci correlate
(See also), Altri progetti (Other projects), Collega-         4        Baselines
menti esterni (External links), Galleria di Immag-
ini (Images).                                                 We tested some preliminary non-neural baseline
                                                              methods on the dataset, reported in Table 3.
3.3    Dataset Statistics                                        All methods reported are unsupervised. Thus,
Table 1 shows some statistics on the dataset and              we unsupervisedly obtained the summary from the
compares WITS with the Italian split of WikiLin-              source and then used the lead as the gold standard
gua (which we will refer to as IT-WikiLingua).                for evaluation. We evaluated the summaries using
  IT-WikiLingua contains documents from                       Recall-Oriented Understudy for Gisting Evalua-
17,673 WikiHow pages, but some of these pages                 tion (ROUGE) (Lin, 2004). ROUGE is an n-gram
describe more than one method related to the same             based, recall-oriented metric for summary quality
topic. For example, the page “How to Reduce the               evaluation. Following previous work (Lloret et al.,
Redness of Sunburn” contains several methods:                 2018), we report ROUGE-1 (R-1), ROUGE-2 (R-
“Healing and Concealing Sunburns”, “Lessening                 2), and ROUGE-L (R-L) (recall).
Your Pain and Discomfort”, and “Preventing                       We considered the following baselines:
a Sunburn”. We consider distinct methods as
separate documents, as they can be summarized                 Lead-3 We extract the first three sentences from
  10
                                                                  the source. Previous work has shown that
     https://dumps.wikimedia.org/itwiki/l
atest/itwiki-latest-pages-articles.xml.b
                                                                  this baseline is often hard to beat (See et
z2                                                                al., 2017), especially in news summarization,
  11
     https://radimrehurek.com/gensim/scri
                                                                  12
pts/segment\ wiki.html                                                 https://spacy.io/models/it
        which presents an “inverted pyramid” struc-        We trained on two GeForce RTX 2080 GPUs
        ture and tends to report the most important        and kept the batch size per GPU to 1. We
        content at the start.                              kept the summary length to 75 tokens, and
                                                           the source text length to 1000 tokens.
TextRank (Mihalcea and Tarau, 2004)
    TextRank is an unsupervised algorithm
    that extracts the most relevant sentences                               R-1      R-2    R-L
    in the source. The algorithm constructs a                 Lead-3        24.76    5.54   16.54
    graph with sentences as nodes and sentence                TextRank      30.20    6.57   19.67
    similarity (in terms of shared vocabulary)                LexRank       26.90    5.91   17.52
    as edges. The sentences are then ranked                   SumBasic      20.60    4.80   14.01
    by using the PageRank (Page et al., 1999)                 IT5-small     21.58    9.69   19.34
    algorithm.
                                                             Table 3: ROUGE results on WITS.
LexRank (Erkan and Radev, 2004) LexRank
    works in a similar way as TextRank.                  Results show that the Lead-3 baseline perfor-
    However, instead of computing sentence            mance is low; this is likely due to the structure of
    similarity on normalized shared vocabulary,       Wikipedia, which contains several thematic sec-
    it uses the cosine similarity of their TF-IDF     tions without a general introduction outside the
    vectors.                                          lead section. Extracting the first sentence(s) from
                                                      each section would likely produce better results
SumBasic (Nenkova and Vanderwende, 2005)              and could be investigated in future work.
   SumBasic extracts sentences based on their            In contrast, TextRank is the best non-neural
   word probabilities. Specifically, it scores        baseline, with a ROUGE-2 score of 6.57; LexRank
   each sentence as the mean of the probability       performs comparably. SumBasic metrics are even
   of the words it contains (based on their           lower than those obtained with the Lead-3 base-
   frequency in the document). Iteratively, the       line, suggesting that a purely frequency-based ap-
   sentence with the best score among the ones        proach is insufficient given the dataset complexity.
   containing the most probable word is chosen.          Finally, the neural baseline achieves the best
   The probability of the words in the chosen         results in terms of ROUGE-2, even if it is rel-
   sentence is then squared to limit redundancy.      atively small and likely severely under-trained,
                                                      since only around 30% of the data are used for
IT5-small (Raffel et al., 2020) The Text-to-Text      fine-tuning, due to computational constraints. This
    Transfer Transformer (T5) is a pre-trained        suggests that sequence-to-sequence neural models
    sequence-to-sequence language model,              have great potential on the dataset, and should be
    trained treating both input and output as         better investigated in future work. Surprisingly,
    text strings; the rationale is to use the same    however, results in terms of ROUGE-1 are instead
    models for all NLP tasks, unifying them           below most of the other baselines. Future work
    under the sequence-to-sequence framework.         should investigate this discrepancy.
    We use a small version of the original model
    (60 million parameters)13 , pretrained on the     5   Conclusions
    Clean Italian mC4 IT14 , the Italian split of
    the multilingual cleaned version of Common        We have presented WITS, the first large-scale
    Crawl’s Corpus (mC4) (Raffel et al., 2020).       dataset for abstractive summarization in Italian.
    We extracted 10,000 summary-source pairs          We have exploited Wikipedia’s articles’ structure
    from the dataset for the validation set, and      to build a challenging, non-technical dataset, with
    10,000 for the test set. We trained the model     high-quality human-written abstracts. Given the
    on the rest of the data for 100,000 steps; this   lengthy source documents, the short summaries
    account for around 30% of the training data.      and the short extractive fragments, the dataset calls
                                                      for an abstractive approach. In the paper, we
  13
       https://huggingface.co/gsarti/it5-sm           have explored some standard non-neural extractive
all
  14
     https://huggingface.co/datasets/gsar             baselines and a neural abstractive baseline. Future
ti/clean mc4 it                                       work will investigate further neural baselines for
the dataset. Moreover, the dataset can be easily          Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
extended applying the procedure described in the            jan Ghazvininejad, Abdelrahman Mohamed, Omer
                                                            Levy, Veselin Stoyanov, and Luke Zettlemoyer.
paper to more languages, including low-resource
                                                            2020. BART: Denoising sequence-to-sequence pre-
ones given Wikipedia structure. We are confident            training for natural language generation, translation,
that research in summarization in languages other           and comprehension. In Proceedings of the 58th An-
than English will become more active in the near            nual Meeting of the Association for Computational
future and hope that WITS can be a valuable step            Linguistics, pages 7871–7880, Online, July. Associ-
                                                            ation for Computational Linguistics.
in this direction.
                                                          Chin-Yew Lin. 2004. ROUGE: A package for auto-
                                                            matic evaluation of summaries. In Text Summariza-
References                                                  tion Branches Out, pages 74–81, Barcelona, Spain,
                                                            July. Association for Computational Linguistics.
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li.
  2018. Faithful to the original: Fact aware neural       Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,
  abstractive summarization. In Proceedings of the          Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis,
  AAAI Conference on Artificial Intelligence.               Luke Zettlemoyer, and Veselin Stoyanov. 2019.
                                                            RoBERTa: A robustly optimized BERT pretraining
Sumit Chopra, Michael Auli, and Alexander M. Rush.          approach. ArXiv, abs/1907.11692.
  2016. Abstractive sentence summarization with at-
  tentive recurrent neural networks. In Proceedings of    Elena Lloret, Laura Plaza, and Ahmet Aker. 2018.
  the 2016 Conference of the North American Chap-           The Challenging Task of Summary Evaluation: An
  ter of the Association for Computational Linguistics:     Overview. Language Resources and Evaluation,
  Human Language Technologies, pages 93–98, San             52(1).
  Diego, California, June. Association for Computa-
  tional Linguistics.
                                                          C. Mastronardo and F. Tamburini. 2019. Enhancing a
                                                             text summarization system with ELMo. In CLiC-it.
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,
  and Yoshua Bengio. 2014. Empirical evaluation of
  gated recurrent neural networks on sequence mod-        Rada Mihalcea and Paul Tarau. 2004. TextRank:
  eling. In NIPS 2014 Workshop on Deep Learning,            Bringing order into text. In Proceedings of the
  December 2014.                                            2004 Conference on Empirical Methods in Natural
                                                            Language Processing, pages 404–411, Barcelona,
Arman Cohan, Franck Dernoncourt, Doo Soon Kim,              Spain, July. Association for Computational Linguis-
  Trung Bui, Seokhwan Kim, Walter Chang, and Nazli          tics.
  Goharian. 2018. A discourse-aware attention model
  for abstractive summarization of long documents. In     Ramesh Nallapati, Bowen Zhou, Cicero dos San-
  Proceedings of the 2018 Conference of the North           tos, Çağlar Gulçehre, and Bing Xiang. 2016.
  American Chapter of the Association for Computa-          Abstractive text summarization using sequence-to-
  tional Linguistics: Human Language Technologies,          sequence RNNs and beyond. In Proceedings of The
  Volume 2 (Short Papers), pages 615–621, New Or-           20th SIGNLL Conference on Computational Natural
  leans, Louisiana, June. Association for Computa-          Language Learning, pages 280–290, Berlin, Ger-
  tional Linguistics.                                       many, August. Association for Computational Lin-
                                                            guistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of        Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero
   deep bidirectional transformers for language under-      Nogueira dos Santos, Henghui Zhu, Dejiao Zhang,
   standing. In Proceedings of the 2019 Conference of       Kathleen McKeown, and Bing Xiang. 2021. Entity-
   the North American Chapter of the Association for        level factual consistency of abstractive text summa-
   Computational Linguistics: Human Language Tech-          rization. In Proceedings of the 16th Conference of
   nologies, Volume 1 (Long and Short Papers), pages        the European Chapter of the Association for Com-
   4171–4186, Minneapolis, Minnesota, June. Associ-         putational Linguistics: Main Volume, pages 2727–
   ation for Computational Linguistics.                     2733, Online, April. Association for Computational
                                                            Linguistics.
Günes Erkan and Dragomir R. Radev. 2004. LexRank:
  Graph-based lexical centrality as salience in text      Ani Nenkova and Lucy Vanderwende. 2005. The im-
  summarization. Journal of Artificial Intelligence         pact of frequency on summarization. Technical re-
  Research, 22(1):457–479, December.                        port, Microsoft Research.

Faisal Ladhak, Esin Durmus, Claire Cardie, and Kath-      Lawrence Page, Sergey Brin, Rajeev Motwani, and
  leen McKeown. 2020. WikiLingua: A new bench-              Terry Winograd. 1999. The pagerank citation rank-
  mark dataset for multilingual abstractive summa-          ing: Bringing order to the web. Technical Report
  rization. In Findings of EMNLP, 2020.                     1999-66, Stanford InfoLab, November.
Alec Radford, Jeff Wu, Rewon Child, David Luan,             Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
  Dario Amodei, and Ilya Sutskever. 2019. Language             ter J. Liu. 2020. PEGASUS: pre-training with
  models are unsupervised multitask learners. Techni-          extracted gap-sentences for abstractive summariza-
  cal report.                                                  tion. In Proceedings of the 37th International Con-
                                                               ference on Machine Learning, ICML 2020, 13-18
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine            July 2020, Virtual Event, volume 119 of Proceed-
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,              ings of Machine Learning Research, pages 11328–
  Wei Li, and Peter J. Liu. 2020. Exploring the limits         11339. PMLR.
  of transfer learning with a unified text-to-text trans-
  former. Journal of Machine Learning Research,
  21(140):1–67.

Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.
  2020. Leveraging pre-trained checkpoints for se-
  quence generation tasks. Transactions of the Asso-
  ciation for Computational Linguistics, 8:264–280.

Alexander M. Rush, Sumit Chopra, and Jason Weston.
  2015. A neural attention model for abstractive sen-
  tence summarization. In Proceedings of the 2015
  Conference on Empirical Methods in Natural Lan-
  guage Processing, pages 379–389, Lisbon, Portugal,
  September. Association for Computational Linguis-
  tics.

Abigail See, Peter J. Liu, and Christopher D. Man-
  ning. 2017. Get to the point: Summarization with
  pointer-generator networks. In Proceedings of the
  55th Annual Meeting of the Association for Compu-
  tational Linguistics (Volume 1: Long Papers), pages
  1073–1083, Vancouver, Canada, July. Association
  for Computational Linguistics.

Eva Sharma, Chen Li, and Lu Wang. 2019. BIG-
  PATENT: A large-scale dataset for abstractive and
  coherent summarization. In Proceedings of the 57th
  Annual Meeting of the Association for Computa-
  tional Linguistics, pages 2204–2213, Florence, Italy,
  July. Association for Computational Linguistics.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and
  Tie-Yan Liu. 2019. MASS: masked sequence
  to sequence pre-training for language generation.
  In Proceedings of the 36th International Confer-
  ence on Machine Learning, ICML 2019, 9-15 June
  2019, Long Beach, California, USA, volume 97 of
  Proceedings of Machine Learning Research, pages
  5926–5936. PMLR.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is
  all you need. In I. Guyon, U. V. Luxburg, S. Ben-
  gio, H. Wallach, R. Fergus, S. Vishwanathan, and
  R. Garnett, editors, Advances in Neural Information
  Processing Systems, volume 30. Curran Associates,
  Inc.

Linting Xue, Noah Constant, Adam Roberts, Mi-
  hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
  Barua, and Colin Raffel. 2021. mT5: A massively
  multilingual pre-trained text-to-text transformer. In
  NAACL.