On Generating Extended Summaries of Long Documents Sajad Sotudeh† , Arman Cohan‡ , Nazli Goharian† † IR Lab, Georgetown University, Washington D.C., USA {sajad, nazli}@ir.cs.georgetown.edu, ‡ Allen Institute for Artificial Intelligence, Seattle, WA, USA armanc@allenai.org Abstract abstracts provide a short summary about their main methods and findings, the abstract does not include Prior work in document summarization has mainly details of the methods or experimental conditions. focused on generating short summaries of a document. While this type of summary helps get To those who seek more detailed information about a high-level view of a given document, it is desirable in a document without having to cover the entire some cases to know more detailed information about document, an extended or long summary can be its salient points that can’t fit in a short summary. desirable (Chandrasekaran et al. 2020; Sotudeh, This is typically the case for longer documents Cohan, and Goharian 2020; Ghosh Roy et al. 2020). such as a research paper, legal document, or a Many long documents, including scientific book. In this paper, we present a new method for papers, follow a certain hierarchical structure where generating extended summaries of long papers. content is organized throughout multiple sections Our method exploits hierarchical structure of the documents and incorporates it into an extractive and sub-sections. For example, research papers summarization model through a multi-task learning often describe objectives, problem, methodology, approach. We then present our results on three long experiments, and conclusions (Collins, Augenstein, summarization datasets, arXiv-Long, PubMed-Long, and Riedel 2017). Few prior studies have noted the and Longsumm. Our method outperforms or matches importance of documents’ structure in shorter-form the performance of strong baselines. Furthermore, we summary generation (Collins, Augenstein, and perform a comprehensive analysis over the generated Riedel 2017; Cohan et al. 2018). However, we are not results, shedding insights on future research for aware of existing summarization methods explicitly long-form summary generation task. Our analysis approaching modeling the document structure when shows that our multi-tasking approach can adjust it comes to generating extended summaries. extraction probability distribution to the favor of summary-worthy sentences across diverse sections. We approach the problem of generating extended Our datasets, and codes are publicly available at https: summary by incorporating document’s hierarchical //github.com/Georgetown-IR-Lab/ExtendedSumm. structure into the summarization model. Specifically, we hypothesize that integrating the processes of sentence selection and section prediction improves Introduction the summarization model’s performance over the In the past few years, there has been a significant existing baseline models on extended summarization progress on both extractive (e.g., Nallapati, Zhai, task. To substantiate our hypothesis, we test our and Zhou 2017; Zhou et al. 2018; Liu and Lapata proposed model on three extended summarization 2019; Xu et al. 2020; Jia et al. 2020) and abstractive datasets, namely, arXiv-Long, PubMed-Long, and (e.g., See, Liu, and Manning 2017; Cohan et al. Longsumm. We further provide comprehensive 2018; MacAvaney et al. 2019; Zhang et al. 2019; analyses over the generated results for two long Sotudeh, Goharian, and Filice 2020; Dong et al. datasets, demonstrating the qualities of our model 2020) approaches for document summarization. over the baseline. Our analysis reveals that the These approaches generate a concise summary of a multi-tasking model helps with adjusting sentence document, capturing its salient content. However, for extraction probability to the advantage of salient a longer document containing numerous details, it sentences scattered across different sections of the is sometimes helpful to read an extended summary, document. Our contributions are threefold: providing details about its different aspects. Scientific papers are examples of such documents; while their 1. A multi-task learning approach for leveraging document structure in generating extended Copyright © 2021 for this paper by its authors. Use summaries of long documents. permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2. In-depth and comprehensive analyses over the generated results to explore the qualities of our extended summaries from scientific documents model in comparison with the baseline model. and provided a extended summarization dataset 3. Collecting two large-scale extended summarization called Longsumm over which participants were datasets with oracle labels for facilitating ongoing urged to generate extended summaries. To tackle this research in extended summarization domain. challenge, researchers used different methodologies. For instance, Sotudeh, Cohan, and Goharian (2020) proposed a multi-tasking approach to jointly learn Related Work sentence importance along with its section to Scientific document summarization Summarizing be included in the summary. Herein, we aim at scientific papers has garnered vast attention from the validating the multi-tasking model on a variety of research community during recent years, although extended summarization datasets and provide a it has been studied for decades. The characteristics comprehensive analysis to guide future research. of scientific papers, namely the length, writing Moreover, Ghosh Roy et al. (2020) utilized section- style, and discourse structure, lead to special model contribution pre-computations (training set) to considerations to overcome the summarization assign weights via a budget module for generating task in scientific domain. Researchers have utilized extended summaries. After specifying the section different approaches to address these challenges. contribution, an extractive summarizer is executed In earlir work, Teufel and Moens (2002) proposed over each section separately to extract salient a Naïve bayes classifier to do content selection sentences. Unlike their work, we unify sentence over the documents’ sentences with regard to their selection and sentence section prediction tasks to rhetorical sentence role. More recent works have effectively aid the model at identifying summary- given rise to the importance of discourse structure worthy sentences scattered around different sections. and its usefulness in summarizing scientific papers. Furthermore, Reddy et al. (2020) proposed a CNN- For example, Collins, Augenstein, and Riedel (2017) based classification network for extracting salient used a set of pre-defined section clusters that source sentences. Gidiotis, Stefanidis, and Tsoumakas (2020) sentences are appeared in as a categorical feature proposed to use a divide and conquer (DANCER) to aid the model at identifying summary-worthy approach (Gidiotis and Tsoumakas 2020) to identify sentences. Cohan et al. (2018) introduced large-scale the key sections of the paper to be summarized. The datasets of arXiv and PubMed (collected from public PEGASUS abstractive summarizer (Zhang et al. 2019) repositories), and used a hierarchical encoder to then runs over each section separately to produce model the discourse structure of a paper, and then section summaries, which are finally concatenated used an attentive decoder to generate the summary. to form the extended summary. Beltagy, Peters, More recently, Xiao and Carenini (2019) proposed a and Cohan (2020) proposed “Longformer” that sequence-to-sequence model that incorporates both utilizes “Dilated Sliding Windows”, enabling the the global context of the entire document and local model to achieve better long-range coverage on context within the specified section. Inspired by the long documents. With all being mentioned above, fact that discourse information is important when to the best of our knowledge, we are the first to dealing with long documents (Cohan et al. 2018), conduct quite a comprehensive analysis over the we utilize this structure in scientific summarization. generated summarization results in the extended Unlike prior works, we integrate sentence selection summarization domain. and sentence section labeling processes through a multi-task learning approach. In a different line of Dataset research, the use of citation context information has been shown to be quite effective at summarizing We use three extended summarization datasets in scientific papers (Abu-Jbara and Radev 2011). this research. The first one is Longsumm dataset, For instance, Cohan and Goharian (2015, 2018) which has been provided in the Longsumm 2020 utilized a citation-based approach, denoting how shared task (Chandrasekaran et al. 2020). To further the paper is cited in the reference papers, to form validate the model, we collect two additional datasets the summary. Here, we do not exploit any citation called arXiv-Long and PubMed-Long by filtering the context information. instances of arXiv and PubMed corpora to retain those whose abstract contains at least 350 tokens. Extended summarization While summarization Also, to measure how our model works on the mixed research has been extensively explored in literature, varied-length scientific dataset, we exploit the arXiv extended summarization has recently gained a huge summarization dataset (Cohan et al. 2018). deal of attention from the research community. Among the first attempts to encourage the ongoing Longsumm The Longsumm dataset was provided research in this field, Chandrasekaran et al. (2020) for the Longsumm challenge (Chandrasekaran set up the Longsumm shared task 1 on producing et al. 2020) whose aim was to generate extended summaries for scientific papers. It consists of two 1 https://ornlcda.github.io/SDProc/sharedtasks.html types of summaries: • Extractive summaries: these summaries are coming Datasets # docs avg. doc. length avg. summ. length (tokens) (tokens) from the TalkSumm dataset (Lev et al. 2019), containing 1705 extractive summaries of scientific arXiv 215K 4938 220 papers according to their video talks in conferences Longsumm 2.2K 5858 920 arXiv-Long 11.1K 9221 574 (i.e., ACL, NAACL, etc.). Each summary within this PubMed-Long 88.0K 5359 403 corpus is formed by appending top 30 sentences of the paper. Table 1: Statistics on arXiv (Cohan et al. 2018), • Abstractive summaries: an add-on dataset Longsumm (Chandrasekaran et al. 2020), and two containing 531 abstractive summaries from several extended summarization datasets (arXiv-Long, CS domains such as Machine Learning, NLP, and PubMed-Long), collected by this work. AI, that are written by NLP and ML researchers on their blogs. The length of summaries in this dataset ranges from 50-1500 words per paper. sentences to be included in the summary. Formally, In our experiments, we use the extractive set let P show a scientific paper containing sentences along with 50% of the abstractive set as our training [s 1 , s 2 , s 3 , ..., s m ], where m is the number of sentences. set, containing 1969 papers; and 20% of it as the The extractive summarization is then defined as the validation set. Note that these splits are made for task of assigning a binary label ( ŷ i ∈ {0, 1}) to each the purpose of our internal experiments as the sentence s i within the paper, signifying whether the official test set containing 22 abstractive summaries sentence should be included in the summary. is blind (Chandrasekaran et al. 2020). B ERT S UM: B ERT for Summarization As our base arXiv-Long & PubMed-Long. To further test our model we use the B ERT S UM extractive summarization methods on additional datasets, we construct model (Liu and Lapata 2019), a B ERT-based sentence two extended summarization datasets for our classification model fine-tuned for summarization. task. For creating the first dataset, we take arXiv After B ERT S UM outputs sentence representations summarization dataset introduced by Cohan et al. within the input document, several inter-sentence (2018) and filter the instances whose abstract Transformer layers are stacked upon the B ERT S UM (i.e., ground-truth summary) contains at least 350 to collect document-level features. The final output tokens. We call this dataset arXiv-Long. We repeat layer is a linear classifier with Sigmoid activation the same process on the PubMed papers obtained function to decide whether the sentence should be from the Open Access FTP service 2 and call this included or not. The loss function is defined as below: dataset PubMed-Long. The motivation is that we 1 X n are interested in validating our model on extended L1 = − y i log( yˆi ) + (1 − y i )log(1 − yˆi ) (1) summarization datasets to investigate its effects N i =1 compared to the existing works, and 350 is the where N is the output size, yˆi is the output of the length threshold that we use to characterize papers model, and y i is the corresponding target value. In with “long” summaries. The resulting sets contain our experiments, we use this model to extract salient 11,149 instances for arXiv-Long, and 88,035 instances sentences (i.e., those with the positive label) to form for PubMed-Long datasets. Note that the abstract the summary. We set this model as the baseline called of papers are used as ground-truth summaries B ERT S UM E XT (Liu and Lapata 2019). in these two datasets. The overall statistics of the datasets are shown in Table 1. We release these Our model: a section-aware summarizer datasets to facilitate future research in extended summarization. 3 Inspired by few prior works that have studied the effect of document’s hierarchical structure in Methodology summarization task (Conroy and Davis 2017; Cohan In this section, we discuss our proposed method et al. 2018), we define a section prediction task, that aims at jointly learning to predict sentence aiming at predicting the relevant section for each importance and its corresponding section. Before sentence in the document. Specifically, we add discussing the details of our summarization model, an additional linear classification layer on top of we investigate the preliminary background that B ERT S UM sentence representations to predict the provides a fair basis for implementing our method. relevant section to each sentence. The loss function for the section prediction network is defined as Background follows: Extractive Summarization The extractive summarization system aims at extracting salient S L2 = − X y i log( yˆi ) (2) 2 https://www.ncbi.nlm.nih.gov/pmc/tools/ftp i =1 3 https://github.com/Georgetown-IR-Lab/ where y i and ŷ i are the ground-truth and the model ExtendedSumm scores for each section i in S. Linear Layer Linear Layer Sentence Selection Section Prediction Transformer Encoder BERTSUM Figure 1: The overview of B ERT S UM E XT M ULTI model. The baseline model (i.e., B ERT S UM E XT) is dash-boarded. The extension to the baseline model is addition of Section Prediction linear layer (specified in green box). The entire extractive network is then trained to labelling approach (Liu and Lapata 2019) with slight optimize both tasks (i.e., sentence selection and modification for labelling up top 30, 15, and 25 section prediction) in a multi-task setting: sentences for Longsumm, arXiv-Long, and PubMed- Long datasets, respectively, since these numbers of L Multi = αL 1 + (1 − α)L 2 (3) oracle sentences yielded the highest oracle scores. 6 where L 1 is the binary cross-entropy loss from For the joint model, we tuned α (loss weighting sentence selection task, L 2 is the categorical cross- parameter) at 0.5 as it resulted in the highest scores entropy loss from section prediction network, and throughout our experiments. In all our experiments, α is the weighting parameter that balances the we pick the checkpoint that achieves the best average learning procedure between the sentence and section of R OUGE -2 and R OUGE -L scores on the validation prediction tasks. intervals as our best model for inference. Experimental Setup Results In this section, we give details about the pre- In this section, we present the performance of the processing steps on the datasets and parameters that baseline and our model over the validation and test we used for the experimented models. sets of the extended summarization datasets. We then For our baseline, we used the pre-trained B ERT S UM discuss our proposed model’s performance compared model and implementation provided by the authors to baseline over a mix of varied-length summarization (Liu and Lapata 2019).4 The B ERT S UM E XT M ULTI dataset (i.e., arXiv). As the evaluation metrics, we is that of the model used in (Sotudeh, Cohan, report the summarization systems’ performance in and Goharian 2020), but without post-processing terms of R OUGE -1 (F1), R OUGE -2 (F1), and R OUGE - module at inference time, which utilizes trigram- L (F1)) metrics. blocking (Liu and Lapata 2019) to hinder repetitions As we see in Table 2, we notice that having section in the final summary. We intentionally removed the predictor model incorporated into summarization post-processing part as the model could attain higher model (i.e., B ERT S UM E XT M ULTI model) performs scores in the absence of this module throughout fairly well compared to the baseline model. This is a our experiments. In order to obtain ground-truth particularly important finding since it characterizes section labels associated with each sentence, we the importance of injecting documents’ structure utilized the external sequential-sentence package5 when summarizing a scientific paper. While the score by Cohan et al. (2019). To provide oracle labels for gap is relatively higher in arXiv-Long and Longsumm source sentences in our datasets, we use a greedy datasets, it is similar in PubMed-Long dataset. As observed in Table. 3, it is noticeable that 4 https://github.com/nlpyang/PreSumm 5 https://github.com/allenai/sequential_sentence_ 6 The modification was made to assure that the oracle classification sentences are sampled from diverse sections. Validation Test Model Dataset RG-1(%) RG-2(%) RG-L(%) RG-1(%) RG-2(%) RG-L(%) B ERT S UM E XT 43.2 12.4 16.8 – – – Longsumm B ERT S UM E XT M ULTI 43.3 13.0∗ 17.0 53.1 16.8 20.3 B ERT S UM E XT 47.1 18.2 20.8 47.2 18.4 21.1 arXiv-Long B ERT S UM E XT M ULTI 47.8∗ 18.9∗ 21.3∗ 47.8∗ 19.2∗ 21.5∗ B ERT S UM E XT 49.1 24.3 25.7 49.1 24.5 25.8 PubMed-Long B ERT S UM E XT M ULTI 48.9 24.1 25.5 48.9 24.1 25.5 Table 2: R OUGE (F1) results of the baseline (i.e., B ERT S UM E XT) and our proposed model (i.e., B ERT S UM E XT M ULTI) on extended summarization datasets. ∗ shows the statistically significant improvement (paired t-test, p < 0.01). The validation set for Longsumm refers to our internal validation set (20% of the abstractive set) as there was no official validation set provided for this dataset. RG-1 RG-2 RG-L F-Measure average Other systems Summaformers (Ghosh Roy et al. 2020) 49.38 16.86 21.38 29.21 Wing 50.58 16.62 20.50 29.23 IIITBH-IITP (Reddy et al. 2020) 49.03 15.74 20.46 28.41 Auth-Team (Gidiotis, Stefanidis, and Tsoumakas 2020) 50.11 15.37 19.59 28.36 CIST_BUPT (Li et al. 2020) 48.99 15.06 20.13 28.06 This work B ERT S UM E XT M ULTI 53.11 16.77 20.34 30.07 Table 3: R OUGE (F1) results of our multi-tasking model on the blind test set of Longsumm shared task containing 22 abstractive summaries (Chandrasekaran et al. 2020), along with the performance of other participants’ systems. We only show top 5 participants in this table. B ERT S UM E XT M ULTI approach performs top Analysis among the state-of-the-art long summarization In order to gain insights into how our multi- methods on the blind test set of LongSumm tasking approach works on different long datasets, challenge (Chandrasekaran et al. 2020). While we perform an extensive analysis in this section this model improves R OUGE -1 quite significantly to explore the qualities of our multi-tasking system over the other state-of-the-art, it stays competitive on (i.e., B ERT S UM E XT M ULTI) over the baseline (i.e., R OUGE -2 and R OUGE -L metrics. In terms of R OUGE B ERT S UM E XT). Specifically, we perform two types (F1) F-Measure average, B ERT S UM E XT M ULTI model of analyses: 1) quantitative analysis; 2) qualitative ranks first by a huge margin compared to the other analysis. systems. For the first part, we choose to use two metrics: RGdiff which denotes the average R OUGE (F1) difference (i.e., gap) between the baseline and our model 7 . Positive values indicate the improvement, while negative values denote the decline in scores. To test the model on mixed varied-length Similarly, Fdiff is the average difference of F1 score summarization datasets, we trained and tested it between the baseline and our model. We create three on arXiv (Cohan et al. 2018) dataset, which contains bins sorted by RGdiff : I MPROVED which contains the a mix of varying length abstracts as ground-truth reports whose average R OUGE (F1) score is improved summaries. Table 4 shows that our model can achieve by the multi-tasking model; T IED including those competitive performance on this dataset. While the that the multi-tasking model leaves unchanged in model does not yield any improvement on arXiv terms of modifying average R OUGE (F1) score; and dataset, our hypothesis was to investigate if our D ECLINED containing those whose average R OUGE model is superior to existing models on longer-form (F1) score has decreased by the joint model. datasets –such as those we have used in this research, For the qualitative analysis section, we specifically which we validated by presenting the evaluation results on long summarization datasets. 7 The average is defined on R OUGE -1 (F1), R OUGE -2 Validation Test Model Dataset RG-1(%) RG-2(%) RG-L(%) RG-1(%) RG-2(%) RG-L(%) B ERT S UM E XT 43.6 16.6 20.2 44.0 16.8 20.4 arXiv B ERT S UM E XT M ULTI 43.4 16.5 19.8 43.5 16.5 20.0 Table 4: R OUGE (F1) results of the baseline (i.e., B ERT S UM E XT) and our proposed model (i.e., B ERT S UM E XT M ULTI) on arXiv summarization dataset. 0.25 Baseline Baseline 0.25 Multi-tasking Multi-tasking 0.20 0.20 0.15 Rouge-2 Rouge-2 0.15 0.10 0.10 0.05 0.05 0.00 0.00 5 91 72 0 31 37 70 38 84 39 99 41 17 44 47 49 90 56 69 73 34 10 074 23 -38 37 9-5 4-8 -16 0-3 0-3 4-3 9-4 7-4 7-4 0-5 9-7 -21 5-1 4-1 52 38 62 75 35 74 89 13 Summary Length (tokens) Summary Length (tokens) (a) R OUGE -2 scores on Longsumm (b) R OUGE -2 scores on arXiv-Long Figure 2: Bar charts exhibiting the correlation of ground-truth summary length (in tokens) with the baseline (i.e., B ERT S UM E XT) and our multi-tasking model’s (i.e., B ERT S UM E XT M ULTI) performance. The diagrams are shown for Longsumm and arXiv-Long datasets’ test set. Each bin contains 31 summaries for Longsumm, and 196 summaries for arXiv-Long. As denoted, the multi-tasking model generally outperforms the baseline on later bins which include longer-form summaries. aim at comparing the methods in terms of section comes to evaluating the model on arXiv-Long dataset distribution since that is where our method’s with average R OUGE improvement of 2.40%. improvements are expected to come from. Interestingly, our method can consistently improve Furthermore, we conduct an additional length F1 measure in general (See total F1 scores in Table. 5). analysis over the results generated by the baseline Seemingly, F1 metric directly correlates with R OUGE versus our model. (F1) metric on arXiv-Long dataset, whereas this is not the case on D ECLINED bin of the Longsumm Quantitative Analysis dataset. This might be due to the relatively small test We first perform the quantitative analysis over set size of Longsumm dataset. It has to be mentioned the long summarization datasets’ test sets in two that I MPROVED bin holds relatively higher counts and parts including 1) Metric analysis which aims at improved metrics than that of D ECLINED bin across comparing different bins based on the average R OUGE both datasets in our evaluation. score difference of the baseline and our model; 2) Length analysis that targets at finding the correlation Length analysis We analyze the generated results between the summary length on different bins and by both models to see if the summary length affects models’ performance. the models’ performance using bar charts in Figure 2. The bar charts are intended to provide the basis for Metric analysis Table 5 shows the overall quantities comparing both models on different length bins (x- of Longsumm and arXiv-Long datasets in terms axis), which are evenly-spaced (i.e., having the same of average difference of R OUGE and F1 scores. number of papers). It has to be mentioned that we As shown, the multi-tasking approach is able to used five bins (each bin with 31 summaries) and ten improve 76 summaries with an average R OUGE (F1) bins (each bin with 196 summaries) for Longsumm improvement of 2.05%. This is even more when it and arXiv-Long datasets, respectively. As shown in Figure 2 (a), for Longsumm dataset, (F1), and R OUGE -L (F1) scores. as the length of the ground-truth summary increases, 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 *14* 15 16 17 18 19 20 21 22 23 24 25 26 27 *28* 29 30 31 32 33 34 35 *36* 37 38 39 *40* *41* 42 43 44 45 *46* 47 *48* 193 194 197 199 200 204 *425* *426* *427* *428* *429* 431 *433* 434 435 436 *437* *440* 441 Source sentence number (a) Extraction probability distribution of the baseline model (i.e., B ERT S UM E XT) over the source sentences. 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 13 *14* 15 16 17 18 19 20 21 22 23 24 25 26 27 *28* 29 30 31 32 33 35 *36* 37 38 39 *40* *41* 43 44 45 *46* 47 *48* 95 100 102 103 190 191 193 194 195 196 197 198 199 202 203 204 *425* *426* *427* *428* *429* 430 431 432 *433* 434 435 436 *437* *440* Source sentence number (b) Extraction probability distribution of the multi-tasking model (i.e., B ERT S UM E XT M ULTI) over the source sentences. Figure 3: Heat-maps showing the extraction probabilities over the source sentences (Paper ID: astro-ph9807040 sampled from arXiv-Long dataset). For simplicity, we have only shown the sentences that gain over 15% extraction probability by the models. The cells bordered in black show the models’ final selection, and oracle sentences are indicated with *. Bin Dataset Count RGdiff Fdiff important sections. I MPROVED 76 2.05 6.16 By investigating the summaries from D ECLINED T IED Longsumm 4 0 0 bin, we noticed that in declined summaries, while D ECLINED 74 −1.47 1.95 our multi-tasking approach can adjust extraction probability distribution to diverse sections, it has Total 154 0.31 4.11 difficulty picking up salient sentences (i.e., positive I MPROVED 1,084 2.40 4.47 sentences) from the corresponding section; thus, it T IED arXiv-Long 67 0 0.32 leads to relatively lower R OUGE score. This might D ECLINED 801 −1.82 −1.34 be improved if two networks (i.e., sentence selection and section prediction) are optimized in a more Total 1952 0.59 1.94 elegant way such that the extractive summarizer can further select salient sentences from the specified Table 5: I MPROVED, T IED, and D ECLINED bins on sections when they could be identified. For example, the test set of Longsumm and arXiv-Long datasets. the improved multi-tasking methods can involve The numbers show the improvements (positive) and task prioritization (Guo et al. 2018) to dynamically drops (negative) compared to the baseline model (i.e., balance the learning process between two tasks B ERT S UM E XT). during training, rather than using a fixed α parameter. In the cases where the F1 score and R OUGE (F1) the multi-tasking model generally improves over the were not consistent with each other, we observed baseline consistently on both datasets, except for that adding non-salience sentences to the final the last bin on Longsumm dataset where it achieves summary hurts the final R OUGE (F1) scores. In comparable performance. This behaviour is also other words, while the multi-tasking approach can observed on R OUGE -1 and R OUGE -L for Longsumm achieve a higher F1 score compared to the baseline dataset. The R OUGE improvement is even more since it chooses different non-salient (i.e., negative) noticeable when it comes to analysis over arXiv-Long sentences than baseline, the overall R OUGE (F1) dataset (See Figure 2 (b)). Thus, the length analysis scores drop slightly. Having conditional decoding supports our hypothesis that the multi-tasking model length (i.e., sentences) might help with this as done outperforms the baseline more significantly when the in (Mao et al. 2020). summary is of longer-form. Fig. 3 shows the extraction probabilities that each model outputs on the source sentences. It is Qualitative analysis observable that the baseline model picks most of the As the results of the qualitative analysis on the sentences (47%) from the beginning of the paper, I MPROVED bin is observed, we found out that the while the multi-tasking approach (b) can effectively multi-tasking model can effectively sample sentences distract probability distribution to summary-worthy from diverse sections when the ground-truth sentences that are all around different sections of the summary is also sampled from diverse sections. It paper, and pick those with higher confidence. Our improves significantly over the baseline when the model achieves the overall F1 score of 53.33% on this extractive model can detect salient sentences from sample paper, while the baseline’s F1 score is 33.33%. Conclusion & Future Work Conroy, J. M.; and Davis, S. 2017. Section mixture In this paper, we approach the problem of generating models for scientific document summarization. extended summaries, given a long document. Our International Journal on Digital Libraries 19: 305–322. proposed model is a multi-task learning approach Dong, Y.; Wang, S.; Gan, Z.; Cheng, Y.; Cheung, J.; that unifies sentence selection and section prediction and jing Liu, J. 2020. Multi-Fact Correction in processes, extracting summary-worthy sentences. We Abstractive Text Summarization. In EMNLP, volume further collect two large-scale extended summary abs/2010.02443. datasets (arXiv-Long and PubMed-Long) from Ghosh Roy, S.; Pinnaparaju, N.; Jain, R.; Gupta, M.; scientific papers. Our results on three datasets show and Varma, V. 2020. Summaformers @ LaySumm 20, the efficacy of the joint multi-task model in the LongSumm 20. In SDP. extended summarization task. While it achieves fairly competitive performance with the baseline on Gidiotis, A.; Stefanidis, S.; and Tsoumakas, G. 2020. one of three datasets, it consistently improves over AUTH @ CLSciSumm 20, LaySumm 20, LongSumm the baseline in the other two evaluation datasets. 20. In SDP. We further performed extensive quantitative and Gidiotis, A.; and Tsoumakas, G. 2020. A Divide- qualitative analyses over the generated results by and-Conquer Approach to the Summarization of both models. These evaluations revealed our model’s Academic Articles. ArXiv abs/2004.06190. qualities compared to the baseline. Based on the error analysis, it could be noticed that the performance Guo, M.; Haque, A.; Huang, D.-A.; Yeung, S.; and Fei- of this model highly depends on the multi-tasking Fei, L. 2018. Dynamic Task Prioritization for Multitask objectives. Future studies could fruitfully explore this Learning. In ECCV. issue further by optimizing the multi-task objectives Jia, R.; Cao, Y.; Tang, H.; Fang, F.; Cao, C.; and in a way that both sentence selection and section Wang, S. 2020. Neural Extractive Summarization prediction tasks can benefit. with Hierarchical Attentive Heterogeneous Graph Network. In EMNLP. References Lev, G.; Shmueli-Scheuer, M.; Herzig, J.; Jerbi, A.; Abu-Jbara, A.; and Radev, D. R. 2011. Coherent and Konopnicki, D. 2019. TalkSumm: A Dataset Citation-Based Summarization of Scientific Papers. In and Scalable Annotation Method for Scientific Paper ACL. Summarization Based on Conference Talks. ACL . Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Li, L.; Xie, Y.; Liu, W.; Liu, Y.; Jiang, Y.; Qi, S.; and Li, Longformer: The Long-Document Transformer. ArXiv X. 2020. CIST@CL-SciSumm 2020, LongSumm 2020: abs/2004.05150. Automatic Scientific Document Summarization. In SDP. Chandrasekaran, M. K.; Feigenblat, G.; Hovy, E.; Ravichander, A.; Shmueli-Scheuer, M.; and de Waard, Liu, Y.; and Lapata, M. 2019. Text Summarization with A. 2020. Overview and Insights from the Shared Pretrained Encoders. In EMNLP/IJCNLP. Tasks at Scholarly Document Processing 2020: CL- MacAvaney, S.; Sotudeh, S.; Cohan, A.; Goharian, N.; SciSumm, LaySumm and LongSumm. In SDP. Talati, I.; and Filice, R. W. 2019. Ontology-Aware Cohan, A.; Beltagy, I.; King, D.; Dalvi, B.; and Weld, Clinical Abstractive Summarization. In SIGIR. D. S. 2019. Pretrained Language Models for Sequential Mao, Y.; Qu, Y.; Xie, Y.; Ren, X.; and Han, J. Sentence Classification. In EMNLP/IJCNLP. 2020. Multi-document Summarization with Maximal Cohan, A.; Dernoncourt, F.; Kim, D. S.; Bui, T.; Kim, S.; Marginal Relevance-guided Reinforcement Learning. Chang, W.; and Goharian, N. 2018. A Discourse-Aware In EMNLP. Attention Model for Abstractive Summarization of Nallapati, R.; Zhai, F.; and Zhou, B. 2017. Long Documents. In NAACL-HLT. SummaRuNNer: A Recurrent Neural Network Based Cohan, A.; and Goharian, N. 2015. Scientific Article Sequence Model for Extractive Summarization of Summarization Using Citation-Context and Article’s Documents. In AAAI. Discourse Structure. In EMNLP. Reddy, S.; Saini, N.; Saha, S.; and Bhattacharyya, P. Cohan, A.; and Goharian, N. 2018. Scientific 2020. IIITBH-IITP@CL-SciSumm20, CL-LaySumm20, document summarization via citation LongSumm20. In SDP. contextualization and scientific discourse. See, A.; Liu, P. J.; and Manning, C. D. 2017. Get International Journal on Digital Libraries 19(2-3): To The Point: Summarization with Pointer-Generator 287–303. Networks. In ACL. Collins, E.; Augenstein, I.; and Riedel, S. 2017. A Sotudeh, S.; Cohan, A.; and Goharian, N. 2020. Supervised Approach to Extractive Summarisation of GUIR @ LongSumm 2020: Learning to Generate Long Scientific Papers. In CoNLL. Summaries from Scientific Documents. In SDP. Sotudeh, S.; Goharian, N.; and Filice, R. 2020. Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization. In ACL. Teufel, S.; and Moens, M. 2002. Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status. Computational Linguistics 28: 409– 445. Xiao, W.; and Carenini, G. 2019. Extractive Summarization of Long Documents by Combining Global and Local Context. In EMNLP/IJCNLP. Xu, J.; Gan, Z.; Cheng, Y.; and jing Liu, J. 2020. Discourse-Aware Neural Extractive Text Summarization. In ACL. Zhang, J.; Zhao, Y.; Saleh, M.; and Liu, P. J. 2019. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In ICML. Zhou, Q.; Yang, N.; Wei, F.; Huang, S.; Zhou, M.; and Zhao, T. 2018. Neural Document Summarization by Jointly Learning to Score and Select Sentences. In ACL.