=Paper= {{Paper |id=Vol-2831/paper22 |storemode=property |title=On Generating Extended Summaries of Long Documents |pdfUrl=https://ceur-ws.org/Vol-2831/paper22.pdf |volume=Vol-2831 |authors=Sajad Sotudeh,Arman Cohan,Nazli Goharian |dblpUrl=https://dblp.org/rec/conf/aaai/SotudehCG21 }} ==On Generating Extended Summaries of Long Documents== https://ceur-ws.org/Vol-2831/paper22.pdf

On Generating Extended Summaries of Long Documents

Sajad Sotudeh† , Arman Cohan‡ , Nazli Goharian†
†
IR Lab, Georgetown University, Washington D.C., USA
{sajad, nazli}@ir.cs.georgetown.edu,
‡
Allen Institute for Artificial Intelligence, Seattle, WA, USA
armanc@allenai.org

Abstract abstracts provide a short summary about their main
methods and findings, the abstract does not include
Prior work in document summarization has mainly
details of the methods or experimental conditions.
focused on generating short summaries of a
document. While this type of summary helps get To those who seek more detailed information about
a high-level view of a given document, it is desirable in a document without having to cover the entire
some cases to know more detailed information about document, an extended or long summary can be
its salient points that can’t fit in a short summary. desirable (Chandrasekaran et al. 2020; Sotudeh,
This is typically the case for longer documents Cohan, and Goharian 2020; Ghosh Roy et al. 2020).
such as a research paper, legal document, or a Many long documents, including scientific
book. In this paper, we present a new method for papers, follow a certain hierarchical structure where
generating extended summaries of long papers.
content is organized throughout multiple sections
Our method exploits hierarchical structure of the
documents and incorporates it into an extractive
and sub-sections. For example, research papers
summarization model through a multi-task learning often describe objectives, problem, methodology,
approach. We then present our results on three long experiments, and conclusions (Collins, Augenstein,
summarization datasets, arXiv-Long, PubMed-Long, and Riedel 2017). Few prior studies have noted the
and Longsumm. Our method outperforms or matches importance of documents’ structure in shorter-form
the performance of strong baselines. Furthermore, we summary generation (Collins, Augenstein, and
perform a comprehensive analysis over the generated Riedel 2017; Cohan et al. 2018). However, we are not
results, shedding insights on future research for aware of existing summarization methods explicitly
long-form summary generation task. Our analysis approaching modeling the document structure when
shows that our multi-tasking approach can adjust
it comes to generating extended summaries.
extraction probability distribution to the favor of
summary-worthy sentences across diverse sections. We approach the problem of generating extended
Our datasets, and codes are publicly available at https: summary by incorporating document’s hierarchical
//github.com/Georgetown-IR-Lab/ExtendedSumm. structure into the summarization model. Specifically,
we hypothesize that integrating the processes of
sentence selection and section prediction improves
Introduction the summarization model’s performance over the
In the past few years, there has been a significant existing baseline models on extended summarization
progress on both extractive (e.g., Nallapati, Zhai, task. To substantiate our hypothesis, we test our
and Zhou 2017; Zhou et al. 2018; Liu and Lapata proposed model on three extended summarization
2019; Xu et al. 2020; Jia et al. 2020) and abstractive datasets, namely, arXiv-Long, PubMed-Long, and
(e.g., See, Liu, and Manning 2017; Cohan et al. Longsumm. We further provide comprehensive
2018; MacAvaney et al. 2019; Zhang et al. 2019; analyses over the generated results for two long
Sotudeh, Goharian, and Filice 2020; Dong et al. datasets, demonstrating the qualities of our model
2020) approaches for document summarization. over the baseline. Our analysis reveals that the
These approaches generate a concise summary of a multi-tasking model helps with adjusting sentence
document, capturing its salient content. However, for extraction probability to the advantage of salient
a longer document containing numerous details, it sentences scattered across different sections of the
is sometimes helpful to read an extended summary, document. Our contributions are threefold:
providing details about its different aspects. Scientific
papers are examples of such documents; while their 1. A multi-task learning approach for leveraging
document structure in generating extended
Copyright © 2021 for this paper by its authors. Use summaries of long documents.
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0). 2. In-depth and comprehensive analyses over the
generated results to explore the qualities of our extended summaries from scientific documents
model in comparison with the baseline model. and provided a extended summarization dataset
3. Collecting two large-scale extended summarization called Longsumm over which participants were
datasets with oracle labels for facilitating ongoing urged to generate extended summaries. To tackle this
research in extended summarization domain. challenge, researchers used different methodologies.
For instance, Sotudeh, Cohan, and Goharian (2020)
proposed a multi-tasking approach to jointly learn
Related Work sentence importance along with its section to
Scientific document summarization Summarizing be included in the summary. Herein, we aim at
scientific papers has garnered vast attention from the validating the multi-tasking model on a variety of
research community during recent years, although extended summarization datasets and provide a
it has been studied for decades. The characteristics comprehensive analysis to guide future research.
of scientific papers, namely the length, writing Moreover, Ghosh Roy et al. (2020) utilized section-
style, and discourse structure, lead to special model contribution pre-computations (training set) to
considerations to overcome the summarization assign weights via a budget module for generating
task in scientific domain. Researchers have utilized extended summaries. After specifying the section
different approaches to address these challenges. contribution, an extractive summarizer is executed
In earlir work, Teufel and Moens (2002) proposed over each section separately to extract salient
a Naïve bayes classifier to do content selection sentences. Unlike their work, we unify sentence
over the documents’ sentences with regard to their selection and sentence section prediction tasks to
rhetorical sentence role. More recent works have effectively aid the model at identifying summary-
given rise to the importance of discourse structure worthy sentences scattered around different sections.
and its usefulness in summarizing scientific papers. Furthermore, Reddy et al. (2020) proposed a CNN-
For example, Collins, Augenstein, and Riedel (2017) based classification network for extracting salient
used a set of pre-defined section clusters that source sentences. Gidiotis, Stefanidis, and Tsoumakas (2020)
sentences are appeared in as a categorical feature proposed to use a divide and conquer (DANCER)
to aid the model at identifying summary-worthy approach (Gidiotis and Tsoumakas 2020) to identify
sentences. Cohan et al. (2018) introduced large-scale the key sections of the paper to be summarized. The
datasets of arXiv and PubMed (collected from public PEGASUS abstractive summarizer (Zhang et al. 2019)
repositories), and used a hierarchical encoder to then runs over each section separately to produce
model the discourse structure of a paper, and then section summaries, which are finally concatenated
used an attentive decoder to generate the summary. to form the extended summary. Beltagy, Peters,
More recently, Xiao and Carenini (2019) proposed a and Cohan (2020) proposed “Longformer” that
sequence-to-sequence model that incorporates both utilizes “Dilated Sliding Windows”, enabling the
the global context of the entire document and local model to achieve better long-range coverage on
context within the specified section. Inspired by the long documents. With all being mentioned above,
fact that discourse information is important when to the best of our knowledge, we are the first to
dealing with long documents (Cohan et al. 2018), conduct quite a comprehensive analysis over the
we utilize this structure in scientific summarization. generated summarization results in the extended
Unlike prior works, we integrate sentence selection summarization domain.
and sentence section labeling processes through a
multi-task learning approach. In a different line of Dataset
research, the use of citation context information has
been shown to be quite effective at summarizing We use three extended summarization datasets in
scientific papers (Abu-Jbara and Radev 2011). this research. The first one is Longsumm dataset,
For instance, Cohan and Goharian (2015, 2018) which has been provided in the Longsumm 2020
utilized a citation-based approach, denoting how shared task (Chandrasekaran et al. 2020). To further
the paper is cited in the reference papers, to form validate the model, we collect two additional datasets
the summary. Here, we do not exploit any citation called arXiv-Long and PubMed-Long by filtering the
context information. instances of arXiv and PubMed corpora to retain
those whose abstract contains at least 350 tokens.
Extended summarization While summarization Also, to measure how our model works on the mixed
research has been extensively explored in literature, varied-length scientific dataset, we exploit the arXiv
extended summarization has recently gained a huge summarization dataset (Cohan et al. 2018).
deal of attention from the research community.
Among the first attempts to encourage the ongoing Longsumm The Longsumm dataset was provided
research in this field, Chandrasekaran et al. (2020) for the Longsumm challenge (Chandrasekaran
set up the Longsumm shared task 1 on producing et al. 2020) whose aim was to generate extended
summaries for scientific papers. It consists of two
1 https://ornlcda.github.io/SDProc/sharedtasks.html types of summaries:
• Extractive summaries: these summaries are coming Datasets # docs
avg. doc. length avg. summ. length
(tokens) (tokens)
from the TalkSumm dataset (Lev et al. 2019),
containing 1705 extractive summaries of scientific arXiv 215K 4938 220
papers according to their video talks in conferences Longsumm 2.2K 5858 920
arXiv-Long 11.1K 9221 574
(i.e., ACL, NAACL, etc.). Each summary within this PubMed-Long 88.0K 5359 403
corpus is formed by appending top 30 sentences of
the paper. Table 1: Statistics on arXiv (Cohan et al. 2018),
• Abstractive summaries: an add-on dataset Longsumm (Chandrasekaran et al. 2020), and two
containing 531 abstractive summaries from several extended summarization datasets (arXiv-Long,
CS domains such as Machine Learning, NLP, and PubMed-Long), collected by this work.
AI, that are written by NLP and ML researchers on
their blogs. The length of summaries in this dataset
ranges from 50-1500 words per paper. sentences to be included in the summary. Formally,
In our experiments, we use the extractive set let P show a scientific paper containing sentences
along with 50% of the abstractive set as our training [s 1 , s 2 , s 3 , ..., s m ], where m is the number of sentences.
set, containing 1969 papers; and 20% of it as the The extractive summarization is then defined as the
validation set. Note that these splits are made for task of assigning a binary label ( ŷ i ∈ {0, 1}) to each
the purpose of our internal experiments as the sentence s i within the paper, signifying whether the
official test set containing 22 abstractive summaries sentence should be included in the summary.
is blind (Chandrasekaran et al. 2020). B ERT S UM: B ERT for Summarization As our base
arXiv-Long & PubMed-Long. To further test our model we use the B ERT S UM extractive summarization
methods on additional datasets, we construct model (Liu and Lapata 2019), a B ERT-based sentence
two extended summarization datasets for our classification model fine-tuned for summarization.
task. For creating the first dataset, we take arXiv After B ERT S UM outputs sentence representations
summarization dataset introduced by Cohan et al. within the input document, several inter-sentence
(2018) and filter the instances whose abstract Transformer layers are stacked upon the B ERT S UM
(i.e., ground-truth summary) contains at least 350 to collect document-level features. The final output
tokens. We call this dataset arXiv-Long. We repeat layer is a linear classifier with Sigmoid activation
the same process on the PubMed papers obtained function to decide whether the sentence should be
from the Open Access FTP service 2 and call this included or not. The loss function is defined as below:
dataset PubMed-Long. The motivation is that we
1 X n
are interested in validating our model on extended
L1 = − y i log( yˆi ) + (1 − y i )log(1 − yˆi ) (1)
summarization datasets to investigate its effects N i =1
compared to the existing works, and 350 is the
where N is the output size, yˆi is the output of the
length threshold that we use to characterize papers
model, and y i is the corresponding target value. In
with “long” summaries. The resulting sets contain
our experiments, we use this model to extract salient
11,149 instances for arXiv-Long, and 88,035 instances
sentences (i.e., those with the positive label) to form
for PubMed-Long datasets. Note that the abstract
the summary. We set this model as the baseline called
of papers are used as ground-truth summaries
B ERT S UM E XT (Liu and Lapata 2019).
in these two datasets. The overall statistics of the
datasets are shown in Table 1. We release these Our model: a section-aware summarizer
datasets to facilitate future research in extended
summarization. 3 Inspired by few prior works that have studied
the effect of document’s hierarchical structure in
Methodology summarization task (Conroy and Davis 2017; Cohan
In this section, we discuss our proposed method et al. 2018), we define a section prediction task,
that aims at jointly learning to predict sentence aiming at predicting the relevant section for each
importance and its corresponding section. Before sentence in the document. Specifically, we add
discussing the details of our summarization model, an additional linear classification layer on top of
we investigate the preliminary background that B ERT S UM sentence representations to predict the
provides a fair basis for implementing our method. relevant section to each sentence. The loss function
for the section prediction network is defined as
Background follows:
Extractive Summarization The extractive
summarization system aims at extracting salient S
L2 = −
X
y i log( yˆi ) (2)
2 https://www.ncbi.nlm.nih.gov/pmc/tools/ftp i =1
3 https://github.com/Georgetown-IR-Lab/ where y i and ŷ i are the ground-truth and the model
ExtendedSumm scores for each section i in S.
Linear Layer Linear Layer
Sentence Selection Section Prediction

Transformer Encoder

BERTSUM

Figure 1: The overview of B ERT S UM E XT M ULTI model. The baseline model (i.e., B ERT S UM E XT) is dash-boarded.
The extension to the baseline model is addition of Section Prediction linear layer (specified in green box).

The entire extractive network is then trained to labelling approach (Liu and Lapata 2019) with slight
optimize both tasks (i.e., sentence selection and modification for labelling up top 30, 15, and 25
section prediction) in a multi-task setting: sentences for Longsumm, arXiv-Long, and PubMed-
Long datasets, respectively, since these numbers of
L Multi = αL 1 + (1 − α)L 2 (3) oracle sentences yielded the highest oracle scores. 6
where L 1 is the binary cross-entropy loss from For the joint model, we tuned α (loss weighting
sentence selection task, L 2 is the categorical cross- parameter) at 0.5 as it resulted in the highest scores
entropy loss from section prediction network, and throughout our experiments. In all our experiments,
α is the weighting parameter that balances the we pick the checkpoint that achieves the best average
learning procedure between the sentence and section of R OUGE -2 and R OUGE -L scores on the validation
prediction tasks. intervals as our best model for inference.

Experimental Setup Results
In this section, we give details about the pre- In this section, we present the performance of the
processing steps on the datasets and parameters that baseline and our model over the validation and test
we used for the experimented models. sets of the extended summarization datasets. We then
For our baseline, we used the pre-trained B ERT S UM discuss our proposed model’s performance compared
model and implementation provided by the authors to baseline over a mix of varied-length summarization
(Liu and Lapata 2019).4 The B ERT S UM E XT M ULTI dataset (i.e., arXiv). As the evaluation metrics, we
is that of the model used in (Sotudeh, Cohan, report the summarization systems’ performance in
and Goharian 2020), but without post-processing terms of R OUGE -1 (F1), R OUGE -2 (F1), and R OUGE -
module at inference time, which utilizes trigram- L (F1)) metrics.
blocking (Liu and Lapata 2019) to hinder repetitions As we see in Table 2, we notice that having section
in the final summary. We intentionally removed the predictor model incorporated into summarization
post-processing part as the model could attain higher model (i.e., B ERT S UM E XT M ULTI model) performs
scores in the absence of this module throughout fairly well compared to the baseline model. This is a
our experiments. In order to obtain ground-truth particularly important finding since it characterizes
section labels associated with each sentence, we the importance of injecting documents’ structure
utilized the external sequential-sentence package5 when summarizing a scientific paper. While the score
by Cohan et al. (2019). To provide oracle labels for gap is relatively higher in arXiv-Long and Longsumm
source sentences in our datasets, we use a greedy datasets, it is similar in PubMed-Long dataset.
As observed in Table. 3, it is noticeable that
4 https://github.com/nlpyang/PreSumm
5 https://github.com/allenai/sequential_sentence_ 6 The modification was made to assure that the oracle
classification sentences are sampled from diverse sections.
Validation Test
Model Dataset RG-1(%) RG-2(%) RG-L(%) RG-1(%) RG-2(%) RG-L(%)
B ERT S UM E XT 43.2 12.4 16.8 – – –
Longsumm
B ERT S UM E XT M ULTI 43.3 13.0∗ 17.0 53.1 16.8 20.3
B ERT S UM E XT 47.1 18.2 20.8 47.2 18.4 21.1
arXiv-Long
B ERT S UM E XT M ULTI 47.8∗ 18.9∗ 21.3∗ 47.8∗ 19.2∗ 21.5∗
B ERT S UM E XT 49.1 24.3 25.7 49.1 24.5 25.8
PubMed-Long
B ERT S UM E XT M ULTI 48.9 24.1 25.5 48.9 24.1 25.5

Table 2: R OUGE (F1) results of the baseline (i.e., B ERT S UM E XT) and our proposed model (i.e., B ERT S UM E XT M ULTI)
on extended summarization datasets. ∗ shows the statistically significant improvement (paired t-test, p < 0.01).
The validation set for Longsumm refers to our internal validation set (20% of the abstractive set) as there was no
official validation set provided for this dataset.

RG-1 RG-2 RG-L F-Measure average
Other systems
Summaformers (Ghosh Roy et al. 2020) 49.38 16.86 21.38 29.21
Wing 50.58 16.62 20.50 29.23
IIITBH-IITP (Reddy et al. 2020) 49.03 15.74 20.46 28.41
Auth-Team (Gidiotis, Stefanidis, and Tsoumakas 2020) 50.11 15.37 19.59 28.36
CIST_BUPT (Li et al. 2020) 48.99 15.06 20.13 28.06
This work
B ERT S UM E XT M ULTI 53.11 16.77 20.34 30.07

Table 3: R OUGE (F1) results of our multi-tasking model on the blind test set of Longsumm shared task containing
22 abstractive summaries (Chandrasekaran et al. 2020), along with the performance of other participants’ systems.
We only show top 5 participants in this table.

B ERT S UM E XT M ULTI approach performs top Analysis
among the state-of-the-art long summarization In order to gain insights into how our multi-
methods on the blind test set of LongSumm tasking approach works on different long datasets,
challenge (Chandrasekaran et al. 2020). While we perform an extensive analysis in this section
this model improves R OUGE -1 quite significantly to explore the qualities of our multi-tasking system
over the other state-of-the-art, it stays competitive on (i.e., B ERT S UM E XT M ULTI) over the baseline (i.e.,
R OUGE -2 and R OUGE -L metrics. In terms of R OUGE B ERT S UM E XT). Specifically, we perform two types
(F1) F-Measure average, B ERT S UM E XT M ULTI model of analyses: 1) quantitative analysis; 2) qualitative
ranks first by a huge margin compared to the other analysis.
systems. For the first part, we choose to use two metrics:
RGdiff which denotes the average R OUGE (F1)
difference (i.e., gap) between the baseline and our
model 7 . Positive values indicate the improvement,
while negative values denote the decline in scores.
To test the model on mixed varied-length Similarly, Fdiff is the average difference of F1 score
summarization datasets, we trained and tested it between the baseline and our model. We create three
on arXiv (Cohan et al. 2018) dataset, which contains bins sorted by RGdiff : I MPROVED which contains the
a mix of varying length abstracts as ground-truth reports whose average R OUGE (F1) score is improved
summaries. Table 4 shows that our model can achieve by the multi-tasking model; T IED including those
competitive performance on this dataset. While the that the multi-tasking model leaves unchanged in
model does not yield any improvement on arXiv terms of modifying average R OUGE (F1) score; and
dataset, our hypothesis was to investigate if our D ECLINED containing those whose average R OUGE
model is superior to existing models on longer-form (F1) score has decreased by the joint model.
datasets –such as those we have used in this research, For the qualitative analysis section, we specifically
which we validated by presenting the evaluation
results on long summarization datasets. 7 The average is defined on R OUGE -1 (F1), R OUGE -2
Validation Test
Model Dataset RG-1(%) RG-2(%) RG-L(%) RG-1(%) RG-2(%) RG-L(%)
B ERT S UM E XT 43.6 16.6 20.2 44.0 16.8 20.4
arXiv
B ERT S UM E XT M ULTI 43.4 16.5 19.8 43.5 16.5 20.0

Table 4: R OUGE (F1) results of the baseline (i.e., B ERT S UM E XT) and our proposed model (i.e., B ERT S UM E XT M ULTI)
on arXiv summarization dataset.

0.25 Baseline Baseline
0.25
Multi-tasking Multi-tasking
0.20
0.20
0.15
Rouge-2

Rouge-2
0.15
0.10 0.10

0.05 0.05

0.00 0.00
5

37 70
38 84
39 99
41 17
44 47
49 90
56 69
73 34
10 074
23
-38

37
9-5

4-8

-16

0-3
0-3
4-3
9-4
7-4
7-4
0-5
9-7

-21
5-1

4-1
52

74
89

Summary Length (tokens) Summary Length (tokens)
(a) R OUGE -2 scores on Longsumm (b) R OUGE -2 scores on arXiv-Long

Figure 2: Bar charts exhibiting the correlation of ground-truth summary length (in tokens) with the baseline (i.e.,
B ERT S UM E XT) and our multi-tasking model’s (i.e., B ERT S UM E XT M ULTI) performance. The diagrams are shown for
Longsumm and arXiv-Long datasets’ test set. Each bin contains 31 summaries for Longsumm, and 196 summaries
for arXiv-Long. As denoted, the multi-tasking model generally outperforms the baseline on later bins which
include longer-form summaries.

aim at comparing the methods in terms of section comes to evaluating the model on arXiv-Long dataset
distribution since that is where our method’s with average R OUGE improvement of 2.40%.
improvements are expected to come from. Interestingly, our method can consistently improve
Furthermore, we conduct an additional length F1 measure in general (See total F1 scores in Table. 5).
analysis over the results generated by the baseline Seemingly, F1 metric directly correlates with R OUGE
versus our model. (F1) metric on arXiv-Long dataset, whereas this is
not the case on D ECLINED bin of the Longsumm
Quantitative Analysis dataset. This might be due to the relatively small test
We first perform the quantitative analysis over set size of Longsumm dataset. It has to be mentioned
the long summarization datasets’ test sets in two that I MPROVED bin holds relatively higher counts and
parts including 1) Metric analysis which aims at improved metrics than that of D ECLINED bin across
comparing different bins based on the average R OUGE both datasets in our evaluation.
score difference of the baseline and our model; 2)
Length analysis that targets at finding the correlation Length analysis We analyze the generated results
between the summary length on different bins and by both models to see if the summary length affects
models’ performance. the models’ performance using bar charts in Figure 2.
The bar charts are intended to provide the basis for
Metric analysis Table 5 shows the overall quantities comparing both models on different length bins (x-
of Longsumm and arXiv-Long datasets in terms axis), which are evenly-spaced (i.e., having the same
of average difference of R OUGE and F1 scores. number of papers). It has to be mentioned that we
As shown, the multi-tasking approach is able to used five bins (each bin with 31 summaries) and ten
improve 76 summaries with an average R OUGE (F1) bins (each bin with 196 summaries) for Longsumm
improvement of 2.05%. This is even more when it and arXiv-Long datasets, respectively.
As shown in Figure 2 (a), for Longsumm dataset,
(F1), and R OUGE -L (F1) scores. as the length of the ground-truth summary increases,
0.4

0.2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
*14*
15
16
17
18
19
20
21
22
23
24
25
26
27
*28*
29
30
31
32
33
34
35
*36*
37
38
39
*40*
*41*
42
43
44
45
*46*
47
*48*
193
194
197
199
200
204
*425*
*426*
*427*
*428*
*429*
431
*433*
434
435
436
*437*
*440*
441
Source sentence number
(a) Extraction probability distribution of the baseline model (i.e., B ERT S UM E XT) over the source sentences.
0.6
0.4
0.2
0
1
2
3
4
5
6
7
8
9
10
11
13
*14*
15
16
17
18
19
20
21
22
23
24
25
26
27
*28*
29
30
31
32
33
35
*36*
37
38
39
*40*
*41*
43
44
45
*46*
47
*48*
95
100
102
103
190
191
193
194
195
196
197
198
199
202
203
204
*425*
*426*
*427*
*428*
*429*
430
431
432
*433*
434
435
436
*437*
*440*
Source sentence number
(b) Extraction probability distribution of the multi-tasking model (i.e., B ERT S UM E XT M ULTI) over the source sentences.
Figure 3: Heat-maps showing the extraction probabilities over the source sentences (Paper ID:
astro-ph9807040 sampled from arXiv-Long dataset). For simplicity, we have only shown the sentences
that gain over 15% extraction probability by the models. The cells bordered in black show the models’ final
selection, and oracle sentences are indicated with *.

Bin Dataset Count RGdiff Fdiff important sections.

I MPROVED 76 2.05 6.16 By investigating the summaries from D ECLINED
T IED Longsumm 4 0 0 bin, we noticed that in declined summaries, while
D ECLINED 74 −1.47 1.95 our multi-tasking approach can adjust extraction
probability distribution to diverse sections, it has
Total 154 0.31 4.11
difficulty picking up salient sentences (i.e., positive
I MPROVED 1,084 2.40 4.47 sentences) from the corresponding section; thus, it
T IED arXiv-Long 67 0 0.32 leads to relatively lower R OUGE score. This might
D ECLINED 801 −1.82 −1.34 be improved if two networks (i.e., sentence selection
and section prediction) are optimized in a more
Total 1952 0.59 1.94 elegant way such that the extractive summarizer can
further select salient sentences from the specified
Table 5: I MPROVED, T IED, and D ECLINED bins on sections when they could be identified. For example,
the test set of Longsumm and arXiv-Long datasets. the improved multi-tasking methods can involve
The numbers show the improvements (positive) and task prioritization (Guo et al. 2018) to dynamically
drops (negative) compared to the baseline model (i.e., balance the learning process between two tasks
B ERT S UM E XT). during training, rather than using a fixed α parameter.

In the cases where the F1 score and R OUGE (F1)
the multi-tasking model generally improves over the were not consistent with each other, we observed
baseline consistently on both datasets, except for that adding non-salience sentences to the final
the last bin on Longsumm dataset where it achieves summary hurts the final R OUGE (F1) scores. In
comparable performance. This behaviour is also other words, while the multi-tasking approach can
observed on R OUGE -1 and R OUGE -L for Longsumm achieve a higher F1 score compared to the baseline
dataset. The R OUGE improvement is even more since it chooses different non-salient (i.e., negative)
noticeable when it comes to analysis over arXiv-Long sentences than baseline, the overall R OUGE (F1)
dataset (See Figure 2 (b)). Thus, the length analysis scores drop slightly. Having conditional decoding
supports our hypothesis that the multi-tasking model length (i.e., sentences) might help with this as done
outperforms the baseline more significantly when the in (Mao et al. 2020).
summary is of longer-form. Fig. 3 shows the extraction probabilities that
each model outputs on the source sentences. It is
Qualitative analysis observable that the baseline model picks most of the
As the results of the qualitative analysis on the sentences (47%) from the beginning of the paper,
I MPROVED bin is observed, we found out that the while the multi-tasking approach (b) can effectively
multi-tasking model can effectively sample sentences distract probability distribution to summary-worthy
from diverse sections when the ground-truth sentences that are all around different sections of the
summary is also sampled from diverse sections. It paper, and pick those with higher confidence. Our
improves significantly over the baseline when the model achieves the overall F1 score of 53.33% on this
extractive model can detect salient sentences from sample paper, while the baseline’s F1 score is 33.33%.
Conclusion & Future Work Conroy, J. M.; and Davis, S. 2017. Section mixture
In this paper, we approach the problem of generating models for scientific document summarization.
extended summaries, given a long document. Our International Journal on Digital Libraries 19: 305–322.
proposed model is a multi-task learning approach Dong, Y.; Wang, S.; Gan, Z.; Cheng, Y.; Cheung, J.;
that unifies sentence selection and section prediction and jing Liu, J. 2020. Multi-Fact Correction in
processes, extracting summary-worthy sentences. We Abstractive Text Summarization. In EMNLP, volume
further collect two large-scale extended summary abs/2010.02443.
datasets (arXiv-Long and PubMed-Long) from Ghosh Roy, S.; Pinnaparaju, N.; Jain, R.; Gupta, M.;
scientific papers. Our results on three datasets show and Varma, V. 2020. Summaformers @ LaySumm 20,
the efficacy of the joint multi-task model in the LongSumm 20. In SDP.
extended summarization task. While it achieves
fairly competitive performance with the baseline on Gidiotis, A.; Stefanidis, S.; and Tsoumakas, G. 2020.
one of three datasets, it consistently improves over AUTH @ CLSciSumm 20, LaySumm 20, LongSumm
the baseline in the other two evaluation datasets. 20. In SDP.
We further performed extensive quantitative and Gidiotis, A.; and Tsoumakas, G. 2020. A Divide-
qualitative analyses over the generated results by and-Conquer Approach to the Summarization of
both models. These evaluations revealed our model’s Academic Articles. ArXiv abs/2004.06190.
qualities compared to the baseline. Based on the error
analysis, it could be noticed that the performance Guo, M.; Haque, A.; Huang, D.-A.; Yeung, S.; and Fei-
of this model highly depends on the multi-tasking Fei, L. 2018. Dynamic Task Prioritization for Multitask
objectives. Future studies could fruitfully explore this Learning. In ECCV.
issue further by optimizing the multi-task objectives Jia, R.; Cao, Y.; Tang, H.; Fang, F.; Cao, C.; and
in a way that both sentence selection and section Wang, S. 2020. Neural Extractive Summarization
prediction tasks can benefit. with Hierarchical Attentive Heterogeneous Graph
Network. In EMNLP.
References Lev, G.; Shmueli-Scheuer, M.; Herzig, J.; Jerbi, A.;
Abu-Jbara, A.; and Radev, D. R. 2011. Coherent and Konopnicki, D. 2019. TalkSumm: A Dataset
Citation-Based Summarization of Scientific Papers. In and Scalable Annotation Method for Scientific Paper
ACL. Summarization Based on Conference Talks. ACL .
Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Li, L.; Xie, Y.; Liu, W.; Liu, Y.; Jiang, Y.; Qi, S.; and Li,
Longformer: The Long-Document Transformer. ArXiv X. 2020. CIST@CL-SciSumm 2020, LongSumm 2020:
abs/2004.05150. Automatic Scientific Document Summarization. In
SDP.
Chandrasekaran, M. K.; Feigenblat, G.; Hovy, E.;
Ravichander, A.; Shmueli-Scheuer, M.; and de Waard, Liu, Y.; and Lapata, M. 2019. Text Summarization with
A. 2020. Overview and Insights from the Shared Pretrained Encoders. In EMNLP/IJCNLP.
Tasks at Scholarly Document Processing 2020: CL- MacAvaney, S.; Sotudeh, S.; Cohan, A.; Goharian, N.;
SciSumm, LaySumm and LongSumm. In SDP. Talati, I.; and Filice, R. W. 2019. Ontology-Aware
Cohan, A.; Beltagy, I.; King, D.; Dalvi, B.; and Weld, Clinical Abstractive Summarization. In SIGIR.
D. S. 2019. Pretrained Language Models for Sequential Mao, Y.; Qu, Y.; Xie, Y.; Ren, X.; and Han, J.
Sentence Classification. In EMNLP/IJCNLP. 2020. Multi-document Summarization with Maximal
Cohan, A.; Dernoncourt, F.; Kim, D. S.; Bui, T.; Kim, S.; Marginal Relevance-guided Reinforcement Learning.
Chang, W.; and Goharian, N. 2018. A Discourse-Aware In EMNLP.
Attention Model for Abstractive Summarization of Nallapati, R.; Zhai, F.; and Zhou, B. 2017.
Long Documents. In NAACL-HLT. SummaRuNNer: A Recurrent Neural Network Based
Cohan, A.; and Goharian, N. 2015. Scientific Article Sequence Model for Extractive Summarization of
Summarization Using Citation-Context and Article’s Documents. In AAAI.
Discourse Structure. In EMNLP. Reddy, S.; Saini, N.; Saha, S.; and Bhattacharyya, P.
Cohan, A.; and Goharian, N. 2018. Scientific 2020. IIITBH-IITP@CL-SciSumm20, CL-LaySumm20,
document summarization via citation LongSumm20. In SDP.
contextualization and scientific discourse. See, A.; Liu, P. J.; and Manning, C. D. 2017. Get
International Journal on Digital Libraries 19(2-3): To The Point: Summarization with Pointer-Generator
287–303. Networks. In ACL.
Collins, E.; Augenstein, I.; and Riedel, S. 2017. A Sotudeh, S.; Cohan, A.; and Goharian, N. 2020.
Supervised Approach to Extractive Summarisation of GUIR @ LongSumm 2020: Learning to Generate Long
Scientific Papers. In CoNLL. Summaries from Scientific Documents. In SDP.
Sotudeh, S.; Goharian, N.; and Filice, R. 2020. Attend
to Medical Ontologies: Content Selection for Clinical
Abstractive Summarization. In ACL.
Teufel, S.; and Moens, M. 2002. Summarizing
Scientific Articles: Experiments with Relevance and
Rhetorical Status. Computational Linguistics 28: 409–
445.
Xiao, W.; and Carenini, G. 2019. Extractive
Summarization of Long Documents by Combining
Global and Local Context. In EMNLP/IJCNLP.
Xu, J.; Gan, Z.; Cheng, Y.; and jing Liu, J.
2020. Discourse-Aware Neural Extractive Text
Summarization. In ACL.
Zhang, J.; Zhao, Y.; Saleh, M.; and Liu, P. J. 2019.
PEGASUS: Pre-training with Extracted Gap-sentences
for Abstractive Summarization. In ICML.
Zhou, Q.; Yang, N.; Wei, F.; Huang, S.; Zhou, M.; and
Zhao, T. 2018. Neural Document Summarization by
Jointly Learning to Score and Select Sentences. In
ACL.