On Generating Extended Summaries of Long Documents

                              Sajad Sotudeh† , Arman Cohan‡ , Nazli Goharian†
                                 †
                                  IR Lab, Georgetown University, Washington D.C., USA
                                   {sajad, nazli}@ir.cs.georgetown.edu,
                               ‡
                                 Allen Institute for Artificial Intelligence, Seattle, WA, USA
                                                  armanc@allenai.org


                         Abstract                                 abstracts provide a short summary about their main
                                                                  methods and findings, the abstract does not include
  Prior work in document summarization has mainly
                                                                  details of the methods or experimental conditions.
  focused on generating short summaries of a
  document. While this type of summary helps get                  To those who seek more detailed information about
  a high-level view of a given document, it is desirable in       a document without having to cover the entire
  some cases to know more detailed information about              document, an extended or long summary can be
  its salient points that can’t fit in a short summary.           desirable (Chandrasekaran et al. 2020; Sotudeh,
  This is typically the case for longer documents                 Cohan, and Goharian 2020; Ghosh Roy et al. 2020).
  such as a research paper, legal document, or a                     Many long documents, including scientific
  book. In this paper, we present a new method for                papers, follow a certain hierarchical structure where
  generating extended summaries of long papers.
                                                                  content is organized throughout multiple sections
  Our method exploits hierarchical structure of the
  documents and incorporates it into an extractive
                                                                  and sub-sections. For example, research papers
  summarization model through a multi-task learning               often describe objectives, problem, methodology,
  approach. We then present our results on three long             experiments, and conclusions (Collins, Augenstein,
  summarization datasets, arXiv-Long, PubMed-Long,                and Riedel 2017). Few prior studies have noted the
  and Longsumm. Our method outperforms or matches                 importance of documents’ structure in shorter-form
  the performance of strong baselines. Furthermore, we            summary generation (Collins, Augenstein, and
  perform a comprehensive analysis over the generated             Riedel 2017; Cohan et al. 2018). However, we are not
  results, shedding insights on future research for               aware of existing summarization methods explicitly
  long-form summary generation task. Our analysis                 approaching modeling the document structure when
  shows that our multi-tasking approach can adjust
                                                                  it comes to generating extended summaries.
  extraction probability distribution to the favor of
  summary-worthy sentences across diverse sections.                  We approach the problem of generating extended
  Our datasets, and codes are publicly available at https:        summary by incorporating document’s hierarchical
  //github.com/Georgetown-IR-Lab/ExtendedSumm.                    structure into the summarization model. Specifically,
                                                                  we hypothesize that integrating the processes of
                                                                  sentence selection and section prediction improves
                     Introduction                                 the summarization model’s performance over the
In the past few years, there has been a significant               existing baseline models on extended summarization
progress on both extractive (e.g., Nallapati, Zhai,               task. To substantiate our hypothesis, we test our
and Zhou 2017; Zhou et al. 2018; Liu and Lapata                   proposed model on three extended summarization
2019; Xu et al. 2020; Jia et al. 2020) and abstractive            datasets, namely, arXiv-Long, PubMed-Long, and
(e.g., See, Liu, and Manning 2017; Cohan et al.                   Longsumm. We further provide comprehensive
2018; MacAvaney et al. 2019; Zhang et al. 2019;                   analyses over the generated results for two long
Sotudeh, Goharian, and Filice 2020; Dong et al.                   datasets, demonstrating the qualities of our model
2020) approaches for document summarization.                      over the baseline. Our analysis reveals that the
These approaches generate a concise summary of a                  multi-tasking model helps with adjusting sentence
document, capturing its salient content. However, for             extraction probability to the advantage of salient
a longer document containing numerous details, it                 sentences scattered across different sections of the
is sometimes helpful to read an extended summary,                 document. Our contributions are threefold:
providing details about its different aspects. Scientific
papers are examples of such documents; while their               1. A multi-task learning approach for leveraging
                                                                    document structure in generating extended
Copyright © 2021 for this paper by its authors. Use                 summaries of long documents.
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).                                       2. In-depth and comprehensive analyses over the
   generated results to explore the qualities of our      extended summaries from scientific documents
   model in comparison with the baseline model.           and provided a extended summarization dataset
3. Collecting two large-scale extended summarization      called Longsumm over which participants were
   datasets with oracle labels for facilitating ongoing   urged to generate extended summaries. To tackle this
   research in extended summarization domain.             challenge, researchers used different methodologies.
                                                          For instance, Sotudeh, Cohan, and Goharian (2020)
                                                          proposed a multi-tasking approach to jointly learn
                    Related Work                          sentence importance along with its section to
Scientific document summarization Summarizing             be included in the summary. Herein, we aim at
scientific papers has garnered vast attention from the    validating the multi-tasking model on a variety of
research community during recent years, although          extended summarization datasets and provide a
it has been studied for decades. The characteristics      comprehensive analysis to guide future research.
of scientific papers, namely the length, writing          Moreover, Ghosh Roy et al. (2020) utilized section-
style, and discourse structure, lead to special model     contribution pre-computations (training set) to
considerations to overcome the summarization              assign weights via a budget module for generating
task in scientific domain. Researchers have utilized      extended summaries. After specifying the section
different approaches to address these challenges.         contribution, an extractive summarizer is executed
In earlir work, Teufel and Moens (2002) proposed          over each section separately to extract salient
a Naïve bayes classifier to do content selection          sentences. Unlike their work, we unify sentence
over the documents’ sentences with regard to their        selection and sentence section prediction tasks to
rhetorical sentence role. More recent works have          effectively aid the model at identifying summary-
given rise to the importance of discourse structure       worthy sentences scattered around different sections.
and its usefulness in summarizing scientific papers.      Furthermore, Reddy et al. (2020) proposed a CNN-
For example, Collins, Augenstein, and Riedel (2017)       based classification network for extracting salient
used a set of pre-defined section clusters that source    sentences. Gidiotis, Stefanidis, and Tsoumakas (2020)
sentences are appeared in as a categorical feature        proposed to use a divide and conquer (DANCER)
to aid the model at identifying summary-worthy            approach (Gidiotis and Tsoumakas 2020) to identify
sentences. Cohan et al. (2018) introduced large-scale     the key sections of the paper to be summarized. The
datasets of arXiv and PubMed (collected from public       PEGASUS abstractive summarizer (Zhang et al. 2019)
repositories), and used a hierarchical encoder to         then runs over each section separately to produce
model the discourse structure of a paper, and then        section summaries, which are finally concatenated
used an attentive decoder to generate the summary.        to form the extended summary. Beltagy, Peters,
More recently, Xiao and Carenini (2019) proposed a        and Cohan (2020) proposed “Longformer” that
sequence-to-sequence model that incorporates both         utilizes “Dilated Sliding Windows”, enabling the
the global context of the entire document and local       model to achieve better long-range coverage on
context within the specified section. Inspired by the     long documents. With all being mentioned above,
fact that discourse information is important when         to the best of our knowledge, we are the first to
dealing with long documents (Cohan et al. 2018),          conduct quite a comprehensive analysis over the
we utilize this structure in scientific summarization.    generated summarization results in the extended
Unlike prior works, we integrate sentence selection       summarization domain.
and sentence section labeling processes through a
multi-task learning approach. In a different line of                            Dataset
research, the use of citation context information has
been shown to be quite effective at summarizing           We use three extended summarization datasets in
scientific papers (Abu-Jbara and Radev 2011).             this research. The first one is Longsumm dataset,
For instance, Cohan and Goharian (2015, 2018)             which has been provided in the Longsumm 2020
utilized a citation-based approach, denoting how          shared task (Chandrasekaran et al. 2020). To further
the paper is cited in the reference papers, to form       validate the model, we collect two additional datasets
the summary. Here, we do not exploit any citation         called arXiv-Long and PubMed-Long by filtering the
context information.                                      instances of arXiv and PubMed corpora to retain
                                                          those whose abstract contains at least 350 tokens.
Extended summarization While summarization                Also, to measure how our model works on the mixed
research has been extensively explored in literature,     varied-length scientific dataset, we exploit the arXiv
extended summarization has recently gained a huge         summarization dataset (Cohan et al. 2018).
deal of attention from the research community.
Among the first attempts to encourage the ongoing         Longsumm The Longsumm dataset was provided
research in this field, Chandrasekaran et al. (2020)      for the Longsumm challenge         (Chandrasekaran
set up the Longsumm shared task 1 on producing            et al. 2020) whose aim was to generate extended
                                                          summaries for scientific papers. It consists of two
   1 https://ornlcda.github.io/SDProc/sharedtasks.html    types of summaries:
• Extractive summaries: these summaries are coming        Datasets         # docs
                                                                                     avg. doc. length           avg. summ. length
                                                                                            (tokens)                     (tokens)
  from the TalkSumm dataset (Lev et al. 2019),
  containing 1705 extractive summaries of scientific      arXiv              215K                   4938                     220
  papers according to their video talks in conferences    Longsumm            2.2K                  5858                     920
                                                          arXiv-Long        11.1K                   9221                     574
  (i.e., ACL, NAACL, etc.). Each summary within this      PubMed-Long       88.0K                   5359                     403
  corpus is formed by appending top 30 sentences of
  the paper.                                             Table 1: Statistics on arXiv (Cohan et al. 2018),
• Abstractive summaries: an add-on dataset               Longsumm (Chandrasekaran et al. 2020), and two
  containing 531 abstractive summaries from several      extended summarization datasets (arXiv-Long,
  CS domains such as Machine Learning, NLP, and          PubMed-Long), collected by this work.
  AI, that are written by NLP and ML researchers on
  their blogs. The length of summaries in this dataset
  ranges from 50-1500 words per paper.                   sentences to be included in the summary. Formally,
   In our experiments, we use the extractive set         let P show a scientific paper containing sentences
along with 50% of the abstractive set as our training    [s 1 , s 2 , s 3 , ..., s m ], where m is the number of sentences.
set, containing 1969 papers; and 20% of it as the        The extractive summarization is then defined as the
validation set. Note that these splits are made for      task of assigning a binary label ( ŷ i ∈ {0, 1}) to each
the purpose of our internal experiments as the           sentence s i within the paper, signifying whether the
official test set containing 22 abstractive summaries    sentence should be included in the summary.
is blind (Chandrasekaran et al. 2020).                   B ERT S UM: B ERT for Summarization As our base
arXiv-Long & PubMed-Long. To further test our            model we use the B ERT S UM extractive summarization
methods on additional datasets, we construct             model (Liu and Lapata 2019), a B ERT-based sentence
two extended summarization datasets for our              classification model fine-tuned for summarization.
task. For creating the first dataset, we take arXiv         After B ERT S UM outputs sentence representations
summarization dataset introduced by Cohan et al.         within the input document, several inter-sentence
(2018) and filter the instances whose abstract           Transformer layers are stacked upon the B ERT S UM
(i.e., ground-truth summary) contains at least 350       to collect document-level features. The final output
tokens. We call this dataset arXiv-Long. We repeat       layer is a linear classifier with Sigmoid activation
the same process on the PubMed papers obtained           function to decide whether the sentence should be
from the Open Access FTP service 2 and call this         included or not. The loss function is defined as below:
dataset PubMed-Long. The motivation is that we
                                                                         1 X n
are interested in validating our model on extended
                                                                L1 = −          y i log( yˆi ) + (1 − y i )log(1 − yˆi )        (1)
summarization datasets to investigate its effects                        N i =1
compared to the existing works, and 350 is the
                                                         where N is the output size, yˆi is the output of the
length threshold that we use to characterize papers
                                                         model, and y i is the corresponding target value. In
with “long” summaries. The resulting sets contain
                                                         our experiments, we use this model to extract salient
11,149 instances for arXiv-Long, and 88,035 instances
                                                         sentences (i.e., those with the positive label) to form
for PubMed-Long datasets. Note that the abstract
                                                         the summary. We set this model as the baseline called
of papers are used as ground-truth summaries
                                                         B ERT S UM E XT (Liu and Lapata 2019).
in these two datasets. The overall statistics of the
datasets are shown in Table 1. We release these          Our model: a section-aware summarizer
datasets to facilitate future research in extended
summarization. 3                                         Inspired by few prior works that have studied
                                                         the effect of document’s hierarchical structure in
                  Methodology                            summarization task (Conroy and Davis 2017; Cohan
In this section, we discuss our proposed method          et al. 2018), we define a section prediction task,
that aims at jointly learning to predict sentence        aiming at predicting the relevant section for each
importance and its corresponding section. Before         sentence in the document. Specifically, we add
discussing the details of our summarization model,       an additional linear classification layer on top of
we investigate the preliminary background that           B ERT S UM sentence representations to predict the
provides a fair basis for implementing our method.       relevant section to each sentence. The loss function
                                                         for the section prediction network is defined as
Background                                               follows:
Extractive   Summarization The        extractive
summarization system aims at extracting salient                                         S
                                                                              L2 = −
                                                                                        X
                                                                                               y i log( yˆi )                   (2)
  2 https://www.ncbi.nlm.nih.gov/pmc/tools/ftp                                          i =1
  3 https://github.com/Georgetown-IR-Lab/                where y i and ŷ i are the ground-truth and the model
ExtendedSumm                                             scores for each section i in S.
                                          Linear Layer                       Linear Layer
                                       Sentence Selection                  Section Prediction


                                                        Transformer Encoder


                                                            BERTSUM


Figure 1: The overview of B ERT S UM E XT M ULTI model. The baseline model (i.e., B ERT S UM E XT) is dash-boarded.
The extension to the baseline model is addition of Section Prediction linear layer (specified in green box).


  The entire extractive network is then trained to                 labelling approach (Liu and Lapata 2019) with slight
optimize both tasks (i.e., sentence selection and                  modification for labelling up top 30, 15, and 25
section prediction) in a multi-task setting:                       sentences for Longsumm, arXiv-Long, and PubMed-
                                                                   Long datasets, respectively, since these numbers of
                 L Multi = αL 1 + (1 − α)L 2                (3)    oracle sentences yielded the highest oracle scores. 6
where L 1 is the binary cross-entropy loss from                    For the joint model, we tuned α (loss weighting
sentence selection task, L 2 is the categorical cross-             parameter) at 0.5 as it resulted in the highest scores
entropy loss from section prediction network, and                  throughout our experiments. In all our experiments,
α is the weighting parameter that balances the                     we pick the checkpoint that achieves the best average
learning procedure between the sentence and section                of R OUGE -2 and R OUGE -L scores on the validation
prediction tasks.                                                  intervals as our best model for inference.

                 Experimental Setup                                                             Results
In this section, we give details about the pre-                    In this section, we present the performance of the
processing steps on the datasets and parameters that               baseline and our model over the validation and test
we used for the experimented models.                               sets of the extended summarization datasets. We then
   For our baseline, we used the pre-trained B ERT S UM            discuss our proposed model’s performance compared
model and implementation provided by the authors                   to baseline over a mix of varied-length summarization
(Liu and Lapata 2019).4 The B ERT S UM E XT M ULTI                 dataset (i.e., arXiv). As the evaluation metrics, we
is that of the model used in (Sotudeh, Cohan,                      report the summarization systems’ performance in
and Goharian 2020), but without post-processing                    terms of R OUGE -1 (F1), R OUGE -2 (F1), and R OUGE -
module at inference time, which utilizes trigram-                  L (F1)) metrics.
blocking (Liu and Lapata 2019) to hinder repetitions                  As we see in Table 2, we notice that having section
in the final summary. We intentionally removed the                 predictor model incorporated into summarization
post-processing part as the model could attain higher              model (i.e., B ERT S UM E XT M ULTI model) performs
scores in the absence of this module throughout                    fairly well compared to the baseline model. This is a
our experiments. In order to obtain ground-truth                   particularly important finding since it characterizes
section labels associated with each sentence, we                   the importance of injecting documents’ structure
utilized the external sequential-sentence package5                 when summarizing a scientific paper. While the score
by Cohan et al. (2019). To provide oracle labels for               gap is relatively higher in arXiv-Long and Longsumm
source sentences in our datasets, we use a greedy                  datasets, it is similar in PubMed-Long dataset.
                                                                      As observed in Table. 3, it is noticeable that
  4 https://github.com/nlpyang/PreSumm
  5 https://github.com/allenai/sequential_sentence_                   6 The modification was made to assure that the oracle
classification                                                     sentences are sampled from diverse sections.
                                                           Validation                            Test
 Model                        Dataset           RG-1(%)     RG-2(%)     RG-L(%)       RG-1(%)    RG-2(%)     RG-L(%)
 B ERT S UM E XT                                  43.2      12.4        16.8          –          –           –
                              Longsumm
 B ERT S UM E XT M ULTI                           43.3      13.0∗       17.0          53.1       16.8        20.3
 B ERT S UM E XT                                  47.1      18.2        20.8          47.2       18.4        21.1
                              arXiv-Long
 B ERT S UM E XT M ULTI                          47.8∗      18.9∗       21.3∗         47.8∗      19.2∗       21.5∗
 B ERT S UM E XT                                  49.1      24.3        25.7          49.1       24.5        25.8
                              PubMed-Long
 B ERT S UM E XT M ULTI                           48.9      24.1        25.5          48.9       24.1        25.5

Table 2: R OUGE (F1) results of the baseline (i.e., B ERT S UM E XT) and our proposed model (i.e., B ERT S UM E XT M ULTI)
on extended summarization datasets. ∗ shows the statistically significant improvement (paired t-test, p < 0.01).
The validation set for Longsumm refers to our internal validation set (20% of the abstractive set) as there was no
official validation set provided for this dataset.

                                                                          RG-1     RG-2       RG-L      F-Measure average
 Other systems
   Summaformers (Ghosh Roy et al. 2020)                                  49.38    16.86      21.38           29.21
   Wing                                                                  50.58    16.62      20.50           29.23
   IIITBH-IITP (Reddy et al. 2020)                                       49.03    15.74      20.46           28.41
   Auth-Team (Gidiotis, Stefanidis, and Tsoumakas 2020)                  50.11    15.37      19.59           28.36
   CIST_BUPT (Li et al. 2020)                                            48.99    15.06      20.13           28.06
 This work
    B ERT S UM E XT M ULTI                                               53.11     16.77     20.34           30.07

Table 3: R OUGE (F1) results of our multi-tasking model on the blind test set of Longsumm shared task containing
22 abstractive summaries (Chandrasekaran et al. 2020), along with the performance of other participants’ systems.
We only show top 5 participants in this table.


B ERT S UM E XT M ULTI   approach       performs     top                               Analysis
among the state-of-the-art long summarization                   In order to gain insights into how our multi-
methods on the blind test set of LongSumm                       tasking approach works on different long datasets,
challenge (Chandrasekaran et al. 2020). While                   we perform an extensive analysis in this section
this model improves R OUGE -1 quite significantly               to explore the qualities of our multi-tasking system
over the other state-of-the-art, it stays competitive on        (i.e., B ERT S UM E XT M ULTI) over the baseline (i.e.,
R OUGE -2 and R OUGE -L metrics. In terms of R OUGE             B ERT S UM E XT). Specifically, we perform two types
(F1) F-Measure average, B ERT S UM E XT M ULTI model            of analyses: 1) quantitative analysis; 2) qualitative
ranks first by a huge margin compared to the other              analysis.
systems.                                                           For the first part, we choose to use two metrics:
                                                                RGdiff which denotes the average R OUGE (F1)
                                                                difference (i.e., gap) between the baseline and our
                                                                model 7 . Positive values indicate the improvement,
                                                                while negative values denote the decline in scores.
  To test the model on mixed varied-length                      Similarly, Fdiff is the average difference of F1 score
summarization datasets, we trained and tested it                between the baseline and our model. We create three
on arXiv (Cohan et al. 2018) dataset, which contains            bins sorted by RGdiff : I MPROVED which contains the
a mix of varying length abstracts as ground-truth               reports whose average R OUGE (F1) score is improved
summaries. Table 4 shows that our model can achieve             by the multi-tasking model; T IED including those
competitive performance on this dataset. While the              that the multi-tasking model leaves unchanged in
model does not yield any improvement on arXiv                   terms of modifying average R OUGE (F1) score; and
dataset, our hypothesis was to investigate if our               D ECLINED containing those whose average R OUGE
model is superior to existing models on longer-form             (F1) score has decreased by the joint model.
datasets –such as those we have used in this research,             For the qualitative analysis section, we specifically
which we validated by presenting the evaluation
results on long summarization datasets.                             7 The average is defined on R OUGE -1 (F1), R OUGE -2
                                                                           Validation                                    Test
 Model                                  Dataset               RG-1(%)      RG-2(%)             RG-L(%)      RG-1(%)      RG-2(%)      RG-L(%)
 B ERT S UM E XT                                                    43.6      16.6               20.2          44.0         16.8          20.4
                                        arXiv
 B ERT S UM E XT M ULTI                                             43.4      16.5               19.8          43.5         16.5          20.0

Table 4: R OUGE (F1) results of the baseline (i.e., B ERT S UM E XT) and our proposed model (i.e., B ERT S UM E XT M ULTI)
on arXiv summarization dataset.

             0.25                                   Baseline                                                                    Baseline
                                                                                        0.25
                                                    Multi-tasking                                                               Multi-tasking
             0.20
                                                                                        0.20
             0.15
   Rouge-2


                                                                              Rouge-2
                                                                                        0.15
             0.10                                                                       0.10

             0.05                                                                       0.05

             0.00                                                                       0.00
                       5


                              91


                                      72


                                                0


                                                         31


                                                                                                 37 70
                                                                                                 38 84
                                                                                                 39 99
                                                                                                 41 17
                                                                                                 44 47
                                                                                                 49 90
                                                                                                 56 69
                                                                                                73 34
                                                                                               10 074
                                                                                                        23
                      -38


                                              37
                            9-5


                                    4-8


                                                       -16


                                                                                                    0-3
                                                                                                    0-3
                                                                                                    4-3
                                                                                                    9-4
                                                                                                    7-4
                                                                                                    7-4
                                                                                                    0-5
                                                                                                    9-7


                                                                                                    -21
                                            5-1


                                                                                                  4-1
                     52


                            38


                                   62


                                                       75


                                                                                           35


                                                                                                 74
                                           89


                                                    13


                             Summary Length (tokens)                                                     Summary Length (tokens)
                    (a) R OUGE -2 scores on Longsumm                                           (b) R OUGE -2 scores on arXiv-Long

Figure 2: Bar charts exhibiting the correlation of ground-truth summary length (in tokens) with the baseline (i.e.,
B ERT S UM E XT) and our multi-tasking model’s (i.e., B ERT S UM E XT M ULTI) performance. The diagrams are shown for
Longsumm and arXiv-Long datasets’ test set. Each bin contains 31 summaries for Longsumm, and 196 summaries
for arXiv-Long. As denoted, the multi-tasking model generally outperforms the baseline on later bins which
include longer-form summaries.


aim at comparing the methods in terms of section                             comes to evaluating the model on arXiv-Long dataset
distribution since that is where our method’s                                with average R OUGE improvement of 2.40%.
improvements are expected to come from.                                        Interestingly, our method can consistently improve
Furthermore, we conduct an additional length                                 F1 measure in general (See total F1 scores in Table. 5).
analysis over the results generated by the baseline                          Seemingly, F1 metric directly correlates with R OUGE
versus our model.                                                            (F1) metric on arXiv-Long dataset, whereas this is
                                                                             not the case on D ECLINED bin of the Longsumm
Quantitative Analysis                                                        dataset. This might be due to the relatively small test
We first perform the quantitative analysis over                              set size of Longsumm dataset. It has to be mentioned
the long summarization datasets’ test sets in two                            that I MPROVED bin holds relatively higher counts and
parts including 1) Metric analysis which aims at                             improved metrics than that of D ECLINED bin across
comparing different bins based on the average R OUGE                         both datasets in our evaluation.
score difference of the baseline and our model; 2)
Length analysis that targets at finding the correlation                      Length analysis We analyze the generated results
between the summary length on different bins and                             by both models to see if the summary length affects
models’ performance.                                                         the models’ performance using bar charts in Figure 2.
                                                                             The bar charts are intended to provide the basis for
Metric analysis Table 5 shows the overall quantities                         comparing both models on different length bins (x-
of Longsumm and arXiv-Long datasets in terms                                 axis), which are evenly-spaced (i.e., having the same
of average difference of R OUGE and F1 scores.                               number of papers). It has to be mentioned that we
As shown, the multi-tasking approach is able to                              used five bins (each bin with 31 summaries) and ten
improve 76 summaries with an average R OUGE (F1)                             bins (each bin with 196 summaries) for Longsumm
improvement of 2.05%. This is even more when it                              and arXiv-Long datasets, respectively.
                                                                                As shown in Figure 2 (a), for Longsumm dataset,
(F1), and R OUGE -L (F1) scores.                                             as the length of the ground-truth summary increases,
                                                                                                                                 0.4

                                                                                                                                 0.2
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
   *14*
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
   *28*
     29
     30
     31
     32
     33
     34
     35
   *36*
     37
     38
     39
   *40*
   *41*
     42
     43
     44
     45
   *46*
     47
   *48*
    193
    194
    197
    199
    200
    204
  *425*
  *426*
  *427*
  *428*
  *429*
    431
  *433*
    434
    435
    436
  *437*
  *440*
    441
                                                          Source sentence number
     (a) Extraction probability distribution of the baseline model (i.e., B ERT S UM E XT) over the source sentences.
                                                                                                                                 0.6
                                                                                                                                 0.4
                                                                                                                                 0.2
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     13
   *14*
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
   *28*
     29
     30
     31
     32
     33
     35
   *36*
     37
     38
     39
   *40*
   *41*
     43
     44
     45
   *46*
     47
   *48*
     95
    100
    102
    103
    190
    191
    193
    194
    195
    196
    197
    198
    199
    202
    203
    204
  *425*
  *426*
  *427*
  *428*
  *429*
    430
    431
    432
  *433*
    434
    435
    436
  *437*
  *440*
                                                          Source sentence number
(b) Extraction probability distribution of the multi-tasking model (i.e., B ERT S UM E XT M ULTI) over the source sentences.
 Figure 3: Heat-maps showing the extraction probabilities over the source sentences (Paper ID:
 astro-ph9807040 sampled from arXiv-Long dataset). For simplicity, we have only shown the sentences
 that gain over 15% extraction probability by the models. The cells bordered in black show the models’ final
 selection, and oracle sentences are indicated with *.


     Bin           Dataset     Count    RGdiff    Fdiff                important sections.

     I MPROVED                     76    2.05     6.16                    By investigating the summaries from D ECLINED
     T IED       Longsumm           4       0        0                 bin, we noticed that in declined summaries, while
     D ECLINED                     74   −1.47     1.95                 our multi-tasking approach can adjust extraction
                                                                       probability distribution to diverse sections, it has
     Total                       154     0.31     4.11
                                                                       difficulty picking up salient sentences (i.e., positive
     I MPROVED                  1,084    2.40     4.47                 sentences) from the corresponding section; thus, it
     T IED       arXiv-Long        67       0     0.32                 leads to relatively lower R OUGE score. This might
     D ECLINED                    801   −1.82    −1.34                 be improved if two networks (i.e., sentence selection
                                                                       and section prediction) are optimized in a more
     Total                      1952     0.59     1.94                 elegant way such that the extractive summarizer can
                                                                       further select salient sentences from the specified
 Table 5: I MPROVED, T IED, and D ECLINED bins on                      sections when they could be identified. For example,
 the test set of Longsumm and arXiv-Long datasets.                     the improved multi-tasking methods can involve
 The numbers show the improvements (positive) and                      task prioritization (Guo et al. 2018) to dynamically
 drops (negative) compared to the baseline model (i.e.,                balance the learning process between two tasks
 B ERT S UM E XT).                                                     during training, rather than using a fixed α parameter.

                                                                          In the cases where the F1 score and R OUGE (F1)
 the multi-tasking model generally improves over the                   were not consistent with each other, we observed
 baseline consistently on both datasets, except for                    that adding non-salience sentences to the final
 the last bin on Longsumm dataset where it achieves                    summary hurts the final R OUGE (F1) scores. In
 comparable performance. This behaviour is also                        other words, while the multi-tasking approach can
 observed on R OUGE -1 and R OUGE -L for Longsumm                      achieve a higher F1 score compared to the baseline
 dataset. The R OUGE improvement is even more                          since it chooses different non-salient (i.e., negative)
 noticeable when it comes to analysis over arXiv-Long                  sentences than baseline, the overall R OUGE (F1)
 dataset (See Figure 2 (b)). Thus, the length analysis                 scores drop slightly. Having conditional decoding
 supports our hypothesis that the multi-tasking model                  length (i.e., sentences) might help with this as done
 outperforms the baseline more significantly when the                  in (Mao et al. 2020).
 summary is of longer-form.                                              Fig. 3 shows the extraction probabilities that
                                                                       each model outputs on the source sentences. It is
 Qualitative analysis                                                  observable that the baseline model picks most of the
 As the results of the qualitative analysis on the                     sentences (47%) from the beginning of the paper,
 I MPROVED bin is observed, we found out that the                      while the multi-tasking approach (b) can effectively
 multi-tasking model can effectively sample sentences                  distract probability distribution to summary-worthy
 from diverse sections when the ground-truth                           sentences that are all around different sections of the
 summary is also sampled from diverse sections. It                     paper, and pick those with higher confidence. Our
 improves significantly over the baseline when the                     model achieves the overall F1 score of 53.33% on this
 extractive model can detect salient sentences from                    sample paper, while the baseline’s F1 score is 33.33%.
          Conclusion & Future Work                          Conroy, J. M.; and Davis, S. 2017. Section mixture
In this paper, we approach the problem of generating        models for scientific document summarization.
extended summaries, given a long document. Our              International Journal on Digital Libraries 19: 305–322.
proposed model is a multi-task learning approach            Dong, Y.; Wang, S.; Gan, Z.; Cheng, Y.; Cheung, J.;
that unifies sentence selection and section prediction      and jing Liu, J. 2020.   Multi-Fact Correction in
processes, extracting summary-worthy sentences. We          Abstractive Text Summarization. In EMNLP, volume
further collect two large-scale extended summary            abs/2010.02443.
datasets (arXiv-Long and PubMed-Long) from                  Ghosh Roy, S.; Pinnaparaju, N.; Jain, R.; Gupta, M.;
scientific papers. Our results on three datasets show       and Varma, V. 2020. Summaformers @ LaySumm 20,
the efficacy of the joint multi-task model in the           LongSumm 20. In SDP.
extended summarization task. While it achieves
fairly competitive performance with the baseline on         Gidiotis, A.; Stefanidis, S.; and Tsoumakas, G. 2020.
one of three datasets, it consistently improves over        AUTH @ CLSciSumm 20, LaySumm 20, LongSumm
the baseline in the other two evaluation datasets.          20. In SDP.
We further performed extensive quantitative and             Gidiotis, A.; and Tsoumakas, G. 2020. A Divide-
qualitative analyses over the generated results by          and-Conquer Approach to the Summarization of
both models. These evaluations revealed our model’s         Academic Articles. ArXiv abs/2004.06190.
qualities compared to the baseline. Based on the error
analysis, it could be noticed that the performance          Guo, M.; Haque, A.; Huang, D.-A.; Yeung, S.; and Fei-
of this model highly depends on the multi-tasking           Fei, L. 2018. Dynamic Task Prioritization for Multitask
objectives. Future studies could fruitfully explore this    Learning. In ECCV.
issue further by optimizing the multi-task objectives       Jia, R.; Cao, Y.; Tang, H.; Fang, F.; Cao, C.; and
in a way that both sentence selection and section           Wang, S. 2020. Neural Extractive Summarization
prediction tasks can benefit.                               with Hierarchical Attentive Heterogeneous Graph
                                                            Network. In EMNLP.
                     References                             Lev, G.; Shmueli-Scheuer, M.; Herzig, J.; Jerbi, A.;
Abu-Jbara, A.; and Radev, D. R. 2011. Coherent              and Konopnicki, D. 2019. TalkSumm: A Dataset
Citation-Based Summarization of Scientific Papers. In       and Scalable Annotation Method for Scientific Paper
ACL.                                                        Summarization Based on Conference Talks. ACL .
Beltagy, I.; Peters, M. E.; and Cohan, A. 2020.             Li, L.; Xie, Y.; Liu, W.; Liu, Y.; Jiang, Y.; Qi, S.; and Li,
Longformer: The Long-Document Transformer. ArXiv            X. 2020. CIST@CL-SciSumm 2020, LongSumm 2020:
abs/2004.05150.                                             Automatic Scientific Document Summarization. In
                                                            SDP.
Chandrasekaran, M. K.; Feigenblat, G.; Hovy, E.;
Ravichander, A.; Shmueli-Scheuer, M.; and de Waard,         Liu, Y.; and Lapata, M. 2019. Text Summarization with
A. 2020. Overview and Insights from the Shared              Pretrained Encoders. In EMNLP/IJCNLP.
Tasks at Scholarly Document Processing 2020: CL-            MacAvaney, S.; Sotudeh, S.; Cohan, A.; Goharian, N.;
SciSumm, LaySumm and LongSumm. In SDP.                      Talati, I.; and Filice, R. W. 2019. Ontology-Aware
Cohan, A.; Beltagy, I.; King, D.; Dalvi, B.; and Weld,      Clinical Abstractive Summarization. In SIGIR.
D. S. 2019. Pretrained Language Models for Sequential       Mao, Y.; Qu, Y.; Xie, Y.; Ren, X.; and Han, J.
Sentence Classification. In EMNLP/IJCNLP.                   2020. Multi-document Summarization with Maximal
Cohan, A.; Dernoncourt, F.; Kim, D. S.; Bui, T.; Kim, S.;   Marginal Relevance-guided Reinforcement Learning.
Chang, W.; and Goharian, N. 2018. A Discourse-Aware         In EMNLP.
Attention Model for Abstractive Summarization of            Nallapati, R.; Zhai, F.; and Zhou, B. 2017.
Long Documents. In NAACL-HLT.                               SummaRuNNer: A Recurrent Neural Network Based
Cohan, A.; and Goharian, N. 2015. Scientific Article        Sequence Model for Extractive Summarization of
Summarization Using Citation-Context and Article’s          Documents. In AAAI.
Discourse Structure. In EMNLP.                              Reddy, S.; Saini, N.; Saha, S.; and Bhattacharyya, P.
Cohan, A.; and Goharian, N. 2018.         Scientific        2020. IIITBH-IITP@CL-SciSumm20, CL-LaySumm20,
document        summarization       via    citation         LongSumm20. In SDP.
contextualization    and    scientific   discourse.         See, A.; Liu, P. J.; and Manning, C. D. 2017. Get
International Journal on Digital Libraries 19(2-3):         To The Point: Summarization with Pointer-Generator
287–303.                                                    Networks. In ACL.
Collins, E.; Augenstein, I.; and Riedel, S. 2017. A         Sotudeh, S.; Cohan, A.; and Goharian, N. 2020.
Supervised Approach to Extractive Summarisation of          GUIR @ LongSumm 2020: Learning to Generate Long
Scientific Papers. In CoNLL.                                Summaries from Scientific Documents. In SDP.
Sotudeh, S.; Goharian, N.; and Filice, R. 2020. Attend
to Medical Ontologies: Content Selection for Clinical
Abstractive Summarization. In ACL.
Teufel, S.; and Moens, M. 2002.        Summarizing
Scientific Articles: Experiments with Relevance and
Rhetorical Status. Computational Linguistics 28: 409–
445.
Xiao, W.; and Carenini, G. 2019.        Extractive
Summarization of Long Documents by Combining
Global and Local Context. In EMNLP/IJCNLP.
Xu, J.; Gan, Z.; Cheng, Y.; and jing Liu, J.
2020.    Discourse-Aware Neural Extractive Text
Summarization. In ACL.
Zhang, J.; Zhao, Y.; Saleh, M.; and Liu, P. J. 2019.
PEGASUS: Pre-training with Extracted Gap-sentences
for Abstractive Summarization. In ICML.
Zhou, Q.; Yang, N.; Wei, F.; Huang, S.; Zhou, M.; and
Zhao, T. 2018. Neural Document Summarization by
Jointly Learning to Score and Select Sentences. In
ACL.