<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Generating Extended Summaries of Long Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sajad Sotudeh</string-name>
          <email>sajad@ir.cs.georgetown.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arman Cohan</string-name>
          <email>armanc@allenai.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazli Goharian</string-name>
          <email>nazli@ir.cs.georgetown.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Allen Institute for Artificial Intelligence</institution>
          ,
          <addr-line>Seattle, WA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IR Lab, Georgetown University</institution>
          ,
          <addr-line>Washington D.C.</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In the past few years, there has been a significant
progress on both extractive
        <xref ref-type="bibr" rid="ref10 ref11 ref19 ref22 ref25 ref32">(e.g., Nallapati, Zhai,
and Zhou 2017; Zhou et al. 2018; Liu and Lapata
2019; Xu et al. 2020; Jia et al. 2020)</xref>
        and abstractive
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref17 ref23 ref27 ref28 ref31 ref6 ref8">(e.g., See, Liu, and Manning 2017; Cohan et al.
2018; MacAvaney et al. 2019; Zhang et al. 2019;
Sotudeh, Goharian, and Filice 2020; Dong et al.
2020)</xref>
        approaches for document summarization.
These approaches generate a concise summary of a
document, capturing its salient content. However, for
a longer document containing numerous details, it
is sometimes helpful to read an extended summary,
providing details about its different aspects. Scientific
papers are examples of such documents; while their
2. In-depth and comprehensive analyses over the
generated results to explore the qualities of our
model in comparison with the baseline model.
3. Collecting two large-scale extended summarization
datasets with oracle labels for facilitating ongoing
research in extended summarization domain.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Scientific document summarization Summarizing
scientific papers has garnered vast attention from the
research community during recent years, although
it has been studied for decades. The characteristics
of scientific papers, namely the length, writing
style, and discourse structure, lead to special model
considerations to overcome the summarization
task in scientific domain. Researchers have utilized
different approaches to address these challenges.
In earlir work, Teufel and Moens (2002) proposed
a Naïve bayes classifier to do content selection
over the documents’ sentences with regard to their
rhetorical sentence role. More recent works have
given rise to the importance of discourse structure
and its usefulness in summarizing scientific papers.
For example, Collins, Augenstein, and Riedel (2017)
used a set of pre-defined section clusters that source
sentences are appeared in as a categorical feature
to aid the model at identifying summary-worthy
sentences.
        <xref ref-type="bibr" rid="ref6 ref8">Cohan et al. (2018)</xref>
        introduced large-scale
datasets of arXiv and PubMed (collected from public
repositories), and used a hierarchical encoder to
model the discourse structure of a paper, and then
used an attentive decoder to generate the summary.
More recently, Xiao and Carenini (2019) proposed a
sequence-to-sequence model that incorporates both
the global context of the entire document and local
context within the specified section. Inspired by the
fact that discourse information is important when
dealing with long documents
        <xref ref-type="bibr" rid="ref6 ref8">(Cohan et al. 2018)</xref>
        ,
we utilize this structure in scientific summarization.
Unlike prior works, we integrate sentence selection
and sentence section labeling processes through a
multi-task learning approach. In a different line of
research, the use of citation context information has
been shown to be quite effective at summarizing
scientific papers
        <xref ref-type="bibr" rid="ref1">(Abu-Jbara and Radev 2011)</xref>
        .
For instance,
        <xref ref-type="bibr" rid="ref7">Cohan and Goharian (2015</xref>
        , 2018)
utilized a citation-based approach, denoting how
the paper is cited in the reference papers, to form
the summary. Here, we do not exploit any citation
context information.
      </p>
      <sec id="sec-2-1">
        <title>Extended summarization While summarization</title>
        <p>
          research has been extensively explored in literature,
extended summarization has recently gained a huge
deal of attention from the research community.
Among the first attempts to encourage the ongoing
research in this field,
          <xref ref-type="bibr" rid="ref4">Chandrasekaran et al. (2020)</xref>
          set up the Longsumm shared task 1 on producing
1https://ornlcda.github.io/SDProc/sharedtasks.html
extended summaries from scientific documents
and provided a extended summarization dataset
called Longsumm over which participants were
urged to generate extended summaries. To tackle this
challenge, researchers used different methodologies.
For instance, Sotudeh, Cohan, and Goharian (2020)
proposed a multi-tasking approach to jointly learn
sentence importance along with its section to
be included in the summary. Herein, we aim at
validating the multi-tasking model on a variety of
extended summarization datasets and provide a
comprehensive analysis to guide future research.
Moreover,
          <xref ref-type="bibr" rid="ref14">Ghosh Roy et al. (2020</xref>
          ) utilized
sectioncontribution pre-computations (training set) to
assign weights via a budget module for generating
extended summaries. After specifying the section
contribution, an extractive summarizer is executed
over each section separately to extract salient
sentences. Unlike their work, we unify sentence
selection and sentence section prediction tasks to
effectively aid the model at identifying
summaryworthy sentences scattered around different sections.
Furthermore, Reddy et al. (2020) proposed a
CNNbased classification network for extracting salient
sentences. Gidiotis, Stefanidis, and Tsoumakas (2020)
proposed to use a divide and conquer (DANCER)
approach
          <xref ref-type="bibr" rid="ref15 ref17">(Gidiotis and Tsoumakas 2020)</xref>
          to identify
the key sections of the paper to be summarized. The
PEGASUS abstractive summarizer
          <xref ref-type="bibr" rid="ref31">(Zhang et al. 2019)</xref>
          then runs over each section separately to produce
section summaries, which are finally concatenated
to form the extended summary.
          <xref ref-type="bibr" rid="ref2">Beltagy, Peters,
and Cohan (2020</xref>
          ) proposed “Longformer” that
utilizes “Dilated Sliding Windows”, enabling the
model to achieve better long-range coverage on
long documents. With all being mentioned above,
to the best of our knowledge, we are the first to
conduct quite a comprehensive analysis over the
generated summarization results in the extended
summarization domain.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>
        We use three extended summarization datasets in
this research. The first one is Longsumm dataset,
which has been provided in the Longsumm 2020
shared task
        <xref ref-type="bibr" rid="ref4">(Chandrasekaran et al. 2020)</xref>
        . To further
validate the model, we collect two additional datasets
called arXiv-Long and PubMed-Long by filtering the
instances of arXiv and PubMed corpora to retain
those whose abstract contains at least 350 tokens.
Also, to measure how our model works on the mixed
varied-length scientific dataset, we exploit the arXiv
summarization dataset
        <xref ref-type="bibr" rid="ref6 ref8">(Cohan et al. 2018)</xref>
        .
Longsumm The Longsumm dataset was provided
for the Longsumm challenge
        <xref ref-type="bibr" rid="ref4">(Chandrasekaran
et al. 2020)</xref>
        whose aim was to generate extended
summaries for scientific papers. It consists of two
types of summaries:
• Extractive summaries: these summaries are coming
from the TalkSumm dataset
        <xref ref-type="bibr" rid="ref20">(Lev et al. 2019)</xref>
        ,
containing 1705 extractive summaries of scientific
papers according to their video talks in conferences
(i.e., ACL, NAACL, etc.). Each summary within this
corpus is formed by appending top 30 sentences of
the paper.
• Abstractive summaries: an add-on dataset
containing 531 abstractive summaries from several
CS domains such as Machine Learning, NLP, and
AI, that are written by NLP and ML researchers on
their blogs. The length of summaries in this dataset
ranges from 50-1500 words per paper.
      </p>
      <p>
        In our experiments, we use the extractive set
along with 50% of the abstractive set as our training
set, containing 1969 papers; and 20% of it as the
validation set. Note that these splits are made for
the purpose of our internal experiments as the
official test set containing 22 abstractive summaries
is blind
        <xref ref-type="bibr" rid="ref4">(Chandrasekaran et al. 2020)</xref>
        .
arXiv-Long &amp; PubMed-Long. To further test our
methods on additional datasets, we construct
two extended summarization datasets for our
task. For creating the first dataset, we take arXiv
summarization dataset introduced by
        <xref ref-type="bibr" rid="ref6 ref8">Cohan et al.
(2018)</xref>
        and filter the instances whose abstract
(i.e., ground-truth summary) contains at least 350
tokens. We call this dataset arXiv-Long. We repeat
the same process on the PubMed papers obtained
from the Open Access FTP service 2 and call this
dataset PubMed-Long. The motivation is that we
are interested in validating our model on extended
summarization datasets to investigate its effects
compared to the existing works, and 350 is the
length threshold that we use to characterize papers
with “long” summaries. The resulting sets contain
11,149 instances for arXiv-Long, and 88,035 instances
for PubMed-Long datasets. Note that the abstract
of papers are used as ground-truth summaries
in these two datasets. The overall statistics of the
datasets are shown in Table 1. We release these
datasets to facilitate future research in extended
summarization. 3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>In this section, we discuss our proposed method
that aims at jointly learning to predict sentence
importance and its corresponding section. Before
discussing the details of our summarization model,
we investigate the preliminary background that
provides a fair basis for implementing our method.</p>
      <sec id="sec-4-1">
        <title>Background</title>
      </sec>
      <sec id="sec-4-2">
        <title>Extractive Summarization The extractive</title>
        <p>summarization system aims at extracting salient</p>
        <sec id="sec-4-2-1">
          <title>2https://www.ncbi.nlm.nih.gov/pmc/tools/ftp</title>
          <p>3https://github.com/Georgetown-IR-Lab/
ExtendedSumm
Datasets
arXiv
Longsumm
arXiv-Long
PubMed-Long
# docs
215K
2.2K
11.1K
88.0K
avg. doc. length avg. summ. length
(tokens) (tokens)
sentences to be included in the summary. Formally,
let P show a scientific paper containing sentences
[s1, s2, s3, ..., sm ], where m is the number of sentences.
The extractive summarization is then defined as the
task of assigning a binary label (yˆi 2 {0, 1}) to each
sentence si within the paper, signifying whether the
sentence should be included in the summary.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>BERTSUM: BERT for Summarization As our base</title>
        <p>
          model we use the BERTSUM extractive summarization
model
          <xref ref-type="bibr" rid="ref22">(Liu and Lapata 2019)</xref>
          , a BERT-based sentence
classification model fine-tuned for summarization.
        </p>
        <p>
          After BERTSUM outputs sentence representations
within the input document, several inter-sentence
Transformer layers are stacked upon the BERTSUM
to collect document-level features. The final output
layer is a linear classifier with Sigmoid activation
function to decide whether the sentence should be
included or not. The loss function is defined as below:
1 Xn yi log(yˆi ) Å (1 ¡ yi )log(1 ¡ yˆi )
L1 Æ ¡ N iÆ1
(1)
where N is the output size, yˆi is the output of the
model, and yi is the corresponding target value. In
our experiments, we use this model to extract salient
sentences (i.e., those with the positive label) to form
the summary. We set this model as the baseline called
BERTSUMEXT
          <xref ref-type="bibr" rid="ref22">(Liu and Lapata 2019)</xref>
          .
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Our model: a section-aware summarizer</title>
        <p>
          Inspired by few prior works that have studied
the effect of document’s hierarchical structure in
summarization task
          <xref ref-type="bibr" rid="ref11 ref6 ref8">(Conroy and Davis 2017; Cohan
et al. 2018)</xref>
          , we define a section prediction task,
aiming at predicting the relevant section for each
sentence in the document. Specifically, we add
an additional linear classification layer on top of
BERTSUM sentence representations to predict the
relevant section to each sentence. The loss function
for the section prediction network is defined as
follows:
        </p>
        <p>S
L2 Æ ¡ X yi log(yˆi ) (2)</p>
        <p>iÆ1
where yi and yˆi are the ground-truth and the model
scores for each section i in S.</p>
        <p>Linear Layer
Sentence Selection</p>
        <p>Linear Layer</p>
        <p>Section Prediction
Transformer Encoder</p>
        <p>BERTSUM</p>
        <p>The entire extractive network is then trained to
optimize both tasks (i.e., sentence selection and
section prediction) in a multi-task setting:</p>
        <p>LMulti Æ ®L1 Å (1 ¡ ®)L2
(3)
where L1 is the binary cross-entropy loss from
sentence selection task, L2 is the categorical
crossentropy loss from section prediction network, and
® is the weighting parameter that balances the
learning procedure between the sentence and section
prediction tasks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Setup</title>
      <p>In this section, we give details about the
preprocessing steps on the datasets and parameters that
we used for the experimented models.</p>
      <p>
        For our baseline, we used the pre-trained BERTSUM
model and implementation provided by the authors
        <xref ref-type="bibr" rid="ref22">(Liu and Lapata 2019)</xref>
        .4 The BERTSUMEXTMULTI
is that of the model used in
        <xref ref-type="bibr" rid="ref17 ref28">(Sotudeh, Cohan,
and Goharian 2020)</xref>
        , but without post-processing
module at inference time, which utilizes
trigramblocking
        <xref ref-type="bibr" rid="ref22">(Liu and Lapata 2019)</xref>
        to hinder repetitions
in the final summary. We intentionally removed the
post-processing part as the model could attain higher
scores in the absence of this module throughout
our experiments. In order to obtain ground-truth
section labels associated with each sentence, we
utilized the external sequential-sentence package5
by
        <xref ref-type="bibr" rid="ref5">Cohan et al. (2019)</xref>
        . To provide oracle labels for
source sentences in our datasets, we use a greedy
      </p>
      <sec id="sec-5-1">
        <title>4https://github.com/nlpyang/PreSumm</title>
        <p>
          5https://github.com/allenai/sequential_sentence_
classification
labelling approach
          <xref ref-type="bibr" rid="ref22">(Liu and Lapata 2019)</xref>
          with slight
modification for labelling up top 30, 15, and 25
sentences for Longsumm, arXiv-Long, and
PubMedLong datasets, respectively, since these numbers of
oracle sentences yielded the highest oracle scores. 6
For the joint model, we tuned ® (loss weighting
parameter) at 0.5 as it resulted in the highest scores
throughout our experiments. In all our experiments,
we pick the checkpoint that achieves the best average
of ROUGE-2 and ROUGE-L scores on the validation
intervals as our best model for inference.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>In this section, we present the performance of the
baseline and our model over the validation and test
sets of the extended summarization datasets. We then
discuss our proposed model’s performance compared
to baseline over a mix of varied-length summarization
dataset (i.e., arXiv). As the evaluation metrics, we
report the summarization systems’ performance in
terms of ROUGE-1 (F1), ROUGE-2 (F1), and
ROUGEL (F1)) metrics.</p>
      <p>As we see in Table 2, we notice that having section
predictor model incorporated into summarization
model (i.e., BERTSUMEXTMULTI model) performs
fairly well compared to the baseline model. This is a
particularly important finding since it characterizes
the importance of injecting documents’ structure
when summarizing a scientific paper. While the score
gap is relatively higher in arXiv-Long and Longsumm
datasets, it is similar in PubMed-Long dataset.</p>
      <p>
        As observed in Table. 3, it is noticeable that
6The modification was made to assure that the oracle
sentences are sampled from diverse sections.
BERTSUMEXT
BERTSUMEXTMULTI
BERTSUMEXT
BERTSUMEXTMULTI
Longsumm
arXiv-Long
PubMed-Long
43.2
43.3
47.1
47.8¤
BERTSUMEXTMULTI approach performs top
among the state-of-the-art long summarization
methods on the blind test set of LongSumm
challenge
        <xref ref-type="bibr" rid="ref4">(Chandrasekaran et al. 2020)</xref>
        . While
this model improves ROUGE-1 quite significantly
over the other state-of-the-art, it stays competitive on
ROUGE-2 and ROUGE-L metrics. In terms of ROUGE
(F1) F-Measure average, BERTSUMEXTMULTI model
ranks first by a huge margin compared to the other
systems.
      </p>
      <p>
        To test the model on mixed varied-length
summarization datasets, we trained and tested it
on arXiv
        <xref ref-type="bibr" rid="ref6 ref8">(Cohan et al. 2018)</xref>
        dataset, which contains
a mix of varying length abstracts as ground-truth
summaries. Table 4 shows that our model can achieve
competitive performance on this dataset. While the
model does not yield any improvement on arXiv
dataset, our hypothesis was to investigate if our
model is superior to existing models on longer-form
datasets –such as those we have used in this research,
which we validated by presenting the evaluation
results on long summarization datasets.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Analysis</title>
      <p>In order to gain insights into how our
multitasking approach works on different long datasets,
we perform an extensive analysis in this section
to explore the qualities of our multi-tasking system
(i.e., BERTSUMEXTMULTI) over the baseline (i.e.,
BERTSUMEXT). Specifically, we perform two types
of analyses: 1) quantitative analysis; 2) qualitative
analysis.</p>
      <p>For the first part, we choose to use two metrics:
RGdiff which denotes the average ROUGE (F1)
difference (i.e., gap) between the baseline and our
model 7. Positive values indicate the improvement,
while negative values denote the decline in scores.
Similarly, Fdiff is the average difference of F1 score
between the baseline and our model. We create three
bins sorted by RGdiff: IMPROVED which contains the
reports whose average ROUGE (F1) score is improved
by the multi-tasking model; TIED including those
that the multi-tasking model leaves unchanged in
terms of modifying average ROUGE (F1) score; and
DECLINED containing those whose average ROUGE
(F1) score has decreased by the joint model.</p>
      <p>For the qualitative analysis section, we specifically
7The average is defined on ROUGE-1 (F1), ROUGE-2
–
20.3
21.1
21.5¤
Dataset
arXiv
43.6
43.4
20.2
19.8
20.4
20.0
aim at comparing the methods in terms of section
distribution since that is where our method’s
improvements are expected to come from.
Furthermore, we conduct an additional length
analysis over the results generated by the baseline
versus our model.</p>
      <sec id="sec-7-1">
        <title>Quantitative Analysis</title>
        <p>We first perform the quantitative analysis over
the long summarization datasets’ test sets in two
parts including 1) Metric analysis which aims at
comparing different bins based on the average ROUGE
score difference of the baseline and our model; 2)
Length analysis that targets at finding the correlation
between the summary length on different bins and
models’ performance.</p>
        <p>Metric analysis Table 5 shows the overall quantities
of Longsumm and arXiv-Long datasets in terms
of average difference of ROUGE and F1 scores.</p>
        <p>As shown, the multi-tasking approach is able to
improve 76 summaries with an average ROUGE (F1)
improvement of 2.05%. This is even more when it
(F1), and ROUGE-L (F1) scores.
comes to evaluating the model on arXiv-Long dataset
with average ROUGE improvement of 2.40%.</p>
        <p>Interestingly, our method can consistently improve
F1 measure in general (See total F1 scores in Table. 5).</p>
        <p>Seemingly, F1 metric directly correlates with ROUGE
(F1) metric on arXiv-Long dataset, whereas this is
not the case on DECLINED bin of the Longsumm
dataset. This might be due to the relatively small test
set size of Longsumm dataset. It has to be mentioned
that IMPROVED bin holds relatively higher counts and
improved metrics than that of DECLINED bin across
both datasets in our evaluation.</p>
        <p>Length analysis We analyze the generated results
by both models to see if the summary length affects
the models’ performance using bar charts in Figure 2.</p>
        <p>The bar charts are intended to provide the basis for
comparing both models on different length bins
(xaxis), which are evenly-spaced (i.e., having the same
number of papers). It has to be mentioned that we
used five bins (each bin with 31 summaries) and ten
bins (each bin with 196 summaries) for Longsumm
and arXiv-Long datasets, respectively.</p>
        <p>As shown in Figure 2 (a), for Longsumm dataset,
as the length of the ground-truth summary increases,
0 1 2 3 4 5 6 7 8 9 01 11 13 *41 51 16 17 18 19 20 21 22 23 24 25 26 27 *82 92 30 31 32 33 35 *63 73 38 39 **04 **41 34 44 45 *64 74 *84 59 001 102 103 190 191 193 194 195 196 197 198 199 202 203 204 **245 **426 **472 **482 **492 043 431 432 *343 443 435 436 **347 **440
* * Sou*rce sentence number * * *
(b) Extraction probability distribution of the multi-tasking model (i.e., BERTSUMEXTMULTI) over the source sentences.
the multi-tasking model generally improves over the
baseline consistently on both datasets, except for
the last bin on Longsumm dataset where it achieves
comparable performance. This behaviour is also
observed on ROUGE-1 and ROUGE-L for Longsumm
dataset. The ROUGE improvement is even more
noticeable when it comes to analysis over arXiv-Long
dataset (See Figure 2 (b)). Thus, the length analysis
supports our hypothesis that the multi-tasking model
outperforms the baseline more significantly when the
summary is of longer-form.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Qualitative analysis</title>
        <p>As the results of the qualitative analysis on the
IMPROVED bin is observed, we found out that the
multi-tasking model can effectively sample sentences
from diverse sections when the ground-truth
summary is also sampled from diverse sections. It
improves significantly over the baseline when the
extractive model can detect salient sentences from
important sections.</p>
        <p>
          By investigating the summaries from DECLINED
bin, we noticed that in declined summaries, while
our multi-tasking approach can adjust extraction
probability distribution to diverse sections, it has
difficulty picking up salient sentences (i.e., positive
sentences) from the corresponding section; thus, it
leads to relatively lower ROUGE score. This might
be improved if two networks (i.e., sentence selection
and section prediction) are optimized in a more
elegant way such that the extractive summarizer can
further select salient sentences from the specified
sections when they could be identified. For example,
the improved multi-tasking methods can involve
task prioritization
          <xref ref-type="bibr" rid="ref18">(Guo et al. 2018)</xref>
          to dynamically
balance the learning process between two tasks
during training, rather than using a fixed ® parameter.
        </p>
        <p>In the cases where the F1 score and ROUGE (F1)
were not consistent with each other, we observed
that adding non-salience sentences to the final
summary hurts the final ROUGE (F1) scores. In
other words, while the multi-tasking approach can
achieve a higher F1 score compared to the baseline
since it chooses different non-salient (i.e., negative)
sentences than baseline, the overall ROUGE (F1)
scores drop slightly. Having conditional decoding
length (i.e., sentences) might help with this as done
in (Mao et al. 2020).</p>
        <p>Fig. 3 shows the extraction probabilities that
each model outputs on the source sentences. It is
observable that the baseline model picks most of the
sentences (47%) from the beginning of the paper,
while the multi-tasking approach (b) can effectively
distract probability distribution to summary-worthy
sentences that are all around different sections of the
paper, and pick those with higher confidence. Our
model achieves the overall F1 score of 53.33% on this
sample paper, while the baseline’s F1 score is 33.33%.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Conclusion &amp; Future Work</title>
      <p>In this paper, we approach the problem of generating
extended summaries, given a long document. Our
proposed model is a multi-task learning approach
that unifies sentence selection and section prediction
processes, extracting summary-worthy sentences. We
further collect two large-scale extended summary
datasets (arXiv-Long and PubMed-Long) from
scientific papers. Our results on three datasets show
the efficacy of the joint multi-task model in the
extended summarization task. While it achieves
fairly competitive performance with the baseline on
one of three datasets, it consistently improves over
the baseline in the other two evaluation datasets.
We further performed extensive quantitative and
qualitative analyses over the generated results by
both models. These evaluations revealed our model’s
qualities compared to the baseline. Based on the error
analysis, it could be noticed that the performance
of this model highly depends on the multi-tasking
objectives. Future studies could fruitfully explore this
issue further by optimizing the multi-task objectives
in a way that both sentence selection and section
prediction tasks can benefit.</p>
      <p>Xiao, W.; and Carenini, G. 2019. Extractive
Summarization of Long Documents by Combining
Global and Local Context. In EMNLP/IJCNLP.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abu-Jbara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D. R.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Coherent Citation-Based Summarization of Scientific Papers</article-title>
          .
          <source>In ACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Peters,
          <string-name>
            <given-names>M. E.</given-names>
            ; and
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Longformer: The Long-Document Transformer</article-title>
          . ArXiv abs/
          <year>2004</year>
          .05150.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chandrasekaran</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Feigenblat</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ravichander</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shmueli-Scheuer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>de Waard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Overview and Insights from the Shared Tasks at Scholarly Document Processing 2020: CLSciSumm, LaySumm and LongSumm</article-title>
          . In SDP.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>King</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dalvi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Pretrained Language Models for Sequential Sentence Classification</article-title>
          . In EMNLP/IJCNLP.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          ; Bui,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ; and
            <surname>Goharian</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents</article-title>
          . In NAACL-HLT.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Scientific Article Summarization Using Citation-Context and Article's Discourse Structure</article-title>
          .
          <source>In EMNLP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Scientific document summarization via citation contextualization and scientific discourse</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>International Journal on Digital Libraries</source>
          <volume>19</volume>
          (
          <issue>2-3</issue>
          ):
          <fpage>287</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Collins</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Augenstein</surname>
            ,
            <given-names>I.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Riedel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>A Supervised Approach to Extractive Summarisation of Scientific Papers</article-title>
          . In CoNLL.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Conroy</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Section mixture models for scientific document summarization</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>International Journal on Digital Libraries</source>
          <volume>19</volume>
          :
          <fpage>305</fpage>
          -
          <lpage>322</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>Z</given-names>
          </string-name>
          .; Cheng, Y.;
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>J.</given-names>
            ; and jing Liu, J.
          </string-name>
          <year>2020</year>
          .
          <article-title>Multi-Fact Correction in Abstractive Text Summarization</article-title>
          .
          <source>In EMNLP</source>
          , volume abs/
          <year>2010</year>
          .02443.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Ghosh</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Pinnaparaju,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Gupta,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Varma</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <year>2020</year>
          . Summaformers @ LaySumm 20,
          <article-title>LongSumm 20</article-title>
          . In SDP.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Gidiotis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and Tsoumakas,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>AUTH @ CLSciSumm 20</source>
          , LaySumm 20,
          <article-title>LongSumm 20</article-title>
          . In SDP.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Gidiotis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Tsoumakas,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>A Divideand-Conquer Approach to the Summarization of Academic Articles</article-title>
          . ArXiv abs/
          <year>2004</year>
          .06190.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Haque</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          -A.;
          <string-name>
            <surname>Yeung</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and FeiFei,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Dynamic Task Prioritization for Multitask Learning</article-title>
          .
          <source>In ECCV.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Cao,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Tang</surname>
          </string-name>
          , H.;
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Neural Extractive Summarization with Hierarchical Attentive Heterogeneous Graph Network</article-title>
          .
          <source>In EMNLP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Lev</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <article-title>Shmueli-</article-title>
          <string-name>
            <surname>Scheuer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Herzig</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jerbi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Konopnicki</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks</article-title>
          . ACL .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <surname>W.</surname>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>CIST@CL-SciSumm 2020</article-title>
          ,
          <article-title>LongSumm 2020: Automatic Scientific Document Summarization</article-title>
          .
          <source>In SDP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and Lapata,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Text Summarization with Pretrained Encoders</article-title>
          . In EMNLP/IJCNLP.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>MacAvaney</surname>
          </string-name>
          , S.;
          <string-name>
            <surname>Sotudeh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Talati</surname>
            ,
            <given-names>I.;</given-names>
          </string-name>
          and Filice,
          <string-name>
            <surname>R. W.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Ontology-Aware Clinical Abstractive Summarization</article-title>
          .
          <source>In SIGIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          2020.
          <article-title>Multi-document Summarization with Maximal Marginal Relevance-guided Reinforcement Learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Nallapati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          2020.
          <article-title>IIITBH-IITP@CL-SciSumm20, CL-LaySumm20, LongSumm20</article-title>
          . In SDP.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>See</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>P. J.;</given-names>
            and
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. D.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Get To The Point: Summarization with Pointer-Generator Networks</article-title>
          .
          <source>In ACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Sotudeh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>GUIR</surname>
          </string-name>
          @
          <article-title>LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents</article-title>
          .
          <source>In SDP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          2020.
          <article-title>Discourse-Aware Neural Extractive Text Summarization</article-title>
          .
          <source>In ACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Saleh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Liu,
          <string-name>
            <surname>P. J.</surname>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Neural Document Summarization by Jointly Learning to Score and Select Sentences</article-title>
          .
          <source>In ACL.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>