<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automating Scientific Highlight Generation with Transformer Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sangita Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Navya Sinha</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jyoti Prakash Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Technology Patna</institution>
          ,
          <addr-line>Patna, 800005, Bihar</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sikkim Manipal Institute of Technology</institution>
          ,
          <addr-line>Majitar, Sikkim</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>This work addresses the task of automatic highlight generation from scientific papers' abstracts, a challenging problem in research communication and summarization. To solve this issue, we used benchmark transformerbased models, including T5-small, PEGASUS, and LongT5, as well as an entity-aware variant of PEGASUS. Experimental results demonstrate that PEGASUS achieves the strongest overall performance in terms of ROUGE1: 0.3272, ROUGE-2: 0.1166, ROUGE-L: 0.2345, and METEOR: 0.2841. These findings establish PEGASUS as the most efective approach for abstractive highlight generation, while also highlighting the limitations of entity-focused methods in this domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;highlights generation</kwd>
        <kwd>Pegasus</kwd>
        <kwd>T5 small</kwd>
        <kwd>T5 long</kwd>
        <kwd>and SciHigh-2025 Datasets</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>To address these challenges, we propose a highlight generation approach designed to enhance both the
accuracy and usefulness of generated highlights. The main contributions of our work are summarized
as follows:
1. We benchmark transformer-based generation models, including T5-small, PEGASUS, and LongT5,
and further experiment with an entity-aware variant of PEGASUS.
2. We provide a comparative evaluation using ROUGE and METEOR metrics.
3. Our method ensures that highlights are not only concise and accurate, but also tailored to what
end-users actually value in practice.</p>
      <sec id="sec-1-1">
        <title>1.1. Research Objectives</title>
        <p>The main objective was to develop and evaluate methods for automatically generating highlights (concise
summaries) from scientific papers. Highlights are short, reader-friendly descriptions that capture the
key contributions of a paper. Since writing them manually is time-consuming and subjective, our goal
was to create a system that can generate accurate, informative, and human-like highlights to assist
researchers, publishers, and readers.</p>
        <p>The work was guided by the following research questions:
RQ1: Can automatic text summarization methods generate highlights that are comparable in quality to
human-written highlights?
RQ2: Which approaches (extractive vs. abstractive models, or hybrid methods) are more efective for
highlight generation in scientific writing?
RQ3: What evaluation metrics best capture the quality of highlights (e.g., ROUGE, METEOR, BERTScore,
human evaluation)?
RQ4: How well do models trained on general summarization datasets transfer to the scientific domain
compared to models fine-tuned on domain-specific corpora?</p>
        <p>The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 describes
the dataset and proposed model. Section 4 describe evaluation metrics. Section 5 presents the results
produced by the proposed model. Section 6 discussed the implications of recent pretraining strategies,
such as PEGASUS, T5 small, and Long T5, for highlight generation tasks. Section 7 concludes the paper
with potential directions for future research.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Literature</title>
      <p>In this section, we have discussed Highlight generation, a sub-task of text summarization, focuses on
producing concise and salient snippets that capture the core meaning of a longer document. Unlike
traditional summarization, which may yield multi-sentence abstracts, highlight generation demands
brevity—often one or two sentences—while preserving informativeness. This problem is particularly
relevant in domains such as news (where highlights serve as teasers), scientific publishing (where
highlights guide readers through dense papers), and meeting records (where highlights emphasize key
decisions or action items).</p>
      <p>Cho et al. [1] introduced a framework for sub-sentence-level highlight generation to ensure
selfcontained and meaningful segments. Their method combined XLNet for extracting candidate segments
with Determinantal Point Processes (DPP) for selecting salient and non-redundant highlights. Models
were trained and evaluated on multi-document summarization datasets, including DUC-03/04 and
TAC08/09/10/11, while a classifier for assessing segment quality was trained separately on the CNN/DM
dataset. The proposed HL-XLNetSegs model achieved ROUGE-1 = 39.2, ROUGE-2 = 10.70, and
ROUGESU4 = 14.47 on DUC-04. Human evaluations confirmed that the extracted highlights were highly
self-contained.</p>
      <p>Woodsend et al. [2] developed a phrase-based Integer Linear Programming (ILP) model for joint
content selection and compression in news summarization, designed to generate story highlights.
The model integrated syntactic information from PCFG parse trees and dependency graphs, encoding
grammatical constraints into the optimization process. They constructed a dataset of approximately
9,000 CNN.com article–highlight pairs (2007–2009), with 210 pairs manually annotated. Evaluations
on the DUC-2002 benchmark showed that the ILP model achieved ROUGE-1 = 0.445 and ROUGE-2
= 0.200, outperforming the lead-3 baseline, while human judges found no significant diference in
grammaticality compared to CNN highlights.</p>
      <p>Gupta et al. [3] presented a comprehensive survey of abstractive summarization approaches,
categorizing methods into three paradigms: structure-based (templates, trees, ontologies), semantic-based
(semantic graphs, predicate–argument structures), and deep learning-based (neural encoder–decoder
models). While no new experiments were introduced, they reported typical performance ranges across
benchmarks such as DUC and TAC. Neural approaches typically achieved ROUGE-1 between 0.28 and
0.47, while semantic graph-based methods ranged between 0.30 and 0.40. The authors also highlighted
the limitations of ROUGE for abstractive summaries and advocated for more semantic-aware evaluation
metrics.</p>
      <p>
        Rehman et al. [4] conducted one of the earliest studies on automatic highlight generation from
abstracts using deep learning. They evaluated three models: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a seq2seq model with attention, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
a pointer-generator network (PGN), and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) a PGN with coverage. Using GloVe embeddings and the
CSPubSum dataset (10,142 papers), the PGN+Coverage model achieved the best results with ROUGE-1
= 31.46, ROUGE-2 = 8.57, ROUGE-L = 29.14, and METEOR = 12.01, outperforming both the seq2seq and
PGN baselines.
      </p>
      <p>Building on this, Rehman et al. [5] proposed an abstractive highlight generation model by enhancing
the pointer-generator network with Named Entity Recognition (NER). Named entities were treated
as single tokens to avoid fragmentation and preserve meaning. Using the CSPubSum dataset, the
best-performing model (NER+PGM+Cov) achieved ROUGE-1 = 38.13, ROUGE-2 = 13.68, ROUGE-L =
35.11, METEOR = 31.03, and BERTScore = 86.3, outperforming non-NER variants.</p>
      <p>In a subsequent study, Rehman et al. [6] integrated pre-trained ELMo embeddings with the
pointer-generator network and coverage mechanism. Four models (PGM, PGM+Cov, PGM+ELMo, and
PGM+ELMo+Cov) were evaluated on CSPubSum using two input settings: (a) abstract only, and (b)
abstract + introduction + conclusion. The best model, PGM+ELMo+Cov, achieved ROUGE-1 = 38.40,
ROUGE-2 = 13.32, ROUGE-L = 35.45, METEOR = 30.61, and BERTScore = 86.6, demonstrating consistent
improvements over the baselines.</p>
      <p>Further extending this line of work, Rehman et al. [7] incorporated SciBERT embeddings into a
PGN+Coverage architecture for domain-specific contextual understanding. Experiments were conducted
on both the CSPubSum and a new MixSub corpus of 19,785 multidisciplinary articles with author-written
highlights. On CSPubSum, the model achieved ROUGE-1 = 38.26, ROUGE-2 = 14.20, ROUGE-L = 35.51,
METEOR = 32.62, and BERTScore = 86.65. On MixSub, it achieved ROUGE-1 = 31.78 and METEOR =
24.00, demonstrating strong cross-domain generalization.</p>
      <p>Xiang et al. [8] investigated the use of highlights to improve unsupervised keyword extraction.
They enriched abstracts with highlight sentences and tested four strategies for combining abstracts
and highlights, including semantic filtering. Three unsupervised models—TextRank, MDERank, and
PromptRank—were evaluated on a new dataset of 1,647 computer science papers derived from Elsevier
and CSPubSum. Incorporating highlights consistently improved extraction performance; for instance,
MDERank with Highlights+Filtered Abstract (H+FA) achieved F1@10 = 15.06, while PromptRank with
Highlights+Abstract (H+A) reached F1@10 = 16.47.</p>
      <p>Recent advances in pre-trained sequence-to-sequence models, including PEGASUS, T5, and LongT5,
have further improved highlight generation, yielding more fluent and contextually accurate outputs.</p>
      <p>Necva et al. [9] introduced MoDeST, a multi-domain and multilingual dataset for scientific title
generation in English and Turkish, spanning disciplines such as social sciences, medical sciences,
and engineering. MoDeST supports generation from multiple sources—keywords, abstracts, and full
articles. Evaluations using LLMs (LLaMA-3.1, Aya-expanse) in zero-shot, few-shot, and fine-tuning
setups revealed that fine-tuning yields the best performance. For Turkish, models achieved scores of
40.12–47.22, and for English, 40.02–49.10. Abstracts were identified as the most efective input source.
This dataset highlights domain-specific and cross-lingual challenges, making it a valuable resource for
future research.</p>
      <p>
        Finally, Rehman et al. [10] explored three deep learning models for research highlight generation:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a seq2seq model with attention using 128-dimensional GloVe embeddings, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) a pointer-generator
network (PGN), and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) a PGN with a coverage mechanism. All models used a vocabulary of 50K
tokens, beam search size 4, and input/output constraints of 400 and 100 tokens, respectively. In later
work, Rehman et al. [11] proposed a research plan emphasizing evaluation techniques for scientific text
summarization, underscoring the importance of reliable metrics and the need to address evaluation
challenges efectively.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>To address our research objectives, we designed an abstractive summarization framework and evaluated
it using the MixSub-SciHigh dataset as follows:
We evaluted our models based on a cleaned and enriched version of the MixSub-SciHigh dataset. This
dataset is given in track SciHigh at FIRE-2025 1[7]. The dataset contains 10,000 training instances,
1,985 validation instances, and 1,840 test instances. Each sample consists of an abstract paired with
corresponding gold-standard highlights.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Proposed Models</title>
        <p>We developed three transformer-based text generation models, organized under two teams:
Text_highlights_gen and NIT_PATNA_2025. The NIT_PATNA_2025 team built two models, T5-small
and LongT5, while the Text_highlights_gen team developed Pegasus and Pegasus with NER.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Pre-processing</title>
        <p>Before training the models, we applied several preprocessing steps, including tokenization, punctuation
removal, and the identification and storage of named entities.</p>
        <p>• T5-small [12]: A compact variant of the Text-to-Text Transfer Transformer (T5), where all
NLP tasks are reframed as text-to-text problems. Despite its smaller size compared to larger
T5 versions, it is eficient and suitable for resource-constrained environments. For highlight
generation, the task is formulated as: “document highlights”.
• LongT5 [13]: An extension of T5 for handling longer input sequences. It introduces eficient
attention mechanisms such as local-global attention to reduce the quadratic cost of standard
transformers, enabling it to process full-length scientific articles rather than just abstracts. This
makes it particularly suitable for highlight generation when important information is spread
across lengthy papers.
• PEGASUS [14], which is shown in Figure 1, is pre-trained with a gap-sentence generation
objective optimized for abstractive summarization. Its novelty lies in the gap-sentence generation
(GSG) pre-training, where entire sentences are masked and the model learns to reconstruct them
from the remaining context. This objective closely mimics summarization, enabling PEGASUS to
generate coherent, compressed highlights. During fine-tuning, the encoder processes the scientific
abstract and the decoder generates highlights autoregressively, with beam search improving
output quality. This alignment between pre-training and summarization tasks makes PEGASUS
especially efective for highlight generation.</p>
        <p>Additionally, we developed an entity-aware PEGASUS variant, where named entity information
from abstracts was incorporated into the input representation to assess its efect on summarization.
1https://sites.google.com/jadavpuruniversity.in/scihigh2025/home
MyEthical</p>
        <p>Names</p>
        <p>This is pure white. &lt;eos&gt;</p>
        <sec id="sec-3-3-1">
          <title>Transformer Encoder</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Transformer Encoder</title>
          <p>Pegasus is [Mask2] . [Mask1] it [Mask2] the model.</p>
          <p>&lt;S&gt; It is pure white .</p>
          <p>Target Text</p>
          <p>Target Text [Sifted Right]
Pegasus is MyEthical .</p>
          <p>It is pure white .</p>
          <p>It names the model .
3.3.1. Hyperparameters
These models were fine-tuned using the Hugging Face Transformers framework. Training was
performed with the following hyperparameters:
• Learning rate: The model is trained using the Adam optimizer with a learning rate of 2− 5,
selected to ensure stable convergence during training.
• Batch size: The model is trained with a batch size of 2 using the Adam optimizer and a learning
rate of 2− 5, chosen to ensure stable convergence during training.
• Epochs: The model is trained for 10 epochs with a batch size of 2 using the Adam optimizer and
a learning rate of 2− 5, selected to ensure stable convergence during training.
• Beam search: The model is trained for 10 epochs with a batch size of 2 using the Adam optimizer
and a learning rate of 2− 5, and decoding is performed with a beam search width of 4 to improve
sequence generation quality.
• Maximum input length: The model is trained for 10 epochs with a batch size of 2 using the
Adam optimizer and a learning rate of 2− 5. Inputs are truncated to a maximum length of 512
tokens, and decoding is performed with a beam search width of 4 to improve sequence generation
quality.
• Maximum output length: The model is trained for 10 epochs with a batch size of 2 using the
Adam optimizer and a learning rate of 2− 5. Inputs are truncated to a maximum length of 512
tokens, outputs are limited to 100 tokens, and decoding is performed with a beam search width
of 4 to improve sequence generation quality
• Weight_decay: The model is trained for 10 epochs with a batch size of 2 using the Adam optimizer
with a learning rate of 2− 5 and a weight decay of 0.01. Inputs are truncated to a maximum
length of 512 tokens, outputs are limited to 100 tokens, and decoding is performed with a beam
search width of 4 to improve sequence generation quality</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Metrics</title>
      <p>We evaluated the performance of our models through ROUGE-N, ROUGE-L [15], BERTScore [16] and
METEOR [17] metrics. These metrics are employed to evaluate the similarity between the generated
text and the reference text.</p>
      <p>Precision =</p>
      <p>Recall =</p>
      <p>For ROUGE-1,  = 1 (unigrams). For ROUGE-2,  = 2 (bigrams).</p>
      <p>ROUGE-L: It is based on the Longest Common Subsequence (LCS) between the system-generated
text and a reference text. Unlike ROUGE-N, it does not require consecutive matches, but it preserves
sentence-level word order. It is defined using recall, precision, and F-score.</p>
      <p>
        ROUGE-N: It measures the overlap of -grams between a system-generated text and a reference text.
Here, we took ROUGE-1, and ROUGE-2 metrics to evaluate our proposed models. It is defined as:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
      </p>
      <p>Recall =
Precision =
 =
(,  )
(,  )
| |
||
(1 +  2) · Recall · Precision</p>
      <p>Recall +  2 · Precision
where (,  ) is the length of the longest common subsequence between system-generated text
 and reference text  , and ||, | | are their lengths.</p>
      <p>METEOR: It aligns candidate and reference texts using exact matches, stemming, synonyms, and
paraphrases. It combines precision and recall into a harmonic mean and applies a fragmentation penalty
to account for word order. By incorporating linguistic variations and semantic similarity, METEOR is
better correlated with human judgment than simple overlap-based metrics.</p>
      <p>mean =
10 ·  · 
 + 9
Penalty = 0.5
︂(
#chunks )︂ 3
#matches</p>
      <p>METEOR = (1 − Penalty) · mean
where  is Precision,  is Recall, “matches” is the number of unigram matches (exact, stem, synonym),
and “chunks” is the number of contiguous matched word sequences.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore. Among all models, PEGASUS achieved the best
performance, obtaining the highest scores across every metric (ROUGE-1: 35.72, ROUGE-2: 15.66,
ROUGE-L: 25.45, BERTScore: 89.41). Table 2 reports the evaluation results on the MixSub-SciHigh
test dataset using F1-scores for ROUGE-1, ROUGE-2, ROUGE-L, and METEOR. Among all models,
PEGASUS achieved the best performance, obtaining the highest scores across every metric (ROUGE-1:
32.72, ROUGE-2: 11.66, ROUGE-L: 23.45, METEOR: 28.41). Compared to T5-small, PEGASUS provided
consistent improvements of around 1–1.2 points on ROUGE-1 and ROUGE-L and a modest gain on
METEOR.</p>
      <p>Interestingly, while incorporating NER into PEGASUS was expected to boost performance, the results
instead showed a significant decline, with ROUGE-2 dropping from 11.66 to 6.89 and METEOR from
28.41 to 19.46. This suggests that the NER-based preprocessing may have introduced noise or disrupted
contextual coherence in summaries for this dataset.</p>
      <p>LongT5, despite being designed to handle longer contexts, performed poorly in this task, with
ROUGE2 as low as 2.08 and METEOR at only 12.78, highlighting its limitations when applied to relatively short
scientific abstracts in the MixSub-SciHigh dataset.</p>
      <p>Overall, PEGASUS emerged as the most efective model for scientific highlight generation in this
setting, demonstrating that its pre-training objectives are well-suited for abstractive summarization of
domain-specific text.</p>
      <p>
        Table 3 presents the ranks of all participating teams based on ROUGE-L scores in the SciHigh
shared task. We have two teams named (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Text_highlights_gen and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) NIT_PATNA_2025. The team
Text_highlights_gen achieved the top performance with a ROUGE-L score of 23.45, followed closely by
AiNauts (23.24) and SVNIT_CSE (23.02). The diferences among the top three submissions are relatively
small (less than 0.5 points), indicating strong competition at the upper end of the leaderboard.
      </p>
      <p>Our second team, NIT_PATNA_2025, secured the 6th position with a ROUGE-L score of 22.42, placing
us within the top half of all participating teams. The results demonstrate that our approach performs
competitively, outperforming several strong baselines such as MUCS (22.08) and JU_CSE_PR_KS (22.06),
while remaining close to higher-ranked systems like The NLP Explorers (22.94).</p>
      <p>Overall, the leaderboard highlights that while the leading systems achieve similar ROUGE-L scores,
even minor improvements can significantly influence rank positions. Our system’s placement within
the top tier validates the efectiveness of our summarization strategy.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The evaluation results in Table 2 highlight several important insights into the performance of diferent
transformer-based summarization models on the MixSub-SciHigh test dataset. Among all the models,
PEGASUS demonstrated the strongest performance, achieving the highest scores across all evaluation
metrics. Its superior ROUGE-1 (32.72), ROUGE-2 (11.66), ROUGE-L (23.45), and METEOR (28.41) indicate
that the pre-training objectives of PEGASUS, which focus on gap-sentence generation, are well aligned
with the requirements of abstractive scientific summarization.</p>
      <p>While T5-small performed slightly below PEGASUS, its results were still competitive, confirming its
capability as a baseline model for scientific text summarization. The relative gap between T5-small and
PEGASUS (e.g., +1.05 on ROUGE-1, +0.61 on ROUGE-2) suggests that PEGASUS has an advantage in
capturing context and generating more coherent summaries, albeit with only modest improvements.</p>
      <p>Surprisingly, the PEGASUS + NER variant underperformed significantly compared to its vanilla
counterpart. Despite the expectation that entity-focused preprocessing would enhance content selection
and factual consistency, the results show a sharp decline across all metrics (e.g., a 4.77-point drop in
ROUGE-1 and nearly 9 points in METEOR). This suggests that the integration of NER-based features
may have disrupted the natural flow of contextual information, leading to less fluent and incomplete
summaries. It also highlights that naive incorporation of linguistic features does not always guarantee
improvements and requires careful integration strategies.</p>
      <p>The performance of LongT5 was the weakest among all models. Its notably low scores (ROUGE-2:
2.08, METEOR: 12.78) suggest that the model struggled with the relatively short and structured nature
of the MixSub-SciHigh dataset. Although LongT5 is designed to handle long input contexts, this
characteristic may not provide an advantage when processing shorter scientific abstracts, where concise
content selection is more critical than handling extended contexts.</p>
      <p>Overall, these findings emphasize that model pre-training objectives and dataset characteristics
strongly influence summarization performance. PEGASUS emerges as the most efective choice for
highlight generation in this domain, while approaches relying on NER or long-context handling require
more tailored adaptation to achieve competitive results.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future work</title>
      <p>In this paper, we investigate the task of automatic highlight generation for scientific abstracts using
transformer-based models. We introduce a cleaned and enriched version of the MixSub-SciHigh dataset,
incorporating preprocessing steps such as tokenization, stopword removal, punctuation removal, and
named entity recognition. On this dataset, we use multiple transformer models, including T5-small,
LongT5, PEGASUS, and an entity-aware variant of PEGASUS to train these models. Our experiments
demonstrate that PEGASUS achieved the best overall performance across ROUGE and METEOR metrics,
making it the most suitable model for scientific highlight generation.</p>
      <p>In the future, research will focus on incorporating advanced entity-aware highlights generation
techniques that integrate knowledge graphs, ontology alignment, and domain-specific entity linking to
enhance factual accuracy and semantic coherence. Controlled text generation with constraints on
readability, factual grounding, and redundancy reduction will be further investigated to ensure high-quality
highlights generation. Moreover, integrating reinforcement learning with fact-consistency objectives
and human-in-the-loop feedback can help optimize highlights for both precision and interpretability.
Expanding the dataset across multiple disciplines, including low-resource scientific domains, will
support better generalization of long-sequence models. Finally, hybrid approaches that combine symbolic
reasoning with large language models may open new directions for producing reliable, explainable, and
domain-adaptive scientific highlights.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This first author would want to acknowledge the Ministry of Education (MOE), Government of India for
ifnancial support during the research work through the Rajiv Gandhi fellowship Ph.D scheme (UGC)
for computer science &amp; engineering.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>This paper includes no content generated by artificial intelligence tools beyond language editing
and formatting assistance. All intellectual contributions, including the conception, analysis, and
interpretation of results, were made by the authors.
[10] T. Rehman, D. K. Sanyal, S. Chattopadhyay, P. K. Bhowmick, P. P. Das, Automatic generation
of research highlights from scientific abstracts, in: C. Zhang, P. Mayr, W. Lu, Y. Zhang (Eds.),
Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from
Scientific Documents (EEKE 2021) co-located with JCDL 2021, Virtual Event, September 30th,
2021, volume 3004 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 69–70. URL: https:
//ceur-ws.org/Vol-3004/paper10.pdf.
[11] T. Rehman, S. Chattopadhyay, D. K. Sanyal, Abstractive summarization of scientific documents:
Models and evaluation techniques, in: D. Ganguly, S. Majumdar, B. Mitra, P. Gupta, S.
Gangopadhyay, P. Majumder (Eds.), Proceedings of the 15th Annual Meeting of the Forum for Information
Retrieval Evaluation, FIRE 2023, Panjim, India, December 15-18, 2023, ACM, 2023, pp. 121–124.</p>
      <p>URL: https://doi.org/10.1145/3632754.3632771. doi:10.1145/3632754.3632771.
[12] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (2020)
140:1–140:67. URL: https://jmlr.org/papers/v21/20-074.html.
[13] M. Guo, J. Ainslie, D. C. Uthus, S. Ontañón, J. Ni, Y. Sung, Y. Yang, Longt5: Eficient text-to-text
transformer for long sequences, CoRR abs/2112.07916 (2021). URL: https://arxiv.org/abs/2112.07916.
arXiv:2112.07916.
[14] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, PEGASUS: pre-training with extracted gap-sentences for
abstractive summarization, in: Proceedings of the 37th International Conference on Machine
Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning
Research, PMLR, 2020, pp. 11328–11339. URL: http://proceedings.mlr.press/v119/zhang20ae.html.
[15] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[16] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation
with BERT, in: 8th International Conference on Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL: https://openreview.net/forum?id=
SkeHuCVFDr.
[17] S. Banerjee, A. Lavie, METEOR: an automatic metric for MT evaluation with improved correlation
with human judgments, in: J. Goldstein, A. Lavie, C. Lin, C. R. Voss (Eds.), Proceedings of the
Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or
Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, Association for Computational
Linguistics, 2005, pp. 65–72. URL: https://aclanthology.org/W05-0909/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Foroosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Better highlighting: Creating sub-sentence summary highlights</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2020</year>
          , Online,
          <source>November 16-20</source>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>6282</fpage>
          -
          <lpage>6300</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>509</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2020</year>
          .EMNLP-MAIN.
          <year>509</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Woodsend</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <article-title>Automatic generation of story highlights</article-title>
          , in: J.
          <string-name>
            <surname>Hajic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Carberry</surname>
          </string-name>
          , S. Clark (Eds.),
          <source>ACL</source>
          <year>2010</year>
          ,
          <article-title>Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <source>July 11-16</source>
          ,
          <year>2010</year>
          , Uppsala, Sweden, The Association for Computer Linguistics,
          <year>2010</year>
          , pp.
          <fpage>565</fpage>
          -
          <lpage>574</lpage>
          . URL: https://aclanthology.org/P10-1058/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>Abstractive summarization: An overview of the state of the art</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>121</volume>
          (
          <year>2019</year>
          )
          <fpage>49</fpage>
          -
          <lpage>65</lpage>
          . URL: https://doi.org/10.1016/j.eswa.
          <year>2018</year>
          .
          <volume>12</volume>
          .011. doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2018</year>
          .
          <volume>12</volume>
          .011.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chattopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Bhowmick</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>Automatic generation of research highlights from scientific articles</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE</source>
          <year>2021</year>
          ), co-located
          <source>with JCDL</source>
          <year>2021</year>
          , CEUR Workshop Proceedings,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2936</volume>
          /paper3.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chattopadhyay</surname>
          </string-name>
          ,
          <article-title>Named entity recognition based automatic generation of research highlights</article-title>
          ,
          <source>CoRR abs/2303</source>
          .12795 (
          <year>2023</year>
          ). URL: https://doi.org/10. 48550/arXiv.2303.12795. doi:
          <volume>10</volume>
          .48550/ARXIV.2303.12795. arXiv:
          <volume>2303</volume>
          .
          <fpage>12795</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          , S. Chattopadhyay,
          <article-title>Research highlight generation with elmo contextual embeddings</article-title>
          ,
          <source>Scalable Comput. Pract. Exp</source>
          .
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>181</fpage>
          -
          <lpage>190</lpage>
          . URL: https://doi.org/10.12694/scpe. v24i2.2238. doi:
          <volume>10</volume>
          .12694/SCPE.V24I2.2238.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chattopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Bhowmick</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>Generation of highlights from research papers using pointer-generator networks and scibert embeddings</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>91358</fpage>
          -
          <lpage>91374</lpage>
          . URL: https://doi.org/10.1109/ACCESS.
          <year>2023</year>
          .
          <volume>3292300</volume>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2023</year>
          .
          <volume>3292300</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xinyi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Zhang,</surname>
          </string-name>
          <article-title>Enhancing keyword extraction from academic articles using highlights</article-title>
          ,
          <source>Proceedings of the Association for Information Science and Technology</source>
          <volume>61</volume>
          (
          <year>2024</year>
          )
          <fpage>1147</fpage>
          -
          <lpage>1149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bölücü</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. C.</given-names>
            <surname>Bilge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Çetintas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yücel</surname>
          </string-name>
          ,
          <article-title>Modest: A dataset for multi domain scientific title generation</article-title>
          ,
          <source>Knowl. Based Syst</source>
          .
          <volume>321</volume>
          (
          <year>2025</year>
          )
          <article-title>113557</article-title>
          . URL: https://doi.org/10.1016/j.knosys.
          <year>2025</year>
          .
          <volume>113557</volume>
          . doi:
          <volume>10</volume>
          .1016/J.KNOSYS.
          <year>2025</year>
          .
          <volume>113557</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>