<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>First International Workshop on Scholarly Information Access (SCOLIA), April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Ask, Retrieve, Summarize: A Modular Pipeline for Scientific Literature Summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre Achkar</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Gollub</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Potthast</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bauhaus-Universität Weimar</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kassel University</institution>
          ,
          <addr-line>hessian.AI, ScaDS.AI</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leipzig University</institution>
          ,
          <addr-line>Fraunhofer ISI Leipzig</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>10</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>The exponential growth of scientific publications has made it increasingly dificult for researchers to stay updated and synthesize knowledge efectively. This paper presents XSum, a modular pipeline for multi-document summarization (MDS) in the scientific domain using Retrieval-Augmented Generation (RAG). The pipeline includes two core components: a question-generation module and an editor module. The question-generation module dynamically generates questions adapted to the input papers, ensuring the retrieval of relevant and accurate information. The editor module synthesizes the retrieved content into coherent and well-structured summaries that adhere to academic standards for proper citation. Evaluated on the SurveySum dataset, XSum demonstrates strong performance, achieving considerable improvements in metrics such as CheckEval, G-Eval and Ref-F1 compared to existing approaches. This work provides a transparent, adaptable framework for scientific summarization with potential applications in a wide range of domains. Code available at https://github.com/webisde/scolia25-xsum/tree/main PCWrEooUrckResehdoinpgs ISSNc1e6u1r-3w-0s0.o7r3g</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-document Summarization (MDS)</kwd>
        <kwd>Retrieval-Augmented Generation (RAG)</kwd>
        <kwd>Scientific Literature Summarization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The rapid growth of scientific literature has made it increasingly dificult for researchers to stay
up-to-date with the latest developments. The number of papers published each month has increased
exponentially since 1994, with fields such as artificial intelligence (AI) doubling their research output [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
While this growth reflects the progress of research communities, it also presents a serious challenge: how
can researchers stay informed and extract key insights from this volume of information? This overload
of information makes it dificult to manually read, understand, and summarize the growing body of
literature. This challenge becomes paramount in rapidly evolving fields such as AI, where researchers
often need to synthesize knowledge from multiple sources in order to make progress. Summarizing
research is not simply reading through papers but also identifying the most important information,
connecting ideas from diferent sources, and presenting them in a clear and concise way. Automated
summarization solutions are essential to help researchers save time and focus on the core information.
One promising approach to this challenge is Multi-Document Summarization (MDS), which combines
information from multiple sources into clear and concise summaries. The concept itself is not new; for
example, early work from 1999 proposed to use reference relationships between scientific papers to
generate survey-style summaries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The approach identifies key fragments of cited papers, analyzes
similarities and diferences between them, and classifies citation contexts to support summarization.
Over time, summarization methods have evolved from static approaches to deep learning models and
later to pre-trained language models (PLMs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Currently, the field is dominated by Large Language
Models (LLMs), which are pre-trained on massive datasets and capable of generating high-quality text.
Retrieval-Augmented Generation (RAG) builds on these advances by combining retrieval techniques
with LLMs, enabling systems to find relevant information and synthesize it into accurate and coherent
answers. A typical RAG pipeline processes a set of documents  = {1, 2, . . . , } by dividing them
into smaller chunks, encoding them into dense vector embeddings with a pre-trained model, and storing
them in a vector database for later retrieval. In the context of MDS, a search query to the vector database
acts as a summarization guideline that can either be provided by the user or the MDS system. When a
query is provided, the top- most relevant chunks are retrieved based on similarity metrics and passed
to an LLM, which generates responses grounded in the retrieved content.
      </p>
      <p>
        Despite the recent advances, summarizing scientific literature remains an open research problem,
requiring not only linguistic fluency and coherence, but also robust relevance and adherence to academic
standards for citing literature correctly. To address these challenges, we present XSum, a RAG pipeline
designed for MDS in the scientific domain. XSum builds upon the typical RAG pipeline and introduces
two new innovative components: a question-generation module and an editor module. The
questiongeneration module formulates questions on the basis of the input papers to be summarized, which are
then passed to the RAG component. The editor module synthesizes the set of answers retrieved from
the RAG component into coherent summaries, ensuring that the resulting output is comprehensive,
reliable, and well-structured. The proposed pipeline is evaluated on the SurveySum [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] dataset, which
is designed to test MDS methods in the scientific domain. The results show that XSum outperforms
existing methods on metrics such as CheckEval [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], G-Eval [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Ref-F1, demonstrating its ability to
produce high-quality summaries. We consider the quality of a generated summary to be defined by its
ability to comprehensively cover the essential content of the source documents, to maintain a coherent
and fluent narrative, and to accurately reflect the original citations.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The task of summarizing multiple scientific documents has evolved considerably over time. Early
approaches, such as SciSumm, introduced query-driven summarization by clustering relevant text
segments from co-cited papers to generate contextualized summaries [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These methods leveraged
citation relationships but struggled with complex content relationships across documents.
      </p>
      <p>
        Later developments introduced neural network-based architectures for MDS. For instance, HiMAP
and HierSumm utilized hierarchical models and passage ranking techniques to enhance content selection
and fusion, resulting in more coherent and contextually relevant summaries [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. These methods
marked a shift from purely extractive approaches to more integrative models capable of generating
lfuent summaries.
      </p>
      <p>
        The integration of extraction and abstraction further refined summarization methods. Shinde et al.
proposed a hybrid pipeline that combines BERT-based extractive models with BigBird-PEGASUS for
abstractive summarization, achieving robust performance in the biomedical domain [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Similarly,
KGSum introduced knowledge graph-based encoding to model document content and relationships,
employing a two-stage decoding strategy to produce focused and cohesive summaries [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        The field has taken a major step forward with the emergence of retrieval augmented generation
(RAG) pipelines. OpenScholar1, for example, demonstrated a novel approach by integrating a specialized
datastore of 45 million papers with iterative retrieval and feedback loops, enabling precise,
citationbacked responses, highlighting growing interest in retrieval augmented systems [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Another approach
to MDS using retrieval is proposed through the SurveySum framework, which introduces two pipelines,
Pipeline 1 and Pipeline 2, both integrating retrieval-based selection with LLM-based summarization
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These pipelines are evaluated on the SurveySum dataset, a benchmark specifically designed for
MDS in scientific literature, which consists of survey sections paired with their cited papers. This is the
same dataset used in this work, and a more detailed discussion of its structure will be provided in the
Evaluation section.
      </p>
      <p>Pipeline 1 uses a neural ranking approach where full-text papers are segmented into overlapping
chunks during pre-processing. These chunks are ranked by monoT5-3B, which assigns relevance scores
based on the title of the target survey section. The highest-ranked chunks are then passed to an LLM,
such as GPT-4, to generate the final summary. Pipeline 2, on the other hand, relies on embedding-based
retrieval, where text chunks are represented as dense embeddings using SPECTER22 and stored in a
FAISS vector database. The section title (e.g. Data Generation via PLM:Explaining Models’ Decisions) is
used as a query to retrieve relevant chunks at inference time. Unlike Pipeline 1, which directly selects
the top-ranked chunks for summarization, Pipeline 2 includes a re-ranking step where an LLM evaluates
and ranks the retrieved content before summarization. The resulting chunks are then summarized into
a cohesive section. Figures 1 and 2 illustrate the structures of these pipelines.</p>
      <p>While these pipelines achieve acceptable performance, they rely on static retrieval using section titles
as queries, which can limit adaptability to diferent summarization contexts. Among them, Pipeline 2
is more comparable to our approach XSum, as it utilizes embedding-based retrieval rather than direct
ranking. However, XSum addresses key limitations by introducing a question generation module that
dynamically formulates structured questions based on the title and abstract of the input papers, which
serve as queries during retrieval, thereby improving retrieval relevance. Additionally, it features an
editor module that synthesizes retrieved content into a coherent, citation-rich summary, ensuring
better fluency, accuracy, and adherence to academic writing standards. The complete pipeline and the
functionality of these components will be explained in detail in the Methodology section.</p>
      <p>Full Text</p>
      <p>Chunks</p>
      <p>Relevant Chunks</p>
      <p>Final Summary
Document Pre-Processing</p>
      <p>Ranking</p>
      <p>
        Beyond SurveySum, several other datasets have been developed for MDS, particularly in the biomedical
domain. Datasets such as Cochrane-auto and MS2 focus on summarizing clinical trials and systematic
reviews, providing benchmarks for evaluating summarization methods in evidence-based medicine
[
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. Another relevant dataset is Multi-XScience, which was initially considered for evaluating
the proposed approach, as it focuses on synthesizing related work sections from abstracts and cited
references [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. However, a preliminary analysis revealed missing values in the reference papers used to
generate the related work sections, raising concerns about its completeness for reliable benchmarking.
Furthermore, while related work sections can be considered multi-document summaries, they are often
shaped by the comparative and argumentative nature of the paper’s contributions rather than being
purely extractive or abstractive. Given these considerations, the SurveySum dataset appeared to be a
more appropriate choice for evaluating our approach, as it explicitly focuses on summarizing multiple
scientific papers into structured survey sections. To our knowledge, no other experimental work has
been conducted on SurveySum beyond the evaluations presented by its authors so far.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section introduces the XSum pipeline, a modular approach to summarizing scientific literature into
coherent and traceable outputs. The initial idea for building this pipeline was inspired by the interview
paradigm, where an interviewer interacts with a domain expert. In this analogy, the interviewer
prepares a structured set of questions based on the expert’s domain knowledge, conducts the interview
in which the expert answers these questions, and finally an editor compiles the conversation into a
well-structured summary. This concept motivated the design of XSum and led to the introduction of
two key modules: a question generation module, which formulates structured questions to guide the
retrieval process, and an editor module, which synthesizes the retrieved answers into a coherent and
citation-rich summary. Each module plays an important role in ensuring that the summaries generated
are both relevant and well-structured, as described in the following subsections.</p>
      <sec id="sec-3-1">
        <title>3.1. Overview of the Pipeline</title>
        <p>The proposed pipeline for MDS, XSum, illustrated in Figure 3, transforms input reference papers into a
coherent summary through a sequence of modular steps. It begins by using the titles and abstracts of
the reference papers to generate broad and general questions using an LLM. These questions, designed
to reflect the main themes and contributions of the papers, are stored for later use. The full texts of the
reference papers are then processed by dividing them into manageable chunks, which are embedded
in dense vector representations and stored in a vector database. This pre-processing ensures eficient
retrieval of relevant content in subsequent stages. The stored questions are then used to query the
database and retrieve the most relevant chunks. The retrieval process involves an initial similarity-based
ranking of the chunks, followed by a re-ranking step to refine their relevance. The final set of retrieved
chunks is paired with the corresponding questions. These question-chunk pairs are then passed to an
LLM, which generates concise answers based on the retrieved content. If the context is insuficient,
the LLM will refrain from generating an answer, ensuring accuracy and credibility. Finally, the set of
question-answer pairs is passed to the editor module, which synthesizes them into a comprehensive
and well-structured summary. The editor ensures coherence, logical flow, and adherence to academic
standards while incorporating citations to maintain traceability.</p>
        <p>Full Text</p>
        <p>Chunks
Document Pre-Processing</p>
        <p>FAISS</p>
        <p>Relevant Chunks</p>
        <p>Reranking</p>
        <p>Question
Title + Abstract</p>
        <p>Questions</p>
        <p>Q&amp;A Pairs</p>
        <p>Final Summary
Reference</p>
        <p>Papers</p>
        <p>LLM
Question Generation</p>
        <p>RAG
Question Answering</p>
        <p>LLM
Final Summary Generation</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Question Generation Module</title>
        <p>
          This module is essential for aligning the retrieval and summarization stages with the specific content of
the reference documents. By leveraging the generative capabilities of LLMs, it ensures that the pipeline
is driven by structured, contextually relevant queries. The approach draws on insights from methods
such as HyDe [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], HyQE [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and reverse HyDe [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], all of which use generative techniques to improve
retrieval relevance. HyQE (Hypothetical Document Embeddings) involves generating hypothetical
content based on a query, encoding this content into embeddings, and then using these embeddings to
improve retrieval accuracy. Both HyQE (Hypothetical Query Embeddings) and reverse HyDe follow a
similar strategy, but focus on generating hypothetical questions or queries that match the content of
a document. These hypothetical questions bridge the semantic gap between queries and retrievable
content, improving the ranking of relevant results [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>In our pipeline, the title and abstract of each reference paper serve as input to a pre-trained LLM,
which generates  = 5 broad and semantically rich questions encapsulating the core themes and
contributions of the paper. These questions are stored as structured queries for subsequent stages. The
generated questions serve two primary functions: first, they refine the retrieval process by ensuring that
only the most contextually relevant content is retrieved; second, they provide a structured framework to
guide the subsequent synthesis and summarization phases. For illustration, examples of such generated
questions can be found in Appendix A.1.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Document Pre-processing</title>
        <p>The pre-processing phase ensures that the reference papers are prepared for eficient retrieval and
summarization by arranging them in a format suitable for downstream tasks. This phase consists of
three main steps:
• Chunking Documents: The full texts of the reference papers are divided into interconnected
chunks of 150 tokens each, with an overlap of 20 tokens. This overlap preserves contextual
continuity between successive chunks, while respecting sentence boundaries ensures that the
division does not disrupt the semantic flow of the text. We determined this configuration by
experimentation, after trying diferent setups, finding that it provided the best balance between
contextual preservation and computational eficiency.
• Embedding Generation: Each chunk is encoded into dense vector representations using the
SPECTER2 model, which is specifically designed to capture the semantic relationships and
contextual meanings in academic texts.
• Vector Database Indexing: Chunks are indexed in FAISS, a high-speed similarity search database,
for eficient retrieval.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Question Answering Module</title>
        <p>In this module, the focus is on integrating retrieval and synthesis to generate concise, contextually
relevant answers to the questions formulated in the previous stage. By combining robust retrieval
techniques with an LLM in a RAG framework, this module ensures that the pipeline produces
highquality output that is grounded in the source material.</p>
        <p>
          Questions are embedded into dense vector representations using the same SPECTER2 model used in
the document pre-processing phase. The retrieval process proceeds in two stages:
1. Initial Retrieval: Using cosine similarity, the top 100 chunks most relevant to each question are
retrieved from the FAISS vector database, serving as an initial filtering step.
2. Reranking: The retrieved chunks are re-ranked using the ColBERT2 model [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], which evaluates
token-level interactions between the question and the chunks. This refinement step ensures that
the 20 most relevant chunks are selected.
        </p>
        <p>The final set of 20 chunks is presented to a pre-trained LLM along with the corresponding question.
The LLM synthesizes a coherent and accurate response based solely on the retrieved context. If the
retrieved chunks do not provide suficient information, the LLM is instructed not to generate an answer,
minimizing unsupported or speculative output.</p>
        <p>To ensure credibility and traceability, the LLM includes valid citations from the retrieved chunks
in its responses. By grounding the answers in the source material, this module adheres to academic
standards and facilitates the verification of the generated content.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Final Summary Generation (Editor Module)</title>
        <p>The Editor Module synthesizes the answers generated in the previous step into a cohesive and
comprehensive summary, aggregating all question-answer pairs into a unified narrative that reflects the
overarching themes and contributions of the papers. A pre-trained LLM is used as the editor to generate
the final summary. The model is prompted to write an extensive, coherent summary that seamlessly
integrates the individual answers while maintaining a logical flow. The Editor LLM ensures the
summary adheres to academic standards. It incorporates citations into the final summary, ensuring that all
statements are properly grounded in the retrieved source material. The prompt used in this module is
as follows:</p>
        <sec id="sec-3-5-1">
          <title>Editor Module Prompt</title>
          <p>### CONTEXT ###
You are writing the final script of an interview with an expert on the topic ’{topic}’.
The final script should summarize the key insights and findings from the questions and
answers provided.</p>
          <p>Keep the target audience in mind, which includes researchers, students, and professionals in
the field.
### QUESTIONS AND ANSWERS ###
{questions_and_answers}
### INSTRUCTIONS ###
Include the most relevant and important points discussed.</p>
          <p>Be aware of plagiarism, i.e., you should not copy the text, but use them as inspiration.
Avoid using markdown formatting in the text.</p>
          <p>Avoid splitting into subsections, or creating an introduction and conclusion for it.
Avoid introducing new information and focus on summarizing the existing content.
Always include the citations (e.g., [BIBREF14], [BIBREF16]) mentioned in the answers in the
final section.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The proposed pipeline is evaluated using a domain-specific dataset for MDS. This section details the
dataset, metrics, implementation, results, examples, and discussion, providing a comprehensive analysis
of the pipeline’s performance.</p>
        <p>The evaluation of the proposed pipeline is conducted on the SurveySum3 dataset, a domain-specific
resource designed for MDS tasks in scientific literature. This dataset includes 79 survey sections across
ifelds such as AI, natural language processing (NLP), and machine learning (ML). Each section is paired
with the full-text content of its cited papers, with an average of 7.38 papers cited per section. The
dataset is explicitly designed to test MDS models on the synthesis of content from multiple sources,
making it particularly suited for assessing the proposed pipeline.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Metrics</title>
        <p>The evaluation employs a mix of traditional and LLM-based metrics to assess the quality of summaries
in terms of content coverage, coherence, and citation alignment:</p>
        <p>
          ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] measures the overlap between the
generated summaries and the reference text. It calculates n-gram overlap, word sequence matching,
and the longest common subsequences. ROUGE-1, ROUGE-2, and ROUGE-L are used in this study to
capture unigrams, bigrams, and sentence-level matches, respectively.
        </p>
        <p>
          BERTScore [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] evaluates semantic similarity between the generated and reference summaries using
contextual embeddings from PLMs like BERT.
        </p>
        <p>Reference F1 Score (Ref-F1) measures how accurately the citations in the generated summaries align
with those in the ground truth. It computes precision (proportion of correctly included references) and
recall (proportion of ground-truth references captured in the generated summary), and combines them
into an F1 score. This metric is essential in scientific summarization, where attribution and citation
accuracy are critical.</p>
        <p>
          G-Eval4 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a framework for evaluating the output of natural language generation (NLG) using
LLMs, providing reference-free assessments based on criteria such as coherence, coverage, fluency, and
relevance. It uses Chain-of-Thought (CoT) reasoning to systematically generate detailed evaluation
steps, ensuring consistency and robustness in scoring. Scores are assigned on a fixed scale (e.g. 1
to 5) and refined using token probabilities, enabling granular, continuous analyses that capture the
nuances between outputs. By bypassing the need for reference outputs, it is particularly efective for
tasks where predefined references are unavailable or impractical, such as creative or open-ended text
generation. Experiments show that G-Eval achieves a stronger correlation with human judgments than
traditional metrics such as ROUGE, as well as neural evaluators, on benchmarks such as SummEval
[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] for summarization.
        </p>
        <p>
          CheckEval5 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a robust evaluation framework that uses LLMs to evaluate generated text using
a structured checklist-based approach. It supports two evaluation modes: reference-based, which
compares the generated text to reference summaries, and criteria-driven, which evaluates the text
against predefined dimensions such as coherence, fluency, and coverage. By breaking down evaluation
criteria into detailed sub-aspects, framed as Boolean (yes/no) questions, CheckEval simplifies the
evaluation process and increases its reliability and interpretability. The framework operates in three
stages: aspect selection, where key evaluation dimensions are identified; checklist generation, where
detailed questions are created and refined; and checklist-based evaluation, where LLMs respond to the
questions, with the final score calculated as the proportion of positive responses. Validated against the
SummEval benchmark, CheckEval demonstrates high correlation with human judgment and strong
inter-annotator agreement.
        </p>
        <p>ROUGE and BERTScore evaluations report recall scores to emphasize content coverage, while G-Eval
and CheckEval focus on the coverage criterion, consistent with SurveySum’s methodology for assessing
core content representation.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Implementation Details</title>
        <p>
          This section outlines the tools, models, and frameworks utilized in our development process:
• Development Environment: The pipeline was implemented in Python, utilizing
sentence-transformers for embedding generation, nltk for text processing, and FAISS for
eficient vector-based retrieval. All experiments were conducted on a Tesla V100-PCIE-32GB
GPU, enabling eficient embedding generation, chunk retrieval, and summarization tasks.
• Pre-trained LLMs: For text generation, we employed gpt4o-mini_15-2-2024-preview,
while Phi-3-small-8k-instruct was utilized for evaluation. Both models were configured
with a temperature of 0.3 and a top-p of 0.95 to ensure controlled and consistent outputs. This
choice aligns with the position that the same model should not be used for both generation and
evaluation to mitigate potential bias. Research has highlighted that using identical or equally
powerful models for both tasks can lead to skewed results, as LLMs like GPT-4 tend to favor
their own outputs due to egocentric biases [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Notably, the SurveySum paper does not specify
whether identical models were used in their study for both tasks.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Results</title>
        <p>Since the original SurveySum paper did not report ROUGE and BERTScore metrics for the benchmark
pipelines, we calculated these values. G-Eval and CheckEval were also computed using our
implementation to ensure consistency, employing the Phi-3-small-8k-instruct model as the evaluator. This
approach guarantees a fair comparison across all pipelines, enabling a comprehensive assessment of
XSum’s performance relative to the benchmarks. Any diferences in the results (particularly for G-Eval
and CheckEval) can be attributed to diferences in the evaluation settings and model configurations
compared to the original SurveySum experiments. For clarity, Pipeline_1 and Pipeline_2 are the two
pipelines that performed best in the SurveySum experiments.</p>
        <p>The results, summarized in Table 1, highlight XSum’s consistent outperformance of the benchmark
pipelines across all metrics. It achieves ROUGE-1 (0.51) and ROUGE-L (0.24), reflecting its ability to
efectively capture unigrams and sentence-level structures. Its BERTScore (0.62) highlights its strong
semantic alignment with reference summaries, reflecting its capability to retain content integrity
through paraphrasing and semantic rephrasing. Furthermore, XSum attains the highest Ref-F1 (0.76),
G-Eval (4.2), and CheckEval (0.97) scores, emphasizing its superiority in generating coherent, relevant,
and high-quality summaries.</p>
        <p>To further demonstrate the performance of XSum, we present two examples of summaries generated
by it, along with their corresponding ground truth (original section text) and evaluation scores. These
examples have been selected based on their performance across the evaluation metrics, one representing
the highest average score across all metrics and the other representing the lowest average score. Due to
their length, the full examples are provided in Appendix A.2.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Discussion</title>
        <p>The strong performance of XSum is largely driven by its two key features: the question-generation
module and the editor module. By dynamically generating queries about the document content, the
retrieval module ensures relevant and contextual results, addressing the limitations of static query
approaches such as using section titles, as in Pipeline_2. Additionally, the use of ColBERT as a reranker
may contribute to better chunk retrieval by prioritizing the most relevant and informative sections
during the ranking process. The editor module further enhances the pipeline by synthesizing retrieved
information into coherent summaries with proper citations, resulting in outputs that adhere to academic
standards.</p>
        <p>The quality of the content generated by XSum highlights its ability to synthesize multiple sources
into a structured and coherent summary. The selected examples illustrate both the strengths and
limitations of the approach. The high-scoring example (Example 1) closely follows the human-written
text, efectively capturing key technical details while maintaining logical flow and factual consistency.
This suggests that XSum can generate summaries that are both informative and well-structured and
in line with academic standards. However, a notable diference remains in the style and clarity of
the summaries. Human-written sections tend to be more compact and nuanced, often presenting
a comparative perspective that sets diferent contributions in relation to each other. In contrast,
XSum summaries tend to be verbose, often providing extended explanations and additional contextual
information beyond what is strictly necessary for summarizing. This is particularly evident in the
low-scoring example (Example 2), where the generated text, while factually accurate, lacks the same
level of selectivity as the human-written version, including an unnecessary degree of background
detail rather than focusing solely on comparative insights. This contrast highlights a key challenge in
scientific summarization. Although retrieval-driven methods such as XSum excel at aggregating and
structuring information, they do not yet fully replicate the complex synthesis and prioritization that
domain experts perform when writing a summary of multiple related papers. Nevertheless, XSum still
produces highly structured and factually based summaries, demonstrating that automated MDS can be
a valuable tool for scientific literature synthesis, particularly in assisting researchers with information
overload.</p>
        <p>Despite its strengths, the low ROUGE-2 scores across all pipelines highlight a common problem in
abstractive summarization: achieving bigram overlap with reference summaries. RAG-based pipelines,
including XSum, prioritize semantic richness and coherence over strict lexical matching, which reduces
alignment with reference summaries. However, ROUGE-1 and ROUGE-L scores show moderate
alignment, reflecting the ability to capture essential unigrams and sentence-level structures. BERTScore,
which assesses semantic similarity, achieves satisfactory results, highlighting the ability of such pipelines
to capture the essence of content through paraphrasing and semantic rephrasing, even when lexical
overlap is limited. It is important to note that while the ROUGE metrics provide valuable insights into
lexical overlap and content coverage, they are inherently limited in assessing the nuances of abstractive
summarisation. This limitation is addressed by incorporating a set of metrics - BERTScore, G-Eval and
CheckEval - that more efectively capture semantic similarity, coherence and overall quality.</p>
        <p>In addition to traditional metrics, frameworks like G-Eval and CheckEval provide refined assessments
of summary quality by leveraging LLMs to evaluate coherence, relevance, and coverage. These metrics
excel at capturing semantic and structural attributes that conventional metrics often overlook, making
them particularly efective for evaluating abstractive summaries. However, their dependence on specific
LLMs introduces challenges of consistency and reproducibility, as evaluation outcomes may vary with
diferent model configurations. This highlights the need for standardization in LLM-driven evaluation
practices.</p>
        <p>Finally, XSum’s modular design ofers substantial flexibility in adapting to diferent summarization
tasks. The question-generation module can be customized to generate domain-specific or task-specific
questions to improve relevance in diferent contexts. Similarly, the editor module allows customization
of tone, style, and abstraction levels, enabling outputs to be tailored for diferent audiences, from
academic researchers to professional practitioners. This adaptability ensures the pipeline’s scalability
and applicability to a wide range of domains, addressing the growing demand for eficient MDS in
complex settings.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This work addresses the challenges of MDS in the scientific domain by introducing a modular
RAGbased pipeline featuring two key enhancements: a question-generation module and an editor module.
These components enable the pipeline to synthesize information from multiple scientific papers into
cohesive, well-structured summaries. Experimental evaluations on the SurveySum dataset demonstrate
considerable improvements in metrics such as CheckEval, G-Eval, and Ref-F1 compared to existing
approaches. By providing detailed guidance on the design and implementation of RAG-based pipelines,
this work contributes to making these systems more transparent, reproducible, and adaptable for diverse
summarization tasks.</p>
      <p>
        While the current pipeline achieves strong performance, several opportunities for future
improvements remain. A key direction is evaluating XSum against other MDS pipelines to enable a more
in-depth comparison of efectiveness and retrieval quality. Such comparisons would provide insights
into the relative strengths and limitations of diferent summarization strategies. Additionally,
conducting an ablation study would allow for a deeper understanding of the impact of each component
in the pipeline, particularly the question-generation module and the editor module, to assess their
individual contributions to overall performance. Optimizing data ingestion pipelines, which are often
a bottleneck in large-scale industrial applications (as emphasized in systems like ColPali [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]), could
further enhance scalability and eficiency. Moreover, integrating vision-language models to process
visually rich documents, including text, tables, and figures, ofers a promising direction for improving
retrieval accuracy and extending the system’s capabilities to more complex scientific datasets.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations</title>
      <p>Despite its contributions, this work has several limitations that warrant further investigation:
• Scalability and Real-World Deployment: While the proposed pipeline demonstrates strong
performance in controlled environments, this work does not address the challenges of scaling the
pipeline for real-world applications. Issues such as handling extremely large datasets, ensuring
low latency, and optimizing cost-efective deployment for diferent organizational needs remain
unaddressed and require further research.
• Qualitative Analysis: While quantitative evaluations on metrics like CheckEval and G-Eval
demonstrate strong performance, this study lacks a comprehensive qualitative analysis of the
generated summaries.
• Document Retrieval Scope: The pipeline assumes a predefined set of input papers for
summarization and does not address the challenge of identifying or retrieving relevant documents
for a specific topic. This limitation highlights the need for further research into integrating
robust document retrieval mechanisms with summarization workflows to enhance the pipeline’s
applicability.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors used DeepL Write for sentence polishing.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Appendix</title>
      <sec id="sec-8-1">
        <title>A.1. Examples of Generate Questions</title>
        <sec id="sec-8-1-1">
          <title>Generated Questions Example 1</title>
          <p>Below are examples of generated questions produced by our question-generation module.</p>
          <p>Paper Title: Automatic melody harmonization with triad chords: A comparative study
Paper Abstract: Several prior works have proposed various methods for the task of automatic
melody harmonization, in which a model aims to generate a sequence of chords to serve as the
harmonic accompaniment of a given multiple-bar melody sequence. In this paper, we present a
comparative study evaluating and comparing the performance of a set of canonical approaches to
this task, including a template matching based model, a hidden Markov based model, a genetic
algorithm based model, and two deep learning based models. The evaluation is conducted on a
dataset of 9,226 melody/chord pairs we newly collect for this study, considering up to 48 triad
chords, using a standardized training/test split. We report the result of an objective evaluation
using six diferent metrics and a subjective study with 202 participants.</p>
          <p>Generated Questions:
• What are the key diferences in performance among the various models evaluated for
automatic melody harmonization?
• How does the dataset of 9,226 melody/chord pairs contribute to the robustness of the study’s
ifndings?
• What specific metrics were used for the objective evaluation of the models, and how do they
compare in terms of efectiveness?
• What insights were gained from the subjective study involving 202 participants regarding
the perceived quality of the harmonizations?
• What future directions for research in automatic melody harmonization does this study
suggest based on its findings?</p>
        </sec>
        <sec id="sec-8-1-2">
          <title>Generated Questions Example 2</title>
          <p>Paper Title: Virtuosonet: A hierarchical rnn-based system for modeling expressive piano
performance
Paper Abstract: In this paper, we present our application of deep neural networks to modeling
piano performance, which imitates the expressive control of tempo, dynamics, articulations,
and pedaling from pianists. Our model consists of recurrent neural networks with hierarchical
attention and a conditional variational autoencoder. The model takes a sequence of note-level
score features extracted from MusicXML as input and predicts piano performance features of the
corresponding notes. To render musical expressions consistently over long-term sections, we first
predict tempo and dynamics at the measure level and, based on the result, refine them at the note
level. The evaluation through listening tests shows that our model achieves a more human-like
expressiveness compared to previous models. We also share the dataset used for the experiment.
Generated Questions:
• What are the key components of the hierarchical RNN architecture used in Virtuosonet for
modeling expressive piano performance?
• How does the model diferentiate between measure-level and note-level predictions in terms
of tempo and dynamics?
• What specific features from MusicXML are utilized as input for the model, and how do they
contribute to the performance output?
• In what ways does the evaluation through listening tests demonstrate the model’s superiority
in expressiveness compared to previous approaches?
• What insights can be drawn from the dataset shared in the paper regarding the training and
evaluation of deep music generation models?</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. Evaluation Examples</title>
        <sec id="sec-8-2-1">
          <title>Evaluation Example 1</title>
          <p>Survey Title: AMMUS:A Survey of Transformer-based Pretrained Models in Natural Language
Processing
Survey Section: Taxonomy:Extensions:Long-Sequence T-PTLMs
Evaluation Scores: ROUGE-1: 0.61, ROUGE-2: 0.19, ROUGE-L: 0.32, BERTScore: 0.69, Ref-F1:
1.00, G-Eval: 4.8, CheckEval: 1.00</p>
        </sec>
        <sec id="sec-8-2-2">
          <title>Ground Truth (Section Text):</title>
          <p>The self-attention attention module in transformers updates the representation of each input
token by attending to all tokens in the input sequence. The quadratic time complexity of the
self-attention module limits the application of T-PTLMs to long input sequences. To overcome
this drawback, self-attention variants like sparse self-attention and linearized self-attention are
proposed to reduce its complexity and hence extend T-PTLMs to long input sequences also
BIBREF205 . Some of the popular T-PTLMs based on a) sparse self-attention are Longformer
BIBREF206 , ETC BIBREF207 , BigBird BIBREF208 and Reformer BIBREF209 and b) linearized
self-attention are Performer BIBREF210 . Sparse self-attention reduces the complexity by including
sparsity bias which reduces the number of query-key pairs that each query attends to. In linearized
self-attention, reduced complexity is achieved by disentangling the attention with kernel feature
maps and then computing the attention in reverse order.</p>
        </sec>
        <sec id="sec-8-2-3">
          <title>Generated Content (XSum Output):</title>
          <p>The exploration of Transformer-based pretrained models, particularly in the context of
longsequence processing, has revealed significant advancements and innovations across various
architectures. Key modifications to the original Transformer model have emerged, including Adaptive
Computation Time (ACT), hierarchical structures, and dynamic halting mechanisms, all aimed at
enhancing computational eficiency and representation capabilities for complex inputs [BIBREF205].
The introduction of models like Longformer, with its linear scaling attention mechanism and
localglobal attention integration, demonstrates a marked improvement in handling lengthy documents
compared to traditional self-attention approaches, outperforming models such as RoBERTa in
tasks requiring extensive context [BIBREF206]. The Extended Transformer Construction (ETC)
architecture further addresses the challenges of scaling input length and encoding structured
data through its global-local attention mechanism and the use of relative position encodings,
which significantly enhance the model’s performance on tasks like HotpotQA and WikiHop
[BIBREF207].Similarly, BigBird’s sparse attention mechanism allows for eficient processing of
long sequences while maintaining the representational power of traditional Transformers,
achieving state-of-the-art results in question answering and summarization tasks [BIBREF208]. The
Reformer model introduces techniques such as Locality-Sensitive Hashing (LSH) attention and
reversible residual layers, which contribute to its memory eficiency and ability to handle long
sequences efectively, demonstrating comparable performance to standard Transformers across
various NLP tasks [BIBREF209].Performers leverage the FAVOR+ mechanism to achieve nearly
linear time complexity and sub-quadratic memory consumption, enhancing both eficiency and
accuracy in large-scale applications, including protein sequence modeling and pixel-prediction
tasks [BIBREF210]. These advancements collectively indicate a promising trajectory for future
research, emphasizing the need for continued exploration of eficient architectures and pre-training
strategies that can further enhance the capabilities of Transformer-based models in processing
long sequences and complex data structures across diverse domains.</p>
        </sec>
        <sec id="sec-8-2-4">
          <title>Evaluation Example 2</title>
          <p>Survey Title: Recent Advances in Natural Language Processing via Large Pre-Trained Language
Models: A Survey
Survey Section: Data Generation via PLM:Explaining Models’ Decisions
Evaluation Scores: ROUGE-1: 0.35, ROUGE-2: 0.05, ROUGE-L: 0.13, BERTScore: 0.63, Ref-F1:
0.35, G-Eval: 4.5, CheckEval: 0.11</p>
        </sec>
        <sec id="sec-8-2-5">
          <title>Ground Truth (Section Text):</title>
          <p>Despite the impressive performance of deep learning models for various NLP tasks, a remaining
challenge to widespread adoption is the lack of explanations for the models’ decisions. This hinders
the development and debugging process, as well as user trust. This is especially true for application
domains such as healthcare, security, and online education. As such, a considerable number
of approaches have been proposed for explaining deep learning models’ behavior, including
model-intrinsic BIBREF220 , BIBREF221 , BIBREF222 and model-agnostic approaches BIBREF223 ,
BIBREF224 , BIBREF225 . While model-intrinsic explanations expose internal model state (e.g.
feature importance or attention scores), in model-agnostic (post-hoc) methods, explanations are
generated via the model predictions without inspecting the internal state. Generative models
are often applied for post-hoc explanations, aiming to obtain either counterexamples BIBREF226
, BIBREF227 , BIBREF228 or natural language texts BIBREF229 , BIBREF230 , BIBREF231 for
explaining purposes. Generating counterexamples can shed light on the decision boundaries of
the models (i.e. explaining when a model changes its decision), thus improving intepretability. To
this end, the generated counterexamples should be close to the decision boundaries so that small
modifications result in changing the model predictions. Traditionally, heuristic rules applied to the
original inputs create likely counterexamples BIBREF227 , BIBREF232 , BIBREF233 , BIBREF234 .
PLMs have been leveraged to generate more diverse examples for better evaluation BIBREF235 ,
BIBREF228 , BIBREF236 . In particular, BIBREF228 proposes a method based on GPT-2 to generate
counterfactuals that are close to the original sentences and entail specific relationships with
the original, facilitating label induction (e.g. negation, insertion, shufle). Concretely, an input
sentence is concatenated with a relation label (e.g. negation) and a template consisting of the
special tokens [BLANK] to form the prompt for GPT-2 model. For instance, for the sentence
2˘01c It is great for kids änd the relation label 2˘01c negate ¨, the following prompt is constructed:
2˘01c It is great for kids. [negation] It is [BLANK] great for [BLANK]. [SEP] ¨. Next, the GPT-2
model generates answers for the [BLANK] in the template (e.g. 2˘01c not [ANSWER] children
2˘01d, separated by the special token [ANSWER] ). To fine-tune the GPT-2 model, non-parallel
datasets (e.g. CommonGen, Natural Questions and SQuAD) are automatically processed to find
the relations between pairs of sentences and to construct the templates for each relation based
on the obtained pairs. It is worth noting that the sentences generated by GPT-2 might have
the same label as the original input sentence. In addition, BIBREF228 show that the generated
counterexamples can be helpful to improve the performance of the downstream models, e.g. for
natural language inference, duplicate question detection, and sentiment analysis. Other research
is informing the task of natural language explanation generation , where the goal is to expose
the rationale behind the model decisions in automatically generated natural language text. Any
approach must critically require that the generated response is faithful to the model behavior. To
this end, BIBREF230 propose to first generate the explanations, and then employ the explanations
to obtain the final model predictions. They use natural language inference as the task requiring
explanations. Label-specific GPT-2 models are fine-tuned over concatenations of corresponding
premises, hypotheses, and human-provided explanations, so that at inference, the model generates
an explanation based on premise and hypothesis. Next, the explanations together with the premise
and the hypothesis are consumed by an explanation processor model (e.g. RoBERTa) to select the
most likely label. This process obtains a more faithful explanation for the label choice, compared
to traditional prediction-first approaches BIBREF229 . However, this approach does not provide
explanations that reference non-selected labels. To address the question of why other labels
are not chosen, BIBREF231 exploit counterexamples, deriving them from original samples with
heuristic rules. The original samples and counterexamples are provided to GPT-2 to generate an
explanation for the question 2˘01c Why A not B 2˘01d.</p>
        </sec>
        <sec id="sec-8-2-6">
          <title>Generated Content (XSum Output):</title>
          <p>The exploration of data generation and model explanation through large pre-trained language
models (PLMs) has revealed critical insights into the challenges and advancements in the field. A
significant challenge in the interpretability of machine learning models, particularly in natural
language processing (NLP), lies in the complexity of the models themselves. Explanations must be
comprehensible to users with varying levels of expertise, necessitating a balance between accuracy
and interpretability. Techniques such as LIME (Local Interpretable Model-agnostic Explanations)
and SHAP (SHapley Additive exPlanations) have emerged as efective methods for generating
explanations that enhance user trust in model predictions. LIME has demonstrated high recall
rates and usability, enabling users to better assess the trustworthiness of individual predictions,
while SHAP ofers a theoretically robust framework that aligns closely with human intuition,
improving the interpretability of complex models [BIBREF220], [BIBREF221].Trust is a pivotal
factor in the deployment of machine learning models, as it influences user reliance on model
predictions. Explanations play a crucial role in fostering this trust by elucidating the rationale
behind predictions, thereby enabling users to make informed decisions. The experiments
conducted with human subjects highlighted that while explanations can enhance understanding, they
may also reveal underlying issues with model predictions, leading to a decrease in trust when
users discern that predictions are based on untrustworthy reasons [BIBREF220]. Furthermore,
the model-agnostic nature of explanation methods like LIME and SHAP allows for their
application across various classifiers, facilitating feature engineering and improving model performance
[BIBREF220], [BIBREF221].Recent advancements in instancewise feature selection and the use of
mutual information have further enhanced model interpretation by providing tailored explanations
for specific predictions. This approach contrasts with traditional global feature importance metrics,
allowing for a more nuanced understanding of model behavior [BIBREF222]. The introduction
of multimodal explanation frameworks, which combine visual and textual justifications, has also
shown promise in improving interpretability by leveraging the strengths of diferent modalities
[BIBREF223]. Additionally, the development of datasets such as ACT-X and VQA-X has provided
valuable resources for evaluating multimodal explanations in activity recognition and visual
question answering tasks, thereby advancing the field of explainable AI [BIBREF223].The integration of
counterfactual explanations, particularly through frameworks like GYC and MiCE, has introduced
innovative methodologies for generating explanations that clarify model predictions. These
approaches focus on generating plausible and diverse counterfactuals that can reveal model behavior
and facilitate debugging processes. The ability to produce minimal edits that lead to significant
changes in model outputs enhances the interpretability of complex models while maintaining
lfuency and grammaticality in generated explanations [BIBREF235], [BIBREF236]. Overall, the
insights gained from these advancements underscore the importance of developing robust methods
for explanation generation, which are essential for fostering trust and understanding in machine
learning systems deployed in real-world applications.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Krenn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bufoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Coutinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gritsevskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Moutinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sanjabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sonthalia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Valente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Kopp, Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network</article-title>
          ,
          <source>Nat. Mac. Intell</source>
          .
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>1326</fpage>
          -
          <lpage>1335</lpage>
          . URL: https://doi.org/10.1038/ s42256-023-00735-0. doi:
          <volume>10</volume>
          .1038/S42256-023-00735-0.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Nanba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Okumura</surname>
          </string-name>
          ,
          <article-title>Towards multi-paper summarization using reference information</article-title>
          , in: T.
          <string-name>
            <surname>Dean</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99</source>
          , Stockholm, Sweden,
          <source>July 31 - August 6</source>
          ,
          <year>1999</year>
          . 2 Volumes, 1450 pages, Morgan Kaufmann,
          <year>1999</year>
          , pp.
          <fpage>926</fpage>
          -
          <lpage>931</lpage>
          . URL: http://ijcai.org/Proceedings/99-2/Papers/038.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>A systematic survey of text summarization: From statistical methods to large language models</article-title>
          ,
          <source>CoRR abs/2406</source>
          .11289 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2406. 11289. doi:
          <volume>10</volume>
          .48550/ARXIV.2406.11289. arXiv:
          <volume>2406</volume>
          .
          <fpage>11289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Guedes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Laitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Lotufo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>Surveysum: A dataset for summarizing multiple scientific articles into a survey section</article-title>
          ,
          <source>CoRR abs/2408</source>
          .16444 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2408.16444. doi:
          <volume>10</volume>
          .48550/ARXIV. 2408.16444. arXiv:
          <volume>2408</volume>
          .
          <fpage>16444</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kang</surname>
          </string-name>
          , Checkeval:
          <article-title>Robust evaluation framework using large language model via checklist</article-title>
          ,
          <source>CoRR abs/2403</source>
          .18771 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv. 2403.18771. doi:
          <volume>10</volume>
          .48550/ARXIV.2403.18771. arXiv:
          <volume>2403</volume>
          .
          <fpage>18771</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , G-eval:
          <article-title>NLG evaluation using GPT-4 with better human alignment</article-title>
          ,
          <source>CoRR abs/2303</source>
          .16634 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2303.16634. doi:
          <volume>10</volume>
          .48550/ARXIV.2303.16634. arXiv:
          <volume>2303</volume>
          .
          <fpage>16634</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gvr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Rosé</surname>
          </string-name>
          ,
          <article-title>SciSumm: A multi-document summarization system for scientific articles</article-title>
          , in: S. Kurohashi (Ed.),
          <article-title>Proceedings of the ACL-HLT 2011 System Demonstrations, Association for Computational Linguistics</article-title>
          , Portland, Oregon,
          <year>2011</year>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>120</lpage>
          . URL: https: //aclanthology.org/P11-4020/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fabbri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Radev, Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>1074</fpage>
          -
          <lpage>1084</lpage>
          . URL: https://aclanthology.org/P19-1102/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1102.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <article-title>Hierarchical transformers for multi-document summarization</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>5070</fpage>
          -
          <lpage>5081</lpage>
          . URL: https://aclanthology.org/P19-1500/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1500.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shinde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ghosal</surname>
          </string-name>
          ,
          <article-title>An extractive-abstractive approach for multi-document summarization of scientific articles for literature review</article-title>
          , in: A.
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Feigenblat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Herrmannova</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Knoth</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mayr</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Shmueli-Scheuer</surname>
          </string-name>
          , A. de Waard, L. L.
          <string-name>
            <surname>Wang</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the Third Workshop on Scholarly Document Processing</source>
          , Association for Computational Linguistics, Gyeongju, Republic of Korea,
          <year>2022</year>
          , pp.
          <fpage>204</fpage>
          -
          <lpage>209</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .sdp-
          <volume>1</volume>
          .25/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>Wang, Multi-document scientific summarization from a knowledge graph-centric view</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Ryu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Donatelli</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kurohashi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Paggio</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hahm</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>T. K.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Santus</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bond</surname>
          </string-name>
          , S. Na (Eds.),
          <source>Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2022</year>
          , Gyeongju, Republic of Korea,
          <source>October 12-17</source>
          ,
          <year>2022</year>
          ,
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>6222</fpage>
          -
          <lpage>6233</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .coling-
          <volume>1</volume>
          .
          <fpage>543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Soldaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D'Arcy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Wadden</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Latzke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            , S. Liu,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Tong</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Neubig</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Downey</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Yih</surname>
            ,
            <given-names>P. W.</given-names>
          </string-name>
          <string-name>
            <surname>Koh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hajishirzi</surname>
          </string-name>
          , Openscholar:
          <article-title>Synthesizing scientific literature with retrieval-augmented lms</article-title>
          ,
          <source>CoRR abs/2411</source>
          .14199 (
          <year>2024</year>
          ). URL: https://doi.org/10. 48550/arXiv.2411.14199. doi:
          <volume>10</volume>
          .48550/ARXIV.2411.14199. arXiv:
          <volume>2411</volume>
          .
          <fpage>14199</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts</article-title>
          , in: M.
          <string-name>
            <surname>Shardlow</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Saggion</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Alva-Manchego</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zampieri</surname>
          </string-name>
          , K. North, S. Štajner, R. Stodden (Eds.),
          <source>Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>51</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .tsar-
          <volume>1</volume>
          .5/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .tsar-
          <volume>1</volume>
          .5.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>J. DeYoung</surname>
          </string-name>
          , I. Beltagy, M. van
          <string-name>
            <surname>Zuylen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kuehl</surname>
            ,
            <given-names>L. L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>MS2: multi-document summarization of medical studies</article-title>
          ,
          <source>CoRR abs/2104</source>
          .06486 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2104.06486. arXiv:
          <volume>2104</volume>
          .
          <fpage>06486</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Charlin, Multi-xscience: A large-scale dataset for extreme multi-document summarization of scientific articles</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2020</year>
          , Online,
          <source>November 16-20</source>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>8068</fpage>
          -
          <lpage>8074</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>648</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2020</year>
          .EMNLP-MAIN.
          <year>648</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Precise zero-shot dense retrieval without relevance labels</article-title>
          ,
          <source>CoRR abs/2212</source>
          .10496 (
          <year>2022</year>
          ). URL: https://doi.org/10.48550/arXiv.2212.10496. doi:
          <volume>10</volume>
          .48550/ARXIV. 2212.10496. arXiv:
          <volume>2212</volume>
          .
          <fpage>10496</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hasson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Hyqe: Ranking contexts with hypothetical query embeddings</article-title>
          ,
          <source>CoRR abs/2410</source>
          .15262 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2410.15262. doi:
          <volume>10</volume>
          .48550/ARXIV.2410.15262. arXiv:
          <volume>2410</volume>
          .
          <fpage>15262</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <source>CoRR abs/2312</source>
          .10997 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2312.10997. doi:
          <volume>10</volume>
          .48550/ARXIV. 2312.10997. arXiv:
          <volume>2312</volume>
          .
          <fpage>10997</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saad-Falcon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zaharia, Colbertv2: Efective and eficient retrieval via lightweight late interaction</article-title>
          ,
          <source>CoRR abs/2112</source>
          .01488 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/ 2112.01488. arXiv:
          <volume>2112</volume>
          .
          <fpage>01488</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013/.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with BERT</article-title>
          , CoRR abs/
          <year>1904</year>
          .09675 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1904</year>
          .09675. arXiv:
          <year>1904</year>
          .09675.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Fabbri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kryscinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Radev</surname>
          </string-name>
          , Summeval:
          <article-title>Re-evaluating summarization evaluation</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>391</fpage>
          -
          <lpage>409</lpage>
          . URL: https: //doi.org/10.1162/tacl_a_00373. doi:
          <volume>10</volume>
          .1162/TACL\_A\_
          <volume>00373</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tao</surname>
          </string-name>
          , S. Ma,
          <article-title>Leveraging large language models for NLG evaluation: Advances and challenges</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
          </string-name>
          , Y. Chen (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Miami</article-title>
          , FL, USA, November
          <volume>12</volume>
          -
          <issue>16</issue>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>16028</fpage>
          -
          <lpage>16045</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>896</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Faysse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sibille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Omrani</surname>
          </string-name>
          , G. Viaud,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hudelot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colombo</surname>
          </string-name>
          , Colpali:
          <article-title>Eficient document retrieval with vision language models</article-title>
          ,
          <source>CoRR abs/2407</source>
          .01449 (
          <year>2024</year>
          ). URL: https://doi. org/10.48550/arXiv.2407.01449. doi:
          <volume>10</volume>
          .48550/ARXIV.2407.01449. arXiv:
          <volume>2407</volume>
          .
          <fpage>01449</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>