<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (B. Vendeville);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF 2025 SimpleText Task 1: Simplify Scientific Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Bakker</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Vendeville</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liana Ermakova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaap Kamps</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université de Bretagne Occidentale</institution>
          ,
          <addr-line>HCTI</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents an overview of the CLEF 2025 SimpleText Task 1 on Text Simplification. The task aims to simplify scientific text. We discuss the data and benchmarks provided for these tasks, along with preliminary insights and anticipated challenges. Our main findings are the following. First, we advanced the field of text simplification by creating new biomedical corpora that support true paragraph- and document-level simplification, capturing greater variation and complex discourse suited to LLMs. Second, our CLEF 2025 document-level corpus showed for the first time that document-level simplification models clearly outperformed sentence-level methods. Third, in addressing the biomedical domain, teams developed novel approaches to generate plain language summaries that overcome key barriers for consumers, enhancing accessibility to authoritative health information. More generally, we hope and expect that the constructed corpora and evaluation data will be used by researchers to further advance text simplification approaches, both in general and specifically for the biomedical domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Scientific text simplification</kwd>
        <kwd>Biomedical AI</kwd>
        <kwd>Generative AI</kwd>
        <kwd>Information access</kwd>
        <kwd>Natural language processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Task 1: Text Simplification simplify scientific text.
• Task 2: Controlled Creativity identify and avoid hallucination.</p>
      <p>• Task 3: SimpleText 2024 Revisited selected tasks by popular request.</p>
      <p>
        This paper gives an overview of the CLEF 2025 SimpleText Task 1 on Text Simplification, which aims to
simplify scientific text. Further detail on the entire track is in the CLEF 2025 SimpleText Track Overview
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Additional details on Task 2 on Controlled Creativity are in a companion Task 2 overview paper [3].
We also refer to the respective participants’ papers for further details.
      </p>
      <p>A total of 74 teams registered for our SimpleText track at CLEF 2025. A total of 18 teams submitted
198 runs in total for Tasks 1 and 2. The statistics for these runs submitted are presented in Table 1.1
However, some runs had problems that we could not resolve. We do not detail them in the rest of the
paper and leave out the 0-scoring runs. More details about individual runs and experiments can be
found in the participants’ papers, also shown in Table 1.</p>
      <p>The rest of this paper is structured in the following way. Section 2 describes the task, the data,
the format, and the evaluation measures. Section 3 describes the participants’ approaches. Section 4
provides detailed results for the task. Section 5 provides further analysis of the results. We end with a
discussion and conclusions in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 1: Simplify Scientific Text</title>
      <sec id="sec-2-1">
        <title>This section details Task 1: Text Simplification on simplify scientific text.</title>
        <sec id="sec-2-1-1">
          <title>2.1. Description</title>
          <p>
            The Text Simplification task aims to simplify scientific text . We created a new CLEF 2025 SimpleText
corpus based on biomedical literature abstracts and lay summaries from Cochrane systematic reviews,
called Cochrane-auto [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. An example is shown in Figure 1. This corpus was created by closely following
1The table includes submissions in the Tasks 1 and 2 Codabench evaluation platform, where we were privileged to have 29
(Task 1) and 13 (Task 2) participants.
          </p>
          <p>Complex paragraph
Fifteen heterogeneous trials, involving 1022 adults with dorsally displaced and potentially or evidently unstable distal
radial fractures, were included. While all trials compared external fixation versus plaster cast immobilisation, there was
considerable variation especially in terms of patient characteristics and interventions. Methodological weaknesses among
these trials included lack of allocation concealment and inadequate outcome assessment.</p>
          <p>Simple paragraph
Fifteen trials, involving 1022 adults with potentially or evidently unstable fractures, were included. While all trials compared
external fixation versus plaster cast immobilisation, there was considerable variation in their characteristics especially in
terms of patient characteristics and the method of external fixation.
Task Description This is the core NLP task of the track, and we continue with both sentence-level
(Task 1.1) and document-level (Task 1.2) scientific text simplification. The main innovation is the very
large new corpus we constructed in 2024, and the shift to the biomedical domain.
2.2. Data
As discussed above, we constructed a large scientific text simplification corpus, based on realigning
abstracts and lay summaries at scale at the sentence, paragraph, and document levels. In 2025, we will
use this Cochrane-auto corpus as the training data.</p>
          <p>Train data The specific train data for Task 1 consists of 1,085 documents, 4,171 paragraphs, and 14,719
sentences, with paired content from the abstract and the plain language summary. While the track
distinguishes only between sentence and document level text simplification, the paragraph level
of the sentence input is included, allowing also for paragraph level text simplification submission
to Task 1.2.</p>
          <p>Test data The primary test data consists of 217 new Cochrane abstracts with paired plain English
summaries, composed of 4,293 source sentences.</p>
          <p>These are new systematic reviews published by Cochrane over the last year. We process these
paired abstracts and plain language summaries in two diferent ways.</p>
          <p>
            • We process these as Cochrane-auto [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] to ensure a high-quality sentence and paragraph
alignment. This results in a subset of 37 abstracts and 587 sentences, paired with 37 plain
language summaries with 388 sentences. The processing is identical to Cochrane-auto and
other text simplification data sets.
• For document-level text simplification, we can also use the original pairs of abstracts and
plain language summaries, using only the results and conclusions sections, similar to [19].
This results in 217 abstracts with 4,293 source sentences, paired with 217 plain language
summaries with 3,641 sentences.
          </p>
          <p>We use the aligned subset as the main evaluation and also report the scores over the whole subset.
Analysis data For further analysis, we extended the test data with the Cochrane-auto validation and
test splits (part of the train data), Medline abstracts for which TREC PLABA references exist,2
and SimpleText 2024 abstracts for which we have references. The combined test file, including
additional data sources, contains 666 documents with 9,160 sentences.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.3. Formats</title>
          <p>
            2.3.1. Train data
This section outlines the format of the data used in the CLEF 2025 SimpleText Task 1.
The training data used was from the Cochrane-auto paper [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] as published on GitHub: https:
//github.com/JanB100/cochrane-auto/tree/main/data. This data format follows the earlier Wiki-auto
and Newsela-auto corpora and includes additional fields for statistics. It also includes the specific
transformation (copy, rephrase, split, merge, delete) for each sentence in the source abstract.
          </p>
          <p>
            The training data already has train, validation, and test splits with references. We also included
the validation and test splits of the training data [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] in the combined source data test set, so we also
collected the predictions of each system on the training data.
2.3.2. Sources
Sentence-level simplification
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The source data is in a JSON format with the following fields:</title>
      </sec>
      <sec id="sec-2-3">
        <title>1. pair_id: Unique ID for the complex-simple document pair.</title>
        <p>2. para_id: Index of the source paragraph in the document.
3. sent_id: Index of the source sentence in the document.</p>
        <p>4. complex: Complex sentence from the source document.</p>
        <p>This format is a simplified version of the Cochrane-auto format in JSON. The combination of the pair_id
(the id of the document or abstract) and the sent_id (the sentence order) determines the unique sentence
unambiguously. As the abstracts can be quite lengthy, we retained the paragraph identifier. This third
identifier enables paragraph-level text simplification approaches (submitted to Task 1.2).</p>
        <p>An example of the Task 1.1 JSON source input is:
{
},
{</p>
        <p>"pair_id": "CD012520",
"pair_id": "CD012520",
"para_id": 0,
"sent_id": 0,
"complex": "We included seven cluster-randomised trials with 42,489 patient participants from 129
˓→ hospitals, conducted in Australia, the UK, China, and the Netherlands."
},
. . .</p>
        <p>{
Document-level simplification</p>
      </sec>
      <sec id="sec-2-4">
        <title>The source data is in a JSON format with the following fields:</title>
      </sec>
      <sec id="sec-2-5">
        <title>1. pair_id: Unique ID for the complex-simple document pair.</title>
      </sec>
      <sec id="sec-2-6">
        <title>2. source: The origin of the data (only for reference).</title>
      </sec>
      <sec id="sec-2-7">
        <title>3. complex: The complex document’s content (all sentences).</title>
        <p>This is again a simplified version of the Cochrane-auto format in JSON. The new Cochrane-auto data is
the main evaluation of the task in 2025. However, the test input files also included the validation and
train splits of the training data, as well as other scientific abstracts from PubMed and the earlier CLEF
2024 SimpleText corpus. The source field indicates the source of the abstracts. This enables comparative
performance analysis against other existing text simplification corpora, in addition to the track’s oficial
evaluation scores on the new set of Cochrane-auto abstracts.</p>
        <p>An example of the Task 1.2 JSON source input is:
"para_id": 0,
"sent_id": 1,
"complex": "Health professional participants (numbers not specified) included nursing, medical
˓→ and allied health professionals."
"pair_id": "CD012520",
"para_id": 2,
"sent_id": 12,
"complex": "We are uncertain whether a multifaceted implementation intervention compared to no
˓→ intervention improves adherence to evidence-based recommendations in acute stroke settings,
˓→ because the certainty of evidence is very low."
}, . . .</p>
        <p>"pair_id": "CD012520",
"source": "Cochrane",
"complex": "We included seven cluster-randomised trials with 42,489 patient participants from 129
hospitals, conducted in Australia, the UK, China, and the Netherlands. Health professional
participants (numbers not specified) included nursing, medical and allied health
professionals.
˓→
˓→
˓→
...</p>
        <p>We are uncertain whether a multifaceted implementation intervention compared to no intervention
˓→ improves adherence to evidence-based recommendations in acute stroke settings, because the
˓→ certainty of evidence is very low."
}, . . .</p>
        <p>In this case, there are 13 sentences (Task 1.1) corresponding to the Cochrane document (Task 1.2).
2.3.3. Predictions
As output, participants were asked to add a column to the input data with their simplified sentence
or abstracts (prediction) as well as a unique identifier for their run. So, in addition to the fields in the
Source format above, submissions included the following fields:
• prediction: Your simplification of the complex sentence with pair_id and sent_id (for sentence-level
simplification), or of the complex document with pair_id (for document-level simplification).
• run_id: &lt;team-id&gt;_&lt;task-id&gt;_&lt;method-used&gt;.</p>
      </sec>
      <sec id="sec-2-8">
        <title>An example of a sentence-level text simplification (Task 1.1) submis</title>
        <p>Sentence-level simplification
sion is the following:
[{"pair_id":"CD012520","para_id":0,"sent_id":0,"complex":"We included seven cluster-randomised
˓→ trials with 42,489 patient participants from 129 hospitals, conducted in Australia, the UK,
˓→ China, and the Netherlands.","prediction":"We looked at seven studies that were done in groups.
˓→ These studies involved 42,489 patients from 129 hospitals in Australia, the UK, China, and the
˓→ Netherlands.","run_id":"UBOnlp_task11_gpt4o"},
{"pair_id":"CD012520","para_id":0,"sent_id":1,"complex":"Health professional participants (numbers
not specified) included nursing, medical and allied health professionals.","prediction":"The
group of health professionals included nurses, doctors, and other healthcare workers. The exact
number of people in each group wasn't mentioned.","run_id":"UBOnlp_task11_gpt4o"},
˓→
˓→
˓→
. . .
{"pair_id":"CD012520","para_id":2,"sent_id":12,"complex":"We are uncertain whether a multifaceted
˓→ implementation intervention compared to no intervention improves adherence to evidence-based
˓→ recommendations in acute stroke settings, because the certainty of evidence is very
˓→ low.","prediction":"We're not sure if using a special plan with many parts helps doctors follow
˓→ the best advice for treating strokes, compared to not using any plan. This is because we don't
˓→ have strong proof yet.","run_id":"UBOnlp_task11_gpt4o"},
. . . ]
Document-level simplification
submission is the following:</p>
        <p>An example of a document-level text simplification (Task 1.2)
[{"pair_id":"CD012520","source":"Cochrane","complex":"We included seven cluster-randomised trials
˓→ with 42,489 patient participants from 129 hospitals, conducted in Australia, the UK, China, and
˓→ the Netherlands. Health professional participants (numbers not specified) included nursing,
˓→ medical and allied health professionals. ... We are uncertain whether a multifaceted
˓→ implementation intervention compared to no intervention improves adherence to evidence-based
˓→ recommendations in acute stroke settings, because the certainty of evidence is very
˓→ low.","prediction":"Researchers conducted studies in hospitals across Australia, the UK, China,
˓→ and the Netherlands. They wanted to see if certain strategies could help healthcare workers
˓→ provide better care. These strategies were aimed at doctors, nurses, and other health
˓→ professionals. ... Overall, the evidence was not strong enough to say for sure if these
˓→ strategies help healthcare workers follow best practices in treating stroke
˓→ patients.","run_id":"UBOnlp_task12_gpt4o"},
. . . ]
2.3.4. References
The reference simplifications are based on pairs of Cochrane abstracts and plain language summaries.
One possibility for document-level text simplification is to use the original pairs of abstracts and plain
language summaries, using only the results and conclusions sections, similar to [19]. This results in
a representative and realistic set of references and a large set of abstracts, but has the disadvantage
that there may be many diferences between the abstract and plain language summary. This version is
included as simple_original.</p>
        <p>
          We can also process these as Cochrane-auto [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to ensure a high-quality sentence and paragraph
alignment. This results in a smaller set of abstracts, but clear correspondence between the content in
the pair of abstract and plain language summary. This is the main evaluation reference and included as
simple_auto.
        </p>
        <p>The reference data is also in JSON:
{
"CD012520": {
"simple_original": "Implementation interventions are designed to improve the delivery of
˓→ 'evidence-based' care, which is care that has been proven in research studies to help people
˓→ with a particular health condition. ... What did we find? We included seven studies that
˓→ involved 42,489 acute stroke patients and an unknown number of health professionals. The
˓→ studies were conducted in 129 hospitals in Australia, the UK, China and the Netherlands. ...
˓→ How up to date is this evidence? This review includes papers that we identified from
˓→ searching in April 2022.",</p>
        <p>},
"simple_auto": "We included seven studies that involved 42,489 acute stroke patients and an
˓→ unknown number of health professionals. The studies were conducted in 129 hospitals in
˓→ Australia, the UK, China and the Netherlands. ... We do not know if implementation
˓→ interventions delivered in acute stroke units lead to better delivery of evidence-based care."
To have identical ground truth and directly comparable evaluation scores over both sentence and
document level submissions, in particular in the single leaderboard table in the Task’s Codabench, we
decided to evaluate both types of submissions at the document level. That is, a sentence-level submission
is merged to a complete, simplified abstract, and evaluated against the plain language summary.</p>
        <sec id="sec-2-8-1">
          <title>2.4. Codabench</title>
          <p>Submissions were made through Codabench.3 Due to the diferences in the setup, each task had a
designated separate competition on Codabench. The Task 1 runs were submitted at: https://www.
codabench.org/competitions/8400/ (shown in Figure 2). The Task 2 runs were submitted at: https:
//www.codabench.org/competitions/8327/. The Codabench greatly facilitated running the track in 2025
and provided active participants (who had also registered at the Codabench) with full access to the
competition, including the submission and leaderboard pages.</p>
        </sec>
        <sec id="sec-2-8-2">
          <title>2.5. Evaluation</title>
          <p>In 2025, we emphasize large-scale automatic evaluation measures (SARI, BLEU, compression, readability)
that provide a reusable test collection. For further details on these evaluation measures for scientific
text simplification, see [ 20]. This automatic evaluation will be supplemented with a detailed human
evaluation of other aspects, essential for deeper analysis.</p>
          <p>Almost all participants used generative models for text simplification, yet existing evaluation measures
are blind to potential hallucinations with extra or distorted content. In 2025, we will continue to provide
further analysis of ways to detect and quantify spurious content in the output, potentially corresponding
to what is informally called "hallucinations."</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Participant’s Approaches</title>
      <p>A total of 18 teams submitted 132 runs in total. In the detailed results, we only include runs without
errors, which got a non-zero score.</p>
      <p>AIIRLab Largey et al. [4] submitted six runs in total for Task 1. They submitted four runs for Task 1.1
and two runs for Task 1.2. They use a range of open-source models (Mistral, LLaMA), with extensive
ifnetuning and exploration of efective prompts, for sentence- and document-level text simplification.
Special precautions against noise and unwanted output were taken. The prompt instructions were
directly focused on the desired outcome evaluation measures of the task.</p>
      <p>ASM Djoudi et al. [5] submitted 10 runs in total for Task 1. They submitted no runs for Task 1.1
and 10 runs for Task 1.2. They created an extensive set of over 3,000 simplified medical definitions
compiled from multiple public sources. A Mistral 7B model was used to detect jargon in the abstracts,
and matching simplified definitions were added to the prompt. Three open-source models (Mistral 7B,
Gemma 2-9B, Med42-v2) were used for text simplification, obtaining competitive performance.
DSGT Marturi and Elwazzan [6] submitted three runs in total for Task 1. They submitted two runs for
Task 1.1 and one run for Task 1.2. The paper uses an open-source LLaMA 3.3 70b model with a few-shot
prompt approach. For sentence-level simplification, they first prompt a plan to guide the simplification
and, in a second stage, prompt the model to execute the plan at the sentence level. For document-level
simplification, they first apply a summarization prompt and, in a second stage, prompt the model to
simplify the summary. Related Task 2.3 experiments are in a separate paper [7].</p>
      <p>DUTH Arampatzis and Arampatzis [8] submitted three runs in total for Task 1. They submitted
three runs for Task 1.1 and none for Task 1.2. They use open-source models, such as FLAN-T5 and
BART-SAMSum, with a zero-shot prompt for sentence-level and document-level text simplification.
EngKh (no paper) submitted two runs in total for Task 1. They submitted two runs for Task 1.1 and
none for Task 1.2.</p>
      <p>Fujitsu Agüero-Torales et al. [9] submitted 19 runs in total for Task 1. They submitted 19 runs for
Task 1.1 and none for Task 1.2. They explore an in-context learning approach, with zero and
threeshot prompting of GPT-3.5, o4-mini, and T5-small models, in an optimized pipeline for sentence-level
scientific text classification.</p>
      <p>LIA Gallina et al. [10] submitted 9 runs in total for Task 1. They submitted no runs for Task 1.1
and 9 runs for Task 1.2. The paper does interesting experiments with a range of open-source models
(LLaMA-4, LLaMA-3.3, Mistral-Small, Gemma2, Helsinki). They use both generic prompts and specific
guidance based on the Cochrane plain language summaries instructions. The specific instructions help
the performance of their models.</p>
      <p>Mtest (no paper) submitted two runs in total for Task 1. They submitted one run for Task 1.1 and 1 run
for Task 1.2.</p>
      <p>PICT Vora et al. [11] submitted two runs in total for Task 1. They submitted one run for Task 1.1 and
one run for Task 1.2. They explore an advanced pipeline to create an abstract meaning representation
of the text, focusing on lexical and phrase-level simplification, sentence-level structural simplification,
and a final T5 model for generative text simplification.</p>
      <p>RECAIDS (no paper) submitted two runs in total for Task 1. They submitted one run for Task 1.1
and one run for Task 1.2. They explore a T5 model for Tasks 1.1 and 1.2 with a straightforward T5
completion prompt, and with a model fine-tuned on each task.</p>
      <p>Scalar Dongre et al. [13] submitted 11 runs in total for Task 1. They submitted ten runs for Task 1.1
and one run for Task 1.2. They perform an interesting experiment for Task 1.1, motivated by avoiding
biomedical jargon or technical terminology. They deploy earlier generation models (BioBERT/BioBART,
GPT-2), which are considerably more eficient than current LLMs and demonstrate reasonable
performance.</p>
      <p>SINAI Collado-Montañez et al. [14] submitted four runs in total for Task 1. They submitted two runs
for Task 1.1 and two runs for Task 1.2. They uses a closed-source model, GPT-4.1, in a zero-shot prompt
setting. They use tailored biomedical test simplification prompts for Tasks 1.1 and 1.2, and the model
shows high performance.</p>
      <p>THM Hofmann et al. [15] submitted 22 runs in total for Task 1. In fact, they submitted 22 runs for Task
1.1 and none for Task 1.2. They devote special interest to biomedical jargon or technical terminology.
The main experiment uses five diferent prompts with advanced closed-source models (OpenAI and
Gemini).</p>
      <p>UBO Vendeville et al. [16] submitted 12 runs in total for Task 1. They submitted five runs for Task 1.1
and seven runs for Task 1.2. The submissions were mostly test submissions, which were not discussed
in detail in the paper.</p>
      <p>UM-FHS Kocbek and Stiglic [17] submitted 9 runs in total for Task 1. They submitted four runs for Task
1.1 and five runs for Task 1.2. They utilize closed-source models (GPT-4.1 standard, mini, and nano) in
a zero-shot prompt setup with detailed prompts for Tasks 1.1 and 1.2. They also explore the value of
ifne-tuning the smaller models.</p>
      <p>
        UvA Papandreou et al. [18] submitted 14 runs in total for Task 1. They submitted five runs for Task 1.1
and nine runs for Task 1.2. They submitted Cochrane-auto [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] trained BART models. These were
either operating at the sentence level, including a plan-guided version, for Task 1.1, or the paragraph or
document level for Task 1.2. They also experimented with jargon detection trained on MedReadMe [21]
to create a jargon-aware prompt for a LLaMA 3.1-8b model, for both tasks.
      </p>
      <p>Unknown team (no paper) submitted two runs in total for Task 1. They submitted two runs for Task 1.1
and none for Task 1.2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section details the task results for sentence- and document-level test simplification subtasks.</p>
      <sec id="sec-4-1">
        <title>4.1. Task 1.1: Sentence-level Scientific Text Simplification</title>
        <p>The main evaluation concerns the 37 abstracts, with 587 sentences aligned identically to the way
Cochrane-auto and other collections are aligned. In this track overview paper, we decided to evaluate
all submissions in Task 1.1 and Task 1.2 at the document level to ensure identical ground truth and
comparable scores across tasks.</p>
        <p>Table 3 shows the Task 1.1 (sentence-level text simplification) results. The table is restricted to
submissions without issues, and we show a maximum of five runs per team. We show several evaluation
scores against the human reference simplifications, particularly SARI and BLEU. In addition, we provide
additional text statistics on the system output, such as FKGL, and compare them to the source input.</p>
        <p>We make a number of observations. First, the table is sorted on SARI, the primary automatic text
simplification measure used in the track. We observe SARI scores above 30% for almost all systems
and above 40% for the top-scoring systems. This high overlap with the plain language reference
simplifications is encouraging, and it indicates that the efectiveness of text simplification approaches,
traditionally trained on youth news reading corpora like Newsela, also extends to scientific text.</p>
        <p>Second, in terms of the level of text complexity, readability measures like FKGL provide a rough
indicator of lexical and grammatical complexity. The original sentences have an FKGL of 13-14
corresponding to university-level text, and most systems reduce this to an FKGL of 11-12 corresponding to
the exit level of compulsory education. This is an encouraging result, as it indicates that the scientific
text simplification approach can be a viable approach to lower the textual complexity of scientific
text toward the range acceptable by a layperson. Although this indicator is positive, this approximate
measure does not consider terminological complexities.</p>
        <p>Third, the table includes various other scores that indicate that there is still considerable room for
improvement in scientific text simplification. Throughout the table, the BLEU evaluation measure
remains very low. It leads to a diferent ranking of systems, with some of the best systems on BLEU
demonstrating superior overlap with the human reference simplifications. The table also reveals some
runs with very high “compression” ratios, sentence splits, and high proportions of additions. While
evaluation measures like SARI are essential for understanding important aspects of text simplification
output quality, they are also known to be relatively insensitive to content outside the intersection of
manual text simplifications. Hence, high levels of content insertion can still lead to favorable SARI
scores and even improve text statistics like FKGL without conveying key content of the original text.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task 1.2: Document-level Scientific Text Simplification</title>
        <p>Table 4 shows the results of Task 1.2 (document-level text simplification). Again, we restrict the table to
submissions covering a maximum of five runs with non-zero scores per team.</p>
        <p>We make a number of observations. First, in terms of evaluation measures like SARI, we see
similar encouraging performance levels again when evaluating against the plain language reference
simplifications. In earlier years of the track, this mainly resulted from using proven sentence-level text
simplification models with the output merged back into the entire abstract. However, this year, we see
almost exclusively large language models applied to the lengthy source abstract as a whole. This is a
clear sign of the remarkable progress in models for text simplification and other complex NLP tasks.
Second, there remains room for improvement in capturing the human simplifications more closely,
as the BLEU score remains low throughout. Here, the more conservative approaches seem to obtain
better scores. For scientific text simplification, we aim for a careful balance between simplicity and
accuracy, and being conservative is a key strength to avoid unnecessary and potentially inaccurate
changes. Third, we see less extreme values on the other indicators, but still considerable variation in
the compression ratio and number of splits, and proportions of additions and deletions. Generally, we
see more compression and deletions, indicating summarization aspects such as reducing the number of
sentences, which happens frequently.</p>
        <p>It is encouraging to see solid performance for the approaches that perform text simplification on
the entire abstract in one pass. This holds the promise to incorporate the discourse structure, use
more complex text simplification operations such as deletions and merges, and deploy planner-based
approaches to the text simplification of long documents. Traditional sentence-level simplification
approaches and earlier evaluation data cannot capture these aspects. This demonstrates the value of
the new test collections constructed during the CLEF 2025 SimpleText track.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results on Plain Language Summaries</title>
        <p>In this section, we provide additional evaluation on the larger set of 217 abstracts with 4,293 source
sentences paired with 217 plain language summaries with 3,641 sentences. Unlike the subset discussed
above, high-quality sentence alignment is not possible for this data. However, our primary interest is in
document-level text simplification and evaluation, and our analysis explores the value of using parallel
text directly as evaluation.</p>
        <p>Table 5 shows the results of Task 1.1 (sentence-level text simplification) against a larger set of 217
abstracts and plain language summaries without further alignment. Again, we restrict the table to
submissions covering a maximum of five runs with non-zero scores per team. Note again that all
submissions in Task 1.1 and Task 1.2 at the document level, to ensure identical ground truth and
comparable scores across tasks.</p>
        <p>We make a number of observations. First, in terms of evaluation measures like SARI, we see again
similar encouraging performance levels when evaluating against the larger set of plain language
reference simplifications. The ranking in Table 5 is similar to the subset of Table 3 before, with some
notable shifts and upsets, particularly for run with a low BLEU score, but overall high agreement.
Second, we see relatively low BLEU scores again, and even considerably lower than before. This is
partly a result of the less clear source to reference alignment at the sentence and paragraph level for this
larger set of references. But it also shows that document-level text simplification is a challenging task,
even for current advanced models. Third, this also indicates that real-world plain language summaries
are far removed from direct sentence-level simplifications. It also suggests that more conservative
approaches, which may be desirable from an accuracy point of view, fail to capture the complex plain
language adaptations.</p>
        <p>Table 6 shows the results of Task 1.2 (document-level text simplification) against a larger set of 217
abstracts and plain language summaries without further alignment. Again, we restrict the table to
submissions covering a maximum of five runs with non-zero scores per team.</p>
        <p>We make a number of observations. First, in terms of evaluation measures like SARI, we see
similar encouraging performance levels again when evaluating against the plain language reference
simplifications. The tables show some swaps and upset, but generally good agreement between Table 6
and Table 4 shown before. One exception seems to be closed-source models, such as GPT-4, which
perform less impressively on the larger set of plain language summaries. Second, the BLEU score
remains low throughout again, and notably lower than on the subset of Table 4. This seems to be a
result of the greater variation and discourse changes in the plain language summaries. However, this
also immediately suggests that this is not yet captured well by the predictions of advanced NLP models
for text simplification. Third, we see less extreme values on the other indicators for document-level
text simplification approaches. The fraction of deletions remains very high throughout all systems.
Interestingly, the better-scoring systems also seem to have more insertions. This can be an indication
that some systems are finding valuable content to insert, such as explanations of jargon or other
specialized terminology.
4.4. Findings
This concludes the results for the CLEF 2025 SimpleText Task 1: Text Simplification on simplify scientific
text. Our main findings are the following: First, our analysis compared the results over the carefully
sentence-aligned abstracts in Table 3 and Table 4, with the larger unfiltered set of document-level aligned
abstracts in Table 5 and Table 6. It is encouraging to see the broad agreement in the ranking over both
sets, as this suggests evaluation and training on document-aligned texts is a viable option. Similar to
how machine translation was able to scale up due to the availability of parallel texts, this can help scale
up text simplification by increasing the number of available corpora. Second, this also shifts the focus of
the field of text simplification beyond the traditional aspects of lexical and grammatical simplification
and introduces new and interesting aspects. Examples include dealing with the discourse structure,
particular background knowledge needed to understand the text, and avoiding or explaining jargon or
specialized terminology. Third, while the results are encouraging and the submitted predictions are
generally high quality compared to some years ago, there remains also clear room for improvement, in
particular when dealing with the scientific vernacular and specific biomedical jargon. This demonstrates
the value of the new test collections constructed during the CLEF 2025 SimpleText track.
5. Analysis
This section details further analysis of the submissions to the track. We focus in particular on a
sentence-level evaluation of the Task 1.1 submissions.</p>
      </sec>
      <sec id="sec-4-4">
        <title>5.1. Task 1.1: Sentence-level Scientific Text Simplification</title>
        <p>
          As detailed above, we made particular eforts in Cochrane-auto [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to ensure alignment at the document,
paragraph, and sentence level. Hence, we have Cochrane-auto aligned references for 37 abstracts, with
a total of 388 source sentences, carefully aligned with 363 sentences in the plain language summaries
due to deletions.
        </p>
        <p>Table 7 shows the evaluation of the Task 1.1 submissions against the aligned sentence-level references.
We make several observations. First, we see solid agreement between the sentence-level evaluation in
Table 7 and the earlier document-level evaluation of the same runs in Table 3. This is not surprising
since both use the same ground truth references, but it still adds to the confidence in the evaluation
setup using aligned data at scale. Second, we see the same divergence between reasonable SARI scores
and relatively low BLEU scores that seem to favor more conservative approaches. Many of the high
BLEU scoring approaches also have a relatively high fraction of exact copies. This is interesting, as a
conservative approach feels appropriate for the scientific text simplification use case. Conservative
approaches may promote accuracy, faithfulness, and correctness of the simplifications. This may be
desirable approach even if the readability and accessibility are not as high. Third, at the fine-grained
sentence level over a large set of sentences, the text statistics give a more detailed and informative
picture of the text simplification approaches. At the sentence level, we see varying and even high
fractions of copies for more conservative methods. We see quite a significant variation on almost every
indicator, highlighting great diferences between each approach, despite the relatively similar SARI
scores.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Discussion and Conclusions</title>
      <p>This paper describes the setup of the CLEF 2025 SimpleText track, which contains the following three
tasks. Task 1 on Text Simplification : simplify scientific text . Task 2 on Controlled Creativity: identify
and avoid hallucination. Task 3 on SimpleText 2024 Revisited: selected tasks by popular request. These
tasks address some of today’s main NLP/IR challenges. This Task overview focuses on the CLEF 2025
SimpleText Track’s Task 1 on sentence-level and document-level text simplification. The main aim
of our track, and the CLEF evaluation forum as a whole, is i) to construct corpora and evaluation
resources to stimulate research on scientific text summarization and simplification, and ii) to foster a
community of IR, NLP, and AI researchers working together on the important task of making science
more accessible for everyone.</p>
      <p>Within the CLEF 2025 SimpleText Task 1, we have constructed extensive corpora and new references
for evaluation data. First, we pushed the research frontier in text simplification by creating new
scientific text simplification corpora for biomedical literature. We focused on true paragraph-level and
document-level simplification with greater variation and took the complex discourse structure into
account. This fits current models such as LLMs, which operate on long input. Second, the
documentlevel text simplification corpus created at CLEF 2025 is a major advance of the field, as earlier data
was typically based on direct human simplifications at the sentence level. As a result, sentence-level
text simplification approaches were very efective, and typically outcompeted true document-level
approaches, while our models have been able to cope with long context for long. In 2025, we saw for the
ifrst time, that document-level text simplification approaches clearly outcompeted sentence-level text
simplification. Third, the move to the biomedical domain presents many important challenges to current
technology. The abstracts can be quite long, with a complex discourse structure, and the plain language
summaries avoid or explain medical jargon and provide additional background information to make
the key points of the scientific abstracts understandable for consumers. Several teams experimented
with specific approaches and models for the health and biomedical domain. In particular, several teams
explored novel ways to guide the model into producing output that address barriers consumers face
when directly accessing the biomedical literature. These advances hold the promise to automatically
provide plain language summaries of biomedical literature, and thereby greatly enhance the scope and
impact of authoritative health information.</p>
      <p>These reusable corpora and evaluation resources are available to participants and other researchers
who want to work on the important problem of making scientific information open and easily
accessible for everyone. In terms of building a community researching scientific text summarization and
simplification, the track saw a record attendance in 2025, with significant changes in the tasks and the
move to Codabench, more runs were submitted, and with the largest number of participating teams
ever.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We are incredibly thankful to the master’s students in translation and technical writing from the
University of Brest for participating in data annotation. We also thank each of the individual track
participants for their efort in submitting a record number of submissions to Codabench and documenting
these in their papers.</p>
      <p>We thank the CLEF 2025 chairs for hosting us, and the CLEF 2025 Labs and Proceedings chairs for
their excellent assistance and flexibility. It is heartwarming to be part of such a great CLEF family.
We thank Codabench [22] for hosting the competition. Post-competition experiments are ongoing at
https://www.codabench.org/competitions/8400/ (Task 1.1, Task 1.2, and Task 2.3) and https://www.
codabench.org/competitions/8327/ (Task 2.1 and Task 2.2). We hope and expect that these “living test
collections” remain in active use until the next iteration of the track.</p>
      <p>Benjamin Vendeville and Liana Ermakova are partly funded by the French National Research Agency
(ANR-22-CE23-0019-01, Automatic Simplification of Scientific Texts ). Liana Ermakova is further supported
by the CNRS research group MaDICS (https://www.madics.fr/ateliers/simpletext/).</p>
      <p>Jan Bakker and Jaap Kamps are partly funded by the Netherlands Organization for Scientific Research
(NWO NWA # 1518.22.105). Jaap Kamps is further supported by (NWO CI # CISC.CC.016), the University
of Amsterdam (AI4FinTech program), and ICAI (AI for Open Government Lab). Views expressed in this
paper are not necessarily shared or endorsed by those funding the research.</p>
    </sec>
    <sec id="sec-7">
      <title>Disclosure of Interests</title>
      <p>The authors have no competing interests to declare that are relevant to the content of this article.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly in order to: Grammar
and spelling check and Paraphrase and reword. After using these tools/services, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.
[3] B. Vendeville, J. Bakker, H. Azarbonyad, L. Ermakova, J. Kamps, Overview of the CLEF 2025</p>
      <p>SimpleText Task 2: Identify and Avoid Hallucination, in: [23], 2025.
[4] N. Largey, D. Wu, B. Mansouri, AIIRLab Systems for CLEF 2025 SimpleText: Cross-Encoders to</p>
      <p>Avoid Spurious Generation, in: [23], 2025.
[5] A. N. Djoudi, S. Nouali, M. Aabid, I. Badache, A.-G. Chifu, P. Bellot, LIS at the SimpleText 2025:
Enhancing Scientific Text Accessibility with LLMs and Retrieval-Augmented Generation, in: [ 23],
2025.
[6] K. C. Marturi, H. H. Elwazzan, Hallucination Detection and Mitigation in Scientific Text
Simplification using Ensemble Approaches: DS@GT at CLEF 2025 SimpleText, in: [23], 2025.
[7] K. C. Marturi, H. H. Elwazzan, LLM-Guided Planning and Summary-Based Scientific Text
Simplification: DS@GT at CLEF 2025 SimpleText, in: [23], 2025.
[8] G. Arampatzis, A. Arampatzis, DUTH at CLEF 2025 SimpleText Track: Tackling Scientific Text</p>
      <p>Simplification and Hallucination Detection, in: [23], 2025.
[9] M. M. Agüero-Torales, C. Rodríguez-Abellán, C. A. C. Moraga, Sentence-level Scientific Text</p>
      <p>Simplification With Just a Pinch of Data, in: [23], 2025.
[10] Y. Gallina, T. Jiménez, S. Huet, University of Avignon at SimpleText 2025: Guided Medical Abstract</p>
      <p>Simplification, in: [23], 2025.
[11] A. Vora, T. Chaudhari, S. Hotha, S. Sonawane, S-3 Pipeline by PICT/Pune for Biomedical Text</p>
      <p>Simplification, in: [23], 2025.
[12] S. Eugin, A. Ms.Beula, V. Sathvikha, V. Sangamithra, SimpleText: Simplify Scientific Text, in: [ 23],
2025.
[13] A. A. Dongre, A. Vaadiraaju, A. K. Madasamy, NITK SCaLAR Lab at the CLEF 2025 SimpleText</p>
      <p>Track: Transformer-Based Models for Biomedical Sentence Simplification (Task 1.1), in: [ 23], 2025.
[14] J. Collado-Montañez, J. A. Ortiz-Zambrano, C. Espin-Riofrio, A. Montejo-Ráez, SINAI in SimpleText
CLEF 2025: Simplifying Biomedical Scientific Texts and Identifying Hallucinations Using GPT-4.1
and Pattern Detection, in: [23], 2025.
[15] N. Hofmann, J. Dauenhauer, N. O. Dietzler, I. D. Idahor, C. K. Kreutz, THM@SimpleText 2025 Task
1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts, in: [23], 2025.
[16] B. Vendeville, L. Ermakova, P. D. Loor, J. Kamps, UBONLP Report on the SimpleText lab, in: [23],
2025.
[17] P. Kocbek, G. Stiglic, UM-FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and
Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification,
in: [23], 2025.
[18] T. Papandreou, J. Bakker, J. Kamps, University of Amsterdam at the CLEF 2025 SimpleText Track,
in: [23], 2025.
[19] A. Devaraj, I. Marshall, B. Wallace, J. J. Li, Paragraph-level simplification of medical texts, in:
K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell,
T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Association
for Computational Linguistics, Online, 2021, pp. 4972–4984. URL: https://aclanthology.org/2021.
naacl-main.395/. doi:10.18653/v1/2021.naacl-main.395.
[20] D. Davari, L. Ermakova, R. Krestel, Comparative analysis of evaluation measures for
scientific text simplification, in: A. Antonacopoulos, A. Hinze, B. Piwowarski, M. Coustaty, G. M.
Di Nunzio, F. Gelati, N. Vanderschantz (Eds.), Linking Theory and Practice of Digital Libraries
- 28th International Conference on Theory and Practice of Digital Libraries, TPDL 2024,
Ljubljana, Slovenia, September 24-27, 2024, Proceedings, Part I, volume 15177 of Lecture Notes in
Computer Science, Springer, 2024, pp. 76–91. URL: https://doi.org/10.1007/978-3-031-72437-4_5.
doi:10.1007/978-3-031-72437-4\_5.
[21] C. Jiang, W. Xu, MedReadMe: A systematic study for fine-grained sentence readability in medical
domain, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on
Empirical Methods in Natural Language Processing, Association for Computational Linguistics,
Miami, Florida, USA, 2024, pp. 17293–17319. URL: https://aclanthology.org/2024.emnlp-main.958/.
doi:10.18653/v1/2024.emnlp-main.958.
[22] Z. Xu, S. Escalera, A. Pavão, M. Richard, W. Tu, Q. Yao, H. Zhao, I. Guyon, Codabench: Flexible,
easy-to-use, and reproducible meta-benchmark platform, Patterns 3 (2022) 100543. URL: https:
//doi.org/10.1016/j.patter.2022.100543. doi:10.1016/J.PATTER.2022.100543.
[23] G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025: Conference and Labs
of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts</article-title>
          , in: M.
          <string-name>
            <surname>Shardlow</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Saggion</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Alva-Manchego</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zampieri</surname>
          </string-name>
          , K. North, S. Štajner, R. Stodden (Eds.),
          <source>Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>51</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .tsar-
          <volume>1</volume>
          .5/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .tsar-
          <volume>1</volume>
          .5.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azarbonyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2025 SimpleText track: Simplify scientific texts (and nothing more)</article-title>
          , in: J. Carrillo de Albornoz,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>