1. Introduction

X (B. Vendeville);

Overview of the CLEF 2025 SimpleText Task 1: Simplify Scientific Text

Jan Bakker

Benjamin Vendeville

Liana Ermakova

Jaap Kamps

1 0 Université de Bretagne Occidentale , HCTI , France 1 University of Amsterdam , Amsterdam , The Netherlands

2025

000 0 0002

This paper presents an overview of the CLEF 2025 SimpleText Task 1 on Text Simplification. The task aims to simplify scientific text. We discuss the data and benchmarks provided for these tasks, along with preliminary insights and anticipated challenges. Our main findings are the following. First, we advanced the field of text simplification by creating new biomedical corpora that support true paragraph- and document-level simplification, capturing greater variation and complex discourse suited to LLMs. Second, our CLEF 2025 document-level corpus showed for the first time that document-level simplification models clearly outperformed sentence-level methods. Third, in addressing the biomedical domain, teams developed novel approaches to generate plain language summaries that overcome key barriers for consumers, enhancing accessibility to authoritative health information. More generally, we hope and expect that the constructed corpora and evaluation data will be used by researchers to further advance text simplification approaches, both in general and specifically for the biomedical domain.

eol>Scientific text simplification Biomedical AI Generative AI Information access Natural language processing

1. Introduction

• Task 1: Text Simplification simplify scientific text. • Task 2: Controlled Creativity identify and avoid hallucination.

• Task 3: SimpleText 2024 Revisited selected tasks by popular request.

This paper gives an overview of the CLEF 2025 SimpleText Task 1 on Text Simplification, which aims to simplify scientific text. Further detail on the entire track is in the CLEF 2025 SimpleText Track Overview [ 2 ]. Additional details on Task 2 on Controlled Creativity are in a companion Task 2 overview paper [3]. We also refer to the respective participants’ papers for further details.

A total of 74 teams registered for our SimpleText track at CLEF 2025. A total of 18 teams submitted 198 runs in total for Tasks 1 and 2. The statistics for these runs submitted are presented in Table 1.1 However, some runs had problems that we could not resolve. We do not detail them in the rest of the paper and leave out the 0-scoring runs. More details about individual runs and experiments can be found in the participants’ papers, also shown in Table 1.

The rest of this paper is structured in the following way. Section 2 describes the task, the data, the format, and the evaluation measures. Section 3 describes the participants’ approaches. Section 4 provides detailed results for the task. Section 5 provides further analysis of the results. We end with a discussion and conclusions in Section 6.

2. Task 1: Simplify Scientific Text This section details Task 1: Text Simplification on simplify scientific text. 2.1. Description

The Text Simplification task aims to simplify scientific text . We created a new CLEF 2025 SimpleText corpus based on biomedical literature abstracts and lay summaries from Cochrane systematic reviews, called Cochrane-auto [ 1 ]. An example is shown in Figure 1. This corpus was created by closely following 1The table includes submissions in the Tasks 1 and 2 Codabench evaluation platform, where we were privileged to have 29 (Task 1) and 13 (Task 2) participants.

Complex paragraph Fifteen heterogeneous trials, involving 1022 adults with dorsally displaced and potentially or evidently unstable distal radial fractures, were included. While all trials compared external fixation versus plaster cast immobilisation, there was considerable variation especially in terms of patient characteristics and interventions. Methodological weaknesses among these trials included lack of allocation concealment and inadequate outcome assessment.

Simple paragraph Fifteen trials, involving 1022 adults with potentially or evidently unstable fractures, were included. While all trials compared external fixation versus plaster cast immobilisation, there was considerable variation in their characteristics especially in terms of patient characteristics and the method of external fixation. Task Description This is the core NLP task of the track, and we continue with both sentence-level (Task 1.1) and document-level (Task 1.2) scientific text simplification. The main innovation is the very large new corpus we constructed in 2024, and the shift to the biomedical domain. 2.2. Data As discussed above, we constructed a large scientific text simplification corpus, based on realigning abstracts and lay summaries at scale at the sentence, paragraph, and document levels. In 2025, we will use this Cochrane-auto corpus as the training data.

Train data The specific train data for Task 1 consists of 1,085 documents, 4,171 paragraphs, and 14,719 sentences, with paired content from the abstract and the plain language summary. While the track distinguishes only between sentence and document level text simplification, the paragraph level of the sentence input is included, allowing also for paragraph level text simplification submission to Task 1.2.

Test data The primary test data consists of 217 new Cochrane abstracts with paired plain English summaries, composed of 4,293 source sentences.

These are new systematic reviews published by Cochrane over the last year. We process these paired abstracts and plain language summaries in two diferent ways.

• We process these as Cochrane-auto [ 1 ] to ensure a high-quality sentence and paragraph alignment. This results in a subset of 37 abstracts and 587 sentences, paired with 37 plain language summaries with 388 sentences. The processing is identical to Cochrane-auto and other text simplification data sets. • For document-level text simplification, we can also use the original pairs of abstracts and plain language summaries, using only the results and conclusions sections, similar to [19]. This results in 217 abstracts with 4,293 source sentences, paired with 217 plain language summaries with 3,641 sentences.

We use the aligned subset as the main evaluation and also report the scores over the whole subset. Analysis data For further analysis, we extended the test data with the Cochrane-auto validation and test splits (part of the train data), Medline abstracts for which TREC PLABA references exist,2 and SimpleText 2024 abstracts for which we have references. The combined test file, including additional data sources, contains 666 documents with 9,160 sentences.

2.3. Formats

2.3.1. Train data This section outlines the format of the data used in the CLEF 2025 SimpleText Task 1. The training data used was from the Cochrane-auto paper [ 1 ] as published on GitHub: https: //github.com/JanB100/cochrane-auto/tree/main/data. This data format follows the earlier Wiki-auto and Newsela-auto corpora and includes additional fields for statistics. It also includes the specific transformation (copy, rephrase, split, merge, delete) for each sentence in the source abstract.

The training data already has train, validation, and test splits with references. We also included the validation and test splits of the training data [ 1 ] in the combined source data test set, so we also collected the predictions of each system on the training data. 2.3.2. Sources Sentence-level simplification

The source data is in a JSON format with the following fields: 1. pair_id: Unique ID for the complex-simple document pair.

2. para_id: Index of the source paragraph in the document. 3. sent_id: Index of the source sentence in the document.

4. complex: Complex sentence from the source document.

This format is a simplified version of the Cochrane-auto format in JSON. The combination of the pair_id (the id of the document or abstract) and the sent_id (the sentence order) determines the unique sentence unambiguously. As the abstracts can be quite lengthy, we retained the paragraph identifier. This third identifier enables paragraph-level text simplification approaches (submitted to Task 1.2).

An example of the Task 1.1 JSON source input is: { }, {

"pair_id": "CD012520", "pair_id": "CD012520", "para_id": 0, "sent_id": 0, "complex": "We included seven cluster-randomised trials with 42,489 patient participants from 129 ˓→ hospitals, conducted in Australia, the UK, China, and the Netherlands." }, . . .

{ Document-level simplification

The source data is in a JSON format with the following fields: 1. pair_id: Unique ID for the complex-simple document pair. 2. source: The origin of the data (only for reference). 3. complex: The complex document’s content (all sentences).

This is again a simplified version of the Cochrane-auto format in JSON. The new Cochrane-auto data is the main evaluation of the task in 2025. However, the test input files also included the validation and train splits of the training data, as well as other scientific abstracts from PubMed and the earlier CLEF 2024 SimpleText corpus. The source field indicates the source of the abstracts. This enables comparative performance analysis against other existing text simplification corpora, in addition to the track’s oficial evaluation scores on the new set of Cochrane-auto abstracts.

An example of the Task 1.2 JSON source input is: "para_id": 0, "sent_id": 1, "complex": "Health professional participants (numbers not specified) included nursing, medical ˓→ and allied health professionals." "pair_id": "CD012520", "para_id": 2, "sent_id": 12, "complex": "We are uncertain whether a multifaceted implementation intervention compared to no ˓→ intervention improves adherence to evidence-based recommendations in acute stroke settings, ˓→ because the certainty of evidence is very low." }, . . .

"pair_id": "CD012520", "source": "Cochrane", "complex": "We included seven cluster-randomised trials with 42,489 patient participants from 129 hospitals, conducted in Australia, the UK, China, and the Netherlands. Health professional participants (numbers not specified) included nursing, medical and allied health professionals. ˓→ ˓→ ˓→ ...

We are uncertain whether a multifaceted implementation intervention compared to no intervention ˓→ improves adherence to evidence-based recommendations in acute stroke settings, because the ˓→ certainty of evidence is very low." }, . . .

In this case, there are 13 sentences (Task 1.1) corresponding to the Cochrane document (Task 1.2). 2.3.3. Predictions As output, participants were asked to add a column to the input data with their simplified sentence or abstracts (prediction) as well as a unique identifier for their run. So, in addition to the fields in the Source format above, submissions included the following fields: • prediction: Your simplification of the complex sentence with pair_id and sent_id (for sentence-level simplification), or of the complex document with pair_id (for document-level simplification). • run_id: <team-id>_<task-id>_<method-used>.

An example of a sentence-level text simplification (Task 1.1) submis

Sentence-level simplification sion is the following: [{"pair_id":"CD012520","para_id":0,"sent_id":0,"complex":"We included seven cluster-randomised ˓→ trials with 42,489 patient participants from 129 hospitals, conducted in Australia, the UK, ˓→ China, and the Netherlands.","prediction":"We looked at seven studies that were done in groups. ˓→ These studies involved 42,489 patients from 129 hospitals in Australia, the UK, China, and the ˓→ Netherlands.","run_id":"UBOnlp_task11_gpt4o"}, {"pair_id":"CD012520","para_id":0,"sent_id":1,"complex":"Health professional participants (numbers not specified) included nursing, medical and allied health professionals.","prediction":"The group of health professionals included nurses, doctors, and other healthcare workers. The exact number of people in each group wasn't mentioned.","run_id":"UBOnlp_task11_gpt4o"}, ˓→ ˓→ ˓→ . . . {"pair_id":"CD012520","para_id":2,"sent_id":12,"complex":"We are uncertain whether a multifaceted ˓→ implementation intervention compared to no intervention improves adherence to evidence-based ˓→ recommendations in acute stroke settings, because the certainty of evidence is very ˓→ low.","prediction":"We're not sure if using a special plan with many parts helps doctors follow ˓→ the best advice for treating strokes, compared to not using any plan. This is because we don't ˓→ have strong proof yet.","run_id":"UBOnlp_task11_gpt4o"}, . . . ] Document-level simplification submission is the following:

An example of a document-level text simplification (Task 1.2) [{"pair_id":"CD012520","source":"Cochrane","complex":"We included seven cluster-randomised trials ˓→ with 42,489 patient participants from 129 hospitals, conducted in Australia, the UK, China, and ˓→ the Netherlands. Health professional participants (numbers not specified) included nursing, ˓→ medical and allied health professionals. ... We are uncertain whether a multifaceted ˓→ implementation intervention compared to no intervention improves adherence to evidence-based ˓→ recommendations in acute stroke settings, because the certainty of evidence is very ˓→ low.","prediction":"Researchers conducted studies in hospitals across Australia, the UK, China, ˓→ and the Netherlands. They wanted to see if certain strategies could help healthcare workers ˓→ provide better care. These strategies were aimed at doctors, nurses, and other health ˓→ professionals. ... Overall, the evidence was not strong enough to say for sure if these ˓→ strategies help healthcare workers follow best practices in treating stroke ˓→ patients.","run_id":"UBOnlp_task12_gpt4o"}, . . . ] 2.3.4. References The reference simplifications are based on pairs of Cochrane abstracts and plain language summaries. One possibility for document-level text simplification is to use the original pairs of abstracts and plain language summaries, using only the results and conclusions sections, similar to [19]. This results in a representative and realistic set of references and a large set of abstracts, but has the disadvantage that there may be many diferences between the abstract and plain language summary. This version is included as simple_original.

We can also process these as Cochrane-auto [ 1 ] to ensure a high-quality sentence and paragraph alignment. This results in a smaller set of abstracts, but clear correspondence between the content in the pair of abstract and plain language summary. This is the main evaluation reference and included as simple_auto.

The reference data is also in JSON: { "CD012520": { "simple_original": "Implementation interventions are designed to improve the delivery of ˓→ 'evidence-based' care, which is care that has been proven in research studies to help people ˓→ with a particular health condition. ... What did we find? We included seven studies that ˓→ involved 42,489 acute stroke patients and an unknown number of health professionals. The ˓→ studies were conducted in 129 hospitals in Australia, the UK, China and the Netherlands. ... ˓→ How up to date is this evidence? This review includes papers that we identified from ˓→ searching in April 2022.",

}, "simple_auto": "We included seven studies that involved 42,489 acute stroke patients and an ˓→ unknown number of health professionals. The studies were conducted in 129 hospitals in ˓→ Australia, the UK, China and the Netherlands. ... We do not know if implementation ˓→ interventions delivered in acute stroke units lead to better delivery of evidence-based care." To have identical ground truth and directly comparable evaluation scores over both sentence and document level submissions, in particular in the single leaderboard table in the Task’s Codabench, we decided to evaluate both types of submissions at the document level. That is, a sentence-level submission is merged to a complete, simplified abstract, and evaluated against the plain language summary.

2.4. Codabench

Submissions were made through Codabench.3 Due to the diferences in the setup, each task had a designated separate competition on Codabench. The Task 1 runs were submitted at: https://www. codabench.org/competitions/8400/ (shown in Figure 2). The Task 2 runs were submitted at: https: //www.codabench.org/competitions/8327/. The Codabench greatly facilitated running the track in 2025 and provided active participants (who had also registered at the Codabench) with full access to the competition, including the submission and leaderboard pages.

2.5. Evaluation

In 2025, we emphasize large-scale automatic evaluation measures (SARI, BLEU, compression, readability) that provide a reusable test collection. For further details on these evaluation measures for scientific text simplification, see [ 20]. This automatic evaluation will be supplemented with a detailed human evaluation of other aspects, essential for deeper analysis.

Almost all participants used generative models for text simplification, yet existing evaluation measures are blind to potential hallucinations with extra or distorted content. In 2025, we will continue to provide further analysis of ways to detect and quantify spurious content in the output, potentially corresponding to what is informally called "hallucinations."

3. Participant’s Approaches

A total of 18 teams submitted 132 runs in total. In the detailed results, we only include runs without errors, which got a non-zero score.

AIIRLab Largey et al. [4] submitted six runs in total for Task 1. They submitted four runs for Task 1.1 and two runs for Task 1.2. They use a range of open-source models (Mistral, LLaMA), with extensive ifnetuning and exploration of efective prompts, for sentence- and document-level text simplification. Special precautions against noise and unwanted output were taken. The prompt instructions were directly focused on the desired outcome evaluation measures of the task.

ASM Djoudi et al. [5] submitted 10 runs in total for Task 1. They submitted no runs for Task 1.1 and 10 runs for Task 1.2. They created an extensive set of over 3,000 simplified medical definitions compiled from multiple public sources. A Mistral 7B model was used to detect jargon in the abstracts, and matching simplified definitions were added to the prompt. Three open-source models (Mistral 7B, Gemma 2-9B, Med42-v2) were used for text simplification, obtaining competitive performance. DSGT Marturi and Elwazzan [6] submitted three runs in total for Task 1. They submitted two runs for Task 1.1 and one run for Task 1.2. The paper uses an open-source LLaMA 3.3 70b model with a few-shot prompt approach. For sentence-level simplification, they first prompt a plan to guide the simplification and, in a second stage, prompt the model to execute the plan at the sentence level. For document-level simplification, they first apply a summarization prompt and, in a second stage, prompt the model to simplify the summary. Related Task 2.3 experiments are in a separate paper [7].

DUTH Arampatzis and Arampatzis [8] submitted three runs in total for Task 1. They submitted three runs for Task 1.1 and none for Task 1.2. They use open-source models, such as FLAN-T5 and BART-SAMSum, with a zero-shot prompt for sentence-level and document-level text simplification. EngKh (no paper) submitted two runs in total for Task 1. They submitted two runs for Task 1.1 and none for Task 1.2.

Fujitsu Agüero-Torales et al. [9] submitted 19 runs in total for Task 1. They submitted 19 runs for Task 1.1 and none for Task 1.2. They explore an in-context learning approach, with zero and threeshot prompting of GPT-3.5, o4-mini, and T5-small models, in an optimized pipeline for sentence-level scientific text classification.

LIA Gallina et al. [10] submitted 9 runs in total for Task 1. They submitted no runs for Task 1.1 and 9 runs for Task 1.2. The paper does interesting experiments with a range of open-source models (LLaMA-4, LLaMA-3.3, Mistral-Small, Gemma2, Helsinki). They use both generic prompts and specific guidance based on the Cochrane plain language summaries instructions. The specific instructions help the performance of their models.

Mtest (no paper) submitted two runs in total for Task 1. They submitted one run for Task 1.1 and 1 run for Task 1.2.

PICT Vora et al. [11] submitted two runs in total for Task 1. They submitted one run for Task 1.1 and one run for Task 1.2. They explore an advanced pipeline to create an abstract meaning representation of the text, focusing on lexical and phrase-level simplification, sentence-level structural simplification, and a final T5 model for generative text simplification.

RECAIDS (no paper) submitted two runs in total for Task 1. They submitted one run for Task 1.1 and one run for Task 1.2. They explore a T5 model for Tasks 1.1 and 1.2 with a straightforward T5 completion prompt, and with a model fine-tuned on each task.

Scalar Dongre et al. [13] submitted 11 runs in total for Task 1. They submitted ten runs for Task 1.1 and one run for Task 1.2. They perform an interesting experiment for Task 1.1, motivated by avoiding biomedical jargon or technical terminology. They deploy earlier generation models (BioBERT/BioBART, GPT-2), which are considerably more eficient than current LLMs and demonstrate reasonable performance.

SINAI Collado-Montañez et al. [14] submitted four runs in total for Task 1. They submitted two runs for Task 1.1 and two runs for Task 1.2. They uses a closed-source model, GPT-4.1, in a zero-shot prompt setting. They use tailored biomedical test simplification prompts for Tasks 1.1 and 1.2, and the model shows high performance.

THM Hofmann et al. [15] submitted 22 runs in total for Task 1. In fact, they submitted 22 runs for Task 1.1 and none for Task 1.2. They devote special interest to biomedical jargon or technical terminology. The main experiment uses five diferent prompts with advanced closed-source models (OpenAI and Gemini).

UBO Vendeville et al. [16] submitted 12 runs in total for Task 1. They submitted five runs for Task 1.1 and seven runs for Task 1.2. The submissions were mostly test submissions, which were not discussed in detail in the paper.

UM-FHS Kocbek and Stiglic [17] submitted 9 runs in total for Task 1. They submitted four runs for Task 1.1 and five runs for Task 1.2. They utilize closed-source models (GPT-4.1 standard, mini, and nano) in a zero-shot prompt setup with detailed prompts for Tasks 1.1 and 1.2. They also explore the value of ifne-tuning the smaller models.

UvA Papandreou et al. [18] submitted 14 runs in total for Task 1. They submitted five runs for Task 1.1 and nine runs for Task 1.2. They submitted Cochrane-auto [ 1 ] trained BART models. These were either operating at the sentence level, including a plan-guided version, for Task 1.1, or the paragraph or document level for Task 1.2. They also experimented with jargon detection trained on MedReadMe [21] to create a jargon-aware prompt for a LLaMA 3.1-8b model, for both tasks.

Unknown team (no paper) submitted two runs in total for Task 1. They submitted two runs for Task 1.1 and none for Task 1.2.

4. Results

This section details the task results for sentence- and document-level test simplification subtasks.

4.1. Task 1.1: Sentence-level Scientific Text Simplification

The main evaluation concerns the 37 abstracts, with 587 sentences aligned identically to the way Cochrane-auto and other collections are aligned. In this track overview paper, we decided to evaluate all submissions in Task 1.1 and Task 1.2 at the document level to ensure identical ground truth and comparable scores across tasks.

Table 3 shows the Task 1.1 (sentence-level text simplification) results. The table is restricted to submissions without issues, and we show a maximum of five runs per team. We show several evaluation scores against the human reference simplifications, particularly SARI and BLEU. In addition, we provide additional text statistics on the system output, such as FKGL, and compare them to the source input.

We make a number of observations. First, the table is sorted on SARI, the primary automatic text simplification measure used in the track. We observe SARI scores above 30% for almost all systems and above 40% for the top-scoring systems. This high overlap with the plain language reference simplifications is encouraging, and it indicates that the efectiveness of text simplification approaches, traditionally trained on youth news reading corpora like Newsela, also extends to scientific text.

Second, in terms of the level of text complexity, readability measures like FKGL provide a rough indicator of lexical and grammatical complexity. The original sentences have an FKGL of 13-14 corresponding to university-level text, and most systems reduce this to an FKGL of 11-12 corresponding to the exit level of compulsory education. This is an encouraging result, as it indicates that the scientific text simplification approach can be a viable approach to lower the textual complexity of scientific text toward the range acceptable by a layperson. Although this indicator is positive, this approximate measure does not consider terminological complexities.

Third, the table includes various other scores that indicate that there is still considerable room for improvement in scientific text simplification. Throughout the table, the BLEU evaluation measure remains very low. It leads to a diferent ranking of systems, with some of the best systems on BLEU demonstrating superior overlap with the human reference simplifications. The table also reveals some runs with very high “compression” ratios, sentence splits, and high proportions of additions. While evaluation measures like SARI are essential for understanding important aspects of text simplification output quality, they are also known to be relatively insensitive to content outside the intersection of manual text simplifications. Hence, high levels of content insertion can still lead to favorable SARI scores and even improve text statistics like FKGL without conveying key content of the original text.

4.2. Task 1.2: Document-level Scientific Text Simplification

Table 4 shows the results of Task 1.2 (document-level text simplification). Again, we restrict the table to submissions covering a maximum of five runs with non-zero scores per team.

We make a number of observations. First, in terms of evaluation measures like SARI, we see similar encouraging performance levels again when evaluating against the plain language reference simplifications. In earlier years of the track, this mainly resulted from using proven sentence-level text simplification models with the output merged back into the entire abstract. However, this year, we see almost exclusively large language models applied to the lengthy source abstract as a whole. This is a clear sign of the remarkable progress in models for text simplification and other complex NLP tasks. Second, there remains room for improvement in capturing the human simplifications more closely, as the BLEU score remains low throughout. Here, the more conservative approaches seem to obtain better scores. For scientific text simplification, we aim for a careful balance between simplicity and accuracy, and being conservative is a key strength to avoid unnecessary and potentially inaccurate changes. Third, we see less extreme values on the other indicators, but still considerable variation in the compression ratio and number of splits, and proportions of additions and deletions. Generally, we see more compression and deletions, indicating summarization aspects such as reducing the number of sentences, which happens frequently.

It is encouraging to see solid performance for the approaches that perform text simplification on the entire abstract in one pass. This holds the promise to incorporate the discourse structure, use more complex text simplification operations such as deletions and merges, and deploy planner-based approaches to the text simplification of long documents. Traditional sentence-level simplification approaches and earlier evaluation data cannot capture these aspects. This demonstrates the value of the new test collections constructed during the CLEF 2025 SimpleText track.

4.3. Results on Plain Language Summaries

In this section, we provide additional evaluation on the larger set of 217 abstracts with 4,293 source sentences paired with 217 plain language summaries with 3,641 sentences. Unlike the subset discussed above, high-quality sentence alignment is not possible for this data. However, our primary interest is in document-level text simplification and evaluation, and our analysis explores the value of using parallel text directly as evaluation.

Table 5 shows the results of Task 1.1 (sentence-level text simplification) against a larger set of 217 abstracts and plain language summaries without further alignment. Again, we restrict the table to submissions covering a maximum of five runs with non-zero scores per team. Note again that all submissions in Task 1.1 and Task 1.2 at the document level, to ensure identical ground truth and comparable scores across tasks.

We make a number of observations. First, in terms of evaluation measures like SARI, we see again similar encouraging performance levels when evaluating against the larger set of plain language reference simplifications. The ranking in Table 5 is similar to the subset of Table 3 before, with some notable shifts and upsets, particularly for run with a low BLEU score, but overall high agreement. Second, we see relatively low BLEU scores again, and even considerably lower than before. This is partly a result of the less clear source to reference alignment at the sentence and paragraph level for this larger set of references. But it also shows that document-level text simplification is a challenging task, even for current advanced models. Third, this also indicates that real-world plain language summaries are far removed from direct sentence-level simplifications. It also suggests that more conservative approaches, which may be desirable from an accuracy point of view, fail to capture the complex plain language adaptations.

Table 6 shows the results of Task 1.2 (document-level text simplification) against a larger set of 217 abstracts and plain language summaries without further alignment. Again, we restrict the table to submissions covering a maximum of five runs with non-zero scores per team.

We make a number of observations. First, in terms of evaluation measures like SARI, we see similar encouraging performance levels again when evaluating against the plain language reference simplifications. The tables show some swaps and upset, but generally good agreement between Table 6 and Table 4 shown before. One exception seems to be closed-source models, such as GPT-4, which perform less impressively on the larger set of plain language summaries. Second, the BLEU score remains low throughout again, and notably lower than on the subset of Table 4. This seems to be a result of the greater variation and discourse changes in the plain language summaries. However, this also immediately suggests that this is not yet captured well by the predictions of advanced NLP models for text simplification. Third, we see less extreme values on the other indicators for document-level text simplification approaches. The fraction of deletions remains very high throughout all systems. Interestingly, the better-scoring systems also seem to have more insertions. This can be an indication that some systems are finding valuable content to insert, such as explanations of jargon or other specialized terminology. 4.4. Findings This concludes the results for the CLEF 2025 SimpleText Task 1: Text Simplification on simplify scientific text. Our main findings are the following: First, our analysis compared the results over the carefully sentence-aligned abstracts in Table 3 and Table 4, with the larger unfiltered set of document-level aligned abstracts in Table 5 and Table 6. It is encouraging to see the broad agreement in the ranking over both sets, as this suggests evaluation and training on document-aligned texts is a viable option. Similar to how machine translation was able to scale up due to the availability of parallel texts, this can help scale up text simplification by increasing the number of available corpora. Second, this also shifts the focus of the field of text simplification beyond the traditional aspects of lexical and grammatical simplification and introduces new and interesting aspects. Examples include dealing with the discourse structure, particular background knowledge needed to understand the text, and avoiding or explaining jargon or specialized terminology. Third, while the results are encouraging and the submitted predictions are generally high quality compared to some years ago, there remains also clear room for improvement, in particular when dealing with the scientific vernacular and specific biomedical jargon. This demonstrates the value of the new test collections constructed during the CLEF 2025 SimpleText track. 5. Analysis This section details further analysis of the submissions to the track. We focus in particular on a sentence-level evaluation of the Task 1.1 submissions.

5.1. Task 1.1: Sentence-level Scientific Text Simplification

As detailed above, we made particular eforts in Cochrane-auto [ 1 ] to ensure alignment at the document, paragraph, and sentence level. Hence, we have Cochrane-auto aligned references for 37 abstracts, with a total of 388 source sentences, carefully aligned with 363 sentences in the plain language summaries due to deletions.

Table 7 shows the evaluation of the Task 1.1 submissions against the aligned sentence-level references. We make several observations. First, we see solid agreement between the sentence-level evaluation in Table 7 and the earlier document-level evaluation of the same runs in Table 3. This is not surprising since both use the same ground truth references, but it still adds to the confidence in the evaluation setup using aligned data at scale. Second, we see the same divergence between reasonable SARI scores and relatively low BLEU scores that seem to favor more conservative approaches. Many of the high BLEU scoring approaches also have a relatively high fraction of exact copies. This is interesting, as a conservative approach feels appropriate for the scientific text simplification use case. Conservative approaches may promote accuracy, faithfulness, and correctness of the simplifications. This may be desirable approach even if the readability and accessibility are not as high. Third, at the fine-grained sentence level over a large set of sentences, the text statistics give a more detailed and informative picture of the text simplification approaches. At the sentence level, we see varying and even high fractions of copies for more conservative methods. We see quite a significant variation on almost every indicator, highlighting great diferences between each approach, despite the relatively similar SARI scores.

6. Discussion and Conclusions

This paper describes the setup of the CLEF 2025 SimpleText track, which contains the following three tasks. Task 1 on Text Simplification : simplify scientific text . Task 2 on Controlled Creativity: identify and avoid hallucination. Task 3 on SimpleText 2024 Revisited: selected tasks by popular request. These tasks address some of today’s main NLP/IR challenges. This Task overview focuses on the CLEF 2025 SimpleText Track’s Task 1 on sentence-level and document-level text simplification. The main aim of our track, and the CLEF evaluation forum as a whole, is i) to construct corpora and evaluation resources to stimulate research on scientific text summarization and simplification, and ii) to foster a community of IR, NLP, and AI researchers working together on the important task of making science more accessible for everyone.

Within the CLEF 2025 SimpleText Task 1, we have constructed extensive corpora and new references for evaluation data. First, we pushed the research frontier in text simplification by creating new scientific text simplification corpora for biomedical literature. We focused on true paragraph-level and document-level simplification with greater variation and took the complex discourse structure into account. This fits current models such as LLMs, which operate on long input. Second, the documentlevel text simplification corpus created at CLEF 2025 is a major advance of the field, as earlier data was typically based on direct human simplifications at the sentence level. As a result, sentence-level text simplification approaches were very efective, and typically outcompeted true document-level approaches, while our models have been able to cope with long context for long. In 2025, we saw for the ifrst time, that document-level text simplification approaches clearly outcompeted sentence-level text simplification. Third, the move to the biomedical domain presents many important challenges to current technology. The abstracts can be quite long, with a complex discourse structure, and the plain language summaries avoid or explain medical jargon and provide additional background information to make the key points of the scientific abstracts understandable for consumers. Several teams experimented with specific approaches and models for the health and biomedical domain. In particular, several teams explored novel ways to guide the model into producing output that address barriers consumers face when directly accessing the biomedical literature. These advances hold the promise to automatically provide plain language summaries of biomedical literature, and thereby greatly enhance the scope and impact of authoritative health information.

These reusable corpora and evaluation resources are available to participants and other researchers who want to work on the important problem of making scientific information open and easily accessible for everyone. In terms of building a community researching scientific text summarization and simplification, the track saw a record attendance in 2025, with significant changes in the tasks and the move to Codabench, more runs were submitted, and with the largest number of participating teams ever.

Acknowledgments

We are incredibly thankful to the master’s students in translation and technical writing from the University of Brest for participating in data annotation. We also thank each of the individual track participants for their efort in submitting a record number of submissions to Codabench and documenting these in their papers.

We thank the CLEF 2025 chairs for hosting us, and the CLEF 2025 Labs and Proceedings chairs for their excellent assistance and flexibility. It is heartwarming to be part of such a great CLEF family. We thank Codabench [22] for hosting the competition. Post-competition experiments are ongoing at https://www.codabench.org/competitions/8400/ (Task 1.1, Task 1.2, and Task 2.3) and https://www. codabench.org/competitions/8327/ (Task 2.1 and Task 2.2). We hope and expect that these “living test collections” remain in active use until the next iteration of the track.

Benjamin Vendeville and Liana Ermakova are partly funded by the French National Research Agency (ANR-22-CE23-0019-01, Automatic Simplification of Scientific Texts ). Liana Ermakova is further supported by the CNRS research group MaDICS (https://www.madics.fr/ateliers/simpletext/).

Jan Bakker and Jaap Kamps are partly funded by the Netherlands Organization for Scientific Research (NWO NWA # 1518.22.105). Jaap Kamps is further supported by (NWO CI # CISC.CC.016), the University of Amsterdam (AI4FinTech program), and ICAI (AI for Open Government Lab). Views expressed in this paper are not necessarily shared or endorsed by those funding the research.

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT and Grammarly in order to: Grammar and spelling check and Paraphrase and reword. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [3] B. Vendeville, J. Bakker, H. Azarbonyad, L. Ermakova, J. Kamps, Overview of the CLEF 2025

SimpleText Task 2: Identify and Avoid Hallucination, in: [23], 2025. [4] N. Largey, D. Wu, B. Mansouri, AIIRLab Systems for CLEF 2025 SimpleText: Cross-Encoders to

Avoid Spurious Generation, in: [23], 2025. [5] A. N. Djoudi, S. Nouali, M. Aabid, I. Badache, A.-G. Chifu, P. Bellot, LIS at the SimpleText 2025: Enhancing Scientific Text Accessibility with LLMs and Retrieval-Augmented Generation, in: [ 23], 2025. [6] K. C. Marturi, H. H. Elwazzan, Hallucination Detection and Mitigation in Scientific Text Simplification using Ensemble Approaches: DS@GT at CLEF 2025 SimpleText, in: [23], 2025. [7] K. C. Marturi, H. H. Elwazzan, LLM-Guided Planning and Summary-Based Scientific Text Simplification: DS@GT at CLEF 2025 SimpleText, in: [23], 2025. [8] G. Arampatzis, A. Arampatzis, DUTH at CLEF 2025 SimpleText Track: Tackling Scientific Text

Simplification and Hallucination Detection, in: [23], 2025. [9] M. M. Agüero-Torales, C. Rodríguez-Abellán, C. A. C. Moraga, Sentence-level Scientific Text

Simplification With Just a Pinch of Data, in: [23], 2025. [10] Y. Gallina, T. Jiménez, S. Huet, University of Avignon at SimpleText 2025: Guided Medical Abstract

Simplification, in: [23], 2025. [11] A. Vora, T. Chaudhari, S. Hotha, S. Sonawane, S-3 Pipeline by PICT/Pune for Biomedical Text

Simplification, in: [23], 2025. [12] S. Eugin, A. Ms.Beula, V. Sathvikha, V. Sangamithra, SimpleText: Simplify Scientific Text, in: [ 23], 2025. [13] A. A. Dongre, A. Vaadiraaju, A. K. Madasamy, NITK SCaLAR Lab at the CLEF 2025 SimpleText

Track: Transformer-Based Models for Biomedical Sentence Simplification (Task 1.1), in: [ 23], 2025. [14] J. Collado-Montañez, J. A. Ortiz-Zambrano, C. Espin-Riofrio, A. Montejo-Ráez, SINAI in SimpleText CLEF 2025: Simplifying Biomedical Scientific Texts and Identifying Hallucinations Using GPT-4.1 and Pattern Detection, in: [23], 2025. [15] N. Hofmann, J. Dauenhauer, N. O. Dietzler, I. D. Idahor, C. K. Kreutz, THM@SimpleText 2025 Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts, in: [23], 2025. [16] B. Vendeville, L. Ermakova, P. D. Loor, J. Kamps, UBONLP Report on the SimpleText lab, in: [23], 2025. [17] P. Kocbek, G. Stiglic, UM-FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification, in: [23], 2025. [18] T. Papandreou, J. Bakker, J. Kamps, University of Amsterdam at the CLEF 2025 SimpleText Track, in: [23], 2025. [19] A. Devaraj, I. Marshall, B. Wallace, J. J. Li, Paragraph-level simplification of medical texts, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 4972–4984. URL: https://aclanthology.org/2021. naacl-main.395/. doi:10.18653/v1/2021.naacl-main.395. [20] D. Davari, L. Ermakova, R. Krestel, Comparative analysis of evaluation measures for scientific text simplification, in: A. Antonacopoulos, A. Hinze, B. Piwowarski, M. Coustaty, G. M. Di Nunzio, F. Gelati, N. Vanderschantz (Eds.), Linking Theory and Practice of Digital Libraries - 28th International Conference on Theory and Practice of Digital Libraries, TPDL 2024, Ljubljana, Slovenia, September 24-27, 2024, Proceedings, Part I, volume 15177 of Lecture Notes in Computer Science, Springer, 2024, pp. 76–91. URL: https://doi.org/10.1007/978-3-031-72437-4_5. doi:10.1007/978-3-031-72437-4\_5. [21] C. Jiang, W. Xu, MedReadMe: A systematic study for fine-grained sentence readability in medical domain, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 17293–17319. URL: https://aclanthology.org/2024.emnlp-main.958/. doi:10.18653/v1/2024.emnlp-main.958. [22] Z. Xu, S. Escalera, A. Pavão, M. Richard, W. Tu, Q. Yao, H. Zhao, I. Guyon, Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform, Patterns 3 (2022) 100543. URL: https: //doi.org/10.1016/j.patter.2022.100543. doi:10.1016/J.PATTER.2022.100543. [23] G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025: Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2025.

[1]

Bakker ,

Kamps , Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts , in: M. Shardlow , H.

Saggion , F.

Alva-Manchego , M.

Zampieri , K. North, S. Štajner, R. Stodden (Eds.), Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024 ), Association for Computational Linguistics , Miami, Florida, USA, 2024 , pp. 41 - 51 . URL: https://aclanthology.org/ 2024 .tsar- 1 .5/. doi: 10 .18653/v1/ 2024 .tsar- 1 .5.

[2]

Ermakova ,

Azarbonyad ,

Bakker ,

Vendeville ,

Kamps , Overview of the CLEF 2025 SimpleText track: Simplify scientific texts (and nothing more) , in: J. Carrillo de Albornoz,

Gonzalo ,

Plaza ,

García Seco de Herrera ,

Mothe ,

Piroi ,

Rosso ,

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, 2025 .