Overview of the CLEF 2024 SimpleText Task 3:
                         Simplify Scientific Text
                         Liana Ermakova1 , Valentin Laimé1 , Helen McCombie2 and Jaap Kamps3
                         1
                           Université de Bretagne Occidentale, HCTI, France
                         2
                           Université de Bretagne Occidentale, BTU, France
                         3
                           University of Amsterdam, Amsterdam, The Netherlands


                                       Abstract
                                       This article provides a comprehensive summary of the CLEF 2024 SimpleText Task 3, which focuses on simplifying
                                       scientific text based on specific queries. We discuss in detail the motivation for lay access to scholarly literature,
                                       and provide an overview of the setup of the scientific text simplification task. One of the main innovations of
                                       the CLEF 2024 SimpleText Task 3 is to complement sentence-level text simplification with a document-level text
                                       simplification task. We describe the resulting sentence-level and document-level text simplification test collection
                                       in detail, which consists of a corpus of over 1,500 paired source and reference sentences, and a corpus of over
                                       250 paired source and reference abstracts, both containing the source text from scientific abstracts with direct
                                       reference simplifications produced by human annotators. We present the results of the participants submission,
                                       with 15 teams submitting 52 sentence-level text simplification runs and 9 teams submitting 31 sentence-level
                                       text simplification runs. The article concludes with an in-depth analysis, including information distortion and
                                       potential LLM “hallucinations” of the simplified sentences submitted by participants.

                                       Keywords
                                       automatic text simplification, science popularization, information distortion, error analysis, lexical complexity,
                                       syntactic complexity, LLMs hallucination


                         1. Introduction
                         Becoming science literate is more important than ever before. Objective scientific information helps any
                         user to navigate a world of where misinformation, disinformation, or unfounded generated information
                         is only a single mouse click away. Everyone acknowledges the importance of objective scientific
                         information. However, finding and understanding relevant scientific documents is often challenging
                         due to complex terminology and readers’ lack of prior knowledge. The question is can we improve
                         accessibility for everyone?
                            Text simplification technology holds the promise to remove some of the access barriers [1, 2, 3, 4].
                         Despite impressive progress, the automatic removal of comprehension barriers between scientific
                         texts and the general public remains an ongoing challenge. The paper highlights that even the most
                         advanced language models currently available face difficulties when it comes to simplifying scientific
                         texts. The described results demonstrate the limitations of these models in effectively tackling the task
                         of simplification in the scientific domain.
                            The CLEF 2024 SimpleText track brings together researchers and practitioners working on the
                         generation of simplified summaries of scientific texts. It is an evaluation lab that follows up on the CLEF
                         2021 SimpleText Workshop [5] the CLEF 2022 SimpleText Track [6], and the CLEF 2023 SimpleText
                         Track [7].
                            The CLEF 2024 SimpleText track is based on four interrelated tasks:

                                1. Task 1 on Content Selection: retrieve passages to include in a simplified summary.

                                2. Task 2 on Complexity Spotting: identify and explain difficult concepts.

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 9–12, 2024, Grenoble, France
                          $ liana.ermakova@univ-brest.fr (L. Ermakova)
                           https://simpletext-project.com/ (L. Ermakova)
                           0000-0002-7598-7474 (L. Ermakova); 0000-0002-6614-0087 (J. Kamps)
                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
CLEF 2024 Simpletext Task 3 official run submission statistics


                                                                                                                                                                     Tomislav/Rowan
                                                                                                                                Frane/Andrea
                                                                                                                 Dajana/Katya


                                                                                                                                                                                      UAmsterdam
                                                                                                                                               Petra/Regina


                                                                                                                                                                                                                 UZH Pandas
                             Arampatzis


                                                                            Sharigans
                                                                 PiTheory
          AIIR Lab


                                                                                                        AB/DPV
                     AMATU


                                          Elsevier


                                                                                                SONAR


                                                                                                                                                                                                         UniPD
                                                                                        SINAI


                                                                                                                                                              Ruby


                                                                                                                                                                                                   UBO
                                                           LIA
                                                     L3S
Task                                                                                                                                                                                                                          Total
3.1        4                   4           8                     11           1                 1       1           1            1                1            1      1                4           2             11            52
3.2        4                   4           2                     10           1                                                                                1      1                6           2                           31


      3. Task 3 on Text Simplification: simplify scientific text.

      4. Task 4 on SOTA?: track the state-of-the-art in scholarly publications.

This paper presents an overview of the CLEF 2024 SimpleText Task 3 on Content Selection. For a
comprehensive overview of the other tasks, the task overview papers on Task 1 [8], Task 2 [9], and
Task 4 [10], as well as the track overview paper [11], provide detailed information and further insights.
   The CLEF 2024 SimpleText Task 3 directly addresses the technical and evaluation challenges associated
with making scientific information accessible to a wide audience, including students and non-experts. We
describes the data and benchmarks provided for scientific text simplification, along with the participants’
results and further analysis. This task on simplifying scientific text is a direct continuation of the CLEF
2023 Task 3 [12]. One of the key innovation in 2024 is the introduction of both sentence level and
document (abstract) level scientific text simplification subtasks, as Task 3.1 and Task 3.2.
   A total of 45 teams registered for our SimpleText track at CLEF 2024. A total of 20 teams submitted
207 runs in total for the Track, of which 15 teams submitted a total of 83 runs for Task 3. The statistics
for the Task 3 runs submitted are presented in Table 1. However, some runs had problems that we could
not resolve. We do not detail them in the paper as well as the 0-scored runs.
   This introduction is followed by Section 2 presenting the text simplification task with the datasets and
evaluation metrics used. Section 3 gives an overview of text simplification approaches for scientific text
as deployed by the participants. In Section 4, we present and discuss the results of the official submissions.
In Section 5, a thorough analysis of the results is carried out, covering several important aspects. This
includes examining the relationship between difficult scientific terms and the simplification process,
investigating information distortion that may occur during simplification, and exploring instances of
language models (LLMs) generating hallucinations and producing inaccurate information. The analysis
delves into these topics to provide a comprehensive understanding of the findings and insights derived
from the study. We end with Section 6 summarizes the findings and draws perspective for future work.


2. Task 3: Simplify Scientific Text
This section details Task 3: Text Simplification on simplify scientific text.

2.1. Description
The goal of this task is to provide a simplified version of the sentences extracted from scientific abstracts.
Participants will be provided with popular science articles and queries and matching abstracts of
scientific papers, either split into individual sentences or as the entire abstracts. This year will feature
both sentence level (Task 3.1) and document or abstract level (Task 3.2) text simplification.
   Table 2 shows an example of a human reference simplification, combining the input sentences
belonging to the abstract of the document 𝑖𝑑 = 130055196 retrieved for query G01.1. Here, we show
the deletions and insertions relative to the source input sentences (in this case in the first 4 sentences).
Table 2
Example of SimpleText Task 3 human reference simplifications of the source input: deletions and insertions
Topic Document Output
G01.1 130055196 As various kinds The rise of output devices emerged , such as highresolution like
                high-resolution printers or a display of and PDA ( Personal Digital Assistant ) , displays
                has increased⃒ the importance of need for high-quality resolution conversion has been
                increasing . ⃒This The paper proposes a new method
                                                                 ⃒      for enlarging image with to make
                images bigger while maintaining high quality . One of the largest problems on image
                                                                 ⃒
                enlargement The main issue with enlarging
                                                        ⃒    images is the exaggeration of the jaggy that
                jagged edges can become exaggerated . ⃒To remedy solve this problem , we propose suggest
                a new interpolation method , which uses artificial that helps us to estimate the value of
                the newly generated ⃒pixels using a neural network to determine the optimal values of
                interpolated pixels . ⃒⃒The experimental experiment ’s results are shown presented and
                evaluated analyzed . ⃒The We evaluate the effectiveness of our methods is⃒ discussed by
                comparing with the conventional methods them to traditional approaches . ⃒


Table 3
CLEF SimpleText Task 3 Scientific Text simplification Corpora
Task              Level                Role                     Source                    Reference
3.1             Sentence              Train               893 sentences             958 simplified sentences
3.1             Sentence              Test                578 sentences             578 simplified sentences
3.1             Sentence            Combined             1,471 sentences           1,536 simplified sentences
3.2            Document               Train               175 abstracts             175 simplified abstracts
3.2            Document               Test                103 abstracts             103 simplified abstracts
3.2            Document             Combined              278 abstracts             278 simplified abstracts


2.1.1. Data
Task 3 uses a corpus based on the high-ranked abstracts retrieved for the requests of the CLEF 2024
SimpleText Task 1. Our training data is a truly parallel corpus of directly simplified sentences coming
from scientific abstracts from the DBLP Citation Network Dataset for Computer Science and Google
Scholar and PubMed articles on Health and Medicine. Other existing text simplification corpora used
post-hoc aligned sentences [e.g., 13].
   In 2024, we expanded the training and evaluation data. In addition to sentence-level text simplification,
we will provide document-level or abstract-level input and reference simplifications. In order to make
the sentence-level and document-level tasks fairly comparably, both use the exact same reference
simplifications. The scientific sentences from scientific abstracts were simplified either by master
students in Technical Writing and Translation or by a domain expert (a computer scientist) and a
professional translator (native English speaker) working together.
   Table 3 gives an overview of all the SimpleText Task 3 scientific text simplification corpora constructed
in 2024. The SimpleText corpus contains 1,536 directly simplified sentences, corresponding to 278
scientific abstracts. This is a useful addition to existing high-quality corpora like Newsela [13], with
2,259 sentences in Newsela-Manual. Our track is the first to focus on the simplification of scientific text
with a much higher text complexity than news articles.
   Available Task 3 training data is derived from the CLEF 2023 edition [7], and includes 893 source
sentences from 175 scientific abstracts paired with the corresponding manual reference simplifications.
The new test data created in 2024 consists of 578 sentences paired with reference simplifications
for the sentence-level task (Task 3.1), and 103 abstracts paired with reference simplifications for the
document-level task (Task 3.2).
2.1.2. Formats
Sources      The source data are provided in JSON formats with the following fields:
     1. snt_id (Task 3.1) or abs_id (Task 3.2): a unique sentence (or abstract) identifier
     2. source_snt (Task 3.1) or source_abs (Task 3.2): passage text (sentence or abstract)
     3. doc_id: a unique source document identifier
     4. query_id: a query ID
     5. query_text: difficult terms should be extracted from sentences with regard to this query
An example of the Task 3.1 JSON source input is:
{
      "query_id":"G11.1",
      "query_text":"drones",
      "doc_id":2892036907,
      "snt_id":"G11.1_2892036907_2",
      "source_snt":"With the ever increasing number of unmanned aerial vehicles getting
      ˓→  involved in activities in the civilian and commercial domain, there is an increased
      ˓→  need for autonomy in these systems too."
},


Predictions Predictions or submissions of participants were also requested in a JSON format with
the following fields:
     1. run_id: Run ID starting with <team_id>_<task_id>_<method_used>, e.g. UBO_Task3.1_BLOOM
     2. manual: Whether the run is manual {0,1}
     3. snt_id (Task 3.1) or abs_id (Task 3.2): a unique sentence or abstract identifier from the input file
     4. simplified_snt (Task 3.1) or simplified_abs (Task 3.2): simplified text for the sentence or abstract
An example of the Task 3.1 submission in JSON is:
{
      "run_id": "Elsevier@SimpleText_Task3.1_run1",
      "manual": 0,
      "snt_id": "G11.1_2892036907_2",
      "simplified_snt": "As more and more drones are used for civilian and commercial
      ˓→  purposes, there is a growing need for them to operate independently."
},


References The references are provided in a very similar format as the predictions above. An example
of a Task 3.1 reference in JSON is:
{
      "snt_id":"G11.1_2892036907_2",
      "simplified_snt":"Drones are increasingly used in the civilian and commercial domain
      ˓→  and need to be autonomous."
},


2.1.3. Evaluation
In 2024, we emphasize large-scale automatic evaluation measures (SARI, BLEU, compression, readability)
that provide a reusable test collection. This automatic evaluation will be supplemented with a detailed
human evaluation of other aspects, essential for deeper analysis. Almost all participants used generative
models for text simplification, yet existing evaluation measures are blind to potential hallucinations
with extra or distorted content [12]. In 2024, we provide further analysis of ways to detect and quantify
spurious content in the output, potentially corresponding to what is informally called “hallucinations.”
3. Scientific Text Simplification Approaches
In this section, we discuss a range of text simplification approaches that have been applied to scientific
text as provided by the track. A total of 15 teams submitted 83 runs in total.

AB/DPV Varadi and Bartulović [14] submitted one run for Task 3. Their approach is an LSTM model
for the sentence-level task.

AIIRLab Largey et al. [15] submitted a total of eight runs for Task 3. Their approach uses LLaMA3 and
Mistral models with different prompting and fine-tuning, for both the sentence-level and abstract-level
tasks.

Arampatzis (No paper received) submitted a total of eight runs for Task 3. Their approach is a range
of models (DistilBERT, T5) for both the sentence-level and abstract-level tasks.

Dajana/Katya (No paper with run details received) submitted one run for Task 3. Their approach
which follows standard text simplification approaches is applied to the sentence-level task.

Elsevier Capari et al. [16] submitted a total of ten runs for Task 3. Their approach is based on
a GPT-3.5 model experimenting with zero-shot and few-shot prompts for both sentence-level and
abstract-level tasks.

Frane/Andrea (No paper with run details received) submitted one run for Task 3. Their approach
which follows standard text simplification approaches is applied to the sentence-level task.

Petra/Diana Elagina and Vučić [17] submitted one run for Task 3. Their approach is a LLaMA model
for the sentence-level task.

PiTheory (No paper with run details received) submitted a total of twenty runs for Task 3. Their
approach uses pre-trained BART and T5 models but contains very few results for both the sentence-level
and abstract-level tasks.

Ruby (No paper received) submitted two runs for Task 3. Their approach uses standard models for
both sentence-level and abstract-level tasks.

Sharigans Ali et al. [18] submitted a total of two runs for Task 3. Their approach is a GPT-3.5 model
for both the sentence-level and abstract-level tasks.

SONAR (No paper received) submitted a single run for Task 3. Their approach is a standard model
for the sentence-level task.

Tomislav/Rowan Mann and Mikulandric [19] submitted a total of two runs for Task 3. Their approach
is the LLama 2 model with a range of prompts and post-processing for both the sentence-level and
abstract-level tasks. Their submission only covers a part of the train topics.

UAmsterdam Bakker et al. [20] submitted a total of ten runs for Task 3. They experiment with
GPT-2, and Wiki and Cochrane-trained models at the sentence, paragraph, and document-level text
simplification, for both sentence-level and document-level tasks.
UBO Vendeville et al. [21] submitted a total of four runs for Task 3. Their approach is to prompt a
smaller Phi3 model for lexical and grammatical text simplifications, for both the sentence-level and
abstract-level tasks.

UZHPandas Michail et al. [22] submitted a total of ten runs for Task 3. They experiment with a
multi-prompt Minimum Bayes Risk (MBR) decoding approach to the sentence-level task. Their approach
is a refinement of their CLEF 2023 approach, which was recognized with a prestigious Best of the Labs
award, and published as part of the CLEF 2024 LNCS proceedings [23].


4. Results
This section details the results of the task, for both sentence-level and abstract-level test simplification
subtasks.

4.1. Task 3.1: Sentence-level scientific text simplification
Table 4 shows the Task 3.1 (sentence-level text simplification) results. The table is restricted to submis-
sions covering a sufficient number of input sentences. We show a number of evaluation scores against
the human reference simplifications, in particular SARI and BLEU. In addition, we provide additional
text statistics on the system output such as FKGL, and a comparison to the source input.
   We make a number of observations. First, the table is sorted on SARI, the main automatic text
simplification measure used in the track. We observe SARI scores of 30+ % for the majority of systems
and 40+ % for the top-scoring systems. This high overlap with the human reference simplifications is
encouraging and indicates that the effectiveness of text simplification approaches, traditionally trained
on youth news reading corpora like Newsela, also extends to scientific text.
   Second, in terms of the level of text complexity, readability measures like FKGL provide a rough indica-
tor of lexical and grammatical complexity. The original sentences have an FKGL of 13-14 corresponding
to university-level text, and the majority of systems reduce this to an FKGL of 11-12 corresponding to
the exit level of compulsory education. This is an encouraging result, as it indicates that the scientific
text simplification approach can be a viable approach to lower the textual complexity of scientific
text toward the range acceptable by a layperson. Although this is positive indicator, this approximate
measure does not take into account terminological complexities as studied in Task 2, or ways to retrieve
all and only more accessible abstracts in Task 1 [24].
   Third, the table includes various other scores that indicate that there is still considerable room for
improvement in scientific text simplification. Throughout the table the BLEU evaluation measure
remains very low, and leads to a different ranking of systems with some of the best systems on BLEU
demonstrating superior overlap with the human reference simplifications. The table also reveals
some runs with very high “compression” ratios and sentence splits, as well as high proportions of
additions. While evaluation measures like SARI are essential for understanding important aspects of
text simplification output quality, they are also known to be relative insensitive to content outside the
intersection with the manual text simplifications. Hence high levels of insertion of content can still lead
to favorable SARI scores, and even improve text statistics like FKGL, without conveying key content of
the original text.

4.2. Task 3.2: Abstract-level scientific text simplification
Table 5 shows the Task 3.2 (abstract-level text simplification) results. Again we restrict the table to
submissions covering a sufficient number of input abstracts.
  We make a number of observations. First, in terms of evaluation measures like SARI we see again
similar encouraging performance levels when evaluating against the human reference simplifications.
This is partly due to the use of proven sentence-level text simplification models with the output merged
back into the entire abstract. Second, there remains room for improvement in capturing the human
Table 4
Results for CLEF 2024 SimpleText Task 3.1 sentence-level text simplification (task number removed from the
run_id) on the test set


                                                                                                                                                                                                              Lexical complexity score
                                                                                                                    Levenshtein similarity


                                                                                                                                                              Additions proportion


                                                                                                                                                                                      Deletions proportion
                                                                           Compression ratio


                                                                                                 Sentence splits


                                                                                                                                              Exact copies
                                 count


                                                                  BLEU
                                            FKGL


                                                       SARI
run_id
Source                          578      13.65      12.02      19.76      1.00                  1.00               1.00                      1.00            0.00                    0.00                    8.80
Reference                       578       8.86     100.00     100.00      0.70                  1.06               0.60                      0.01            0.27                    0.54                    8.51
Elsevier_run1                   578      10.33      43.63      10.68      0.87                  1.06               0.59                      0.00            0.45                    0.53                    8.39
Elsevier_run4                   577      11.73      43.14      12.08      0.85                  1.00               0.63                      0.00            0.37                    0.50                    8.54
Elsevier_run8                   577      12.40      42.95      12.35      0.90                  1.02               0.63                      0.00            0.35                    0.50                    8.66
Elsevier_run6                   577      12.65      42.88      11.76      0.95                  1.00               0.64                      0.00            0.38                    0.47                    8.63
Elsevier_run7                   577      12.55      42.87      12.20      0.87                  1.00               0.63                      0.00            0.35                    0.51                    8.67
Elsevier_run9                   577      12.53      42.61      12.15      0.87                  1.00               0.63                      0.00            0.35                    0.50                    8.67
Elsevier_run3                   577      11.50      42.58      15.75      0.76                  0.98               0.68                      0.00            0.23                    0.46                    8.68
Elsevier_run10                  577      12.57      42.49      11.91      0.91                  1.02               0.63                      0.00            0.34                    0.50                    8.67
AIIRLab_llama-3-8b_run1         578       8.39      40.58       7.53      0.90                  1.37               0.56                      0.00            0.48                    0.58                    8.45
AIIRLab_llama-3-8b_run3         578       9.47      40.36       6.26      1.17                  1.52               0.53                      0.00            0.53                    0.56                    8.51
AIIRLab_llama-3-8b_run2         578      10.33      39.76       5.46      1.03                  1.19               0.51                      0.00            0.60                    0.56                    8.34
UZHPandas_simple_cot            578      13.74      39.59       3.38      3.44                  2.67               0.41                      0.00            0.76                    0.12                    8.61
UZHPandas_simple                578      11.24      39.28       5.67      0.88                  0.98               0.52                      0.00            0.53                    0.62                    8.45
Sharingans_finetuned            578      11.39      38.61      18.18      0.83                  1.07               0.77                      0.11            0.16                    0.32                    8.70
UZHPandas_selection_sle_cot     578       6.49      38.38       1.03      4.76                  6.26               0.30                      0.00            0.89                    0.14                    8.30
UZHPandas_simple_inter_def      578      21.36      38.29       3.13      1.93                  0.99               0.46                      0.00            0.69                    0.33                    8.86
UZHPandas_selection_lens_cot    578       6.74      38.16       1.10      4.54                  5.88               0.32                      0.00            0.87                    0.14                    8.32
UZHPandas_5Y_target_cot         578       6.39      37.95       0.97      4.73                  6.25               0.30                      0.00            0.89                    0.14                    8.30
UZHPandas_selection_lens        578      21.29      37.79       2.71      1.97                  1.01               0.44                      0.00            0.71                    0.34                    8.85
UBO_Phi4mini-s                  578       8.74      36.78       0.58     18.23                 23.48               0.47                      0.00            0.66                    0.29                    8.89
UZHPandas_selection_lens_1      578       7.79      36.72       3.65      0.72                  0.98               0.46                      0.00            0.54                    0.73                    8.25
UBO_Phi4mini-sl                 578       6.16      36.53       0.61      6.92                  9.81               0.38                      0.00            0.80                    0.42                    8.72
UZHPandas_5Y_target_inter_def   578      19.30      36.53       2.27      1.76                  1.01               0.45                      0.00            0.70                    0.41                    8.87
UZHPandas_selection_sle         578       6.07      35.30       2.57      0.65                  0.98               0.43                      0.00            0.56                    0.78                    8.17
UZHPandas_5Y_target             578       5.94      34.91       2.29      0.66                  0.99               0.43                      0.00            0.57                    0.78                    8.17
RubyAiYoungTeam                 578       8.76      34.40      15.37      0.60                  1.22               0.69                      0.03            0.05                    0.44                    8.71
SONAR_SONARnonlinreg            578      13.14      32.12      18.41      0.97                  1.01               0.93                      0.13            0.11                    0.13                    8.73
UAms_GPT2_Check                 578      11.47      29.91      15.10      1.02                  1.23               0.87                      0.14            0.17                    0.14                    8.68
UAms_GPT2                       578      10.91      29.73      13.07      1.30                  1.50               0.79                      0.06            0.29                    0.12                    8.63
Arampatzis_T5                   578      13.18      28.92      10.66      1.12                  1.10               0.72                      0.03            0.34                    0.37                    9.06
UAms_Wiki_BART_Snt              578      12.13      27.45      21.56      0.85                  0.99               0.89                      0.32            0.02                    0.16                    8.73
Arampatzis_DistilBERT           578       5.85      19.00      13.56      1.03                  3.00               0.95                      0.00            0.22                    0.11                    8.65
UAms_Cochrane_BART_Snt          578      13.22      18.45      19.21      0.95                  0.99               0.96                      0.59            0.02                    0.07                    8.77


simplifications more closely, as the BLEU score remains low throughout. Here, the more conservative
approaches seem to obtain better scores. Third, we see less extreme values on the other indicators, but
still considerable variation in the compression ratio and number of splits, and proportions of addition
and deletions. We will investigate how much of the output is grounded in the source sentences and
abstracts below.
   Many submissions rely on proven sentence-level text simplification approaches, with results closely
mirroring those observed for the sentence-level task. It is encouraging to see solid performance for the
approaches that perform text simplification at the entire abstracts in one pass. This holds the promise to
incorporate the discourse structure, use more complex text simplifications operations such as deletions
and merges, and deploy planner-based approaches to the text simplification of long documents.
Table 5
Results for CLEF 2024 SimpleText Task 3.2 abstract-level text simplification (task number removed from the
run_id) on the test set


                                                                                                                                                                                                         Lexical complexity score
                                                                                                               Levenshtein similarity


                                                                                                                                                         Additions proportion


                                                                                                                                                                                 Deletions proportion
                                                                       Compression ratio


                                                                                            Sentence splits


                                                                                                                                         Exact copies
                              count


                                                               BLEU
                                         FKGL


                                                    SARI
run_id
Source                       103      13.64      12.81      21.36     1.00                 1.00               1.00                      1.00            0.00                    0.00                    8.88
Reference                    103       8.91     100.00     100.00     0.67                 1.04               0.60                      0.00            0.23                    0.53                    8.66
AIIRLab_llama-3-8b_run1      103       9.07      43.44      11.73     1.01                 1.38               0.51                      0.00            0.37                    0.56                    8.57
AIIRLab_llama-3-8b_run3      103      10.17      43.21      11.03     1.15                 1.47               0.52                      0.00            0.40                    0.51                    8.66
Elsevier_run2                103      11.01      42.47      10.54     1.04                 1.22               0.51                      0.00            0.38                    0.55                    8.60
AIIRLab_llama-3-8b_run2      103      10.22      42.19       7.99     1.31                 1.38               0.48                      0.00            0.53                    0.52                    8.44
Elsevier_run5                103      12.08      42.15      10.96     1.04                 1.15               0.52                      0.00            0.36                    0.53                    8.75
Sharingans_finetuned         103      11.53      40.96      18.29     1.20                 1.39               0.65                      0.00            0.24                    0.34                    8.80
UBO_Phi4mini-ls              103       8.45      38.79       5.53     1.21                 1.75               0.43                      0.00            0.40                    0.63                    8.53
UBO_Phi4mini-l               103       9.96      38.41      10.01     1.29                 2.11               0.55                      0.00            0.24                    0.51                    9.03
UAms_GPT2_Check_Abs          103      12.85      36.47      13.12     0.91                 0.92               0.59                      0.00            0.18                    0.45                    8.73
UAms_Cochrane_BART_Doc       103      14.46      33.51       9.39     0.65                 0.58               0.54                      0.04            0.06                    0.53                    8.80
UAms_Cochrane_BART_Par       103      16.53      31.58      15.40     1.08                 0.80               0.67                      0.04            0.15                    0.32                    8.81
UAms_GPT2_Check_Snt          103      11.57      30.71      15.24     1.54                 1.70               0.78                      0.00            0.27                    0.13                    8.77
UAms_Wiki_BART_Doc           103      15.68      26.50      15.11     1.51                 1.14               0.76                      0.01            0.25                    0.11                    8.79
UAms_Wiki_BART_Par           103      13.11      23.92      19.49     1.39                 1.37               0.81                      0.01            0.11                    0.10                    8.86


4.3. Train results
In this section, we show the results over the train data for sentence-level and abstract-level scientific
text simplification. This analysis includes those submission retricted to the train data and left out above.

4.3.1. Task 3.1: Sentence-level scientific text simplification
Table 6 shows the sentence-level text simplification results for the train data.
   We make the following observations. First, we observed very high performance with SARI scores
up to 65% for systems fine-tuned on the train data. Even more striking are very high BLEU scores of
over 50%. This is a signal of potential overfitting, although the top performing systems on train still
perform reasonably on the new test data. The majority of runs performs similar on train and test, which
is according to expectation as most are not particularly trained or fine-tuned on the relatively small set
of train sentences and abstracts.
   Second, we observe again a clear reduction of FKGL readability, in particular for systems with a
high proportion of sentence splits. We make the same proviso that although shorter sentences, and
shorter or more common words, is a weak proxy for text complexity, as complex terminology and brief
abbreviations may remain and stay opaque for lay users. A very simple grammar is common in youth
reading levels, such as target by the popular Newsela-auto [13] data, making FKGL a popular readability
score. However, in plain English summaries of scientific text we don’t observe such reduction [25].
   Third, while we observe higher scores on the train data in Table 4 than on the test data above in
Table 4, there seems to be still room for improvement. Throughout the table, we see many low BLEU
scores, and very high fractions of additions may risk gratuitous introduction of new content, and hence
risk “hallucination.”
Table 6
Results for CLEF 2024 SimpleText Task 3.1 sentence-level text simplification (task number removed from the
run_id) on the train set


                                                                                                                                                                                                                     Lexical complexity score
                                                                                                                               Levenshtein similarity


                                                                                                                                                                       Additions proportion

                                                                                                                                                                                              Deletions proportion
                                                                                        Compression ratio

                                                                                                             Sentence splits


                                                                                                                                                        Exact copies
                                                    count


                                                                                BLEU
                                                            FKGL


                                                                      SARI
run_id
Source                                             893 14,30 19,18 38,95               1,00                 1,00 1,00 1,00 0,00 0,00 8,72
Reference References                               893 11,70 100,00 100,00             0,84                 1,07 0,72 0,04 0,21 0,37 8,63
Sharingans_finetuned                               714 11,69       64,75     52,53 0,82 1,07 0,73 0,05 0,19 0,37 8,61
Elsevier@SimpleText_run3                           714 11,78       46,78     25,55 0,76 0,99 0,68 0,00 0,23 0,47 8,62
Elsevier@SimpleText_run6                           714 12,58       44,36     20,64 0,90 1,02 0,64 0,00 0,37 0,47 8,56
Elsevier@SimpleText_run7                           714 12,67       43,76     20,51 0,85 1,00 0,63 0,00 0,35 0,50 8,61
Elsevier@SimpleText_run8                           714 12,54       43,64     20,69 0,85 1,02 0,63 0,00 0,34 0,50 8,60
Elsevier@SimpleText_run9                           714 12,66       43,59     20,33 0,86 1,00 0,63 0,00 0,35 0,51 8,63
Elsevier@SimpleText_run10                          714 12,57       43,37     20,29 0,86 1,02 0,63 0,00 0,34 0,50 8,61
Elsevier@SimpleText_run4                           714 11,79       43,30     20,05 0,84 1,01 0,62 0,00 0,38 0,52 8,49
Elsevier@SimpleText_run1                           714 10,52       41,05     15,56 0,86 1,07 0,59 0,00 0,45 0,53 8,35
Tomislav&Rowan_LLAMA                                25 11,84       40,67      4,27 3,94 2,86 0,41 0,00 0,73 0,28 8,36
AIIRLab_Mistral_7B_Instruct_V0.2                   893 10,64       39,36     14,07 0,74 1,05 0,58 0,00 0,32 0,58 8,62
UBO_Phi4mini-s                                     714 8,60        39,27      1,15 17,05 22,28 0,48 0,00 0,65 0,30 8,85
UZH_Pandas_simple_with_cot                         714 13,81       38,73      4,62 3,42 2,74 0,41 0,00 0,77 0,12 8,57
AIIRLab_llama-3-8b_run1                            714 8,32        38,53     11,75 0,89 1,39 0,56 0,00 0,46 0,59 8,39
AIIRLab_llama-3-8b_run3                            714 9,28        37,89      9,35 1,12 1,51 0,54 0,00 0,52 0,58 8,45
UZH_Pandas_simple_with_intermediate_definitions    714 21,60       36,71      5,10 1,91 0,99 0,46 0,00 0,70 0,34 8,83
PiTheory_T5                                         97 9,94        36,53     11,02 1,37 1,53 0,63 0,00 0,48 0,30 8,51
team1_Petra_and_Regina_task3_ST                    893 8,42        36,19     19,72 0,58 1,29 0,66 0,03 0,05 0,47 8,66
UBO_RubyAiYoungTeam                                893 8,42        36,19     19,72 0,58 1,29 0,66 0,03 0,05 0,47 8,66
SONAR_SONARnonlinreg                               714 13,61       36,01     29,89 0,96 1,02 0,92 0,12 0,10 0,13 8,65
UBO_RubyAiYoungTeam                                714 8,67        35,97     19,73 0,59 1,27 0,68 0,04 0,05 0,45 8,67
UZH_Pandas_simple                                  714 10,91       35,56      8,27 0,84 0,99 0,52 0,00 0,52 0,64 8,37
UZH_Pandas_selection_with_lens                     714 21,45       35,56      4,26 1,91 1,00 0,44 0,00 0,71 0,35 8,84
AIIRLab_llama-3-8b_run2                            714 10,43       35,47      6,87 1,00 1,18 0,52 0,00 0,59 0,58 8,29
UAms_GPT2_Check                                    714 11,87       35,21     27,35 1,02 1,22 0,87 0,11 0,17 0,14 8,59
UAms_GPT2                                          714 11,21       34,73     23,69 1,28 1,47 0,79 0,05 0,28 0,12 8,56
UZH_Pandas_selection_with_lens_cot                 714 6,41        34,32      1,34 4,44 6,16 0,32 0,00 0,88 0,14 8,28
FRANE_AND_ANDREA_t5                                893 8,57        34,20     33,58 0,87 1,72 0,82 0,17 0,11 0,24 8,73
Dajana&Kathy_t5                                    893 8,57        34,20     33,58 0,87 1,72 0,82 0,17 0,11 0,24 8,73
UZH_Pandas_5Y_target_with_intermediate_definitions 714 19,83       34,20      3,40 1,74 0,99 0,45 0,00 0,71 0,41 8,86
UAms_Wiki_BART_Snt                                 714 12,34       34,19     37,18 0,83 0,99 0,88 0,29 0,02 0,19 8,64
UZH_Pandas_selection_with_sle_cot                  714 6,23        34,07      1,15 4,66 6,51 0,31 0,00 0,89 0,14 8,28
UZH_Pandas_5Y_target_with_cot                      714 6,16        33,98      1,13 4,66 6,53 0,30 0,00 0,89 0,14 8,26
Arampatzis_T5                                      893 12,15       33,12     21,85 1,09 1,25 0,72 0,03 0,35 0,38 9,07
UBO_Phi4mini-sl                                    714 7,02        32,94      1,02 5,49 7,03 0,39 0,00 0,79 0,44 8,69
UZH_Pandas_selection_with_lens                     714 7,85        32,31      4,96 0,72 0,99 0,46 0,00 0,54 0,73 8,21
UZH_Pandas_selection_with_sle                      714 6,22        30,25      2,45 0,66 0,99 0,43 0,00 0,56 0,78 8,18
UZH_Pandas_5Y_target                               714 6,02        29,88      2,03 0,66 1,00 0,42 0,00 0,58 0,79 8,19
UAms_Cochrane_BART_Snt                             714 13,74       26,70     36,69 0,94 0,99 0,95 0,56 0,03 0,08 8,67
Arampatzis_DistilBERT                              893 6,07        26,42     29,20 1,03 2,94 0,95 0,00 0,21 0,10 8,63


4.3.2. Task 3.2: Abstract-level scientific text simplification
Table 7 shows the abstract-level text simplification results for the train data.
  We make the following observations. First, we observe higher scores for systems who deploy
Table 7
Results for CLEF 2024 SimpleText Task 3.2 abstract-level text simplification (task number removed from the
run_id) on the train set


                                                                                                                                                                                                         Lexical complexity score
                                                                                                               Levenshtein similarity


                                                                                                                                                         Additions proportion


                                                                                                                                                                                 Deletions proportion
                                                                       Compression ratio


                                                                                            Sentence splits


                                                                                                                                         Exact copies
                              count


                                                               BLEU
                                         FKGL


                                                    SARI
run_id
Source                      175       14,30      19,53      39,95     1,00                 1,00               1,00                      1,00            0,00                    0,00                    8,88
Reference References        175       11,80     100,00     100,00     0,80                 1,04               0,70                      0,00            0,20                    0,40                    8,75
Sharingans_finetuned        119       11,36      60,65      45,74     0,78                 1,07               0,68                      0,00            0,20                    0,41                    8,71
Mistral-7B-Instruct-V0.2    175       12,85      40,66      16,52     0,79                 0,92               0,60                      0,00            0,29                    0,51                    8,83
AIIRLab_llama-3-8b_run3     119        9,77      40,62      15,04     0,70                 1,03               0,55                      0,00            0,31                    0,57                    8,59
Elsevier@SimpleText_run5    119       12,16      40,30      14,23     0,71                 0,84               0,55                      0,00            0,30                    0,57                    8,62
UBO_Phi4mini-l              119        9,39      39,95      14,41     1,87                 3,23               0,56                      0,00            0,18                    0,56                    8,95
AIIRLab_llama-3-8b_run1     119        8,49      39,51      13,00     0,65                 1,03               0,54                      0,00            0,31                    0,61                    8,47
Elsevier@SimpleText_run2    119       11,09      39,32      12,43     0,68                 0,86               0,53                      0,00            0,31                    0,60                    8,56
Tomislav&Rowan_LLAMA         20       10,48      37,61      15,26     1,13                 1,70               0,53                      0,00            0,45                    0,48                    8,73
AIIRLab_llama-3-8b_run2     119       10,42      37,13       9,95     0,82                 1,01               0,51                      0,00            0,47                    0,57                    8,37
UAms_GPT2_Check_Abs         119       12,75      36,68      16,48     0,59                 0,66               0,60                      0,01            0,11                    0,50                    8,61
UAms_GPT2_Check_Snt         119       11,88      35,97      28,86     1,00                 1,22               0,85                      0,01            0,18                    0,15                    8,71
UAms_Cochrane_BART_Par      119       16,15      35,12      26,23     0,70                 0,59               0,70                      0,04            0,08                    0,36                    8,72
UBO_Phi4mini-ls             119        8,71      34,81       7,23     0,89                 1,50               0,44                      0,00            0,34                    0,68                    8,57
Arampatzis_T5               175       11,39      33,94       9,61     0,48                 0,60               0,53                      0,00            0,07                    0,59                    8,90
UAms_Wiki_BART_Doc          119       16,45      33,36      28,35     1,01                 0,83               0,81                      0,00            0,18                    0,15                    8,73
UAms_Cochrane_BART_Doc      119       14,78      33,23       9,55     0,40                 0,40               0,52                      0,03            0,01                    0,61                    8,76
UAms_Wiki_BART_Par          119       13,26      30,31      36,76     0,89                 1,00               0,88                      0,01            0,03                    0,13                    8,81
Arampatzis_DistilBERT       175       11,24      25,17      30,75     1,02                 1,67               0,96                      0,00            0,16                    0,09                    8,78


finetuning, which doesn’t seem to generalize to the unseen test evaluation before. Most systems,
however, wer not particularly trained or finetuned on the train data and show similar performance on
both train and test.
   Second, we observe solid performance for the more complex document-level scientific text simpli-
fication task, but this is due to many systems deploying proving sentence-level text simplification
technology with merging the sentence-level output back into complete abstracts.
   Third, while a sentence-level approach to document-level text simplification is a pragmatic choice
and viable strategy, several model perform direct abstract-level or paragraph-level taking the discourse
structure and more complex sentences reordering and deletion into account. These document-level text
simplification approach tend to lead to far greater compression, including whole sentence deletions,
making their output far more succinct than sentence-level approaches to document-level text simplifi-
cation. Giving their succinct output, and in light of the sentence-level constructed human reference
simplifications, the scores of direct abstract-level or paragraph-level approaches is impressive. Further
research in such document-level text simplification approaches would be important in the future of the
CLEF SimpleText track.


5. Analysis
This section provides further analysis of the submitted runs, and the task as whole.
Table 8
Example of SimpleText Task 3 output versus input: deletions, insertions, and whole sentence insertions
Topic Document Output
G01.1   130055196     As various kinds of output devices emerged , such as highresolution printers or a display
                      of PDA ( Personal Digital Assistant      ⃒ ) , the . The importance of high-quality resolution
                      conversion has been increasing . ⃒This paper proposes a new method for enlarging an
                      image with high quality    ⃒ . It will involve using a combination of high-speed imaging and
                      high-resolution video . ⃒One of the largest biggest problems on image enlargement is the
                      exaggeration⃒of the jaggy edges . This is especially true when the image is enlarged , as
                      in this case . ⃒To remedy this problem , we propose a new interpolation method , which .
                      This method
                               ⃒     uses artificial neural network to determine the optimal values of interpolated
                      pixels . ⃒The experimental results are shown⃒ and evaluated . The results are compared to
                      other studies and found to be inconclusive . ⃒The effectiveness of our methods is discussed
                      by comparing with the conventional methods . Our methods are⃒ designed to help people
                      with mental health problems , not just as a way to cure them . ⃒


5.1. Human Evaluation
Due to the delayed submission deadline, as well as, follow-up correspondence with teams on partial or
incorrect output, the manual annotation of system output has been limited to a small sample, and is still
ongoing. We report here only initial observations from the translation professionals conducting this
analysis, based on the expectation of what a professional editor would provide as reference output. We
looked in particular at the novel document-level simplifications of the entire abstract, and it’s coherence
and discourse structure.
    First, and foremost, something is working. The automatic text simplifications are generally of
impressive quality despite the remaining limitations that are the focus of this section. The fluency
and language variation is impressive, and far exceeds earlier language generation technology often
reflecting the protocol, and template or rule-based system underlying it.
    Second, changes can be unnecessary nor helpful. Frequently, as we observed in our work on the
project last year [12], the information is written in another way but does not offer simplification.
Sometimes the vocabulary does no change but is simply rearranged.
    Third, discourse structure matters. In other examples the resulting text is not shaped as a whole,
with a proper beginning middle and end, but is reorder to the detriment of clarity. For example, the
first sentence of the “simplified” abstract can contain a reference back to information already given.
Another example: start of a first sentence with “However, . . . ” in the simplification when source text
started “It is the purpose of this study, . . . ” or with “For example, . . . ” when the original first sentences
presented the subject.
    Fourth, brevity is not always clearer. Although some examples shorten the sentences within an
abstract, thus technically simplifying, their interrelation is not necessarily maintained, producing a
choppy style. Better results were produced when the new text was split into subsections dedicated to
particular subtopics, including their explanation.
    Fifth, gratuitous additions are problematic. Another type of problem is illustrated by the creation
of a cumbersome nominal group “the 21st Century managed care needs of patients, . . . ” which does
not exist in the original, where we instead had an evocative example: “the emergency room at home.”
Here though, both things belong in the same domain. Elsewhere, seeming hallucinations appeared,
for example, through the addition of an off-topic sentence. For example, to an abstract about digital
tools to aid Parkinson’s sufferers, we found the following last sentence added during simplification:
“It includes advice on how to manage consultant work, such as research and development .” Although, in
terms of meaning, this has no equivalent in the source text, the source text starting sentence was: “The
paper also discusses how a practitioner can accomplish UCSD in the context of product development and
consultant work.”, which mentions the topic in a different context.
Table 9
Analysis of SimpleText Task 3.1: Spurious generation
Run                                               # Input Sentences                 Spurious Content
                                                                              Number              Fraction
AB/DVP_SequentialLSTM                                          4797              4788                  1.00
AIIRLab_Mistral_7B_Instruct_V0                                  779                23                  0.03
AIIRLab_llama-3-8b_run3                                        4797               129                  0.03
AIIRLab_llama-3-8b_run3                                        4797               381                  0.08
AIIRLab_llama-3-8b_run3                                        4797               489                  0.10
Dajana/Kathy_t5                                                 779                80                  0.10
Elsevier@SimpleText_run1                                       4797                50                  0.01
Elsevier@SimpleText_run10                                      4796                49                  0.01
Elsevier@SimpleText_run3                                       4795                36                  0.01
Elsevier@SimpleText_run4                                       4795                32                  0.01
Elsevier@SimpleText_run6                                       4796                46                  0.01
Elsevier@SimpleText_run7                                       4796                41                  0.01
Elsevier@SimpleText_run8                                       4796                46                  0.01
Elsevier@SimpleText_run9                                       4796                43                  0.01
FRANE_AND_ANDREA_t5                                             779                80                  0.10
SONAR_SONARnonlinreg                                           4797                15                  0.00
Sharingans_finetuned                                           4797                51                  0.01
UAms-1_Cochrane_BART_Snt                                       4797                25                  0.01
UAms-1_GPT2                                                    4797              1390                  0.29
UAms-1_GPT2_Check                                              4797                 3                  0.00
UAms-1_Wiki_BART_Snt                                           4797                14                  0.00
UBO_Phi4mini-s                                                 4797              2055                  0.43
UBO_Phi4mini-sl                                                4797              1822                  0.38
UBO_RubyAiYoungTeam                                             779               169                  0.22
UBO_RubyAiYoungTeam                                            4797              1051                  0.22
UZHPandas_5Y_target                                            4797              2607                  0.54
UZHPandas_5Y_target_cot                                        4797              3383                  0.71
UZHPandas_5Y_target_intermediate_defs                          4797               365                  0.08
UZHPandas_selection_lens                                       4797               283                  0.06
UZHPandas_selection_lens_cot                                   4797              3265                  0.68
UZHPandas_selection_sle                                        4797              2311                  0.48
UZHPandas_selection_sle_cot                                    4797              3362                  0.70
UZHPandas_simple                                               4797               166                  0.03
UZHPandas_simple_cot                                           4797              2915                  0.61
UZHPandas_simple_intermediate_defs                             4797                79                  0.02
Arampatzis_DistilBERT                                          5576              5575                  1.00
Arampatzis_T5                                                  5576               336                  0.06
team1_Petra_and_Regina_ST                                       779               169                  0.22


5.2. Spurious or overgeneration
We conduct a deeper analysis of how much of the generated simplified output sentences and abstracts
can be traced to the source input. In particular, we look at spurious generated content and it’s prevalence
in the submitted generated text simplifications. This content is at risk of being introduced gratuitously
by the generative model, and what is informally referred to as “hallucinations.”
   Earlier in Table 2, we showed an example of a human reference simplification, combining the input
sentences belonging to the abstract of the document 𝑖𝑑 = 130055196 retrieved for query G01.1. We can
do the same for the automatically generated scientific text simplifications. We show again the deletions
and insertions relative to the source input sentences. Table 8 shows an example output simplification of
one of the participating teams, for the same input sentences as in Table 2 above. Most simplifications
are revisions of the input, but we also observe that sometimes an entire sentence is inserted (shown as
xxx in Table 8). The example in Table 8 is an extreme case picked to illustrate both the importance and
complexity of detecting such spurious content.
   We provide a detailed analysis quantifying the prevalence of spurious content in the CLEF 2024
Table 10
Results for SimpleText Task 3.2: Spurious generation
Run                                        # Input Abstracts                      Spurious Content
                                                                          Number                  Fraction
AIIRLab_llama-3-8b_run1                                 782                     56                     0.07
AIIRLab_llama-3-8b_run2                                 782                    121                     0.15
AIIRLab_llama-3-8b_run3                                 782                     98                     0.13
Elsevier@SimpleText_run2                                782                     28                     0.04
Elsevier@SimpleText_run5                                782                     30                     0.04
Mistral-7B-Instruct-V0                                  119                      6                     0.05
Sharingans_finetuned                                    782                     59                     0.08
UAms-2_Cochrane_BART_Doc                                782                      2                     0.00
UAms-2_Cochrane_BART_Par                                782                     28                     0.04
UAms-2_GPT2_Check_Abs                                   782                      1                     0.00
UAms-2_GPT2_Check_Snt                                   782                    111                     0.14
UAms-2_Wiki_BART_Doc                                    782                     74                     0.09
UAms-2_Wiki_BART_Par                                    782                     46                     0.06
UBO_Phi4mini-s                                          782                    102                     0.13
UBO_Phi4mini-sl                                         782                    104                     0.13
Arampatzis_DistilBERT                                   901                    118                     0.13
Arampatzis_T5                                           901                      5                     0.01


SimpleText Task 3 submissions. Table 9 quantifies how often such spurious generation occurs. We
re-aligned the generated output with the original source sentences, and flag here only entire output
sentences that do not share a single token with the input. Our analysis reveals that the amount of
spurious content is varying but far from infrequent. A total of 17 out 36 submissions (47%) have spurious
whole sentences in at least 10% of the input sentences. In fact, 14 (39%) submissions in at least 20% of
the input, and 7 (19%) submissions in at least 50% of the input sentences. The detection of non-aligned
output sentences is indicative but imperfect. For example, a significant reordering of content may lead
to false positives in rare cases, and unusual tokenization or formatting may affect the alignment with
the source even systematically. Note also that the detected additions may introduce helpful background
knowledge or other useful information to contextualize the information in the source sentences.
   Table 10 quantifies how often such spurious generation occurs for the abstract-level output. Here we
look again at the spurious output at the end of the input abstract, rather than conducting a sentence-level
analysis as done above. Aligning longer text is more complex than sentences. For those generating true
paragraph or document level simplifications, we observe more variation involving content of multiple
input sentences leading to a more complex alignment. Hence we focus on detecting spurious content at
the end of the generated abstract. As a result, for those aggregating sentence-level output merged into
the abstracts, we are only able to detect spurious content for the final sentence.
   We make a number of observations based on our analysis in this section. First, the fraction of
sentences with spurious content is very low for some submissions, however, for other submissions, the
fraction is very substantial. Second, the standard evaluation measures used for text simplification, and
in fact for any text generation task in NLP, do not take this aspect into account. A submission with
significant spurious content can still obtain very high text overlap with the reference, and hence obtain
a very high performance score. Third, and more generally, human evaluation and this type of analysis
feel crucial to accurately evaluate generative models for the NLP and IR challenges addressed in our
Track and in CLEF in general.


6. Conclusions
The paper provides an overview of the CLEF 2024 SimpleText Task 3: Text Simplification, which focuses
on the simplification of scientific text. The objective of the task is to simplify either the separate
sentences or the entire scientific abstracts in order to enhance their accessibility and comprehensibility
for a general audience. We highlighted the key aspects and goals of the task within the broader context
of the CLEF 2024 SimpleText track [11].
   Our main findings are the following: First, we observe competitive performance for scientific text
simplification, both on evaluation against the human reference simplifications and on text statistics
such as FKGL readability score. Second, the abstract-level text simplification results is a mixture of
sentence-level and passage-level text simplification approaches. Third, our analysis reveals a very high
and varying range of spurious text generation, not detected by standard evaluation measures, and a
major concern in the use of these model in a real-world setting. More generally, almost all participants
use generative models (for the task, the track, and CLEF in general), and the track offers a unique setting
to study some of the inherent limitations of generative models.
   The main aim of our task, the track, and the CLEF evaluation forum as a whole, is i) to foster a
community of IR, NLP, and AI researchers working together on the important task of making science
more accessible for everyone, and ii) to construct corpora and evaluation resources to stimulate research
on scientific text summarization and simplification. In terms of a building a community researching
scientific text summarization and simplification, the task saw a record attendance in 2024: due to the
additional abstract level task we received 83 runs from 15 teams, the largest number of participating
teams ever. In fact, the community is broadening beyond CLEF and raising general interest in generative
scientific text summarization and simplification [1].
   Within the CLEF 2024 SimpleText Task 3, we have constructed extensive corpora and manually
labeled evaluation data for scientific text simplification. Specifically, we added in 2024 a a parallel
corpus of manually simplified sentences and abstracts from the scientific literature:

    • Train, sentence level: 958 source sentences from scientific abstracts paired with corresponding
      human reference simplifications.

    • Test, sentence level: 578 source sentences from scientific abstracts paired with corresponding
      human reference simplifications.

    • Train, abstract level: 175 source scientific abstracts paired with corresponding human reference
      simplifications.

    • Test, abstract level: 103 source scientific abstracts paired with corresponding human reference
      simplifications.

These reusable corpora and evaluation resources are available to participants and other researchers who
want to work on the important problem of making scientific information open and easily accessible for
everyone.


Acknowledgments
This track would not have been possible without the great support of numerous individuals. We want to
thank in particular the colleagues and the students who participated in data construction and evaluation.
Please visit the SimpleText website for more details on the track.1
   Liana Ermakova is funded by the French National Research Agency (ANR) Automatic Simplification of
Scientific Texts project (ANR-22-CE23-0019-01),2 and the MaDICS research group.3 Jaap Kamps is partly
funded by the Netherlands Organization for Scientific Research (NWO CI # CISC.CC.016, NWO NWA #
1518.22.105), the University of Amsterdam (AI4FinTech program), and ICAI (AI for Open Government
Lab). Views expressed in this paper are not necessarily shared or endorsed by those funding the
research.


1
  https://simpletext-project.com/
2
  https://anr.fr/Project-ANR-22-CE23-0019
3
  https://www.madics.fr/ateliers/simpletext/
References
 [1] G. M. D. Nunzio, F. Vezzani, L. Ermakova, H. Azarbonyad, J. Kamps (Eds.), Proceedings of the
     Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING
     2024, ELRA and ICCL, Torino, Italia, 2024. URL: https://aclanthology.org/2024.determit-1.0.
 [2] S. Štajner, H. Saggio, M. Shardlow, F. Alva-Manchego (Eds.), Proceedings of the Second Workshop
     on Text Simplification, Accessibility and Readability, INCOMA Ltd., Shoumen, Bulgaria, Varna,
     Bulgaria, 2023. URL: https://aclanthology.org/2023.tsar-1.0.
 [3] S. Štajner, H. Saggion, D. Ferrés, M. Shardlow, K. C. Sheang, K. North, M. Zampieri, W. Xu (Eds.),
     Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022),
     Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Virtual), 2022. URL:
     https://aclanthology.org/2022.tsar-1.0.
 [4] H. Saggion, S. Stajner, D. Ferrés, K. C. Sheang (Eds.), Proceedings of the First Workshop on Current
     Trends in Text Simplification (CTTS 2021) co-located with the 37th Conference of the Spanish
     Society for Natural Language Processing (SEPLN2021), Online (initially located in Málaga, Spain),
     September 21st, 2021, volume 2944 of CEUR Workshop Proceedings, CEUR-WS.org, 2021. URL:
     https://ceur-ws.org/Vol-2944.
 [5] L. Ermakova, P. Bellot, P. Braslavski, J. Kamps, J. Mothe, D. Nurbakova, I. Ovchinnikova, E. SanJuan,
     Overview of simpletext 2021 - CLEF workshop on text simplification for scientific information
     access, in: K. S. Candan, B. Ionescu, L. Goeuriot, B. Larsen, H. Müller, A. Joly, M. Maistro, F. Piroi,
     G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction -
     12th International Conference of the CLEF Association, CLEF 2021, Virtual Event, September 21-24,
     2021, Proceedings, volume 12880 of Lecture Notes in Computer Science, Springer, 2021, pp. 432–449.
     URL: https://doi.org/10.1007/978-3-030-85251-1_27. doi:10.1007/978-3-030-85251-1\_27.
 [6] L. Ermakova, E. SanJuan, J. Kamps, S. Huet, I. Ovchinnikova, D. Nurbakova, S. Araújo, R. Hannachi,
     É. Mathurin, P. Bellot, Overview of the CLEF 2022 simpletext lab: Automatic simplification of
     scientific texts, in: A. Barrón-Cedeño, G. D. S. Martino, M. D. Esposti, F. Sebastiani, C. Macdonald,
     G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction - 13th International Conference of the CLEF Association, CLEF
     2022, Bologna, Italy, September 5-8, 2022, Proceedings, volume 13390 of Lecture Notes in Computer
     Science, Springer, 2022, pp. 470–494. URL: https://doi.org/10.1007/978-3-031-13643-6_28. doi:10.
     1007/978-3-031-13643-6\_28.
 [7] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, O. Augereau, J. Kamps, Overview of the CLEF
     2023 simpletext lab: Automatic simplification of scientific texts, in: A. Arampatzis, E. Kanoulas,
     T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro
     (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International
     Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18-21, 2023,
     Proceedings, volume 14163 of Lecture Notes in Computer Science, Springer, 2023, pp. 482–506. URL:
     https://doi.org/10.1007/978-3-031-42448-9_30. doi:10.1007/978-3-031-42448-9\_30.
 [8] E. SanJuan, S. Huet, J. Kamps, L. Ermakova, Overview of the CLEF 2024 SimpleText task 1: Retrieve
     passages to include in a simplified summary, in: [26], 2024.
 [9] G. M. Di Nunzio, F. Vezzani, V. Bonato, H. Azarbonyad, J. Kamps, L. Ermakova, Overview of the
     CLEF 2024 SimpleText task 2: Identify and explain difficult concepts, in: [26], 2024.
[10] J. D’Souza, S. Kabongo, H. B. Giglou, Y. Zhang, Overview of the CLEF 2024 SimpleText Task 4:
     SOTA? Tracking the State-of-the-Art in Scholarly Publications, in: [26], 2024.
[11] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’Souza, J. Kamps,
     Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone, in:
     [27], 2024.
[12] L. Ermakova, S. Bertin, H. McCombie, J. Kamps, Overview of the CLEF 2023 simpletext task
     3: Simplification of scientific texts, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.),
     Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki,
     Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org,
     2023, pp. 2855–2875. URL: https://ceur-ws.org/Vol-3497/paper-240.pdf.
[13] W. Xu, C. Callison-Burch, C. Napoles, Problems in current text simplification research: New data
     can help, Transactions of the Association for Computational Linguistics 3 (2015) 283–297. URL:
     https://aclanthology.org/Q15-1021. doi:10.1162/tacl_a_00139.
[14] D. P. Varadi, A. Bartulović, SimpleText 2024: Scientific Text Made Simpler Through the Use of AI,
     in: [26], 2024.
[15] N. Largey, R. Maarefdoust, S. Durgin, B. Mansouri, AIIR Lab Systems for CLEF 2024 SimpleText:
     Large Language Models for Text Simplification, in: [26], 2024.
[16] A. Capari, H. Azarbonyad, G. Tsatsaronis, Z. Afzal, Enhancing Scientific Document Simplification
     through Adaptive Retrieval and Generative Models, in: [26], 2024.
[17] R. Elagina, P. Vučić, AI Contributions to Simplifying Scientific Discourse in SimpleText 2024, in:
     [26], 2024.
[18] S. M. Ali, H. Sajid, O. Aijaz, O. Waheed, F. Alvi, A. Samad, Improving Scientific Text Comprehension:
     A Multi-Task Approach with GPT-3.5 Turbo and Neural Ranking, in: [26], 2024.
[19] R. Mann, T. Mikulandric, CLEF 2024 SimpleText Tasks 1-3: Use of LLaMA-2 for text simplification,
     in: [26], 2024.
[20] J. Bakker, G. Yüksel, J. Kamps, University of Amsterdam at the CLEF 2024 SimpleText Track, in:
     [26], 2024.
[21] B. Vendeville, L. Ermakova, P. De Loor, UBO NLP report on the SimpleText track at CLEF 2024, in:
     [26], 2024.
[22] A. Michail, P. S. Andermatt, T. Fankhauser, Scientific Text Simplification Using Multi-Prompt
     Minimum Bayes Risk Decoding: Examining MBR’s Decisions, in: [26], 2024.
[23] A. Michail, P. S. Andermatt, T. Fankhauser, Scientific text simplification using multi-prompt
     minimum bayes risk decoding: Simpletext best of labs in CLEF 2023, in: [27], 2024.
[24] L. Ermakova, J. Kamps, Complexity-aware scientific literature search: Searching for relevant
     and accessible scientific text, in: G. M. D. Nunzio, F. Vezzani, L. Ermakova, H. Azarbonyad,
     J. Kamps (Eds.), Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a
     Multilingual Context @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 16–26. URL:
     https://aclanthology.org/2024.determit-1.2.
[25] J. Bakker, J. Kamps, Plan-guided simplification of biomedical documents, in: Under Submission,
     2024.
[26] G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024:
     Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2024.
[27] L. Goeuriot, G. Q. Philippe Mulhem, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S.
     de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and
     Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF
     2024), Lecture Notes in Computer Science, Springer, 2024.