1. Introduction

Grenoble, France $ liana.ermakova@univ-brest.fr (L. Ermakova) https://simpletext-project.com/ (L. Ermakova)

Overview of the CLEF 2024 SimpleText Task 3: Simplify Scientific Text

Liana Ermakova

Valentin Laimé

Helen McCombie

Jaap Kamps

2 0 Université de Bretagne Occidentale , BTU , France 1 Université de Bretagne Occidentale , HCTI , France 2 University of Amsterdam , Amsterdam , The Netherlands

2024

000 0 0002

This article provides a comprehensive summary of the CLEF 2024 SimpleText Task 3, which focuses on simplifying scientific text based on specific queries. We discuss in detail the motivation for lay access to scholarly literature, and provide an overview of the setup of the scientific text simplification task. One of the main innovations of the CLEF 2024 SimpleText Task 3 is to complement sentence-level text simplification with a document-level text simplification task. We describe the resulting sentence-level and document-level text simplification test collection in detail, which consists of a corpus of over 1,500 paired source and reference sentences, and a corpus of over 250 paired source and reference abstracts, both containing the source text from scientific abstracts with direct reference simplifications produced by human annotators. We present the results of the participants submission, with 15 teams submitting 52 sentence-level text simplification runs and 9 teams submitting 31 sentence-level text simplification runs. The article concludes with an in-depth analysis, including information distortion and potential LLM “hallucinations” of the simplified sentences submitted by participants.

eol>automatic text simplification science popularization information distortion error analysis lexical complexity syntactic complexity LLMs hallucination

1. Introduction

1. Task 1 on Content Selection: retrieve passages to include in a simplified summary. 2. Task 2 on Complexity Spotting: identify and explain dificult concepts.

3. Task 3 on Text Simplification : simplify scientific text.

4. Task 4 on SOTA?: track the state-of-the-art in scholarly publications.

This paper presents an overview of the CLEF 2024 SimpleText Task 3 on Content Selection. For a comprehensive overview of the other tasks, the task overview papers on Task 1 [ 8 ], Task 2 [ 9 ], and Task 4 [ 10 ], as well as the track overview paper [ 11 ], provide detailed information and further insights.

The CLEF 2024 SimpleText Task 3 directly addresses the technical and evaluation challenges associated with making scientific information accessible to a wide audience, including students and non-experts. We describes the data and benchmarks provided for scientific text simplification, along with the participants’ results and further analysis. This task on simplifying scientific text is a direct continuation of the CLEF 2023 Task 3 [ 12 ]. One of the key innovation in 2024 is the introduction of both sentence level and document (abstract) level scientific text simplification subtasks, as Task 3.1 and Task 3.2.

A total of 45 teams registered for our SimpleText track at CLEF 2024. A total of 20 teams submitted 207 runs in total for the Track, of which 15 teams submitted a total of 83 runs for Task 3. The statistics for the Task 3 runs submitted are presented in Table 1. However, some runs had problems that we could not resolve. We do not detail them in the paper as well as the 0-scored runs.

This introduction is followed by Section 2 presenting the text simplification task with the datasets and evaluation metrics used. Section 3 gives an overview of text simplification approaches for scientific text as deployed by the participants. In Section 4, we present and discuss the results of the oficial submissions. In Section 5, a thorough analysis of the results is carried out, covering several important aspects. This includes examining the relationship between dificult scientific terms and the simplification process, investigating information distortion that may occur during simplification, and exploring instances of language models (LLMs) generating hallucinations and producing inaccurate information. The analysis delves into these topics to provide a comprehensive understanding of the findings and insights derived from the study. We end with Section 6 summarizes the findings and draws perspective for future work.

2. Task 3: Simplify Scientific Text

This section details Task 3: Text Simplification on simplify scientific text.

2.1. Description

The goal of this task is to provide a simplified version of the sentences extracted from scientific abstracts. Participants will be provided with popular science articles and queries and matching abstracts of scientific papers, either split into individual sentences or as the entire abstracts. This year will feature both sentence level (Task 3.1) and document or abstract level (Task 3.2) text simplification.

Table 2 shows an example of a human reference simplification, combining the input sentences belonging to the abstract of the document = 130055196 retrieved for query G01.1. Here, we show the deletions and insertions relative to the source input sentences (in this case in the first 4 sentences). G01.1 130055196 As various kinds The rise of output devices emerged , such as highresolution like high-resolution printers or a display of and PDA ( Personal Digital Assistant ) , displays has increased the importance of need for high-quality resolution conversion has been increasing . ⃒⃒ This The paper proposes a new method for enlarging image with to make images bigger while maintaining high quality . ⃒⃒ One of the largest problems on image enlargement The main issue with enlarging images is the exaggeration of the jaggy that jagged edges can become exaggerated . ⃒⃒ To remedy solve this problem , we propose suggest a new interpolation method , which uses artificial that helps us to estimate the value of the newly generated pixels using a neural network to determine the optimal values of interpolated pixels . ⃒⃒ The experimental experiment ’s results are shown presented and evaluated analyzed . ⃒⃒ The We evaluate the efectiveness of our methods is discussed by comparing with the conventional methods them to traditional approaches . ⃒⃒ 2.1.1. Data

Level Sentence Sentence Sentence Document Document Document

Role Train

Test Combined

Train

Test Combined

Source 893 sentences 578 sentences 1,471 sentences 175 abstracts 103 abstracts 278 abstracts 958 simplified sentences 578 simplified sentences 1,536 simplified sentences 175 simplified abstracts 103 simplified abstracts 278 simplified abstracts Task 3 uses a corpus based on the high-ranked abstracts retrieved for the requests of the CLEF 2024 SimpleText Task 1. Our training data is a truly parallel corpus of directly simplified sentences coming from scientific abstracts from the DBLP Citation Network Dataset for Computer Science and Google Scholar and PubMed articles on Health and Medicine. Other existing text simplification corpora used post-hoc aligned sentences [e.g., 13].

In 2024, we expanded the training and evaluation data. In addition to sentence-level text simplification, we will provide document-level or abstract-level input and reference simplifications. In order to make the sentence-level and document-level tasks fairly comparably, both use the exact same reference simplifications. The scientific sentences from scientific abstracts were simplified either by master students in Technical Writing and Translation or by a domain expert (a computer scientist) and a professional translator (native English speaker) working together.

Table 3 gives an overview of all the SimpleText Task 3 scientific text simplification corpora constructed in 2024. The SimpleText corpus contains 1,536 directly simplified sentences, corresponding to 278 scientific abstracts. This is a useful addition to existing high-quality corpora like Newsela [ 13], with 2,259 sentences in Newsela-Manual. Our track is the first to focus on the simplification of scientific text with a much higher text complexity than news articles.

Available Task 3 training data is derived from the CLEF 2023 edition [ 7 ], and includes 893 source sentences from 175 scientific abstracts paired with the corresponding manual reference simplifications. The new test data created in 2024 consists of 578 sentences paired with reference simplifications for the sentence-level task (Task 3.1), and 103 abstracts paired with reference simplifications for the document-level task (Task 3.2). 2.1.2. Formats Sources The source data are provided in JSON formats with the following fields: 1. snt_id (Task 3.1) or abs_id (Task 3.2): a unique sentence (or abstract) identifier 2. source_snt (Task 3.1) or source_abs (Task 3.2): passage text (sentence or abstract) 3. doc_id: a unique source document identifier 4. query_id: a query ID 5. query_text: dificult terms should be extracted from sentences with regard to this query An example of the Task 3.1 JSON source input is: "query_id":"G11.1", "query_text":"drones", "doc_id":2892036907, "snt_id":"G11.1_2892036907_2", "source_snt":"With the ever increasing number of unmanned aerial vehicles getting ˓→ involved in activities in the civilian and commercial domain, there is an increased ˓→ need for autonomy in these systems too." References The references are provided in a very similar format as the predictions above. An example of a Task 3.1 reference in JSON is: "snt_id":"G11.1_2892036907_2", "simplified_snt":"Drones are increasingly used in the civilian and commercial domain ˓→ and need to be autonomous." Predictions Predictions or submissions of participants were also requested in a JSON format with the following fields: 1. run_id: Run ID starting with <team_id>_<task_id>_<method_used>, e.g. UBO_Task3.1_BLOOM 2. manual: Whether the run is manual {0,1} 3. snt_id (Task 3.1) or abs_id (Task 3.2): a unique sentence or abstract identifier from the input file 4. simplified_snt (Task 3.1) or simplified_abs (Task 3.2): simplified text for the sentence or abstract An example of the Task 3.1 submission in JSON is: "run_id": "Elsevier@SimpleText_Task3.1_run1", "manual": 0, "snt_id": "G11.1_2892036907_2", "simplified_snt": "As more and more drones are used for civilian and commercial ˓→ purposes, there is a growing need for them to operate independently." { }, { }, { }, 2.1.3. Evaluation In 2024, we emphasize large-scale automatic evaluation measures (SARI, BLEU, compression, readability) that provide a reusable test collection. This automatic evaluation will be supplemented with a detailed human evaluation of other aspects, essential for deeper analysis. Almost all participants used generative models for text simplification, yet existing evaluation measures are blind to potential hallucinations with extra or distorted content [ 12 ]. In 2024, we provide further analysis of ways to detect and quantify spurious content in the output, potentially corresponding to what is informally called “hallucinations.”

3. Scientific Text Simplification Approaches

In this section, we discuss a range of text simplification approaches that have been applied to scientific text as provided by the track. A total of 15 teams submitted 83 runs in total.

AB/DPV Varadi and Bartulović [14] submitted one run for Task 3. Their approach is an LSTM model for the sentence-level task.

AIIRLab Largey et al. [15] submitted a total of eight runs for Task 3. Their approach uses LLaMA3 and Mistral models with diferent prompting and fine-tuning, for both the sentence-level and abstract-level tasks.

Arampatzis (No paper received) submitted a total of eight runs for Task 3. Their approach is a range of models (DistilBERT, T5) for both the sentence-level and abstract-level tasks.

Dajana/Katya (No paper with run details received) submitted one run for Task 3. Their approach which follows standard text simplification approaches is applied to the sentence-level task. Elsevier Capari et al. [16] submitted a total of ten runs for Task 3. Their approach is based on a GPT-3.5 model experimenting with zero-shot and few-shot prompts for both sentence-level and abstract-level tasks.

Frane/Andrea (No paper with run details received) submitted one run for Task 3. Their approach which follows standard text simplification approaches is applied to the sentence-level task. Petra/Diana Elagina and Vučić [17] submitted one run for Task 3. Their approach is a LLaMA model for the sentence-level task.

PiTheory (No paper with run details received) submitted a total of twenty runs for Task 3. Their approach uses pre-trained BART and T5 models but contains very few results for both the sentence-level and abstract-level tasks.

Ruby (No paper received) submitted two runs for Task 3. Their approach uses standard models for both sentence-level and abstract-level tasks.

Sharigans Ali et al. [18] submitted a total of two runs for Task 3. Their approach is a GPT-3.5 model for both the sentence-level and abstract-level tasks.

SONAR (No paper received) submitted a single run for Task 3. Their approach is a standard model for the sentence-level task.

Tomislav/Rowan Mann and Mikulandric [19] submitted a total of two runs for Task 3. Their approach is the LLama 2 model with a range of prompts and post-processing for both the sentence-level and abstract-level tasks. Their submission only covers a part of the train topics.

UAmsterdam Bakker et al. [20] submitted a total of ten runs for Task 3. They experiment with GPT-2, and Wiki and Cochrane-trained models at the sentence, paragraph, and document-level text simplification, for both sentence-level and document-level tasks.

UBO Vendeville et al. [21] submitted a total of four runs for Task 3. Their approach is to prompt a smaller Phi3 model for lexical and grammatical text simplifications, for both the sentence-level and abstract-level tasks.

UZHPandas Michail et al. [22] submitted a total of ten runs for Task 3. They experiment with a multi-prompt Minimum Bayes Risk (MBR) decoding approach to the sentence-level task. Their approach is a refinement of their CLEF 2023 approach, which was recognized with a prestigious Best of the Labs award, and published as part of the CLEF 2024 LNCS proceedings [23].

4. Results

This section details the results of the task, for both sentence-level and abstract-level test simplification subtasks.

4.1. Task 3.1: Sentence-level scientific text simplification 4.2. Task 3.2: Abstract-level scientific text simplification

Table 5 shows the Task 3.2 (abstract-level text simplification) results. Again we restrict the table to submissions covering a suficient number of input abstracts.

We make a number of observations. First, in terms of evaluation measures like SARI we see again similar encouraging performance levels when evaluating against the human reference simplifications. This is partly due to the use of proven sentence-level text simplification models with the output merged back into the entire abstract. Second, there remains room for improvement in capturing the human simplifications more closely, as the BLEU score remains low throughout. Here, the more conservative approaches seem to obtain better scores. Third, we see less extreme values on the other indicators, but still considerable variation in the compression ratio and number of splits, and proportions of addition and deletions. We will investigate how much of the output is grounded in the source sentences and abstracts below.

Many submissions rely on proven sentence-level text simplification approaches, with results closely mirroring those observed for the sentence-level task. It is encouraging to see solid performance for the approaches that perform text simplification at the entire abstracts in one pass. This holds the promise to incorporate the discourse structure, use more complex text simplifications operations such as deletions and merges, and deploy planner-based approaches to the text simplification of long documents.

4.3. Train results

In this section, we show the results over the train data for sentence-level and abstract-level scientific text simplification. This analysis includes those submission retricted to the train data and left out above. 4.3.1. Task 3.1: Sentence-level scientific text simplification Table 6 shows the sentence-level text simplification results for the train data.

We make the following observations. First, we observed very high performance with SARI scores up to 65% for systems fine-tuned on the train data. Even more striking are very high BLEU scores of over 50%. This is a signal of potential overfitting, although the top performing systems on train still perform reasonably on the new test data. The majority of runs performs similar on train and test, which is according to expectation as most are not particularly trained or fine-tuned on the relatively small set of train sentences and abstracts.

Second, we observe again a clear reduction of FKGL readability, in particular for systems with a high proportion of sentence splits. We make the same proviso that although shorter sentences, and shorter or more common words, is a weak proxy for text complexity, as complex terminology and brief abbreviations may remain and stay opaque for lay users. A very simple grammar is common in youth reading levels, such as target by the popular Newsela-auto [13] data, making FKGL a popular readability score. However, in plain English summaries of scientific text we don’t observe such reduction [25].

Third, while we observe higher scores on the train data in Table 4 than on the test data above in Table 4, there seems to be still room for improvement. Throughout the table, we see many low BLEU scores, and very high fractions of additions may risk gratuitous introduction of new content, and hence risk “hallucination.” 4.3.2. Task 3.2: Abstract-level scientific text simplification Table 7 shows the abstract-level text simplification results for the train data.

We make the following observations. First, we observe higher scores for systems who deploy ifnetuning, which doesn’t seem to generalize to the unseen test evaluation before. Most systems, however, wer not particularly trained or finetuned on the train data and show similar performance on both train and test.

Second, we observe solid performance for the more complex document-level scientific text simpliifcation task, but this is due to many systems deploying proving sentence-level text simplification technology with merging the sentence-level output back into complete abstracts.

Third, while a sentence-level approach to document-level text simplification is a pragmatic choice and viable strategy, several model perform direct abstract-level or paragraph-level taking the discourse structure and more complex sentences reordering and deletion into account. These document-level text simplification approach tend to lead to far greater compression, including whole sentence deletions, making their output far more succinct than sentence-level approaches to document-level text simplification. Giving their succinct output, and in light of the sentence-level constructed human reference simplifications, the scores of direct abstract-level or paragraph-level approaches is impressive. Further research in such document-level text simplification approaches would be important in the future of the CLEF SimpleText track.

5. Analysis

This section provides further analysis of the submitted runs, and the task as whole.

As various kinds of output devices emerged , such as highresolution printers or a display of PDA ( Personal Digital Assistant ) , the . The importance of high-quality resolution conversion has been increasing . ⃒⃒ This paper proposes a new method for enlarging an image with high quality . It will involve using a combination of high-speed imaging and high-resolution video . ⃒⃒ One of the largest biggest problems on image enlargement is the exaggeration of the jaggy edges . This is especially true when the image is enlarged , as in this case . ⃒⃒ To remedy this problem , we propose a new interpolation method , which . This method uses artificial neural network to determine the optimal values of interpolated pixels . ⃒⃒ The experimental results are shown and evaluated . The results are compared to other studies and found to be inconclusive . ⃒⃒ The efectiveness of our methods is discussed by comparing with the conventional methods . Our methods are designed to help people with mental health problems , not just as a way to cure them . ⃒⃒

5.1. Human Evaluation

Due to the delayed submission deadline, as well as, follow-up correspondence with teams on partial or incorrect output, the manual annotation of system output has been limited to a small sample, and is still ongoing. We report here only initial observations from the translation professionals conducting this analysis, based on the expectation of what a professional editor would provide as reference output. We looked in particular at the novel document-level simplifications of the entire abstract, and it’s coherence and discourse structure.

First, and foremost, something is working. The automatic text simplifications are generally of impressive quality despite the remaining limitations that are the focus of this section. The fluency and language variation is impressive, and far exceeds earlier language generation technology often reflecting the protocol, and template or rule-based system underlying it.

Second, changes can be unnecessary nor helpful. Frequently, as we observed in our work on the project last year [ 12 ], the information is written in another way but does not ofer simplification. Sometimes the vocabulary does no change but is simply rearranged.

Third, discourse structure matters. In other examples the resulting text is not shaped as a whole, with a proper beginning middle and end, but is reorder to the detriment of clarity. For example, the ifrst sentence of the “simplified” abstract can contain a reference back to information already given. Another example: start of a first sentence with “ However, . . . ” in the simplification when source text started “It is the purpose of this study, . . . ” or with “For example, . . . ” when the original first sentences presented the subject.

Fourth, brevity is not always clearer. Although some examples shorten the sentences within an abstract, thus technically simplifying, their interrelation is not necessarily maintained, producing a choppy style. Better results were produced when the new text was split into subsections dedicated to particular subtopics, including their explanation.

Fifth, gratuitous additions are problematic. Another type of problem is illustrated by the creation of a cumbersome nominal group “the 21st Century managed care needs of patients, . . . ” which does not exist in the original, where we instead had an evocative example: “the emergency room at home.” Here though, both things belong in the same domain. Elsewhere, seeming hallucinations appeared, for example, through the addition of an of-topic sentence. For example, to an abstract about digital tools to aid Parkinson’s suferers, we found the following last sentence added during simplification: “It includes advice on how to manage consultant work, such as research and development .” Although, in terms of meaning, this has no equivalent in the source text, the source text starting sentence was: “The paper also discusses how a practitioner can accomplish UCSD in the context of product development and consultant work.”, which mentions the topic in a diferent context.

5.2. Spurious or overgeneration

We conduct a deeper analysis of how much of the generated simplified output sentences and abstracts can be traced to the source input. In particular, we look at spurious generated content and it’s prevalence in the submitted generated text simplifications. This content is at risk of being introduced gratuitously by the generative model, and what is informally referred to as “hallucinations.”

Earlier in Table 2, we showed an example of a human reference simplification, combining the input sentences belonging to the abstract of the document = 130055196 retrieved for query G01.1. We can do the same for the automatically generated scientific text simplifications. We show again the deletions and insertions relative to the source input sentences. Table 8 shows an example output simplification of one of the participating teams, for the same input sentences as in Table 2 above. Most simplifications are revisions of the input, but we also observe that sometimes an entire sentence is inserted (shown as xxx in Table 8). The example in Table 8 is an extreme case picked to illustrate both the importance and complexity of detecting such spurious content.

We provide a detailed analysis quantifying the prevalence of spurious content in the CLEF 2024 # Input Sentences SimpleText Task 3 submissions. Table 9 quantifies how often such spurious generation occurs. We re-aligned the generated output with the original source sentences, and flag here only entire output sentences that do not share a single token with the input. Our analysis reveals that the amount of spurious content is varying but far from infrequent. A total of 17 out 36 submissions (47%) have spurious whole sentences in at least 10% of the input sentences. In fact, 14 (39%) submissions in at least 20% of the input, and 7 (19%) submissions in at least 50% of the input sentences. The detection of non-aligned output sentences is indicative but imperfect. For example, a significant reordering of content may lead to false positives in rare cases, and unusual tokenization or formatting may afect the alignment with the source even systematically. Note also that the detected additions may introduce helpful background knowledge or other useful information to contextualize the information in the source sentences.

Table 10 quantifies how often such spurious generation occurs for the abstract-level output. Here we look again at the spurious output at the end of the input abstract, rather than conducting a sentence-level analysis as done above. Aligning longer text is more complex than sentences. For those generating true paragraph or document level simplifications, we observe more variation involving content of multiple input sentences leading to a more complex alignment. Hence we focus on detecting spurious content at the end of the generated abstract. As a result, for those aggregating sentence-level output merged into the abstracts, we are only able to detect spurious content for the final sentence.

We make a number of observations based on our analysis in this section. First, the fraction of sentences with spurious content is very low for some submissions, however, for other submissions, the fraction is very substantial. Second, the standard evaluation measures used for text simplification, and in fact for any text generation task in NLP, do not take this aspect into account. A submission with significant spurious content can still obtain very high text overlap with the reference, and hence obtain a very high performance score. Third, and more generally, human evaluation and this type of analysis feel crucial to accurately evaluate generative models for the NLP and IR challenges addressed in our Track and in CLEF in general.

6. Conclusions

The paper provides an overview of the CLEF 2024 SimpleText Task 3: Text Simplification, which focuses on the simplification of scientific text. The objective of the task is to simplify either the separate sentences or the entire scientific abstracts in order to enhance their accessibility and comprehensibility for a general audience. We highlighted the key aspects and goals of the task within the broader context of the CLEF 2024 SimpleText track [ 11 ].

Our main findings are the following: First, we observe competitive performance for scientific text simplification, both on evaluation against the human reference simplifications and on text statistics such as FKGL readability score. Second, the abstract-level text simplification results is a mixture of sentence-level and passage-level text simplification approaches. Third, our analysis reveals a very high and varying range of spurious text generation, not detected by standard evaluation measures, and a major concern in the use of these model in a real-world setting. More generally, almost all participants use generative models (for the task, the track, and CLEF in general), and the track ofers a unique setting to study some of the inherent limitations of generative models.

The main aim of our task, the track, and the CLEF evaluation forum as a whole, is i) to foster a community of IR, NLP, and AI researchers working together on the important task of making science more accessible for everyone, and ii) to construct corpora and evaluation resources to stimulate research on scientific text summarization and simplification. In terms of a building a community researching scientific text summarization and simplification, the task saw a record attendance in 2024: due to the additional abstract level task we received 83 runs from 15 teams, the largest number of participating teams ever. In fact, the community is broadening beyond CLEF and raising general interest in generative scientific text summarization and simplification [ 1 ].

Within the CLEF 2024 SimpleText Task 3, we have constructed extensive corpora and manually labeled evaluation data for scientific text simplification. Specifically, we added in 2024 a a parallel corpus of manually simplified sentences and abstracts from the scientific literature: • Train, sentence level: 958 source sentences from scientific abstracts paired with corresponding human reference simplifications. • Test, sentence level: 578 source sentences from scientific abstracts paired with corresponding human reference simplifications. • Train, abstract level: 175 source scientific abstracts paired with corresponding human reference simplifications. • Test, abstract level: 103 source scientific abstracts paired with corresponding human reference simplifications.

These reusable corpora and evaluation resources are available to participants and other researchers who want to work on the important problem of making scientific information open and easily accessible for everyone.

Acknowledgments

This track would not have been possible without the great support of numerous individuals. We want to thank in particular the colleagues and the students who participated in data construction and evaluation. Please visit the SimpleText website for more details on the track.1

Liana Ermakova is funded by the French National Research Agency (ANR) Automatic Simplification of Scientific Texts project (ANR-22-CE23-0019-01),2 and the MaDICS research group.3 Jaap Kamps is partly funded by the Netherlands Organization for Scientific Research (NWO CI # CISC.CC.016, NWO NWA # 1518.22.105), the University of Amsterdam (AI4FinTech program), and ICAI (AI for Open Government Lab). Views expressed in this paper are not necessarily shared or endorsed by those funding the research.

1https://simpletext-project.com/ 2https://anr.fr/Project-ANR-22-CE23-0019 3https://www.madics.fr/ateliers/simpletext/

2023, pp. 2855–2875. URL: https://ceur-ws.org/Vol-3497/paper-240.pdf. [13] W. Xu, C. Callison-Burch, C. Napoles, Problems in current text simplification research: New data can help, Transactions of the Association for Computational Linguistics 3 (2015) 283–297. URL: https://aclanthology.org/Q15-1021. doi:10.1162/tacl_a_00139. [14] D. P. Varadi, A. Bartulović, SimpleText 2024: Scientific Text Made Simpler Through the Use of AI, in: [26], 2024. [15] N. Largey, R. Maarefdoust, S. Durgin, B. Mansouri, AIIR Lab Systems for CLEF 2024 SimpleText:

Large Language Models for Text Simplification, in: [26], 2024. [16] A. Capari, H. Azarbonyad, G. Tsatsaronis, Z. Afzal, Enhancing Scientific Document Simplification through Adaptive Retrieval and Generative Models, in: [26], 2024. [17] R. Elagina, P. Vučić, AI Contributions to Simplifying Scientific Discourse in SimpleText 2024, in: [26], 2024. [18] S. M. Ali, H. Sajid, O. Aijaz, O. Waheed, F. Alvi, A. Samad, Improving Scientific Text Comprehension:

A Multi-Task Approach with GPT-3.5 Turbo and Neural Ranking, in: [26], 2024. [19] R. Mann, T. Mikulandric, CLEF 2024 SimpleText Tasks 1-3: Use of LLaMA-2 for text simplification, in: [26], 2024. [20] J. Bakker, G. Yüksel, J. Kamps, University of Amsterdam at the CLEF 2024 SimpleText Track, in: [26], 2024. [21] B. Vendeville, L. Ermakova, P. De Loor, UBO NLP report on the SimpleText track at CLEF 2024, in: [26], 2024. [22] A. Michail, P. S. Andermatt, T. Fankhauser, Scientific Text Simplification Using Multi-Prompt

Minimum Bayes Risk Decoding: Examining MBR’s Decisions, in: [26], 2024. [23] A. Michail, P. S. Andermatt, T. Fankhauser, Scientific text simplification using multi-prompt minimum bayes risk decoding: Simpletext best of labs in CLEF 2023, in: [27], 2024. [24] L. Ermakova, J. Kamps, Complexity-aware scientific literature search: Searching for relevant and accessible scientific text, in: G. M. D. Nunzio, F. Vezzani, L. Ermakova, H. Azarbonyad, J. Kamps (Eds.), Proceedings of the Workshop on DeTermIt! Evaluating Text Dificulty in a Multilingual Context @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 16–26. URL: https://aclanthology.org/2024.determit-1.2. [25] J. Bakker, J. Kamps, Plan-guided simplification of biomedical documents, in: Under Submission, 2024. [26] G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024:

Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2024. [27] L. Goeuriot, G. Q. Philippe Mulhem, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, 2024.

[1] G. M. D. Nunzio , F.

Vezzani , L.

Ermakova , H.

Azarbonyad , J. Kamps (Eds.), Proceedings of the Workshop on DeTermIt! Evaluating Text Dificulty in a Multilingual Context @ LREC-COLING 2024 , ELRA and ICCL , Torino , Italia, 2024 . URL: https://aclanthology.org/ 2024 .determit- 1 .0.

[2]

Štajner ,

Saggio ,

Shardlow ,

Alva-Manchego (Eds.), Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability , INCOMA Ltd., Shoumen , Bulgaria, Varna, Bulgaria, 2023 . URL: https://aclanthology.org/ 2023 .tsar- 1 .0.

[3]

Štajner ,

Saggion ,

Ferrés ,

Shardlow ,

K. C.

Sheang , K. North,

Zampieri , W. Xu (Eds.), Proceedings of the Workshop on Text Simplification , Accessibility, and Readability (TSAR- 2022 ), Association for Computational Linguistics , Abu Dhabi, United Arab Emirates (Virtual) , 2022 . URL: https://aclanthology.org/ 2022 .tsar- 1 .0.

[4]

Saggion ,

Stajner ,

Ferrés , K. C. Sheang (Eds.), Proceedings of the First Workshop on Current Trends in Text Simplification (CTTS 2021 ) co-located with the 37th Conference of the Spanish Society for Natural Language Processing (SEPLN2021), Online (initially located in Málaga , Spain), September 21st , 2021 , volume 2944 of CEUR Workshop Proceedings, CEUR-WS.org , 2021 . URL: https://ceur-ws. org/ Vol- 2944 .

[5]

Ermakova ,

Bellot ,

Braslavski ,

Kamps ,

Mothe ,

Nurbakova ,

Ovchinnikova , E. SanJuan, Overview of simpletext 2021 - CLEF workshop on text simplification for scientific information access , in: K. S. Candan,

Ionescu ,

Goeuriot ,

Larsen ,

Müller ,

Joly ,

Maistro ,

Piroi , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality , Multimodality, and Interaction - 12th International Conference of the CLEF Association, CLEF 2021 ,

Virtual

Event , September 21-24 , 2021 , Proceedings, volume 12880 of Lecture Notes in Computer Science, Springer, 2021 , pp. 432 - 449 . URL: https://doi.org/10.1007/978-3- 030 -85251-1_ 27 . doi: 10 .1007/978-3- 030 -85251-1\_ 27 .

[6]

Ermakova , E. SanJuan, J. Kamps,

Huet ,

Ovchinnikova ,

Nurbakova ,

Araújo ,

Hannachi , É. Mathurin,

Bellot , Overview of the CLEF 2022 simpletext lab: Automatic simplification of scientific texts , in: A. Barrón-Cedeño , G. D. S.

Martino , M. D.

Esposti , F.

Sebastiani , C.

Macdonald , G.

Pasi , A.

Hanbury , M.

Potthast , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality , Multimodality, and Interaction - 13th International Conference of the CLEF Association, CLEF 2022 , Bologna, Italy, September 5- 8 , 2022 , Proceedings, volume 13390 of Lecture Notes in Computer Science, Springer, 2022 , pp. 470 - 494 . URL: https://doi.org/10.1007/978-3- 031 -13643-6_ 28 . doi: 10 . 1007/978-3- 031 -13643-6\_ 28 .

[7]

Ermakova , E. SanJuan, S. Huet,

Azarbonyad ,

Augereau ,

Kamps , Overview of the CLEF 2023 simpletext lab: Automatic simplification of scientific texts , in: A. Arampatzis , E. Kanoulas, T.

Tsikrika , S.

Vrochidis , A.

Giachanou , D.

Li , M.

Aliannejadi , M.

Vlachos , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality , Multimodality, and Interaction - 14th International Conference of the CLEF Association, CLEF 2023 , Thessaloniki, Greece, September 18-21 , 2023 , Proceedings, volume 14163 of Lecture Notes in Computer Science, Springer, 2023 , pp. 482 - 506 . URL: https://doi.org/10.1007/978-3- 031 -42448-9_ 30 . doi: 10 .1007/978-3- 031 -42448-9\_ 30 .

[8]

SanJuan , S. Huet,

Kamps ,

Ermakova , Overview of the CLEF 2024 SimpleText task 1: Retrieve passages to include in a simplified summary , in: [26] , 2024 .

[9]

G. M.

Di Nunzio ,

Vezzani ,

Bonato ,

Azarbonyad ,

Kamps ,

Ermakova , Overview of the CLEF 2024 SimpleText task 2: Identify and explain dificult concepts , in: [26] , 2024 .

[10] J. D'Souza , S.

Kabongo , H. B.

Giglou , Y. Zhang,

Overview of the CLEF 2024 SimpleText Task 4: SOTA? Tracking the State-of-the-Art in Scholarly Publications , in: [26], 2024 .

[11]

Ermakova , E. SanJuan, S. Huet,

Azarbonyad ,

G. M.

Di Nunzio ,

Vezzani , J. D'Souza , J. Kamps , Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone , in: [27] , 2024 .

[12]

Ermakova ,

Bertin ,

McCombie ,

Kamps , Overview of the CLEF 2023 simpletext task 3: Simplification of scientific texts , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023 ), Thessaloniki, Greece, September 18th to 21st , 2023 , volume 3497 of CEUR Workshop Proceedings , CEUR-WS.org,