1. Introduction

Benjamin Vendeville

0 1 2 4

Liana Ermakova

0 4

Pierre De Loor

1 4

Jaap Kamps

3 4 0 HCTI , Brest , France 1 Lab-STICC (UMR CNRS 6285) , Brest , France 2 Université de Bretagne Occidentale , Brest , France 3 University of Amsterdam , Amsterdam , The Netherlands 4 Task 1: Text Simplification

2025

This paper presents the UBOnlp team's participation in the SimpleText lab at CLEF 2025, focusing on scientific text simplification and controlled creativity tasks. We evaluate the performance of GPT-4o using simple prompt-based approaches across multiple subtasks without specialized training or fine-tuning. For Task 1 (Text Simplification), we applied GPT-4o to both sentence-level and document-level simplification of scientific abstracts from the Cochrane-Auto corpus. Our system achieved competitive SARI scores (42.20 for sentence-level, 43.37 for documentlevel) while maintaining low complexity metrics, demonstrating efective simplification through content reduction rather than lexical substitution. For Task 2 (Controlled Creativity), we addressed spurious generation detection and error classification in simplified texts. Our approach showed strong performance in fluency error detection (F1 = 0.322, ranking first) and alignment error detection (F1 = 0.381, ranking third), but struggled with general spurious content detection, particularly in post-hoc scenarios without source documents. These results highlight both the potential and limitations of large language models for specialized text simplification tasks. While GPT-4o demonstrates capabilities in linguistic quality assessment, task-specific architectures remain superior for comprehensive error detection and generation control. Our findings contribute to understanding the practical applicability of general-purpose language models in scientific text processing workflows.

eol>Automatic text simplification Science popularization Large Language Models

1. Introduction

simplify scientific text [5]. – Subtask 1.1: Simplify sentences.

– Subtask 1.2: Simplify abstracts. • Task 2: Controlled Creativity identify and avoid hallucination [6].

– Subtask 2.1: Identifying creative generation. – Subtask 2.2: Classifying information distortion.

– Subtask 2.3: Avoiding creative generation. • Task 3: SimpleText 2024 Revisited selected tasks by popular request.

– Subtask 3.1: Content Selection: Retrieving passages to include in a simplified summary – Subtask 3.2: Complexity Spotting: Identifying and explaining dificult concepts

– Subtask 3.3: Text Simplification : Simplify Scientific Text

This paper will detail the participation of team UBOnlp for tasks 1 and 2, where we used GPT-4o [7] to generate predictions. We will present the task and data provided, as well as the prompts we used for prediction.

2. Task 1: Text Simplification 2.1. Task Description

The goal of this task was to generate simplifications of scientific texts. It was divided into two subtasks: sentence-level simplification (Subtask 1.1) and document-level simplification (Subtask 1.2). This task used the Cochrane-Auto corpus, built from the Cochrane systematic reviews and their associated lay summaries. Cochrane-Auto consists of professionally written abstract-summary pairs aligned at sentence, paragraph, and document levels. The dataset was constructed by realigning biomedical abstracts and lay summaries at diferent levels of granularity-sentence, paragraph, and full document. The alignment is restricted to ensure accurate correspondences, enabling meaningful evaluation at each level. The dataset was split into training and test datasets: • train: 4,171 sentences (Task 1.1) and 4,171 paragraphs (Task 1.2) • test: 4,293 sentences (Task 1.1) and 217 abstracts (Task 1.2) Participants were welcomed to use training data to train models, but we decided to use an untrained, prompt-based approach.

We evaluate system outputs using a range of standard and simplification-specific metrics provided by EASSE [8]. Flesch-Kincaid Grade Level (FKGL) [9] estimates the reading dificulty of a text based on average sentence length and syllables per word, returning a U.S. school grade level; higher values indicate more complex texts, with a theoretical lower bound of -3.40 and no upper limit.

BLEU [10] assesses n-gram overlap between generated and reference texts. Although originally developed for machine translation, it is commonly applied in simplification by treating standard and simplified English as distinct languages. Scores range from 0 (no overlap) to 1 (perfect match).

SARI [11] is specifically designed for text simplification, comparing the system output not only to references but also to the input. It evaluates the quality of additions, deletions, and words retained, with scores ranging from 0 to 100, where higher indicates better simplification.

To characterize structural transformations, we compute the compression ratio, which compares the token count of the output to that of the reference; higher values reflect more compressed outputs. Sentence splits count the number of input sentences divided into multiple ones in the output, with higher counts indicating more frequent segmentation.

We also use Levenshtein similarity to quantify the edit distance between the input and the output, where higher values denote greater surface similarity. The exact copy rate measures the proportion of output sentences that are identical to sentences in the input.

In addition, we track the proportion of additions and deletions, indicating the extent of lexical changes between input and output. Finally, lexical complexity is computed following Alva-Manchego et al. [8], by aggregating the third quartile of the log-frequency ranks of words, capturing the relative rarity of the vocabulary used.

For sentence-level simplification (Task 1.1) sentences were concatenated into abstract and evaluated as such. Furthermore, two diferent sets of references were used. One was based on the plain language summary (PLS) from the original Cochrane references and contained references for 217 abstracts, while the 2nd was made from Cochrane-auto and contained references for 37 abstracts.

2.2. Test Data

The provided test data for Task 1.1 was of the form: { }, { "pair_id": "CD012520", "para_id": 0, "sent_id": 0, "complex": "We included seven cluster-randomised trials with 42,489 patient participants from 129 hospitals, conducted in Australia, the UK, China, and the

Netherlands."

While test data for Task 1.2 was of the form:

"pair_id": "CD012520", "source": "Cochrane", "complex": "We included seven cluster-randomised trials with 42,489 patient participants from 129 hospitals, conducted in Australia, the UK, China, and the Netherlands. Health professional participants (numbers not specified) included nursing, medical and allied health professionals. Interventions in all studies included [...]" },

2.3. Submission Description

Our goal for this task was to see the performance of state of the art models used in a simple way. Therefore, we decided to use GPT-4o to generate simplifications based only on a simple prompt and the source sentence. The decoding was made with a temperature of 0, and we used the following prompt: prompt = f"""You are a classification expert for simplification errors. You need to simplify the following scientific text for the general public.

The goal is to make the provided text more easily understandable.

It is important to keep un easy vocabulary, a simple semantic structure, and to not have too much information density.

You also need to be informative and make the user understand important facts in the source. ---------Source: "{source}" """

The same prompt is used for both subtasks 1.1 and 1.2. 2.4. Results

2.4.1. Task 1.1 The evaluation of our run, along with scores of other participants, are presented in Table 1 and Table 2. We see our system being one of the best on SARI on sentence-level simplification while keeping one of the lowest FKGL and lexical complexity scores. Looking at the addition and deletion proportions, our model removed more content than other models, while adding less.

This suggests that our system adopts a more conservative rewriting strategy, favoring deletion over lexical addition. While this may help reduce complexity, it could also risk omitting important information.

On Cochrane-Auto aligned data, however, we observe a notable drop in our model’s performance, especially on SARI and BLEU, while other systems such as DSGT plan_guided_lla remain closer to the PLS references. Interestingly, this drop coincides with a mismatch in sentence splitting behavior: while our model tends to preserve the original sentence boundaries, the PLS references in Cochrane-Auto may restructure content more, with significantly more sentence splits compared to those in the manually aligned references. This diference may have penalized our system, which performs better for sentencelevel rewriting and performs well when reference simplifications follow similar segmentation. Despite this, our model maintains competitive scores on FKGL and lexical complexity, suggesting that it still produces fluent and accessible output, albeit less aligned with the structural edits present in the PLS references. 2.4.2. Task 1.2 1.00 0.97 0.99 0.68 1.52 1.64 1.20 0.82 1.02 0.46 0.49 1.23 1.00 1.18 1.22 0.49 1.72 1.53 1.02 1.94 1.10 1.19 1.00 0.40 0.46 0.37 0.61 0.45 0.60 0.60 0.44 0.45 0.51 0.40 0.40 0.63 0.64 0.52 0.54 0.65 0.62 0.66 0.80 0.78 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.23 0.00 0.00 0.00 0.00 0.00 0.00 0.35 0.00 0.29 0.18 0.18 0.26 0.62 0.47 0.14 0.11 0.00 0.01 0.66 0.09 0.45 0.22 0.01 0.59 0.37 0.43 0.46 0.10 0.13 0.00 0.63 0.71 0.78 0.53 0.64 0.27 0.50 0.75 0.72 0.67 0.71 0.79 0.25 0.46 0.63 0.12 0.25 0.20 0.20 0.29 0.30 9.05 8.65 8.50 8.37 8.54 8.35 8.68 8.76 8.96 8.81 8.78 8.65 8.90 8.69 8.54 8.85 8.68 8.75 9.26 8.68 8.93 8.77 The evaluation of our system, UBOnlp GPT-4o, alongside those of other participants, is presented in Table 3 and Table 4. Our system demonstrates competitive performance, particularly on SARI, indicating efective simplification strategies. It produces longer outputs and performs frequent sentence splitting, reflecting a consistent approach focused on decomposing and elaborating complex information rather than merely shortening the text. This is further supported by the high compression ratio and addition proportion, suggesting that the model often introduces explanatory content-such as definitions-to enhance clarity. Despite these strengths, the lower BLEU scores point to a greater divergence from reference phrasing, potentially impacting perceived fluency and alignment. The system also performs well on FKGL and lexical complexity metrics, confirming its ability to adapt the vocabulary and structure to a simpler register.

These tendencies are confirmed in the evaluation against the Cochrane-Auto references, where the results remain broadly consistent-SARI scores decrease slightly, while BLEU improves marginally, highlighting the model’s stable behavior across reference sets.

3. Task 2: Controlled Creativity 3.1. Task Description

In practice, when generating simplifications, organizers have found a high proportion and variety of spurious generation. The goal of this task is therefore to detect, classify and avoid spurious generation.

In particular, we participated in the following subtasks 2.1 and 2.2.

3.1.1. Subtask 2.1 The goal of this subtask is to detect spurious generation. Participants were presented a system generated simplification, and had to classify it as spurious or not. in particular, two cases were studied: one (sourced) where participants had access to the source document of the simplification and one (posthoc) where they did not. The dataset was constructed from system simplifications retrieved from last year’s submissions to the SimpleText lab and were automatically annotated based on token alignment, where if over 10% of the tokens in the generations were not aligned with the source, the generation was considered spurious. This created a high prevalence of the spurious label (90%). The train dataset was split into 13,341 sentences (posthoc) and 13,514 sentences (sourced) while the test dataset was split into 3,336 sentences (posthoc) and 3,379 sentences (sourced). Results are evaluated using Accuracy, Precision, The goal of this subtask is to detect and classify hallucinations with regards to a taxonomy of [12]. The taxonomy classifies errors in text simplifications into one or multiple of 14 diferent errors classes, grouped into 4 error groups: • A. Fluency Is the answer provided in a correct form that a fluent speaker would speak? • B. Alignment Is the format of the answer correct? • C. Information Is the information provided accurate and relevant to the input? • D. Simplification Does the response focus on simplification? In addition, a "No Error" class is also considered. The training data is constructed from 42,392 synthetically generated simplifications containing targeted errors generated from past submissions to the SimpleText lab. The test data was constructed from 2,659 manual annotations of past submissions to the SimpleText lab. Results are evaluated on the four aggregated error categories, using both F1 score and AUC.

3.2. Test data

3.2.1. Subtask 2.1 • Subtask 2.1 Posthoc:

The provided test datasets were provided in a json format as such:

"sentence": "I explained the complex terms directly within the simplified sentence:\ n\n* 'Next-generation model' means a new and improved plan.", "anon_gen_id": "74704850//66348262//3" • Subtask 2.1 Sourced: "abs_id": "G01.1_1570837852", "sentence": "In this paper, we share our findings on how evolutionary algorithms and multi-agent systems can be used to understand a user's preferences while they interact with a digital assistant.", "gen_id": "11102757//G01.1_1570837852//1" The Sourced data could be merged with the abstract data of the following format: "query_id": "G11.1", "query_text": "drones", "doc_id": 2892036907, "abs_id": "G11.1_2892036907", "abs_source": "In the modern era of automation and robotics, autonomous vehicles are currently the focus of academic and industrial research. With the ever increasing number of unmanned aerial vehicles getting involved in activities in the civilian and commercial domain, there is [...]" }, 3.2.2. Subtask 2.2

The test data was provided as a json file as such: 3.3. Submission Description

In both subtasks, our goal was to try to measure performance of state of the art models used in a naive, simple way. In both subtasks, we relied on an untrained GPT-4o model using only a prompt wiht test data as input. The decoding was made with a temperature of 0. 3.3.1. Subtask 2.1 For this subtask, we used two slightly diferent prompts for the sourced and posthoc variations. For posthoc we used the following prompt: prompt = f""" You are an expert in detecting hallucinations in simplified scientific texts. Hallucinations include: - Information distortion: misrepresenting or oversimplifying facts in a misleading way. - Spurious generation: adding information not supported by scientific content. Your task: Analyze the simplified text and respond only with: - True -> if the text likely contains a hallucination. - False -> if the text seems accurate and faithful.

Respond with **only** ‘True‘ or ‘False‘.

For the sourced variation, we used:

prompt = f""" You are an expert in detecting hallucinations in simplified scientific texts. Hallucinations include: - **Information distortion**: when the simplified text misrepresents or alters the meaning of the source. - **Spurious generation**: when the simplified text includes new information not present or supported in the source.

Your task is to compare the simplified text with the source and respond with: - True -> if the simplified text contains hallucinations (of either type). - False -> if the simplified text is faithful to the source.

Respond with **only** True or False. ---------Source Text: {source} Simplified Text: {simplified} """ 3.3.2. Subtask 2.2 For the subtask 2.2, we used a prompt describing the taxonomy, as well as the format required, and included examples. The taxonomy is the definition of the errors as provided in [ 12] while possible codes are the codes corresponding to the error. """

3.4. Results

3.4.1. Subtask 2.1 Results for this subtask are presented in table 5 and table 6 In the posthoc detection scenario, our GPT-4o approach ranked last among the participating teams. The results reveal a characteristic pattern: while our method achieved high precision (0.92), indicating that when it predicted spurious generation it was usually correct, it sufered from extremely low recall (0.21). This suggests our GPT-4o approach was overly conservative in identifying spurious content when operating without access to source documents. The low accuracy (0.27) and near-random AUROC (0.52) indicate that our approach struggled significantly with the posthoc detection task. Given that the dataset has a 90% prevalence of spurious examples, our low recall particularly hurt overall performance.

When source documents were available, our GPT-4o approach showed improved but still limited performance. The recall increased from 0.21 to 0.71, and accuracy improved from 0.27 to 0.70. This suggests that GPT-4o benefits significantly from having reference material to compare against when detecting spurious generation. However, our approach still ranked in the lower tier of submissions, with several teams achieving accuracy scores above 0.90 and F1 scores above 0.95.

The performance diference between our approach and top-performing methods (which achieved F1-scores above 0.95) suggests that task-specific model architectures, such as BERT-based classifiers and ensemble methods, may still be more suitable for this type of detection task than general-purpose language models used in a zero-shot or few-shot manner. 3.4.2. Subtask 2.2 Our system achieved the best F1 score (0.322) for fluency error detection, outperforming all competing systems including specialized fine-tuned models. This demonstrates GPT-4o’s capabilities for identifying grammatical errors and fluency issues. It also showed strong performance in alignment error detection (F1 = 0.381, 3rd place), showing efective identification of format and structural issues. However, our system showed lower performance in "No Error" classification (F1 = 0.680) suggesting tendency toward false positives. Information and simplification error detection showed moderate results, indicating challenges with task-specific requirements.

The results highlight GPT-4o’s strength in linguistic tasks while revealing limitations in specialized error detection, showing the usefulness of building task-specific error detection models.

4. Conclusion

This paper evaluated GPT-4o’s efectiveness for scientific text simplification and controlled creativity tasks at CLEF 2025 SimpleText using straightforward prompt-based approaches without specialized training. Our results demonstrate both strengths and limitations of general-purpose language models for specialized text processing tasks. In text simplification, GPT-4o achieved competitive SARI scores (42.20 sentence-level, 43.37 document-level) through a conservative strategy that prioritized content reduction over lexical substitution. For controlled creativity, the model excelled in fluency error detection (highest F1 score among participants) and alignment error detection, but struggled with spurious generation detection, particularly in post-hoc scenarios without source documents. These findings highlight that while GPT-4o demonstrates strong linguistic capabilities for quality assessment tasks, task-specific architectures remain superior for comprehensive error detection and generation control. The substantial performance gap between our approach and specialized systems indicates that domain-specific finetuning or architectural modifications are necessary for optimal performance in critical applications. Future work should explore hybrid approaches combining the linguistic sophistication of large language models with the precision of specialized architectures. Our results underscore the importance of careful evaluation when deploying general-purpose language models in specialized domains where accuracy and reliability are essential.

Acknowledgments

This research was funded by the French National Research Agency (ANR) under the projects ANR-22CE23-0019-01 and ANR-19-GURE-0001 (program Investissements d’avenir integrated into France 2030).

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT and Claude in order to: Grammar and spelling check, Paraphrase and reword, and Drafting content. 9–12, 2024, Proceedings, Part II, Springer-Verlag, Berlin, Heidelberg, 2024, p. 283–307. URL: https://doi.org/10.1007/978-3-031-71908-0_13. doi:10.1007/978-3-031-71908-0_13. [2] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’Souza, S. Kabongo, H. B. Giglou, Y. Zhang, S. Auer, J. Kamps, CLEF 2024 SimpleText Track, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 28–35. doi:10.1007/ 978-3-031-56072-9_4. [3] L. Ermakova, S. Bertin, H. McCombie, J. Kamps, Overview of the clef 2023 simpletext task 3:

Simplification of scientific texts, Overview of the CLEF 2023 SimpleText Task 3 (2023).

[4] L. Ermakova, et al., Overview of CLEF 2025 SimpleText Track: Simplify Scientific Texts (and Nothing More), in: J. C. de Albornoz, et al. (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2025), Lecture Notes in Computer Science, Springer-Verlag, 2025. [5] J. Bakker, et al., Overview of the CLEF 2025 SimpleText Task 1: Simplify Scientific Text, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2025), CEUR Workshop Proceedings, CEUR-WS.org, 2025. [6] B. Vendeville, et al., Overview of the CLEF 2025 SimpleText Task 2: Identify and Avoid Hallucination, in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2025), CEUR Workshop Proceedings, CEUR-WS.org, 2025. [7] OpenAI, A. Hurst, A. Lerer, A. P. Goucher, Perelman, et al., GPT-4o System Card, 2024. doi:10.

48550/arXiv.2410.21276. arXiv:2410.21276. [8] F. Alva-Manchego, L. Martin, C. Scarton, L. Specia, EASSE: Easier Automatic Sentence Simpliifcation Evaluation, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Association for Computational Linguistics, Hong Kong, China, 2019, pp. 49–54. doi:10.18653/v1/D19-3009. [9] J. P. Kincaid, Jr. Fishburne, R. Robert P., C. Richard L., Brad S., Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel:, Technical Report, Defense Technical Information Center, Fort Belvoir, VA, 1975. doi:10.21236/ADA006655. [10] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A Method for Automatic Evaluation of Machine Translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. doi:10.3115/1073083.1073135. [11] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing Statistical Machine Translation for Text Simplification, Transactions of the Association for Computational Linguistics 4 (2016) 401–415. doi:10.1162/tacl_a_00107. [12] B. Vendeville, L. Ermakova, P. D. Loor, Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection, 2025. doi:10.1145/3726302.3730304. arXiv:2505.16392.

[1]

Ermakova , E. SanJuan, S. Huet,

Azarbonyad ,

G. M.

Di Nunzio ,

Vezzani , J. D'Souza , J. Kamps , Overview of the clef 2024 simpletext track: Improving access to scientific texts for everyone, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction: 15th International Conference of the CLEF Association, CLEF 2024 , Grenoble, France, September