<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Vendeville</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liana Ermakova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre De Loor</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaap Kamps</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HCTI</institution>
          ,
          <addr-line>Brest</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lab-STICC (UMR CNRS 6285)</institution>
          ,
          <addr-line>Brest</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Université de Bretagne Occidentale</institution>
          ,
          <addr-line>Brest</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Task 1: Text Simplification</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents the UBOnlp team's participation in the SimpleText lab at CLEF 2025, focusing on scientific text simplification and controlled creativity tasks. We evaluate the performance of GPT-4o using simple prompt-based approaches across multiple subtasks without specialized training or fine-tuning. For Task 1 (Text Simplification), we applied GPT-4o to both sentence-level and document-level simplification of scientific abstracts from the Cochrane-Auto corpus. Our system achieved competitive SARI scores (42.20 for sentence-level, 43.37 for documentlevel) while maintaining low complexity metrics, demonstrating efective simplification through content reduction rather than lexical substitution. For Task 2 (Controlled Creativity), we addressed spurious generation detection and error classification in simplified texts. Our approach showed strong performance in fluency error detection (F1 = 0.322, ranking first) and alignment error detection (F1 = 0.381, ranking third), but struggled with general spurious content detection, particularly in post-hoc scenarios without source documents. These results highlight both the potential and limitations of large language models for specialized text simplification tasks. While GPT-4o demonstrates capabilities in linguistic quality assessment, task-specific architectures remain superior for comprehensive error detection and generation control. Our findings contribute to understanding the practical applicability of general-purpose language models in scientific text processing workflows.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automatic text simplification</kwd>
        <kwd>Science popularization</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>simplify scientific text [5].
– Subtask 1.1: Simplify sentences.</p>
      <p>– Subtask 1.2: Simplify abstracts.
• Task 2: Controlled Creativity identify and avoid hallucination [6].</p>
      <p>– Subtask 2.1: Identifying creative generation.
– Subtask 2.2: Classifying information distortion.</p>
      <p>– Subtask 2.3: Avoiding creative generation.
• Task 3: SimpleText 2024 Revisited selected tasks by popular request.</p>
      <p>– Subtask 3.1: Content Selection: Retrieving passages to include in a simplified summary
– Subtask 3.2: Complexity Spotting: Identifying and explaining dificult concepts</p>
      <p>© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>– Subtask 3.3: Text Simplification : Simplify Scientific Text</p>
      <p>This paper will detail the participation of team UBOnlp for tasks 1 and 2, where we used GPT-4o [7]
to generate predictions. We will present the task and data provided, as well as the prompts we used for
prediction.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 1: Text Simplification</title>
      <sec id="sec-2-1">
        <title>2.1. Task Description</title>
        <p>The goal of this task was to generate simplifications of scientific texts. It was divided into two subtasks:
sentence-level simplification (Subtask 1.1) and document-level simplification (Subtask 1.2). This task
used the Cochrane-Auto corpus, built from the Cochrane systematic reviews and their associated
lay summaries. Cochrane-Auto consists of professionally written abstract-summary pairs aligned at
sentence, paragraph, and document levels. The dataset was constructed by realigning biomedical
abstracts and lay summaries at diferent levels of granularity-sentence, paragraph, and full document.
The alignment is restricted to ensure accurate correspondences, enabling meaningful evaluation at each
level. The dataset was split into training and test datasets:
• train: 4,171 sentences (Task 1.1) and 4,171 paragraphs (Task 1.2)
• test: 4,293 sentences (Task 1.1) and 217 abstracts (Task 1.2)
Participants were welcomed to use training data to train models, but we decided to use an untrained,
prompt-based approach.</p>
        <p>We evaluate system outputs using a range of standard and simplification-specific metrics provided
by EASSE [8]. Flesch-Kincaid Grade Level (FKGL) [9] estimates the reading dificulty of a text based
on average sentence length and syllables per word, returning a U.S. school grade level; higher values
indicate more complex texts, with a theoretical lower bound of -3.40 and no upper limit.</p>
        <p>BLEU [10] assesses n-gram overlap between generated and reference texts. Although originally
developed for machine translation, it is commonly applied in simplification by treating standard and
simplified English as distinct languages. Scores range from 0 (no overlap) to 1 (perfect match).</p>
        <p>SARI [11] is specifically designed for text simplification, comparing the system output not only to
references but also to the input. It evaluates the quality of additions, deletions, and words retained, with
scores ranging from 0 to 100, where higher indicates better simplification.</p>
        <p>To characterize structural transformations, we compute the compression ratio, which compares the
token count of the output to that of the reference; higher values reflect more compressed outputs.
Sentence splits count the number of input sentences divided into multiple ones in the output, with
higher counts indicating more frequent segmentation.</p>
        <p>We also use Levenshtein similarity to quantify the edit distance between the input and the output,
where higher values denote greater surface similarity. The exact copy rate measures the proportion of
output sentences that are identical to sentences in the input.</p>
        <p>In addition, we track the proportion of additions and deletions, indicating the extent of lexical changes
between input and output. Finally, lexical complexity is computed following Alva-Manchego et al. [8],
by aggregating the third quartile of the log-frequency ranks of words, capturing the relative rarity of
the vocabulary used.</p>
        <p>For sentence-level simplification (Task 1.1) sentences were concatenated into abstract and evaluated
as such. Furthermore, two diferent sets of references were used. One was based on the plain language
summary (PLS) from the original Cochrane references and contained references for 217 abstracts, while
the 2nd was made from Cochrane-auto and contained references for 37 abstracts.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Test Data</title>
        <p>The provided test data for Task 1.1 was of the form:
{
},
{
"pair_id": "CD012520",
"para_id": 0,
"sent_id": 0,
"complex": "We included seven cluster-randomised trials with 42,489 patient
participants from 129 hospitals, conducted in Australia, the UK, China, and the</p>
        <p>Netherlands."</p>
        <sec id="sec-2-2-1">
          <title>While test data for Task 1.2 was of the form:</title>
          <p>"pair_id": "CD012520",
"source": "Cochrane",
"complex": "We included seven cluster-randomised trials with 42,489 patient
participants from 129 hospitals, conducted in Australia, the UK, China, and the
Netherlands. Health professional participants (numbers not specified) included
nursing, medical and allied health professionals. Interventions in all studies
included [...]"
},</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Submission Description</title>
        <p>Our goal for this task was to see the performance of state of the art models used in a simple way.
Therefore, we decided to use GPT-4o to generate simplifications based only on a simple prompt and the
source sentence. The decoding was made with a temperature of 0, and we used the following prompt:
prompt = f"""You are a classification expert for simplification errors. You need to
simplify the following scientific text for the general public.</p>
        <p>The goal is to make the provided text more easily understandable.</p>
        <p>It is important to keep un easy vocabulary, a simple semantic structure, and to not have
too much information density.</p>
        <p>You also need to be informative and make the user understand important facts in the
source.
---------Source: "{source}" """</p>
        <sec id="sec-2-3-1">
          <title>The same prompt is used for both subtasks 1.1 and 1.2.</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Results</title>
        <p>2.4.1. Task 1.1
The evaluation of our run, along with scores of other participants, are presented in Table 1 and Table 2.
We see our system being one of the best on SARI on sentence-level simplification while keeping one of
the lowest FKGL and lexical complexity scores. Looking at the addition and deletion proportions, our
model removed more content than other models, while adding less.</p>
        <p>This suggests that our system adopts a more conservative rewriting strategy, favoring deletion
over lexical addition. While this may help reduce complexity, it could also risk omitting important
information.</p>
        <p>On Cochrane-Auto aligned data, however, we observe a notable drop in our model’s performance,
especially on SARI and BLEU, while other systems such as DSGT plan_guided_lla remain closer to the
PLS references. Interestingly, this drop coincides with a mismatch in sentence splitting behavior: while
our model tends to preserve the original sentence boundaries, the PLS references in Cochrane-Auto may
restructure content more, with significantly more sentence splits compared to those in the manually
aligned references. This diference may have penalized our system, which performs better for
sentencelevel rewriting and performs well when reference simplifications follow similar segmentation. Despite
this, our model maintains competitive scores on FKGL and lexical complexity, suggesting that it still
produces fluent and accessible output, albeit less aligned with the structural edits present in the PLS
references.
2.4.2. Task 1.2
1.00
0.97
0.99
0.68
1.52
1.64
1.20
0.82
1.02
0.46
0.49
1.23
1.00
1.18
1.22
0.49
1.72
1.53
1.02
1.94
1.10
1.19
1.00
0.40
0.46
0.37
0.61
0.45
0.60
0.60
0.44
0.45
0.51
0.40
0.40
0.63
0.64
0.52
0.54
0.65
0.62
0.66
0.80
0.78
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.23
0.00
0.00
0.00
0.00
0.00
0.00
0.35
0.00
0.29
0.18
0.18
0.26
0.62
0.47
0.14
0.11
0.00
0.01
0.66
0.09
0.45
0.22
0.01
0.59
0.37
0.43
0.46
0.10
0.13
0.00
0.63
0.71
0.78
0.53
0.64
0.27
0.50
0.75
0.72
0.67
0.71
0.79
0.25
0.46
0.63
0.12
0.25
0.20
0.20
0.29
0.30
9.05
8.65
8.50
8.37
8.54
8.35
8.68
8.76
8.96
8.81
8.78
8.65
8.90
8.69
8.54
8.85
8.68
8.75
9.26
8.68
8.93
8.77
The evaluation of our system, UBOnlp GPT-4o, alongside those of other participants, is presented in
Table 3 and Table 4. Our system demonstrates competitive performance, particularly on SARI, indicating
efective simplification strategies. It produces longer outputs and performs frequent sentence splitting,
reflecting a consistent approach focused on decomposing and elaborating complex information rather
than merely shortening the text. This is further supported by the high compression ratio and addition
proportion, suggesting that the model often introduces explanatory content-such as definitions-to
enhance clarity. Despite these strengths, the lower BLEU scores point to a greater divergence from
reference phrasing, potentially impacting perceived fluency and alignment. The system also performs
well on FKGL and lexical complexity metrics, confirming its ability to adapt the vocabulary and structure
to a simpler register.</p>
        <p>These tendencies are confirmed in the evaluation against the Cochrane-Auto references, where
the results remain broadly consistent-SARI scores decrease slightly, while BLEU improves marginally,
highlighting the model’s stable behavior across reference sets.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task 2: Controlled Creativity</title>
      <sec id="sec-3-1">
        <title>3.1. Task Description</title>
        <p>In practice, when generating simplifications, organizers have found a high proportion and variety of
spurious generation. The goal of this task is therefore to detect, classify and avoid spurious generation.</p>
        <sec id="sec-3-1-1">
          <title>In particular, we participated in the following subtasks 2.1 and 2.2.</title>
          <p>3.1.1. Subtask 2.1
The goal of this subtask is to detect spurious generation. Participants were presented a system generated
simplification, and had to classify it as spurious or not. in particular, two cases were studied: one (sourced)
where participants had access to the source document of the simplification and one (posthoc) where they
did not. The dataset was constructed from system simplifications retrieved from last year’s submissions
to the SimpleText lab and were automatically annotated based on token alignment, where if over 10%
of the tokens in the generations were not aligned with the source, the generation was considered
spurious. This created a high prevalence of the spurious label (90%). The train dataset was split into
13,341 sentences (posthoc) and 13,514 sentences (sourced) while the test dataset was split into 3,336
sentences (posthoc) and 3,379 sentences (sourced). Results are evaluated using Accuracy, Precision,
The goal of this subtask is to detect and classify hallucinations with regards to a taxonomy of [12].
The taxonomy classifies errors in text simplifications into one or multiple of 14 diferent errors classes,
grouped into 4 error groups:
• A. Fluency Is the answer provided in a correct form that a fluent speaker would speak?
• B. Alignment Is the format of the answer correct?
• C. Information Is the information provided accurate and relevant to the input?
• D. Simplification Does the response focus on simplification?
In addition, a "No Error" class is also considered. The training data is constructed from 42,392
synthetically generated simplifications containing targeted errors generated from past submissions to the
SimpleText lab. The test data was constructed from 2,659 manual annotations of past submissions to
the SimpleText lab. Results are evaluated on the four aggregated error categories, using both F1 score
and AUC.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Test data</title>
        <p>3.2.1. Subtask 2.1
• Subtask 2.1 Posthoc:</p>
        <sec id="sec-3-2-1">
          <title>The provided test datasets were provided in a json format as such:</title>
          <p>"sentence": "I explained the complex terms directly within the simplified sentence:\
n\n* 'Next-generation model' means a new and improved plan.",
"anon_gen_id": "74704850//66348262//3"
• Subtask 2.1 Sourced:
"abs_id": "G01.1_1570837852",
"sentence": "In this paper, we share our findings on how evolutionary algorithms and
multi-agent systems can be used to understand a user's preferences while they
interact with a digital assistant.",
"gen_id": "11102757//G01.1_1570837852//1"
The Sourced data could be merged with the abstract data of the following format:
"query_id": "G11.1",
"query_text": "drones",
"doc_id": 2892036907,
"abs_id": "G11.1_2892036907",
"abs_source": "In the modern era of automation and robotics, autonomous vehicles are
currently the focus of academic and industrial research. With the ever
increasing number of unmanned aerial vehicles getting involved in activities in
the civilian and commercial domain, there is [...]"
},
3.2.2. Subtask 2.2</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>The test data was provided as a json file as such:</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Submission Description</title>
        <p>In both subtasks, our goal was to try to measure performance of state of the art models used in a naive,
simple way. In both subtasks, we relied on an untrained GPT-4o model using only a prompt wiht test
data as input. The decoding was made with a temperature of 0.
3.3.1. Subtask 2.1
For this subtask, we used two slightly diferent prompts for the sourced and posthoc variations. For
posthoc we used the following prompt:
prompt = f"""
You are an expert in detecting hallucinations in simplified scientific texts.
Hallucinations include:
- Information distortion: misrepresenting or oversimplifying facts in a misleading
way.
- Spurious generation: adding information not supported by scientific content.
Your task: Analyze the simplified text and respond only with:
- True -&gt; if the text likely contains a hallucination.
- False -&gt; if the text seems accurate and faithful.</p>
        <p>Respond with **only** ‘True‘ or ‘False‘.</p>
        <sec id="sec-3-3-1">
          <title>For the sourced variation, we used:</title>
          <p>prompt = f"""
You are an expert in detecting hallucinations in simplified scientific texts.
Hallucinations include:
- **Information distortion**: when the simplified text misrepresents or alters the
meaning of the source.
- **Spurious generation**: when the simplified text includes new information not
present or supported in the source.</p>
          <p>Your task is to compare the simplified text with the source and respond with:
- True -&gt; if the simplified text contains hallucinations (of either type).
- False -&gt; if the simplified text is faithful to the source.</p>
          <p>Respond with **only** True or False.
---------Source Text:
{source}
Simplified Text:
{simplified}
"""
3.3.2. Subtask 2.2
For the subtask 2.2, we used a prompt describing the taxonomy, as well as the format required, and
included examples. The taxonomy is the definition of the errors as provided in [ 12] while possible codes
are the codes corresponding to the error.
"""</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Results</title>
        <p>3.4.1. Subtask 2.1
Results for this subtask are presented in table 5 and table 6 In the posthoc detection scenario, our
GPT-4o approach ranked last among the participating teams. The results reveal a characteristic pattern:
while our method achieved high precision (0.92), indicating that when it predicted spurious generation
it was usually correct, it sufered from extremely low recall (0.21). This suggests our GPT-4o approach
was overly conservative in identifying spurious content when operating without access to source
documents. The low accuracy (0.27) and near-random AUROC (0.52) indicate that our approach
struggled significantly with the posthoc detection task. Given that the dataset has a 90% prevalence of
spurious examples, our low recall particularly hurt overall performance.</p>
        <p>When source documents were available, our GPT-4o approach showed improved but still limited
performance. The recall increased from 0.21 to 0.71, and accuracy improved from 0.27 to 0.70. This
suggests that GPT-4o benefits significantly from having reference material to compare against when
detecting spurious generation. However, our approach still ranked in the lower tier of submissions,
with several teams achieving accuracy scores above 0.90 and F1 scores above 0.95.</p>
        <p>The performance diference between our approach and top-performing methods (which achieved
F1-scores above 0.95) suggests that task-specific model architectures, such as BERT-based classifiers
and ensemble methods, may still be more suitable for this type of detection task than general-purpose
language models used in a zero-shot or few-shot manner.
3.4.2. Subtask 2.2
Our system achieved the best F1 score (0.322) for fluency error detection, outperforming all competing
systems including specialized fine-tuned models. This demonstrates GPT-4o’s capabilities for identifying
grammatical errors and fluency issues. It also showed strong performance in alignment error detection
(F1 = 0.381, 3rd place), showing efective identification of format and structural issues. However, our
system showed lower performance in "No Error" classification (F1 = 0.680) suggesting tendency toward
false positives. Information and simplification error detection showed moderate results, indicating
challenges with task-specific requirements.</p>
        <p>The results highlight GPT-4o’s strength in linguistic tasks while revealing limitations in specialized
error detection, showing the usefulness of building task-specific error detection models.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This paper evaluated GPT-4o’s efectiveness for scientific text simplification and controlled creativity
tasks at CLEF 2025 SimpleText using straightforward prompt-based approaches without specialized
training. Our results demonstrate both strengths and limitations of general-purpose language models for
specialized text processing tasks. In text simplification, GPT-4o achieved competitive SARI scores (42.20
sentence-level, 43.37 document-level) through a conservative strategy that prioritized content reduction
over lexical substitution. For controlled creativity, the model excelled in fluency error detection (highest
F1 score among participants) and alignment error detection, but struggled with spurious generation
detection, particularly in post-hoc scenarios without source documents. These findings highlight that
while GPT-4o demonstrates strong linguistic capabilities for quality assessment tasks, task-specific
architectures remain superior for comprehensive error detection and generation control. The substantial
performance gap between our approach and specialized systems indicates that domain-specific
finetuning or architectural modifications are necessary for optimal performance in critical applications.
Future work should explore hybrid approaches combining the linguistic sophistication of large language
models with the precision of specialized architectures. Our results underscore the importance of careful
evaluation when deploying general-purpose language models in specialized domains where accuracy
and reliability are essential.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research was funded by the French National Research Agency (ANR) under the projects
ANR-22CE23-0019-01 and ANR-19-GURE-0001 (program Investissements d’avenir integrated into France 2030).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT and Claude in order to: Grammar and
spelling check, Paraphrase and reword, and Drafting content.
9–12, 2024, Proceedings, Part II, Springer-Verlag, Berlin, Heidelberg, 2024, p. 283–307. URL:
https://doi.org/10.1007/978-3-031-71908-0_13. doi:10.1007/978-3-031-71908-0_13.
[2] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’Souza,
S. Kabongo, H. B. Giglou, Y. Zhang, S. Auer, J. Kamps, CLEF 2024 SimpleText Track, in: N.
Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances
in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 28–35. doi:10.1007/
978-3-031-56072-9_4.
[3] L. Ermakova, S. Bertin, H. McCombie, J. Kamps, Overview of the clef 2023 simpletext task 3:</p>
      <sec id="sec-6-1">
        <title>Simplification of scientific texts, Overview of the CLEF 2023 SimpleText Task 3 (2023).</title>
        <p>[4] L. Ermakova, et al., Overview of CLEF 2025 SimpleText Track: Simplify Scientific Texts (and
Nothing More), in: J. C. de Albornoz, et al. (Eds.), Experimental IR Meets Multilinguality, Multimodality,
and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association
(CLEF 2025), Lecture Notes in Computer Science, Springer-Verlag, 2025.
[5] J. Bakker, et al., Overview of the CLEF 2025 SimpleText Task 1: Simplify Scientific Text, in:
G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF
2025), CEUR Workshop Proceedings, CEUR-WS.org, 2025.
[6] B. Vendeville, et al., Overview of the CLEF 2025 SimpleText Task 2: Identify and Avoid Hallucination,
in: G. Faggioli, et al. (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum
(CLEF 2025), CEUR Workshop Proceedings, CEUR-WS.org, 2025.
[7] OpenAI, A. Hurst, A. Lerer, A. P. Goucher, Perelman, et al., GPT-4o System Card, 2024. doi:10.</p>
        <p>48550/arXiv.2410.21276. arXiv:2410.21276.
[8] F. Alva-Manchego, L. Martin, C. Scarton, L. Specia, EASSE: Easier Automatic Sentence
Simpliifcation Evaluation, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP): System Demonstrations, Association for Computational Linguistics, Hong Kong,
China, 2019, pp. 49–54. doi:10.18653/v1/D19-3009.
[9] J. P. Kincaid, Jr. Fishburne, R. Robert P., C. Richard L., Brad S., Derivation of New Readability
Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy
Enlisted Personnel:, Technical Report, Defense Technical Information Center, Fort Belvoir, VA,
1975. doi:10.21236/ADA006655.
[10] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A Method for Automatic Evaluation of Machine
Translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics, Association for Computational Linguistics,
Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. doi:10.3115/1073083.1073135.
[11] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing Statistical Machine
Translation for Text Simplification, Transactions of the Association for Computational Linguistics 4
(2016) 401–415. doi:10.1162/tacl_a_00107.
[12] B. Vendeville, L. Ermakova, P. D. Loor, Resource for Error Analysis in Text Simplification: New
Taxonomy and Test Collection, 2025. doi:10.1145/3726302.3730304. arXiv:2505.16392.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , E. SanJuan, S. Huet,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azarbonyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vezzani</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Overview of the clef 2024 simpletext track: Improving access to scientific texts for everyone, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <source>Interaction: 15th International Conference of the CLEF Association, CLEF</source>
          <year>2024</year>
          , Grenoble, France, September
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>