<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SINAI in SimpleText CLEF 2025: Simplifying Biomedical Scientific Texts and Identifying Hallucinations Using GPT-4.1 and Pattern Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaime Collado-Montañez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jenny Alexandra Ortiz-Zambrano</string-name>
          <email>jenny.ortiz@ug.edu.ec</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>César Espin-Riofrio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arturo Montejo-Ráez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science (University of Jaén)</institution>
          ,
          <addr-line>Campus Las Lagunillas, s/n, Jaén, 23071</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Facultad de Ciencias Matemáticas y Físicas, Universidad de Guayaquil</institution>
          ,
          <addr-line>Av. Delta S/N, 090514 Guayaquil</addr-line>
          ,
          <country country="EC">Ecuador</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents our participation in three diferent tasks of the CLEF 2025 SimpleText track. For tasks 1.1 and 1.2, we explore the application of advanced language models, specifically GPT-4.1, for the automatic simplification of biomedical texts in English using zero-shot learning. Two versions of prompts were designed and implemented on the Cochrane-auto dataset in Task 1.1 and Task 1.2, with the aim of generating texts that are more understandable for non-specialist audiences. The results show that the model successfully preserves the original semantic structure, identifying complex terms and providing clear restructuring, including brief explanations when necessary. Furthermore, an accurate listing of key elements and efective reorganization of dificult grammatical structures were observed. These characteristics indicate the adaptability of the model to facilitate access to technical information without afecting its accuracy. Finally, the results preliminarily support the efectiveness of prompt design as an approach to improve text comprehension in the biomedical field, without the need for additional supervised training. In Task 2.1, we addressed the problem of detecting creative generation in simplification outputs at the document level. Our approach combined rule-based pattern matching-developed through exploratory analysis of the training set-with the use of the llama-3.1-8b-instruct language model. Surface-level patterns such as one-word sentences, double-space endings, and near-literal context matches were leveraged to pre-label data, while remaining cases were evaluated by the language model using a token-level confidence threshold. The highest-performing run in the sourced subtask achieved an F1-score of 0.976 whereas in the posthoc subtask it was 0.953.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lexical Complexity</kwd>
        <kwd>Biomedical Scientific Texts</kwd>
        <kwd>GPT-4</kwd>
        <kwd>1</kwd>
        <kwd>Zero-Shot learning</kwd>
        <kwd>Pattern extraction</kwd>
        <kwd>Synthetic generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The way a text is written can become a considerable barrier, especially when it contains rare or
unfamiliar vocabulary, as well as complex lexical and semantic structures that make it dificult to access
its content. [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. This situation is especially evident in diverse populations [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], as there are numerous
groups of readers such as foreign language students [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], people with cognitive disabilities [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], people
with low levels of reading comprehension who face significant barriers to understanding written texts,
even among university students who, despite their academic training and specialized knowledge, also
experience dificulties reading and understanding complex texts [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which poses a critical challenge to
communicative equity.
      </p>
      <p>
        The development of information technologies has radically transformed access to information,
enabling the massive availability of content in diverse and key Field such as education, communication,
health, public administration, and scientific research. In particular, digitization has exponentially boosted
the production and dissemination of scientific literature, facilitating its consultation and analysis on
a large scale. Despite advances in digitization and open access, scientific information continues to
present a significant barrier for the general public: the high linguistic complexity of specialized texts.
This dificulty limits direct access to knowledge from original sources, especially for people without
prior training in the field, who face substantial challenges due to the lack of technical knowledge
or specialized terminology [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], because understanding a text depends largely on the reader’s prior
knowledge of the meaning of words. In this context, ensuring access to linguistically accessible content
not only responds to an educational and social need, but is also consolidated as a fundamental right
increasingly supported by international regulations and institutions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        The SimpleText Lab is part of the CLEF 20251 initiative, which aims to promote the systematic
evaluation of information access systems through shared tasks. This proposal focuses on the challenges
associated with text simplification, with a particular emphasis on the accessibility of scientific
information in recent years. In this context, SimpleText ofers valuable resources and metrics for research, given
that a large part of the general public avoids consulting reliable scientific sources due to their high
linguistic complexity and lack of specialized knowledge. As a result, many people opt for simplified
content available on the internet and social media, which often serves commercial or political interests
rather than informational purposes [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>The main objective of this research is to demonstrate the ability of the Transformer based GPT-4.1
linguistic model to perform lexical simplification. To this end, several variants of sentences without
examples were created and evaluated. This approach allows for the simplification of sentences and
entire documents extracted from Cochrane-auto corpus, derived from biomedical literature summaries
and lay summaries of Cochrane systematic reviews, to facilitate the reader’s understanding of scientific
text.</p>
      <p>The article is organized as follows: in Section 2 we present a brief description of the tasks we
participated in. Sections 3 and 4 detail the data and methodology followed for tasks 1 and 2.1 respectively.
Section 5 shows the results achieved during our experimentation and section 6 summarizes the main
conclusions and proposes avenues for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. SimpleText@CLEF-2024 Tasks</title>
      <sec id="sec-2-1">
        <title>2.1. Task 1: Text Simplification: Simplify scientific text</title>
        <p>The objective of this task is to simplify scientific texts extracted from the Cochrane-auto corpus, both
at the level of complete sentences and entire documents, in order to facilitate the understanding of the
content by non-specialist readers. We have contributed to two subtasks:
1. Task 1.1 - Sentence-level Scientific Text Simplification. The goal of this task is to simplify whole
sentences extracted from the Cochrane-auto dataset.
2. Task 1.2 - Document-level Scientific Text Simplification. The goal of this task is to simplify whole
documents extracted from the Cochrane-auto dataset.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task 2: Controlled Creativity: Identify and Avoid Hallucination</title>
        <p>The objective of Task 2 focuses on identifying and evaluating creative generation and information
distortion in text simplification. We have contributed to he subtask 2.1 Identify Creative Generation at
Document Level.</p>
        <p>This task aims to detect creative generation at the abstract or document level. Participants will analyze
system outputs from previous years, along with deliberately generated outputs from known models.
The goal is to identify which sentences are fully grounded in the source text, both without access to
the original sentences and with access to them. Additionally, sentences that introduce significant new
content must be labeled. This task serves as a post-hoc identification or explanation challenge.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task 1: Experiments with Zero-Shot learning</title>
      <sec id="sec-3-1">
        <title>3.1. Cochrane-auto Corpus</title>
        <p>As part of the CLEF 2025 SimpleText program, the Cochrane-auto corpus was launched. It is composed
of biomedical scientific abstracts and their corresponding simplified versions for non-specialist readers,
derived from Cochrane systematic reviews. This resource represents a significant advance in the field
of biomedical simplification, adopting approaches previously applied to datasets such as Wiki-auto and
Newsela-auto.</p>
        <p>Compared to other traditional corpora, Cochrane-auto provides novel parallel data written by the
review authors themselves, enabling simplification processes at the whole-document level. Furthermore,
Cochrane-auto incorporates advanced techniques such as sentence merging, text restructuring, and
discourse alignment, allowing for a deeper and more coherent treatment of content. This design enables
multi-scale realignment, encompassing paragraphs, sentences, and documents, and distinguishes it
from conventional approaches focused solely on superficial simplification.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Proposed system</title>
        <p>
          State-of-the-art deep learning architectures—including BERT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], RoBERTa [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], GPT-3 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and
GPT-4.1 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] are significantly outperforming classical techniques. In particular, GPT -4.1, a large scale
Transformers based text generation framework developed by OpenAI2, reflects these advances. These
solutions have achieved outstanding performance across a variety of natural language processing tasks,
setting unprecedented levels of accuracy and performance in the field.
        </p>
        <p>As part of our approach, we used this model through the OpenAI API, configured with a temperature
of 0.0 and a maximum limit of 10,000 tokens per response, which allowed us to obtain detailed and
deterministic results. This configuration can be seen in Table 1. The model was integrated into our
workflow through Python code, which facilitated test automation. We also used the OpenAI Playground
environment as a complementary resource to quickly validate diferent inputs and generate query
prototypes.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Prompt Design</title>
        <p>To run the experiments with the Cochrane-auto corpus, we designed two specific instructions
corresponding to Task 1.1 and Task 1.2. Both experiments were conducted using a zero-shot learning
approach, that is, without providing explicit training examples to the model. The prompts developed to
guide the system’s automatic generation of simplifications are presented in the Appendix.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Task 2.1: Detecting patterns in the data</title>
      <sec id="sec-4-1">
        <title>4.1. Data description</title>
        <p>For this sub-task, participants are provided with system outputs from previous years, along with
deliberately generated outputs from known models. The objective is to analyze these outputs to identify
which sentences are fully grounded in the source text both with and without access to the original
sentences.</p>
        <p>An exploratory analysis over the sourced training dataset showed a high class imbalance: 12115
out of 13514 rows (90% of the total data) were labeled as spurious, while the remaining 1399 were not
spurious. More interestingly, spurious sentences were significantly longer—more than 15 words per
sentence in average—, whereas not spurious were only about 11 words long. This fact raised our interest
to further analyze sentence length distribution, where we found a surprising high amount of one-word
length sentences labeled as not spurious given the reduced number of examples in this class (see Figure
1). Specifically, 262 out of the 1399 not spurious examples are one-word sentences, 244 of which are
simply a ‘.’. We also found all instances of the regex ‘#.’, where ‘#’ can be any digit, were always labeled
as spurious, although these accounted just for 14 examples out of the 12115 in this subset.</p>
        <p>Taking a look at some of the remaining not spurious sentences—i.e. those that were longer than 1
word—, we realized most of them appeared literally—or close to literally—within the context provided
in the sourced dataset: 809 examples apply to this case, 790 of such were not spurious while only 19
were spurious.</p>
        <p>Finally, we also noticed that all 1241 sentences ending with a double space character ‘ ’ were labeled
as spurious.</p>
        <p>Table 2 shows examples of all the patterns found during this exploratory analysis.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Proposed system</title>
        <p>
          Based on the patterns detailed in the data description, we developed a pre-filtering system where
sentences matching any of these patterns were automatically assigned its corresponding label. Specifically,
our final submission relied on a filter to pre-anotate sentences matching the one-word, double space,
and (close to) literal match patterns. This last filter includes a fuzzy matching that normalizes text
and removes stopwords before comparing strings. For all sentences left, we prompted
llama-3.1-8binstruct [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] to output ‘Yes’ or ‘No’, meaning spurious or not spurious respectively, given the sentence
and the sourced context.
        </p>
        <p>In order to deal with class imbalance, we opted to trust the LLM output only if the probability of
the ‘No’ token—not spurious, which is the minority class—was larger than a given threshold. In our
submissions, we included 3 diferent ones: 95%, 99% and 100% (i.e. everything not matching a filtering
pattern is labeled as spurious).</p>
        <p>Regarding our post-hoc approach, we generated artificial contexts with the same LLM for each
sentence and repeated the same experiments mentioned before. All prompts used during the
experimentation are included in the appendices.</p>
        <p>To summarize, we presented four diferent runs for both sourced and posthoc subtasks:
• Run 1: One-word and double space filters. All remaining sentences are labeled as spurious.
• Run 2: One-word, double space, and literal match filters. Remaining sentences where the ‘No’
token probability is greater than 95% are labeled as not spurious.
• Run 3: One-word, double space, and literal match filters. Remaining sentences where the ‘No’
token probability is greater than 99% are labeled as not spurious.
• Run 4: One-word, double space, and literal match filters. All remaining sentences are labeled as
spurious.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Results of Task 1.1</title>
        <p>Table 3 shows the simplified prediction for a complex sentence extracted from the Cochrane-auto corpus.
In this instance, it is evident that the model successfully transformed a technical sentence about
clusterrandomized trials, preserving essential information (such as the number of patients and hospitals)
and including a clarification about the acronym “UK”(“United Kingdom”), as requested in the prompt.
Similarly, Table 4 presents a second example in which the model reformulates a sentence describing
types of healthcare professionals. The output generated by the model preserves the semantic structure
and provides an explicit listing of the professionals, adding explanatory examples in parentheses (e.g.,
“physical therapists and dietitians”), as requested in the prompt.</p>
        <p>In both cases, the model demonstrates accurate interpretation of the biomedical domain and applies
lexical simplification strategies, such as the replacement of technical terms, and restructures complex
sentences, generating more accessible versions of the texts. This provides preliminary validation of the
prompt design and the model’s usefulness for automatic simplicfiation tasks without the need for prior
supervised training.</p>
        <sec id="sec-5-1-1">
          <title>Content</title>
          <p>CD012520
0
1
Health professional participants (numbers not specified) included nursing, medical and allied
health professionals.</p>
          <p>The people who took part in the study and work in health care (the exact number is not given)
included nurses, doctors, and other health workers (such as physical therapists or dietitians).</p>
          <p>In the second version of the prompt applied to the model using zero-shot learning for Task 1.1, we
observed a significant improvement in the clarity and accessibility of the original content. In sentence 1,
the model retains the general semantic structure but provides an explicit explanation of the professional
context (“people who took part in the study and work in health care”), in addition to providing a more
detailed and explanatory list of participants, including specific examples (“such as physical therapists or
dietitians”). Regarding sentence 2, the model develops an outstanding approach, that is, it simplifies and
reorganizes a complex grammatical structure, transforming a sentence full of technical terminology into
a more accessible expression. The use of explanations in parentheses to aid understanding (“people who
work in health care”) stands out, as does the paraphrasing of technical expressions such as “delivery
arrangements” or “financial arrangements.”</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>Content</title>
          <p>CD012520
0
2
Interventions in all studies included implementation strategies targeting healthcare workers;
three studies included delivery arrangements, no studies used financial arrangements or
governance arrangements.</p>
          <p>All the studies used ways to help healthcare workers (people who work in health care) do their
jobs better. Three studies also changed how care was given to patients. No studies used changes
in money or rules to improve care.</p>
          <p>These results initially confirm that the design of the second prompt allows for greater adaptability
to readers with low literacy levels, without afecting semantic fidelity. Likewise, the model’s greater
sensitivity to the application context is evident, suggesting an improvement in the efectiveness of
automatic simplification strategies compared to an unsupervised model.</p>
          <p>
            According to the oficial results presented in Table 7 [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], the solution submitted by our team for Task
1.1, SINAI PRMZSTASK11V1, ranks among the top performers in the global CLEF 2025 SimpleText Task
1.1 ranking, reaching the third overall place when all runs are sorted by the main SARI metric, which
evaluates the quality of simplification. This result indicates that SINAI’s solution achieved a SARI of
41.25, outperforming many of the variants submitted by other teams. Furthermore, its BLEU score of
4.59 is the highest among the top four results, demonstrating that the simplified version maintains a high
similarity to the human reference. Taken together, these indicators reflect that SINAI PRMZSTASK11V1
achieved a favorable balance between simplification, fidelity to the original content, and readability,
standing out as one of the best solutions to the challenge compared to the best executions submitted by
other teams.
prediction
          </p>
        </sec>
        <sec id="sec-5-1-3">
          <title>Content</title>
          <p>CD012520
Cochrane corpus
We included seven cluster-randomised trials with 42,489 patient participants from 129 hospitals,
conducted in Australia, the UK, China, and the Netherlands. Health professional participants
(numbers not specified) included nursing, medical and allied health professionals. Interventions
in all studies included implementation strategies targeting healthcare workers; three studies
included delivery arrangements, no studies used financial arrangements or governance
arrangements. Five trials compared a multifaceted implementation intervention to no intervention, two
trials compared one multifaceted implementation intervention to another multifaceted
implementation intervention. No included studies compared a single implementation intervention
to no intervention or to a multifaceted implementation intervention. Quality of care outcomes
(proportions of patients receiving evidence-based care) were included in all included studies.
All studies had low risks of selection bias and reporting bias, but high risk of performance bias.
Three studies had high risks of bias from non-blinding of outcome assessors or due to analyses
used. We are uncertain whether a multifaceted implementation intervention leads to any change
in adherence to evidence-based recommendations compared with no intervention (risk ratio (RR)
1.73; 95% confidence interval (CI) 0.83 to 3.61; 4 trials; 76 clusters; 2144 participants, I2 =92%, very
low-certainty evidence). Looking at two specific processes of care, multifaceted implementation
interventions compared to no intervention probably lead to little or no diference in the
proportion of patients with ischaemic stroke who received thrombolysis (RR 1.14, 95% CI 0.94 to 1.37, 2
trials; 32 clusters; 1228 participants, moderate-certainty evidence), but probably do increase the
proportion of patients who receive a swallow screen within 24 hours of admission (RR 6.76, 95%
CI 4.44 to 10.76; 1 trial; 19 clusters; 1,804 participants; moderate-certainty evidence). Multifaceted
implementation interventions probably make little or no diference in reducing the risk of death,
disability or dependency compared to no intervention (RR 0.93, 95% CI 0.85 to 1.02; 3 trials;
51 clusters ; 1228 participants; moderate-certainty evidence), and probably make little or no
diference to hospital length of stay compared with no intervention (diference in absolute change
1.5 days; 95% CI -0.5 to 3.5; 1 trial; 19 clusters; 1804 participants; moderate-certainty evidence).
We do not know if a multifaceted implementation intervention compared to no intervention
result in changes to resource use or health professionals’ knowledge because no included studies
collected these outcomes. We are uncertain whether a multifaceted implementation intervention
compared to no intervention improves adherence to evidence-based recommendations in acute
stroke settings, because the certainty of evidence is very low.</p>
          <p>We looked at seven studies that included 42,489 patients from 129 hospitals in Australia, the
United Kingdom, China, and the Netherlands. The studies also included nurses, doctors, and
other health workers (the exact number was not given). In all the studies, the main action was to
use diferent ways to help healthcare workers follow guidelines. Three studies also changed how
care was given, but none used changes in money or hospital rules. Five studies compared using
several methods to help healthcare workers with not using any methods. Two studies compared
diferent sets of methods with each other. No studies compared using just one method with
not using any, or with using several methods. All studies measured how many patients got care
based on the best available evidence. The studies were not likely to have problems with how
patients were chosen or how results were reported, but there was a high chance that knowing
which group patients were in could have afected the results (performance bias). Three studies
also had a high chance of mistakes because the people checking the results knew which group
patients were in, or because of the way the results were analyzed. We are not sure if using
several methods to help healthcare workers follow guidelines makes a diference in how well
they follow these guidelines compared to not using any methods (the evidence is not strong).
When looking at two specific types of care, using several methods probably does not change how
many patients with a type of stroke called ischaemic stroke (a stroke caused by a blocked blood
vessel) get a treatment called thrombolysis (a medicine to break up blood clots), but it probably
does increase how many patients get a test to check if they can swallow safely within 24 hours
of arriving at the hospital. Using several methods probably does not make much diference in
lowering the risk of death, disability, or needing help from others, or in how long patients stay
in the hospital. We do not know if using several methods changes how resources are used or
how much health workers know, because the studies did not measure these things. Overall, we
are not sure if using several methods to help healthcare workers follow guidelines improves care
for patients with sudden stroke, because the evidence is not strong.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results of Task 1.2</title>
        <p>Table 8 shows the result achieved after applying the first version of the prompt (specified in Table 2) to
the zero-shot learning-based model on the pair identified as CD012520. This pair represents a complex
biomedical document extracted from Cochrane, with dense and highly specialized content, typical of the
clinical-academic domain. The original text (complex field) presents several particularities characteristic
of lexical and structural complexity. Once the model was implemented, the result generated in the
prediction field shows a restructured text that significantly reduces technical complexity. We analyzed
the evaluation of the results using the following criteria: The system replaced specialized terms with
simpler vocabulary, and long sentences in the original text were organized into simple, consecutive
sentences. Despite the simplification, the model preserves the key ideas of the original text in the result;
The simplified text meets the main objective of the proposed system, obtaining an accessible version of
it.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Results of Task 2.1</title>
        <p>In this section we report all metrics provided by the organizers through Codabench for our final
submission and several ablation runs to evaluate diferent filters and prompts. Table 11 shows these
results.</p>
        <p>The results indicate that predicting non-spurious sentences using large language models (LLMs) is
particularly challenging in this dataset. The highest-performing runs (1 and 4) achieved strong results
by labeling all sentences that did not match specific predefined patterns as spurious. This approach
appears efective largely due to the evaluation metric used: performance is assessed primarily on the
overrepresented class (spurious), rather than through a micro or macro-averaged score across classes. As
a result, attempts to correctly identify non-spurious instances have little impact on the final evaluation
score.</p>
        <p>In the posthoc subtask, our strategy of generating synthetic contexts did not yield the expected
improvements. Many generated contexts included the original sentence verbatim, causing our filtering
We included seven cluster-randomised trials with 42,489 patient participants from 129 hospitals, conducted in Australia, the
UK, China, and the Netherlands. Health professional participants (numbers not specified) included nursing, medical and
allied health professionals. Interventions in all studies included implementation strategies targeting healthcare workers;
three studies included delivery arrangements, no studies used financial arrangements or governance arrangements. Five
trials compared a multifaceted implementation intervention to no intervention, two trials compared one multifaceted
implementation intervention to another multifaceted implementation intervention. No included studies compared a single
implementation intervention to no intervention or to a multifaceted implementation intervention. Quality of care outcomes
(proportions of patients receiving evidence-based care) were included in all included studies. All studies had low risks of
selection bias and reporting bias, but high risk of performance bias. Three studies had high risks of bias from non-blinding
of outcome assessors or due to analyses used. We are uncertain whether a multifaceted implementation intervention leads
to any change in adherence to evidence-based recommendations compared with no intervention (risk ratio (RR) 1.73; 95%
confidence interval (CI) 0.83 to 3.61; 4 trials; 76 clusters; 2144 participants, I2 =92%, very low-certainty evidence). Looking
at two specific processes of care, multifaceted implementation interventions compared to no intervention probably lead
to little or no diference in the proportion of patients with ischaemic stroke who received thrombolysis (RR 1.14, 95% CI
0.94 to 1.37, 2 trials; 32 clusters; 1228 participants, moderate-certainty evidence), but probably do increase the proportion
of patients who receive a swallow screen within 24 hours of admission (RR 6.76, 95% CI 4.44 to 10.76; 1 trial; 19 clusters;
1,804 participants; moderate-certainty evidence). Multifaceted implementation interventions probably make little or no
diference in reducing the risk of death, disability or dependency compared to no intervention (RR 0.93, 95% CI 0.85 to
1.02; 3 trials; 51 clusters ; 1228 participants; moderate-certainty evidence), and probably make little or no diference to
hospital length of stay compared with no intervention (diference in absolute change 1.5 days; 95% CI -0.5 to 3.5; 1 trial; 19
clusters; 1804 participants; moderate-certainty evidence). We do not know if a multifaceted implementation intervention
compared to no intervention result in changes to resource use or health professionals’ knowledge because no included
studies collected these outcomes. We are uncertain whether a multifaceted implementation intervention compared to no
intervention improves adherence to evidence-based recommendations in acute stroke settings, because the certainty of
evidence is very low.</p>
        <p>We included seven studies where groups of patients were randomly assigned to diferent treatments. These studies involved
42,489 patients from 129 hospitals in Australia, the United Kingdom, China, and the Netherlands. The studies also included
health workers (such as nurses, doctors, and other healthcare staf), but the exact number of health workers was not
given. In all the studies, the treatments focused on changing how healthcare workers do their jobs. Three studies also
changed how care was delivered. None of the studies changed how money was used or how hospitals were managed. Five
studies compared a complex treatment (which used several methods to help healthcare workers follow guidelines) to no
treatment. Two studies compared one complex treatment to another complex treatment. None of the studies compared a
single method to no treatment or to a complex treatment. All studies measured how many patients received care based
on the best available evidence. All studies were unlikely to have problems with how patients were chosen or how results
were reported, but there was a high chance that knowing which treatment was given could have afected the results
(performance bias). In three studies, there was a high chance of bias because the people checking the results knew which
treatment was given, or because of the way the results were analyzed. We are not sure if using a complex treatment
changes how well healthcare workers follow evidence-based guidelines compared to no treatment (risk ratio (RR, a way to
compare groups) 1.73; 95% confidence interval (CI, a range that shows uncertainty) 0.83 to 3.61; 4 studies; 76 groups; 2144
patients; I2 =92%, very low certainty in the results). Looking at two specific parts of care, complex treatments compared to
no treatment probably make little or no diference in the number of patients with ischaemic stroke (a type of stroke caused
by a blocked blood vessel) who received thrombolysis (a treatment to break up blood clots) (RR 1.14, 95% CI 0.94 to 1.37, 2
studies; 32 groups; 1228 patients, moderate certainty in the results), but probably do increase the number of patients who
get a swallow screen (a test to check if a person can swallow safely) within 24 hours of arriving at the hospital (RR 6.76,
95% CI 4.44 to 10.76; 1 study; 19 groups; 1,804 patients; moderate certainty in the results). Complex treatments probably
make little or no diference in lowering the risk of death, disability, or needing help from others compared to no treatment
(RR 0.93, 95% CI 0.85 to 1.02; 3 studies; 51 groups; 1228 patients; moderate certainty in the results), and probably make
little or no diference to how long patients stay in the hospital (diference in absolute change 1.5 days; 95% CI -0.5 to 3.5; 1
study; 19 groups; 1804 patients; moderate certainty in the results). We do not know if complex treatments compared to no
treatment change how resources are used or improve health workers’ knowledge, because none of the studies measured
these outcomes. Overall, we are not sure if complex treatments compared to no treatment help healthcare workers follow
evidence-based guidelines in the care of patients with acute stroke (sudden stroke), because the certainty of the evidence
is very low.
mechanism—designed to detect literal matches—to misclassify a large number of these as non-spurious.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>For the text simplification results (Tasks 1.1 and 1.2), the model demonstrates remarkable adaptability
in identifying and replacing complex terms with clear didactic descriptions, preserving the original
meaning of the text and providing an accurate list of key aspects, such as the healthcare professionals
in the analyzed examples. This implies that the automatic simplification system is capable of producing
clear results without afecting the quality and accuracy of the information provided.</p>
      <p>For Task 2.1, our pattern-based approach combined with LLM evaluation proved highly efective for
detecting hallucinations in text simplification outputs. The strategy of identifying surface-level patterns
(one-word sentences, double spaces, and literal matches) followed by LLM-based classification achieved
outstanding performance in the sourced subtask, with our best run (Run 4) achieving an F1-score of
0.976, precision of 1.0, and accuracy of 0.958.</p>
      <p>The key insight from Task 2.1 was that simple rule-based pattern matching could efectively pre-label
a significant portion of the data, while the LLM (llama-3.1-8b-instruct) provided reliable classification
for remaining cases. However, the posthoc subtask proved more challenging, with our synthetic context
generation approach yielding lower performance (best F1-score of 0.953), primarily due to the generated
contexts including original sentences verbatim, which caused misclassification by our literal matching
iflters.</p>
      <p>As future work, we propose the development of adaptive multi-specialty simplification approaches,
which allow the system to adjust its text simplification strategies to the conditions, needs or changes in
the environment according to the thematic domain, preserving terminological precision and
communicative clarity. Additionally, improvements to synthetic context generation for hallucination detection
could enhance post-hoc evaluation capabilities, particularly by developing more sophisticated methods
that avoid verbatim inclusion of target sentences in generated contexts.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>A sincere thank you to the organizers of the CLEF 2025 Conference for their motivation, dedication, and
eforts in promoting research in the search for solutions that contribute to reducing barriers to reading.</p>
      <p>Our thanks to Engineer Anthony Arteaga Burgos, a graduate of the Faculty of Mathematical and
Physical Sciences at the University of Guayaquil, for his valuable contribution to the development of
the material.</p>
      <p>This work is funded by the Ministerio para la Transformación Digital y de la Función Pública and
Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within
the framework of the project Desarrollo Modelos ALIA. This work has also been partially supported
by Project CONSENSO (PID2021-122263OB-C21), Project MODERATES (TED2021-130145B-I00), and
Project SocialTox (PDC2022-133146-C21) funded by MCIN/AEI/10.13039/501100011033 and by the
European Union NextGenerationEU/PRTR. It has also been funded by the scholarship
(FPI-PRE2022105603) from the Ministry of Science, Innovation and Universities of the Spanish Government.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Gemini and ChatGPT in order to: Grammar and
spelling check. After using this tool, the authors reviewed and edited the content as needed and take
full responsibility for the publication’s content.</p>
      <p>arXiv:2407.21783.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Prompts</title>
      <p>I want you to act as an automatic simplification system for texts written in English in the context of
medicine. Your task is to process the input text and make it clearer, that is, generate a simplified version,
identifying dificult expressions that should be the most complex ones in the text. Remember that the
concept of lexical simplification, in terms of natural language processing (NLP), refers to the process of
replacing complex words with simpler alternatives, preserving the original meaning of the text. Try to
reorganize paragraphs of the original text that contain dificult-to-understand grammatical structures so
that the simplified text is easy to understand for the general public, second language learners, people
with low literacy levels, and non-native speakers. You should keep in mind that if two or more words
in the text form a single concept (for example, “artificial intelligence”), treat them as a unit and do not
separate them; never simplify or replace words within a title; regarding acronyms, you should replace
them with the corresponding meaning. Finally, when there are complex words in the simplified text, I
require you to include a brief explanation in parentheses in all cases. We do not need you to include
explanations of dificult words once the text has been simplified. Finally the result must be in JSON
format Example:\n{ \n: “simplification” \n}</p>
      <p>It acts as an advanced automatic lexical simplification system for medical texts written in English. Its
main objective is to process the original text and generate a significantly clearer and more understandable
version, specifically designed for audiences with low literacy levels, students of English as a second
language, non-native speakers, and the general public. The task of the advanced automatic lexical
simplification system is to identify lexical complexity, that is, to detect the most dificult words or
expressions in the text. These should be technical medical terms, uncommon words, or idiomatic phrases
that make understanding dificult. It should then replace each dificult word or expression with a
simpler and more understandable alternative, maintaining the same meaning. In cases where relatively
complex words are identified, include a brief explanation in parentheses immediately after the word. It is
necessary to preserve the meaning by ensuring that the original meaning of the source text is preserved
at all times. If a sentence or paragraph has a complex grammatical structure, reorganize it to facilitate
understanding without altering its message. The advanced automatic lexical simplification system
should not alter lexical items that represent a single concept (e.g., “artificial intelligence”), nor should it
simplify or modify words that are part of a title. In the case of acronyms, write their full meaning in
parentheses next to them at least the first time they appear in the text. Finally, the result must be in
JSON format. Example: ⁀ ‘‘simplification”: “simplificación”</p>
      <p>You are evaluating whether a sentence contains hallucinated information based on a given context.
A sentence is considered a **hallucination** if **any part** of it presents information that is not
**explicitly and clearly stated** in the context.
- Do not assume or infer any facts.
- If the sentence goes beyond the given context, even slightly, mark it as a hallucination.
- If you are uncertain, err on the side of caution and mark it as a hallucination.</p>
      <p>Sentence: {sentence}
Context: {context}
**Answer with only one word: ‘Yes’ if it is a hallucination, or ‘No’ if it is fully supported. Do not explain.**</p>
      <p>You are an AI assistant specializing in academic research paper abstracts. Your task is to generate a full,
plausible abstract for a scientific or technical paper.</p>
      <p>The abstract’s core content and findings MUST directly support, or clearly imply, the following simplified
core idea/sentence:** {sentence}
This means the provided sentence should either be present verbatim, or the abstract’s content should
make that sentence a straightforward, accurate, and concise summary or conclusion that could be drawn
from it.</p>
      <p>Construct a complete abstract that typically includes:
1. **Background/Problem:** Introduce the context or problem addressed by the research.
2. **Approach/Methodology:** Briefly describe how the research was conducted, what system/design
was proposed, or what data was analyzed.
3. **Key Findings/Results:** Present the main outcomes, discoveries, or the core functionality, ensuring
this section is where the provided sentence’s idea is most strongly rooted.
4. **Conclusion/Implications:** Summarize the significance, benefits, or future outlook derived from the
findings.</p>
      <p>The abstract should be between 150-250 words, maintain a formal, objective, and scientific tone, and
ensure smooth, logical transitions between sections. It should read as if it were a genuine abstract from
a published paper.</p>
      <p>Generate only the abstract text.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Štajner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Drndarevic</surname>
          </string-name>
          ,
          <article-title>Making it simplext: Implementation and evaluation of a text simplification system for spanish</article-title>
          ,
          <source>ACM Transactions on Accessible Computing (TACCESS) 6</source>
          (
          <issue>2015</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>North</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shardlow</surname>
          </string-name>
          ,
          <article-title>Lexical complexity prediction: An overview</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Segura-Bedmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <article-title>Simplifying drug package leaflets written in spanish by using word embedding</article-title>
          ,
          <source>Journal of Biomedical Semantics</source>
          <volume>8</volume>
          (
          <year>2017</year>
          ).
          <source>doi:10.1186/s13326-017-0156-7.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Štajner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Drndarevic</surname>
          </string-name>
          ,
          <article-title>Making it simplext: Implementation and evaluation of a text simplification system for spanish</article-title>
          ,
          <source>ACM Trans. Access. Comput. 6</source>
          (
          <year>2015</year>
          ). URL: https://doi.org/10.1145/2738046. doi:
          <volume>10</volume>
          .1145/2738046.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Rets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rogaten</surname>
          </string-name>
          ,
          <article-title>To simplify or not? facilitating english l2 users' comprehension and processing of open educational resources in english using text simplification</article-title>
          ,
          <source>Journal of Computer Assisted Learning</source>
          <volume>37</volume>
          (
          <year>2021</year>
          )
          <fpage>705</fpage>
          -
          <lpage>717</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Licardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Volčanjk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haramija</surname>
          </string-name>
          ,
          <article-title>Diferences in communication skills among elementary students with mild intellectual disabilities after using easy-to-read texts</article-title>
          ,
          <source>The new educational review 64</source>
          (
          <year>2021</year>
          )
          <fpage>236</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alarcón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <article-title>Hulat-alexs cwi task-cwi for language and learning disabilities</article-title>
          applied to university educational texts,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <article-title>CEUR-WS, Malaga</article-title>
          , Spain,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azarbonyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bertin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Augereau</surname>
          </string-name>
          ,
          <article-title>Overview of the clef 2023 simpletext task 2: Dificult concept identification and explanation (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Rojas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Salazar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Ramirez</surname>
          </string-name>
          , Multils-sp/ca:
          <article-title>Lexical complexity prediction and lexical simplification resources for catalan and spanish, 2024</article-title>
          . URL: https://arxiv. org/abs/2404.07814. arXiv:
          <volume>2404</volume>
          .
          <fpage>07814</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azarbonyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Clef 2025 simpletext track: Simplify scientific text (and nothing more)</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>425</fpage>
          -
          <lpage>433</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <article-title>RoBERTa: A robustly optimized BERT pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          ,
          <article-title>Introducing gpt-4.1 in the api</article-title>
          , https://openai.com/index/gpt-4-1/,
          <year>2025</year>
          . Accessed: June 16,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. a. M.</given-names>
            <surname>Llama Team</surname>
          </string-name>
          ,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>