<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Integrating Linguistic Knowledge into Prompting Strategies for Spanish Text Simplification: Insights from the NIL-UCM participation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Fernández</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Díaz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Facultad de Informática and ITC, Universidad Complutense de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>90</fpage>
      <lpage>100</lpage>
      <abstract>
        <p>We present our participation in a text simplification shared task focused on Plain Language and Easy-to-Read. Our approach, based on explicit linguistic instructions, yielded good results in semantic fidelity and readability. However, the metrics used-such as the Fernández-Huerta index-are insuficient to capture the complexity of the task. Fine-tuning did not significantly outperform prompt-based generation. We highlight the need for more robust and multidimensional metrics to enable fairer and more accurate evaluation of text simplification.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automatic simplification</kwd>
        <kwd>Plain Language</kwd>
        <kwd>Easy-to-Read</kwd>
        <kwd>Readability metrics</kwd>
        <kwd>NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Task Description</title>
        <p>
          The competition included two distinct subtasks, allowing participants to take part in either one or
both. The goal was the same in both cases: to automatically simplify a corpus of around six hundred
texts [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], with the diference being the methodology applied in each. Thus, in Subtask 1, the aim
was to ensure that administrative and news texts used clear and understandable language following
the recommendations of Plain Language, while Subtask 2 required applying the specific criteria of
Easy-to-Read, and more specifically, the guidelines established in the UNE 153101 EX standard [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Dataset Description</title>
        <p>
          The dataset provided to participants [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] consisted of five CSV files. There is a clear division between
training and test texts, as well as between Subtask 1 and Subtask 2.
        </p>
        <p>The training files, which include both the original and the simplified versions of the texts according
to each methodology, allow participants to train their models:</p>
        <p>It is worth noting that the original texts are not particularly complex to begin with. In our view,
this is especially problematic in the case of Subtask 1, considering that Plain Language, as we will
see, is designed to adapt documents that are especially dificult for the average reader, often due
to their legal nature.
• Subtask1Train.csv. Consists of the 2,400 training texts manually simplified according to Plain</p>
        <p>
          Language guidelines.
• Subtask2Train.csv. Comprises the same 2,400 training texts, this time manually simplified
according to the UNE 153101 EX standard [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>To better understand the nature of the original texts and those provided to participants as references
for adaptation according to each methodology, we include below one of the texts in its three versions:</p>
        <sec id="sec-1-2-1">
          <title>Text 25. Original</title>
          <p>"Alicante, 25 de noviembre del 2022. El concejal de Deportes, José Luis Berenguer, ha presentado
esta mañana el acto del XII Encuentro Club Esportiu Aquarium Alacant de natación adaptada junto al
presidente de la entidad deportiva CE Aquarium, Jorge Chica, y el nadador Manuel Martínez. El evento
que tiene lugar mañana sábado se celebrará en la Piscina Monte Tossal ‘José Antonio Chicoy’, a las 17
horas, y está organizado por el Club Aquarium, la Federación de Deportes Adaptados de la Comunitat
Valenciana junto con la Concejalía de Deportes del Ayuntamiento de Alicante. El XII encuentro reunirá
a numerosos deportistas de diferentes clubes de natación adaptada de ámbito nacional.El edil de
Deportes José Luis Berenguer (Cs) ha comentado que ’siempre apoyamos este tipo de iniciativas porque
pensamos que practicar deporte es la mejor terapia y medicina ya que es saludable y hace mucho
bien a las personas’. El concejal ha destacado también que el Club Esportiu Aquarium Alacant ’hace
una gran labor para con el deporte de nuestra ciudad y con los deportistas de su Club en particular,
además de llevar a cabo este tipo de eventos’.Por otra parte, Berenguer ha señalado que el club ’cuando
participa en competiciones siempre obtiene muy buenos resultados, lo que demuestra que el trabajo
que hacen lo hacen muy bien’. Para el Ayuntamiento de Alicante y para la Concejalía de Deportes es
’muy satisfactorio que haya entidades como esta que se vuelcan tanto con el deporte y que llevan el
nombre de Alicante con orgullo allá donde van’".</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>Text 25. Plain Language</title>
          <p>"El concejal de Deportes de Alicante presentó el 12 Encuentro Club Esportiu Aquarium Alacant
de natación adaptada que organiza el Club Aquarium, la Federación de Deportes Adaptados de la
Comunitat Valenciana y la Concejalía de Deportes del Ayuntamiento de Alicante. El encuentro será en
la Piscina Monte Tossal el sábado 26 de noviembre a las 5 de la tarde y participarán deportistas de
diferentes clubes de natación adaptada. El concejal dijo que es muy importante apoyar estas actividades
y el buen trabajo del Club Esportiu Aquarium Alacant para promocionar el deporte en Alicante.
También dijo que el club tiene buenos resultados en las competiciones y que es bueno tener entidades
que difunden una buena imagen de Alicante".</p>
        </sec>
        <sec id="sec-1-2-3">
          <title>Text 25. Easy-to-Read</title>
          <p>"Alicante presenta el encuentro número 12 de natación adaptada
que organiza el Club Esportiu Aquarium Alacant.</p>
          <p>El concejal de Deportes de Alicante
presentó el 12 Encuentro Club Esportiu Aquarium Alacant de natación adaptada
que organiza el Club Aquarium,
la Federación de Deportes Adaptados de la Comunitat Valenciana
y la Concejalía de Deportes del Ayuntamiento de Alicante.</p>
          <p>El encuentro será en la Piscina Monte Tossal
el sábado 26 de noviembre
a las 5 de la tarde.</p>
          <p>Participarán deportistas de diferentes equipos de natación adaptada.</p>
          <p>El concejal dijo que es muy importante apoyar estas actividades
para promocionar el deporte en Alicante
y que el Club Esportiu Aquarium está dando una buena imagen de la ciudad.</p>
          <p>También dijo que el club está haciendo un buen trabajo a favor del deporte
y que tiene buenos resultados en las competiciones".</p>
          <p>As for the test files, they contain the original texts that must be adapted automatically and are
specifically intended for model evaluation. These texts belong to the same domain and are similar in
length to those in the training set. As expected, they difer from the texts used during training.
• Subtask1Test.csv. Consists of 607 original texts that must be adapted to Plain Language.
• SubTask2Test.csv. Contains 600 original texts to be adapted to Easy-to-Read. These are the
same texts as in the previous file, except for the last seven, which are not included for some
reason.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Plain Language and Easy-to-Read Language</title>
      <p>As we will explain in a later section, our approach to solving the two tasks in the competition has been
based on integrating certain linguistic knowledge into our models. For this reason, we believe it is
necessary to devote some attention to the sociolinguistic foundations behind both methodologies of
text adaptation.</p>
      <sec id="sec-2-1">
        <title>2.1. Plain Language</title>
        <p>
          Plain Language is an initiative that emerged in the 1960s in the United States, the United Kingdom,
Canada, and Australia, and since then it has spread to many countries, where in some cases legislation
has even been enacted on the matter. As Petelin [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] points out, "the key principle of plain language is
that the intended reader can use the document for its intended purpose." Unlike Easy-to-Read Language,
it is not aimed at a specific group, but rather seeks to prevent texts from using overly convoluted and
complex language—something that often occurs in legal and financial domains, the areas in which these
recommendations have seen the most development precisely for that reason. So much so that Tartaglia
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] directly links the complexity of legal language and complaints about its dense and incomprehensible
nature to the birth of the Plain Language Movement.
        </p>
        <p>
          In her renowned book Plain Language for Lawyers, Asprey [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] perfectly summarizes the philosophy of
this movement when she states: "Simple in this sense doesn’t mean simplistic. It means straightforward,
clear, precise. Writing in plain language is just writing in clear, straightforward language, with the needs
of the reader foremost in mind (...). The main thing to remember is that if what you have written could
be unclear or confusing for your reader, or dificult to read, you should rewrite it so that it becomes
clear, unambiguous and easy to read."
        </p>
        <p>It is recommended to consider the target user of the text and adapt its content, structure, and visual
design to their needs, eliminating anything that could cause confusion or hinder readability.</p>
        <p>Particularly relevant is the guide How to Write Clearly, published by the European Commission and
primarily addressed to EU staf: "European Commission staf have to write many diferent types of
documents. Whatever they type – legislation, a technical report, minutes, a press release or speech — a
clear document will be more efective and more easily and quickly understood" [ 9]. We based our model
on these recommendations, along with those found in the Plain English Handbook [10], published by the
Ofice of Investor Education and Assistance of the SEC, the U.S. agency that regulates the stock market.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Easy-to-Read</title>
        <p>
          Easy-to-Read is a cognitive accessibility tool "that brings together a set of guidelines and
recommendations regarding text drafting, document design/layout, and the validation of their comprehensibility,
aimed at making information accessible to people with reading comprehension dificulties" [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. It
is part of a broader strategy aimed at facilitating cognitive accessibility by removing barriers to the
comprehension, interaction, and use of products and services. In this way, it helps to guarantee the
right of access to information that people with cognitive disabilities are legally entitled to. Although
Easy-to-Read documents can be useful for all users, they are specifically intended for individuals who
experience reading comprehension dificulties.
        </p>
        <p>The goal is to eliminate the barriers that people with reading comprehension dificulties may face in
all aspects of life. As Hurtado and Reguera [11] note, public administrations are increasingly demanding
Easy-to-Read texts, and interest in text adaptation has grown significantly.</p>
        <p>
          In order to integrate this linguistic knowledge into our model, we have compiled the most relevant
guidelines and recommendations on the subject. Our main reference has been the experimental
standard UNE 153101:2018 EX [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This standard, which is the primary reference in Spain, includes both
mandatory guidelines and recommendations, often illustrated with incorrect and correct examples. In
fact, it is the document proposed as a reference for Subtask 2. García Muñoz [12] is also highly relevant,
as he systematically presents drafting and evaluation proposals based on previous experiences.
        </p>
        <p>Although both documents also include proposals related to design and layout, we have focused on
those related to orthotypography, lexis, morphosyntax, style, and the organization of information.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Evaluation</title>
        <p>The competition guidelines stated that the evaluation would consider both the lexical and semantic
similarity between the original and adapted texts, as well as the readability of the latter:
• Cosine similarity (Bag-of-Words) to measure lexical overlap between the reference texts and the
participants’ submissions.
• Cosine similarity (Embeddings) to assess semantic similarity between the adapted and original
texts.
• Fernández-Huerta readability index, aimed at measuring the readability of the adapted texts, in
accordance with plain and accessible language guidelines.</p>
        <p>We believe that the first two metrics—based on cosine similarity of sparse and dense vector
representations of the original and adapted texts—are appropriate. We were more doubtful about the use
of the Fernández-Huerta index. Fernández-Huerta [13] adapted Flesch’s Reading Ease Score (RES) to
Spanish, adjusting it to the linguistic characteristics of the language. Since its publication, it has become
a pioneering reference for assessing the readability of texts in Spanish, particularly in educational
contexts [14]. The formula is expressed as follows:</p>
        <p>Readability = 206.84 − 0.60 − 1.02
where P is the number of syllables per 100 words and F is the number of sentences per 100 words.</p>
        <p>Based on this metric, Fernández-Huerta [13] established a classification into seven levels, each
corresponding to an educational stage:
• 0–30: very dificult – university level
• 30–50: dificult – pre-university level
• 50–60: fairly dificult – ages 13–16
• 60–70: standard – ages 10–12
• 70–80: fairly easy – age 9
• 80–90: easy – age 6</p>
        <p>As can be seen, the Fernández-Huerta index, like Flesch’s RES, is based on the assumption that the
dificulty or readability of a text is determined by its length and by the number of words and sentences
it contains [14]. One advantage of this type of formula is that its measurement can be easily automated,
allowing for the evaluation of large amounts of text in a short time. However, this approach seems
somewhat simplistic to us, and it is striking that the same metric is used to evaluate both subtasks,
despite the fact that Plain Language and Easy-to-Read, while sharing some common elements, are based
on diferent methodologies.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Model Selection</title>
        <p>To carry out the task of automatic text simplification, we employed Mistral-7B-Instruct-v0.3, a large
language model (LLM) with 7 billion parameters, designed for automatic text generation. This model is
a fine-tuned version of the base model Mistral-7B-v0.3, specifically trained to follow human instructions.
This fine-tuning process, known as instruct fine-tuning, enables the model to respond more helpfully,
coherently, and in a task-oriented manner. Thanks to its ability to interpret and execute instructions, the
model is particularly suitable for tasks such as text simplification, which require transforming content
while preserving its original meaning but reducing its linguistic complexity1.</p>
        <p>Several features proved decisive in selecting this model, as they make it especially suitable for the
task of automatic text simplification. Its instruction-based fine-tuning allows the model to understand
and execute specific directives, which is essential for transforming complex texts into more accessible
versions without losing their original meaning. This fine-tuning equips the model to follow concrete
linguistic instructions.</p>
        <p>Mistral-7B-Instruct-v0.3 met all the requirements established by our methodology:
1. A model specialized in coherent and fluent text generation.
2. No need for additional fine-tuning with the competition’s training data.
3. Ability to follow precise linguistic instructions.</p>
        <p>4. Significantly lower cost compared to models from OpenAI, Meta, and Gemini.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Implementation</title>
        <p>This model was loaded using Hugging Face’s transformers library, which provides a modular and
standardized interface for working with a wide variety of language models. In particular, we used the
AutoModelForCausalLM and AutoTokenizer classes. The AutoModelForCausalLM class allows
for loading pretrained autoregressive language models designed for causal text generation tasks, i.e.,
predicting the next token based on previous ones. This abstraction simplifies the loading of the specific
model (in this case, Mistral-7B-Instruct-v0.3) without manually defining its architecture, as the class
automatically adapts the appropriate structure and weights.</p>
        <p>The AutoTokenizer class handles the tokenization and detokenization of text, converting text
strings into numerical sequences (tokens) that the model can process and vice versa. This tokenizer
also automatically adapts to the loaded model, ensuring consistency between the vocabulary and the
encoding used.</p>
        <p>The implementation was developed in Python, using PyTorch as the backend for processing, with GPU
acceleration enabled. It is worth noting that the task was particularly demanding from a computational
standpoint, requiring intensive use of GPU memory, reaching approximately 35 GB of VRAM to handle
the model and generate text.
1The model description is available at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3</p>
        <p>We explored two diferent approaches to applying the model to the simplification task: on the
one hand, we tested its performance without additional fine-tuning, relying solely on its ability to
follow explicit linguistic instructions via prompts. On the other hand, we also evaluated the model’s
performance after fine-tuning it with the competition’s training texts.</p>
        <p>The simplification procedure consists of constructing a prompt composed of externally defined
linguistic instructions (contained in a text file) followed by the original text, in the following structure:
{instructions}\nOriginal text: {text}\nRewritten text:</p>
        <p>This prompt is tokenized using the model’s tokenizer, generating the input tensors required for
inference. Text generation was carried out using the model’s generate method, with parameters
configured to optimize output quality and diversity, such as:
• max_length=10000
• do_sample=True
• top_p=0.9
• temperature=0.7</p>
        <p>In addition, the padding token was set to the end-of-sequence token
(pad_token_id=tokenizer.eos_token_id) to ensure proper padding management.</p>
        <p>The generation process was executed without gradient computation (torch.no_grad()), reducing
computational resource consumption. After generation, the portion corresponding to the simplified
text was extracted based on the textual marker “Rewritten text:”, ensuring the retrieval of relevant
content.</p>
        <p>To avoid input size limitations, it was verified that the number of tokens in the prompt did not
exceed a maximum threshold (max_tokens=10000). If this limit was exceeded, the text was marked
as unprocessed. Finally, the simplified texts were stored in a pandas DataFrame and exported to CSV
format for subsequent analysis and evaluation.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Fine-tuning</title>
        <p>As previously noted, the Mistral-7B-Instruct-v0.3 model was adapted to two specific tasks: Plain
Language simplification and Easy-to-Read simplification. For this purpose, an independent fine-tuning
process was carried out for each task, employing eficient training techniques designed for
computationally constrained environments. This procedure was repeated separately in both cases, thereby
producing two models each tailored to a specific simplification objective.</p>
        <p>Specifically, a parameter-eficient adaptation strategy known as Low-Rank Adaptation (LoRA) was
applied to a 4-bit quantized version of the base model, significantly reducing computational requirements
without substantially compromising performance.</p>
        <p>The data used for training, validation, and testing were organized into three datasets (train.csv,
val.csv, test.csv), composed of input-output pairs (text, expected), where each input
corresponds to an original text and its respective simplified version. These data were converted into instances
of the Hugging Face datasets.Dataset class, enabling seamless integration into the tokenization
and training pipeline.</p>
        <p>Each dataset instance was transformed into an instructive prompt following the format used by
instruction-tuned models:
&lt;s&gt;[INST] {instructions}:
{text} [/INST] {expected}&lt;/s&gt;</p>
        <p>where {instructions} corresponds to a set of general guidelines for the text simplification task,
loaded from an external file ( prompt.md). For supervised learning purposes, tokens corresponding
to the prompt were masked in the labels (labels) using the value -100, so that the loss function is
applied exclusively to the tokens generated by the model’s response.</p>
        <p>To optimize memory usage during training, the bitsandbytes library was employed to load the
Mistral-7B-Instruct model in 4-bit quantization mode. This configuration uses the nf4 quantization
technique with float16 computation and double quantization enabled, providing a balanced trade-of
between eficiency and precision.</p>
        <p>The LoRA technique was used to insert trainable adapters into a specific subset of model layers:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj. The LoRA
hyperparameters were set to r=16 (decomposition rank) and lora_alpha=32, with a dropout rate (dropout) of
0.05. Adaptation was performed using the peft library, allowing only the LoRA-added layers to be
updated during training while keeping the original base model weights frozen.</p>
        <p>Training was conducted using the Hugging Face Trainer class with a configuration adapted for
resource-limited environments. An efective batch size of 4 was achieved through gradient accumulation
(gradient_accumulation_steps=4), and a learning rate of 2e-4 was used. To reduce
computational cost, training was limited to a maximum of 3 epochs, which allowed the model to be fine-tuned
without placing excessive demands on the available resources. Training was capped at a maximum
of 1000 steps, with evaluation and model checkpointing performed every 200 steps. The optimizer
employed was paged_adamw_8bit, specifically designed for quantized models. Finally, the fine-tuned
model was saved to disk along with its corresponding tokenizer.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Prompt Design for Text Simplification</title>
        <p>As previously indicated, the adopted strategy was based on the use of prompts with explicit and
structured instructions for each subtask. This formulation was designed by leveraging the linguistic
knowledge acquired throughout the research on Plain Language and Easy-to-Read guidelines, with
the aim of optimizing interaction with an instruction-tuned model, thereby maximizing its ability to
generate outputs aligned with the defined simplification criteria.</p>
        <p>Rather than modifying the model’s parameters, prompts allow the integration of pretrained models
into specific tasks simply through the appropriate formulation of instructions. These instructions,
which can be expressed in natural language or as learned vector representations, guide the model to
produce the desired behavior by activating the relevant knowledge according to the provided context
[15].</p>
        <p>Our prompt-based approach ofers a significant advantage in terms of computational eficiency, and
it also enables us to direct text simplification following a specific methodology, which, in the case
of Easy-to-Read, is particularly thorough. Nonetheless, our intuition was that the combined use of a
detailed prompt with linguistic instructions and fine-tuning using the original and manually adapted
texts would yield the best results. However, it was necessary to verify this and assess whether the
diference compared to the non-finetuned model justified the computational cost of fine-tuning.
3.4.1. Prompt Design for Plain Language
The prompt used in Subtask 1 was designed in accordance with the guidelines and recommendations of
Plain Language outlined in a previous section. Its elaboration was highly detailed and included multiple
examples, both correct and incorrect, for each instruction, since—as already mentioned—this strategy
has proven to be highly efective. As a result, the prompt reached a considerable length, with a total of
19,617 characters.</p>
        <p>It is divided into two main sections: "Mission" and "Instructions". The first section, in which we
provided the necessary context and introduced general instructions, can in turn be divided into three
parts:
1. Description of the task and the general behavior expected of the model.
2. Prohibition against including explanations of the changes applied in the simplifications.
3. General example of simplification.</p>
        <p>The prohibition against including explanations was necessary because in preliminary tests we
observed that the model had included a list of the changes made in some of the texts. It is worth noting
that even this was not suficient to prevent such output entirely, and the changes had to be removed
from the final result using regular expressions.</p>
        <p>In the "Instructions" section, we introduced specific guidelines in a highly detailed manner, including
several correct and incorrect examples for each one. When multiple guidelines were closely related, we
included them under the same entry. We based these on the linguistic and methodological knowledge
of Plain Language discussed in a previous section, and the instructions covered lexical, syntactic,
morphological, and information structure domains.
3.4.2. Prompt Design for Easy-to-Read
For the development of the prompt focused on Easy-to-Read, we adopted a similar approach, this time
following the guidelines and recommendations detailed in the corresponding section. Our prompt aimed
to reflect the higher level of precision and thoroughness characteristic of Easy-to-Read, as evidenced
by its even greater length of 21,726 characters. The lexical, syntactic, morphological, and information
structure dimensions were maintained, albeit with the specific features of this methodology. In fact,
some instructions—although worded diferently—were essentially the same as those in the prompt for
Subtask 1. As we have previously explained, the two methodologies share many points in common,
despite their notable diferences. Additionally, we included some considerations related to formatting
and the need for glosses, which are highly significant aspects in Easy-to-Read practices.</p>
        <p>The prompt is divided into eight parts, such that general instructions are followed by seven sections
grouping a substantial number of guidelines along with their examples: general objective, punctuation
marks and symbols, line formatting, lexis, numbers, morphosyntax, and information structure.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Preliminary Evaluation with the Test Set</title>
        <p>Before proceeding with the automatic simplification of the more than six hundred texts included in the
competition—a task with high computational cost—we chose to validate our approach in advance. To
this end, we applied the three metrics described above to the test subset of the training texts, which we
split into train, validation, and test. Our main interest was to compare the results obtained with the
manually adapted texts.</p>
        <p>We carried out this validation both with the texts generated using only the prompt and with those
simplified through a combination of prompt and fine-tuning, although we limited this analysis to
Subtask 2. This decision was based on the fact that, at this stage, our goal was merely to assess the
feasibility of the approach. In any case, Subtask 1 would later be evaluated on the full set of the six
hundred and seven competition texts.</p>
        <p>In order to assess the quality of the texts generated by the automatic simplification system, we applied
three complementary metrics: lexical similarity, semantic similarity, and readability score.
1. Lexical Similarity (Bag-of-Words): We used the Bag-of-Words approach with
CountVectorizer to represent texts as frequency vectors. Then, cosine similarity was
calculated between the original texts, the reference texts, and the simplified outputs. This metric
estimates the degree of lexical overlap without taking word order or meaning into account.
2. Semantic Similarity (embeddings): To capture meaning relations beyond surface-level lexicon,
we employed the multilingual model paraphrase-multilingual-MiniLM-L12-v2 from
Sentence Transformers, which generates dense sentence embeddings. We again computed cosine
similarity between various pairs of texts, allowing us to evaluate semantic content preservation
in the simplified versions.
3. Readability (Fernández-Huerta Index): Finally, we applied the Fernández-Huerta index to
compare the readability of the original, reference, and simplified texts. It is particularly interesting
to compare not only the reference and generated texts, but also the originals, in order to assess
whether there are significant diferences in readability.</p>
        <p>Taken together, these three metrics provide a multidimensional evaluation encompassing lexical
ifdelity, semantic equivalence, and textual accessibility of the generated simplifications.</p>
        <p>In the case of simplification using only the prompt, the results of this preliminary evaluation were
positive, as the generated texts outperformed the manually simplified references in both lexical and
semantic similarity, while scoring only slightly lower in readability.</p>
        <p>1. Lexical Similarity
2. Semantic Similarity
3. Fernández-Huerta
• Original – Reference: 0.8408
• Original – Generated: 0.8963
• Original – Reference: 0.8193
• Original – Generated: 0.8498
• Original: 51.402 (somewhat dificult)
• Reference: 76.3211 (somewhat easy)
• Generated: 76.0391 (somewhat easy)</p>
        <p>For the fine-tuned model, the results were similar, although it is worth noting that the readability
index obtained in this case was slightly lower than that of the other model.</p>
        <p>1. Lexical Similarity
• Original – Reference: 0.8408
• Original – Generated: 0.9061
2. Semantic Similarity
• Original – Reference: 0.8193
• Original – Generated: 0.8593</p>
        <sec id="sec-4-1-1">
          <title>3. Fernández-Huerta</title>
          <p>• Original: 51.402 (somewhat dificult)
• Reference: 76.3211 (somewhat easy)
• Generated: 75.3677 (somewhat easy)</p>
          <p>In short, the results show that the texts generated by both the base and the fine-tuned models exhibit
high similarity to the original texts, particularly in lexical and semantic terms. In both cases, lexical
similarity was even higher between the original and the generated texts (0.8963 and 0.9061) than
between the original and the reference texts (0.8408), suggesting a considerable preservation of original
vocabulary. Similarly, semantic similarity was also higher in the original–generated comparison (0.8498
and 0.8593) than in original–reference (0.8193), indicating that the simplified texts maintain the original
meaning efectively. Finally, the Fernández-Huerta readability index reveals a notable increase in
reading ease: while the original texts are classified as “somewhat dificult” (51.4), both the reference and
generated texts reach “somewhat easy” levels (between 75 and 76). Although the reference texts—i.e.,
those manually simplified—achieved a slightly higher readability score, the diference compared to the
automatically generated texts is minimal, further supporting the efectiveness of the model. Moreover,
the fact that the base model achieved such competitive results suggests that the additional computational
cost of fine-tuning may not be justified, especially given the relatively small improvements in similarity.
Nevertheless, this hypothesis will be tested on the texts to be simplified for the competition.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results of Subtask 1</title>
        <p>For the analysis of the texts generated in Subtask 1, we present below both the results obtained using
our implementation of the three metrics—focused on lexical similarity, semantic similarity, and the
Fernández-Huerta readability index—and the oficial results published at the end of the competition.
This comparison enables a more comprehensive assessment of the models’ performance by contrasting
internal automatic evaluation with the external evaluation provided by the competition organizers.
Unlike the latter, the internal evaluation analyzed the performance of both the fine-tuned and
non-finetuned models, ofering the advantage of directly comparing the impact of fine-tuning on simplification
metrics.
4.2.1. Internal Evaluation
1. Lexical similarity
• Prompt only: 0.8615
• Fine-tuning: 0.8671
2. Semantic similarity
• Prompt only: 0.8388
• Fine-tuning: 0.8339</p>
        <sec id="sec-4-2-1">
          <title>3. Fernández-Huerta index</title>
          <p>• Original: 75.5377 (somewhat easy)
• Prompt only: 80.8373 (easy)
• Fine-tuning: 83.3521 (easy)</p>
          <p>As can be observed, the diferences between the instruction-only model (prompt only) and the
finetuned model are minimal in terms of both lexical similarity (0.8615 vs. 0.8671) and semantic similarity
(0.8388 vs. 0.8339), suggesting that fine-tuning does not lead to substantial improvements. However,
the diference is somewhat more pronounced in the case of the Fernández-Huerta readability index,
which increases from 80.8373 with the non-fine-tuned model to 83.3521 with the fine-tuned one.
4.2.2. External Evaluation
For the external evaluation, the organizers employed a lexical similarity approach based on a
bag-ofwords model using TF-IDF vectors, as opposed to our internal evaluation, which relied on raw frequency
vectors through CountVectorizer. Two separate rankings were published: one based on the average
of lexical and semantic similarity (cosine similarity with TF-IDF and embeddings), and another based on
the Fernández-Huerta readability index. Our team, NIL-UCM, achieved competitive results in the first
ranking, securing second place with an average cosine similarity of 0.71, just behind HULAT-UC3M
(0.75), and ahead of CARDIFFNLP and VICOMTECH (both 0.70). In terms of individual metrics,
NILUCM obtained a lexical similarity score of 0.67 and a semantic similarity score of 0.75. However, in
the ranking based on readability, our system ranked third, with an average Fernández-Huerta index of
70.42, behind VICOMTECH (82.98) and CARDIFFNLP (78.81), but ahead of HULAT-UC3M (69.72). This
discrepancy suggests that while our approach preserved lexical and semantic content efectively, there
may still be room for improvement in optimizing textual accessibility.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Ranking based on the average of cosine similarities (TF-IDF / Embeddings):</title>
          <p>1. HULAT-UC3M: 0.71 / 0.78 (average: 0.75)
2. NIL-UCM: 0.67 / 0.75 (average: 0.71)
3. CARDIFFNLP and VICOMTECH: 0.63 / 0.77 (average: 0.70)
Ranking based on Fernández-Huerta readability index:
1. VICOMTECH: 82.98
2. CARDIFFNLP: 78.81
3. NIL-UCM: 70.42
4. HULAT-UC3M: 69.72
4.3. Results of Subtask 2
4.3.1. Internal Evaluation
1. Lexical similarity
• Prompt only: 0.8481
• Fine-tuning: 0.8787</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>2. Semantic similarity</title>
          <p>• Prompt only: 0.8280
• Fine-tuning: 0.8396</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>3. Fernández-Huerta Index</title>
          <p>For Subtask 2, we present an analysis similar to that of Subtask 1, reporting internal results based on
the same three metrics and comparing them with the oficial evaluation. Again, we distinguish between
the fine-tuned model and the base model in order to assess the impact of fine-tuning on this subtask.
• Original: 75.6803 (somewhat easy)
• Prompt only: 82.7251 (easy)
• Fine-tuning: 82.1256 (easy)</p>
          <p>The results obtained for Subtask 2 are generally positive. The fine-tuned model achieves slightly
better performance than the base model on the lexical (0.8787 vs. 0.8481) and semantic (0.8396 vs. 0.8280)
similarity metrics, particularly the former. In contrast, for the Fernández-Huerta readability index,
the prompt-only model achieves a slightly higher score (82.7251) than the fine-tuned model (82.1256),
although both fall within the range of texts considered easy to read.
4.3.2. External Evaluation
As in Subtask 1, Subtask 2 includes two diferent rankings: one based on the average of lexical similarity
(measured with TF-IDF) and semantic similarity (measured with embeddings), and another based on
the Fernández-Huerta readability index. Our team, NIL-UCM, achieved the highest score in the first
ranking, with a global average similarity of 0.72, due to high values in both lexical similarity (0.68) and
semantic similarity (0.75). This was followed by CARDIFFNLP (0.71) and UR (0.70), while UNED-INEDA
and VICOMTECH ranked fourth and fifth with 0.68 and 0.66, respectively. However, the ranking based
on the Fernández-Huerta index, which evaluates the readability of the generated texts, shows an inverse
pattern: VICOMTECH leads with an average of 85.44, closely followed by UR (85.12). In this case, our
team ranks fifth with a score of 69.40, indicating that although our simplifications exhibit a high degree
of similarity with the references, there is still room for improvement in terms of readability—similarly
to what was observed in Subtask 1.</p>
        </sec>
        <sec id="sec-4-2-5">
          <title>Ranking based on the average of cosine similarities (TF-IDF / Embeddings):</title>
          <p>1. NIL-UCM: 0.68 / 0.75 (average: 0.72)
2. CARDIFFNLP: 0.65 / 0.77 (average: 0.71)
3. UR: 0.64 / 0.76 (average: 0.70)
4. UNED-INEDA: 0.60 / 0.75 (average: 0.68)
5. VICOMTECH: 0.58 / 0.74 (average: 0.66)
1. VICOMTECH: 85.44
2. UR: 85.12
3. CARDIFFNLP: 77.85
4. UNED-INEDA: 72.39
5. NIL-UCM: 69.40</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>As noted in a previous section, the Fernández-Huerta index is not an appropriate metric for evaluating
readability in the context of automatic text simplification. This measure, which focuses exclusively on
superficial aspects such as word and sentence length, fails to capture the complexity of the processes
involved in tasks such as Plain Language and Easy-to-Read. Using the same metric for both subtasks is
problematic, as it does not reflect the methodological diferences or specific objectives of each. Although
our results on this metric were outperformed by those of other teams in both tasks, we consider it
necessary to adopt more comprehensive and multidimensional indicators that can adequately assess
the various aspects involved in text adaptation. In this regard, the prompts employed in our proposals
provide detailed guidance that addresses multiple dimensions of simplification (lexical, morphosyntactic,
information organization, etc.); however, their efectiveness is only partially reflected through such a
limited metric.</p>
      <p>Another important aspect concerns the nature of the original texts used in the tasks. These texts
already exhibit a high readability index, which limits the potential for improvement. As previously
mentioned, this issue is particularly relevant in the subtask focused on Plain Language. It is worth
recalling that, in the test subset corresponding to the training texts, the average Fernández-Huerta
index of the original texts was significantly lower, which allowed the efectiveness of our simplification
methodology to stand out more clearly.</p>
      <p>As shown, our models perform particularly well in terms of lexical and semantic similarity between
the original texts and their simplified versions, indicating a high level of fidelity to the source content.
Moreover, the readability results are also satisfactory. In fact, for the test texts from the training set,
the Fernández-Huerta scores often match or even exceed those obtained by the reference simplified
texts. Regarding the competition texts, while the performance of other teams suggests there is room
for improvement, it is important to emphasize that the limitations of the metric used hinder a fair and
comprehensive evaluation of model performance.</p>
      <p>With regard to further training via fine-tuning, the results are inconclusive. In Subtask 1, some
improvement in readability is observed with the fine-tuned model (83.35 vs. 80.84), whereas in Subtask
2, the results are slightly lower than those of the base model (82.75 vs. 82.13). In light of these findings,
one may question whether the computational cost of fine-tuning is justified, as its benefits do not appear
to be consistent or significant. It should be noted, however, that fine-tuning was limited to only three
epochs to avoid excessive computational costs, and further experimentation with longer training could
potentially yield diferent results.</p>
      <p>Finally, it is worth commenting on the impact of the prompts used. The prompt corresponding to
Subtask 1 appears to be the most efective for guiding textual simplification, as suggested by the fact that
the highest overall readability score was obtained with the fine-tuned model in that task. Nevertheless,
this conclusion must be qualified due to the limitations of the metric employed. Furthermore, the
diference between the best result for the Subtask 1 prompt (fine-tuned model: 83.96) and the best result
for the Subtask 2 prompt (non-fine-tuned model: 82.73) is not statistically significant. Similarly, in terms
of lexical and semantic similarity, no substantial diferences are observed either between subtasks or
between base and fine-tuned models, which further highlights the need for a more in-depth analysis
using a combination of metrics for a more robust evaluation. As noted previously, since fine-tuning
was limited to only three epochs to manage computational costs, further investigation with extended
training is necessary to fully assess its potential impact.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This challenge represents a significant step forward in the field of automatic text simplification, a task
with important social implications, particularly regarding the right to access information for individuals
with cognitive disabilities. Such technologies can contribute to a more inclusive society, in which all
citizens are able to exercise their rights on equal terms.</p>
      <p>Our approach, based on the use of explicit linguistic instructions, has proven efective in terms of
semantic fidelity and readability. Nevertheless, while the results are encouraging, we believe there is
still room for improvement. To continue making progress, it is essential to adopt more precise and
comprehensive evaluation metrics that better reflect the complexity of the processes involved in text
simplification.</p>
      <p>In conclusion, we consider it a priority to move towards more accurate and multidimensional
evaluations, which allow for a fairer and more comprehensive assessment of the diferent approaches
to automatic simplification.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgements</title>
      <p>This publication is part of the R&amp;D&amp;I project HumanAI-UI, Grant PID2023-148577OB-C22
(Human-Centered AI: User-Driven Adaptative Interfaces-HumanAI-UI) funded by
MICIU/AEI/10.13039/501100011033 and by FEDER/UE.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[9] E. Commission, D.-G. for Translation, Z. Field, How to write clearly, Publications Ofice of the</p>
      <p>European Union, 2015. doi:doi/10.2782/022405.
[10] U. S. Securities, E. C. O. of Investor Education, Assistance, A Plain English Handbook: How to</p>
      <p>Create Clear SEC Disclosure Documents, The Ofice, 1998.
[11] C. J. Hurtado, A. M. Reguera, Metodología de la traducción a lectura fácil: Retos de investigación,
in: Translation, Mediation and Accessibility for Linguistic Minorities, volume 128, 2022, p. 205.
[12] O. García Muñoz, Lectura fácil: métodos de redacción y evaluación, Real Patronato sobre
Discapacidad, 2012.
[13] J. Fernández Huerta, Medidas sencillas de lecturabilidad, Consigna 214 (1959) 29–32.
[14] J. M. Porras-Garzón, R. Estopà, Escalas de legibilidad aplicadas a informes médicos: límites de un
análisis cuantitativo formal, Círculo de Lingüística Aplicada a la Comunicación 83 (2020) 205–216.
doi:10.5209/clac.70574.
[15] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt
engineering in large language models: Techniques and applications, arXiv preprint arXiv:2402.07927
(2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Botella-Gil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Espinosa-Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonet-Jover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Madina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Molino</given-names>
            <surname>Piñar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moreda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Gonzalez-Dios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Martín</surname>
          </string-name>
          <string-name>
            <surname>Valdivia</surname>
          </string-name>
          , Ureña, Overview of clears at iberlef 2025:
          <article-title>Challenge for plain language and easy-to-read adaptation for spanish texts</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Espinosa-Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Abreu-Salas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moreda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palomar</surname>
          </string-name>
          ,
          <article-title>Automatic text simplification for people with cognitive disabilities: Resource creation within the ClearText project</article-title>
          , in: S. Štajner,
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shardlow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alva-Manchego</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability</source>
          , INCOMA Ltd.,
          <string-name>
            <surname>Shoumen</surname>
          </string-name>
          , Bulgaria, Varna, Bulgaria,
          <year>2023</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>77</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .tsar-
          <volume>1</volume>
          .7/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] AENOR, UNE 153101 EX</article-title>
          . Lectura Fácil.
          <article-title>Pautas y recomendaciones para la elaboración de documentos</article-title>
          ,
          <source>AENOR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Botella-Gil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Espinosa-Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moreda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palomar</surname>
          </string-name>
          ,
          <string-name>
            <surname>Corpus</surname>
            <given-names>ClearSim</given-names>
          </string-name>
          ,
          <year>2024</year>
          . URL: http: //hdl.handle.net/10045/151688.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Petelin</surname>
          </string-name>
          ,
          <article-title>Considering plain language: issues and initiatives</article-title>
          ,
          <source>Corporate Communications: An International Journal</source>
          <volume>15</volume>
          (
          <year>2010</year>
          )
          <fpage>205</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tartaglia</surname>
          </string-name>
          ,
          <article-title>Getting a movement to move: the plain language movement, ICADE</article-title>
          . Revista de la Facultad de Derecho 94 (
          <year>2015</year>
          )
          <fpage>177</fpage>
          -
          <lpage>208</lpage>
          . URL: https://revistas.comillas.edu/index.php/revistaicade/ article/view/5433. doi:
          <volume>10</volume>
          .14422/icade.i94.
          <year>y2015</year>
          .
          <fpage>008</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>M. M. Asprey</surname>
          </string-name>
          ,
          <article-title>Plain language for lawyers</article-title>
          , The Federation Press,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>