<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Preface on the Iberian Languages Evaluation Forum (IberLEF 2025)</article-title>
      </title-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>IberLEF is a shared evaluation campaign for Natural Language Processing (NLP) systems in Spanish and other Iberian languages. This annual cycle begins in December with a call for task proposals and concludes in September with an IberLEF meeting held alongside SEPLN. Throughout this period, various challenges are conducted with significant international participation from academic and industry research groups. The aim is to inspire the research community to organize competitive tasks in text processing, understanding, and generation, thereby establishing new research challenges and advancing state-of-the-art results in these languages. In its seventh edition, IberLEF 2025 has contributed to the field of NLP in Spanish and other Iberian languages with a total of 445 researchers from 196 research groups in 21 countries participating in 14 NLP challenges in Spanish, Portuguese, and English. This volume begins with an overview of all the activities conducted during IberLEF 2025, along with aggregated figures and insights about the various tasks. It also includes a collection of papers describing the participating systems. However, the task overviews are not included in these proceedings; they have been published in the September 2025 issue of the journal Procesamiento del Lenguaje Natural. IberLEF 2025 has addressed the following tasks:</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>ADoBo 2025 is the second edition of the ADoBo shared task. The first edition was held
in 2021 and addressed the automatic detection of unassimilated borrowings in the Spanish
press. This second edition of the task focused on the automatic detection of anglicisms (i.e.,
unassimilated lexical borrowings from English) in Spanish journalistic texts. Participants were
asked to return annotated spans of anglicisms from a set of Spanish sentences. Unlike the 2021
edition, no training set was provided, although a development set was made available. The
development set released was the same used in the 2021 edition of ADoBo, specifically including
only sentences that contained anglicisms and no lexical borrowings from other languages. The
test set provided was BLAS (Benchmark for Loanwords and Anglicisms in Spanish). BLAS
consists of 1,836 annotated sentences in Spanish (37,344 tokens), which contain 2,076 spans
labeled as anglicisms. The task was conducted entirely in Spanish and the evaluation was
based on strict span-level precision, recall, and F1-score. A total of 14 teams registered for the
task, out of which 6 teams submitted results on the test set and 5 teams sent working notes.
Participants submitted solutions using LLMs, deep learning models, Transformer-based models,
and rule-based systems. The best performing team, qilex, achieved an F1 score of 98.79 using an
OpenAI o3 model with an enriched prompt that included explicit guidelines along with reminders.</p>
      <p>CLEARS explores automated techniques for adapting Spanish texts into plain language
and easy-to-read formats. The task is divided into two subtasks: one focused on plain language
adaptation and the other on easy-to-read adaptation. The dataset consists of 3,000 news articles
from various municipalities in the province of Alicante (Spain), covering a wide range of topics.</p>
      <p>Each article was adapted into both plain language and easy-to-read versions following the
general guidelines of the Asociación Española de Normalización (UNE), with all adaptations
reviewed and validated by a team of field experts. Participants’ submissions were evaluated
using lexical and semantic similarity measures, along with readability scores. In total, four
teams participated in Subtask 1 and five teams in Subtask 2. The top-performing systems in
both subtasks used prompting techniques with instruction-tuned LLMs. Team HULAT-UC3M
achieved the best results in Subtask 1, reaching a cosine similarity of 0.75 with a method based
on prompting a LoRA-adapted RigoChat-7B-v2 model finetuned on the provided dataset.
Team NIL-UCM led Subtask 2 with a cosine similarity of 0.72, using a similar approach based
on Mistral-7B-Instruct-v0.3.</p>
      <p>PROFE is designed to assess the reading comprehension abilities of NLP systems, focusing
on their linguistic competence under the same conditions used to evaluate humans. The task
includes three subtasks: (i) Multiple choice, where systems must select the correct answer from a
list of options for each question, (ii) Matching, where systems must pair texts from two diferent
sets, similar to natural language inference and semantic textual similarity tasks, and (iii)
Fillin-the-gap, where systems must identify the correct position of text fragments within a masked
passage. All three subtasks were evaluated using accuracy as the metric. For this task, the
organizers created an evaluation dataset based on Spanish proficiency tests developed over the
years by the Instituto Cervantes. A total of 19 teams registered for the task, with 8 submitting
runs. The multiple-choice subtask received 24 submissions, the matching subtask 11, and the
ifll-in-the-gap subtask 9. Team Vicomtech achieved the highest accuracy across all three subtasks
(above 93%) using ensembles of open-source large language models, such as Qwen-2.5-14B and
Phi-4-14B, operating in a zero-shot setup.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Harmful and Inclusive Content</title>
      <p>DIMEMEX is the second edition of DIMEMEX at IberLEF 2024, continuing its mission to
advance research on automatic detection of inappropriate content in memes, with a particular
focus on Mexican Spanish. This year’s edition featured three subtasks: (i) Three-way
classiifcation to etermine whether a meme contains hate speech, inappropriate content, or neither,
(ii) Fine-grained classification , where systems must assign memes to specific categories of hate
speech, and (iii) LLM-focused three-way classification , same as subtask 1, but restricted to using
LLMs only. The DIMEMEX 2025 dataset is a refined version of the previous year’s, consisting
of approximately 3,000 memes manually annotated for abusive content. These memes were
collected from public Facebook groups in Mexico known for sharing such material. All subtasks
were evaluated using macro-averaged recall, precision, and F1 score. Ten teams participated in
Subtask 1, while Subtasks 2 and 3 each saw three participating teams. Team HARGP-BETO
achieved the best performance in Subtask 1 (macro-F1 score of 0.58), using a text-only gated
unit model that fuses local and global attention mechanisms based on OCR and textual
descriptions. Team UC-UCO-CICESE led Subtask 2 (macro-F1 score of 0.37) with a system that
combined text and image modalities through a late fusion of BETO (for text) and ViT (for images).</p>
      <p>HOMO-LAT25 continues the HOMO-MEX shared tasks from 2023 and 2024, extending
the study of polarity detection toward LGBTQ+ content in online messages to Spanish dialects
in Latin America. This year’s edition focused on Reddit posts written in Spanish from 19
Latin American countries, annotated with positive, negative or neutral polarity toward specific
LGBTQ+ identity keywords. The task comprised two tracks: (i) Track 1 evaluated polarity
detection when training and test data came from the same Spanish dialect (Argentina, Chile,
Colombia, and Mexico); and (ii) Track 2 evaluated cross-dialect generalization by testing on
countries unseen during training. 30 teams registered for the task, out of which 7 submitted
valid results and 6 presented working notes. All participating teams used Transformer-based
models, two also incorporated traditional machine learning and two leveraged large language
models (LLMs). The best results were obtained by the PLD team, achieving a macro F1-score
of 52.96 in Track 1 and 50.86 in Track 2. Their approach combined translating all input texts
into English to take advantage of highly performing pre-trained models and mitigate dialectal
variation, with a new “Context Engine” that retrieves semantically similar examples from each
sentiment class (positive, negative, neutral) to enrich the model’s inference process and improve
generalization, especially in a challenging cross-dialect setting.</p>
      <p>MentalRiskES aims to promote the early detection of mental risk disorders in Spanish. In
this third edition of the task, the focus was on the detection of gambling disorders, with two
subtasks: (1) risk detection of gambling disorders, (2) risk detection of gambling disorders and
determining the type of addiction. The task used a dataset of texts from social media, annotated
as low-risk and high-risk of diferent types of gambling disorders. This edition had 13 teams
submitting results, 12 of them submitted papers. The submissions were ranked according to
overall classification, early prediction, and also with eficiency metrics, emphasizing the need for
sustainable practices. Team UNSL obtained the best Macro-F1 for task 1 (0.567), team MCDI
obtained the best Macro-F1 for task 2 (0.589), while team PLN_PPM_ISB obtained the best
results related to early prediction.</p>
      <p>MiSonGyny focused on the automatic detection and classification of misogynistic content in
Spanish song lyrics. It was designed to address the underexplored presence of symbolic violence
and hate speech in musical texts, which often include subtle and metaphorical expressions of
misogyny. The task comprised two subtasks: (i) Subtask 1, binary classification of song verses
as Misogynistic (M) or Non-Misogynistic (NM); and (ii) Subtask 2, fine-grained classification
of misogynistic content into Sexualization (S), Violence (V), Hate (H), or Not Related (NR). A
total of 13 teams participated in Subtask 1 and 9 in Subtask 2, out of which 9 submitted working
notes. Most approaches relied on transformer-based architectures, complemented by traditional
machine learning, data augmentation and, in some cases, LLMs or hierarchical pipelines. The
best-performing team in both subtasks was HULAT UC3M, achieving an F1-score of 0.8811
in Subtask 1 and 0.5895 in Subtask 2. This team developed a comprehensive pipeline that
combines data augmentation, transformer-based encoders, and traditional machine learning
methods. In addition, it addressed class imbalance through minority class oversampling using
back-translation and AEDA techniques.</p>
      <p>PolyHope aims to detect hope speech (messages that express optimism, encouragement,
or the desire for a better future) in English and Spanish social media. Two subtasks were
proposed: (a) binary hope speech detection, and (b) multiclass detection with the categories
generalized hope, realistic hope, unrealistic hope, not hope, or the sarcasm category meant to
detect hopeful language that is used in a misleading way. The dataset used contains over 30
thousand tweets annotated with hope speech information, around two thirds in Spanish and
the rest in English. 31 participants took part in the competition, with 13 papers accepted. The
best performances for the binary task were 0.852 F1 for Spanish by teddymas, and 0.871 F1 for
English by michaelibrahim. For the multiclass task, the best performances were 0.742 macro-F1
for Spanish by lephuquy, and 0.755 macro-F1 for English by supachoke. Common challenges
mentioned by teams inlcuded imbalanced data, language mixing, and cultural diferences in how
people express emotions.</p>
    </sec>
    <sec id="sec-3">
      <title>Content Curation and Generation</title>
      <p>PastReader focuses on the automatic transcription of digitized Spanish historical newspapers.
The task includes two subtasks: (i) Error Correction, where participants receive the output of
an OCR system and must generate clean, corrected versions of the extracted texts; and (ii)
End-to-end Extraction, which explores full pipeline approaches that take scanned pages as input
and produce curated transcriptions as output. The corpus used in this task consists of historical
newspaper publications from the public domain, digitized by the National Library of Spain
(BNE) and available through the Hemeroteca Digital. It includes 298 press titles, 88,748 issues,
and a total of 8,302,407 pages in PDF format. For the shared task, the organizers sampled
121,295 documents and transcriptions, which were split into training, validation, and test sets in
a 74-4-22 ratio. Evaluation relied on standard text generation metrics such as Word Error Rate,
(Normalized) Levenshtein Distance, BLEU, and ROUGE, as well as sustainability metrics,
including CO2 emissions. Only Subtask 2 received participation, with three teams submitting
systems. Team OCRTIST achieved the best performance based on the primary ranking metric
(Levenshtein distance of 53.30). Their system used Gemini 2.5 PRO in a standalone setup,
relying on a single prompt to perform OCR directly from scanned images.</p>
      <p>PRESTA is a task about question answering over tabular data in Spanish, where
participants had to interpret natural language questions to obtain data from tabular sources.
The dataset contains 31 thousand data rows from 10 diferent sources, with a total of 200
question-answer pairs for training, and 100 pairs for test. The question-answer pairs could have
diferent expected answer types: boolean, categorical, numeric, or list (either of categories or
numbers). The evaluation metric for the task is average accuracy over all categories. Seven
teams submitted systems to the competition, with both the ITU NLP and sonrobok4 teams
obtaining the best performances of 87% accuracy. ITU NLP uses a zero-shot LLM-based Python
code generation method experimenting with several models of the OpenAI, Qwen, DeepSeek
and LLaMA families. sonrobok4 uses a multi-prompt strategy, also using zero-shot on LLMs
with OpenAI and DeepSeek models. One of the conclusions is that current LLM technologies
outperform traditional pipelines, but also small open-source models could work on par with
bigger ones if properly used.</p>
      <p>TA1C focuses on the detection and spoiling of clickbait in Spanish news, particularly in
tweets that links to a piece of news. The task consists of two subtasks: i) Clickbait Detection,
a binary classification task to determine whether a news teaser is clickbait based on the
information gap theory where headlines deliberately omit key information to provoke curiosity;
and ii) Clickbait Spoiling, a generative task that requires producing a concise Spanish text
that fills the information gap created by the clickbait. The dataset provided includes 4,200
manually annotated Spanish tweets for the clickbait detection task and 500 human-written
spoilers for the spoiling task, all collected from 18 media outlets across 12 Spanish speaking
countries and international sources. A total of 27 teams registered for the task, out of which 13
participated in the evaluation phase of the clickbait detection task and 3 teams in the spoiling
task. The best-performing team in the detection task, UmuTeam, achieved an F1-score of
0.8156 using an ensemble of fine-tuned transformer models, including MarIA, BERTIN, ALBETO,
and the decoder-only Gemma-2-2B-it with QLoRA fine-tuning. In the spoiling task, the top
system in manual evaluation, submitted by CogniCIC, obtained a score of 3.88 out of 5
in Accuracy/Completeness, the highest among all participants, using a few-shot prompting
approach with the Claude Sonnet 4 LLM.
ASQP-PT is a shared task about aspect based sentiment analysis in Portuguese, consisting
of four subtasks: (1) aspect term extraction, (2) opinion term extraction, (3) aspect category
detection, and (4) aspect sentiment quadruple prediction (ASQP). The task presented a corpus
of 1236 Trip Advisor reviews in Portuguese about hotels in four cities, annotated with 5749
(Category, Aspect, Opinion, Polarity) quadruples. Two teams took part in the task, one of
them in all four subtasks, and the other one only in the aspect term extraction subtask. The
teams could not beat the finetuned Portuguese BERT baseline for tasks 1, 2 and 3, but for
the ASQP task the best result was 0.46 F-Measure by the ABCD team, using an end-to-end
encoder-decoder transformers model, beating the baseline of 0.40.</p>
      <p>REST-MEX 2025 is the fourth edition of the REST-MEX shared task, aimed at advancing
natural language processing for tourism in the Mexican context, with a focus on sentiment
analysis and classification of user-generated texts about Mexico’s Magical Towns (Pueblos
Mágicos). The task is structured into three subtasks: i) polarity prediction, a fine-grained
classification into five levels of polarity (from 1 to 5); ii) service type classification , identifying
whether the review refers to a hotel, restaurant, or tourist attraction; and iii) geographical
identification of the visited location , a multiclass classification task to determine which of the
40 predefined Magical Towns is being reviewed. The corpus consists of 297,217 TripAdvisor
reviews shared by tourists who visited representative destinations in Mexico. A total of 32 teams
participated in the shared task. The best performing team in all three subtasks was UDENAR,
which obtained a macro F1 score of 0.64 in prediction of polarity, 0.99 in classification of type
of service, and 0.69 in geographical identification. Their approach transformed each multiclass
task into independent binary classification problems (one per class), using centroid-based
sampling on balanced datasets to address class imbalance, particularly improving performance
on minority classes like negative polarity. They combined fine-tuned transformer models with
knowledge transfer techniques to enhance generalization and robustness across the three tasks.</p>
      <p>SatiSPeech proposes research on the automatic recognition of satire in Spanish in two
independent subtasks: the first one is text-only, based on YouTube videos transcriptions; the
second one is multimodal, combining textual and acoustic information. The dataset comprised
8000 short audio segments (no longer than 25 seconds) and their transcription obtained from
popular satirical programs in YouTube, each segment was annotated as satirical or non-satirical.
There were 11 teams that participated in the task, and all of them submitted systems for
both tasks. For subtask 1, the best performance was 85.6 F1, obtained by the UPV-ELiRF
team combining several BERT and LLM models; while for subtask 2, the top performance was
88.3 F1 by the UMU-Ev team using HuBERT and prosodic features combined with a SVM classifier.</p>
      <p>In the realm of Natural Language Processing, where Deep Learning and, more specifically
Large Language Models, have become the go-to solutions, defining research challenges and
creating robust evaluation methods and high-quality test collections are crucial for success.
These elements enable iterative testing and refinement. IberLEF is playing an important role in
advancing these eforts and moving the field forward.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>