<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>José Ángel González-Barba</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Chiruzzo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salud María Jiménez-Zafra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto de Computación, Facultad de Ingeniería, Universidad de la República</institution>
          ,
          <country country="UY">Uruguay</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>SINAI, Computer Science Department, CEATIC, Universidad de Jaén</institution>
          ,
          <addr-line>Jaén</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>TransPerfect</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>IberLEF is a shared evaluation campaign for Natural Language Processing systems focused on Spanish and other Iberian languages, organized annually since 2019 as part of the conference for the Spanish Society for Natural Language Processing. Its aim is to inspire the research community to develop and participate in competitive tasks related to text processing, understanding, and generation. These eforts are geared towards defining new research challenges and setting state-of-the-art results in Iberian languages, including Spanish, Portuguese, Catalan, Basque, and Galician. This paper provides an overview of the evaluation activities conducted during IberLEF 2025, which featured 14 tasks and 33 subtasks. These tasks covered various areas such as language comprehension, harmful and inclusive content, content curation and generation, and sentiment and figurative analysis. Overall, the IberLEF 2025 activities represented a significant collaborative efort, involving more than 440 researchers from 21 countries across Europe, Asia, Africa, and the Americas.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Evaluation Challenges</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>IberLEF is a shared Natural Language Processing (NLP) evaluation campaign focused on Spanish and
other Iberian languages. It is organized annually since 2019, as part of the conference of the Spanish
Society for Natural Language Processing. It aims to inspire the research community to develop and
participate in competitive tasks related to processing, understanding, and generation of at least one of
the Iberian languages, including: Spanish, Portuguese, Catalan, Basque, and Galician. These eforts are
geared towards defining new research challenges and improving the state-of-the-art results in these
languages.</p>
      <p>In this shared evaluation campaign, the research community defines new challenges and proposes
tasks to advance the NLP state of the art. The task proposals are reviewed by the IberLEF steering
and program committees, and then evaluated by the IberLEF general chairs. The organizers of the
accepted tasks are in charge of setting up the evaluation according to their proposal, promoting the
task, and managing the submissions and scientific evaluation of system description papers written
by participants. These papers are included in this IberLEF proceedings volume, published at CEUR
Workshop Proceedings. In addition, task organizers must prepare and submit an overview of their task
and evaluation, which are reviewed by the IberLEF organizing committee and published in the journal
Procesamiento del Lenguaje Natural, vol. 75 (September 2025 issue). Finally, the task organizers report
the results of the tasks, and selected participants present descriptions of their systems at the IberLEF
workshop.</p>
      <p>IberLEF 2025 takes place on September 23, 2025, in Zaragoza (Aragón, Spain), as part of the XLI
International Conference of the Spanish Society for Natural Language Processing (SEPLN 2025). This
year, 14 shared tasks were accepted to be organized as part of IberLEF 2025, out of 18 proposals. These
tasks focus on a range of NLP challenges, including language comprehension, harmful and inclusive
content detection, content curation and generation, and sentiment and figurative analysis.</p>
      <p>In this paper, we provide a summary and analysis of the tasks organized in IberLEF 2025 to ofer a
clearer understanding of this collective efort.</p>
    </sec>
    <sec id="sec-2">
      <title>2. IberLEF 2025 Tasks</title>
      <sec id="sec-2-1">
        <title>2.1. Language Comprehension</title>
        <p>The 14 tasks involved in IberLEF 2025 are presented below, grouped by theme.</p>
        <p>
          ADoBo 2025 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], Automatic Detection of Borrowings, addressed the automatic detection of anglicisms
(English lexical borrowings) in Spanish journalistic texts. This shared task was previously organized
in 2021 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], but with a specific focus on the automatic detection of unassimilated borrowings in
the Spanish press. In this edition, participants were asked to return annotated spans of anglicisms
from a set of Spanish sentences. Unlike the 2021 edition, no training set was provided, although a
development set was made available. The development set released was the same used in the 2021
edition of ADoBo, specifically including only sentences that contained anglicisms and no lexical
borrowings from other languages. The test set provided was BLAS (Benchmark for Loanwords and
Anglicisms in Spanish) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. BLAS consists of 1,836 annotated sentences in Spanish (37,344 tokens),
which contain 2,076 spans labeled as anglicisms. The task was conducted entirely in Spanish and
the evaluation was based on strict span-level precision, recall, and F1-score. A total of 14 teams
registered for the task, out of which 6 teams submitted results on the test set and 5 teams sent
working notes. Participants submitted solutions using LLMs, deep learning models, Transformer-based
models, and rule-based systems. The best performing team, qilex, achieved an F1 score of 98.79
using an OpenAI o3 model with an enriched prompt that included explicit guidelines along with reminders.
        </p>
        <p>
          CLEARS [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], Challenge for Plain Language and Easy-to-Read Adaptation for Spanish texts, explores
automated techniques for adapting Spanish texts into plain language and easy-to-read formats. The task
is divided into two subtasks: one focused on plain language adaptation and the other on easy-to-read
adaptation. The dataset consists of 3,000 news articles from various municipalities in the province of
Alicante (Spain), covering a wide range of topics. Each article was adapted into both plain language and
easy-to-read versions following the general guidelines of the Asociación Española de Normalización
(UNE), with all adaptations reviewed and validated by a team of field experts. Participants’ submissions
were evaluated using lexical and semantic similarity measures, along with readability scores. In total,
four teams participated in Subtask 1 and five teams in Subtask 2. The top-performing systems in both
subtasks used prompting techniques with instruction-tuned LLMs. Team HULAT-UC3M achieved the
best results in Subtask 1, reaching a cosine similarity of 0.75 with a method based on prompting a
LoRA-adapted RigoChat-7B-v2 model finetuned on the provided dataset. Team NIL-UCM led Subtask
2 with a cosine similarity of 0.72, using a similar approach based on Mistral-7B-Instruct-v0.3.
        </p>
        <p>
          PROFE [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], Language Proficiency Evaluation, is designed to assess the reading comprehension
abilities of NLP systems, focusing on their linguistic competence under the same conditions used to
evaluate humans. The task includes three subtasks: (i) Multiple choice, where systems must select the
correct answer from a list of options for each question, (ii) Matching, where systems must pair texts
from two diferent sets, similar to natural language inference and semantic textual similarity tasks,
and (iii) Fill-in-the-gap, where systems must identify the correct position of text fragments within a
masked passage. All three subtasks were evaluated using accuracy as the metric. For this task, the
organizers created an evaluation dataset based on Spanish proficiency tests developed over the years
by the Instituto Cervantes. A total of 19 teams registered for the task, with 8 submitting runs. The
multiple-choice subtask received 24 submissions, the matching subtask 11, and the fill-in-the-gap
subtask 9. Team Vicomtech achieved the highest accuracy across all three subtasks (above 93%) using
ensembles of open-source large language models, such as Qwen-2.5-14B and Phi-4-14B, operating
in a zero-shot setup.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Harmful and Inclusive Content</title>
        <p>
          DIMEMEX [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], Detection of Inappropriate Memes from Mexico, is the second edition of DIMEMEX at
IberLEF, continuing its mission to advance research on automatic detection of inappropriate content in
memes, with a particular focus on Mexican Spanish. This year’s edition featured three subtasks: (i)
Three-way classification to determine whether a meme contains hate speech, inappropriate content,
or neither, (ii) Fine-grained classification , where systems must assign memes to specific categories of
hate speech, and (iii) LLM-focused three-way classification , same as subtask 1, but restricted to using
LLMs only. The DIMEMEX 2025 dataset is a refined version of the previous year’s, consisting of
approximately 3,000 memes manually annotated for abusive content. These memes were collected from
public Facebook groups in Mexico known for sharing such material. All subtasks were evaluated using
macro-averaged recall, precision, and 1 score. Ten teams participated in Subtask 1, while Subtasks
2 and 3 each saw three participating teams. Team HARGP-BETO achieved the best performance in
Subtask 1 (macro-1 score of 0.58), using a text-only gated unit model that fuses local and global
attention mechanisms based on OCR and textual descriptions. Team UC-UCO-CICESE led Subtask 2
(macro-1 score of 0.37) with a system that combined text and image modalities through a late fusion
of BETO (for text) and ViT (for images).
        </p>
        <p>
          HOMO-LAT25 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], Human-centric polarity detection in Online Messages Oriented to the Latin
American-speaking LGBTQ+ population, continues the HOMO-MEX shared tasks from 2023 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and
2024 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], extending the study of polarity detection toward LGBTQ+ content in online messages to
Spanish dialects in Latin America. This year’s edition focused on Reddit posts written in Spanish from
19 Latin American countries, annotated with positive, negative or neutral polarity toward specific
LGBTQ+ identity keywords. The task comprised two tracks: (i) Track 1 evaluated polarity detection
when training and test data came from the same Spanish dialect (Argentina, Chile, Colombia, and
Mexico); and (ii) Track 2 evaluated cross-dialect generalization by testing on countries unseen during
training. 30 teams registered for the task, out of which 7 submitted valid results and 6 presented
working notes. All participating teams used Transformer-based models, two also incorporated
traditional machine learning and two leveraged large language models (LLMs). The best results were
obtained by the PLD team, achieving a macro F1-score of 52.96 in Track 1 and 50.86 in Track 2. Their
approach combined translating all input texts into English to take advantage of highly performing
pre-trained models and mitigate dialectal variation, with a new “Context Engine” that retrieves
semantically similar examples from each sentiment class (positive, negative, neutral) to enrich the
model’s inference process and improve generalization, especially in a challenging cross-dialect setting.
        </p>
        <p>
          MentalRiskES [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], Early detection of mental disorders risk in Spanish Third edition - Detecting
Addiction, is the third edition of this task about early detection of mental risk disorders in Spanish, this
time with the particular focus in detecting gambling disorders. Two subtasks were presented: detection
of gambling disorders risk, and classification considering diferent types of addiction . The task presented a
dataset of social media texts annotated with information about the risk of diferent types of gambling
disorders (e.g. betting, online gaming, trading/crypto, lootboxes). A total of 13 teams participated
in the task submitting at least one result, and the submissions were evaluated according to overall
classification perfromance but also early prediction and eficiency metrics, with the aim of emphasizing
the need for sustainable practices in NLP. The best Macro-F1 for task 1 was achieved by team UNSL
(0.567), while for task 2 the best Macro-F1 was by team MCDI (0.589). Team PLN_PPM_ISB obtained
the best results related to early prediction.
        </p>
        <p>
          MiSonGyny [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], Misogyny Speech Detection in Spanish Language Song Lyrics, focused on the
automatic detection and classification of misogynistic content in Spanish song lyrics. It was designed
to address the underexplored presence of symbolic violence and hate speech in musical texts, which
often include subtle and metaphorical expressions of misogyny. The task comprised two subtasks: (i)
Subtask 1, binary classification of song verses as Misogynistic (M) or Non-Misogynistic (NM); and (ii)
Subtask 2, fine-grained classification of misogynistic content into Sexualization (S), Violence (V), Hate
(H), or Not Related (NR). A total of 13 teams participated in Subtask 1 and 9 in Subtask 2, out of which 9
submitted working notes. Most approaches relied on transformer-based architectures, complemented
by traditional machine learning, data augmentation and, in some cases, LLMs or hierarchical pipelines.
The best-performing team in both subtasks was HULAT UC3M, achieving an F1-score of 0.8811 in
Subtask 1 and 0.5895 in Subtask 2. This team developed a comprehensive pipeline that combines data
augmentation, transformer-based encoders, and traditional machine learning methods. In addition, it
addressed class imbalance through minority class oversampling using back-translation and AEDA
techniques.
        </p>
        <p>
          PolyHope [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], Optimism, Expectation or Sarcasm?, is the continuation of the HOPE tasks that had
diferent editions, two of them in previous IberLEF workshops, and all related to hope speech detection
and classification (messages that express optimism, encouragement, or the desire for a better future). In
this edition, two subtasks are proposed, with variants in English and Spanish: first binary hope speech
detection, and second multiclass categorization as generalized hope, realistic hope, unrealistic hope, not
hope, or a novel sarcasm category meant to detect hopeful language that is used in a misleading way.
They presented a dataset of 30,000 tweets labeled with hope speech data, a third of them in English and
the rest in Spanish. A total of 31 teams participated in the competition, and 13 of those teams had their
papers accepted. The top system for the binary subtask in Spanish was submitted by team teddymas
and had 0.852 F1, the best system for English was by michaelibrahim having 0.871 F1; while in the
multiclass categorization subtask the top performance for Spanish was 0.742 macro-F1 by lephuquy,
and 0.755 for English by supachoke. Some challenges were mentioned by many of the teams, including
data imbalance, language mixing, and cultural diferences in how people express emotions.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Content Curation and Generation</title>
        <p>PastReader [13], Transcribing Texts from the Past, focuses on the automatic transcription of digitized
Spanish historical newspapers. This task includes two subtasks: (i) Error Correction, where participants
receive the output of an OCR system and must generate clean, corrected versions of the extracted
texts; and (ii) End-to-end Extraction, which explores full pipeline approaches that take scanned pages as
input and produce curated transcriptions as output. The corpus used in this task consists of historical
newspaper publications from the public domain, digitized by the National Library of Spain (BNE) and
available through the Hemeroteca Digital. It includes 298 press titles, 88,748 issues, and a total of
8,302,407 pages in PDF format. For the shared task, the organizers sampled 121,295 documents and
transcriptions, which were split into training, validation, and test sets in a 74-4-22 ratio. Evaluation
relied on standard text generation metrics such as Word Error Rate, (Normalized) Levenshtein Distance,
BLEU, and ROUGE, as well as sustainability metrics, including CO2 emissions. Only Subtask 2 received
participation, with three teams submitting systems. Team OCRTIST achieved the best performance
based on the primary ranking metric (Levenshtein distance of 53.30). Their system used Gemini
2.5 PRO in a standalone setup, relying on a single prompt to perform OCR directly from scanned images.</p>
        <p>PRESTA [14], Preguntas y Respuestas sobre Tablas en Español - Questions and Answers about Tables
in Spanish, is a question-answering task focusing on answering questions about tabular data in Spanish.
Participants were given natural language questions that needed to be interpreted to extract data from
tabular sources. The dataset was a collection of 10 diferent sources totalling 31 thousand data rows,
and 300 question-answer pairs over that data were provided, with diferent expected answer types:
boolean, categorical, numeric, or list (either of categories or numbers). The data was split in 200
question-answer pairs for training, and 100 for test, and overall accuracy across all categories was the
main metric of the task. The were 7 participant systems, all of them obtaining better results than the
GPT-4o baseline. Overall, the ITU NLP and sonrobok4 teams obtained the best performances (87%
accuracy), although with slightly diferent results for the types boolean and list. Both systems use code
generation methods with diferent LLMs and diferent prompting strategies. One of the conclusions is
that although current LLM technologies outperform traditional pipelines, using the largest models is not
the only way to go, as small open-source models models when properly used can give good results as well.</p>
        <p>TA1C [15], Te Ahorré Un Click, focuses on the detection and spoiling of clickbait in Spanish news,
particularly in tweets that link to a piece of news. The task consists of two subtasks: i) Clickbait Detection,
a binary classification task to determine whether a news teaser is clickbait based on the information
gap theory where headlines deliberately omit key information to provoke curiosity; and ii) Clickbait
Spoiling, a generative task that requires producing a concise Spanish text that fills the information gap
created by the clickbait. The dataset provided includes 4,200 manually annotated Spanish tweets for the
clickbait detection task and 500 human-written spoilers for the spoiling task, all collected from 18 media
outlets across 12 Spanish speaking countries and international sources. A total of 27 teams registered for
the task, out of which 13 participated in the evaluation phase of the clickbait detection task and 3 teams
in the spoiling task. The best-performing team in the detection task, UmuTeam, achieved an F1-score of
0.8156 using an ensemble of fine-tuned transformer models, including MarIA, BERTIN, ALBETO, and the
decoder-only Gemma-2-2B-it with QLoRA fine-tuning. In the spoiling task, the top system in manual
evaluation, submitted by CogniCIC, obtained a score of 3.88 out of 5 in Accuracy/Completeness, the
highest among all participants, using a few-shot prompting approach with the Claude Sonnet 4 LLM.</p>
        <p>One additional task had originally been accepted for IberLEF 2025: MIMIC, Multi-Modal AI Content
Detection. The goal of this task was to determine whether (image, text) pair, consisting of images,
captions, and contexts from English and Spanish Wikipedia, were fully or partially generated by AI.
However, due to funding limitations, it was not possible to generate the planned number of instances
using Large Multimodal Models to build a high-quality dataset. As a result, the organizers decided to
cancel the task.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Sentiment and Figurative Analysis</title>
        <p>ASQP-PT [16], Aspect Sentiment Quad Prediction in Portuguese, is a shared task about aspect based
sentiment analysis in Portuguese. It is a continuation of previous IberLEF tasks ABSAPT 2022 and
2024, and this year it consisted of four subtasks: aspect term extraction, opinion term extraction, aspect
category detection, and aspect sentiment quadruple prediction (ASQP). A corpus of 1236 Portuguese Trip
Advisor reviews about hotels in four cities was presented to the participants, with 5749 annotations of
(Category, Aspect, Opinion, Polarity) quadruples. Out of the two teams that participated in the task, only
one attempted solving all four subtasks, while the other only submitted results for the aspect term
extraction subtask. The baselines for the first three subtasks were not beaten by any team, but ABCD
team outperformed the baseline for the Aspect-Sentiment Quad Prediction subtask, the most complete
of the subtasks, obtaining an F1 of 0.46.</p>
        <p>REST-MEX 2025 [17], Researching on Evaluating Sentiment and Textual instances selection for
Mexican magical towns, is the fourth edition of the REST-MEX shared task, aimed at advancing
natural language processing for tourism in the Mexican context, with a focus on sentiment analysis
and classification of user-generated texts about Mexico’s Magical Towns (Pueblos Mágicos). The
task is structured into three subtasks: i)polarity prediction, a fine-grained classification into five
levels of polarity (from 1 to 5); ii) service type classification , identifying whether the review refers to
a hotel, restaurant, or tourist attraction; and iii) geographical identification of the visited location , a
multiclass classification task to determine which of the 40 predefined Magical Towns is being reviewed.
The corpus consists of 297,217 TripAdvisor reviews shared by tourists who visited representative
destinations in Mexico. A total of 32 teams participated in the shared task. The best performing
team in all three subtasks was UDENAR, which obtained a macro F1 score of 0.64 in prediction of
polarity, 0.99 in classification of type of service, and 0.69 in geographical identification. Their approach
transformed each multiclass task into independent binary classification problems (one per class), using
centroid-based sampling on balanced datasets to address class imbalance, particularly improving
performance on minority classes like negative polarity. They combined fine-tuned transformer models
with knowledge transfer techniques to enhance generalization and robustness across the three tasks.</p>
        <p>SatiSPeech [18], Multimodal Audio-Text Satire Classification in Spanish , was a novel task proposed
for the first time this year, about the automatic recognition of satire in Spanish YouTube videos. The
task was divided in two subtasks: one of them only considering the video transcription (text-only),
while the other one considered the transcription and the acoustic information in a multimodal setting
(text+audio). They presented a dataset of 8000 audio segments no longer than 25 seconds each, together
with their transcription. The fragments were obtained from popular satirical shows in YouTube. Each
segment was labeled as satirical or non-satirical. Eleven teams participated in both subtasks of the
challenge, and almost all of them outperformed the baselines. The best system for the text-only task
was from UPV-ELiRF, obtaining 85.6 F1; and the best system for the multimodal text+audio task was
from UMU-Ev obtaining 88.3 F1.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Aggregated Analysis of IberLEF 2025 Tasks</title>
      <sec id="sec-3-1">
        <title>3.1. Tasks characterization</title>
        <p>If we take a look at the distribution of subtasks by abstract task types, as shown in Figure 2, we can
see that the most common task type is once again multi-class classification with 15 subtasks, followed
by binary classification with 7 tasks. Interestingly this year the third most common type of subtasks
involve generation (5 tasks plus one QA task, which is also related to generation). This is in line with a
larger trend in the NLP community as a whole, which is becoming more interested in generative tasks
due to the availability of new kinds of generative language models.</p>
        <p>The distribution of the main evaluation metrics used this year are shown in Figure 3. As in previous
years, F1 remains predominant, being used in 10 tasks, with 6 of them also incorporating Precision and
Recall. Many tasks also include the Accuracy metric, in two cases being the only metric because of the
nature of the task. Also note that metrics that correspond mainly to generative tasks, such as BLEU [19]
and ROUGE [20], are starting to appear in diferent tasks as well, together with other metrics like BERT
score [21], Levenshtein Distance, Cosine Similarity, and other metrics that are only suitable to some
kinds of tasks.</p>
        <p>IberLEF 2025 incorporated eight tasks that were not organized before or were considered novel enough
to be counted as new (57%), while six of the tasks were new editions of previously run competitions.
This strikes a good balance between novelty and stability, as successful tasks from previous years
such as DIMEMEX, HOPE, MentalRiskES, and REST-MEX had new editions, while also the campaign
introduced new challenges for Iberian language processing which attracted more researchers.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Datasets and results</title>
        <p>In terms of dataset sizes and annotation eforts 1, making fair comparisons is challenging due to
the diversity of data sources, variations in text lengths, and the wide range of annotation dificulties.
In almost all the cases (13 out of 14 tasks), the datasets were fully manually annotated. Only one
dataset combined synthetic and manually annotated data (PolyHope). Three datasets contain more
than 15,000 instances (HOMO-LAT, PolyHope, and REST-MEX), one has between 10,000 and 15,000
instances (PastReader), two fall between 5,000 and 10,000 instances (SatiSpeech and ASQP-PT), and the
remaining datasets contain 1,000 to 5,000 samples, most of which are manually annotated. Regarding
annotation reliability, inter-annotator agreement serves as a useful indicator and is reported for 6 out
of 14 tasks. Fleiss’ Kappa and Cohen’s Kappa are the only statistical measures used to assess agreement
reliability, each appearing in 50% of the cases. Among these, one task show high agreement, three has
moderate-high agreement, and two show from low to moderate-low agreement.2</p>
        <p>Regarding progress relative to the state of the art, it remains challenging to draw overarching
conclusions for the entire IberLEF efort due to the varied approaches used for establishing task baselines.
Figure 5 shows a pairwise comparison between the best system and the best baseline for each task
where at least one baseline is provided and one results was submitted, using the oficial ranking metric
1Overall, the annotation eforts in IberLEF 2025 continue to make a significant contribution to expanding test collections for
Spanish and, to a lesser extent, other languages. Once again, IberLEF has been conducted without specific funding sources,
relying instead on the resources obtained individually by the teams organizing and participating in the tasks. Implementing
a centralized funding model could undoubtedly help achieve larger and more comprehensive annotations across IberLEF as a
whole.
2Generally, moderate agreement may reflect the complexity of the task rather than deficiencies in the annotation guidelines.
for each task. To avoid confusion, the chart is limited to tasks where the oficial metric ranges from 0
(worst quality) to 1 (perfect output). One task (CLEARS) did not provide any baseline, and almost every
other task included baselines based in pre-trained Transformers, either encoder (29%), encoder-decoder
(29%), and decoder-only (14%) models depending on the nature of the task. Majority baselines (7%)
and classical machine learning including SVM, Random Forest, and Logistic Regression were also used
in some tasks (29%). In the subtasks that included baselines and had at least one submission, the best
system outperformed the baseline by more than 5% in 18 cases (60% of the total), while the systems could
not beat the baseline in only 4 cases. Examining the results, only 5 subtasks had the top-performing
system scoring higher than 0.9, which suggests that there is still room for improvement in most cases.
None of the subtasks presented a baseline that performed above this 0.9 level, showing that either there
was a preference for weaker baselines, or the tasks were indeed designed to be more challenging.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Participation</title>
        <p>Despite IberLEF 2025 not being a funded initiative, participation was impressive, with a significant
portion of current research groups interested in NLP for Spanish and other Iberian languages either
organizing or participating in one or more tasks. In total, 445 researchers from 196 research groups
across 21 countries in Europe, Asia, Africa, and the Americas were involved in IberLEF tasks3. Compared
to IberLEF 2024, the number of participating researchers increased by 54% and the number of research
groups by 46%, highlighting the growing interest in NLP evaluation campaigns on Iberian languages.</p>
        <p>Figure 6 shows the distribution of research groups by country. Interestingly, Mexican research
groups now have the highest participation, with 71 groups, a shift from previous years when Spanish
institutions were dominant. Mexico is followed by Spain with 62 groups, Cuba with 11, the USA with 9,
and Vietnam with 7.
3Statistics were compiled from the submitted working notes, which implies two things: i) Some groups and researchers may
be counted more than once if they participated in multiple tasks; and ii) actual participation might be higher because some
teams submitted runs but did not submit their working notes, thus not being counted in the statistics.</p>
        <p>Figure 7 illustrates the distribution of researchers (listed as authors in the working notes) by country.
The top five countries -—Mexico, Spain, Cuba, USA, and Colombia—- account for approximately 85% of
the participating researchers. Similarly to the institution-level trends, there has been a shift compared
to the previous year, with Mexican researchers now dominating participation. The presence of
nonSpanish-speaking countries such as Vietnam, Nigeria, India, and China in the top ten highlights two key
points: i) Spanish captures interest in the broader NLP community; and ii) current NLP technologies
allow researchers to work with diferent languages without needing language-specific tools, beyond
pre-trained language models available to the research community.</p>
        <p>Figure 8 shows the number of teams participating in each of the tasks, considering that they submitted
at least one run. Participation ranges between 2 and 34 teams. Notably, the task with the highest
participation is REST-MEX, which may have contributed to the increase in Mexican participation,
highlighting the interest in sentiment analysis for this Spanish variety. The distribution of research
groups per task is shown in Figure 9. In this case, participation ranges between 2 and 52 groups4.
Notably, REST-MEX accounts for the largest share of research group participation (27% of the total), and
the top three tasks (REST-MEX, PolyHope, and SatiSpeech) together involve 50% of all research groups.</p>
        <p>As with other evaluation initiatives, participation appears to be influenced not only by the intrinsic
interest of the task but also by the cost of entry. Classification tasks, which are the simplest
machine learning tasks and have more available plug-and-play software packages, typically attract more
participants than tasks that require more complex approaches and creative algorithmic solutions.</p>
        <p>Finally, we tried to analyze the evolution of participation in IberLEF throughout the years since the
beginning. As researchers might be part of diferent teams, participate in diferent tasks, and present
diferent system description papers, we decided to measure the number of unique participants as the
number of unique authors that took part in at least one system description paper in the proceedings of
each year. Please note that this estimation is not perfect, as sometimes authors appear under diferent
names in diferent papers, and it is also possible that there are two or more unrelated authors that have
the same name. Figure 10 presents this estimation of participation since 2019, showing that this year
has been the edition with the largest number of unique participants, slightly ahead of the 2023 edition.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this seventh edition, IberLEF has once again demonstrated its significant collective efort to advance
Natural Language Processing in Spanish and other Iberian languages. This year’s event included 14
main tasks and involved 445 researchers from 196 research groups across 21 countries in Europe, Asia,
Africa, and the Americas. Compared to IberLEF 2024, the number of participating researchers increased
by 54% and the number of research groups by 46%, highlighting the growing interest in NLP evaluation
campaigns on Iberian languages.</p>
      <p>IberLEF 2025 was one of the most diverse editions in terms of application domains, data sources, and
4A team is composed of researchers from the same or diferent research groups and entities who collaborate to participate
in a shared task. In contrast, a research group typically consists of researchers from the same faculty who specialize in a
particular subject and work together oficially on that topic, not solely for participating in a shared task.
task types, with growing interest in multimodal scenarios and text generation tasks where to leverage
recent advances in language modeling. It advanced the field in several areas, including language
comprehension, harmful and inclusive content, content curation and generation, and sentiment and
ifgurative analysis.</p>
      <p>In the realm of Natural Language Processing, where Large Language Models have become the go-to
solutions, defining research challenges and creating robust evaluation methods and high-quality test
collections are crucial for success. These elements enable iterative testing and refinement. IberLEF is
playing an important role in advancing these eforts and moving the field forward.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The research work conducted by Salud María Jiménez-Zafra is part of the grant RYC2023-044481-I,
supported by MICIU/AEI/10.13039/501100011033 and by ESF+. This work is funded by the Ministerio
para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y
Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Desarrollo Modelos
ALIA. This work has also been partially supported by Project CONSENSO (PID2021-122263OB-C21),
Project MODERATES (TED2021-130145B-I00) and Project SocialTox (PDC2022-133146-C21) funded by
MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR, and Project
FedDAP (PID2020-116118GA-I00) and Project Trust-ReDaS (PID2020-119478GB-I00) supported by
MICINN/AEI/10.13039/501100011033.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
of PolyHope at IberLEF 2025: Optimism, Expectation or Sarcasm?, Procesamiento del Lenguaje
Natural 75 (2025).
[13] A. Montejo-Ráez, E. Sánchez-Nogales, G. Expósito-Álvarez, L. A. Ureña-López, M. T.
MartínValdivia, J. Collado-Montañez, M. C. Díaz-Galiano, I. C. de Castro, M. V. Cantero-Romero, R.
OrtuñoCasanova, Overview of PastReader at IberLEF 2025: Transcribing Texts From the Past,
Procesamiento del Lenguaje Natural 75 (2025).
[14] J. Osés Grijalba, L. A. Ureña-López, E. Martínez Cámara, J. Camacho-Collados, Overview of
PRESTA at IberLEF 2025: Question Answering Over Tabular Data In Spanish, Procesamiento del
Lenguaje Natural 75 (2025).
[15] G. Mordecki, L. Chiruzzo, R. Laguna, J. J. Prada, A. Rosá, I. Sastre, G. Moncecchi, Overview of TA1C
at IberLEF 2025: Detecting and Spoiling Clickbait in Spanish-Language News, Procesamiento del
Lenguaje Natural 75 (2025).
[16] E. P. Lopes, G. A. Gomes, A. Thurow Bender, R. M. Araujo, L. A. de Freitas, U. B. Corrêa, Overview
of ASQP-PT at IberLEF 2025: Overview of the Task on Aspect-Sentiment Quadruple Prediction in
Portuguese, Procesamiento del Lenguaje Natural 75 (2025).
[17] M. A. Álvarez Carmona, A. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, L. Bustio-Martínez,
V. Herrera-Semenets, Overview of Rest-Mex at IberLEF 2025: Researching Sentiment Evaluation
in Text for Mexican Magical Towns, Procesamiento del Lenguaje Natural 75 (2025).
[18] R. Pan, J. A. García-Díaz, T. Bernal-Beltrán, F. García-Sánchez, R. Valencia-García, Overview of
SatiSPeech at IberLEF 2025: Multimodal Audio-Text Satire Classification in Spanish, Procesamiento
del Lenguaje Natural 75 (2025).
[19] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting of the Association for Computational
Linguistics, 2002, pp. 311–318.
[20] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[21] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with
bert, in: Proceedings of 2020 International Conference on Learning Representations, 2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Álvarez-Mellado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Porta-Zamorano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lignos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , Overview of ADoBo at IberLEF 2025:
          <article-title>Automatic Detection of Anglicisms in Spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Á. Mellado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Arroyo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lignos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Zamorano</surname>
          </string-name>
          , Overview of adobo 2021:
          <article-title>Automatic detection of unassimilated borrowings in the spanish press</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>277</fpage>
          -
          <lpage>285</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Álvarez</surname>
          </string-name>
          <string-name>
            <surname>Mellado</surname>
          </string-name>
          ,
          <article-title>Lexical borrowing detection as a sequence labeling task: Data, modeling and evaluation methods for anglicism retrieval in Spanish</article-title>
          ,
          <source>Phd thesis</source>
          , Universidad Nacional de
          <article-title>Educación a Distancia (UNED</article-title>
          ), Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Botella-Gil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Espinosa-Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonet-Jover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Madina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Molino</given-names>
            <surname>Piñar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moreda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Gonzalez-Dios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Martín-Valdivia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ureña-López</surname>
          </string-name>
          , Overview of CLEARS at IberLEF 2025:
          <article-title>Challenge for Plain Language and Easy-to-Read Adaptation for Spanish Texts</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moreno-Álvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Peñas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agerri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fruns-Jiménez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Soria-Pastor, Overview of PROFE at IberLEF 2025: Language Proficiency Evaluation</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jarquín-Vásquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Tlelo-Coyotecatl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. I.</given-names>
            <surname>Hernández-Farías</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Villaseñor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>M. y Gómez, Overview of DIMEMEX at IberLEF 2025: Detection of Inappropriate Memes from Mexico</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bel-Enguix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ojeda-Trueba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sierra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lee-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dunstan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manrique</surname>
          </string-name>
          ,
          <article-title>Overview of HOMO-LAT at IberLEF 2025: Human-centric polarity detection in Online Messages Oriented to the Latin American-speaking LGBTQ+ populaTion</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bel-Enguix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sierra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vásquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Andersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ojeda-Trueba</surname>
          </string-name>
          ,
          <article-title>Overview of homo-mex at iberlef 2023: Hate speech detection in online messages directed towards the mexican spanish speaking lgbtq+ population</article-title>
          ,
          <source>Procesamiento del lenguaje natural 71</source>
          (
          <year>2023</year>
          )
          <fpage>361</fpage>
          -
          <lpage>370</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bel-Enguix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Calvo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ojeda-Trueba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Andersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vásquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Alcántara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Soto</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Macias, Overview of homo-mex at iberlef 2024: Hate speech detection towards the mexican spanish speaking lgbt+ population</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>73</volume>
          (
          <year>2024</year>
          )
          <fpage>393</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>A. M. Mármol-Romero</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Álvarez Ojeda</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno-Muñoz</surname>
            ,
            <given-names>F. M. P.</given-names>
          </string-name>
          <string-name>
            <surname>del Arco</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D. MolinaGonzález</surname>
          </string-name>
          , M. T.
          <string-name>
            <surname>Martín-Valdivia</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Ureña-López</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Montejo-Ráez</surname>
          </string-name>
          , Overview of MentalRiskES at IberLEF 2025:
          <article-title>Early Detection of Addiction Risk in Spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Alcántara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Soto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Garcia-Vazquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Espinosa-Juarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Calvo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>ValdezRodríguez</surname>
          </string-name>
          , E. Felipe-Riveron, Overview of MiSonGyny at IberLEF 2025:
          <article-title>Misogyny Speech Detection in Spanish Language Song Lyrics</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. G.</given-names>
            <surname>Ceballos</surname>
          </string-name>
          , G. Sidorov, Overview
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>