Overview of IberLEF 2022: Natural Language
Processing Challenges for Spanish and other Iberian
Languages
Julio Gonzalo1 , Manuel Montes-y-Gómez2 and Francisco Rangel3
1
  nlp.uned.es, ETSI Informática de la UNED, Madrid, Spain
2
  National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
3
  Symanto Research, Valencia, Spain


                                         Abstract
                                         IberLEF is a comparative evaluation campaign for Natural Language Processing Systems in Spanish
                                         and other Iberian languages. Its goal is to encourage the research community to organize competitive
                                         text processing, understanding and generation tasks in order to define new research challenges and
                                         set new state-of-the-art results in those languages. This paper summarizes the evaluation activities
                                         carried out in IberLEF 2022, which included 10 tasks and 19 subtasks dealing with sentiment, stance and
                                         opinion analysis, detection and categorization of harmful content, Information Extraction, Paraphrase
                                         Identification, and Question Answering. Overall, IberLEF activities were a remarkable collective effort
                                         involving 310 researchers from 24 countries in Europe, Asia, Africa, Australia and the Americas.

                                         Keywords
                                         Natural Language Processing, Artificial Intelligence, Evaluation, Evaluation Challenges


1. Introduction
IberLEF is a comparative evaluation campaign for Natural Language Processing Systems in
Spanish and other Iberian languages. Its goal is to encourage the research community to
organize competitive text processing, understanding and generation tasks in order to define
new research challenges and set new state-of-the-art results in those languages. This paper
summarizes the evaluation activities carried out in IberLEF 2021, which included ten tasks
dealing with sentiment, stance and opinion analysis, detection and categorization of harmful
content, Information Extraction and Answer Extraction, and Paraphrase Identification. Overall,
IberLEF activities were a remarkable collective effort involving 310 researchers from 24 countries
in Europe, Asia, Africa, Australia and the Americas. Papers with system descriptions are included
in this IberLEF 2022 Proceedings volume, and papers with task overviews are published in the
journal Procesamiento del Lenguaje Natural, vol. 69 (September 2022 issue).


IberLEF 2022, September 2022, A Coruña, Spain
$ julio@lsi.uned.es (J. Gonzalo); mmontesg@inaoep.mx (M. Montes-y-Gómez); kico.rangel@gmail.com (F. Rangel)
 https://nlp.uned.es/ (J. Gonzalo); https://ccc.inaoep.mx/~mmontesg/ (M. Montes-y-Gómez);
https://kicorangel.com/ (F. Rangel)
 0000-0002-5341-9337 (J. Gonzalo); 0000-0002-7601-501X (M. Montes-y-Gómez); 0000-0002-6583-3682 (F. Rangel)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
  In this paper we summarize the activities carried on in IberLEF 2022, extracting some aggre-
gated figures for a better understanding of this collective effort.


2. IberLEF 2022 Tasks
These are the ten tasks successfully run in 2022, grouped thematically:

2.1. Sentiment, Stance and Opinions
ABSAPT [1] is an aspect-based sentiment analysis task in Portuguese, which used TripAdvisor
reviews as target texts. It included (i) a subtask on aspect term extraction devoted to identification
of aspects in reviews, and (ii) a subtask on sentiment orientation (polarity) identification about
a single aspect mentioned in the review.
    PoliticES [2] is an author profiling task on Twitter accounts for Spanish politicians and
political journalists, where systems must extract gender, profession and political spectrum of
each profile.
    Rest-Mex [3] is a task that works with Mexican Tourist Texts, and addresses three problems:
(i) a recommendation subtask where, given a TripAdvisor user and a Mexican tourist destination,
the system must predict the degree of satisfaction (1-5) that the user will have when visiting the
destination; (ii) a sentiment analysis task where the system must predict the polarity (1-5) of a
given TripAdvisor review, and also the type of destination (hotel, restaurant, attraction); (iii)
an epidemiological semaphore prediction task, where given covid-related news of a Mexican
region, systems must predict the semaphore color of weeks 0, 2, 4 and 8 in the future.

2.2. Harmful Content
DA-VINCIS [4] is a task where systems must detect and classify tweets (in Spanish) that report
violent incidents. It included two subtasks; the first one is a binary classification task in which
users had to determine whether tweets were associated to a violent incident or not, and the
second one is a multi-label classification task in which the category of the violent incident
should be spotted.
   DETESTS [5] is a task where systems must detect and classify racial stereotypes in comments
to online news articles written in Spanish. Subtask 1 is stereotype detection, and systems must
identify whether the comment contains at least one stereotype or not. Manual annotations are
handled following the learning with disagreement paradigm, where there is not necessarily
a single correct label for every example in the dataset. Subtask 2 is a multi-label hierarchical
classification problem where systems must detect and classify stereotypes according to this set
of categories: victims of xenophobia, suffering victims, economic resources, migration control,
cultural and religious differences, people which takes “benefits” of our social policy, problem of
public health, security threat, dehumanization, other.
   EXIST [6] is a task where systems must detect and classify sexist content in Spanish and
English tweets and gabs. Task 1 is about identification of sexism-related content: a tweet is
positive if it is sexist itself, describes a sexist situation or criticizes a sexist behavior. Task 2
is about sexism categorization: once a message has been classified as sexist, systems must
classify positive tweets in the following categories: ideological and inequality, stereotyping and
dominance, objectification, sexual violence, misogyny and non-sexual violence.

2.3. Information Extraction and Paraphrase Identification
LivingNER [7] is a task on named entity recognition, normalization and classification of species,
pathogens and food. Source texts are medical documents (case reports) annotated by medical
experts using the NCBI taxonomy. In Task 1, LivingNER-Species NER track, systems must
find all mentions to (human or non-human) species mentioned, such as “hepatitis B”, “virus
herpes simple”, “paciente”. In Task 2, LivingNER-Species Norm track, systems have to retrieve
all species mentions together with their corresponding NCBI taxonomy concept identifiers.
And in Task 3, LivingNER-Clinical Impact track, for each text systems must (i) detect if the text
contains information relevant to real-world clinical use cases of high impact; (ii) retrieve the
list of NCBI taxonomy identifiers that support such detections; categorize the documents in the
following information axes: pets and farm animals, animal causing injuries, food species, and
nosocomial entities.
   PAR-MEX [8] is a paraphrase identification task. Systems must detect sentence-level para-
phrase identification in Mexican Spanish food-related texts, which have been manually generated
from an original set of texts using literary creation, low paraphrase, high paraphrase and no
paraphrase methods.

2.4. Question Answering and Machine Reading
QuALES [9] is a Question Answering task where answers must be extracted from news articles
written in Spanish. The input for systems is a question and a piece of news, and the system
must find the shortest spans of text in the article (if there is any) that answer the question. Most
questions (but not all) in the dataset deal with covid-19 issues.
   ReCoRES [10] is a Reading Comprehension and Reasoning Explanation task for Spanish.
Given a passage and a question about its content, Reading Comprehension systems must (1)
select the correct answer from a given set of candidates (multiple-choice task); and (2) provide
an explanation for why a given candidate was chosen as answer (reasoning explanation). Texts
in this dataset are based on university entrance examinations, and explanations are evaluated
according to automatic similarity estimations with respect to manual reference explanations,
and with manual assessments of their accuracy, fluency and readability.


3. Aggregated Analysis of IberLEF 2022 Tasks
3.1. Tasks characterization
In terms of languages, the distribution per tasks (including subtasks) is shown in Figure 1.
Spanish is, one more year, the central language of IberLEF (17 tasks) with Portuguese and
English in a secondary role (2 tasks each). Main Spanish variants considered are those from
Spain, Mexico, Uruguay and Perú.
Figure 1: Distribution of languages in IberLEF 2022 tasks


   In terms of abstract task types, the distribution of tasks can be seen in Figure 2. Out of
a total of 19 tasks (each subtask is counted as a task here), the most popular type of task is
multi-class classification (7 tasks), followed by sequence tagging and binary classification (4
each). There are also two ordinal classification tasks, two regression tasks, two KB linking
tasks (one on entity linking and another one on taxonomy linking), two answer extraction
tasks (one is multiple choice, which is also counted as classification, and the other one is span
selection, which we also count as sequence tagging) and one text generation task. Interestingly,
in 2022 there are four complex tasks which involve solving more than one core task at once (for
instance, sequence tagging plus entity linking).
   Compared with 2021, the trends are towards a less numerous (19 vs 29) but more diverse and
more complex set of tasks, where binary classification is no longer the most popular type of
task and several tasks imply solving many NLP problems at the same time.
   In terms of evaluation metrics, the distribution can be seen in Figure 3, which depicts only
the main metrics used to rank systems in each task. As in previous years, there is a remarkable
predominance of F1 (11 tasks), even if it does not perfectly match the problem considered.
Accuracy is used by three tasks, MAE in two regression tasks, and there are other six metrics
that are used only in one task. Some of them correspond to the complex tasks which embed
subtasks (e.g. the mean of F1 scores for several tasks is used in one occasion, the mean of inverse
MAE and F1 scores for different tasks in other, or the average of F1 measures at different points
Figure 2: Distribution of IberLEF 2022 tasks per abstract task type.


in the future with weights according to the time distance in other. The rest are Average Exact
Match [11] for a QA task, BERTScore [12] to compare system and gold standard explanations,
and ICM [13] for a hierarchical classification task.
   Overall, in IberLEF as in other NLP competitive evaluation challenges we might still be
relying too much on averages to combine different quality metrics: it has been common this
year to combine F1 measures (which are harmonic averages) with other measures using some
other form of averaging. This hides the actual behaviour of systems and give usually no clues
on how to improve them. Also, again in 2022 the choice of metrics is, in general, barely justified,
particularly in terms of how the system output is going to be used in realistic usage scenarios.
   Finally, in terms of novelty/stability IberLEF 2022 has brought many new problems, with
seven out of the 10 primary tasks being new this year. Only REST-MEX, EXIST and DETESTS
had also been run in 2021.
Figure 3: Distribution of official evaluation metrics in IberLEF 2022 tasks.


3.2. Datasets and results
In terms of types of textual sources, Figure 4 shows how they are used in IberLEF 2022 tasks.
There is more diversity than previous years, with Twitter being less dominant: TripAdvisor
reviews were used in 5 tasks, Twitter in 4, clinical cases in 3 subtasks (all belonging to the same
task), exams, news comments and Gabs were used in two subtasks each, and finally news and
gastronomy texts were used in one task each.
   In terms of dataset sizes and annotation efforts, it is difficult to establish fair comparisons,
because of the diversity of text sizes and the wide variance in terms of annotation difficulty.
In any case, in the majority of cases (14 tasks) manually annotated datasets were below 6,000
instances. Two other tasks provided annotated collections comprising between 10,000 and
15,000 instances, and there was one task which provided over 40,000 annotated instances.
   As for the reliability of the annotations, one useful indicator is inter-annotator agreement,
which is reported in 9 out of 19 tasks. In the tasks where it is reported, annotator agreement is
high in three cases and mid-low in another six. In general, mid-low agreement indicates the
complexity of the task rather than poor annotation guidelines.
Figure 4: Types of textual sources in IberLEF 2022 tasks.


   Overall the annotation effort in IberLEF 2022 keeps being a remarkable contribution to enlarge
test collections for Spanish (and, less prominently, other languages). One more year, IberLEF
has been carried out without specific funding sources (other than those obtained individually
by the teams organizing and participating in the tasks). A centralized funding schema could
certainly help reaching larger and better annotations in IberLEF as a whole.
   In terms of progress with respect the state of the art, it is as usual difficult to extract
aggregated conclusions for the whole IberLEF effort, in particular given the diversity of ap-
proaches for providing task baselines: in five tasks, no baseline was provided. In three, only
a trivial baseline was included in the comparisons (e.g. majority class or random baselines in
classification). Four tasks used SVM as baseline, and five used some variant of transformers
(BETO in two occasions, BERT in another two and T5 in one). Only two used other types of
baselines.
   In the tasks that used baselines, the baseline was beaten (by a margin larger than 5%) by the
best system in eight cases. In two cases, the difference was below 5% (one in favour of the best
system, the other in favour of the baseline), and in the last two tasks, the baseline was better
than any system. This is an indication that at least some of the tasks
Figure 5: Performance of best systems versus baselines in IberLEF 2022 tasks. Only tasks with official
evaluation metrics in the range [0-1] that include at least a baseline system are included in this graph.


  In Figure 5 we display a pairwise comparison between the best system and the best baseline,
for each of the tasks where at least one baseline is provided, and with respect to the official
ranking metric used in each task. To avoid confusion, we have restricted the chart to tasks
where the official metric varies between 0 (worst quality) and 1 (perfect output).

3.3. Participation
Given that IberLEF 2022 was not a funded initiative, participation has again been impressive,
with a large fraction of current research groups interested in NLP for Spanish organizing and/or
participating in one or more tasks. Overall, 310 researchers representing 169 research groups
from 24 countries in Europe, Asia, Africa, Australia and the Americas were involved in IberLEF
tasks1 .
   Figure 6 shows the distribution of research groups per country. This year, Mexico has the
largest representation, with 54 groups, followed by Spain with 44 groups (note that all figures

1
    Statistics have been compiled from the submitted working notes, meaning two things: i) some groups and researchers
    may be counted twice if they have participated in more than one task; ii) real participation may be higher due to
    the number of teams who submitted runs but did not submit their working notes afterwards, and thus have not
    been counted in the statistics.
Figure 6: Number of research groups participating in IberLEF 2022 tasks per country.


reporting participation do not collapse duplicates: a group or a researcher participating in two
tasks is counted twice).
   Figure 7 shows the distribution of researchers (appearing as authors in the working notes)
per country. The numbers are almost consistent with the distribution of groups per country,
with some flips between USA and Brazil, or China, Chile and Vietnam. The top five, with
Mexico, Spain, Brazil, USA, and China, represents roughly 80% of the researchers involved.
The fact that there are two non-Spanish, non-Portuguese speaking countries in the top five,
China and the USA, as well as others such as Vietnam or Canada in the top positions in terms
of participation, indicates two things: first, that Spanish attracts the attention of the NLP
community at large; and second, that current NLP technologies enable addressing different
languages without language-specific machinery, other than pre-trained language models made
available to the research community.
   The distribution of research groups per task is shown in Figure 8. Participation ranges
between 3 and 36 groups. As in other evaluation initiatives, participation seems to be driven
not only by the task intrinsic interest, but also by the cost of entry: as usual, classification tasks
(the most basic machine learning task, for which more plug and play software packages exist)
receive more participation than tasks which require more elaborated approaches and more
creativity to assemble algorithmic solutions.
Figure 7: Number of researchers participating in IberLEF 2022 tasks per country.


Figure 8: Distribution of participant groups per task in IberLEF 2022. The figure displays the number of
groups that submitted at least one run.


4. Conclusions
In its third edition, IberLEF has again been a remarkable collective effort for the advancement of
Natural Language Processing in Spanish and other Iberian languages, comprising 10 main tasks
and 310 researchers involved, from institutions in 24 countries in Europe, Asia, Africa, Australia
and the Americas. IberLEF 2022 has been one of the most diverse in terms of types of tasks and
application domains, and has contributed to advance the field in the areas of sentiment, stance
and opinion analysis, detection and categorization of harmful content, Information Extraction,
Answer Extraction, and Paraphrase identification. In a field where machine learning is the
ubiquitous approach to solve challenges, the definition of research challenges, the development
of high-quality test collections that allow for iterative evaluation and the design of sound
evaluation methodologies and metrics are perhaps the most critical aspects of research, and we
believe IberLEF keeps making significant contributions to all of them.


Acknowledgments
This work has been partially supported by the Spanish Ministry of Science and Innovation,
Project FairTransNLP (PID2021-124361OB-C32), and by CONACyT-México, Project CB-2015-
01-257383. The work of the third author has been partially funded by CDTI under grant
IDI-20210776, IVACE under grant IMINOD/2021/72, and grant PLEC2021-007681 funded by
MCIN/AEI/10.13039/501100011033 and by European Union NextGenerationEU/PRTR.


References
 [1] F. L. V. da Silva, G. d. S. Xavier, H. M. Mensenburg, R. F. Rodrigues, L. P. dos Santos, R. M.
     Araújo, U. B. Corrêa, L. A. de Freitas, ABSAPT 2022 at IberLEF: Overview of the Task on
     Aspect-Based Sentiment Analysis in Portuguese 69 (2022).
 [2] J. A. García-Díaz, S. M. Jiménez-Zafra, M.-T. Martín-Valdivia, F. García-Sánchez, L. A.
     Ureña-López, R. Valencia-García, Overview of PoliticEs 2022: Spanish Author Profiling for
     Political Ideology, Procesamiento del Lenguaje Natural 69 (2022).
 [3] A. Álvarez Carmona, Miguel A.and Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González,
     D. Fajardo-Delgado, R. Guerrero-Rodríguez, L. Bustio-Martínez, Overview of Rest-Mex
     at IberLEF 2022: Recommendation System, Sentiment Analysis and Covid Semaphore
     Prediction for Mexican Tourist Texts, Procesamiento del Lenguaje Natural 69 (2022).
 [4] L. J. Arellano, H. J. Escalante, L. Villaseñor-Pineda, M. Montes-y Gómez, F. Sanchez-Vega,
     Overview of DA-VINCIS at IberLEF 2022: Detection of Aggressive and Violent Incidents
     from Social Media in Spanish 69 (2022).
 [5] A. Ariza-Casabona, W. S. Schmeisser-Nieto, M. Nofre, M. Taulé, E. Amigó, B. Chulvi,
     P. Rosso, Overview of DETESTS at IberLEF 2022: DETEction and classification of racial
     STereotypes in Spanish, Procesamiento del Lenguaje Natural 69 (2022).
 [6] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, A. Mendieta-Aragón, G. Marco-
     Remón, M. Makeienko, M. Plaza, J. Gonzalo, D. Spina, P. Rosso, Overview of EXIST 2022:
     sEXism Identification in Social neTworks, Procesamiento del Lenguaje Natural 69 (2022).
 [7] A. Miranda-Escalada, E. Farré-Maduell, S. Lima-López, D. Estrada, L. Gascó, M. Krallinger,
     Mention detection, normalization classification of species, pathogens, humans and food in
     clinical documents: Overview of the LivingNER shared task and resources, Procesamiento
     del Lenguaje Natural 69 (2022).
 [8] G. Bel-Enguix, G. Sierra, H. Gómez-Adorno, J.-M. Torres-Moreno, J.-G. Ortiz-Barajas,
     J. Vásquez, Overview of PAR-MEX at Iberlef 2022: Paraphrase Detection in Spanish Shared
     Task, Procesamiento del Lenguaje Natural 69 (2022).
 [9] A. Rosá, L. Chiruzzo, L. Bouza, A. Dragonetti, S. Castro, M. Etcheverry, S. Góngora,
     S. Goycoechea, J. Machado, G. Moncecchi, J. J. Prada, D. Wonsever, Overview of QuALES
     at IberLEF 2022: Question Answering Learning from Examples in Spanish, Procesamiento
     del Lenguaje Natural 69 (2022).
[10] M. A. Sobrevilla Cabezudo, D. Diestra, R. López, E. Gómez, A. Oncevay, F. Alva-Manchego,
     Overview of ReCoRES at IberLEF 2022: Reading Comprehension and Reasoning Explana-
     tion for Spanish, Procesamiento del Lenguaje Natural 69 (2022).
[11] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 100,000+ Questions for Machine Com-
     prehension of Text, 2016. URL: https://arxiv.org/abs/1606.05250. doi:10.48550/ARXIV.
     1606.05250.
[12] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating Text
     Generation with BERT, 2019. URL: https://arxiv.org/abs/1904.09675. doi:10.48550/ARXIV.
     1904.09675.
[13] E. Amigo, A. Delgado, Evaluating Extreme Hierarchical Multi-label Classification, in:
     Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
     (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022,
     pp. 5809–5819. URL: https://aclanthology.org/2022.acl-long.399. doi:10.18653/v1/2022.
     acl-long.399.