<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Natural
Language Processing Journal 7 (2024) 100073. URL: https://www.sciencedirect.com/science/article/
pii/S2949719124000219. doi:https://doi.org/10.1016/j.nlp.2024.100073.
[19] M. Wiegand</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/2023.semeval-1.305</article-id>
      <title-group>
        <article-title>UMUTeam at EXIST 2024: Multi-modal Identification and Categorization of Sexism by Feature Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ronghao Pan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Antonio García-Díaz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomás Bernal-Beltrán</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Valencia-García</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Facultad de Informática, Universidad de Murcia, Campus de Espinardo</institution>
          ,
          <addr-line>30100</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2943</volume>
      <fpage>09</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>The fourth edition of the EXIST shared task is a multimodal identification and categorization of sexism. This edition will take place as a lab in CLEF 2024. The main innovation in this edition with respect 2023 is the addition of a multimodal task based on sexism identification and categorization with MEMEs. As before, this shared task compromises sexist documents written in English and Spanish. For the textual tasks, we rely on feature integration from some Large Language Models and linguistic features extracted from our custom tool. We rank 15th in Sexism Identification, 8th in Source Intention, and 20th in Sexism Categorization. For the multimodal task, we rely on the CLIP model to extract the embedded text and image, and then combine them by diagonal multiplication to obtain the classification models. We rank 33rd in Sexism Identicfiation, and 18th in both Source Intention and Sexism Categorization.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism identification</kwd>
        <kwd>sexism categorization</kwd>
        <kwd>source intention detection</kwd>
        <kwd>multi modal Feature Engineering</kwd>
        <kwd>natural language processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Social networks have become central platforms for social activism and complaints, allowing movements
like #MeToo, #8M, and #Time’sUp to spread rapidly. Through these channels, women around the
world have shared their real-life experiences of abuse, discrimination, and other forms of sexism.
However, Social networks also facilitate the spread of sexist, disrespectful, and hateful behavior. In this
context, the development of automated tools is essential. These tools can help detect and warn against
sexist behavior and discourse, estimate the frequency of sexist and abusive situations on social media,
identify the most common forms of sexism, and analyze how sexism is expressed on these platforms.
Traditional systems for detecting sexism frequently depend on predefined labels and fixed perspectives,
which can miss the complexity and subjectivity of sexist statements. The challenge in identifying
and addressing sexism stems from its inherently subjective assessment. In contrast, perspectivism
ofers a promising method for enhancing detection by incorporating diverse opinions and viewpoints.
An important contribution to this field, which aims to address the problem of sexism identification
within the paradigm of learning with disagreement, has been made by the project EXIST 2024: sEXism
Identification in Social neTworks [1, 2, 3].</p>
      <p>Here we describe our participation in the fourth edition of EXIST [4, 5]. In the first three editions,
EXIST focused only on the detection and classification of sexist text messages, and this 2024 edition
introduces tasks that focus on images, specifically memes. Memes, generally humorous images, spread
rapidly through social networks and the Internet. They can therefore encompass a wider range of
sexist manifestations on social networks, especially those disguised as humor. Therefore, this shared
task involves the development of automated multimodal tools capable of detecting sexism in both text
and memes. With this addition, the organizers aim to cover a broader spectrum of sexism on social
networks, especially that which is disguised as humor.</p>
      <p>
        As a reminder, the tasks in the last edition of EXIST were three challenges, namely, (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) sexism
identification, a binary classification task in which participants had to determine whether a tweet
was sexist or not; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) source intention, a multi-classification task focused on determining whether the
author’s intention was to post a sexist message, to report a sexist situation, or to make a judgment; and
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) sexism categorization, a multi-label classification task focused on identifying sexist characteristics.
It should be noted that in EXIST 2024, the organizers are retaining the learning with disagreements
paradigm proposed in 2023.
      </p>
      <p>Our research group has experience in the detection hate speech [6] and misogyny [7] through the
compilation and evaluation of several corpora. We have focused mainly on Spanish-language data and
our work dealing with English-language data is more limited, limited to participating in the previous
editions of EXIST [8, 9, 10] and the evaluation of zero and few-shot learning strategies on some existing
English datasets focused on hate speech detection [11].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Sexism refers to any abuse or negative sentiment directed at women based on their gender, or their
gender in combination with other identity attributes. In particular, sexism is a growing problem on the
Internet, with detrimental efects on women and other marginalized groups. It makes online spaces less
accessible and less welcoming, perpetuating asymmetries and social injustices.</p>
      <p>Automated tools are already widely used to identify and rate sexist content online. Researchers
have proposed several approaches to address this problem, ranging from rule-based methods [12]
to the use of more complex models based on deep learning and pre-trained language models with
Transformer architecture from a linguistic perspective [13, 14], with only a few attempts to address the
problem from a visual or multimodal perspective. Another work is described in [15], where the authors
study about 6 thousand misogyny memes using a deep learning model that determines which modality
plays a more significant role. The dataset included various characteristics such as hate speech, sexism,
or cyberbullying, among others. The authors found that all modalities were useful for identifying
misogyny, with text playing a significant role.</p>
      <p>Sexism can be further categorized into diferent forms depending on the author’s intent or the type
of sexism. All editions of EXIST use categories such as “Ideology and Inequality”, “Stereotyping and
Dominance”, “Objectification”, “Sexual Violence”, and “Misogyny and Non-Sexual Violence”. This is
similar to SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS) [16], which defined
a taxonomy for the more explainable classification of sexism in three hierarchical levels: binary sexism
detection, category of sexism, and fine-grained vector of sexism. In all editions of EXIST, sexism is also
considered at three levels: binary sexism detection in tweets, multiclass detection for source intention
in tweets, and category of sexism in tweets.</p>
      <p>Although the problem of sexism is primarily approached from a textual perspective, a field of research
has emerged that approaches the problem from a visual or multimodal perspective, as the multimodal
approach has shown superior performance to single modality approaches in detecting hate speech and
misogyny, such as [17] and [18]. Sexist content can also appear in images or in multimodal forms. From
a visual perspective, most research focuses on detecting ofensive, non-conforming, or pornographic
content. For this reason, this edition of EXIST includes a new task focused on detecting sexism in
memes.</p>
      <p>Currently, one of the problems that needs to be addressed in sexism detection is the presence of biases
that could afect the actual performance of the models. Studies such as, [ 19], and [20] have addressed
this issue by using textual data to detect sexism. However, many of the contributions of previous sexism
detection studies have not considered the complexity and multiple perspectives surrounding sexism, as
sexism can manifest itself in multiple ways influenced by cultural, social, and individual factors.</p>
      <p>Perspectivism in sexism detection involves considering multiple cultural, social, and individual
viewpoints and contexts when analyzing potentially sexist content. It is essential to better understand
the diferent manifestations of sexism, avoid bias, and improve the accuracy of detection models. It
promotes equity by recognizing the diferent ways in which sexism can manifest itself, thus contributing
to systems that are fairer and more sensitive to cultural and social diversity. For example, [14] examine
the errors made by classification models and discuss the dificulty of automatically classifying sexism
due to the subjectivity of labels and the complexity of the natural language used in social networks.</p>
      <p>Thus, this edition of EXIST has taken a similar approach to the 2023 edition, adopting the Learning
with Disagreement (LeWiDi) paradigm for both dataset development and system evaluation. Unlike
traditional methods that rely on a single “correct” label for each example, LeWiDi trains models to
process and learn from conflicting or diverse annotations. This allows the system to incorporate diferent
annotator perspectives, biases, and interpretations, resulting in a more equitable learning process.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The textual challenges on EXIST 2024 followed the same strategies as previous editions [21, 22]; that is,
it includes tweets written in Spanish or English crawled with certain keywords that are commonly used
to undervalue the role of women. In fact, for the textual dataset the dataset is the same as from 2023.</p>
      <p>For the multimodal dataset, the organizers of EXIST 2024 defined a lexicon of 250 terms and phrases
that lead to sexist memes, derived from those that have proven efective in identifying sexism in previous
editions of EXIST. This set includes 112 terms in English and 138 in Spanish, covering diverse topics and
including terms with varying degrees of use in both sexist and non-sexist contexts, all centered around
women. These terms were used as search queries on Google Images to obtain the top 100 images per
term. Rigorous manual cleaning procedures were applied to define memes and remove noise such as
textless images, text-only images, advertisements, and duplicates. This resulted in a final set of over
3,000 memes per language. Given the heterogeneous proportion of memes per term, we discarded
the most unbalanced seeds and ensured that each seed had at least five memes. The final dataset was
curated to achieve a balanced distribution of memes per seed. To avoid selection bias, memes were
randomly selected while maintaining the appropriate distribution per seed. This process resulted in
over 2,000 memes per language for the training set and over 500 memes per language for the test set.</p>
      <p>For the annotation process, the organizers considered two socio-demographic traits: gender and age
range. Each meme was then annotated by six crowd sourced annotators selected through the Prolific
app, following guidelines developed by two experts in gender issues. The authors also provided the
annotators’ education level, ethnicity, and country of residence. The idea is to reduce labeling bias that
may arise from cultural diferences among annotators.</p>
      <p>In addition, following the Learning with Disagreements paradigm removes the assumption that items
have a single, unambiguous interpretation in a given context. Therefore, the dataset does not have a
single “gold” annotation, but participants can use the full range of annotations from all annotators.
This allows us to capture the diversity and develop more robust systems. The organizers release all
annotations from all annotators to the participants.</p>
      <p>The custom validation split is created using stratification by label in an 80-20 ratio.</p>
      <p>First, since we have all the annotations, we decided to train our models for Task 1 (sexism
identiifcation) as a regression task rather than a classification task; thus, a text is considered sexist if the
regression model outputs a score greater than 2.5. The soft labels are the output of the regression model
normalized to a range of 1–10. Note that this approach was also used by our team in the previous
edition of EXIST. The hard labels of the textual modality can be found in Table 1.</p>
      <p>
        Second, Table 2 shows the label distribution for Tasks 2 and 5. The labels are: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) direct, if the intent
of the document is to write a message that is itself sexist or incites to be sexist; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) judgmental, if the
intent is to report and share a sexist situation; and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) reported, if the intent was to judge.
      </p>
      <p>
        Third, Table 3 shows the label distribution for Tasks 3 and 6. The labels are: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) ideological and
inequality, if the message downplays feminism or, equality between men and women; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) stereotyping
and dominance, if the message contains stereotypes about social roles; and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) objectification, if the
message contains physical characteristics about beauty standards or hyper-sexualization. As can be
seen, we extracted a split from the dataset provided for individual validation (in an 80-20 ratio) using
label stratification. For Task 2, there is a significant imbalance between the labels, with direct sexism
being the label with the most examples and judgments and reports with a similar number of examples.
For Task 3, the dataset has more balance between the characteristics. We have included the number of
unknown responses in the textual modality.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This section explains our methodology for Tasks 1, 2, and 3 using textual modalities (see Section 4.1),
and for Tasks 4, 5, and 6 using multimodalies (see Section 4.2).</p>
      <sec id="sec-4-1">
        <title>4.1. Textual modalities</title>
        <p>For the Tasks 1, 2, and 3 we decided to keep our previous strategy of training both languages (Spanish
and English) together in order to reduce the carbon footprint of training a model for each language
but to reduce the number of models to 4. Since we are evaluating feature integration strategies, we
decided to keep two Spanish LLMs, BETO and MarIA; and two multilingual LLMs, deBERTa v3 [23]
and XLMTwitter. In this sense, we removed from our pipeline the multilingual BERT, XLM, BERTIN,
DistilBETO and ALBETO [24]). We also extracted the linguistic features (LFs) using the UMUTextStats
tool [25].</p>
        <p>To fine-tune each LLM for each task, we performed hyperparameter tuning on 10 models. This
involved testing diferent learning rates, varying the number of epochs from 1 to 5, experimenting with
batch sizes of 8 and 16, and adjusting warm-up steps and weight decay to optimize the learning rate
during the initial training phases. The results of this tuning process are detailed in the Table 4.
3.6e-05
2.9e-05
3.1e-05
4.7e-05
3.8e-05
3.8e-05
1.4e-05
4.2e-05
2.6e-05
4.5e-05
4.5e-05
2.8e-05</p>
        <p>Task 1
Task 2
Task 3</p>
        <p>Next, we extract the classification token for each tweet, LLM, and Task (1, 2, and 3) using
SentenceBERT [26] to extract the [CLS] token. Now that each document is represented by a unique fixed-length
vector, we can merge it with the LFs using diferent feature integration strategies. The first strategy,
known as Knowledge Integration (KI), is to use the feature vectors to train a new multi-input neural
network. The second strategy is based on Ensemble Learning (EL). For this, we train separate simple
neural networks for each LLM and for the LFs, and combine the results by averaging probabilities and
outputs.</p>
        <p>The training of the KI and the individual models are shown in the Table 5. As expected, most
architectures are shallow (one or two hidden layers), brick-shaped (same number of neurons in each
layer). This is expected because the vectors already capture the meaning of the sentences and are tuned
to the output tasks. However, the KI for Tasks 1 and 3 resulted in deep neural networks, but with few
neurons in each layer, being a funnel shape in Task 1 and a triangle shape in Task 3. If we compare
these results with those obtained in the previous edition, it draws our attention that for Task 3 we
obtained the best results in 2023 with deep neural networks and complex shapes.</p>
        <p>
          For the EL, we evaluate diferent strategies: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) mode of predictions, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) averaging of probabilities,
and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) obtaining the highest probability. The only exception is Task 1, where we only averaged the
predictions of the model, since we considered it to be a regression task.
        </p>
        <p>Now we report our results with our custom validation split. For Task 1 (see Table 6), as we handled it
as a regression task in order to account for the disagreement between the annotators. Therefore, we
report using Explained Variance (EV), Root Mean Squared Logarithmic Error (RMSLE), Pearson R, R
Square (R2), Mean Average Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error
(RMSE). The best results were obtained using the KI strategy.</p>
        <p>For Tasks 2 and 3, the results with the custom validation split are shown in the Table 7. The models
are scored and ranked using the macro average precision, recall, and F1-Score. Similar to the previous
edition, our team achieved their best results with the custom validation split with the with KI for
Task 2 with the best recall and F1-Score but with the best precision is obtained with an EL based on
the mode. In the case of Task 3, the EL strategies achieve better results with an F1-Score of 53.085%
averaging probabilities and, the best recall of 81.242% with the highest probability, but the best precision
LF
EL
KI</p>
        <p>EV
0.162
0.563
0.574
0.575
0.541
0.584
0.602
2
6
5
2
5
3
2
2
2
2
2
7
0.414
0.224
0.218
0.216
0.235
0.227
0.203
KI
EL (HIGHEST)
EL (MEAN)
EL (MODE)
38.788</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Multi modalities</title>
        <p>As shown in Figure 1, during training with the visual input (meme) and the text input (textual
content of the meme), the text embedding vector and the image embedding vector perform diagonal
alignment. This means that the dot product between the text embedding vector and its corresponding
image embedding vector is high when they are related (i.e., for correct pairs) and low when they are not
related (i.e., for incorrect pairs). This is possible because the CLIP model trains and processes the text
and images in the same shared feature space. CLIP is a model developed by OpenAI that can eficiently
understand and associate images and text. This means that the model learns to correctly associate a
descriptive text with its corresponding image and to distinguish it from other unrelated images. Each
component of the architecture is described in detail below:
• Image Encoder (CLIP image encoder). The meme image is passed through the CLIP image
encoder, which extracts a series of embeddings {I1,I2,I3,. . . ,IN} These embeddings represent the
1https://huggingface.co/openai/clip-vit-base-patch32</p>
        <p>visual properties of the image.
• Text Encoder (CLIP text encoder). The text of the meme is fed into the CLIP text encoder,
which produces a series of embeddings {T1,T2,T3,. . . ,TN} These embeddings capture the semantic
features of the text.
• Diagonal multiplication. The image and text embeddings are combined using a diagonal
multiplication operation. This involves multiplying each text embedding Ti with the corresponding
image embedding Ii to create a new set of combined embedding {T1I1,T2I2,T3I3,. . . ,TNIN}.
• Classification head . The combined embeddings are passed through a classification neural
network consisting of the following components:
– Dropout Layer. A layer with a dropout rate of 0.1 to prevent overfitting.
– Linear Layer. A linear layer that reduces the dimensionality to N features, where N is
the combined embedding length. Activation Layer (ReLU): A ReLU activation layer to
introduce nonlinearities.
– Dropout Layer. Another layer with a dropout rate of 0.1 to prevent overfitting.
– Final Linear Layer. A linear layer that reduces the output to 2 neurons, corresponding to
the output classes (sexist or non-sexist).</p>
        <p>For Tasks 5 and 6 we also assigned the most repeated label of the annotators, and in case of a tie,
female annotators have more weight. In addition, we have used the same approach of using the CLIP
model to obtain embeddings of text and images and then, through a diagonal multiplication, to obtain a
combined embedding that will be the input of a classification neural network.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>From an evaluation metrics perspective, organizers use two types of evaluation based on the learning
with disagreement paradigm [28].</p>
      <p>• Hard-Hard. This is a comparison between “hard” system outputs (final labels) and “hard” ground
truth. The Information Contrast Measure (ICM) metric is used to measure the similarity to the
ground truth categories. The F1-Score is also reported, although it is not ideal in this context
because it does not take into account the relationships between labels.
• Soft-Soft. This is a comparison between “soft” system outputs (probabilities) and “soft” ground
truths. In this case, the ICM-soft metric (an extension of ICM) is used as the oficial metric.</p>
      <sec id="sec-5-1">
        <title>5.1. Textual modality</title>
        <p>We sent three runs for Tasks 1, 2, and 3 based on the results on the custom validation split. For Tasks 1,
our runs are KI, EL, LFs. Our results for the Task 1 are described in Table 8. We ranked 15th, 19th, and
32nd for the soft-soft scheme for each run. As expected, our results were better in Spanish than English,
explained by the usage of two Spanish LLMs, BETO and MarIA. The model based on LFs outperformed
the baselines proposed.</p>
        <p>The oficial results for Task 2 are shown in Table 9. The first run is based on KI, while the second
and third runs are based on MarIA and multilingual DeBERTA. This is the task in which we obtained
our best results, ranking 8th (7th in Spanish and 10th in English) with the KI strategy, totaling 38
submissions among all teams. It is worth noting that MarIA outperformed MDeBERTA in both Spanish
and English.</p>
        <p>Next, the results for Task 3 are reported in Table 10. In this case, we focus on the hard vs. hard
scheme, as we do not calculate probabilities. The first run is based on EL averaging the probabilities,
the second run is based on KI, and the third run is based on EL based on mode. We got our best results
with the first run, except for the English results, where we got better results with EL based on mode.
Since we compete in a Hard vs. Hard scheme, the ranking also includes the Macro F1-Score, which is
49.42% with the first run, 47.38% with the second run, and 48.21% with the third run.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Multimodal</title>
        <p>Table 11 shows the oficial results for Task 4, which evaluates hard labels only. The UMUTeam
consistently ranks around 33-35 in all evaluations, both ALL and language-specific such as English or
Spanish. We have exceeded the two baselines suggested by the organizers, ranking 33rd with -0.2422
on ICM-Hard and 0.6963 on F1_YES (F1-Score of sexist labels).</p>
        <p>Although the ICM-Hard metrics are negative, indicating space for improvement in similarity to ground
truth, we present less negative values compared to the baselines. This means that our predictions
are closer to the true labels. If we look only at the F1-Score of the model, our system would be in a
better position. However, in terms of similarity to the ground truth (ICM-Hard), we have a lot of room
for improvement. This may be due to the fact that our approach does not have a final softmax layer
indicating the probability of each label, since we used the model logits directly for the prediction. It is
also possible that it is due to a lack of additional adjustments and optimizations. In this case, we only
evaluated a 2e-5 learning rate, a training batch size of 16, 20 epochs, and an epoch-based validation
strategy.</p>
        <p>Table 12 provides the oficial results for Task 5, evaluating only hard labels. In summary, we have
achieved 18th place in all evaluations, slightly behind the baseline majority class (17th place) and ahead
of the baseline minor class (21st place).</p>
        <p>Regarding the ICM-Hard values of our system (-1.1486, -1.0605, -1.2148), they are negative and higher
in magnitude compared to the baseline majority class, indicating a lower similarity to the true labels.
However, they are significantly better than those of the baseline minor class (-2.0637, -2.0866, -2.0410).</p>
        <p>In terms of F1_YES values, our model is superior to both baselines in all scenarios, especially in the
second ES column (0.2805), indicating better precision and recall in the positive class.</p>
        <p>Table 13 shows the oficial results for Task 6, which is a multi-label classification problem where only
hard labels are evaluated. Our system achieved several ranks depending on the scenario with a total
of 36 submissions among all teams: 18th in Hard-Hard ALL, 21st in the first column of Hard-Hard ES,
and 12th in the second column of Hard-Hard ES. In comparison, the model performs better than the
baselines in most scenarios, especially in the second Hard-Hard ES set.</p>
        <p>The ICM-Hard values obtained (-1.9511, -2.3853, -1.5817) are negative, but generally better than the
baselines, indicating a higher similarity to the true labels. It also clearly outperforms the baseline minor
class in all scenarios and performs better than the baseline majority class in terms of ICM-Hard.</p>
        <p>With respect to Macro F1-Score, we obtained significantly higher values than the baselines, especially
in the Hard-Hard ES evaluation, indicating a better balance between precision and recall and an overall
balanced performance in all classes.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and further work</title>
      <p>This document describes UMUTEAM’s proposal for EXIST 2024, which focuses on the identification
and categorization of SEXISM in Spanish and English. This is a very interesting competition as it
deals with the learning with disagreements paradigm, binary and multi-classification tasks as well as
a new challenge based on multimodal features. For the textual tasks, we based our proposal on the
integration of linguistic features with sentence embeddings extracted for four LLMs, including Spanish
and multilingual variants. We achieved our best result, 8th place, in Task 2, the source detection task.</p>
      <p>As in previous editions, we are satisfied with our participation, since we achieve competitive results
in all tasks, outperforming the proposed baselines. However, in this edition, we only consider the
diferent annotator schemes in the binary classification tasks, since it is considered a regression problem,
but we are satisfied that we have been able to participate in all multimodal tasks, incorporating image
processing modeling tachniques into our pipeline.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been supported by projects LaTe4PoliticES (PID2022-138099OB-I00) funded by
MICIU/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF)-a way of making
Europe, LT-SWM (TED2021-131167B-I00) funded by MICIU/AEI/10.13039/ 501100011033 and by the
European Union NextGenerationEU/PRTR, "Services based on language technologies for political
microtargeting" (22252/PDC/23) funded by the Autonomous Community of the Region of Murcia through the
Regional Support Program for the Transfer and Valorization of Knowledge and Scientific
Entrepreneurship of the Seneca Foundation, Science and Technology Agency of the Region of Murcia. Mr. Ronghao
Pan is supported by the Programa Investigo grant, funded by the Region of Murcia, the Spanish
Ministry of Labour and Social Economy and the European Union - NextGenerationEU under the "Plan de
Recuperación, Transformación y Resiliencia (PRTR)".</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>R</surname>
          </string-name>
          .-S. y
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          y
          <article-title>Laura Plaza y Julio Gonzalo y Paolo Rosso y Miriam Comet y Trinidad Donoso, Overview of exist 2021: sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>195</fpage>
          -
          <lpage>207</lpage>
          . URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/ article/view/6389.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>R</surname>
          </string-name>
          .-S. y
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          y
          <article-title>Laura Plaza y Adrián Mendieta-Aragón y Guillermo MarcoRemón y Maryna Makeienko y María Plaza y Julio Gonzalo y Damiano Spina y Paolo Rosso, Overview of exist 2022: sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          )
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          . URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/ 6443.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of exist 2023:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>in: Proceedings of ECIR'23</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>593</fpage>
          -
          <lpage>599</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -28241-6_
          <fpage>68</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2024 - learning with disagreement for sexism identification and characterization in social networks and memes. experimental ir meets multilinguality, multimodality, and interaction</article-title>
          .,
          <source>in: Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Springer,
          <year>2023</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2024 - learning with disagreement for sexism identification and characterization in social networks and memes (extended overview</article-title>
          ,
          <source>in: Working Notes of CLEF 2024- Conference and Labs of the Evaluation Forum</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>