UMUTeam at EXIST 2024: Multi-modal Identification and Categorization of Sexism by Feature Integration Notebook for the EXIST 2024 Lab at CLEF 2024

UMUTeam at EXIST 2024: Multi-modal Identification and Categorization of Sexism by Feature Integration Notebook for the EXIST 2024 Lab at CLEF 2024 RonghaoPan ronghao.pan@um.es Facultad de Informática Universidad de Murcia

Campus de Espinardo 30100 Spain

JoséAntonioGarcía-Díaz Facultad de Informática Universidad de Murcia

Campus de Espinardo 30100 Spain

TomásBernal-Beltrán Facultad de Informática Universidad de Murcia

Campus de Espinardo 30100 Spain

RafaelValencia-García Facultad de Informática Universidad de Murcia

Campus de Espinardo 30100 Spain

UMUTeam at EXIST 2024: Multi-modal Identification and Categorization of Sexism by Feature Integration Notebook for the EXIST 2024 Lab at CLEF 2024 1613-0073 712A42F381012E0BBFB0AA7C43DE1B4A GROBID - A machine learning software for extracting information from scholarly documents Sexism identification sexism categorization source intention detection multi modal Feature Engineering natural language processing

The fourth edition of the EXIST shared task is a multimodal identification and categorization of sexism. This edition will take place as a lab in CLEF 2024. The main innovation in this edition with respect 2023 is the addition of a multimodal task based on sexism identification and categorization with MEMEs. As before, this shared task compromises sexist documents written in English and Spanish. For the textual tasks, we rely on feature integration from some Large Language Models and linguistic features extracted from our custom tool. We rank 15th in Sexism Identification, 8th in Source Intention, and 20th in Sexism Categorization. For the multimodal task, we rely on the CLIP model to extract the embedded text and image, and then combine them by diagonal multiplication to obtain the classification models. We rank 33rd in Sexism Identification, and 18th in both Source Intention and Sexism Categorization.

Introduction

Social networks have become central platforms for social activism and complaints, allowing movements like #MeToo, #8M, and #Time'sUp to spread rapidly. Through these channels, women around the world have shared their real-life experiences of abuse, discrimination, and other forms of sexism. However, Social networks also facilitate the spread of sexist, disrespectful, and hateful behavior. In this context, the development of automated tools is essential. These tools can help detect and warn against sexist behavior and discourse, estimate the frequency of sexist and abusive situations on social media, identify the most common forms of sexism, and analyze how sexism is expressed on these platforms. Traditional systems for detecting sexism frequently depend on predefined labels and fixed perspectives, which can miss the complexity and subjectivity of sexist statements. The challenge in identifying and addressing sexism stems from its inherently subjective assessment. In contrast, perspectivism offers a promising method for enhancing detection by incorporating diverse opinions and viewpoints. An important contribution to this field, which aims to address the problem of sexism identification within the paradigm of learning with disagreement, has been made by the project EXIST 2024: sEXism Identification in Social neTworks [1,2,3].

Here we describe our participation in the fourth edition of EXIST [4,5]. In the first three editions, EXIST focused only on the detection and classification of sexist text messages, and this 2024 edition introduces tasks that focus on images, specifically memes. Memes, generally humorous images, spread rapidly through social networks and the Internet. They can therefore encompass a wider range of sexist manifestations on social networks, especially those disguised as humor. Therefore, this shared task involves the development of automated multimodal tools capable of detecting sexism in both text and memes. With this addition, the organizers aim to cover a broader spectrum of sexism on social networks, especially that which is disguised as humor.

As a reminder, the tasks in the last edition of EXIST were three challenges, namely, (1) sexism identification, a binary classification task in which participants had to determine whether a tweet was sexist or not; (2) source intention, a multi-classification task focused on determining whether the author's intention was to post a sexist message, to report a sexist situation, or to make a judgment; and (3) sexism categorization, a multi-label classification task focused on identifying sexist characteristics. It should be noted that in EXIST 2024, the organizers are retaining the learning with disagreements paradigm proposed in 2023.

Our research group has experience in the detection hate speech [6] and misogyny [7] through the compilation and evaluation of several corpora. We have focused mainly on Spanish-language data and our work dealing with English-language data is more limited, limited to participating in the previous editions of EXIST [8,9,10] and the evaluation of zero and few-shot learning strategies on some existing English datasets focused on hate speech detection [11].

Related works

Sexism refers to any abuse or negative sentiment directed at women based on their gender, or their gender in combination with other identity attributes. In particular, sexism is a growing problem on the Internet, with detrimental effects on women and other marginalized groups. It makes online spaces less accessible and less welcoming, perpetuating asymmetries and social injustices.

Automated tools are already widely used to identify and rate sexist content online. Researchers have proposed several approaches to address this problem, ranging from rule-based methods [12] to the use of more complex models based on deep learning and pre-trained language models with Transformer architecture from a linguistic perspective [13,14], with only a few attempts to address the problem from a visual or multimodal perspective. Another work is described in [15], where the authors study about 6 thousand misogyny memes using a deep learning model that determines which modality plays a more significant role. The dataset included various characteristics such as hate speech, sexism, or cyberbullying, among others. The authors found that all modalities were useful for identifying misogyny, with text playing a significant role.

Sexism can be further categorized into different forms depending on the author's intent or the type of sexism. All editions of EXIST use categories such as "Ideology and Inequality", "Stereotyping and Dominance", "Objectification", "Sexual Violence", and "Misogyny and Non-Sexual Violence". This is similar to SemEval 2023 -Task 10 -Explainable Detection of Online Sexism (EDOS) [16], which defined a taxonomy for the more explainable classification of sexism in three hierarchical levels: binary sexism detection, category of sexism, and fine-grained vector of sexism. In all editions of EXIST, sexism is also considered at three levels: binary sexism detection in tweets, multiclass detection for source intention in tweets, and category of sexism in tweets.

Although the problem of sexism is primarily approached from a textual perspective, a field of research has emerged that approaches the problem from a visual or multimodal perspective, as the multimodal approach has shown superior performance to single modality approaches in detecting hate speech and misogyny, such as [17] and [18]. Sexist content can also appear in images or in multimodal forms. From a visual perspective, most research focuses on detecting offensive, non-conforming, or pornographic content. For this reason, this edition of EXIST includes a new task focused on detecting sexism in memes.

Currently, one of the problems that needs to be addressed in sexism detection is the presence of biases that could affect the actual performance of the models. Studies such as, [19], and [20] have addressed this issue by using textual data to detect sexism. However, many of the contributions of previous sexism detection studies have not considered the complexity and multiple perspectives surrounding sexism, as sexism can manifest itself in multiple ways influenced by cultural, social, and individual factors.

Perspectivism in sexism detection involves considering multiple cultural, social, and individual viewpoints and contexts when analyzing potentially sexist content. It is essential to better understand the different manifestations of sexism, avoid bias, and improve the accuracy of detection models. It promotes equity by recognizing the different ways in which sexism can manifest itself, thus contributing to systems that are fairer and more sensitive to cultural and social diversity. For example, [14] examine the errors made by classification models and discuss the difficulty of automatically classifying sexism due to the subjectivity of labels and the complexity of the natural language used in social networks.

Thus, this edition of EXIST has taken a similar approach to the 2023 edition, adopting the Learning with Disagreement (LeWiDi) paradigm for both dataset development and system evaluation. Unlike traditional methods that rely on a single "correct" label for each example, LeWiDi trains models to process and learn from conflicting or diverse annotations. This allows the system to incorporate different annotator perspectives, biases, and interpretations, resulting in a more equitable learning process.

Dataset

The textual challenges on EXIST 2024 followed the same strategies as previous editions [21,22]; that is, it includes tweets written in Spanish or English crawled with certain keywords that are commonly used to undervalue the role of women. In fact, for the textual dataset the dataset is the same as from 2023.

For the multimodal dataset, the organizers of EXIST 2024 defined a lexicon of 250 terms and phrases that lead to sexist memes, derived from those that have proven effective in identifying sexism in previous editions of EXIST. This set includes 112 terms in English and 138 in Spanish, covering diverse topics and including terms with varying degrees of use in both sexist and non-sexist contexts, all centered around women. These terms were used as search queries on Google Images to obtain the top 100 images per term. Rigorous manual cleaning procedures were applied to define memes and remove noise such as textless images, text-only images, advertisements, and duplicates. This resulted in a final set of over 3,000 memes per language. Given the heterogeneous proportion of memes per term, we discarded the most unbalanced seeds and ensured that each seed had at least five memes. The final dataset was curated to achieve a balanced distribution of memes per seed. To avoid selection bias, memes were randomly selected while maintaining the appropriate distribution per seed. This process resulted in over 2,000 memes per language for the training set and over 500 memes per language for the test set.

For the annotation process, the organizers considered two socio-demographic traits: gender and age range. Each meme was then annotated by six crowd sourced annotators selected through the Prolific app, following guidelines developed by two experts in gender issues. The authors also provided the annotators' education level, ethnicity, and country of residence. The idea is to reduce labeling bias that may arise from cultural differences among annotators.

In addition, following the Learning with Disagreements paradigm removes the assumption that items have a single, unambiguous interpretation in a given context. Therefore, the dataset does not have a single "gold" annotation, but participants can use the full range of annotations from all annotators. This allows us to capture the diversity and develop more robust systems. The organizers release all annotations from all annotators to the participants.

The custom validation split is created using stratification by label in an 80-20 ratio.

First, since we have all the annotations, we decided to train our models for Task 1 (sexism identification) as a regression task rather than a classification task; thus, a text is considered sexist if the regression model outputs a score greater than 2.5. The soft labels are the output of the regression model normalized to a range of 1-10. Note that this approach was also used by our team in the previous edition of EXIST. The hard labels of the textual modality can be found in Table 1.

Second, Table 2 shows the label distribution for Tasks 2 and 5. The labels are: (1) direct, if the intent of the document is to write a message that is itself sexist or incites to be sexist; (2) judgmental, if the intent is to report and share a sexist situation; and (3) reported, if the intent was to judge.

Third, Table 3 shows the label distribution for Tasks 3 and 6. The labels are: (1) ideological and inequality, if the message downplays feminism or, equality between men and women; (2) stereotyping and dominance, if the message contains stereotypes about social roles; and (3) objectification, if the message contains physical characteristics about beauty standards or hyper-sexualization. As can be seen, we extracted a split from the dataset provided for individual validation (in an 80-20 ratio) using label stratification. For Task 2, there is a significant imbalance between the labels, with direct sexism being the label with the most examples and judgments and reports with a similar number of examples. For Task 3, the dataset has more balance between the characteristics. We have included the number of unknown responses in the textual modality.

Methodology

Textual modalities

For the Tasks 1, 2, and 3 we decided to keep our previous strategy of training both languages (Spanish and English) together in order to reduce the carbon footprint of training a model for each language but to reduce the number of models to 4. Since we are evaluating feature integration strategies, we decided to keep two Spanish LLMs, BETO and MarIA; and two multilingual LLMs, deBERTa v3 [23] and XLMTwitter. In this sense, we removed from our pipeline the multilingual BERT, XLM, BERTIN, DistilBETO and ALBETO [24]). We also extracted the linguistic features (LFs) using the UMUTextStats tool [25].

To fine-tune each LLM for each task, we performed hyperparameter tuning on 10 models. This involved testing different learning rates, varying the number of epochs from 1 to 5, experimenting with batch sizes of 8 and 16, and adjusting warm-up steps and weight decay to optimize the learning rate during the initial training phases. The results of this tuning process are detailed in the Table 4. Next, we extract the classification token for each tweet, LLM, and Task (1, 2, and 3) using Sentence-BERT [26] to extract the [CLS] token. Now that each document is represented by a unique fixed-length vector, we can merge it with the LFs using different feature integration strategies. The first strategy, known as Knowledge Integration (KI), is to use the feature vectors to train a new multi-input neural network. The second strategy is based on Ensemble Learning (EL). For this, we train separate simple neural networks for each LLM and for the LFs, and combine the results by averaging probabilities and outputs.

The training of the KI and the individual models are shown in the Table 5. As expected, most architectures are shallow (one or two hidden layers), brick-shaped (same number of neurons in each layer). This is expected because the vectors already capture the meaning of the sentences and are tuned to the output tasks. However, the KI for Tasks 1 and 3 resulted in deep neural networks, but with few neurons in each layer, being a funnel shape in Task 1 and a triangle shape in Task 3. If we compare these results with those obtained in the previous edition, it draws our attention that for Task 3 we obtained the best results in 2023 with deep neural networks and complex shapes.

For the EL, we evaluate different strategies: (1) mode of predictions, (2) averaging of probabilities, and (3) obtaining the highest probability. The only exception is Task 1, where we only averaged the predictions of the model, since we considered it to be a regression task. Now we report our results with our custom validation split. For Task 1 (see Table 6), as we handled it as a regression task in order to account for the disagreement between the annotators. Therefore, we report using Explained Variance (EV), Root Mean Squared Logarithmic Error (RMSLE), Pearson R, R Square (R2), Mean Average Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). The best results were obtained using the KI strategy. For Tasks 2 and 3, the results with the custom validation split are shown in the Table 7. The models are scored and ranked using the macro average precision, recall, and F1-Score. Similar to the previous edition, our team achieved their best results with the custom validation split with the with KI for Task 2 with the best recall and F1-Score but with the best precision is obtained with an EL based on the mode. In the case of Task 3, the EL strategies achieve better results with an F1-Score of 53.085% averaging probabilities and, the best recall of 81.242% with the highest probability, but the best precision

Multi modalities

Figure 1 shows the architecture used to implement the model for Task 4, Task 5 and Task 6. We can see that we have used the CLIP [27] model (specifically, openai/clip-vit-base-patch321 ) to extract both image and text embeddings, and then multiply those embeddings using a diagonal operation to use them as input to a classification neural network to identify whether a meme is sexist or not. Diagonal multiplication refers to a specific mechanism used during the training process to effectively connect and align text and image representations. As shown in Figure 1, during training with the visual input (meme) and the text input (textual content of the meme), the text embedding vector and the image embedding vector perform diagonal alignment. This means that the dot product between the text embedding vector and its corresponding image embedding vector is high when they are related (i.e., for correct pairs) and low when they are not related (i.e., for incorrect pairs). This is possible because the CLIP model trains and processes the text and images in the same shared feature space. CLIP is a model developed by OpenAI that can efficiently understand and associate images and text. This means that the model learns to correctly associate a descriptive text with its corresponding image and to distinguish it from other unrelated images. Each component of the architecture is described in detail below:

• Image Encoder (CLIP image encoder). The meme image is passed through the CLIP image encoder, which extracts a series of embeddings {I 1 ,I 2 ,I 3 ,. . . ,I N } These embeddings represent the visual properties of the image. • Text Encoder (CLIP text encoder). The text of the meme is fed into the CLIP text encoder, which produces a series of embeddings {T 1 ,T 2 ,T 3 ,. . . ,T N } These embeddings capture the semantic features of the text. • Diagonal multiplication. The image and text embeddings are combined using a diagonal multiplication operation. This involves multiplying each text embedding T i with the corresponding image embedding I i to create a new set of combined embedding {T 1 I 1 ,T 2 I 2 ,T 3 I 3 ,. . . ,T N I N }. • Classification head. The combined embeddings are passed through a classification neural network consisting of the following components:

-Dropout Layer. A layer with a dropout rate of 0.1 to prevent overfitting.

-Linear Layer. A linear layer that reduces the dimensionality to N features, where N is the combined embedding length. Activation Layer (ReLU): A ReLU activation layer to introduce nonlinearities. -Dropout Layer. Another layer with a dropout rate of 0.1 to prevent overfitting.

-Final Linear Layer. A linear layer that reduces the output to 2 neurons, corresponding to the output classes (sexist or non-sexist).

For Tasks 5 and 6 we also assigned the most repeated label of the annotators, and in case of a tie, female annotators have more weight. In addition, we have used the same approach of using the CLIP model to obtain embeddings of text and images and then, through a diagonal multiplication, to obtain a combined embedding that will be the input of a classification neural network.

Results

From an evaluation metrics perspective, organizers use two types of evaluation based on the learning with disagreement paradigm [28].

• Hard-Hard. This is a comparison between "hard" system outputs (final labels) and "hard" ground truth. The Information Contrast Measure (ICM) metric is used to measure the similarity to the ground truth categories. The F1-Score is also reported, although it is not ideal in this context because it does not take into account the relationships between labels. • Soft-Soft. This is a comparison between "soft" system outputs (probabilities) and "soft" ground truths. In this case, the ICM-soft metric (an extension of ICM) is used as the official metric.

Textual modality

We sent three runs for Tasks 1, 2, and 3 based on the results on the custom validation split. For Tasks 1, our runs are KI, EL, LFs. Our results for the Task 1 are described in Table 8. We ranked 15th, 19th, and 32nd for the soft-soft scheme for each run. As expected, our results were better in Spanish than English, explained by the usage of two Spanish LLMs, BETO and MarIA. The model based on LFs outperformed the baselines proposed.

The official results for Task 2 are shown in Table 9. The first run is based on KI, while the second and third runs are based on MarIA and multilingual DeBERTA. This is the task in which we obtained our best results, ranking 8th (7th in Spanish and 10th in English) with the KI strategy, totaling 38 submissions among all teams. It is worth noting that MarIA outperformed MDeBERTA in both Spanish and English.

Next, the results for Task 3 are reported in Table 10. In this case, we focus on the hard vs. hard scheme, as we do not calculate probabilities. The first run is based on EL averaging the probabilities, the second run is based on KI, and the third run is based on EL based on mode. We got our best results with the first run, except for the English results, where we got better results with EL based on mode. Since we compete in a Hard vs. Hard scheme, the ranking also includes the Macro F1-Score, which is 49.42% with the first run, 47.38% with the second run, and 48.21% with the third run.

Multimodal

Table 11 shows the official results for Task 4, which evaluates hard labels only. The UMUTeam consistently ranks around 33-35 in all evaluations, both ALL and language-specific such as English or Spanish. We have exceeded the two baselines suggested by the organizers, ranking 33rd with -0.2422 on ICM-Hard and 0.6963 on F1_YES (F1-Score of sexist labels).

Although the ICM-Hard metrics are negative, indicating space for improvement in similarity to ground truth, we present less negative values compared to the baselines. This means that our predictions are closer to the true labels. If we look only at the F1-Score of the model, our system would be in a better position. However, in terms of similarity to the ground truth (ICM-Hard), we have a lot of room for improvement. This may be due to the fact that our approach does not have a final softmax layer indicating the probability of each label, since we used the model logits directly for the prediction. It is also possible that it is due to a lack of additional adjustments and optimizations. In this case, we only evaluated a 2e-5 learning rate, a training batch size of 16, 20 epochs, and an epoch-based validation strategy.

Table 12 provides the official results for Task 5, evaluating only hard labels. In summary, we have achieved 18th place in all evaluations, slightly behind the baseline majority class (17th place) and ahead Table 13 shows the official results for Task 6, which is a multi-label classification problem where only hard labels are evaluated. Our system achieved several ranks depending on the scenario with a total of 36 submissions among all teams: 18th in Hard-Hard ALL, 21st in the first column of Hard-Hard ES, and 12th in the second column of Hard-Hard ES. In comparison, the model performs better than the baselines in most scenarios, especially in the second Hard-Hard ES set.

The ICM-Hard values obtained (-1.9511, -2.3853, -1.5817) are negative, but generally better than the baselines, indicating a higher similarity to the true labels. It also clearly outperforms the baseline minor class in all scenarios and performs better than the baseline majority class in terms of ICM-Hard.

With respect to Macro F1-Score, we obtained significantly higher values than the baselines, especially in the Hard-Hard ES evaluation, indicating a better balance between precision and recall and an overall balanced performance in all classes.

Table 13

Official results for Task 6, including only hard labels. We report the rank, the ICM-Hard metric and the F1-Score of the YES label between parentheses.

Hard-Hard ALL

Hard-Hard ES Hard-Hard ES

Conclusions and further work

This document describes UMUTEAM's proposal for EXIST 2024, which focuses on the identification and categorization of SEXISM in Spanish and English. This is a very interesting competition as it deals with the learning with disagreements paradigm, binary and multi-classification tasks as well as a new challenge based on multimodal features. For the textual tasks, we based our proposal on the integration of linguistic features with sentence embeddings extracted for four LLMs, including Spanish and multilingual variants. We achieved our best result, 8th place, in Task 2, the source detection task.

As in previous editions, we are satisfied with participation, since we achieve competitive results in all tasks, outperforming the proposed baselines. However, in this edition, we only consider the different annotator schemes in the binary classification tasks, since it is considered a regression problem, but we are satisfied that we have been able to participate in all multimodal tasks, incorporating image processing modeling tachniques into our pipeline.

Figure 1 :1Figure 1: System architecture of Task 4.

Table 11Datasets statistics for Tasks 1 (left) and 4 (right) concerning sexism identification.Task 1Task 4labeltrainval total train val totalnon-sexist 2306 1540 3846 2933 734 3667sexist2466 1646 411230275377total4772 3186 7958 3235 809 4044

Table 22Datasets statistics for Tasks 2 (left) and 5 (right) concerning source intention.Task 2Task 5labeltrainval total train val total-3175 2090 5265---direct992665 1657 2609 626 3261judgmental296225521652 157783reported309206515---total4772 3186 7958 3261 783 4044

Table 33Distribution of the datasets of the Tasks 3 (left) and 6 (right) concerning misogyny categorization.Task 3Task 6

Table 44Hyperparameter optimization of LLMs.LLMlearning rate epochs batch size warmup steps weight decayTask 1BETO3.6e-0541600.044MARIA2.9e-053162500.21MDEBERTA3.1e-054165000.18XLMTWITTER4.7e-052800.22Task 2BETO3.8e-053165000.17MARIA3.8e-05282500.031MDEBERTA1.4e-05585000.018XLMTWITTER4.2e-055810000.22Task 3BETO2.6e-0541600.25MARIA4.5e-0551600.054MDEBERTA4.5e-055165000.29XLMTWITTER2.8e-05485000.17

Table 55Results of the hyper-parameter optimization stage using Keras of the LFs (LF), each LLM and the multi-input neural network using Knowledge Integration (KI).feature setshapelayers neurons dropoutlr batch size activationTask 1LFbrick2640 0.00132 linearBETOtriangle35120.3 0.00164 tanhMARIAlong funnel 32560.10.0164 sigmoidMDEBERTAbrick4160.2 0.00132 seluXLMTWITTER brick280.3 0.00132 sigmoidKIfunnel880.3 0.00164 eluTask 2LFbrick225600.01256 linearBETObrick612800.01128 eluMARIAtriangle551200.01256 eluMDEBERTAbrick251200.01256 linearXLMTWITTER brick551200.01256 seluKIbrick31280.2 0.001512 sigmoidTask 3LFbrick225600.0164 linearBETObrick21280 0.00164 tanhMARIAbrick251200.0164 linearMDEBERTAbrick251200.0164 linearXLMTWITTER brick2370 0.00164 sigmoidKItriangle716False0.0132 selu

Table 66Results with the custom validation split for Task 1, reported as a regression task. The results are organized by feature set. The first block is the linguistic features, the second block are the LLMs, and the third and fourth blocks are the Knowledge Integration (KI) strategy and an ensemble learning (EL) respectively.feature-setEV RMSLE PEARSONRR2 MAEMSE RMSELF0.1620.4140.458 0.161 1.573 3.6291.905BETO0.5630.2240.752 0.562 1.084 1.8921.376MARIA0.5740.2180.757 0.574 1.080 1.8431.358MDEBERTA0.5750.2160.758 0.575 1.082 1.8371.356XLMTWITTER 0.5410.2350.735 0.541 1.130 1.9861.409EL0.5840.2270.770 0.584 1.100 1.8011.342KI0.6020.2030.776 0.602 1.042 1.720 1.312

Table 77Results with custom validation for Tasks 2 and 3. The results are organized with the LFs (LF), all LLMs separately, the Knowledge Integration strategy (KI), and the three evaluated ensemble learning strategies (EL). All metrics are macro weighted.Task 2Task 3feature-setprecisionrecall F1-Score precisionrecall F1-ScoreLF38.788 32.02531.99936.837 57.09042.527BETO52.360 42.17043.60753.073 48.72550.538MARIA53.592 44.68347.01050.375 55.17752.343MDEBERTA52.402 41.62342.55645.748 59.03850.853XLMTWITTER52.636 43.47745.96052.535 50.16750.969KI54.288 52.89653.37149.873 54.54351.760EL (HIGHEST)53.906 40.81542.94333.973 81.24247.466EL (MEAN)59.768 40.61841.84351.229 56.36253.085EL (MODE)61.823 40.60141.73150.016 54.47851.699

Table 88Official results for Task 1 with the Soft vs Soft scheme, including the Spanish and English results. Ranking is by runs.Soft vs Soft ALLSoft vs Soft ESSoft vs Soft ENTeamRank ICM-Soft Rank ICM-Soft Rank ICM-SoftUMUTeam 1150.6679120.7510160.5200UMUTeam 2190.50339180.5720200.3745UMUTeam 332-0.417030-0.369129-0.5207baseline majority class36-2.358536-2.542136-2.1991baseline minority class40-3.071737-2.574240-3.8158

Table 99Official results for Task 2 with the Soft vs Soft scheme, including the Spanish and English results. Ranking is by runs.Soft vs Soft ALLSoft vs Soft ESSoft vs Soft ENTeamRank ICM-Soft Rank ICM-Soft Rank ICM-SoftUMUTeam 18-1.95697-1.766710-2.2517UMUTeam 212-2.053310-1.802712-2.4821UMUTeam 319-3.318919-3.043220-3.8940baseline majority class27-5.446030-5.667425-5.2028baseline minority class35 -32.9552735-28.709335-39.4948

Table 1010Official results for Task 3 with the Hard vs Hard scheme, including the Spanish and English results. Ranking is by runs.Hard vs Hard ALLHard vs Hard ESHard vs Hard ENTeamRank ICM-Hard Rank ICM-Hard Rank ICM-HardUMUTeam 120-0.73324-0.6959721-0.8012UMUTeam 224-0.871926-0.824525-0.9513UMUTeam 322-0.790125-0.8145720-2.4821baseline majority class30-1.598430-1.726928-1.4563baseline minority class34-3.129534-3.319633-2.9279

https://huggingface.co/openai/clip-vit-base-patch32

Acknowledgments

This work has been supported by projects LaTe4PoliticES (PID2022-138099OB-I00) funded by MICI-U/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF)-a way of making Europe, LT-SWM (TED2021-131167B-I00) funded by MICIU/AEI/10.13039/ 501100011033 and by the European Union NextGenerationEU/PRTR, "Services based on language technologies for political microtargeting" (22252/PDC/23) funded by the Autonomous Community of the Region of Murcia through the Regional Support Program for the Transfer and Valorization of Knowledge and Scientific Entrepreneurship of the Seneca Foundation, Science and Technology Agency of the Region of Murcia. Mr. Ronghao Pan is supported by the Programa Investigo grant, funded by the Region of Murcia, the Spanish Ministry of Labour and Social Economy and the European Union -NextGenerationEU under the "Plan de Recuperación, Transformación y Resiliencia (PRTR)".

Table 11

Official results for Task 4, including only hard labels. We report the rank, the ICM-Hard metric and the F1-Score of the YES label between parentheses.

Hard-Hard ALL

Hard-Hard ES Hard-Hard ES of the baseline minor class (21st place).

Regarding the ICM-Hard values of our system (-1.1486, -1.0605, -1.2148), they are negative and higher in magnitude compared to the baseline majority class, indicating a lower similarity to the true labels. However, they are significantly better than those of the baseline minor class (-2.0637, -2.0866, -2.0410).

In terms of F1_YES values, our model is superior to both baselines in all scenarios, especially in the second ES column (0.2805), indicating better precision and recall in the positive class.

Table 12

Official results for Task 5, including only hard labels. We report the rank, the ICM-Hard metric and the F1-Score of the YES label between parentheses.

Hard-Hard ALL

Hard-Hard ES Hard-Hard ES

Overview of exist 2021: sexism identification in social networks FR-S. Y JorgeCarrillo-De-Albornoz Y LauraPlaza JulioGonzalo PaoloRosso MiriamComet TrinidadDonoso Procesamiento del Lenguaje Natural 67 2021 Overview of exist 2022: sexism identification in social networks FR-S. Y JorgeCarrillo-De-Albornoz Y LauraPlaza Y Adrián Mendieta-Aragón Y GuillermoMarco-Remón Y Maryna Makeienko Y María Plaza Y JulioGonzalo Y Damiano Spina Y PaoloRosso Procesamiento del Lenguaje Natural 69 2022 Overview of exist 2023: sexism identification in social networks LPlaza JCarrillo-De Albornoz RMorante JGonzalo EAmigó DSpina PRosso 10.1007/978-3-031-28241-6_68 Proceedings of ECIR' ECIR' 2023 23 Overview of exist 2024 -learning with disagreement for sexism identification and characterization in social networks and memes. experimental ir meets multilinguality, multimodality, and interaction LPlaza JCarrillo-De Albornoz VRuiz AMaeso BChulvi PRosso EAmigó JGonzalo RMorante DSpina Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024) the Fifteenth International Conference of the CLEF Association (CLEF 2024) Springer 2023 Overview of exist 2024 -learning with disagreement for sexism identification and characterization in social networks and memes (extended overview LPlaza JCarrillo-De Albornoz VRuiz AMaeso BChulvi PRosso EAmigó JGonzalo RMorante DSpina Working Notes of CLEF 2024-Conference and Labs of the Evaluation Forum Springer 2024 Evaluating feature combination strategies for hate-speech detection in spanish using linguistic features and transformers JAGarcía-Díaz SMJiménez-Zafra MAGarcía-Cumbreras RValencia-García Complex & Intelligent Systems 2022 Detecting misogyny in spanish tweets. an approach based on linguistics features and word embeddings JAGarcía-Díaz MCánovas-García RColomo-Palacios RValencia-García doi:j.future.2020.08.032 Future Generation Computer Systems 114 2021 Umuteam at exist 2021: Sexist language identification based on linguistic features and transformers in spanish and english JAGarcía-Díaz RColomo-Palacios RValencia-García CEUR Workshop Proceedings 2021 2943 Umuteam at exist 2022: Knowledge integration and ensemble learning for multilingual sexism identification and categorization using linguistic features and transformers JAGarcía-Díaz SMJiménez-Zafra RColomo-Palacios RValencia-García Proceedings of the Iberian Languages Evaluation Forum the Iberian Languages Evaluation Forum

IberLEF

2022. 2022 3202 JAGarcía-Díaz RPan RValencia-García Umuteam at exist 2023: Sexism identification and categorisation fine-tuning multilingual large language models 2023 Leveraging zero and few-shot learning for enhanced model generality in hate speech detection in spanish and english JAGarcía-Díaz RPan RValencia-García Mathematics 11 5004 2023 Revisiting sexism detection using psychological scales and adversarial samples MSamory ISen JKohne FFlöck CWagner International Conference on Web and Social Media 2020 Sexism prediction in spanish and english tweets using monolingual and multilingual bert and ensemble models AF MDe Paula RFDa Silva IBSchlicht ArXiv abs/2111.04551 2021 Sexism identification in tweets and gabs using deep neural networks AKalra AZubiaga ArXiv abs/2111.03612 2021 Unveiling misogyny memes: A multimodal analysis of modality effects on identification SChen UNaseem IRazzak FSalim Companion Proceedings of the ACM on Web Conference 2024 2024 SemEval-2023 task 10: Explainable detection of online sexism HKirk WYin BVidgen PRöttger 10.18653/v1/2023.semeval-1.305 Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Association for Computational Linguistics AKOjha ASDoğruöz GDa San Martino HTayyar RMadabushi EKumar Sartori the 17th International Workshop on Semantic Evaluation (SemEval-2023), Association for Computational Linguistics

Toronto, Canada

2023 Multimodal hate speech detection via multi-scale visual kernels and knowledge distillation architecture AChhabra DKVishwakarma 10.1016/j.engappai.2023.106991 Engineering Applications of Artificial Intelligence 126 106991 2023 Mistra: Misogyny detection through text-image fusion and representation analysis NJindal PKKumaresan RPonnusamy SThavareesan SRajiakodi BRChakravarthi 10.1016/j.nlp.2024.100073 Natural Language Processing Journal 7 100073 2024 Detection of abusive language: the problem of biased datasets MWiegand JRuppenhofer TKleinbauer North American Chapter of the Association for Computational Linguistics 2019 Challenges in automated debiasing for toxic language detection XZhou MSap SSwayamdipta YChoi NSmith 10.18653/v1/2021.eacl-main.274 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics PMerlo JTiedemann RTsarfaty the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics 2021 Automatic classification of sexism in social networks: An empirical study on twitter data FRodríguez-Sánchez JCarrillo-De Albornoz LPlaza IEEE Access 8 2020 Overview of exist 2021: Sexism identification in social networks FRodríguez-Sánchez JCarrillo-De Albornoz LPlaza JGonzalo PRosso MComet TDonoso Procesamiento del Lenguaje Natural 67 2021 PHe JGao WChen CoRR abs/2111.09543 DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing 2021 ALBETO and DistilBETO: Lightweight spanish language models JCañete SDonoso FBravo-Marquez ACarvallo VAraujo Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022 NCalzolari FBéchet PBlache KChoukri CCieri TDeclerck SGoggi HIsahara BMaegaard JMariani HMazo JOdijk SPiperidis the Thirteenth Language Resources and Evaluation Conference, LREC 2022

Marseille, France

European Language Resources Association 20-25 June 2022. 2022 Umutextstats: A linguistic feature extraction tool for spanish JAGarcía-Díaz PJVivancos-Vicente AAlmela RValencia-García Proceedings of the Thirteenth Language Resources and Evaluation Conference the Thirteenth Language Resources and Evaluation Conference 2022 Sentence-bert: Sentence embeddings using siamese bert-networks NReimers IGurevych 10.18653/v1/D19-1410 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019 KInui JJiang VNg XWan the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019

Hong Kong, China

November 3-7, 2019. 2019 Association for Computational Linguistics Learning transferable visual models from natural language supervision ARadford JWKim CHallacy ARamesh GGoh SAgarwal GSastry AAskell PMishkin JClark International conference on machine learning

PMLR

2021 Evaluating extreme hierarchical multi-label classification EAmigó ADelgado 10.18653/v1/2022.acl-long.399 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Long Papers the 60th Annual Meeting of the Association for Computational Linguistics 2022 1