1. Introduction

Natural Language Processing Journal 7 (2024) 100073. URL: https://www.sciencedirect.com/science/article/ pii/S2949719124000219. doi:https://doi.org/10.1016/j.nlp.2024.100073. [19] M. Wiegand

10.18653/v1/2023.semeval-1.305

UMUTeam at EXIST 2024: Multi-modal Identification and Categorization of Sexism by Feature Integration

Ronghao Pan

José Antonio García-Díaz

Tomás Bernal-Beltrán

Rafael Valencia-García

0 0 Facultad de Informática, Universidad de Murcia, Campus de Espinardo , 30100 , Spain

2024

2943 09 12

The fourth edition of the EXIST shared task is a multimodal identification and categorization of sexism. This edition will take place as a lab in CLEF 2024. The main innovation in this edition with respect 2023 is the addition of a multimodal task based on sexism identification and categorization with MEMEs. As before, this shared task compromises sexist documents written in English and Spanish. For the textual tasks, we rely on feature integration from some Large Language Models and linguistic features extracted from our custom tool. We rank 15th in Sexism Identification, 8th in Source Intention, and 20th in Sexism Categorization. For the multimodal task, we rely on the CLIP model to extract the embedded text and image, and then combine them by diagonal multiplication to obtain the classification models. We rank 33rd in Sexism Identicfiation, and 18th in both Source Intention and Sexism Categorization.

eol>Sexism identification sexism categorization source intention detection multi modal Feature Engineering natural language processing

1. Introduction

Social networks have become central platforms for social activism and complaints, allowing movements like #MeToo, #8M, and #Time’sUp to spread rapidly. Through these channels, women around the world have shared their real-life experiences of abuse, discrimination, and other forms of sexism. However, Social networks also facilitate the spread of sexist, disrespectful, and hateful behavior. In this context, the development of automated tools is essential. These tools can help detect and warn against sexist behavior and discourse, estimate the frequency of sexist and abusive situations on social media, identify the most common forms of sexism, and analyze how sexism is expressed on these platforms. Traditional systems for detecting sexism frequently depend on predefined labels and fixed perspectives, which can miss the complexity and subjectivity of sexist statements. The challenge in identifying and addressing sexism stems from its inherently subjective assessment. In contrast, perspectivism ofers a promising method for enhancing detection by incorporating diverse opinions and viewpoints. An important contribution to this field, which aims to address the problem of sexism identification within the paradigm of learning with disagreement, has been made by the project EXIST 2024: sEXism Identification in Social neTworks [1, 2, 3].

Here we describe our participation in the fourth edition of EXIST [4, 5]. In the first three editions, EXIST focused only on the detection and classification of sexist text messages, and this 2024 edition introduces tasks that focus on images, specifically memes. Memes, generally humorous images, spread rapidly through social networks and the Internet. They can therefore encompass a wider range of sexist manifestations on social networks, especially those disguised as humor. Therefore, this shared task involves the development of automated multimodal tools capable of detecting sexism in both text and memes. With this addition, the organizers aim to cover a broader spectrum of sexism on social networks, especially that which is disguised as humor.

As a reminder, the tasks in the last edition of EXIST were three challenges, namely, ( 1 ) sexism identification, a binary classification task in which participants had to determine whether a tweet was sexist or not; ( 2 ) source intention, a multi-classification task focused on determining whether the author’s intention was to post a sexist message, to report a sexist situation, or to make a judgment; and ( 3 ) sexism categorization, a multi-label classification task focused on identifying sexist characteristics. It should be noted that in EXIST 2024, the organizers are retaining the learning with disagreements paradigm proposed in 2023.

Our research group has experience in the detection hate speech [6] and misogyny [7] through the compilation and evaluation of several corpora. We have focused mainly on Spanish-language data and our work dealing with English-language data is more limited, limited to participating in the previous editions of EXIST [8, 9, 10] and the evaluation of zero and few-shot learning strategies on some existing English datasets focused on hate speech detection [11].

2. Related works

Sexism refers to any abuse or negative sentiment directed at women based on their gender, or their gender in combination with other identity attributes. In particular, sexism is a growing problem on the Internet, with detrimental efects on women and other marginalized groups. It makes online spaces less accessible and less welcoming, perpetuating asymmetries and social injustices.

Automated tools are already widely used to identify and rate sexist content online. Researchers have proposed several approaches to address this problem, ranging from rule-based methods [12] to the use of more complex models based on deep learning and pre-trained language models with Transformer architecture from a linguistic perspective [13, 14], with only a few attempts to address the problem from a visual or multimodal perspective. Another work is described in [15], where the authors study about 6 thousand misogyny memes using a deep learning model that determines which modality plays a more significant role. The dataset included various characteristics such as hate speech, sexism, or cyberbullying, among others. The authors found that all modalities were useful for identifying misogyny, with text playing a significant role.

Sexism can be further categorized into diferent forms depending on the author’s intent or the type of sexism. All editions of EXIST use categories such as “Ideology and Inequality”, “Stereotyping and Dominance”, “Objectification”, “Sexual Violence”, and “Misogyny and Non-Sexual Violence”. This is similar to SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS) [16], which defined a taxonomy for the more explainable classification of sexism in three hierarchical levels: binary sexism detection, category of sexism, and fine-grained vector of sexism. In all editions of EXIST, sexism is also considered at three levels: binary sexism detection in tweets, multiclass detection for source intention in tweets, and category of sexism in tweets.

Although the problem of sexism is primarily approached from a textual perspective, a field of research has emerged that approaches the problem from a visual or multimodal perspective, as the multimodal approach has shown superior performance to single modality approaches in detecting hate speech and misogyny, such as [17] and [18]. Sexist content can also appear in images or in multimodal forms. From a visual perspective, most research focuses on detecting ofensive, non-conforming, or pornographic content. For this reason, this edition of EXIST includes a new task focused on detecting sexism in memes.

Currently, one of the problems that needs to be addressed in sexism detection is the presence of biases that could afect the actual performance of the models. Studies such as, [ 19], and [20] have addressed this issue by using textual data to detect sexism. However, many of the contributions of previous sexism detection studies have not considered the complexity and multiple perspectives surrounding sexism, as sexism can manifest itself in multiple ways influenced by cultural, social, and individual factors.

Perspectivism in sexism detection involves considering multiple cultural, social, and individual viewpoints and contexts when analyzing potentially sexist content. It is essential to better understand the diferent manifestations of sexism, avoid bias, and improve the accuracy of detection models. It promotes equity by recognizing the diferent ways in which sexism can manifest itself, thus contributing to systems that are fairer and more sensitive to cultural and social diversity. For example, [14] examine the errors made by classification models and discuss the dificulty of automatically classifying sexism due to the subjectivity of labels and the complexity of the natural language used in social networks.

Thus, this edition of EXIST has taken a similar approach to the 2023 edition, adopting the Learning with Disagreement (LeWiDi) paradigm for both dataset development and system evaluation. Unlike traditional methods that rely on a single “correct” label for each example, LeWiDi trains models to process and learn from conflicting or diverse annotations. This allows the system to incorporate diferent annotator perspectives, biases, and interpretations, resulting in a more equitable learning process.

3. Dataset

The textual challenges on EXIST 2024 followed the same strategies as previous editions [21, 22]; that is, it includes tweets written in Spanish or English crawled with certain keywords that are commonly used to undervalue the role of women. In fact, for the textual dataset the dataset is the same as from 2023.

For the multimodal dataset, the organizers of EXIST 2024 defined a lexicon of 250 terms and phrases that lead to sexist memes, derived from those that have proven efective in identifying sexism in previous editions of EXIST. This set includes 112 terms in English and 138 in Spanish, covering diverse topics and including terms with varying degrees of use in both sexist and non-sexist contexts, all centered around women. These terms were used as search queries on Google Images to obtain the top 100 images per term. Rigorous manual cleaning procedures were applied to define memes and remove noise such as textless images, text-only images, advertisements, and duplicates. This resulted in a final set of over 3,000 memes per language. Given the heterogeneous proportion of memes per term, we discarded the most unbalanced seeds and ensured that each seed had at least five memes. The final dataset was curated to achieve a balanced distribution of memes per seed. To avoid selection bias, memes were randomly selected while maintaining the appropriate distribution per seed. This process resulted in over 2,000 memes per language for the training set and over 500 memes per language for the test set.

For the annotation process, the organizers considered two socio-demographic traits: gender and age range. Each meme was then annotated by six crowd sourced annotators selected through the Prolific app, following guidelines developed by two experts in gender issues. The authors also provided the annotators’ education level, ethnicity, and country of residence. The idea is to reduce labeling bias that may arise from cultural diferences among annotators.

In addition, following the Learning with Disagreements paradigm removes the assumption that items have a single, unambiguous interpretation in a given context. Therefore, the dataset does not have a single “gold” annotation, but participants can use the full range of annotations from all annotators. This allows us to capture the diversity and develop more robust systems. The organizers release all annotations from all annotators to the participants.

The custom validation split is created using stratification by label in an 80-20 ratio.

First, since we have all the annotations, we decided to train our models for Task 1 (sexism identiifcation) as a regression task rather than a classification task; thus, a text is considered sexist if the regression model outputs a score greater than 2.5. The soft labels are the output of the regression model normalized to a range of 1–10. Note that this approach was also used by our team in the previous edition of EXIST. The hard labels of the textual modality can be found in Table 1.

Second, Table 2 shows the label distribution for Tasks 2 and 5. The labels are: ( 1 ) direct, if the intent of the document is to write a message that is itself sexist or incites to be sexist; ( 2 ) judgmental, if the intent is to report and share a sexist situation; and ( 3 ) reported, if the intent was to judge.

Third, Table 3 shows the label distribution for Tasks 3 and 6. The labels are: ( 1 ) ideological and inequality, if the message downplays feminism or, equality between men and women; ( 2 ) stereotyping and dominance, if the message contains stereotypes about social roles; and ( 3 ) objectification, if the message contains physical characteristics about beauty standards or hyper-sexualization. As can be seen, we extracted a split from the dataset provided for individual validation (in an 80-20 ratio) using label stratification. For Task 2, there is a significant imbalance between the labels, with direct sexism being the label with the most examples and judgments and reports with a similar number of examples. For Task 3, the dataset has more balance between the characteristics. We have included the number of unknown responses in the textual modality.

4. Methodology

This section explains our methodology for Tasks 1, 2, and 3 using textual modalities (see Section 4.1), and for Tasks 4, 5, and 6 using multimodalies (see Section 4.2).

4.1. Textual modalities

For the Tasks 1, 2, and 3 we decided to keep our previous strategy of training both languages (Spanish and English) together in order to reduce the carbon footprint of training a model for each language but to reduce the number of models to 4. Since we are evaluating feature integration strategies, we decided to keep two Spanish LLMs, BETO and MarIA; and two multilingual LLMs, deBERTa v3 [23] and XLMTwitter. In this sense, we removed from our pipeline the multilingual BERT, XLM, BERTIN, DistilBETO and ALBETO [24]). We also extracted the linguistic features (LFs) using the UMUTextStats tool [25].

To fine-tune each LLM for each task, we performed hyperparameter tuning on 10 models. This involved testing diferent learning rates, varying the number of epochs from 1 to 5, experimenting with batch sizes of 8 and 16, and adjusting warm-up steps and weight decay to optimize the learning rate during the initial training phases. The results of this tuning process are detailed in the Table 4. 3.6e-05 2.9e-05 3.1e-05 4.7e-05 3.8e-05 3.8e-05 1.4e-05 4.2e-05 2.6e-05 4.5e-05 4.5e-05 2.8e-05

Task 1 Task 2 Task 3

Next, we extract the classification token for each tweet, LLM, and Task (1, 2, and 3) using SentenceBERT [26] to extract the [CLS] token. Now that each document is represented by a unique fixed-length vector, we can merge it with the LFs using diferent feature integration strategies. The first strategy, known as Knowledge Integration (KI), is to use the feature vectors to train a new multi-input neural network. The second strategy is based on Ensemble Learning (EL). For this, we train separate simple neural networks for each LLM and for the LFs, and combine the results by averaging probabilities and outputs.

The training of the KI and the individual models are shown in the Table 5. As expected, most architectures are shallow (one or two hidden layers), brick-shaped (same number of neurons in each layer). This is expected because the vectors already capture the meaning of the sentences and are tuned to the output tasks. However, the KI for Tasks 1 and 3 resulted in deep neural networks, but with few neurons in each layer, being a funnel shape in Task 1 and a triangle shape in Task 3. If we compare these results with those obtained in the previous edition, it draws our attention that for Task 3 we obtained the best results in 2023 with deep neural networks and complex shapes.

For the EL, we evaluate diferent strategies: ( 1 ) mode of predictions, ( 2 ) averaging of probabilities, and ( 3 ) obtaining the highest probability. The only exception is Task 1, where we only averaged the predictions of the model, since we considered it to be a regression task.

Now we report our results with our custom validation split. For Task 1 (see Table 6), as we handled it as a regression task in order to account for the disagreement between the annotators. Therefore, we report using Explained Variance (EV), Root Mean Squared Logarithmic Error (RMSLE), Pearson R, R Square (R2), Mean Average Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). The best results were obtained using the KI strategy.

For Tasks 2 and 3, the results with the custom validation split are shown in the Table 7. The models are scored and ranked using the macro average precision, recall, and F1-Score. Similar to the previous edition, our team achieved their best results with the custom validation split with the with KI for Task 2 with the best recall and F1-Score but with the best precision is obtained with an EL based on the mode. In the case of Task 3, the EL strategies achieve better results with an F1-Score of 53.085% averaging probabilities and, the best recall of 81.242% with the highest probability, but the best precision LF EL KI

EV 0.162 0.563 0.574 0.575 0.541 0.584 0.602 2 6 5 2 5 3 2 2 2 2 2 7 0.414 0.224 0.218 0.216 0.235 0.227 0.203 KI EL (HIGHEST) EL (MEAN) EL (MODE) 38.788

4.2. Multi modalities

As shown in Figure 1, during training with the visual input (meme) and the text input (textual content of the meme), the text embedding vector and the image embedding vector perform diagonal alignment. This means that the dot product between the text embedding vector and its corresponding image embedding vector is high when they are related (i.e., for correct pairs) and low when they are not related (i.e., for incorrect pairs). This is possible because the CLIP model trains and processes the text and images in the same shared feature space. CLIP is a model developed by OpenAI that can eficiently understand and associate images and text. This means that the model learns to correctly associate a descriptive text with its corresponding image and to distinguish it from other unrelated images. Each component of the architecture is described in detail below: • Image Encoder (CLIP image encoder). The meme image is passed through the CLIP image encoder, which extracts a series of embeddings {I1,I2,I3,. . . ,IN} These embeddings represent the 1https://huggingface.co/openai/clip-vit-base-patch32

visual properties of the image. • Text Encoder (CLIP text encoder). The text of the meme is fed into the CLIP text encoder, which produces a series of embeddings {T1,T2,T3,. . . ,TN} These embeddings capture the semantic features of the text. • Diagonal multiplication. The image and text embeddings are combined using a diagonal multiplication operation. This involves multiplying each text embedding Ti with the corresponding image embedding Ii to create a new set of combined embedding {T1I1,T2I2,T3I3,. . . ,TNIN}. • Classification head . The combined embeddings are passed through a classification neural network consisting of the following components: – Dropout Layer. A layer with a dropout rate of 0.1 to prevent overfitting. – Linear Layer. A linear layer that reduces the dimensionality to N features, where N is the combined embedding length. Activation Layer (ReLU): A ReLU activation layer to introduce nonlinearities. – Dropout Layer. Another layer with a dropout rate of 0.1 to prevent overfitting. – Final Linear Layer. A linear layer that reduces the output to 2 neurons, corresponding to the output classes (sexist or non-sexist).

For Tasks 5 and 6 we also assigned the most repeated label of the annotators, and in case of a tie, female annotators have more weight. In addition, we have used the same approach of using the CLIP model to obtain embeddings of text and images and then, through a diagonal multiplication, to obtain a combined embedding that will be the input of a classification neural network.

5. Results

From an evaluation metrics perspective, organizers use two types of evaluation based on the learning with disagreement paradigm [28].

• Hard-Hard. This is a comparison between “hard” system outputs (final labels) and “hard” ground truth. The Information Contrast Measure (ICM) metric is used to measure the similarity to the ground truth categories. The F1-Score is also reported, although it is not ideal in this context because it does not take into account the relationships between labels. • Soft-Soft. This is a comparison between “soft” system outputs (probabilities) and “soft” ground truths. In this case, the ICM-soft metric (an extension of ICM) is used as the oficial metric.

5.1. Textual modality

We sent three runs for Tasks 1, 2, and 3 based on the results on the custom validation split. For Tasks 1, our runs are KI, EL, LFs. Our results for the Task 1 are described in Table 8. We ranked 15th, 19th, and 32nd for the soft-soft scheme for each run. As expected, our results were better in Spanish than English, explained by the usage of two Spanish LLMs, BETO and MarIA. The model based on LFs outperformed the baselines proposed.

The oficial results for Task 2 are shown in Table 9. The first run is based on KI, while the second and third runs are based on MarIA and multilingual DeBERTA. This is the task in which we obtained our best results, ranking 8th (7th in Spanish and 10th in English) with the KI strategy, totaling 38 submissions among all teams. It is worth noting that MarIA outperformed MDeBERTA in both Spanish and English.

Next, the results for Task 3 are reported in Table 10. In this case, we focus on the hard vs. hard scheme, as we do not calculate probabilities. The first run is based on EL averaging the probabilities, the second run is based on KI, and the third run is based on EL based on mode. We got our best results with the first run, except for the English results, where we got better results with EL based on mode. Since we compete in a Hard vs. Hard scheme, the ranking also includes the Macro F1-Score, which is 49.42% with the first run, 47.38% with the second run, and 48.21% with the third run.

5.2. Multimodal

Table 11 shows the oficial results for Task 4, which evaluates hard labels only. The UMUTeam consistently ranks around 33-35 in all evaluations, both ALL and language-specific such as English or Spanish. We have exceeded the two baselines suggested by the organizers, ranking 33rd with -0.2422 on ICM-Hard and 0.6963 on F1_YES (F1-Score of sexist labels).

Although the ICM-Hard metrics are negative, indicating space for improvement in similarity to ground truth, we present less negative values compared to the baselines. This means that our predictions are closer to the true labels. If we look only at the F1-Score of the model, our system would be in a better position. However, in terms of similarity to the ground truth (ICM-Hard), we have a lot of room for improvement. This may be due to the fact that our approach does not have a final softmax layer indicating the probability of each label, since we used the model logits directly for the prediction. It is also possible that it is due to a lack of additional adjustments and optimizations. In this case, we only evaluated a 2e-5 learning rate, a training batch size of 16, 20 epochs, and an epoch-based validation strategy.

Table 12 provides the oficial results for Task 5, evaluating only hard labels. In summary, we have achieved 18th place in all evaluations, slightly behind the baseline majority class (17th place) and ahead of the baseline minor class (21st place).

Regarding the ICM-Hard values of our system (-1.1486, -1.0605, -1.2148), they are negative and higher in magnitude compared to the baseline majority class, indicating a lower similarity to the true labels. However, they are significantly better than those of the baseline minor class (-2.0637, -2.0866, -2.0410).

In terms of F1_YES values, our model is superior to both baselines in all scenarios, especially in the second ES column (0.2805), indicating better precision and recall in the positive class.

Table 13 shows the oficial results for Task 6, which is a multi-label classification problem where only hard labels are evaluated. Our system achieved several ranks depending on the scenario with a total of 36 submissions among all teams: 18th in Hard-Hard ALL, 21st in the first column of Hard-Hard ES, and 12th in the second column of Hard-Hard ES. In comparison, the model performs better than the baselines in most scenarios, especially in the second Hard-Hard ES set.

The ICM-Hard values obtained (-1.9511, -2.3853, -1.5817) are negative, but generally better than the baselines, indicating a higher similarity to the true labels. It also clearly outperforms the baseline minor class in all scenarios and performs better than the baseline majority class in terms of ICM-Hard.

With respect to Macro F1-Score, we obtained significantly higher values than the baselines, especially in the Hard-Hard ES evaluation, indicating a better balance between precision and recall and an overall balanced performance in all classes.

6. Conclusions and further work

This document describes UMUTEAM’s proposal for EXIST 2024, which focuses on the identification and categorization of SEXISM in Spanish and English. This is a very interesting competition as it deals with the learning with disagreements paradigm, binary and multi-classification tasks as well as a new challenge based on multimodal features. For the textual tasks, we based our proposal on the integration of linguistic features with sentence embeddings extracted for four LLMs, including Spanish and multilingual variants. We achieved our best result, 8th place, in Task 2, the source detection task.

As in previous editions, we are satisfied with our participation, since we achieve competitive results in all tasks, outperforming the proposed baselines. However, in this edition, we only consider the diferent annotator schemes in the binary classification tasks, since it is considered a regression problem, but we are satisfied that we have been able to participate in all multimodal tasks, incorporating image processing modeling tachniques into our pipeline.

Acknowledgments

This work has been supported by projects LaTe4PoliticES (PID2022-138099OB-I00) funded by MICIU/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF)-a way of making Europe, LT-SWM (TED2021-131167B-I00) funded by MICIU/AEI/10.13039/ 501100011033 and by the European Union NextGenerationEU/PRTR, "Services based on language technologies for political microtargeting" (22252/PDC/23) funded by the Autonomous Community of the Region of Murcia through the Regional Support Program for the Transfer and Valorization of Knowledge and Scientific Entrepreneurship of the Seneca Foundation, Science and Technology Agency of the Region of Murcia. Mr. Ronghao Pan is supported by the Programa Investigo grant, funded by the Region of Murcia, the Spanish Ministry of Labour and Social Economy and the European Union - NextGenerationEU under the "Plan de Recuperación, Transformación y Resiliencia (PRTR)".

[1]

R .-S. y

Jorge

Carrillo-de-Albornoz y Laura Plaza y Julio Gonzalo y Paolo Rosso y Miriam Comet y Trinidad Donoso, Overview of exist 2021: sexism identification in social networks , Procesamiento del Lenguaje Natural 67 ( 2021 ) 195 - 207 . URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/ article/view/6389.

[2]

R .-S. y

Jorge

Carrillo-de-Albornoz y Laura Plaza y Adrián Mendieta-Aragón y Guillermo MarcoRemón y Maryna Makeienko y María Plaza y Julio Gonzalo y Damiano Spina y Paolo Rosso, Overview of exist 2022: sexism identification in social networks , Procesamiento del Lenguaje Natural 69 ( 2022 ) 229 - 240 . URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/ 6443.

[3]

Plaza ,

Carrillo-de Albornoz ,

Morante ,

Gonzalo ,

Amigó ,

Spina ,

Rosso , Overview of exist 2023: sexism identification in social networks , in: Proceedings of ECIR'23 , 2023 , pp. 593 - 599 . doi: 10 .1007/978-3- 031 -28241-6_ 68 .

[4]

Plaza ,

Carrillo-de Albornoz ,

Ruiz ,

Maeso ,

Chulvi ,

Rosso ,

Amigó ,

Gonzalo ,

Morante ,

Spina , Overview of exist 2024 - learning with disagreement for sexism identification and characterization in social networks and memes. experimental ir meets multilinguality, multimodality, and interaction ., in: Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024 ), Springer, 2023 , pp. 316 - 342 .

[5]

Plaza ,

Carrillo-de Albornoz ,

Ruiz ,

Maeso ,

Chulvi ,

Rosso ,

Amigó ,

Gonzalo ,

Morante ,

Spina , Overview of exist 2024 - learning with disagreement for sexism identification and characterization in social networks and memes (extended overview , in: Working Notes of CLEF 2024- Conference and Labs of the Evaluation Forum , Springer, 2024 .