DiTana-PV at sEXism Identification in Social neTworks (EXIST) Tasks 4 and 6: The Effect of Translation in Sexism Identification Notebook for the sEXism Identification in Social neTworks (EXIST) Lab at CLEF 2024 Aitana Menárguez-Box1,* , Diego Torres-Bertomeu2,* 1 Pattern Recognition and Human Language Technology (PRHLT) Research Center, Spain 2 Valencian Research Institute for Artificial Intelligence (VRAIN), Spain Abstract This paper details the participation of DiTana-PV team in the sEXism Identification in Social neTworks (EXIST) task at CLEF 2024. Specifically, we focused on Tasks 4 and 6, which involved identifying and categorizing sexism in memes. Our primary objective was to evaluate the effect of machine translation on model performance, as well as to explore data augmentation techniques and task combination strategies. By translating Spanish data to English and leveraging a pretrained BERTweet model fine-tuned for sexism detection, we aimed to improve classification accuracy. This work highlights the potential of translation and data handling techniques to enhance multilingual NLP tasks, contributing to more inclusive and effective AI applications in social media analysis. Keywords Sexim Identification, Data Augmentation through Machine Translation, Automatic Analysis of Memes, Pretrained Models Usage, BERTweet 1. Introduction Sexism identification in social networks is an increasingly critical task given the proliferation of user- generated content that often contains harmful and discriminatory language. The Conference and Labs of the Evaluation Forum (CLEF) 2024 has organized the sEXism Identification in Social neTworks (EXIST) lab, which focuses on the automated detection and categorization of sexist content. This paper presents the efforts of the DiTana-PV team in addressing two specific tasks within this lab: Task 4 (sexism identification in memes) and Task 6 (categorization of sexism types in memes). 1.1. Task Description The proposed lab of CLEF 2024 was sEXism Identification in Social neTworks (EXIST) [1, 2, 3]. Between all the different tasks proposed in EXIST, this paper details our team’s (DiTana-PV) participation in Tasks 4 and 6: sexism identification and categorization in memes, respectively. Given an image (a meme), these aim to try and classify it as sexist or not sexist as well as which kinds of sexism, if any, are present from the following: (i) IDEOLOGICAL AND INEQUALITY, (ii) STEREOTYPING AND DOMINANCE, (iii) OBJECTIFICATION, (iv) SEXUAL VIOLENCE and (v) MISOGYNY AND NON-SEXUAL VIOLENCE. 1.2. Data Distribution The information provided includes the memes images and their transcriptions into texts, along with some information about the annotators. We divided the dataset into three different partitions: train, validation and test. Tab. 1 shows the sample distribution of each language. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * These authors contributed equally. $ amenbox@prhlt.upv.es (A. Menárguez-Box); dtorber@etsinf.upv.es (D. Torres-Bertomeu)  0009-0000-5957-0698 (A. Menárguez-Box); 0009-0009-2179-5942 (D. Torres-Bertomeu) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 Dataset samples distribution per partition. Additionally to the number of samples, the table also shows the percentage over the language. train val test English 1809 (71.70%) 201 (7.97%) 513 (20.33%) Spanish 1830 (71.09%) 204 (7.93%) 540 (20.97%) In Tab. 2 we can see how each class for Task 4 is distributed along both languages for the train and validation partitions. As we trained the models with the hard labels there were some samples that half of the annotators labeled as sexist and the other half as not sexist (tie). After some experimentation, the best solution found to this situation was to discard the ambiguous samples, considering them noisy samples. Table 2 Dataset classes distribution per language for train and validation partitions. Additionally to the number of samples, the table also shows the percentage over the language. sexist not sexist tie English 965 (48.00%) 743 (36.96%) 302 (15.02%) Spanish 1073 (52.75%) 639 (31.42%) 322 (15.83%) 1.3. Performance Measures To measure the performance of our proposed systems, the Intermediate Concept Measure (ICM) [4] has been chosen as the official metric, although F1-score is also provided. ICM is a similarity function that generalizes Pointwise Mutual Information (PMI), and can be used to evaluate system outputs in classification problems by computing their similarity to the ground truth categories. These metrics were computed for the Hard-Hard evaluation as we trained our models with hard labels. 2. Main Objectives of Experiments Here we described our main objectives and experimental findings for our participation in this com- petition. All this objectives focused on text processing, omitting the part of image processing for the memes. 2.1. The Effect of Machine Translation The main objective of participating in this competition was to develop models that could detect sexism in memes and categorize it. More specifically, we were interested in how Machine Translation could affect the model’s performance. As Machine Translation has advanced enormously and there are far more resources in English than in any other language, some minority languages that may not have enough resources to train this kind of models could benefit from this: the idea of previously translating the content to English and using any model that was trained with all the available data in English. This is an open line of research that we think has lots of future because it could mean a democratization of the benefits of Machine Learning for all languages, regardless of their available resources. 2.2. The Power of Data We also aimed to evaluate how the use of Data Augmentation affects the performance of the models. Since it is a very widespread technique in the world of Machine Learning, especially in Computer Vision, but it has been proven that NLP tasks can also take advantage from it. As the dataset was unbalanced, as we could see on Tab. 2, another of our objectives was to work with this type of datasets in which not all classes have the same presence and therefore a series of measures must be taken to prevent the model from being biased. 2.3. Tasks Combination The last of our objectives, specially useful for Task 6, was to see what effect caused the combination of inferences for both tasks. We wanted to try to help a model used for Task 6 with the predictions of a model used for Task 4. This will be further explained in the following sections. 3. Approaches Used In this section we will explain in detail the different approaches that best worked for us. 3.1. Using Machine Translation After some preliminary experiments we discovered that the results in the English dataset were always some points higher than in the Spanish dataset, so we decided to automatically translate the Spanish samples into English samples, assuming the loss of quality that such automatic translation could entail. This approach was interesting because there are lots of resources in English that could improve our results. 3.2. The Base Model For both tasks and languages (as we translated the Spanish samples to English), we used as pretrained model the BERTweet-large-sexism-detector [5] that was presented to the SemEval-2023 Task 10. It is a fine-tuned model of BERTweet-large [6] on the Explainable Detection of Online Sexism (EDOS) dataset [7]. This is intended to be used as a classification model for identifying tweets as sexist or not sexist. In spite of that, the model me all our needs for the competition because the text in a meme is quite short, as it is in tweets, so using a model that is trained with shorter texts should be beneficial. It also gives us a great advantage, not starting from a model that has simply been pretrained on general tasks, but has also been fine-tuned for tasks in the same domain of the ones in this competition. 3.3. Managing the Data As seen in Tab. 2 there were two main problems with hard labels: there were some ambiguous ones and the amount of sexist and not sexist samples was unbalanced. We decided to discard the ambiguous samples, because taking them into account in the training process just added noise and led to worse performance. To try and solve the class-imbalance, for Task 4, we applied a weighted-loss function in order to give more weight to the not sexist class. In some of the models, we also performed Data Augmentation to increase the amount of training data. This technique has proven quite good results in the vast majority of Machine Learning tasks and NLP is not an exception [8]. In particular, we used BERT contextual embeddings for paraphrasing the words in the original text. From each sample we generated three new augmented ones and a 30% of the words were substituted using the nlpaug library [9]. 3.4. Combining Inferences For Task 6, we trained the models for recognising six different labels for each meme: the ones for the type of sexism detected plus an extra one for not sexism detected at all. As one of our purposes was to try to combine the information from both tasks, we also focused on training a model for predicting just the main five labels for the type of sexism inside the meme. We would use our inferences from Task 4 to detect if the meme was sexist or not, previously, and then in case the meme was classified as sexist, the second model would predict what type(s) of sexism were inside the given meme. This approach, as will be seen further in this paper, has been proven to work finely. 4. Models In this section there is a description of each one of the six models that were presented to the competition. All these were fine-tuned for 10 epochs with an NVIDIA RTX 3090 with 24GB. The relevant hyper- parameters are: 8 samples per device, and a learning rate of 5 · 10−5 with a linear scheduler. 4.1. Task 4 – Classification Models For Task 4 we developed 3 different models following the different approaches explained. For all the scenarios described, one separate model was trained for each language. The first pair of models (M1𝑡4 ) were trained with all the samples from the train partition in English and in translated Spanish, for the English model we also added the validation partition in translated Spanish and vice versa. The second set of models (M2𝑡4 ) were trained exactly as M1𝑡4 but we added a weighted-loss function to correct the explained class-imbalanced. The last pair of models (M3𝑡4 ) were trained as M2𝑡4 but increasing the training dataset applying Data Augmentation. 4.2. Task 6 – Categorization Models For this task, we developed 5 different models, according to the approaches in the previous sections: a pair for predicting 6 labels, a pair for predicting 5 labels and a single model for predicting six labels for both languages together. For the first two pairs, there is a model for each language (English and translated Spanish). For the first pair M1𝑡6 , the Spanish model was trained with translated Spanish and the English one was trained with the English samples. The second pair, M2𝑡6 , follows the same logic but for making five label predictions. The last one M3𝑡6 is not a pair but a single model which was trained with both translated Spanish and English samples. 5. Analysis of the Results In this section we will discuss the results obtained for each of the models presented. 5.1. Task 4 – Classification Models For the classification models, we can see that just by translating the Spanish samples to English, combining all of the samples in the training process and using as checkpoint the BERTweet model already fine-tuned in a sexism detection task, was enough to achieve a good result. Although, as we can see in the Annex A, it is lower than what was obtained during validation. Adding weights to the loss function to gives more importance to the not sexist samples as they appear fewer times, allowing us to enhance a little bit the model’s performance, as was already proved during validation. When we applied Data Augmentation techniques we got worse results. This was surprising because during validation for Spanish it improved the results, and for English it got lower results but it was far more what we won in Spanish than what we lost in English. We hypothesize the reason why Data Augmentation this is because we already inserted noise in the Spanish samples through Machine Translation, thus the addition of more noise through paraphrasing with BERT was too much. Table 3 Official final results (Task 4 Hard-Hard ALL) of the competition for the inferences presented to Task 4. Model Run ID Rank ICM-Hard ICM-Hard Norm F1_YES M1𝑡4 DiTana-PV_1.json 18 0.0337 0.5171 0.6908 M2𝑡4 DiTana-PV_2.json 6 0.1150 0.5585 0.7122 M3𝑡4 DiTana-PV_3.json 10 0.0888 0.5451 0.7082 5.2. Task 6 – Categorization Models For the categorization models in Task 6, the results indicate a varied performance across different approaches. As shown in Table 4, the models trained to predict six labels for each language separately (M1𝑡6 ) achieved the highest performance in terms of ICM-Hard and ICM-Hard Norm metrics, although its Macro F1 score was slightly lower than that of the M2𝑡6 models. The M1𝑡6 models, which predicts six labels, performed the best in terms of ICM-Hard and ICM-Hard Norm. This suggests that having a separate model for each language and focusing on the six distinct labels allowed the model to better capture the nuances of sexism categorization. Interestingly, the M2𝑡6 models, which was trained to predict five labels, did not outperform the first models, indicating that removing the not sexist label from the categorization task, together with the possible error the previous model (as it was a joint prediction) may have introduced, might have led to a loss of valuable context needed for accurate categorization. The single model trained on both English and translated Spanish samples (M3𝑡6 ) performed the worst. This may be due to the increased complexity and noise introduced by combining data from two languages, even after translation. The challenges of handling translated text, which may not perfectly capture the original sentiment or nuances, likely contributed to the poorer performance. Table 4 Official final results (Task 6 Hard-Hard ALL) of the competition for the three models presented to Task 6. Model Run ID Rank ICM-Hard ICM-Hard Norm Macro F1 M1𝑡6 DiTana-PV_1.json 1 -0.6996 0.3549 0.4319 M2𝑡6 DiTana-PV_2.json 2 -0.8450 0.3247 0.4430 M3𝑡6 DiTana-PV_3.json 9 -1.3691 0.2160 0.3255 6. Conclusions and Future Work The results obtained from our participation in the sEXism Identification in Social neTworks (EXIST) com- petition demonstrate the potential impact of machine translation and data augmentation on improving model performance for sexism detection and categorization tasks. Our experiments highlighted the benefits of translating minority language datasets into English, utilizing the wealth of available resources and pretrained models in English to enhance performance. This approach showed significant improvements in classification accuracy, suggesting a promising direction for future research aimed at democratizing the benefits of machine learning across languages with fewer resources. The implementation of a weighted-loss function effectively addressed class imbalance, further improv- ing model performance. However, the addition of data augmentation techniques, while beneficial during validation, did not consistently enhance results in the final evaluation, indicating the need for careful consideration of noise introduced by such methods, especially when combined with machine-translated data. In Task 6, combining inferences from Task 4 to aid categorization proved to be a viable strategy, though the overall performance varied across different model configurations. This underscores the complexity of multi-label classification tasks and the importance of tailored model training approaches. Overall, this highlights the value of leveraging translation and sophisticated data handling techniques to improve model accuracy in NLP tasks involving multiple languages. Future work could focus on further refining these methods, exploring visual-textual integration for meme analysis, and investigating more robust data augmentation strategies to mitigate noise. These contributions provide a foundation for advancing sexism detection in multilingual contexts, paving the way for more inclusive and effective AI applications in social network analysis. Acknowledgement This work is partially supported by the Valencian Graduate School and Research Network of Artifficial Intelligence (ValgrAI). References [1] L. Plaza, J. Carrillo-de Albornoz, E. Amigó, J. Gonzalo, R. Morante, P. Rosso, D. Spina, B. Chulvi, A. Maeso, V. Ruiz, Exist 2024: sexism identification in social networks and memes, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 498–504. [2] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin- guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [3] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [4] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 5809–5819. URL: https://aclanthology.org/2022.acl-long.399. doi:10. 18653/v1/2022.acl-long.399. [5] A. Rydelek, D. Dementieva, G. Groh, AdamR at SemEval-2023 task 10: Solving the class im- balance problem in sexism detection with ensemble learning, in: A. K. Ojha, A. S. Doğruöz, G. Da San Martino, H. Tayyar Madabushi, R. Kumar, E. Sartori (Eds.), Proceedings of the 17th In- ternational Workshop on Semantic Evaluation (SemEval-2023), Association for Computational Lin- guistics, Toronto, Canada, 2023, pp. 1371–1381. URL: https://aclanthology.org/2023.semeval-1.190. doi:10.18653/v1/2023.semeval-1.190. [6] D. Q. Nguyen, T. Vu, A. Tuan Nguyen, BERTweet: A pre-trained language model for English tweets, in: Q. Liu, D. Schlangen (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 9–14. URL: https://aclanthology.org/2020.emnlp-demos.2. doi:10.18653/v1/ 2020.emnlp-demos.2. [7] H. Kirk, W. Yin, B. Vidgen, P. Röttger, SemEval-2023 task 10: Explainable detection of online sexism, in: A. K. Ojha, A. S. Doğruöz, G. Da San Martino, H. Tayyar Madabushi, R. Kumar, E. Sartori (Eds.), Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval- 2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 2193–2210. URL: https://aclanthology.org/2023.semeval-1.305. doi:10.18653/v1/2023.semeval-1.305. [8] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, E. Hovy, A survey of data augmentation approaches for NLP, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021, pp. 968–988. URL: https://aclanthology.org/2021.findings-acl.84. doi:10. 18653/v1/2021.findings-acl.84. [9] E. Ma, Nlp augmentation, https://github.com/makcedward/nlpaug, 2019. [10] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020. [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19-1423. A. Validation Results In this appendix section illustrate the results that each of the models obtained for our validation partition of the dataset for each task, distributed by language. A.1. Task 4 – Classification Models As we can see in Tab. 5 the starting pint was not brilliant, although we could improve our results for the Spanish split by four points just by translating the samples to English and using the BERTweet fine-tuned model for sexism detection. The weighted-loss function has also allowed to increase some points and the Data Augmentation has also enhanced the model performance. In Tab. 6 we can see that, just comparing the baselines using BERT and BETO, the English partition gets a better performance. As in the Spanish validation split, just with the combination of the English and translated Spanish datasets and using the BERTweet-large-sexism-detector model as checkpoint, we already obtained good results. Adding the weighted-loss improved them, as happened with the Spanish dataset. In this case, differently from what happened in the Spanish dataset, the Data Augmentation did not improve model’s performance. Table 5 Table of results for Task 4 in the Spanish partition for our validation set. 1 Baseline model is achieved fine-tuning BETO [10] pretrained model with the Spanish samples without transla- tion. Model ICM-Hard ICM-Hard Norm F1_YES Baseline1 0.069 0.535 0.687 M1𝑡4 0.174 0.588 0.727 M2𝑡4 0.222 0.613 0.744 M3𝑡4 0.260 0.632 0.756 Table 6 Table of results for Task 4 in the English partition. 2 Baseline model is achieved fine-tuning BERT [11] pretrained model with the English samples. Model name ICM-Hard ICM-Hard Norm F1_YES 2 Baseline 0.148 0.575 0.719 M1𝑡4 0.343 0.674 0.784 M2𝑡4 0.381 0.694 0.797 M3𝑡4 0.374 0.690 0.795 A.2. Task 6 – Categorization models Model M1𝑡6 , which trained to predict six labels separately for each language, achieved a moderate performance. However, its F-Measure indicates room for improvement, suggesting that the model may struggle with certain categories or nuances in sexism categorization. The model trained to predict five labels M1𝑡6 exhibited lower performance across all metrics compared to M1𝑡6 . This could be attributed to the removal of the not sexist label from the categorization task, as well as mixing its predictions with the ones from the previous model, potentially leading to loss of valuable context for accurate classification. The results can be seen in Tab. 7 For the Engish partition, as shown in Tab. 8 the first model achieved moderate performance, with comparable metrics to its counterpart on the Spanish partition. While the ICM-Hard and ICM-Hard Norm scores indicate reasonable categorization accuracy. The single model trained on both English and translated Spanish samples demonstrated the best performance among the three models on both languages. Despite the differences in validation data composition, this model achieved significantly higher ICM-Hard and ICM-Hard Norm scores, indicating better categorization accuracy and similarity to ground truth labels. However, it’s important to note that the validation data composition for M3𝑡6 differed from that of M1𝑡6 and M2𝑡6, incorporating both English and translated Spanish samples. As such, the higher performance of M3𝑡6 cannot be directly compared to the other models due to this difference in validation data composition. Table 7 Table of results for Task 6 in the Spanish partition for our validation set. 3 This results from M3𝑡6 are not comparable with the others in the table as the validation data is different (it contains English together with translated Spanish samples). Model ICM-Hard ICM-Hard Norm F-Measure M1𝑡6 -1.431 0.220 0.292 M2𝑡6 -1.750 0.143 0.286 M3𝑡6 3 -0.667 0.370 0.446 Table 8 Table of results for Task 6 in the English partition for our validation set. 4 This results from M3𝑡6 are not comparable with the others in the table as the validation data is different (it contains English together with translated Spanish samples). Model ICM-Hard ICM-Hard Norm F-Measure M1𝑡6 -0.511 0.394 0.417 M2𝑡6 -1.450 0.188 0.277 M3𝑡6 4 1.762 0.867 0.841