Multilingual Sexism Identification via Fusion of Large Language Models Sahrish Khan1,* , Gabriele Pergola1 and Arshad Jhumka3 1 Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK 3 School of Computing, University of Leeds, Leeds LS2 9JT, UK Abstract The pervasive presence of sexist content on social media platforms not only perpetuates harmful stereotypes but also fosters environments that can be exclusionary and hostile, especially towards women. Such content, which often targets people of a specific gender, i.e., sexist content, requests platforms to enhance their monitoring and policing efforts. Yet, policing such content is challenging for many reasons, including the volume of messages to check and the context of the content. Consequently, several studies have been conducted to automatically detect sexist language on social media, focusing on its identification and classification. However, variations in detection accuracy can depend on the differences in architecture, training strategies, and data of existing models, [1, 2, 3], leading to potential variances in detection accuracy. This variability, further influenced by the types of messages and input prompts, motivates our exploration into the fusion of multiple Large Language Models (LLMs). As part of EXIST Task 1, which focuses on sexism identification in multilingual contexts, we introduce two novel approaches: the Dual-Transformer Fusion Network (DTFN) and the Multimodel Fusion Ensemble (MFE). These methods utilize fusion and ensemble learning techniques to enhance detection accuracy across multilingual datasets. Our extensive experimental evaluation during the EXIST 2024 competition demonstrates that these methodologies significantly outperform existing models, with MFE and DTFN ranking 1st and 2nd , respectively, in the English segment, and 4th and 13th in the combined English and Spanish segments of the official leaderboard. Keywords Social Media, Sexism Detection, Ensemble, Transformer, Multilingual, Large Language models, EXIST Task 1 1. Introduction The proliferation of social media platforms has fundamentally transformed how individuals communi- cate. However, these platforms have also become arenas for problematic interactions, including the dissemination of sexist content. This content not only perpetuates harmful stereotypes but also fosters an online environment that can be hostile and exclusionary, particularly towards women. The urgency to address this issue is underscored by the growing body of research indicating the growing exposure to sexist language and its profound impacts. Despite the clear need to mitigate this problem, the task of detecting sexist content online presents substantial challenges. Sexist language is not uniformly explicit; it often involves subtle cues and context-dependent expressions. Addressing these challenges requires leveraging flexible computational methods and approaches that can understand and interpret the complexities of language used in these settings. Large Language Models (LLMs), which are pre-trained on vast corpora and fine-tuned for specific tasks, are promising solutions. However, while individual LLMs offer robust linguistic insights, they also have inherent limitations when applied to specific tasks or domains such as detecting sexist or harmful content. Each existing model may interpret nuances differently based on its architecture, training strategy and data [1, 2, 3, 4], leading to potential variances in detection accuracy. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ sahrish.khan@warwick.ac.uk (S. Khan); gabriele.pergola.1@warwick.ac.uk (G. Pergola); arshad.jhumka@leeds.ac.uk (A. Jhumka) € https://www.dcs.warwick.ac.uk/~u2149613/ (S. Khan); https://www.dcs.warwick.ac.uk/~u1898418/ (G. Pergola); https://eps.leeds.ac.uk/computing/staff/9540/professor-arshad-jhumka (A. Jhumka)  0000-0002-7347-2522 (G. Pergola); 0000-0003-0540-2845 (A. Jhumka) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings This variability depends on the types of messages as well as on the input prompts, and it motivates our exploration for fusing multiple LLMs. Our research, conducted as part of the EXIST 2024 (sEXism Identification in Social Networks)[5, 6] shared task 1, which focuses on enhancing automated sexism detection. In this paper, we present two novel methodologies leveraging neural language models for sexism identification: the Dual-Transformer Fusion Network (DTFN) and the Multimodel Fusion Ensemble (MFE). These approaches utilize fusion and ensemble learning techniques to enhance detection accuracy across multilingual datasets, specifically evaluated using the EXIST 2024 dataset for both English and Spanish contexts. In particular, the DTFN is a simple yet effective approach that integrates the outputs from two distinct transformers, i.e., RoBERTa[1] and DeBERTa [2], and fuses them via a fully connected layer. We posit that by concatenating their outputs, the DTFN captures a more comprehensive understanding of the textual data. We further expand this concept by introducing the MFE approach, which applies a majority voting mechanism among multiple models to exploit their collective capabilities for better generalization across diverse linguistic contexts. Ensemble methods, such as MFE, have shown to enhance model performance by mitigating individual model weaknesses and reducing the variance of predictions [7]. By incorporating a diverse set of models like RoBERTa-Large, DeBERTa-V3-Large, Mistral-7b [1, 2, 3], and DTFN, the MFE approach provides a more robust and accurate detection methodology. The effectiveness of these methods was evaluated in the EXIST 2024 competition - Task 1, where the MFE and DTFN approaches notably outperformed other methodologies, ranking 1st and 2nd in the English segment of the official leaderboard, and 4th and 13th in the combined English and Spanish languages, respectively. The remainder of this paper is structured as follows: Section 2 provides an overview of related work in sexism detection and ensemble learning strategies. Section 3 briefly describes the datasets. Section 4 details our methodologies, including the DTFN and MFE techniques. Section 5 details Experimental Assessment and Section 6 presents the results and analysis, followed by our conclusions in Section 7. 2. Related Work Significant research has been conducted by researchers on the detection of hate speech, cyberbullying and offensive language. However, despite the growing interest on the topic, the literature on sexism detection is still limited. In recent studies large language models (LLMs) and Transformer-based architecutres have been used for multi-modal detection of hate speech, sexism and offensive language from the text, images, memes, audio, and videos[8, 9, 10, 11, 12, 13, 14, 15]. The advent of transformer models, particularly BERT (Bidi- rectional Encoder Representations from Transformers) introduced by [16], enables a more sophisticated understanding of contextual relationships within text. Building upon BERT, [1] introduced RoBERTa (A Robustly Optimized BERT Approach), which fine-tuned the training process, and DeBERTa [2] (Decoding-enhanced BERT with Disentangled Attention), which incorporates a disentangled attention mechanism and enhanced decoding capabilities. Based on this pre-trained models, [17] fine-tuned deep learning models, such as CNN-BiLSTM and GPT-2, on the "MultiHate" dataset, achieving notable accuracy rates in sexism classification. Moreover, the problem of sexism detection involves identifying harmful and biased language, often embedded within complex social contexts. Early approaches relied heavily on traditional machine learning techniques, such as SVMs and logistic regression, combined with manually crafted features Gaydhani et al. [18], Anistya and Setiawan [19]. However, these methods struggled with the subtleties of natural language and the contextual nature of sexism. Recent studies have leveraged the transformer models to address these challenges. In particular, [20] applied BERT to detect misogyny in social media. Similarly, Singh et al. [21] focused on the automatic detection of misogyny in multimodal online content by developing a large, annotated corpus of memes involving Hindi-English code-mixed language. Ensemble approaches, which involves combining multiple models, have proven effective in various NLP tasks, and the rationale is that different models can capture different aspects of the data based on their architecture, training objectives and data; thus, their combination can mitigate individual weaknesses. [22] provided a comprehensive overview of ensemble methods, emphasizing their potential to enhance robustness and accuracy. In the context of text classification, Stacked Generalization [23] and Bagging [24] are widely used ensemble techniques. More recent studies have focused on applying these methods to deep learning models. For example, [7] employed an ensemble of BERT-based models for sentiment analysis. 3. Datasets In this study, we utilized the EXIST 2024 Tweets Dataset, specifically tailored for Task 1, which involves sexism identification in tweets. This dataset is comprehensive, containing over 10,000 labeled tweets balanced between English and Spanish. The tweets are annotated for binary classification, where the task is to determine whether a tweet contains sexist expressions or behaviors, categorized as "YES" or "NO." The dataset is split into three parts: training, development, and test sets. For our experiments, we used the training and development sets with hard labels (gold standard) for model training and validation. The detailed distribution of the dataset is described in Table 1. Table 1 Distribution of Data for Task 1: Sexism Detection in the EXIST 2024 Dataset Language Training Set Development Set Test Set English 3,260 489 978 Spanish 3,460 549 1,098 Both Languages 6,920 1,038 2,076 4. Methodology We proceed by introducing the two methodologies designed to address the classification of online sexism. First, we first present a simple yet effective method, named Dual-Transformer Fusion Network (DTFN) (see Section 4.1), based on the fusion of vector representations generated by two different neural language models; we subsequently present a more complex and effective approach, Multimodel Fusion Ensemble (MFE) (Section 4.2), based on ensemble learning of several LLMs and of the aforementioned DTFN. 4.1. Dual-Transformer Fusion Network (DTFN) In our participation in the EXIST Task 1, we introduced a methodology named Dual-Transformer Fusion Network (DTFN), which integrates two Transformer models known for their effectiveness in online post analysis, namely RoBERTa-Large and DeBERTa-V3-Large [1, 2]. The DTFN methodology leverages the distinctive characteristics of each constituent model to enhance text classification: RoBERTa-Large, optimized for deep contextual understanding across longer text sequences [1], and DeBERTa-V3-Large, designed to model the inter-token relationships through its disentangled attention mechanism [2]. Based on these observations, we design DTFN as a hybrid architecture that first processes input text —typically extracted from social media or other online platforms— through both models in parallel. Each model then independently analyzes the text and outputs dense representation from their respective last hidden layers, potentially encoding complementary aspects of the text’s semantic and contextual nuances. Formally, let x denote the input text vector. Each Transformer model 𝑇 (RoBERTa and DeBERTa) processes x independently and outputs a representation h𝑇 from its final hidden layer: RoBERTa-Large Input Data Final Fully-Connected Layer Output DeBERTa-V3-Large Figure 1: Architecture of the Concatenated Transformer Integration (DTFN) Technique. This diagram illustrates how outputs from the last layers of RoBERTa-Large and DeBERTa-V3-Large are concatenated and processed through a classifier(linear Layer) for enhanced sexism detection in multilingual contexts h𝑅 = 𝑇RoBERTa (x), h𝐷 = 𝑇DeBERTa (x) (1) These output vectors, h𝑅 and h𝐷 , capture complementary linguistic features as determined by their distinct training paradigms and architectural innovations. Following the feature extraction, the outputs are concatenated to form a unified feature representation h: h = [h𝑅 ; h𝐷 ] (2) This vector h is then passed through a fully connected linear layer 𝐿 to produce the final class prediction 𝑦^. The linear layer acts as a classifier, integrating the diverse features into a unified prediction: 𝑦^ = 𝜎(𝐿(h)) (3) where 𝜎 denotes the sigmoid activation function, mapping the linear combination of features to a probability score indicating the final class predicted. The entire architecture is trained end-to-end with the objective of minimizing the binary cross-entropy loss ℒ: ℒ(𝑦^, 𝑦) = −𝑦 log(𝑦^) − (1 − 𝑦) log(1 − 𝑦^) (4) The linear layer, as the whole architecture, is trained end-to-end on the specific task of sexism detection. Figure 1 illustrates the overall pipeline of the DTFN, highlighting the flow from the input text to the final classification. 4.2. Multimodel Fusion Ensemble (MFE) Based on the promising results of our preliminary study combining two Transformer architectures, we devise a principled approach based on ensemble to dynamically combine multiple models with different architectures and training strategies. This approach, which we named Multimodel Fusion Ensemble (MFE), combines multiple models with distinct architectures and training strategies. Specifically, MFE integrates outputs from four different Transformer-based models — RoBERTa-Large[1], DeBERTa-V3- Large[2], Mistral-7b[3], and the previously introduced Dual-Transformer Network (DTFN) — using a majority voting mechanism. Each model in the ensemble was selected for its unique capabilities in processing and understanding complex text structures, such as the dynamic masking strategy [1], the disentangled attention mechanism [2], the Grouped-Query and Sliding Window Attention [3]. Specifically, MFE integrates outputs from four different Transformer-based models - RoBERTa-Large, DeBERTa-V3-Large, Mistral-7b, and the previously introduced Dual-Transformer Network (DTFN) — using a majority voting mechanism. The individual models were first fine-tuned on the available dataset for sexism detection to optimize their Figure 2: Architecture of the Multimodel Fusion Ensemble (MFE) Technique. This diagram shows the majority voting process involving RoBERTa-Large, DeBERTa-V3-Large, Mistral-7b, and the DTFN, optimized for enhanced sexism detection across multilingual contexts. performance for the classification task. Subsequently, the ensemble was configured to employ a majority voting system to aggregate the predictions from each model. More formally, in our ensemble method the classification decision for each instance is derived through a majority voting mechanism among the outputs of the constituent models. Let 𝐶 = {𝑐1 , 𝑐2 , . . . , 𝑐𝐾 } represent the set of possible classes. For a given text instance 𝑥, each model 𝑚 in the ensemble 𝑀 = {𝑚1 , 𝑚2 , . . . , 𝑚𝑁 } predicts a class 𝑐𝑚 . The ensemble prediction ^𝑐 is determined by: 𝑁 ∑︁ ^𝑐(𝑥) = arg max 1(𝑐𝑚 (𝑥) = 𝑐) (5) 𝑐∈𝐶 𝑚=1 where 1 is the indicator function that equals 1 if the condition is true and 0 otherwise. This simple approach counts the votes for each class from all models and selects the class with the highest count. Majority Voting and Tie Handling In scenarios where the voting results in a tie, particularly when the ensemble is evenly split across classes, a predefined rule is applied to resolve the ambiguity. Considering the sensitivity and potential consequences of misclassifying sexist content, our tie-breaking strategy defaults to the "Yes" (Sexist) prediction. This decision was based on the task’s sensitivity and the potential social impact of under- detecting sexist content. 5. Experimental Assessment As part of the EXIST 2024, we conducted a thorough experimental evaluation of the presented method- ologies addressing, in particular, Task 1. In our experimental assessment, we initially evaluated the performance of individual models to establish a baseline. Then, we analysed the results yielded by the proposed Dual-Transformer Fusion Network (DTFN) and Multimodel Fusion Ensemble (MFE). We proceed by first introducing the baselines, hyperparameters, and evaluation metrics adopted. We conclude by discussing the results on the EXIST dataset and the official leaderboard, along with quantitative analyses of the ensemble mechanism. Baselines In the following, we briefly describe the baselines evaluated: • RoBERTa-Large [1]: An optimized version of BERT, whose model’s size allows for a deeper understanding of language context, making it ideal for analyzing the intricacies in English and Spanish. • DeBERTa-V3-Large [2]: It improves upon the BERT and RoBERTa designs by deciphering the dependency between words in a sentence, introducing a disentangled attention mechanism. • Mistral-7b [3]: A large-scale model optimized for both performance and throughput and tailored for multilingual understanding. Its large-scale training on diverse datasets makes it particularly adept at handling the complexities of both English and Spanish. We adopted their pre-trained versions, available through the HuggingFace library1 . Parameter Settings For each model used in our experiments, we identified a set of optimal hyperparameters through preliminary testing. The selected hyperparameters include the number of training epochs, learning rate (𝜂), batch size, and weight decay (𝜆). Table 2 presents the optimal hyperparameters for each model. Table 2 Best Hyperparameters per Model Hyperparameter RoBERTa-Large DeBERTa-V3-Large Mistral-7b DTFN Number of Epochs 30 30 10 30 Learning Rate 6 × 10−6 6 × 10−6 1 × 10−4 6 × 10−6 Batch Size 16 16 16 4 Weight Decay 5 × 10−3 5 × 10−3 5 × 10−3 5 × 10−3 Evaluation Metrics In the evaluation of Task 1 for EXIST 2024, the official metrics used are ICM-Hard, ICM-Hard Norm, and F1. The ICM metric, proposed by [25], is based on information theory and measures the similarity between system classifications and gold standard labels. The organizers have also provided a normalized version, ICM-Hard Norm, to account for dataset imbalances, ensuring fair comparisons across different test conditions. For this shared task, higher values of the ICM and ICM-Hard Norm metrics indicate a stronger alignment between system outputs and the ground truth, with higher values considered better. 6. Results 6.1. Experimental Results on the Development set The evaluation of the baseline models shows differences in performance across RoBERTa-Large, DeBERTa-V3-Large, Mistral-7b, and our proposed Dual-Transformer Fusion Network (DTFN), as re- ported in Table 3. RoBERTa-Large achieved an F1 score of 0.864 and an ICM score of 0.592, demonstrating its robustness in handling the task. DeBERTa-V3-Large marginally outperformed RoBERTa-Large. Mistral-7b, on the other hand, yielded a slightly lower F1 score and the lowest ICM score among the baselines. This indicates that despite the higher number of parameters, Mistral-7b might not be as well-suited to the specific task of sexist identification compared to the other models evaluated.– Our proposed model, the Dual-Transformer Fusion Network (DTFN), slightly surpassed the other baseline models with an F1 score of 0.868 and showed a significant improvement with an ICM score of 0.606. The higher performance of DTFN highlights the efficacy of our dual-transformer architecture in improving classification accuracy. 1 https://huggingface.co/ Table 3 Performance Metrics of Baseline and Ensemble Model Combinations on the Development Set ID Models F1 Score ICM Score Baseline Models 1 RoBERTa-Large 0.864 0.592 2 DeBERTa-V3-Large 0.866 0.598 3 Mistral-7b 0.859 0.577 Dual-Transformer Fusion Architecture 4 Dual-Transformer Fusion Network (DTFN) - Ours 0.868 0.606 Ensemble Combinations 5 RoBERTa-Large, DeBERTa-V3-Large, Mistral-7b 0.8811 0.6438 6 RoBERTa-Large, DeBERTa-V3-Large, DTFN 0.8811 0.6439 7 RoBERTa-Large, Mistral-7b, DTFN 0.8832 0.6507 8 DeBERTa-V3-Large, Mistral-7b, DTFN 0.8747 0.6248 9 Multimodel Fusion Ensemble (MFE) - Ours RoBERTa-Large, DeBERTa-V3-Large, Mistral-7b, DTFN 0.8841 0.6548 Experimental Analysis of the Majority Voting To systematically understand the voting results, we analysed the number of times each combination of models agreed on the sexist (Yes) and non-sexist (No) classes. This is done to determine whether there is a dominant model (or combination) in the ensemble. The combinations and their respective agreement counts are detailed in Table 4. The table provides detailed insights into the agreement and non-conformity of various ensemble model combinations on the development set. Among the combinations, the ensemble of RoBERTa-Large, DeBERTa-V3-Large, Mistral-7b, and DTFN shows the highest majority agreement. Conversely, the combination of RoBERTa-Large, DeBERTa- V3-Large, and Mistral-7b without DTFN showed the lowest majority agreement. This indicates that the addition of DTFN significantly boosts the ensemble’s agreement, particularly in identifying sexist content. Out of all instances, we observed a total of 58 ties where the aforementioned tie-breaking rule was applied. Additionally, we reported the isolation frequency, i.e., how often a single model’s prediction differed from the majority vote within the ensemble, reflecting the model’s conformity with others. RoBERTa- Large had the highest isolation frequency, which could be due to the fact that it takes context into account. DeBERTa-V3-Large showed a lower isolation frequency, while Mistral-7b frequently disagreed Table 4 Majority Voting Agreement and Non-conformity Metrics in Multimodel Fusion Ensemble (MFE) on Development Set Ensemble Combination Sexist (‘Yes’) Non-Sexist (‘No’) RoBERTa-Large, DeBERTa-V3-Large, and Mistral-7b 11 11 RoBERTa-Large, DeBERTa-V3-Large, and DTFN 16 27 RoBERTa-Large, Mistral-7b, and DTFN 19 7 DeBERTa-V3-Large, Mistral-7b, and DTFN 25 13 RoBERTa-Large, DeBERTa-V3-Large, Mistral-7b, and DTFN 386 361 Isolation Frequency RoBERTa-Large 25 13 DeBERTa-V3-Large 19 7 Mistral-7b 16 27 DTFN 11 11 Table 5 Official Results on the Test Set for Task 1 (Sexism Detection) Using Dual-Transformer Fusion Network (DTFN) and Multimodel Fusion Ensemble (MFE) Evaluation Language Approach ICM-Hard ICM-Hard Norm F1 Rank Hard-Hard English MFE 0.6178 0.8153 0.7610 1st /68 Hard-Hard English DTFN 0.5953 0.8038 0.7491 2nd /68 Hard-Hard Spanish MFE 0.5497 0.7748 0.7898 12𝑡ℎ /66 Hard-Hard Spanish DTFN 0.4903 0.7452 0.7710 25𝑡ℎ /66 Hard-Hard Both MFE 0.5883 0.7956 0.7775 4𝑡ℎ /70 Hard-Hard Both DTFN 0.5447 0.7738 0.7614 13𝑡ℎ /70 with the ensemble on non-sexist classifications. DTFN had the lowest isolation frequency, suggesting it is the most conforming model within the ensemble, underscoring the DTFN’s role in enhancing ensemble cohesion. 6.2. Results of the Official Leaderboard In this section, we present the results of our participation in Task 1 of the EXIST 2024 challenge, where our team, EquityExplorers, submitted two runs: EquityExplorer-1 using the DTFN technique and EquityExplorer-2 employing the Multimodel Fusion Ensemble (MFE) approach. The MFE and the DTFN demonstrated notable performance by ranking 1𝑠𝑡 and 2𝑛𝑑 for the Task 1 in the English segment, respectively. The effectiveness of the MFE, evidenced by its ICM-Hard score of 0.6178 and ICM-Hard Norm of 0.8153, coupled with an F1 score of 0.7610, demonstrated its robust capability to discern nuances of sexist content in English tweets effectively. This ensemble approach, by combining different strategies and model outputs, has proven to be particularly effective in improving accuracy and reliability over individual models, including the Dual-Transformer Fusion Network. The patterns observed in the Spanish and Both (combining results from English and Spanish) evalua- tions align with these findings. Although the performance gap in the Spanish evaluation is wider, it highlights the robustness of MFE in a different linguistic environment. Notably, DTFN ranks 25𝑡ℎ out of 66 in Spanish, suggesting that while it is effective, it might not fully adapt to different languages as efficiently as MFE. The aggregated results for both languages demonstrate the consistent advantage of using an ensemble approach, with MFE achieving the 4𝑡ℎ rank out of 70, compared to the 13𝑡ℎ rank for DTFN. In conclusion, the official leaderboard results validate the proposed approach and highlight the significant improvements achieved through the Multimodel Fusion Ensemble (MFE). The consistent outperformance of MFE across various metrics and datasets underscores the potential of ensemble methods involving neural language models. 7. Conclusion In this work, we introduced the Dual-Transformer Fusion Network (DTFN) and the Multimodel Fusion Ensemble (MFE) for identifying sexist content across multiple languages within the context of the EXIST 2024 competition. A thorough evaluation on the development and test sets highlighted the superior performance of the MFE, particularly ranking highly in both the English (1𝑠𝑡 ) and combined language (4𝑡ℎ ) categories. This performance, paired with a comparative analysis against baseline models, allowed for a detailed assessment of the relative improvements offered by the DTFN and MFE approaches. It demonstrated the benefits of integrating diverse transformer models into an ensemble framework to leverage the characteristics of each neural language model, thereby achieving higher accuracy and reliability in detecting complex linguistic patterns associated with sexism. References [1] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692. [2] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, 2021. arXiv:2006.03654. [3] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825. [4] H. Yan, L. Gui, G. Pergola, Y. He, Position bias mitigation: A knowledge-aware graph model for emotion cause extraction, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 3364–3375. [5] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin- guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [6] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [7] A. C. Mazari, N. Boudoukhani, A. Djeffal, Bert-based ensemble learning for multi-aspect hate speech detection, Cluster Computing 27 (2024) 325–339. [8] D. Kikkisetti, R. U. Mustafa, W. Melillo, R. Corizzo, Z. Boukouvalas, J. Gill, N. Japkowicz, Us- ing llms to discover emerging coded antisemitic hate-speech in extremist social media, 2024. arXiv:2401.10841. [9] L. Zhu, G. Pergola, L. Gui, D. Zhou, Y. He, Topic-driven and knowledge-aware transformer for dialogue emotion detection, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 1571–1582. [10] G. Pergola, L. Gui, Y. He, A disentangled adversarial neural topic model for separating opinions from plots in user reviews, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 2870–2883. [11] R. Wolfe, Y. Yang, B. Howe, A. Caliskan, Contrastive language-vision ai models pretrained on web-scraped multimodal data exhibit sexual objectification bias, in: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 1174–1185. URL: https://doi.org/10.1145/3593013.3594072. doi:10.1145/3593013.3594072. [12] G. Pergola, L. Gui, Y. He, TDAM: A topic-dependent attention model for sentiment analysis, Information Processing & Management 56 (2019) 102084. [13] J. Lu, X. Tan, G. Pergola, L. Gui, Y. He, Event-centric question answering via contrastive learning and invertible event transformation, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 2377–2389. [14] J. Lu, J. Li, B. Wallace, Y. He, G. Pergola, NapSS: Paragraph-level medical text simplification via narrative prompting and sentence-matching summarization, in: A. Vlachos, I. Augenstein (Eds.), Findings of the Association for Computational Linguistics: EACL 2023, Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 1079–1091. [15] A. Irfan, D. Azeem, S. Narejo, N. Kumar, Multi-modal hate speech recognition through machine learning, in: 2024 IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC), 2024, pp. 1–6. doi:10.1109/KHI-HTC60760.2024.10482031. [16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv:1810.04805. [17] A. Vetagiri, P. Pakray, A. Das, A deep dive into automated sexism detection using fine-tuned deep learning and large language models, Available at SSRN 4791798 (2024). [18] A. Gaydhani, V. Doma, S. Kendre, L. Bhagwat, Detecting hate speech and offensive language on twitter using machine learning: An n-gram and {TFIDF} based approach, CoRR abs/1809.08651 (2018). URL: http://arxiv.org/abs/1809.08651. arXiv:1809.08651. [19] F. Anistya, E. B. Setiawan, Hate speech detection on twitter in indonesia with feature expansion us- ing glove, Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) 5 (2021) 1044 – 1051. URL: http: //www.jurnal.iaii.or.id/index.php/RESTI/article/view/3521. doi:10.29207/resti.v5i6.3521. [20] A. Rahali, M. A. Akhloufi, A.-M. Therien-Daniel, E. Brassard-Gourdeau, Automatic misogyny detection in social media platforms using attention-based bidirectional-lstm*, in: 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2021, pp. 2706–2711. doi:10. 1109/SMC52423.2021.9659158. [21] A. Singh, D. Sharma, V. K. Singh, Mimic: Misogyny identification in multimodal internet content in hindi-english code-mixed language, ACM Trans. Asian Low-Resour. Lang. Inf. Process. (2024). URL: https://doi.org/10.1145/3656169. doi:10.1145/3656169, just Accepted. [22] T. G. Dietterich, Ensemble methods in machine learning, in: International workshop on multiple classifier systems, Springer, 2000, pp. 1–15. [23] D. H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. [24] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–140. URL: https://api. semanticscholar.org/CorpusID:47328136. [25] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 5809–5819.