Penta ML at EXIST 2024: Tagging Sexism in Online Multimodal Content With Attention-enhanced Modal Context Notebook for the EXIST Lab at CLEF 2024 Deeparghya Dutta Barua1,*,† , Md Sakib Ul Rahman Sourove1,*,† , Fabiha Haider1 , Fariha Tanjim Shifat1 , Md Farhan Ishmam1,3 , Md Fahim1,2,* and Farhad Alam Bhuiyan1,* 1 Research and Development, Penta Global Limited, Bangladesh 2 CCDS Lab, IUB, Bangladesh 3 Islamic University of Technology, Bangladesh Abstract Content moderation at scale warrants automated systems that are capable of understanding nuance from the text and images being posted online. Transformer-based models have been shown to perform with these preconditions in mind, but the additional complexities originating from multimodality and multilinguality mandate the need for better tuned systems that can capture more enriched representations of the context. This can essentially translate to downstream tasks such as sexism identification in online content, which is the forefront of the EXIST 2024 shared tasks. This paper, as part of the EXIST challenge at CLEF 2024, investigates an attention-based approach to improve performance over baseline multimodal models by assigning separate importance to the textual and visual representations. The proposal is evaluated against CLIP and ViLT, two established multimodal models, while achieving state-of-the-art performance in multi-label classification tasks in the hard-hard evaluation context. The study is further augmented by the inclusion of different forms of ablations, involving confusion metrics for the applicable tasks. Keywords Hateful Memes Detection, Multimodal Fusion, Vision Language Modeling 1. Introduction The proliferation of internet users throughout the globe has caused an upsurge in the amount of content being generated and consumed on a daily basis. This huge volume of content includes a wide range of information, personal opinions and entertainment, which reflect the diverse perspective of its global audience. Naturally, given the volume, a large portion of the content on the internet enforces harmful and problematic behaviors [1]. One of the most concerning issues among them is sexism. Memes have recently entered the cultural zeitgeist where a piece of visual or audio-visual media may be accompanied by a humorous piece of text [2]. However, as with any form of media, memes too can be weaponized to diffuse problematic views and ideologies. Consequently, a good portion of memes on the internet enforce discriminatory behavior, such as sexism. The influence of such content is multifaceted, ranging from creating further polarization and radicalization of impressionable people, to causing psychological discomfort to its victims. In order to combat such problematic trends and create a safer online experience regardless of gender, it has become quite important for automated systems to screen and flag for potentially inimical content at scale. The simple rule-based approaches may work to a certain extent for filtering textual content, but these approaches tend to fail when multiple modalities, such as images along with text, are involved. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding authors. † These authors contributed equally. $ deeparghyadutta-2018425323@cs.du.ac.bd (D. D. Barua); souroveskb@gmail.com (M. S. U. R. Sourove); fabihahaider4@gmail.com (F. Haider); fariha.tanjim.shifat@gmail.com (F. T. Shifat); farhanishmam@iut-dhaka.edu (M. F. Ishmam); fahimcse381@gmail.com (M. Fahim); pdcsedu@gmail.com (F. A. Bhuiyan) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings This is due to the fact that nuance and context cannot be inferred easily from a set of rules. To take this into account, transformer-based architectures have recently gained prevalence in hateful content detection due to their ability to better understand context. The role of a comprehensive dataset is quintessential in order to tackle the challenge of training these transformer-based models for the purpose of employing them to downstream tasks. The Hateful Memes Challenge dataset [3] is a popular pick, containing over 10,000 multimodal samples of binary-labeled data. However, datasets that cover hateful behaviour at a broad scale fail to capture the specifics of sexism found in online content. Moreover, while not a low-resource language, there is a lack of Spanish datasets for sexism detection that are adequately labeled. The SemEval-2019 dataset [4] does address sexism in English and Spanish textual content specifically, but the binary labels lack resolution when it comes to understanding intent, and it does not address sexism found in images. Similarly, the Automatic Misogyny Identification (AMI) text dataset [5] offers labeled data for sexist text classification with some granularity, but the classifications are exclusive, with each sample being assigned to a single category only. In order to address multiple dimensions at once, the EXIST 2024 [6, 7] dataset is the only multimodal dataset for both Spanish and English content, where the labels are assigned with varying levels of granularity, ranging from binary classification, to multi-class and even multi-label classification where multiple labels can co-exist for the same sample. The multimodal nature of the EXIST 2024 memes dataset imposes a different set of challenges as opposed to its text-only counterpart. The multilinguality also adds to the complexity since most pretrained models tend to be trained on datasets that are predominantly in English. To address all these issues, we have proposed an attention-enhanced approach that uses both the textual and the visual context to get a more enriched representation of the sample and then use it for the downstream task of sexism classification, the architecture of which is discussed in finer detail within this paper. We have also evaluated our approach empirically over existing multimodal models, such as CLIP and ViLT, along with the error analysis that delineates the shortcomings of our system. As per the experiments, our approach yields superior performance in multi-class and multi-label classification problems when considering the hard-hard evaluation context (ICM-Hard), improving the performance by 7%-12% and 6%-13% respectively. 2. Background 2.1. Dataset and Tasks The EXIST 2024 memes dataset includes a total of 5,044 image-text pairs in English in Spanish. The training split contains 4,044 memes, each labeled by 6 different annotators for their respective tasks. The test split provides 1,053 unlabeled samples. Since the memes dataset does not provide a separate split for validation, an 80-20 split has been performed on the original training split, resulting in 3,235 samples for training and 809 samples for validation. A detailed language-wise breakdown of the dataset splits can be visualized in Table 1. The dataset also provides additional features pertaining to the annotators, such as their genders, ages, ethnicities, education level, and countries. The tasks for the memes dataset seek to classify sexism in memes to varying degrees. A more detailed overview of the hard label distribution of each of these tasks can be found in Table 2. The soft labels for each of the tasks assign a numeric value to each category in the Learning With Disagreement (LeWiDi) format. The soft values for tasks 4 and 5 add up to 1, as the labels are unique for each sample. For task 6, they do not necessarily add up to 1, as it is a multi-label problem. • Task 4: Given an image and its caption, the subtask is to classify whether meme in question contains any references to sexism or not. This includes any form of sexist content, describing situations involving discrimination towards women, and even contexts where sexism is criticised. It is essentially a binary classification problem. Table 1 EXIST dataset distribution by the train-validation-test splits. The train-validation split is 80%-20% of the initial training samples since the memes dataset does not provide a separate validation set. Split Language Samples English 1588 Train Español 1647 All 3235 English 422 Validation Español 387 All 809 English 513 Test Español 540 All 1053 All 5044 Table 2 Class/Label distribution in the dataset based on hard labels. For task 4, the labels are binary. Task 5 is uses multi-class labeling where each sample belongs to a single unique class. For task 6, multiple labels can coexist for the same sample. Both tasks 5 and 6 are hierarchical, meaning that the labels other than “NO” are only assignable when the meme in question is sexist. Task Class/Labels Train Validation YES 1604 434 Task 4 NO 1111 271 DIRECT 1078 277 Task 5 JUDGEMENTAL 351 109 NO 1111 271 IDEOLOGICAL-INEQUALITY 672 171 STEREOTYPING-DOMINANCE 807 216 OBJECTIFICATION 736 208 Task 6 SEXUAL-VIOLENCE 372 111 MISOGYNY-NON-SEXUAL-VIOLENCE 307 90 NO 1111 271 • Task 5: This subtask is a multi-class classification problem where the intention behind the meme needs to be identified. Given that a meme is sexist, this can either be classified as “DIRECT” or “JUDGEMENTAL”. If it is not sexist to begin with, then the identification should also indicate that. – DIRECT: The meme itself enforces sexist ideology without any ironic or satirical aspect to it. – JUDGEMENTAL: The meme condemns sexist behavior, either directly or satirically. • Task 6: The image can be classified into multiple labels — “IDEOLOGICAL-INEQUALITY”, “STEREOTYPING-DOMINANCE”, “OBJECTIFICATION”, “SEXUAL-VIOLENCE” or “MISOGYNY- NON-SEXUAL-VIOLENCE”. Multiple labels can coexist for a single image at the same time and similar to the previous tasks, there should be a separate label if the meme is not sexist. – IDEOLOGICAL-INEQUALITY: Ideological discrediting refers to all memes that discredit the feminist movement with the intention of devaluing, belittling and defaming the plight of women. On the other hand, “Inequality” refers to the memes that establish a narrative that no gender discrimination exists in the current society, or the flipped narrative where men are presented as the victims. – STEREOTYPING-DOMINANCE: Memes that impose the idea of specific roles being better suited for women fall under stereotyping. Dominance is characterized by the positioning of men above women in various standings. – OBJECTIFICATION: Memes dehumanizing or treating women as objects or commodities count under this label. These may also include the exertion of beauty or societal standards. – SEXUAL-VIOLENCE: These memes in question call for sexual suggestions, sexual favors or sexual abuse. – MISOGYNY-NON-SEXUAL-VIOLENCE: Expressions of physical violence and hatred towards women fall under this label. 3. Model Architecture In this section, we will provide a brief overview of the methodology employed to address the tasks at hand. Specifically, we were provided with memes sourced from the internet, structured in image- caption pairs denoted as (𝑉, 𝑇 ). For solving the tasks, the model architecture we have used has five different components i) Pretrained Vision-Language Model, ii) Semantics from Pooled Representations, iii) Attention Enhanced Context Vector for each Modality, iv) Modality Fusion, and v) Classification Head 3.1. Pretrained Vision-Language Model Our experiment uses a pre-trained ViLT model. Each input meme sample 𝑀 = (𝑉, 𝑇 ), comprising the image content 𝑉 and its caption 𝑇 , is processed individually. The text processor tokenizes the caption into its constituent tokens 𝑇 = 𝑡1 , 𝑡2 , . . . , 𝑡𝑛 . Meanwhile, the image processor divides the input image 2 𝑉 ∈ R𝐶×𝐻×𝑊 into patches, which are then flattened to 𝑣 ∈ R𝑁 ×(𝑃 ·𝐶) , where (𝑃, 𝑃 ) represents the patch resolution, and 𝑁 = 𝐻𝑊 𝑃2 . Specifically, ViLT employs the BERTTokenizer as the text processor and the ViT (Vision Transformer) processor as the image processor. The tokens and the image patches are then fed into the pre-trained ViLT model to enhance the comprehensive understanding of image-text pairs through multimodal fusion. From the model, we get image-aware text representations 𝐻𝑇 = {ℎ𝑡1 , ℎ𝑡2 , . . . , ℎ𝑡𝑛 } and text-aware image representations 𝐻𝑉 = {ℎ𝑝1 , ℎ𝑝2 , . . . , ℎ𝑝𝑛 } where ℎ𝑡𝑖 is the last layer hidden representation for 𝑖−th token and ℎ𝑝𝑖 is last layer hidden representation for 𝑖−th image-patch 3.2. Semantics from Pooled Representations The ViLT model also gives a pooled representation of the whole multimodal input. ViLT incorporates a special [CLS] token at the beginning of the multimodal input. We also extract the last layer representation of this token. A linear projection is applied to that representation to extract the pooled representation ℎ𝑝𝑜𝑜𝑙 = 𝑊𝑝𝑜𝑜𝑙 · ℎ[CLS] + 𝑏𝑝𝑜𝑜𝑙 3.3. Attention Enhanced Context Vector for each Modality Each token and patch holds its unique representation, with some being more crucial for prediction than others. To effectively combine these representations based on their significance, we utilize an additional attention network within each modality, ultimately determining a context vector. • Text Context Vector: The text representations 𝐻𝑇 = {ℎ𝑡1 , ℎ𝑡2 , . . . , ℎ𝑡𝑛 } obtained from ViLT are passed into an additional attention layer to compute learnable attention scores 𝛼𝑡𝑖 for each token 𝑡𝑖 in 𝐻. The calculation is as follows: 𝛼𝑡𝑖 = softmax(𝑊𝑇 · ℎ𝑡𝑖 + 𝑏𝑇 ), 𝑖 = 1, 2, . . . , 𝑛 After finding attention scores for each token, we find the context vector for the text modality by multiplying the ℎ𝑡𝑖 of token 𝑡𝑖 with its attention score 𝛼𝑡𝑖 . 𝑛 ∑︁ 𝑐𝑇 = 𝛼𝑡𝑖 · ℎ𝑡𝑖 𝑖=1 • Vision Context Vector: Similarly, the vision representations 𝐻𝑉 = {ℎ𝑝1 , ℎ𝑝2 , . . . , ℎ𝑝𝑛 } are also fed into another addi- tional attention layer to get the attention weights: 𝛼𝑝𝑖 = softmax(𝑊𝑉 · ℎ𝑝𝑖 + 𝑏𝑉 ), 𝑖 = 1, 2, . . . , 𝑛 Having calculated the attention scores for each patch, we determine the context vector for the vision modality by multiplying the representation ℎ𝑝𝑖 of each patch 𝑝𝑖 with its corresponding attention score 𝛼𝑝𝑖 . 𝑛 ∑︁ 𝑐𝑉 = 𝛼𝑝𝑖 · ℎ𝑝𝑖 𝑖=1 3.4. Modality Fusion The context vectors 𝑐𝑇 and 𝑐𝑉 are summed to the pooled representation ℎ𝑝𝑜𝑜𝑙 to get more enhanced ′ representations 𝑐𝑇 and 𝑐𝑉 ′ where 𝑐𝑇 ′ = ℎ𝑝𝑜𝑜𝑙 + 𝑐𝑇 ′ and 𝑐𝑉 ′ = ℎ𝑝𝑜𝑜𝑙 + 𝑐𝑉 ′ . Finally, we fuse both vision ′ and text modality concatenating 𝑐𝑇 and 𝑐𝑉 ′ and pass them into a Multi Layer Perceptron (MLP) to get the modality fused feature. 𝑐 = concat[𝑐𝑇 , 𝑐𝑉 ] (1) 𝑧 = MLP(𝑐) (2) 3.5. Classification Head After finding the modality fused feature representation 𝑧, it is fed into a classification layer. The representation is the logits 𝑧 is employed for the classification process by the following: ′ 𝑧 =𝑊 ·𝑧+𝑏 (3) ′ Finally, we calculate the Cross-Entropy (CE) loss based on 𝑧 with the ground truth. 4. Experimental Setup 4.1. Settings All experiments were conducted using Python (version 3.12) and PyTorch, leveraging the free NVIDIA Tesla P100 GPU provided by Kaggle. For the pretrained vision language models, we use HuggingFace transformers library (version 4.40.1). The models used for tasks 4 and 5 have been trained for 15 epochs, and the models for task 6 have been trained for 50 epochs from their pretrained checkpoints. We have Logits Classification Head + + Attention Attention h[CLS] hA hdelicious hlooking happle hp1 hp2 hp3 hp4 ViLT A delicious looking apple Processor A delicious looking apple Figure 1: Overview of the proposed architecture. The text is tokenized and the images are divided into fixed-size patches using the processor. These are then passed through a ViLT model to get the embeddings, along with extra [CLS] tokens that capture the context. This contextual representation is further enhanced by passing and weighting the text and image embeddings through separate attention networks, and then adding them to the CLS tokens. Finally, this enhanced representation is used for the downstream classification task. used the schedule-free AdamW optimizer [8] which requires no explicit hyperparameters for the optimization stopping step 𝑇 due to not using scheduling altogether. The initial learning rate used is 2 × 10−4 . The epsilon value of the optimizer is 10−6 , and the 𝛽 coefficients for computing the running average of the gradient and its square are 0.9 and 0.999 respectively. The seed or random state value used across all operations is 42. The batch size used in all the instances is 64. For the classification head, the dimension of the hidden layer is 256 and the dropout layer sets 10% of the input values to zero. To do better modeling and include different explainability results, we adopt different training and explainable experiment settings from EDAL, ITPT, HateXplain [9, 10, 11] papers. 4.2. Evaluation Metrics Our evaluation metrics for validation follow the same scoring system selected for the EXIST test rankings. “ICM-Hard” is the primary metric in the hard-hard evaluation context and “ICM-Soft” serves the same purpose in the soft-soft evaluation context. Additionally, the “Macro F1” values have also been listed as a more traditional metric that does not consider hierarchical classifications. The PyEvALL library [12] has been used to calculate the scores throughout all the experiments. • ICM-Hard: The Information Contrast Model-Hard (ICM-Hard) metric [13] is designed to eval- uate unbalanced hierarchical multi-label classification problems by incorporating hierarchical relationships and category specificity. ICM-Hard operates by considering the information content (IC) of categories and their intersections. The IC of a category represents the probability of items to appear in the category or any of its descendants, providing a measure of category specificity. Given two feature sets A and B, the metric is defined as: 𝐼𝐶𝑀 (𝐴, 𝐵) = 𝛼1 𝐼𝐶(𝐴) + 𝛼2 𝐼𝐶(𝐵) − 𝛽𝐼𝐶(𝐴 ∪ 𝐵) 𝐼𝐶(𝐴) = −𝑙𝑜𝑔(𝑃 (𝐴)) The values selected for 𝛼1 , 𝛼2 and 𝛽 for evaluating these experiments are 2, 2 and 3 respectively. • ICM-Soft: The Information Contrast Model-Soft (ICM-Soft) metric is an extension of ICM-Hard, designed to handle hierarchical multi-label classification problems in a learning with disagreement (LeWiDi) scenario. The metric is defined similarly using information content but it accommodates for soft ground truth assignments and soft system outputs. Given a category 𝑐 with an agreement 𝑣 to a given item, the IC is defined as — 𝐼𝐶({⟨𝑐, 𝑣⟩}) = − log2 (𝑃 ({𝑑 ∈ 𝐷 : 𝑔𝑐 (𝑑) ≥ 𝑣})) A recursive function is applied to calculate the ICM-Soft over a set of assignments: (︃ 𝑛 )︃ (︃ 𝑛 )︃ (︃ 𝑛 )︃ ⋃︁ ⋃︁ ⋃︁ 𝐼𝐶 {⟨𝑐𝑖 , 𝑣𝑖 ⟩} = 𝐼𝐶(⟨𝑐1 , 𝑣1 ⟩) + 𝐼𝐶 {⟨𝑐𝑖 , 𝑣𝑖 ⟩} − 𝐼𝐶 {⟨𝑙𝑐𝑎(𝑐1 , 𝑐𝑖 ), min(𝑣1 , 𝑣𝑖 )⟩} 𝑖=1 𝑖=2 𝑖=2 𝐼𝐶({⟨𝑐, 𝑣⟩}) = − log2 (𝑃 ({𝑑 ∈ 𝐷 : 𝑔𝑐 (𝑑) ≥ 𝑣})) Where 𝑙𝑐𝑎(𝑎, 𝑏) is defined as the lowest of common ancestor of categories 𝑎 and 𝑏. • Macro F1: The F1 score is the harmonic mean of precision and recall for a class. The macro F1 metric is the average of the F1 scores of each class. All classes are given equal weight in this metric, which is desirable for underrepresented classes in highly imbalanced datasets. 𝑁 1 ∑︁ Macro-F1 = 𝐹𝑖 𝑁 𝑖=1 2 · 𝑃𝑖 · 𝑅𝑖 𝐹𝑖 = 𝑃𝑖 + 𝑅𝑖 5. Experimental Evaluation For our experiments, we have chosen CLIP [14] and ViLT [15] as our baseline models due to their multimodal capabilities in image-text pairs. For standard classification using CLIP, we concatenated the representations obtained from both vision and language inputs and passed them into an MLP for classification. For ViLT, we extracted the representations of the CLS token and fed them into an MLP for standard classification. As ViLT outperformed CLIP in most scenarios, we selected ViLT as our final model for experiments with the proposed architecture. Table 3 Model performances on the validation dataset. Our approach is shown to have a superior macro F1 score in tasks 4 and 6 over the baseline models CLIP and ViLT. ICM-Hard ICM-Soft Macro F1 CLIP -0.2196 -1.9589 0.5592 Task 4 ViLT -0.1414 -1.5918 0.5755 Our Approach -0.1553 -2.0167 0.5815 CLIP -1.1371 -7.0817 0.3526 Task 5 ViLT -1.0804 -7.7936 0.3822 Our Approach -1.1196 -7.2194 0.3344 CLIP - -14.8292 0.2812 Task 6 ViLT - -13.9726 0.2723 Our Approach - -17.2587 0.2921 Table 4 Task 4 performance on the test dataset. The ViLT model without any context-enhancing consistently outperforms other models in all splits across all metrics. ICM-Hard ICM-Soft Macro F1 CLIP -0.1745 -1.5664 0.6524 All ViLT -0.1308 -1.2910 0.6742 Our Approach -0.2049 -1.7425 0.6101 CLIP -0.2666 -1.9498 0.6354 Español ViLT -0.2201 -1.7547 0.6657 Our Approach -0.3016 -2.2800 0.5957 CLIP -0.0824 -1.2165 0.6713 English ViLT -0.0420 -0.8683 0.6837 Our Approach -0.1083 -1.2548 0.6260 5.1. Results of the models on Validation Dataset Table 3 presents the performance of the models on the validation dataset. For Task 4, the CLIP model achieves a macro F1 score of 55.92%, with ICM-Hard and ICM-Soft scores of -0.2196 and -1.9589, respectively. When using the ViLT pretrained vision-language model instead of CLIP, we observe an improvement of approximately 2% in the macro F1 score. The ICM-Hard and ICM-Soft scores also improve with ViLT. The performance of the ViLT model is further enhanced when incorporated with our model design, achieving a 1% gain and reaching a macro F1 score of 58.55%. For Task 5, ViLT again outperforms CLIP, showing a 3% improvement in the macro F1 score. However, when integrating our approach with ViLT, the performance decreases. ViLT with the [CLS] token-based classification head performs best on the validation data for this task. For Task 6, CLIP performs better than ViLT, with CLIP achieving a macro F1 score of 28% compared to ViLT’s 27%. Our approach surpasses these baselines, improving by 1% over CLIP and 2% over ViLT. Nevertheless, ViLT yields the best ICM-Soft score for Task 6. The ICM-Hard scores for Task 6 could not be calculated as there is no public-facing implementation for multi-label scenarios. 5.2. Results on Test Dataset of Task 4 Table 4 displays the performance comparison of three models — CLIP, ViLT, and our approach on the task 4 test dataset. Overall, ViLT outperforms both CLIP and our approach, exhibiting the lowest error rates and achieving the highest macro F1 score of 0.6742. CLIP performs moderately well, with a macro F1 score of 0.6524, while our approach shows the least favorable performance with a score of 0.6101. This trend is consistent across subsets, with ViLT consistently outperforming the other models. Table 5 Task 5 performance on the test dataset. Our approach outperforms other models in all splits in the hard-hard evaluation metric, as indicated by the ICM-Hard scores. ICM-Hard ICM-Soft Macro F1 CLIP -0.6546 -5.3096 0.3856 All ViLT -0.7089 -5.9832 0.3841 Our Approach -0.6123 -5.2668 0.3841 CLIP -0.7533 -5.7513 0.3694 Español ViLT -0.7131 -5.7680 0.3733 Our Approach -0.6128 -5.4847 0.3636 CLIP -0.5554 -4.9616 0.4019 English ViLT -0.7043 -6.2139 0.3835 Our Approach -0.6144 -5.1208 0.3949 Particularly, in the English subset, ViLT achieves the highest macro F1 score of 68.37%, followed by CLIP at 67.13%, emphasizing ViLT’s effectiveness in this context. However, in the Spanish subset, although ViLT still leads with a score of 66.57%, the performance gap between ViLT and CLIP is narrower. Unfortunately, our proposal demonstrates the highest error rates across all subsets, suggesting potential areas for improvement. 5.3. Results on Test Dataset of Task 5 Table 5 illustrates the performance of the models for Task 5 test dataset. In this task, CLIP achieves a macro F1 score of 38.56%, with values of -0.6546 for ICM-Hard and -5.3096 for ICM-Soft instances. ViLT performs marginally worse than CLIP with a macro F1 score of 38.41%, having -0.7089 for ICM-Hard and the value of -5.9832 for ICM-Soft. Our approach shows the best performance in both the ICM metrics with the scores being -0.6123 and -5.2668 respectively. The macro F1 performance of 38.41% is comparable to ViLT. On the Spanish tweet subset, CLIP performs similarly to its overall performance, with a macro F1 score of 36.94%, while ViLT slightly outperforms CLIP with a score of 37.33% and the lowest error rate for ICM-Soft instances at -5.7680. Our approach achieves the most promising results on the Spanish subset, with the highest values for both ICM-Hard and ICM-Soft instances being -0.6144 and -5.4547 respectively. For the English subset, CLIP performs slightly better with a macro F1 score of 40.19%, whereas ViLT shows a slight decrease with a score of 38.35%. Our approach remains competitive with the highest score for ICM-Hard instances and a macro F1 score of 39.49%. Key observations indicate that performance varies across different subsets and models, with CLIP and ViLT performing consistently and our approach obtaining the highest scores in the ICM-Hard metric and largely having better performance in the ICM-Soft metric as well, while maintaining competitive macro F1 scores. 5.4. Results on Test Dataset of Task 6 In Table 6, we reported the performance of models on Task 6 in test dataset. For the overall dataset, our approach outperforms both CLIP and ViLT, achieving the best ICM-Hard score of -1.3631 and the highest macro F1 score of 33.56%, though it shows a slightly lower ICM-Soft score of -13.2556 compared to ViLT’s best score of -11.2593. In the Spanish subset, Our approach again demonstrates superior performance with the highest macro F1 score of 31.92% and the best ICM-Hard score of -1.6610, although it has a worse ICM-Soft score of -15.5081 compared to CLIP’s -12.4430. For the English Table 6 Task 6 performance on the test dataset. Our approach outperforms other models in all splits in the hard-hard evaluation metric, as indicated by the ICM-Hard scores. It also achieves the best performance in the macro F1 metric. ICM-Hard ICM-Soft Macro F1 CLIP -1.5499 -11.8047 0.3053 All ViLT -1.4684 -11.2593 0.3093 Our Approach -1.3631 -13.2556 0.3356 CLIP -1.7780 -12.4430 0.2873 Español ViLT -1.7534 -12.8432 0.2987 Our Approach -1.6610 -15.5081 0.3192 CLIP -1.3484 -11.7789 0.3207 English ViLT -1.2112 -10.3388 0.3170 Our Approach -1.1027 -11.9271 0.3487 0.65 0.60 0.7 YES YES YES 0.69 0.31 0.60 0.79 0.21 0.63 0.37 0.6 0.55 0.55 Actual Actual Actual 0.50 0.5 0.50 0.45 0.4 0.45 NO NO NO 0.49 0.51 0.40 0.55 0.45 0.37 0.63 0.3 0.40 0.35 YES NO YES NO YES NO Predicted Predicted Predicted (a) CLIP (b) ViLT (c) Our Approach Figure 2: Confusion matrices on the predictions using CLIP, ViLT and our approach respectively for task 4. Our approach offers the best estimation for true positives and true negatives. subset, our approach achieves the highest overall performance, leading with a macro F1 score of 34.87% and the highest ICM-Hard score of -1.1027, while ViLT shows the best ICM-Soft score of -10.3388. Across all subsets, our approach excels consistently in macro F1 and ICM-Hard metrics, indicating its robustness in handling hard instances for multi-label tasks, while ViLT often provides the best scores for ICM-Soft instances, demonstrating its effectiveness in handling soft instances. Overall, this table highlights the strengths of each model in different aspects of Task 6, with our approach showing the most balanced and highest performance in key metrics across various subsets. 6. Error Analysis 6.1. Confusion Metrics 6.1.1. Confusion Metrics Analysis for Task 4 From the confusion matrices of task 4 in figure 2 we can see that the CLIP model excels in identifying the 300 true positive cases, which is 69% of the total. But it has high Type-I (49% false positives) and Type-II (31% false negatives) error, resulting in significant misclassifications. ViLT performs better than CLIP yielding 341 true positive cases, which accounts for 79% of the total. But it has a higher Type-I (55% false positives) and moderate Type-II (21% false negatives) error, resulting lower accuracy to detect the true negative cases which only account for 45% of the total. Our approach, on the other hand, performs consistently on both the true positives and the true negatives, with both covering 63% of the total. To be more specific, the true negative cases have been handled 0.55 DIRECT DIRECT DIRECT 0.50 0.43 0.44 0.14 0.41 0.35 0.23 0.57 0.32 0.12 0.50 0.5 0.45 0.45 0.40 0.4 Actual Actual Actual 0.40 0.35 NO NO NO 0.32 0.55 0.14 0.24 0.58 0.18 0.42 0.48 0.1 0.35 0.3 0.30 0.30 JUDGEMENTAL JUDGEMENTAL JUDGEMENTAL 0.25 0.2 0.39 0.4 0.21 0.39 0.31 0.3 0.25 0.6 0.31 0.092 0.20 0.20 0.15 0.1 DIRECT NO JUDGEMENTAL DIRECT NO JUDGEMENTAL DIRECT NO JUDGEMENTAL Predicted Predicted Predicted (a) CLIP (b) ViLT (c) Our Approach Figure 3: Confusion matrices on the predictions using CLIP, ViLT and our approach respectively for task 5. While CLIP and ViLT offer better estimates for the true negatives (“NO”), our proposal handles the true “DIRECT” values better than the other two. better than both CLIP and ViLT. It also has consistent rates for Type-I and Type-II errors, both clocking at 37%. 6.1.2. Confusion Metrics Analysis for Task 5 Using the confusion matrices for task 5 from figure 3, we observe that the CLIP model performs well on the ‘NO’ class with 148 correct predictions but suffers from high misclassification rates for ‘DIRECT’ and ‘JUDGEMENTAL’ classes. The ViLT model shows a similar trend. It performs better than CLIP on both the ‘NO’ and ‘DIRECT’ classes with 58% and 30% correct predictions respectively. But it still shows a tendency towards wrongly considering other classes to be ‘JUDGEMENTAL’, despite it having very low representation in the dataset. Our approach, on the other hand, performs really well in predicting the true ’DIRECT’ samples with the highest amount of 57%. It has comparable performance for the ‘NO’ class, but struggles heavily with ‘JUDGEMENTAL’ samples, showing signs of being affected by the dataset distribution. 7. Conclusion In this study, we addressed the pressing issue of sexism in online content, focusing on the complexity posed by multimodal and multilingual (English and Spanish) data. Our proposed attention-enhanced approach effectively integrates textual and visual contexts, providing enriched representations for sexism classification. By leveraging the EXIST 2024 dataset, which offers a thoroughly outlined look into the variants of sexism present in online content, we demonstrated improvements over existing models such as CLIP and ViLT. Our empirical evaluations and subsequent error analysis highlighted both the strengths and areas for improvement in our system. The findings underscore the importance of comprehensive datasets and advanced model architectures in tackling online discrimination. Future work will explore further enhancements in multimodal learning and expand the approach to other forms of harmful content beyond sexism, contributing to a safer and more inclusive online environment. Acknowledgements This project has been sponsored by Penta Global Limited, Bangladesh. We would like to express our deepest gratitude to Penta Global for for their financial support. References [1] J. B. Walther, Social media and online hate, Current Opinion in Psychology 45 (2022) 101298. [2] B. Kostadinovska-Stojchevska, E. Shalevska, Internet memes and their socio-linguistic features, English Language and Linguistics 2 (2018). doi:10.5281/zenodo.1460989. [3] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, D. Testuggine, The hateful memes challenge: Detecting hate speech in multimodal memes, 2021. arXiv:2005.04790. [4] H. R. Kirk, W. Yin, B. Vidgen, P. Röttger, SemEval-2023 Task 10: Explainable Detection of Online Sexism, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval- 2023), Association for Computational Linguistics, 2023. URL: http://arxiv.org/abs/2303.04222. doi:10.48550/arXiv.2303.04222. [5] E. Fersini, D. Nozza, P. Rosso, Overview of the evalita 2018 task on automatic misogyny identifica- tion (ami), in: EVALITA@CLiC-it, 2018. [6] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin- guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [7] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [8] A. Defazio, Xingyu, Yang, H. Mehta, K. Mishchenko, A. Khaled, A. Cutkosky, The road less scheduled, 2024. arXiv:2405.15682. [9] M. Fahim, M. S. Shahriar, M. R. Amin, Hatexplain space model: Fusing robustness with explain- ability in hate speech analysis (2023). [10] M. Fahim, Aambela at blp-2023 task 2: Enhancing banglabert performance for bangla sentiment analysis task with in task pretraining and adversarial weight perturbation, in: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), 2023, pp. 317–323. [11] M. Fahim, A. A. Ali, M. A. Amin, A. M. Rahman, Edal: Entropy based dynamic attention loss for hatespeech classification, in: Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, 2023, pp. 775–785. [12] UNEDLENAR, Pyevall, https://github.com/UNEDLENAR/PyEvALL, 2024. [13] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 5809–5819. URL: https://aclanthology.org/2022.acl-long.399. doi:10. 18653/v1/2022.acl-long.399. [14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763. [15] W. Kim, B. Son, I. Kim, Vilt: Vision-and-language transformer without convolution or region supervision, in: International conference on machine learning, PMLR, 2021, pp. 5583–5594.