1. Introduction

MAMITA: Benchmarking Misogyny in Italian Memes

Elisabetta Fersini

Francesca Gasparini

Giulia Rizzi

Aurora Saibene

0 0 University of Milano-Bicocca , Milan , Italy

2025

This paper introduces MAMITA, a novel Italian multimodal benchmark dataset developed for the automatic detection of misogynistic content in online media, with a specific focus on memes. The dataset comprises 1880 memes sourced from popular social platforms-Facebook, Twitter, Instagram, Reddit-and meme-centric websites, selected using misogyny-related

eol>Misogynous Memes Italian Benchmark Expert vs Crowd Annotation Perspectivism

1. Introduction Despite growing societal awareness and policy eforts

aimed at addressing such an issue, the automatic detecIn recent years, the proliferation of user-generated con- tion of multimodal misogynistic content remains a signiftent on social media has intensified the creation of hateful icant challenge. A major limitation in the development of content against women not only using textual messages robust misogyny detection systems is the scarcity of highthat can implicitly or explicitly contain harmful content, quality, multimodal datasets that reflect the nuanced and but also from a multimodal perspective1. Among the di- subjective nature of such content. Misogyny can maniverse forms of online expression, memes have emerged fest in explicit or implicit forms, often relying on cultural as viral communication tools, which can subtly convey references, irony, or layered symbolism. harmful ideologies thanks to their combination of vi- The identification of this kind of abusive content is of sual and textual elements. This kind of digital violence paramount importance not only for protecting women can be an extension or a precursor to physical violence, and guaranteeing safe online environments, but also for stalking and harassment, but it can also be a way to pun- eventually generating counter-narratives 2. ish, abuse or silence women, increasing the isolation of In this paper, we provide three main contributions: victims (Council of Europe, 2021) [2]. Through the combination of apparently innocuous images coupled with harmless superimposed text, misogynous memes can be easily created and spread, normalizing and trivializing detrimental stereotypes, objectification, and marginalization of women. Their viral nature, usually due to the ironic message behind, contributes to their rapid spread across several media platforms, also fueling those communities that reinforce misogynistic ideologies.

1. MAMITA (Multimedia Automatic Misogyny

Identification in i TAlian), a novel Italian benchmark focused on misogynistic content in memes, which covers diverse forms of gender-based hate such as body shaming, objectification, stereotype, and violence. 2. Dual annotation strategy involving both domain experts and crowd annotators, enabling comparative analysis of labeling perspectives and improving the robustness of misogyny detection. 3. Perspectivist annotation, capturing for each annotator perceived misogyny along with demographic and socio-cultural background such as age, education, and social status, to support re

2https://rm.coe.int/study-on-efectiveness-risks-and-potentials-o

f-using-counter-and-alter/1680b40775 (a) Shaming (b) Stereotype (c) Objectification (d) Violence

search on disagreement in hate speech perception also capture aggressiveness. Lastly, AMI-PRF [12] is the and detection. most recent dataset of tweets annotated for both misogThe paper is organized as follows. In Section 2, related yny and professional categories. A further contribution works are presented. In Section 3, the proposed bench- is represented by PejorativITy [ 13 ], an Italian tweet cormark is described, detailing the two types of annotations, pus annotated at word level for pejorativity, and at the i.e., experts and crowd. In Section 4, insights from human sentence level for misogyny. and multimodal models are reported. Finally, in Section While these eforts advanced text-based detection, they 5, conclusions are outlined. did not address the complexity of multimodal content such as memes, which often rely on implicit visual cues, humor, and cultural references to communicate harm2. Related Work ful messages. Among the general hateful meme benchmarks, we can highlight four main initiatives focused on The automatic detection of hate speech, and misogyny the English language, i.e., Facebook Hateful Memes [14], in particular, has received growing attention in Natural Memotion2 [15], Harmful Memes[16], MultiOFF [17], Language Processing (NLP). Early eforts have primar- and Intervening Cyberbullying in Multimodal Memes ily focused on text-based misogyny detection [3], using (ICMM) [18]. However, these benchmarks do not capture datasets sourced from Twitter and Reddit. For instance, the specificity of misogyny, which often relies on genregarding the multilingual settings, several benchmark der norms, implicit bias, and culturally coded references datasets have been proposed in the literature to cover that difer significantly from general ofensive content or multiple languages. A few representative benchmarks other forms of targeted hate (e.g., against immigrants or are denoted by HATEVAL [4] focused on English and people with disabilities). Only a few benchmarks have Spanish, BAJER [5] for the Danish language, BIASLY [6] been proposed to deal with the peculiarity of hate against focused on movie subtitles and colloquial expressions in women in a multimodal settings, i.e., MAMI [19] for the North American film, ArMIS [ 7] for the Arabic language, English language, MIMIC [20] for Hindi, EXIST [21, 22] and EXIST [ 8, 9 ] for dealing with English and Spanish for English and Spanish, and Dravidian corpus [23] fosexist expressions. cused on the Tamil and Malayalam languages.

Regarding the Italian language, we can summarize Although all the previous initiatives represents a two main benchmarking text-related initiatives, i.e., AMI fundamental step towards the identification of hateful [ 10, 11, 12 ] and PejorativITy [ 13 ]. AMI (Automatic meme against women, to the best of our knowledge Misogyny Identification) represents a set of benchmark no benchmark dataset has been developed to specifidatasets that, starting from the initial challenge at Evalita cally address misogynistic content in the Italian lan2018, have led to three main annotated corpora, i.e., guage, resulting in a remarkable gap in the resources AMI@Evalita 2018, AMI@Evalita 2020, and AMI-PRF. available for the systematic investigation of this pheThe AMI@Evalita 2018 dataset introduced in [10] pro- nomenon within the Italian contexts. To this purpose, we vided one of the first benchmarks for detecting misog- propose MAMITA (Multimedia Automatic Misogyny ynistic language on social media in English and Italian Identification in i TAlian), a novel benchmark dataset for tweets. Its extension presented at the AMI@Evalita 2020 the Italian language that focuses on misogynous memes, [ 11 ] denotes an extension of the former benchmark to composed of a wide range of multimodal expressions de- led to a full agreement in 81.43% of the memes, where in noting body shaming, objectification, stereotyping, and 70.86% of such cases the memes were labeled by the three violence. The dataset is developed using a dual anno- annotators as misogynous. We computed Fleiss’ Kappa tation strategy that combines input from both domain statistics [24] to assess the level of agreement among experts and crowd annotators, enabling robust analysis the experts. The resulting score was 0.749, indicating a of labeling perspectives. substantial inter-annotator reliability in the perception of memes. This value suggests a strong consistency in the evaluators’ judgments, particularly in distinguishing 3. MAMITA between misogynistic and non-misogynistic content. The annotations given by the experts have also been The meme collection was primarily carried out using aggregated following a majority voting strategy to asvisual search engines such as Google Images and Pin- sign a final golden label about misogyny. The dataset terest, based on the keywords reported in Table 1. All labeled by the experts finally consists of 57.71% of misogthe keywords have been defined to try to capture four ynous and 42.29% of not misogynous memes. Regarding main categories related to misogynous contents, i.e., body the category of misogyny, since multiple overlapping shaming, objectification, stereotyping, and violence. The annotations were possible, the final dataset evaluated by websites considered are typically dedicated to meme shar- the expert contains - among those memes considered as ing (e.g., me.me and memedroid.com), as well as Insta- misogynous by the majority of the experts - 76.12% of the gram accounts focused on themes related to femininity memes labeled as Objectification, 48.29% as Stereotype, (e.g., alpha woman and scaricatricidiporto). Additional 20.18% as Violence and 8.84% as Body Shaming by at content was sourced from Facebook groups intention- least one annotator. Considering that multiple labels are ally created for the dissemination of misogynistic memes allowed for the type of misogyny, the dataset is provided (e.g., facciaabuco, ignoranza sofocotti pecorina , and Io sono with soft labels denoting a probability distribution for vaginatariano). The initial dataset consisted of approxi- each category. mately 2,000 memes. Pornographic content, low-quality images, and items that could not be clearly categorized as memes were subsequently removed. Memes were 3.2. Crowd Annotation also normalized to a maximum resolution of 640×640 For what concerns the annotation process performed by pixels, preserving their aspect ratio. The final dataset the crowd, we prepared a proper Google Form and we encomprises 1880 memes, with the textual content tran- gaged trusted voluntary annotators (from 4 to 10 labelers scribed using Optical Character Recognition (OCR) tools for each meme). The total number of volunteers involved (https://www.onlineocr.net/). Examples of misogynous is 231 (116 male, 110 female, and 5 non-responders). The memes available in the MAMITA dataset are reported most frequent age is between 25-34 years old, i.e., about in Figure 1. The dataset has been subsequently labeled 41% of the annotators. The native language is Italian for by two distinct groups, i.e., expert and crowd annotators. the 99% of the participants, while the remaining three The full dataset can be accessed by filling in the form annotators speak Italian fluently. The dataset was dihttps://forms.gle/5Xz1gcxJdrh6GHnq5. vided into groups of 40 memes each, balanced in terms of classification (20 misogynistic, 20 non-misogynistic) 3.1. Expert Annotation according to the experts’ preliminary evaluations, to be subsequently evaluated by the engaged crowd annotators.

For what regards the annotation process performed by The choice of presenting a limited number of memes is the experts, we involved two male and three female due to the fact that sensory habituation cause people to annotators. In order to label each meme, they adopted reduce their response to repeated or continuous stimuli the definitions originally provided in [ 19], opportunely over time [25]. adapted for covering the multimodal scenario. Each Each meme was independently reviewed by a varying meme was reviewed by one male and two female ex- number of labelers. Each annotator labeled the memes perts. Each expert involved in the evaluation process as either misogynistic or non-misogynistic and, when analyzed the memes, classifying them as either misogy- applicable, selected the primary Category of misogyny nistic or non-misogynistic. In cases where a meme was that they perceived most together with the Intensity of perceived as misogynistic, evaluators were also asked to ifgured out misogyny. Moreover, in order to provide a specify the type of misogyny, selecting among violence, benchmark that is characterized by perspectivist inforbody shaming, stereotyping, and objectification. In cases mation, we acquired a few variables to characterize the of uncertainty about the categorization, evaluators were annotators. In particular, participants were required to allowed to select multiple types of misogyny. provide a few information about themselves. Specifically, The annotation process performed by the experts has the following specific details have been required: bitch (stronza) blondes (bionde) call girl (escort) cheap (squallida) cheat (tradire / imbrogliare) clean (pulire) cleaning (pulizia) cold (fredda) complicated (complicata) cooking (cucinare) cougar (cugar) couple (coppia) crazy (pazza) cunt (cagna) dirty (sporca) dishwasher (lavastoviglie) driving (guida) dumb (stupida) equal rights (pari diritti) escort (escort) fat (grassa) female (femmina) feminism (femminismo) feminist (femminista) fuck (fottiti / scopare) girl (ragazza) girlfriend (fidanzata) girl power (potere femminile) girls (ragazze) gold digger (arrampicatrice sociale) harsch (dura / severa) hooker (prostituta) hore (puttana) house (casa) housewife (casalinga) inferior (inferiore) kitchen (cucina) lazy (pigra) marriage (matrimonio) Mars & Venus (Marte e Venere) milf (milf) misogynist (misogino) misogyny (misoginia) nazifeminist (nazifemminista) pregnancy (gravidanza) promiscuous (promiscua) prostitute (prostituta) rape (stupro) sandwich (panino) sex (sesso) sexism (sessismo) sexist (sessista) slut (zoccola) stupid (stupida) tits (tette) trixie (ragazza superficiale) unstable (instabile) wife (moglie) witch (strega) woman (donna) • Subjective Social Status (SSS): we introduced a • Familiarity with memes: Yes/No response to variable that has the goal to measure an individ- whether they know what memes are ual’s perception of his/her social position com- • Frequency of meme visualization: how ofpared to others. To this purpose, we adopted the ten the participant encounters memes, using a MacArthur scale introduced in [26]. Participants 7-point Likert scale ranging from Never to Very are asked to place themselves on a graduated scale Often consisting of ten steps, ranging from the highest • Primary source of meme stimuli: social media, to the lowest socioeconomic status. At the top messaging apps, websites and forums, other. of the scale (10) are individuals with the highest levels of income, education, and occupational Since the number of annotators varies for each meme, prestige. At the bottom of the scale (1) are those they have been finally labeled as misogynous if at with the lowest income, minimal education, and least 50% of the annotators provided the misogynous the least respected jobs, or who may be unem- label. Based on the crowd annotations, the resulting ployed. This self-placement invites participants dataset consists of 58.82% misogynous and 41.17% nonto express a subjective evaluation of their social misogynous memes. The annotation process led to full agreement for 43.14% of the memes. If we focus on each the absolute value indicates how large the diference is in class, 37.97% of the misogynous memes and 50.45% of the terms of standard deviation, i.e., the larger the absolute not misogynous ones show a full agreement, denoting (as value, the more statistically significant the diference. expected) a higher disagreement on misogynous content. In this case, the p-value, which indicates the likelihood To evaluate the overall level of agreement, we also com- that this diference occurred by chance, is extremely low puted Krippendorf’s Alpha statistic [ 27], which yielded (2.14 × 10− 43). The results show a highly significant a score of 0.43. While the percentage of full agreement diference in the perception of intensity between men suggests some level of consistency, the Krippendorf’s and women, suggesting that the probability of observing Alpha value indicates that a substantial portion of the such a diference by chance is asymptotic to zero. agreement may be attributable to chance, highlighting extremely subjective interpretation of what can be con- [Q2] Do statistically significant diferences exist sidered as misogynous. As for the specific categories of among age groups to identify misogynistic content? misogyny, the dataset includes 70.97% of misogynous The core idea is to assess whether the probability of judgmemes labeled as objectification, 55.87% as stereotype, ing content as misogynistic depends on the annotator’s 30.47% as violence, and 22.47% as body shaming by at age group. For this purpose, we estimated both a Chileast one annotator. Also in this case the dataset is pro- Squared statistic and a Binary Logistic Regression, which vided with soft labels denoting a probability distribution verifies if there exists a relationship and estimates how for each category derived through the crowd annotation much each age group afects the likelihood of judging process. content as misogynistic, respectively. In our case, the p-value equal to 7.10 related to the 4. Insights from MAMITA Chi-Squared test denotes a statistically significant relationship between age and the misogyny judgment.

As an additional observation, we report in Table 2 the results of the Binary Logistic Regression where the dependent variable (misogynous or not) is binary.

In this section, we present a twofold analysis of the

MAMITA dataset. First, we investigate how sociodemographic and cognitive characteristics of human annotators—such as gender, age, and Subjective Social Status—influence the perception and labeling of misogynistic content. Then, we evaluate the performance of multimodal baseline models, specifically mCLIP and mBLIP, in detecting misogyny and disagreement in memes, providing a comparative perspective between human subjectivity and machine predictions.

4.1. Human Perspectives

To better understand how individual diferences influence The independent variables are age categories, comthe perception of misogynistic content, we formulated pared with a reference category 18-24 age group. We can three research questions. easily note that the socio-demographic attribute related to the Age is significantly associated with the likelihood [Q1] Does the perceived intensity of misogyny sig- of labeling content as misogynistic, where all age groups nificantly difer between male and female annota- compared to the baseline (18-24) are statistically signifitors? The aim is to determine whether the observed cant (p-value < 0.01). Moreover, the Odds Ratios increase diferences in the perception of misogyny intensity be- with age, particularly from age 45 and up. This indicates tween men and women are statistically significant or an increased probability of labeling content as misogycould be due to chance. To this purpose, the Welch t-test nous as age increases (compared to the 18-24). has been adopted, which does not assume the same variance between the two populations. In this specific case, [Q3] Has the Subjective Social Status a significant the null hypothesis is that the two means of the perceived relationship with the intensity of the perceived intensity are equal and that any observed diference in misogyny? To explore the relationship between inthe data can be attributed to random error or natural dividuals’ perceived social standing and their sensitivsample variation, rather than to a real efect. ity to misogynistic content, we computed the Spearman

The Welch t-test is -13.98, where the negative sign correlation between SSS and the perceived intensity of indicates the direction of the diference since the mean misogyny. In particular, for each annotator, we considof women is higher than that of men (5.07 vs. 4.29) and ered their self-reported SSS score obtained from the backtion capabilities, with mBLIP consistently outperforming mCLIP across all metrics. In the Crowd setting, mBLIP achieves a higher average F1 score (0.79 vs. 0.70), demonstrating better balance between precision and recall for both misogynous and not misogynous labels. It is interesting to note that mBLIP’s 1+ (0.83) and 1− (0.76) suggest a strong ability to correctly identify both misogynistic and non-misogynistic content according to crowd judgments. Performance improves further when considering the Expert annotations. Both models exhibit higher F1 scores compared to the Crowd setting, with mBLIP again leading (Avg. F1 = 0.83 vs. 0.73 for mCLIP). This may indicate better alignment between the models’ predictions and the expert labeling criteria, possibly due to more consistent or less ambiguous expert judgments. In both evaluation contexts, mBLIP proves to be the more robust of the two models, ofering more reliable and accurate misogyny detection. These results suggest that state-of-the-art multimodal models, particularly mBLIP, can efectively capture harmful content signals when ifne-tuned appropriately. ground questionnaire and calculated the average intensity of misogyny they assigned across all memes they annotated as misogynistic. This approach allowed us to assess whether annotators with difering self-reported social positions systematically varied in how strongly they perceived misogynistic content. Spearman’s rank correlation was chosen due to its suitability for capturing monotonic relationships without assuming normality in the data distributions.

The Spearman correlation analysis between the Social Sensitivity Score and the perceived intensity of misogynistic content yielded a statistically significant positive correlation ( = 0.209, = 0.0015). While the correlation is relatively weak, it indicates that annotators with a higher Social Sensitivity Score are slightly more likely to assign higher intensity of perceived misogyny. This finding highlights the influence of annotatorlevel socio-cognitive traits on subjective annotation tasks and suggests the importance of modeling annotator variability when addressing harmful or sensitive content.

4.2. Multimodal Baseline Models

To assess the efectiveness of multimodal models in identifying misogynistic content and disagreement between annotators, we fine-tune two state-of-the-art architectures: mCLIP 3[28, 29] and mBLIP4 [30]. These models leverage both visual and textual information from memes, enabling a comprehensive understanding of their content. Both the vision encoder and text decoder are trained jointly with a classification head, allowing the models to tailor their multimodal representations to the specific task of misogyny and disagreement detection on ttehnetMbaAseMliInTeAfodraetvaasleuta.tTioonp,wroevfinidee-tuanseimboptlhe manoddeclosnbsyis- Approach + + 1+ Cr− owd − 1− . 1 adding a linear classification layer on top of their origi- 0.00 0.00 0.00 0.57 1.00 0.72 0.36 nal representations, without further architectural mod- 0.00 0.00 0.00 0.57 1.00 0.72 0.36 ifications 5. The classifier is trained using binary cross- (* ) 0.44 0.52 0.48 0.58 0.49 0.53 0.50 entropy loss and the Adam optimizer. To compare the (* ) 0.42 0.69 0.53 0.55 0.28 0.37 0.45 baseline models, we measure Precision (P), Recall (R), and Approach + + 1+ Ex− pert − 1− . 1 F-Measure (F1), distinguishing between the misogynous 0.81 1.00 0.90 0.00 0.00 0.00 0.45 label (+) and the non-misogynous one (-) as well as the 0.81 1.00 0.90 0.00 0.00 0.00 0.45 agreement label (+) vs the disagreement one (-). We adopt (* ) 0.82 0.37 0.51 0.19 0.66 0.30 0.40 a 10-fold cross-validation approach to ensure robustness (* ) 0.83 0.34 0.49 0.19 0.68 0.30 0.39 and generalizability of the evaluation. Table 4

The results reported in Table 3 highlight the perfor- Disagreement prediction performance on Crowd and Expert mance of mCLIP and mBLIP in predicting misogynistic labels. (* ) denotes models calibrated using the the Youden’s J content, evaluated against both Crowd and Expert an- statistic. notations. Overall, both models show good classificaTable 4 reports the performance of the considered base3https://huggingface.co/sentence-transformers/clip-ViT-B-32-mul line models in predicting disagreement between crowd tilingual-v1 and expert judgments, under two conditions: raw model 45Thottpens:s/u/hreugregpinrogdfuacceib.ciloit/yGorefgoourr/mrebsulilpts-,mwt0e-rxelport the main training outputs and outputs calibrated using the Youden’s J statisparameters used: batch size = 4, classification threshold = 0.5, and tic [31] to determine the best classification threshold on number of training epochs = 5. the probability distribution. When evaluating against the (a) Expert (b) Crowd

Crowd labels, both mCLIP and mBLIP perform poorly, types of misogyny. Figure 2 reports four violin plots assigning all instances to the negative class. However, ap- corresponding to diferent misogyny categories, distinplying the Youden correction significantly improves per- guishing between Experts and Crowd annotations. Each formance, increasing the average F1 from 0.35 to 0.50 for plot displays the distribution of a specific variable as a mCLIP and 0.45 for mBLIP. In the Expert setting, uncali- percentage6 on the y-axis. The bright-colored regions repbrated models exhibit an inverse pattern: perfect recall resent the distributions within the whole dataset, while and high precision for positive labels (F1 = 0.90), but do the darker-colored regions overlaid within each violin ilnot detect negative samples, again reflecting a strong pre- lustrate the distribution of the errors for each label. From diction bias. The use of the Youden’s threshold reduces the visual comparison, we can easily notice that: such a bias (F1− = 0.30), at the cost of reduced precision and recall on the positive class. Overall, these results • Stereotype and Objectification labels exhibit relhighlight a key challenge in using pretrained multimodal atively symmetrical and balanced distributions models for subtle content moderation tasks: while de- with a moderate spread, indicating consistent disfault thresholds may lead to heavily skewed predictions, tribution across a broad range of values. The ersimple calibration strategies can significantly rebalance ror distributions for these labels are also centered, model behavior, though not without trade-ofs. suggesting relatively low and uniform prediction

We further analyzed models’ errors to better evalu- errors. ate models’ performances, particularly considering the • Shaming and Violence have a sharp, narrow instances that were mislabeled by both classification mod- dataset and error distributions, denoting a lot els. A first analysis focuses on the evaluation of errors 6The percentage value has been computed with respect to the subset in misogyny identification with respect to the diferent of data labeled as misogynous by the majority of annotators. by the degree of annotator agreement. As part of future work, we plan to conduct a more in-depth qualitative error analysis, with a specific focus on identifying the most challenging archetypes of controversial or ambiguous memes, following the approach proposed in [32], to better understand the limitations of current models and highlight open challenges in the detection of misogyny in Italian.

5. Conclusions In this paper, we presented a novel Italian multimodal

benchmark dataset designed to support the automatic Figure 3: Violin plots showing the distribution of annotator detection of misogynistic memes in online social media. laagbreelem(Meinsto(gyy-nayxisv,sp.eNrcoetn-Mtagiseo)gdyinstyi)ngaunidshainnngobteattwioenensocularcses The dataset emphasizes diversity in content and label(Experts vs. Crowd). The lighter area in each violin represents ing perspectives, ofering a comprehensive view of how the full dataset distribution, while the darker overlay indicates misogyny is manifested and perceived across diferent the distribution of model prediction errors. annotator groups. The proposed benchmark, collected using a variety of popular platforms and focusing on a wide spectrum of misogynistic expressions, ensures a broad coverage of the phenomenon. Moreover, the dual of misogynous memes not belonging to those annotation strategy, which includes both domain experts classes. and crowd annotators, provides an opportunity to invesBy analyzing the shapes of the violin plots, we can no- tigate the discrepancies in perceiving contents, therefore tice that the violins dedicated to Shaming and Violence improving the robustness of future automatic detection assume a shape broader at the basis, denoting a signif- systems that account for perspectivism. icant portion of misogynous memes not labeled with those types. Considering all the misogyny types, we can Acknowledgments notice that the Expert plot is consistent in shape with the Crowd one for all the types, indicating a general ability We acknowledge the support of the PNRR ICSC National of the crowd annotators in recognizing all the misogyny Research Centre for High Performance Computing, Big types. Data and Quantum Computing (CN00000013), under the

Subsequently, we evaluated models’ ability in detect- NRRP MUR program funded by the NextGenerationEU. ing misogynistic content with respect to disagreement This work has also been supported by ReGAInS, Debetween annotators. Figure 3 reports two violin plots partment of Excellence. The authors would also like of the agreement among annotators along with the pre- to thank the significant contributions of the master’s studiction error distributions for misogyny classification, dents Annalisa Bachir, Gökalp Recep Boz, Gaia Campisi, distinguishing between Expert and Crowd annotators. Marco Cervelli, Lisa Cocchia, Francesca Frigerio, Rosa The y-axis represents annotator agreement as a percent- Gotti, Monica Mantovani, Matteo Parisi, Emma Salvadori, age, with higher values indicating stronger consensus whose dedicated eforts were fundamental to the develamong annotators, both on the Misogynous and Non- opment and compilation of the MAMITA dataset. Misogynous labels. Each violin, representing the Expert and the crowd evaluation respectively, is divided into two layers: the lighter area represents the distribution References of the full dataset, while the darker overlay highlights the distribution of the model’s prediction errors. It is [1] C. Bosco, E. Ježek, M. Polignano, M. Sanguinetti, easy to notice that the Expert-dedicated violin assumes Preface to the Eleventh Italian Conference on Coman hourglass shape, denoting a tendency for Experts to putational Linguistics (CLiC-it 2025), in: Proceedagree on both classes. The crowd plot instead shows a ings of the Eleventh Italian Conference on Compumore uniform distribution, denoting a greater variabil- tational Linguistics (CLiC-it 2025), 2025, pp. –. ity in the disagreement between crowd annotators. In [2] The Council of Europe, 6th general report on greboth cases, the error distribution appears to be consistent vio’s activities: Group of experts on action against and unrelated to the disagreement distribution. These violence against women and domestic violence, patterns indicate that model errors are not influenced During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

2024. URL: https://rm.coe. int/6th- general- rep volume 2263 , CEUR-WS , 2018 , pp. 1 - 9 .

ort-on-grevio-s-activities/1680b5cbe8 . [11]

Fersini ,

Nozza ,

Rosso , et al., Ami @ [3]

Hewitt ,

Tiropanis ,

Bokhove , The problem evalita2020: Automatic misogyny identification , in:

of identifying misogynist language on twitter (and Proceedings of the 7th evaluation campaign of Natu-

8th ACM Conference on Web Science , 2016 , pp. (EVALITA 2020 ), 2020 .

333- 335 . [12]

Cascione ,

Cerulli , M. M. Manerba , L. Passaro, [4] V.

Basile , C.

Bosco , E.

Fersini , D.

Nozza , V.

Patti , F. M.

Women's professions and targeted misogyny online,

Rangel

Pardo ,

Rosso , M. Sanguinetti, SemEval- in : Proceedings of the 10th Italian Conference on

2019 task 5: Multilingual detection of hate speech Computational Linguistics (CLiC-it

2024 ), 2024 , pp.

against immigrants and women in Twitter , in: 182 - 189 .

May ,

Shutova ,

Herbelot ,

Zhu , M. Apidi- [13]

Muti ,

Ruggeri ,

Toraman ,

Barrón-Cedeño ,

13th International Workshop on Semantic Evalu- C. Zapparoli, PejorativITy: Disambiguating pe-

Minneapolis , Minnesota, USA, 2019 , pp. 54 - 63 . Italian tweets, in: N. Calzolari , M.- Y.

Kan , V.

Hoste , [5] P.

Zeinert , N.

Inie , L. Derczynski, Annotating on- A. Lenci, S.

Sakti , N. Xue (Eds.), Proceedings of the

line misogyny , in: C. Zong , F.

Xia , W.

Li , R.

Navigli 2024 Joint International Conference on Computa-

(Eds.), Proceedings of the 59th Annual Meeting of tional Linguistics , Language Resources and Evalua-

the Association for Computational Linguistics and tion (LREC-COLING 2024), ELRA and ICCL , Torino,

the 11th International Joint Conference on Natu- Italia , 2024 , pp. 12700 - 12711 .

ral Language Processing (Volume 1 : Long

Papers)

, [14]

Kiela ,

Firooz ,

Mohan ,

Goswami ,

Singh ,

2021 , pp. 3181 - 3197 . lenge: Detecting hate speech in multimodal memes , [6]

Sheppard ,

Richter ,

Cohen , E. Smith, in: H. Larochelle , M.

Ranzato , R.

Hadsell , M. Bal-

An expert-annotated dataset for subtle misogyny Processing Systems , volume 33 , Curran

Associates

detection and mitigation , in: L. -W. Ku , A. Martins , Inc., 2020 , pp. 2611 - 2624 .

Srikumar (Eds.), Findings of the Association for [15]

Ramamoorthy ,

Gunti ,

Mishra , S. Suryavar-

Computational

Linguistics: ACL

2024 , Association dan, A . Reganti,

Patwa , A. DaS , T. Chakraborty,

for Computational

Linguistics

, Bangkok, Thailand,

Sheth ,

Ekbal , et al., Memotion 2 : Dataset on

2024 , pp. 427 - 452 . sentiment and emotion analysis of memes , in: Pro[7]

Almanea , M. Poesio, ArMIS - the Arabic misog- ceedings of De-Factify: workshop on multimodal

disagreements, in: N. Calzolari , F.

Béchet , P.

Blache , 2022 .

Choukri ,

Cieri ,

Declerck ,

Goggi , H. Isa- [16]

Sharma ,

M. S.

Akhtar ,

Nakov , T. Chakraborty,

European Language Resources Association , Mar- Computational Linguistics: NAACL 2022 , Associa-

seille , France, 2022 , pp. 2282 - 2291 . tion for Computational Linguistics, Seattle, United [8]

Rodríguez-Sánchez ,

Carrillo-de Albornoz , States, 2022 , pp. 1572 - 1588 .

Plaza ,

Mendieta-Aragón ,

Marco-Remón , [17]

Suryawanshi ,

B. R.

Chakravarthi , M. Arcan,

Rosso , Overview of exist 2022: sexism iden- for identifying ofensive content in image and text,

Lenguaje Natural 69 ( 2022 ) 229 - 240 . S. Malmasi,

Murdock , D. Kadar (Eds.), Proceed[9]

Plaza ,

Carrillo-de Albornoz , R. Morante, ings of the Second Workshop on Trolling, Aggres-

of exist 2023: sexism identification in social net- sources Association (ELRA), Marseille , France, 2020 ,

works, in: European Conference on Information pp. 32 - 41 .

Retrieval , Springer, 2023 , pp. 593 - 599 . [18]

Jha ,

Jain ,

Mandal ,

Chadha , S. Saha, [10]

Fersini ,

Nozza ,

Rosso , et al., Overview of P. Bhattacharyya, MemeGuard: An LLM and VLM-

the evalita 2018 task on automatic misogyny iden- based framework for advancing content moderation

Srikumar (Eds.), Proceedings of the 62nd An- J . Clark,

Krueger , I. Sutskever , Learning transfer-

Linguistics (Volume 1 : Long

Papers)

, Association sion , in: Proc. of the 38th International Conference

for Computational

Linguistics

, Bangkok, Thailand, on Machine Learning (ICML) , volume 139 of Proc. of

2024 , pp. 8084 - 8104 . Machine Learning Research, PMLR, 2021 , pp. 8748 - [ 19]

Fersini ,

Gasparini ,

Rizzi ,

Saibene , 8763. URL: https://proceedings.mlr.press/v139/rad

Chulvi ,

Rosso ,

Lees ,

Sorensen , Semeval- ford21a.html.

2022 task 5: Multimedia automatic misogyny iden- [29]

Reimers , I. Gurevych , Sentence-bert: Sentence

tification, in: Proceedings of the 16th International embeddings using siamese bert-networks , in: Pro-

Workshop on Semantic

Evaluation (SemEval-2022), ceedings of the 2019 Conference on Empirical Meth-

2022 , pp. 533 - 549 . ods in Natural Language Processing , Association [20]

Singh ,

Sharma ,

V. K.

Singh , Mimic: misog- for Computational Linguistics , 2019 . URL: http:

yny identification in multimodal internet content //arxiv .org/abs/ 1908 .10084.

in hindi-english code-mixed language , ACM Trans- [30]

Geigle ,

Jain ,

Timofte , G. Glavaš, mblip:

formation Processing ( 2024 ). in: Proceedings of the 3rd Workshop on Advances [21]

Plaza ,

Carrillo-de Albornoz ,

Ruiz , A . Maeso, in Language and Vision Research (ALVR) , 2024 , pp.

Chulvi ,

Rosso ,

Amigó , J. Gonzalo, 7 - 25 .

Morante ,

Spina , Overview of exist 2024-learn- [31]

W. J.

Youden , Index for rating diagnostic tests , Can-

ing with disagreement for sexism identification and cer 3 (

1950 ) 32 - 35 .

characterization in tweets and memes , in: Inter- [32]

Rizzi ,

Gasparini ,

Saibene ,

Rosso , E. Fersini,

2024 , pp. 93 - 117 . Management 60 ( 2023 ) 103474 . [22]

Plaza ,

Carrillo-de Albornoz , I. Arcos, P. Rosso,

2025: Learning with disagreement for sexism iden-

Information

Retrieval

, Springer, 2025 , pp. 442 - 449 . [23]

B. R.

Chakravarthi ,

Ponnusamy , S. Rajiakodi,

detection: Dravidianlangtech@ naacl 2025 , in:

guages , 2025 , pp. 721 - 731 . [24]

J. L.

Fleiss , Measuring nominal scale agreement

among many raters ., Psychological bulletin 76

( 1971 ) 378 . [25]

Tarantino ,

Passerello ,

Ben-Sasson , T. Y.

naire, PloS one 19 ( 2024 ) e0309030 . [26]

N. E.

Adler ,

Boyce ,

M. A.

Chesney , S. Cohen,

American psychologist 49 ( 1994 ) 15 . [27]

A. F.

Hayes ,

Krippendorf , Answering the call

Communication methods and measures 1 ( 2007 )

77- 89 . [28]

Radford ,

J. W.

Kim ,

Hallacy , A . Ramesh,