From Explanation to Detection: Multimodal Insights into Disagreement in Misogynous Memes Giulia Rizzi1,2,* , Paolo Rosso2 and Elisabetta Fersini1,* 1 University of Milano-Bicocca, Milan, Italy 2 Universitat Politècnica de València, Valencia, Spain Abstract Warning: This paper contains examples of language and images that may be offensive. This paper presents a probabilistic approach to identifying the disagreement-related elements in misogynistic memes by considering both modalities that compose a meme (i.e., visual and textual sources). Several methodologies to exploit such elements in the identification of disagreement among annotators have been investigated and evaluated on the Multimedia Au- tomatic Misogyny Identification (MAMI) [1] dataset. The proposed unsupervised approach reaches comparable performances, and in some cases even better, with state-of-the-art approaches, but with a reduced number of parameters to be estimated. The source code of our approaches is publicly available† . Keywords Disagreement, Perspectivism, Multimodal, Misogyny 1. Introduction Multimedia Automatic Misogyny Identification (MAMI) dataset [1]. Moreover, four different strategies to exploit Hate detection has been a serious concern in recent years, the presence of such elements in the identification of penetrating internet platforms and causing harm to indi- disagreement are investigated. viduals across various communities. Users found in the online environment new modes of representation to ex- press various types of hatred, including the more deeply 2. Related Works rooted ideologies and beliefs with historical origins, for example towards women [2]. Many natural language tasks, such as hate speech detec- Detecting abusive language has become an increasingly tion, humor detection, and sentiment analysis, involve important task. The challenges introduced by the new subjectivity since they require an interpretation based on modes of representation, which require a multimodal human judgment, cultural context, or personal opinion analysis, are further compounded when considering the [6]. Such phenomenon is reflected in the dataset through subjectivity of the task. The subjectivity of the task de- multiple labels from different annotators or via the inclu- rives from the fact that individuals’ perception of what sion of a confidence level to ground truth labels. Labels characterizes a message of hate varies widely. Such di- derived from different interpretations are therefore able versification is reflected in the labeling phase in the form to capture multiple perspectives and understandings [6]. of disagreement among annotators. Identifying elements Information about annotators’ disagreement has primar- within the sample that can lead to disagreement is of ily been exploited as a means to improve data quality paramount importance for several reasons. For content by excluding controversial instances [7, 8]. Alterna- that can lead to disagreement, specific annotation policies tively, aiming at improving model performances, dif- might be introduced, and the number of annotators might ferent strategies have been developed to exploit dis- be enlarged to capture multiple perspectives [3, 4, 5]. agreement information in the training phase. For in- In this work, we propose a methodology to identify the stance, in [9], the authors assign weights to instances disagreement-related elements in multimodal samples to prioritize the ones with higher confidence levels. An- by exploring both visual and textual elements in the other commonly adopted strategy [6, 10] aims at directly learning from disagreement without considering any CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, aggregated label. While a considerable amount of re- Dec 04 — 06, 2024, Pisa, Italy search has been conducted to understand the reasons * Corresponding author. behind annotators’ disagreement [11, 12, 8] and to lever- $ g.rizzi10@campus.unimib.it (G. Rizzi); prosso@dsic.upv.es (P. Rosso); elisabetta.fersini@unimib.it (E. Fersini) age disagreement when training classification models  0000-0002-0619-0760 (G. Rizzi); 0000-0002-8922-1242 (P. Rosso); [13, 14, 15, 16, 17, 18, 19], there has been comparatively 0000-0002-8987-100X (E. Fersini) little attention devoted to the explanation and a priori © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). recognition of disagreement in hateful content. A tax- † § https://github.com/MIND-Lab/From-Explanation-to-Detection-Multimodal-I onomy of possible reasons leading to annotators’ dis- nsights-into-Disagreement-in-Misogynous-Memes CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings agreement has been proposed by [12]. Such taxonomy to a multimodal scenario. In particular, [23] introduces articulates four macro categories of reasons behind dis- a methodology to identify disagreement related con- agreement: sloppy annotations, ambiguity, missing infor- stituents that, however, is limited to textual content. The mation, and subjectivity. Moreover, the authors evaluate approach includes a strategy to identify disagreement- the impact on classification performance of the different related textual constituents and an approach for gen- types. eralization towards unseen textual constituents. Both Only recently, works have focused on the task of ex- methods have been extended to a multimodal scenario plaining disagreement [20, 21, 22, 23]. In [21], the au- in order to identify disagreement related elements both thors propose exploratory text visualization techniques in textual and visual sources that compose a meme. as a method for analyzing different perspectives from Given an element 𝑒, a corresponding Element Disagree- annotated data. In [22], the authors identify textual con- ment Score ( EDS(e)) has been computed according to the stituents that contribute to hateful message explanation following equation: by exploiting integrated gradients within a filtering strat- egy. A more recent approach [23] proposes a probabilistic 𝐸𝐷𝑆(𝑒) = 𝑃 (𝐴𝑔𝑟𝑒𝑒|𝑒) − 𝑃 (¬𝐴𝑔𝑟𝑒𝑒|𝑒) (1) semantic approach for the identification of disagreement- related constituents (e.g. textual elements) in hateful where 𝑃 (𝐴𝑔𝑟𝑒𝑒|𝑒) represents the conditional prob- content. Overall, the findings indicate that, while LLM ability that there is agreement on a meme given can yield promising results, comparable outcomes can that the meme contains the element 𝑒. Analogously, be attained with less complex strategies and fewer com- 𝑃 (¬𝐴𝑔𝑟𝑒𝑒|𝑒) denotes the conditional probability that putational resources. While previous research has con- there is no agreement on a meme given that, that meme, centrated on the analysis of textual disagreement, this contains the element 𝑒. Given that EDS represents a dif- study represents, to the best of our knowledge, a first ference between two complementary probabilities, it is insight into the explanation of multimodal disagreement. bounded within the range of -1 to +1. A higher positive In particular, we have revised and extended to the multi- score indicates stronger agreement between annotators, modal environment the methodology proposed in [23] whereas a lower negative score suggests disagreement. in order to consider not only textual elements but also The score can be estimated on the training data and visual ones. exploited to identify additional disagreement-related ele- ments on unseen memes. 3. Proposed Approach 3.2. Disagreement identification 3.1. Identification of Once the Element Disagreement Scores have been esti- Disagreement-Related Elements mated for each visual and textual element in the training dataset, they can be exploited to qualify the level of dis- The first phase of the proposed approach aims to evalu- agreement on unseen samples. Analogously to what ate the relationship between elements (both visual and was carried out in [23], different aggregation strategies textual) that compose a meme and annotators’ disagree- have been investigated, relying on the hypothesis that ment. Preliminary preprocessing operations have been the identified elements can be exploited for identifying performed before identifying disagreement-related ele- the disagreement thanks to their different distribution in ments. For what concerns the textual components, pre- samples with and without an agreement. processing operations have been performed (i.e., tok- For each meme in the test set, the corresponding list enization, lemmatization, lower casing and stop word of elements and the corresponding Elements Disagree- removal) to identify a valid set of tokens1 that might be ment Score estimated on the training data have been related to disagreement. Considering the image com- extracted. In particular, for each meme, the textual and ponent, the set of 14 human readable concepts (tags) visual elements have been identified and paired with the identified by [24] to capture specific characteristics of corresponding score when available. The Multimodal misogynous content has been adopted. As proposed by Disagreement Score (MDS) has been estimated according the authors, tags were extracted via the Clarifai API [25]. to the following strategies: Sum, Mean, Median, and The preprocessing steps allowed us to extract a list of vi- Minimum. A threshold 𝜏 has been estimated according sual and textual elements from each meme in the dataset. to a grid-search approach for each strategy. In order to measure the relationship among each ele- A qualitative evaluation, comprehensive of a compari- ment in the memes and the disagreement among annota- son with the specific misogynistic terminology and an tors, the approach proposed in [23] has been extended evaluation of the keyword included in the dataset cre- ation phase, has been performed to assess the quality of 1 To guarantee a more robust evaluation, tokens that appear less than the EDS, while both the F1-score for the two considered 10 times in the dataset have been removed. classes (agreement (+) and disagreement (-)) and a global 𝑒 of the training lexicon: F1-score have been computed to validate the MDS. ∑︀ [𝑐𝑜𝑠(𝑒, ˆ𝑒) · 𝐸𝐷𝑆(𝑒)] 𝑒∈𝐷 3.3. Generalization towards unseen 𝐸𝐷𝑆(𝑒ˆ) = ∑︀ (3) 𝑐𝑜𝑠(𝑒, ˆ𝑒) 𝑒∈𝐷 elements The score estimation is strongly based on what is ob- • Multimodal Disagreement Score with un- served in the training data, resulting in the lack of scores seen constituents: All the above-proposed for any elements that do not appear in the training sam- strategies for MDS estimation have been extended ples. This is particularly relevant for textual components to also include elements that do not belong to the rather than visual ones. In fact, while we can assume training lexicon and for which the EDS score has an open-word vocabulary (where a few terms on unseen been estimated. In particular, given a multimodal data can not appear in the training set) for the textual sample 𝑠, the aggregation functions presented in source, we limited the visual tags to closed-word settings Section 3.2 will in this case consider the 𝐸𝐷𝑆 (only 14 tags can be considered both in training and un- values of both seen (by considering the 𝐸𝐷𝑆(𝑒)) seen memes). Since we need to generalize only on unseen and unseen (by considering the 𝐸𝐷𝑆(𝑒ˆ)) ele- textual constituents, for each (unseen) textual element ments. Such generalized aggregation functions ˆ𝑒, an approximated EDS score has been computed as will be later referred to through the prefix 𝐺−. follows: • Embeddings of the training lexicon: the con- 4. Results textualized embedding representation of each tex- tual element 𝑒 has been obtained via mBert [26]. The proposed approach has been evaluated on the An average embedding vector representation ⃗x𝑒 Multimedia Automatic Misogyny Identification (MAMI) is computed to jointly represent multiple embed- Dataset [1] consisting of 10.000 memes for training and ding representations of 𝑒 derived by the different 1.000 memes for testing 2 . The dataset comprises a range contexts where it occurs. In particular, given an el-of memes that exemplify various forms of misogyny, in- ement 𝑒 and 𝑁 sentences containing it, its vector cluding shaming, stereotyping, objectification, and vi- representation ⃗x𝑒 is obtained by a simple aver- olence. Each meme has been labeled by three crowd- 𝑁 sourced annotators for misogynistic content3 , with an age ⃗x𝑒 = v𝑖 /𝑁 , where ⃗ v𝑖 is the constituent estimated Fleiss-K [27] coefficient equal to 0.5767. ∑︀ ⃗ 𝑖=1 contextualized embedding vector related to the In particular, the proposed approach has been adopted 𝑖 occurrence of 𝑒 and obtained through mBert. 𝑡ℎ to estimate an Element Disagreement Score (EDS) for each element and, consequently, MDS for each meme in • Embeddings of unseen term: given an unseen the dataset. textual element ˆ𝑒 within a given sentence, its con- Table 1 reports the top-10 highest positive and high- textualized embedding representation has been est negative disagreement scores derived for the textual computed via mBert [26]. component. We can notice how terms that are rarely • Most similar constituent: given an unseen linked with misogynistic messages (e.g., flu) and terms textual element ˆ𝑒 with the corresponding embed- commonly used to address women in a harmful way (e.g., ding ⃗ v^𝑒 and the average embedding of a training whale) also exploiting stereotypes (e.g. gamer and pro- element 𝑒, the set 𝐷 of most similar constituents grammer), achieve a high positive score, indicating a to ˆ𝑒 is determined according to: strong relation with the agreement. Additionally, some ⋃︁ personal names of famous people (i.e., Bernie and Mi- 𝐷 = {𝑒|𝑐𝑜𝑠(x v^𝑒 ) ≤ 𝜓} ⃗ 𝑒, ⃗ (2) ley) appear within the ranking. In particular, such names 𝑒 2 Although both a training and a test dataset are provided, only the where 𝑐𝑜𝑠(x v^𝑒 ) is the cosine similarity be- ⃗ 𝑒, ⃗ training dataset is adopted, as the proposed work is focused on tween the average contextualized embedding rep- the analysis and prediction of disagreement and the test dataset resentation of element 𝑒 and ˆ𝑒, and 𝜓 is a grid is constructed to include only samples with complete agreement. search estimated threshold. The training dataset, instead, is characterized by 65% of data with complete agreement. Therefore, it has been divided in order to • Unseen terms score: the EDS score for an isolate the 90% for token estimation and the remaining 10% for the unseen textual element ˆ𝑒 is computed as the evaluation. weighted average of the most similar constituents 3 Additionally, a boolean disagreement label has been derived to represent complete agreement among annotators. In particular, this last label is set to 1 if all the annotators have indicated the same label, to 0 otherwise. Figure 1: Visual representation of disagreement scores distinguishing among textual and visual elements. Positive and negative scores are represented with green and pink respectively. The gray bar denotes elements for which the EDS has been estimated, while the white color represents elements with an EDS equal to zero. Term EDS Term EDS Tag EDS Tag EDS flu 1.00 market −0.64 crockery 0.49 dishwasher 0.00 folk 1.00 fetish −0.60 nudity 0.46 broom 0.14 bug 1.00 nut −0.57 cat 0.46 dog 0.20 Bernie 1.00 hotel −0.50 car 0.43 child 0.23 whale 1.00 apologize −0.45 kitchenutensil 0.41 woman 0.26 feeling 0.90 Miley −0.45 gamer 0.87 lonely −0.43 Table 2 rest 0.87 award −0.43 Tags with the highest positive and lowest negative scores programmer 0.87 coke −0.43 san 0.83 blowjob −0.43 Table 1 subject of debate, particularly in relation to its intersec- Terms with the highest positive and lowest negative scores tion with misogynistic ideologies [28, 29]. Some sup- porters, often aligned with "manosphere" or "red pill" ideologies, argue that the sexual marketplace dispropor- tionately empowers women, giving them more control might appear in memes as the target of a hateful message, over sexual selection and relationships, which can dis- referring to their personal life, physical appearance, or advantage men. On the other hand, critics assert that specific events that involved them. As a consequence, this perspective reduces human relationships to transac- depending on the reasons that lead to such criticism (gen- tional exchanges and objectifies both genders, ultimately der, physical appearance, and personal choices for Miley reinforcing misogynistic attitudes. This last viewpoint as- Cyrus vs. political stance and career, without the same serts that framing relationships in market terms devalues gendered connotations, for Bernie Sanders) there might emotional connection and perpetuates harmful stereo- be disagreement about misogyny. types about women’s worth being tied solely to their Table 2 reports the top-5 highest positive and highest sexual desirability. Achieved results suggest the ability negative disagreement scores derived for the visual com- of the approach to detect such variety in interpretations ponent. It is easy to notice how all the scores are positive and reflect them within the EDS scores. and achieve small values, denoting a tendency of such Figure 2 reports two memes that share the same text tags to be weakly related to the agreement label. and a different image. Despite such commonalities, the Figure 1 reports an example of a meme with disagree- memes have been labeled differently: while the first ment along with the visual representation of the EDS of meme has been labeled as misogynous by 2 annotators its textual and visual elements. Moreover, as highlighted out of 3, the second one has been unanimously labeled with a grey bar, some of the reported scores have been es- as non-misogynous. Since such memes share a common timated. Such scores correspond, in fact, to constituents textual representation, the derived textual elements and that are not present in the training dataset and for which textual-EDS are also equal, resulting in an indistinguish- it was not possible to calculate the ESD score. The visual able representation that is ineffective for disagreement representation of the scores related to such elements cor- identification. Moreover, although the memes differ in responds to the score obtained through the estimation the visual content, resulting in different tags and, there- strategy. Overall, it is easy to notice the presence of ele- fore, different textual-EDS, as previously mentioned, such ments strongly related to disagreement (i.e., sexual and a component alone is not sufficient for disagreement pre- market), highlighted in pink. diction. The concept of the "sexual marketplace" is often the The findings demonstrate the necessity of joint considera- Figure 2: Visual representation of disagreement scores distinguishing among textual and visual elements for two samples in the dataset. Positive and negative scores are represented with green and pink respectively. The white color represents elements with EDS equal to zero. tion of both visual and textual modalities for the purpose Approach 𝜓 𝜏 F1+ F1- F1 Score of predicting disagreements. Sum - 3.1 0.61 0.39 0.50 All the proposed aggregation strategies have been im- Mean - 0.2 0.78 0.20 0.49 Median - 0.2 0.07 0.79 0.43 plemented, both considering the modalities individually Minimum - -0.1 0.29 0.75 0.52 and jointly. Table 3, and Table 4 summarise achieved re- G-Sum 0.8 3.1 0.65 0.37 0.51 sults on disagreement identification considering only the G-Mean 0.8 0.2 0.73 0.34 0.53 score related to elements derived from the textual compo- G-Median 0.8 0.2 0.77 0.21 0.49 nent (i.e., terms) and only the scores of elements derived G-Minimum 0.8 -0.1 0.75 0.30 0.52 from the visual component (i.e., tags) respectively. Table BERT [30] - - 0.80 0.00 0.40 5 instead summarises results achieved by the aggregation of the scores derived from all the elements (i.e., terms Table 3 Comparison of the different approaches for disagreement de- and tags). Results achieved on the textual component tection considering the textual component only. The agree- only highlight G-Mean as the most performing approach. ment label (+) indicates complete annotator agreement, re- Overall, the estimation strategy results in an improve- gardless of the misogyny value, while the agreement label (-) ment of performances up to 6%, confirming the ability denotes samples without complete agreement. Bold denotes of the proposed strategy to capture disagreement rela- the best approach in terms of F1-score, and underline repre- tionships for unseen terms. Furthermore, BERT [30]4 has sents the best approach according to the disagreement label. been reported as a state-of-the-art baseline for unimodal 𝜓 and 𝜏 represent the best hyperparameters estimated via a textual classification. Achieved results show how BERT greed search approach. performs better on the majority class, struggling in pre- dicting the disagreement class. The proposed approach, instead leads to performance more balanced among the approach, a state-of-the-art baseline for multimodal clas- two classes. sification has been implemented: CLIP [31]5 . Table 4 reports the performances of the different ap- The inclusion of both modalities leads to a slight im- proaches for disagreement identification considering the provement in performances that, however, remain quite visual component only. However, while the Sum ap- poor, highlighting the difficulty of the task. The inclu- proach (i.e., the most performing approach among the tag- sion of the unseen constituents estimation leads to an based) demonstrates satisfactory performance in iden- improvement of performance (except for the sum-based tifying positive instances (achieving an F1+ of 0.69), it method) up to 8% for the mean-based approach. How- exhibits considerable difficulty in accurately identifying ever, the best performances are achieved by the minimum negative instances. and G-minimum approaches, for which the estimation Finally, Table 5 reports the performances of the dif- methodology is not effective. Such behavior may be at- ferent approaches for disagreement identification jointly tributed to the imbalance in the dataset. The larger the considering both modalities. Furthermore, for a better number of samples with agreement, the greater the num- comparison of the performance achieved by the proposed 5 CLIP has been implemented and finetuned using the huggingface 4 BERT has been implemented and finetuned using the hugging-face framework with default hyperparameters. In particular, we used framework with default hyperparameters. We adopted "bert-base- the version available at https://huggingface.co/openai/clip-vit-l cased" available at https://huggingface.co/google-bert/bert-base-c arge-patch14 to which we concatenated a linear layer for binary ased. classification. Approach 𝜓 𝜏 F1+ F1- F1 Score elements that compose it. Furthermore, a study of the Sum - 0.3 0.69 0.34 compositionality might be carried out to better represent 0.52 Mean - 0.3 0.41 0.48 0.45 the relationship among such elements inside the meme. Median - 0.3 0.41 0.49 0.40 The sense of a meme is often derived from the meanings Minimum - 0.3 0.35 0.49 0.40 of its individual parts (i.e. the image and text) and the Table 4 way they are combined. By analyzing how different ele- Comparison of the different approaches for disagreement de- ments interact and contribute to the overall message, it is tection considering the visual component only. The agreement possible to gain a deeper understanding of how the mean- label (+) indicates complete annotator agreement, regardless ing is represented within the different modalities. This of the misogyny value, while the agreement label (-) denotes will help in identifying complex patterns and improve samples without complete agreement. Bold denotes the best the accuracy of classification models. approach in terms of F1-score, and underline represents the best approach according to the disagreement label. 𝜓 and 𝜏 represent the best hyperparameters estimated via a greed search approach. Acknowledgments We acknowledge the support of the PNRR ICSC Na- Approach 𝜓 𝜏 F1+ F1- F1 Score Param. tional Research Centre for High Performance Comput- Sum - 3.4 0.63 0.36 0.50 |E| ing, Big Data and Quantum Computing (CN00000013), Mean - 0.2 0.79 0.13 0.46 |E| under the NRRP MUR program funded by the Median - 0.2 0.80 0.05 0.42 |E| Minimum - 0 0.69 0.42 0.55 |E| NextGenerationEU. The work of Paolo Rosso was G-Sum 0.8 3.6 0.64 0.35 0.49 179M in the framework of the FairTransNLP-Stereotypes G-Mean 0.9 0.2 0.70 0.39 0.54 179M research project (PID2021-124361OB-C31) funded by G-Median 0.9 0.2 0.77 0.21 0.49 179M MCIN/AEI/10.13039/501100011033 and by ERDF, EU A G-Minimum 0.1 0 0.69 0.42 0.55 179M way of making Europe. CLIP [31] - 0.5 0.63 0.42 0.52 428M Table 5 Comparison of the different approaches for disagreement de- References tection considering both textual and visual components. The agreement label (+) indicates complete annotator agreement, [1] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene, regardless of the misogyny value, while the agreement label B. Chulvi, P. Rosso, A. Lees, J. Sorensen, SemEval- (-) denotes samples without complete agreement. Bold de- 2022 task 5: Multimedia automatic misogyny iden- notes the best approach in terms of F1-score, and underline tification, in: Proceedings of the 16th International represents the best approach according to the disagreement Workshop on Semantic Evaluation (SemEval-2022), label. 𝜓 and 𝜏 represent the best hyperparameters estimated Association for Computational Linguistics, Seattle, via a greed search approach, and 𝐸 is the set of elements. United States, 2022, pp. 533–549. [2] L. Fontanella, B. Chulvi, E. Ignazzi, A. Sarra, A. Ton- todimamma, How do we study misogyny in the ber of agreement-related terms that impact the estima- digital age? a systematic literature review using a tion phase. Consequently, the estimation of scores for computational linguistic approach, Humanities and unseen elements is likely to be positive due to the afore- Social Sciences Communications 11 (2024) 1–15. mentioned imbalance. Overall, the findings suggest that [3] P. Kralj Novak, T. Scantamburlo, A. Pelicon, achieving a balanced performance remains challenging. M. Cinelli, I. Mozetič, F. Zollo, Handling disagree- ment in hate speech modelling, in: International Conference on Information Processing and Manage- 5. Conclusion and Future Works ment of Uncertainty in Knowledge-Based Systems, This paper proposes a probabilistic approach to identify Springer, 2022, pp. 681–695. disagreement-related elements in multimodal content. [4] C. van Son, T. Caselli, A. Fokkens, I. Maks, The proposed approach allows for the identification of el- R. Morante, L. Aroyo, P. Vossen, GRaSP: A multi- ements that could be used as a proxy to identify samples layered annotation scheme for perspectives, in: that might be perceived differently by the annotators, N. Calzolari, K. Choukri, T. Declerck, S. Goggi, and therefore, that could lead to disagreement. Achieved M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, results highlight the difficulty of the task, denoting the A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceed- need for a more advanced approach. Future work will ings of the Tenth International Conference on Lan- include different strategies for image analysis in order to guage Resources and Evaluation (LREC’16), Euro- provide a better description of the image itself in all the pean Language Resources Association (ELRA), Por- torož, Slovenia, 2016, pp. 1177–1184. URL: https: (SemEval-2023), Association for Computational Lin- //aclanthology.org/L16-1187. guistics, Toronto, Canada, 2023, pp. 171–176. URL: [5] S. Frenda, G. Abercrombie, V. Basile, A. Pedrani, https://aclanthology.org/2023.semeval- 1.24. R. Panizzon, A. T. Cignarella, C. Marco, D. Bernardi, doi:10.18653/v1/2023.semeval-1.24. Perspectivist approaches to natural language pro- [15] M. Sullivan, M. Yasin, C. L. Jacobs, University at cessing: a survey, Language Resources and Evalua- buffalo at semeval-2023 task 11: Masda–modelling tion (2024) 1–28. annotator sensibilities through disaggregation, in: [6] A. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, Proceedings of the 17th International Workshop M. Poesio, Learning from disagreement: A survey, on Semantic Evaluation (SemEval-2023), 2023, pp. Journal of Artificial Intelligence Research 72 (2021) 978–985. 1385–1470. [16] A. de Paula, G. Rizzi, E. Fersini, D. Spina, et al., [7] B. Beigman Klebanov, E. Beigman, From annotator Ai-upv at exist 2023–sexism characterization us- agreement to noise models, Computational Linguis- ing large language models under the learning with tics 35 (2009) 495–503. disagreements regime, in: CEUR WORKSHOP PRO- [8] Y. Sang, J. Stanton, The origin and value of dis- CEEDINGS, volume 3497, CEUR-WS, 2023, pp. 985– agreement among data labelers: A case study of 999. individual differences in hate speech annotation, in: [17] J. Erbani, E. Egyed-Zsigmond, D. Nurbakova, P.- Information for a Better World: Shaping the Global E. Portier, When multiple perspectives and an Future: 17th International Conference, iConference optimization process lead to better performance, 2022, Virtual Event, February 28–March 4, 2022, an automatic sexism identification on social media Proceedings, Part I, Springer, 2022, pp. 425–444. with pretrained transformers in a soft label context, [9] A. Dumitrache, F. Mediagroep, L. Aroyo, C. Welty, Working Notes of CLEF (2023). A crowdsourced frame disambiguation corpus with [18] M. E. Vallecillo-Rodríguez, F. del Arco, L. A. Ureña- ambiguity, in: Proceedings of NAACL-HLT, 2019, López, M. T. Martín-Valdivia, A. Montejo-Ráez, Inte- pp. 2164–2170. grating annotator information in transformer fine- [10] T. Fornaciari, A. Uma, S. Paun, B. Plank, D. Hovy, tuning for sexism detection, Working Notes of M. Poesio, et al., Beyond black & white: Leveraging CLEF (2023). annotator disagreement via soft-label multi-task [19] G. Rizzi, M. Fontana, E. Fersini, Perspectives on learning, in: Proceedings of the 2021 Conference hate: General vs. domain-specific models, in: of the North American Chapter of the Association Proceedings of the 3rd Workshop on Perspec- for Computational Linguistics: Human Language tivist Approaches to NLP (NLPerspectives)@ LREC- Technologies, Association for Computational Lin- COLING 2024, 2024, pp. 78–83. guistics, 2021. [20] M. Michele, V. Basile, F. M. Zanzotto, et al., Change [11] L. Han, E. Maddalena, A. Checco, C. Sarasua, my mind: How syntax-based hate speech recog- U. Gadiraju, K. Roitero, G. Demartini, Crowd nizer can uncover hidden motivations based on dif- worker strategies in relevance judgment tasks, in: ferent viewpoints, in: 1st Workshop on Perspec- Proceedings of the 13th international conference tivist Approaches to Disagreement in NLP, NLPer- on web search and data mining, 2020, pp. 241–249. spectives 2022 as part of Language Resources and [12] M. Sandri, E. Leonardelli, S. Tonelli, E. Ježek, Why Evaluation Conference, LREC 2022 Workshop, Eu- don’t you do it right? analysing annotators’ dis- ropean Language Resources Association (ELRA), agreement in subjective tasks, in: Proceedings of 2022, pp. 117–125. the 17th Conference of the European Chapter of the [21] L. Havens, B. Bach, M. Terras, B. Alex, Beyond ex- Association for Computational Linguistics, 2023, pp. planation: A case for exploratory text visualizations 2428–2441. of non-aggregated, annotated datasets, in: G. Aber- [13] S. Shahriar, T. Solorio, Safewebuh at semeval- crombie, V. Basile, S. Tonelli, V. Rieser, A. Uma 2023 task 11: Learning annotator disagreement (Eds.), Proceedings of the 1st Workshop on Perspec- in derogatory text: Comparison of direct training tivist Approaches to NLP @LREC2022, European vs aggregation, arXiv preprint arXiv:2305.01050 Language Resources Association, Marseille, France, (2023). 2022, pp. 73–82. URL: https://aclanthology.org/202 [14] E. Gajewska, eevvgg at SemEval-2023 task 11: 2.nlperspectives-1.10. Offensive language classification with rater-based [22] A. Astorino, G. Rizzi, E. Fersini, Integrated gradi- information, in: A. K. Ojha, A. S. Doğruöz, ents as proxy of disagreement in hateful content, in: G. Da San Martino, H. Tayyar Madabushi, R. Ku- CEUR WORKSHOP PROCEEDINGS, volume 3596, mar, E. Sartori (Eds.), Proceedings of the 17th CEUR-WS. org, 2023. International Workshop on Semantic Evaluation [23] G. Rizzi, A. Astorino, P. Rosso, E. Fersini, Unrav- eling disagreement constituents in hateful speech, in: European Conference on Information Retrieval, Springer, 2024, pp. 21–29. [24] G. Rizzi, F. Gasparini, A. Saibene, P. Rosso, E. Fersini, Recognizing misogynous memes: Biased models and tricky archetypes, Information Processing & Management 60 (2023) 103474. [25] Clarifai, Clarifai guide, ???? URL: https://docs.clari fai.com/. [26] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre- training of deep bidirectional transformers for lan- guage understanding, in: Proceedings of NAACL- HLT, 2019, pp. 4171–4186. [27] J. L. Fleiss, Measuring nominal scale agreement among many raters., Psychological bulletin 76 (1971) 378. [28] D. Ging, A. Neary, Gender, sexuality, and bullying special issue editorial, 2019. [29] E. Ignazzi, A. Sarra, L. Fontanella, et al., Exploring misogyny through time: From historical origins to modern complexities, Philosophies of Communica- tion (2023) 195–214. [30] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre- training of deep bidirectional transformers for lan- guage understanding, in: Proceedings of NAACL- HLT, 2019, pp. 4171–4186. [31] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual mod- els from natural language supervision, in: Inter- national conference on machine learning, PMLR, 2021, pp. 8748–8763.