From Explanation to Detection: Multimodal Insights into
                                Disagreement in Misogynous Memes
                                Giulia Rizzi1,2,* , Paolo Rosso2 and Elisabetta Fersini1,*
                                1
                                    University of Milano-Bicocca, Milan, Italy
                                2
                                    Universitat Politècnica de València, Valencia, Spain


                                                 Abstract
                                                 Warning: This paper contains examples of language and images that may be offensive.
                                                 This paper presents a probabilistic approach to identifying the disagreement-related elements in misogynistic memes by
                                                 considering both modalities that compose a meme (i.e., visual and textual sources). Several methodologies to exploit such
                                                 elements in the identification of disagreement among annotators have been investigated and evaluated on the Multimedia Au-
                                                 tomatic Misogyny Identification (MAMI) [1] dataset. The proposed unsupervised approach reaches comparable performances,
                                                 and in some cases even better, with state-of-the-art approaches, but with a reduced number of parameters to be estimated.
                                                 The source code of our approaches is publicly available† .

                                                 Keywords
                                                 Disagreement, Perspectivism, Multimodal, Misogyny


                                1. Introduction                                                                                          Multimedia Automatic Misogyny Identification (MAMI)
                                                                                                                                         dataset [1]. Moreover, four different strategies to exploit
                                Hate detection has been a serious concern in recent years, the presence of such elements in the identification of
                                penetrating internet platforms and causing harm to indi- disagreement are investigated.
                                viduals across various communities. Users found in the
                                online environment new modes of representation to ex-
                                press various types of hatred, including the more deeply 2. Related Works
                                rooted ideologies and beliefs with historical origins, for
                                example towards women [2].                                                                               Many natural language tasks, such as hate speech detec-
                                Detecting abusive language has become an increasingly tion, humor detection, and sentiment analysis, involve
                                important task. The challenges introduced by the new subjectivity since they require an interpretation based on
                                modes of representation, which require a multimodal human judgment, cultural context, or personal opinion
                                analysis, are further compounded when considering the [6]. Such phenomenon is reflected in the dataset through
                                subjectivity of the task. The subjectivity of the task de- multiple labels from different annotators or via the inclu-
                                rives from the fact that individuals’ perception of what sion of a confidence level to ground truth labels. Labels
                                characterizes a message of hate varies widely. Such di- derived from different interpretations are therefore able
                                versification is reflected in the labeling phase in the form to capture multiple perspectives and understandings [6].
                                of disagreement among annotators. Identifying elements Information about annotators’ disagreement has primar-
                                within the sample that can lead to disagreement is of ily been exploited as a means to improve data quality
                                paramount importance for several reasons. For content by excluding controversial instances [7, 8]. Alterna-
                                that can lead to disagreement, specific annotation policies tively, aiming at improving model performances, dif-
                                might be introduced, and the number of annotators might ferent strategies have been developed to exploit dis-
                                be enlarged to capture multiple perspectives [3, 4, 5].                                                  agreement information in the training phase. For in-
                                In this work, we propose a methodology to identify the stance, in [9], the authors assign weights to instances
                                disagreement-related elements in multimodal samples to prioritize the ones with higher confidence levels. An-
                                by exploring both visual and textual elements in the other commonly adopted strategy [6, 10] aims at directly
                                                                                                                                         learning from disagreement without considering any
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, aggregated label. While a considerable amount of re-
                                Dec 04 — 06, 2024, Pisa, Italy                                                                           search has been conducted to understand the reasons
                                *
                                  Corresponding author.                                                                                  behind annotators’ disagreement [11, 12, 8] and to lever-
                                $ g.rizzi10@campus.unimib.it (G. Rizzi); prosso@dsic.upv.es
                                (P. Rosso); elisabetta.fersini@unimib.it (E. Fersini)
                                                                                                                                         age disagreement when training classification models
                                 0000-0002-0619-0760 (G. Rizzi); 0000-0002-8922-1242 (P. Rosso); [13, 14, 15, 16, 17, 18, 19], there has been comparatively
                                0000-0002-8987-100X (E. Fersini)                                                                         little attention devoted to the explanation and a priori
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).                                                   recognition of disagreement in hateful content. A tax-
                                †
                                  § https://github.com/MIND-Lab/From-Explanation-to-Detection-Multimodal-I                               onomy of possible reasons leading to annotators’ dis-
                                    nsights-into-Disagreement-in-Misogynous-Memes


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
agreement has been proposed by [12]. Such taxonomy                        to a multimodal scenario. In particular, [23] introduces
articulates four macro categories of reasons behind dis-                  a methodology to identify disagreement related con-
agreement: sloppy annotations, ambiguity, missing infor-                  stituents that, however, is limited to textual content. The
mation, and subjectivity. Moreover, the authors evaluate                  approach includes a strategy to identify disagreement-
the impact on classification performance of the different                 related textual constituents and an approach for gen-
types.                                                                    eralization towards unseen textual constituents. Both
   Only recently, works have focused on the task of ex-                   methods have been extended to a multimodal scenario
plaining disagreement [20, 21, 22, 23]. In [21], the au-                  in order to identify disagreement related elements both
thors propose exploratory text visualization techniques                   in textual and visual sources that compose a meme.
as a method for analyzing different perspectives from                        Given an element 𝑒, a corresponding Element Disagree-
annotated data. In [22], the authors identify textual con-                ment Score ( EDS(e)) has been computed according to the
stituents that contribute to hateful message explanation                  following equation:
by exploiting integrated gradients within a filtering strat-
egy. A more recent approach [23] proposes a probabilistic                        𝐸𝐷𝑆(𝑒) = 𝑃 (𝐴𝑔𝑟𝑒𝑒|𝑒) − 𝑃 (¬𝐴𝑔𝑟𝑒𝑒|𝑒)             (1)
semantic approach for the identification of disagreement-
related constituents (e.g. textual elements) in hateful                      where 𝑃 (𝐴𝑔𝑟𝑒𝑒|𝑒) represents the conditional prob-
content. Overall, the findings indicate that, while LLM                   ability that there is agreement on a meme given
can yield promising results, comparable outcomes can                      that the meme contains the element 𝑒. Analogously,
be attained with less complex strategies and fewer com-                   𝑃 (¬𝐴𝑔𝑟𝑒𝑒|𝑒) denotes the conditional probability that
putational resources. While previous research has con-                    there is no agreement on a meme given that, that meme,
centrated on the analysis of textual disagreement, this                   contains the element 𝑒. Given that EDS represents a dif-
study represents, to the best of our knowledge, a first                   ference between two complementary probabilities, it is
insight into the explanation of multimodal disagreement.                  bounded within the range of -1 to +1. A higher positive
In particular, we have revised and extended to the multi-                 score indicates stronger agreement between annotators,
modal environment the methodology proposed in [23]                        whereas a lower negative score suggests disagreement.
in order to consider not only textual elements but also                      The score can be estimated on the training data and
visual ones.                                                              exploited to identify additional disagreement-related ele-
                                                                          ments on unseen memes.

3. Proposed Approach                                                      3.2. Disagreement identification
3.1. Identification of                                                    Once the Element Disagreement Scores have been esti-
     Disagreement-Related Elements                                        mated for each visual and textual element in the training
                                                                          dataset, they can be exploited to qualify the level of dis-
The first phase of the proposed approach aims to evalu-                   agreement on unseen samples. Analogously to what
ate the relationship between elements (both visual and                    was carried out in [23], different aggregation strategies
textual) that compose a meme and annotators’ disagree-                    have been investigated, relying on the hypothesis that
ment. Preliminary preprocessing operations have been                      the identified elements can be exploited for identifying
performed before identifying disagreement-related ele-                    the disagreement thanks to their different distribution in
ments. For what concerns the textual components, pre-                     samples with and without an agreement.
processing operations have been performed (i.e., tok-                        For each meme in the test set, the corresponding list
enization, lemmatization, lower casing and stop word                      of elements and the corresponding Elements Disagree-
removal) to identify a valid set of tokens1 that might be                 ment Score estimated on the training data have been
related to disagreement. Considering the image com-                       extracted. In particular, for each meme, the textual and
ponent, the set of 14 human readable concepts (tags)                      visual elements have been identified and paired with the
identified by [24] to capture specific characteristics of                 corresponding score when available. The Multimodal
misogynous content has been adopted. As proposed by                       Disagreement Score (MDS) has been estimated according
the authors, tags were extracted via the Clarifai API [25].               to the following strategies: Sum, Mean, Median, and
The preprocessing steps allowed us to extract a list of vi-               Minimum. A threshold 𝜏 has been estimated according
sual and textual elements from each meme in the dataset.                  to a grid-search approach for each strategy.
   In order to measure the relationship among each ele-                      A qualitative evaluation, comprehensive of a compari-
ment in the memes and the disagreement among annota-                      son with the specific misogynistic terminology and an
tors, the approach proposed in [23] has been extended                     evaluation of the keyword included in the dataset cre-
                                                                          ation phase, has been performed to assess the quality of
1
    To guarantee a more robust evaluation, tokens that appear less than   the EDS, while both the F1-score for the two considered
    10 times in the dataset have been removed.
classes (agreement (+) and disagreement (-)) and a global                𝑒 of the training lexicon:
F1-score have been computed to validate the MDS.                                             ∑︀
                                                                                                [𝑐𝑜𝑠(𝑒, ˆ𝑒) · 𝐸𝐷𝑆(𝑒)]
                                                                                            𝑒∈𝐷
3.3. Generalization towards unseen                                             𝐸𝐷𝑆(𝑒ˆ) =          ∑︀                              (3)
                                                                                                      𝑐𝑜𝑠(𝑒, ˆ𝑒)
                                                                                                      𝑒∈𝐷
     elements
The score estimation is strongly based on what is ob-                 • Multimodal Disagreement Score with un-
 served in the training data, resulting in the lack of scores           seen constituents: All the above-proposed
 for any elements that do not appear in the training sam-               strategies for MDS estimation have been extended
 ples. This is particularly relevant for textual components             to also include elements that do not belong to the
 rather than visual ones. In fact, while we can assume                  training lexicon and for which the EDS score has
 an open-word vocabulary (where a few terms on unseen                   been estimated. In particular, given a multimodal
 data can not appear in the training set) for the textual               sample 𝑠, the aggregation functions presented in
 source, we limited the visual tags to closed-word settings             Section 3.2 will in this case consider the 𝐸𝐷𝑆
(only 14 tags can be considered both in training and un-                values of both seen (by considering the 𝐸𝐷𝑆(𝑒))
 seen memes). Since we need to generalize only on unseen                and unseen (by considering the 𝐸𝐷𝑆(𝑒ˆ)) ele-
 textual constituents, for each (unseen) textual element                ments. Such generalized aggregation functions
ˆ𝑒, an approximated EDS score has been computed as                      will be later referred to through the prefix 𝐺−.
 follows:

     • Embeddings of the training lexicon: the con-             4. Results
       textualized embedding representation of each tex-
       tual element 𝑒 has been obtained via mBert [26].     The proposed approach has been evaluated on the
       An average embedding vector representation ⃗x𝑒       Multimedia Automatic Misogyny Identification (MAMI)
       is computed to jointly represent multiple embed-     Dataset [1] consisting of 10.000 memes for training and
       ding representations of 𝑒 derived by the different   1.000 memes for testing 2 . The dataset comprises a range
       contexts where it occurs. In particular, given an el-of memes that exemplify various forms of misogyny, in-
       ement 𝑒 and 𝑁 sentences containing it, its vector    cluding shaming, stereotyping, objectification, and vi-
       representation ⃗x𝑒 is obtained by a simple aver-     olence. Each meme has been labeled by three crowd-
                  𝑁                                         sourced annotators for misogynistic content3 , with an
       age ⃗x𝑒 =     v𝑖 /𝑁 , where ⃗ v𝑖 is the constituent  estimated Fleiss-K [27] coefficient equal to 0.5767.
                  ∑︀
                     ⃗
                    𝑖=1
       contextualized embedding vector related to the          In particular, the proposed approach has been adopted
       𝑖 occurrence of 𝑒 and obtained through mBert.
        𝑡ℎ                                                  to  estimate an Element Disagreement Score (EDS) for
                                                            each element and, consequently, MDS for each meme in
     • Embeddings of unseen term: given an unseen
                                                            the dataset.
       textual element ˆ𝑒 within a given sentence, its con-
                                                               Table 1 reports the top-10 highest positive and high-
       textualized embedding representation has been
                                                            est negative disagreement scores derived for the textual
       computed via mBert [26].
                                                            component. We can notice how terms that are rarely
     • Most similar constituent: given an unseen linked with misogynistic messages (e.g., flu) and terms
       textual element ˆ𝑒 with the corresponding embed- commonly used to address women in a harmful way (e.g.,
       ding ⃗ v^𝑒 and the average embedding of a training whale) also exploiting stereotypes (e.g. gamer and pro-
       element 𝑒, the set 𝐷 of most similar constituents grammer), achieve a high positive score, indicating a
       to ˆ𝑒 is determined according to:                    strong relation with the agreement. Additionally, some
                        ⋃︁                                  personal names of famous people (i.e., Bernie and Mi-
                   𝐷 = {𝑒|𝑐𝑜𝑠(x         v^𝑒 ) ≤ 𝜓}
                                   ⃗ 𝑒, ⃗               (2) ley) appear within the ranking. In particular, such names
                         𝑒
                                                                2
                                                                  Although both a training and a test dataset are provided, only the
       where 𝑐𝑜𝑠(x      v^𝑒 ) is the cosine similarity be-
                   ⃗ 𝑒, ⃗                                         training dataset is adopted, as the proposed work is focused on
       tween the average contextualized embedding rep-            the analysis and prediction of disagreement and the test dataset
       resentation of element 𝑒 and ˆ𝑒, and 𝜓 is a grid           is constructed to include only samples with complete agreement.
       search estimated threshold.                                The training dataset, instead, is characterized by 65% of data with
                                                                  complete agreement. Therefore, it has been divided in order to
     • Unseen terms score: the EDS score for an                   isolate the 90% for token estimation and the remaining 10% for the
       unseen textual element ˆ𝑒 is computed as the               evaluation.
       weighted average of the most similar constituents        3
                                                                  Additionally, a boolean disagreement label has been derived to
                                                                  represent complete agreement among annotators. In particular, this
                                                                  last label is set to 1 if all the annotators have indicated the same
                                                                  label, to 0 otherwise.
Figure 1: Visual representation of disagreement scores distinguishing among textual and visual elements. Positive and
negative scores are represented with green and pink respectively. The gray bar denotes elements for which the EDS has been
estimated, while the white color represents elements with an EDS equal to zero.


        Term            EDS      Term         EDS                    Tag               EDS      Tag             EDS
        flu              1.00    market       −0.64                  crockery           0.49    dishwasher      0.00
        folk             1.00    fetish       −0.60                  nudity             0.46    broom           0.14
        bug              1.00    nut          −0.57                  cat                0.46    dog             0.20
        Bernie           1.00    hotel        −0.50                  car                0.43    child           0.23
        whale            1.00    apologize    −0.45                  kitchenutensil     0.41    woman           0.26
        feeling          0.90    Miley        −0.45
        gamer            0.87    lonely       −0.43            Table 2
        rest             0.87    award        −0.43            Tags with the highest positive and lowest negative scores
        programmer       0.87    coke         −0.43
        san              0.83    blowjob      −0.43
Table 1
                                                               subject of debate, particularly in relation to its intersec-
Terms with the highest positive and lowest negative scores     tion with misogynistic ideologies [28, 29]. Some sup-
                                                               porters, often aligned with "manosphere" or "red pill"
                                                               ideologies, argue that the sexual marketplace dispropor-
                                                               tionately empowers women, giving them more control
might appear in memes as the target of a hateful message,
                                                               over sexual selection and relationships, which can dis-
referring to their personal life, physical appearance, or
                                                               advantage men. On the other hand, critics assert that
specific events that involved them. As a consequence,
                                                               this perspective reduces human relationships to transac-
depending on the reasons that lead to such criticism (gen-
                                                               tional exchanges and objectifies both genders, ultimately
der, physical appearance, and personal choices for Miley
                                                               reinforcing misogynistic attitudes. This last viewpoint as-
Cyrus vs. political stance and career, without the same
                                                               serts that framing relationships in market terms devalues
gendered connotations, for Bernie Sanders) there might
                                                               emotional connection and perpetuates harmful stereo-
be disagreement about misogyny.
                                                               types about women’s worth being tied solely to their
   Table 2 reports the top-5 highest positive and highest
                                                               sexual desirability. Achieved results suggest the ability
negative disagreement scores derived for the visual com-
                                                               of the approach to detect such variety in interpretations
ponent. It is easy to notice how all the scores are positive
                                                               and reflect them within the EDS scores.
and achieve small values, denoting a tendency of such
                                                                  Figure 2 reports two memes that share the same text
tags to be weakly related to the agreement label.
                                                               and a different image. Despite such commonalities, the
   Figure 1 reports an example of a meme with disagree-
                                                               memes have been labeled differently: while the first
ment along with the visual representation of the EDS of
                                                               meme has been labeled as misogynous by 2 annotators
its textual and visual elements. Moreover, as highlighted
                                                               out of 3, the second one has been unanimously labeled
with a grey bar, some of the reported scores have been es-
                                                               as non-misogynous. Since such memes share a common
timated. Such scores correspond, in fact, to constituents
                                                               textual representation, the derived textual elements and
that are not present in the training dataset and for which
                                                               textual-EDS are also equal, resulting in an indistinguish-
it was not possible to calculate the ESD score. The visual
                                                               able representation that is ineffective for disagreement
representation of the scores related to such elements cor-
                                                               identification. Moreover, although the memes differ in
responds to the score obtained through the estimation
                                                               the visual content, resulting in different tags and, there-
strategy. Overall, it is easy to notice the presence of ele-
                                                               fore, different textual-EDS, as previously mentioned, such
ments strongly related to disagreement (i.e., sexual and
                                                               a component alone is not sufficient for disagreement pre-
market), highlighted in pink.
                                                               diction.
   The concept of the "sexual marketplace" is often the
                                                               The findings demonstrate the necessity of joint considera-
Figure 2: Visual representation of disagreement scores distinguishing among textual and visual elements for two samples in
the dataset. Positive and negative scores are represented with green and pink respectively. The white color represents elements
with EDS equal to zero.


tion of both visual and textual modalities for the purpose                     Approach           𝜓      𝜏      F1+      F1-    F1 Score
of predicting disagreements.                                                   Sum                -     3.1     0.61    0.39       0.50
   All the proposed aggregation strategies have been im-                       Mean               -     0.2     0.78    0.20       0.49
                                                                               Median             -     0.2     0.07    0.79       0.43
plemented, both considering the modalities individually
                                                                               Minimum            -     -0.1    0.29    0.75       0.52
and jointly. Table 3, and Table 4 summarise achieved re-
                                                                               G-Sum             0.8    3.1     0.65    0.37       0.51
sults on disagreement identification considering only the                      G-Mean            0.8    0.2     0.73    0.34       0.53
score related to elements derived from the textual compo-                      G-Median          0.8    0.2     0.77    0.21       0.49
nent (i.e., terms) and only the scores of elements derived                     G-Minimum         0.8    -0.1    0.75    0.30       0.52
from the visual component (i.e., tags) respectively. Table                     BERT [30]          -       -     0.80    0.00       0.40
5 instead summarises results achieved by the aggregation
of the scores derived from all the elements (i.e., terms                 Table 3
                                                                         Comparison of the different approaches for disagreement de-
and tags). Results achieved on the textual component
                                                                         tection considering the textual component only. The agree-
only highlight G-Mean as the most performing approach.                   ment label (+) indicates complete annotator agreement, re-
Overall, the estimation strategy results in an improve-                  gardless of the misogyny value, while the agreement label (-)
ment of performances up to 6%, confirming the ability                    denotes samples without complete agreement. Bold denotes
of the proposed strategy to capture disagreement rela-                   the best approach in terms of F1-score, and underline repre-
tionships for unseen terms. Furthermore, BERT [30]4 has                  sents the best approach according to the disagreement label.
been reported as a state-of-the-art baseline for unimodal                𝜓 and 𝜏 represent the best hyperparameters estimated via a
textual classification. Achieved results show how BERT                   greed search approach.
performs better on the majority class, struggling in pre-
dicting the disagreement class. The proposed approach,
instead leads to performance more balanced among the                     approach, a state-of-the-art baseline for multimodal clas-
two classes.                                                             sification has been implemented: CLIP [31]5 .
   Table 4 reports the performances of the different ap-                    The inclusion of both modalities leads to a slight im-
proaches for disagreement identification considering the                 provement in performances that, however, remain quite
visual component only. However, while the Sum ap-                        poor, highlighting the difficulty of the task. The inclu-
proach (i.e., the most performing approach among the tag-                sion of the unseen constituents estimation leads to an
based) demonstrates satisfactory performance in iden-                    improvement of performance (except for the sum-based
tifying positive instances (achieving an F1+ of 0.69), it                method) up to 8% for the mean-based approach. How-
exhibits considerable difficulty in accurately identifying               ever, the best performances are achieved by the minimum
negative instances.                                                      and G-minimum approaches, for which the estimation
   Finally, Table 5 reports the performances of the dif-                 methodology is not effective. Such behavior may be at-
ferent approaches for disagreement identification jointly                tributed to the imbalance in the dataset. The larger the
considering both modalities. Furthermore, for a better                   number of samples with agreement, the greater the num-
comparison of the performance achieved by the proposed
                                                                         5
                                                                             CLIP has been implemented and finetuned using the huggingface
4
    BERT has been implemented and finetuned using the hugging-face           framework with default hyperparameters. In particular, we used
    framework with default hyperparameters. We adopted "bert-base-           the version available at https://huggingface.co/openai/clip-vit-l
    cased" available at https://huggingface.co/google-bert/bert-base-c       arge-patch14 to which we concatenated a linear layer for binary
    ased.                                                                    classification.
     Approach         𝜓         𝜏            F1+       F1-       F1 Score
                                                             elements that compose it. Furthermore, a study of the
     Sum              -         0.3          0.69      0.34  compositionality might be carried out to better represent
                                                                     0.52
     Mean             -         0.3          0.41      0.48          0.45
                                                             the relationship among such elements inside the meme.
     Median           -         0.3          0.41      0.49          0.40
                                                             The sense of a meme is often derived from the meanings
     Minimum          -         0.3          0.35      0.49          0.40
                                                             of its individual parts (i.e. the image and text) and the
Table 4                                                      way they are combined. By analyzing how different ele-
Comparison of the different approaches for disagreement de- ments interact and contribute to the overall message, it is
tection considering the visual component only. The agreement possible to gain a deeper understanding of how the mean-
label (+) indicates complete annotator agreement, regardless ing is represented within the different modalities. This
of the misogyny value, while the agreement label (-) denotes will help in identifying complex patterns and improve
samples without complete agreement. Bold denotes the best
                                                             the accuracy of classification models.
approach in terms of F1-score, and underline represents the
best approach according to the disagreement label. 𝜓 and
𝜏 represent the best hyperparameters estimated via a greed
search approach.
                                                                                 Acknowledgments
                                                                                 We acknowledge the support of the PNRR ICSC Na-
 Approach       𝜓         𝜏           F1+       F1-      F1 Score       Param.   tional Research Centre for High Performance Comput-
 Sum             -        3.4         0.63      0.36          0.50       |E|     ing, Big Data and Quantum Computing (CN00000013),
 Mean            -        0.2         0.79      0.13          0.46       |E|     under the NRRP MUR program funded by the
 Median          -        0.2         0.80      0.05          0.42       |E|
 Minimum         -         0          0.69      0.42          0.55       |E|     NextGenerationEU. The work of Paolo Rosso was
 G-Sum          0.8       3.6         0.64      0.35          0.49      179M     in the framework of the FairTransNLP-Stereotypes
 G-Mean         0.9       0.2         0.70      0.39          0.54      179M     research project (PID2021-124361OB-C31) funded by
 G-Median       0.9       0.2         0.77      0.21          0.49      179M     MCIN/AEI/10.13039/501100011033 and by ERDF, EU A
 G-Minimum      0.1        0          0.69      0.42          0.55      179M
                                                                                 way of making Europe.
 CLIP [31]       -        0.5         0.63      0.42          0.52      428M

Table 5
Comparison of the different approaches for disagreement de-                      References
tection considering both textual and visual components. The
agreement label (+) indicates complete annotator agreement,                       [1] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene,
regardless of the misogyny value, while the agreement label                           B. Chulvi, P. Rosso, A. Lees, J. Sorensen, SemEval-
(-) denotes samples without complete agreement. Bold de-                              2022 task 5: Multimedia automatic misogyny iden-
notes the best approach in terms of F1-score, and underline                           tification, in: Proceedings of the 16th International
represents the best approach according to the disagreement                            Workshop on Semantic Evaluation (SemEval-2022),
label. 𝜓 and 𝜏 represent the best hyperparameters estimated                           Association for Computational Linguistics, Seattle,
via a greed search approach, and 𝐸 is the set of elements.                            United States, 2022, pp. 533–549.
                                                                                  [2] L. Fontanella, B. Chulvi, E. Ignazzi, A. Sarra, A. Ton-
                                                                                      todimamma, How do we study misogyny in the
ber of agreement-related terms that impact the estima-                                digital age? a systematic literature review using a
tion phase. Consequently, the estimation of scores for                                computational linguistic approach, Humanities and
unseen elements is likely to be positive due to the afore-                            Social Sciences Communications 11 (2024) 1–15.
mentioned imbalance. Overall, the findings suggest that                           [3] P. Kralj Novak, T. Scantamburlo, A. Pelicon,
achieving a balanced performance remains challenging.                                 M. Cinelli, I. Mozetič, F. Zollo, Handling disagree-
                                                                                      ment in hate speech modelling, in: International
                                                                                      Conference on Information Processing and Manage-
5. Conclusion and Future Works                                                        ment of Uncertainty in Knowledge-Based Systems,
This paper proposes a probabilistic approach to identify                              Springer, 2022, pp. 681–695.
disagreement-related elements in multimodal content.                              [4] C. van Son, T. Caselli, A. Fokkens, I. Maks,
The proposed approach allows for the identification of el-                            R. Morante, L. Aroyo, P. Vossen, GRaSP: A multi-
ements that could be used as a proxy to identify samples                              layered annotation scheme for perspectives, in:
that might be perceived differently by the annotators,                                N. Calzolari, K. Choukri, T. Declerck, S. Goggi,
and therefore, that could lead to disagreement. Achieved                              M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo,
results highlight the difficulty of the task, denoting the                            A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceed-
need for a more advanced approach. Future work will                                   ings of the Tenth International Conference on Lan-
include different strategies for image analysis in order to                           guage Resources and Evaluation (LREC’16), Euro-
provide a better description of the image itself in all the                           pean Language Resources Association (ELRA), Por-
     torož, Slovenia, 2016, pp. 1177–1184. URL: https:            (SemEval-2023), Association for Computational Lin-
     //aclanthology.org/L16-1187.                                 guistics, Toronto, Canada, 2023, pp. 171–176. URL:
 [5] S. Frenda, G. Abercrombie, V. Basile, A. Pedrani,            https://aclanthology.org/2023.semeval- 1.24.
     R. Panizzon, A. T. Cignarella, C. Marco, D. Bernardi,        doi:10.18653/v1/2023.semeval-1.24.
     Perspectivist approaches to natural language pro-       [15] M. Sullivan, M. Yasin, C. L. Jacobs, University at
     cessing: a survey, Language Resources and Evalua-            buffalo at semeval-2023 task 11: Masda–modelling
     tion (2024) 1–28.                                            annotator sensibilities through disaggregation, in:
 [6] A. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank,           Proceedings of the 17th International Workshop
     M. Poesio, Learning from disagreement: A survey,             on Semantic Evaluation (SemEval-2023), 2023, pp.
     Journal of Artificial Intelligence Research 72 (2021)        978–985.
     1385–1470.                                              [16] A. de Paula, G. Rizzi, E. Fersini, D. Spina, et al.,
 [7] B. Beigman Klebanov, E. Beigman, From annotator              Ai-upv at exist 2023–sexism characterization us-
     agreement to noise models, Computational Linguis-            ing large language models under the learning with
     tics 35 (2009) 495–503.                                      disagreements regime, in: CEUR WORKSHOP PRO-
 [8] Y. Sang, J. Stanton, The origin and value of dis-            CEEDINGS, volume 3497, CEUR-WS, 2023, pp. 985–
     agreement among data labelers: A case study of               999.
     individual differences in hate speech annotation, in:   [17] J. Erbani, E. Egyed-Zsigmond, D. Nurbakova, P.-
     Information for a Better World: Shaping the Global           E. Portier, When multiple perspectives and an
     Future: 17th International Conference, iConference           optimization process lead to better performance,
     2022, Virtual Event, February 28–March 4, 2022,              an automatic sexism identification on social media
     Proceedings, Part I, Springer, 2022, pp. 425–444.            with pretrained transformers in a soft label context,
 [9] A. Dumitrache, F. Mediagroep, L. Aroyo, C. Welty,            Working Notes of CLEF (2023).
     A crowdsourced frame disambiguation corpus with         [18] M. E. Vallecillo-Rodríguez, F. del Arco, L. A. Ureña-
     ambiguity, in: Proceedings of NAACL-HLT, 2019,               López, M. T. Martín-Valdivia, A. Montejo-Ráez, Inte-
     pp. 2164–2170.                                               grating annotator information in transformer fine-
[10] T. Fornaciari, A. Uma, S. Paun, B. Plank, D. Hovy,           tuning for sexism detection, Working Notes of
     M. Poesio, et al., Beyond black & white: Leveraging          CLEF (2023).
     annotator disagreement via soft-label multi-task        [19] G. Rizzi, M. Fontana, E. Fersini, Perspectives on
     learning, in: Proceedings of the 2021 Conference             hate: General vs. domain-specific models, in:
     of the North American Chapter of the Association             Proceedings of the 3rd Workshop on Perspec-
     for Computational Linguistics: Human Language                tivist Approaches to NLP (NLPerspectives)@ LREC-
     Technologies, Association for Computational Lin-             COLING 2024, 2024, pp. 78–83.
     guistics, 2021.                                         [20] M. Michele, V. Basile, F. M. Zanzotto, et al., Change
[11] L. Han, E. Maddalena, A. Checco, C. Sarasua,                 my mind: How syntax-based hate speech recog-
     U. Gadiraju, K. Roitero, G. Demartini, Crowd                 nizer can uncover hidden motivations based on dif-
     worker strategies in relevance judgment tasks, in:           ferent viewpoints, in: 1st Workshop on Perspec-
     Proceedings of the 13th international conference             tivist Approaches to Disagreement in NLP, NLPer-
     on web search and data mining, 2020, pp. 241–249.            spectives 2022 as part of Language Resources and
[12] M. Sandri, E. Leonardelli, S. Tonelli, E. Ježek, Why         Evaluation Conference, LREC 2022 Workshop, Eu-
     don’t you do it right? analysing annotators’ dis-            ropean Language Resources Association (ELRA),
     agreement in subjective tasks, in: Proceedings of            2022, pp. 117–125.
     the 17th Conference of the European Chapter of the      [21] L. Havens, B. Bach, M. Terras, B. Alex, Beyond ex-
     Association for Computational Linguistics, 2023, pp.         planation: A case for exploratory text visualizations
     2428–2441.                                                   of non-aggregated, annotated datasets, in: G. Aber-
[13] S. Shahriar, T. Solorio, Safewebuh at semeval-               crombie, V. Basile, S. Tonelli, V. Rieser, A. Uma
     2023 task 11: Learning annotator disagreement                (Eds.), Proceedings of the 1st Workshop on Perspec-
     in derogatory text: Comparison of direct training            tivist Approaches to NLP @LREC2022, European
     vs aggregation, arXiv preprint arXiv:2305.01050              Language Resources Association, Marseille, France,
     (2023).                                                      2022, pp. 73–82. URL: https://aclanthology.org/202
[14] E. Gajewska, eevvgg at SemEval-2023 task 11:                 2.nlperspectives-1.10.
     Offensive language classification with rater-based      [22] A. Astorino, G. Rizzi, E. Fersini, Integrated gradi-
     information, in: A. K. Ojha, A. S. Doğruöz,                  ents as proxy of disagreement in hateful content, in:
     G. Da San Martino, H. Tayyar Madabushi, R. Ku-               CEUR WORKSHOP PROCEEDINGS, volume 3596,
     mar, E. Sartori (Eds.), Proceedings of the 17th              CEUR-WS. org, 2023.
     International Workshop on Semantic Evaluation           [23] G. Rizzi, A. Astorino, P. Rosso, E. Fersini, Unrav-
     eling disagreement constituents in hateful speech,
     in: European Conference on Information Retrieval,
     Springer, 2024, pp. 21–29.
[24] G. Rizzi, F. Gasparini, A. Saibene, P. Rosso, E. Fersini,
     Recognizing misogynous memes: Biased models
     and tricky archetypes, Information Processing &
     Management 60 (2023) 103474.
[25] Clarifai, Clarifai guide, ???? URL: https://docs.clari
     fai.com/.
[26] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-
     training of deep bidirectional transformers for lan-
     guage understanding, in: Proceedings of NAACL-
     HLT, 2019, pp. 4171–4186.
[27] J. L. Fleiss, Measuring nominal scale agreement
     among many raters., Psychological bulletin 76
     (1971) 378.
[28] D. Ging, A. Neary, Gender, sexuality, and bullying
     special issue editorial, 2019.
[29] E. Ignazzi, A. Sarra, L. Fontanella, et al., Exploring
     misogyny through time: From historical origins to
     modern complexities, Philosophies of Communica-
     tion (2023) 195–214.
[30] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-
     training of deep bidirectional transformers for lan-
     guage understanding, in: Proceedings of NAACL-
     HLT, 2019, pp. 4171–4186.
[31] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
     G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, et al., Learning transferable visual mod-
     els from natural language supervision, in: Inter-
     national conference on machine learning, PMLR,
     2021, pp. 8748–8763.