AMI @ EVALITA2020: Automatic Misogyny Identification
                          Elisabetta Fersini1 , Debora Nozza2 , Paolo Rosso3
                                1
                                  DISCo, University of Milano-Bicocca
                                         2
                                           Bocconi University
                      3
                        PRHLT Research Center, Universitat Politècnica de València
                               elisabetta.fersini@unimib.it
                                debora.nozza@unibocconi.it
                                      prosso@dsic.upv.es

                        Abstract                                1   Introduction
    English. Automatic Misogyny Identifica-
    tion (AMI) is a shared task proposed at                     The expressions of people about thoughts, emo-
    the Evalita 2020 evaluation campaign. The                   tions, and feelings by means of posts in social
    AMI challenge, based on Italian tweets,                     media have been widely spread. Women have
    is organized into two subtasks: (1) Sub-                    a strong presence in these online environments:
    task A about misogyny and aggressiveness                    75% of females use social media multiple times
    identification and (2) Subtask B about the                  per day compared to 64% of males. While new op-
    fairness of the model. At the end of the                    portunities emerged for women to express them-
    evaluation phase, we received a total of 20                 selves, systematic inequality and discrimination
    runs for Subtask A and 11 runs for Sub-                     take place in the form of offensive content against
    task B, submitted by 8 teams. In this paper,                the female gender. These manifestations of misog-
    we present an overview of the AMI shared                    yny, usually provided by a man to a woman for
    task, the datasets, the evaluation method-                  dominating or using a sort of power against the
    ology, the results obtained by the partici-                 female gender, is a relevant social problem that
    pants and a discussion about the method-                    has been addressed in the scientific literature dur-
    ology adopted by the teams. Finally, we                     ing the last few years. Recent investigations stud-
    draw some conclusions and discuss future                    ied how the misogyny phenomenon takes place,
    work.                                                       for example as unjustified slurring or as stereotyp-
    Italiano. Automatic Misogyny Identifica-                    ing of the role/body of a woman (i.e., the hash-
    tion (AMI) é uno shared task proposto                      tag #getbacktokitchen), as described in the book
    nella campagna di valutazione Evalita                       by Poland (Poland, 2016). Preliminary research
    2020.     La challenge AMI, basata su                       work was conducted in (Hewitt et al., 2016) as the
    tweet italiani, si distingue in due sub-                    first attempt of manual classification of misogy-
    tasks: (1) subtask A che ha come obiet-                     nous tweets, while automatic misogyny identifica-
    tivo l’identificazione di testi misogini e ag-              tion in social media has been firstly investigated in
    gressivi (2) subtask B relativo alla fair-                  (Anzovino et al., 2018). Since 2018, several initia-
    ness del modello. Al termine della fase                     tives have been dedicated as a call-to-action to stop
    di valutazione, sono state ricevute un to-                  hate against women both from a machine learn-
    tale di 20 submissions per il subtask A e                   ing and computational linguistics points of view,
    11 per il subtask B, inviate da un totale                   such as AMI@Evalita 2018 (Fersini et al., 2018a),
    di 8 team. Presentiamo di seguito una                       AMI@IberEval2018 (Fersini et al., 2018b) and
    sintesi dello shared task AMI, i dataset,                   HatEval@SemEval2019 (Basile et al., 2019). Sev-
    la metodologia di valutazione, i risultati                  eral relevant research directions have been inves-
    ottenuti dai partecipanti e una discus-                     tigated for addressing the misogyny identifica-
    sione sulle metodologie adottate dai di-                    tion challenge, among which approaches focused
    versi team. Infine, vengono discusse le                     on effective text representation (Bakarov, 2018;
    conclusioni e delineati gli sviluppi futuri.                Basile and Rubagotti, 2018), machine learning
                                                                models (Buscaldi, 2018; Ahluwalia et al., 2018)
     Copyright © 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       and domain-specific lexical resources (Pamungkas
ternational (CC BY 4.0).                                        et al., 2018; Frenda et al., 2018).
   During the AMI shared task organized at the                    Subtask B has the goal of measuring the atti-
Evalita 2020 evaluation campaign (Basile et al.,                  tude of a model to be fair when processing
2020), the focus is not only on misogyny identi-                  sentences containing specific identity terms
fication but also on aggressiveness recognition, as               that likely conveyed misogyny in the training
well as to the definition of models able to guaran-               data, e.g. “girlfriend” and “wife”.
tee fair predictions.
                                                        3     Training and Testing Data
2    Task Description
                                                        The data provided to the participants for the AMI
The AMI shared task, which is a re-run of a previ-      shared task comprises a raw dataset and a synthetic
ous challenge at Evalita 2018, proposes the auto-       dataset for measuring bias. Each dataset is distin-
matic identification of misogynous content in the       guished in Training Set and Test Set.
Italian language on Twitter. More specifically, it is
organized according to two main subtasks:               3.1       Raw dataset
    • Subtask A - Misogyny & Aggressive Be-             The raw dataset is a balanced dataset of misogy-
      haviour Identification: a system must rec-        nous and non-misogynous tweets. The raw train-
      ognize if a text is misogynous or not, and in     ing set (6,000 tweets) is derived from the data
      case of misogyny, if it expresses an aggres-      collected for the 2018 edition of the AMI shared
      sive attitude. In order to provide an annotated   task, where the misogynistic posts have been en-
      corpus for Subtask A, the following defini-       riched by labelling aggressive expressions accord-
      tions have been adopted to label the collected    ing to the definition provided in Section 2. The
      dataset:                                          raw test dataset (approximately 1,000 tweets) has
                                                        been collected from Twitter using a similar ap-
        – Misogynous: a text that expresses hat-        proach to the 2018 edition of the shared task. This
          ing towards women in particular (in the       is intentionally done to evaluate the generalization
          form of insulting, sexual harassment,         abilities of the systems on test data collected in a
          threats of violence, stereotype, objecti-     different time period and therefore characterized
          fication, and negation of male responsi-      by higher language variability with respect to the
          bility).                                      training data. Examples of tweets belonging to the
        – Not Misogynous: a text that does not ex-      raw dataset are shown in Table 1.
          press any form of hate towards women.
        – Aggressive: a message is considered ag-       Table 1: Examples of misogynous and non-
          gressive if it (implicitly or explicitly)     misogynous tweets
          presents, incites, threatens, implies, sug-   Misogynous     Aggressiveness   Text
          gests, or alludes to:                               1              1          Stupida tr**a! Meriti di morire
            * attitudes, violent actions, hostility           1              0          Le ragazze sono tutte pu***ne
               or commission of offenses against              0              0          @chiellini sei una cagna
               women;
            * social isolation towards women for          The training raw data for this dataset are pro-
               physical or psychological character-
                                                        vided as TSV files (tab-separated files) and report
               istics;
                                                        the following fields, where:
            * justify or legitimize an aggressive
               action against women.                        • id denotes a unique identifier of the tweet.
        – Not Aggressive: If none of the previous
          conditions hold.                                  • text represents the tweet text.

    • Subtask B - Unbiased Misogyny Identifica-             • misogynous defines whether a tweet is
      tion: a system must discriminate misogynis-             misogynous or not misogynous; it takes val-
      tic contents from the non-misogynistic ones,            ues:
      while guaranteeing the fairness of the model
      (in terms of unintended bias) on a synthetic                  – 0 if the tweet is not misogynous;
      dataset (Nozza et al., 2019). To this purpose                 – 1 if the tweet is misogynous.
                                                               The synthetic testing data are provided as TSV
              Table 2: Template examples.
      Template Examples                           Misogynous   files (tab-separated files) reporting only id and
      <identity term>dovrebbero essere protette   0            text.
      <identity term>dovrebbero essere uccise     1
      apprezziamo <identity term>                 0               The statistics about the raw and synthetic
      picchiamo <identity term>                   1
      splendida <identity term>                   0            datasets, both for the training and testing sets, are
      sporca <identity term>                      1            reported in Table 3.

   • aggressiveness denotes whether a misogy-                  Table 3: Distribution of labels on the Training and
     nous tweet is aggressive or not; it takes value           Test datasets
                                                                                           Training              Testing
     as:                                                                            Raw        Synthetic   Raw      Synthetic
                                                                   Misogynous       2337         1007      500        954
          – 0 denotes a non-aggressive tweet (not                  Non-misogynous   2663         1007      500        954
            misogynous tweets are labelled as 0 by                 Aggressive       1783           -       176          -
            default);                                              Non-aggressive   3217           -       824          -

          – 1 if the tweet is aggressive.

The raw testing data are provided as TSV files re-             4   Evaluation Measures and Baseline
porting only id and text.
                                                               Considering the distribution of labels of the
3.2     Synthetic dataset                                      dataset, we have chosen different evaluation met-
The synthetic test dataset for measuring the pres-             rics. In particular, we distinguished as follows:
ence of unintended bias has been created fol-
lowing the procedure adopted in (Dixon et al.,                 Subtask A. Each class to be predicted (i.e.
2018; Nozza et al., 2019): a list of identity terms            “Misogyny” and “Aggressiveness”) has been
has been constructed by taking into consideration              evaluated independently on the other using a
some concepts related to the term “donna” (e.g.                Macro F1-score. The final ranking of the systems
“moglie”, “fidanzata”). Given the identity terms,              participating in Subtask A was based on the
several templates have been created including pos-             Average Macro F1-score (F1 ), computed as
itive/negative verbs and adjectives (e.g. nega-                follows:
tive: hate, inferior; positive: love, awesome) both                          F1 (M isogyny) + F1 (Aggressiveness)
for conveying a misogynistic message or a non-                 ScoreA =
                                                                                               2
misogynistic one. Some examples of such tem-                                                                (1)
plates, used to create the synthetic dataset, are re-
ported in Table 2.                                                Subtask B. The ranking for Subtask B is com-
   The synthetic dataset, created for measuring the            puted by the weighted combination of AUC esti-
presence of unintended bias, contains template-                mated on the test raw dataset AU Craw and three
generated text labelled according to:                          per-term AUC-based bias scores computed on
                                                               the synthetic dataset (AU CSubgroup , AU CBP SN ,
   • Misogyny: Misogyny (1) vs. Not Misogyny                   AU CBN SP ). Let s be an identity-term (e.g. “girl-
     (0)                                                       friend” and “wife”) and N be the total number of
                                                               identity-terms, the score of each run is estimated
  The training data for the raw dataset are pro-
                                                               according to the following metric:
vided as TSV files (tab-separated files) and report
the following fields:                                              ScoreB = 12 AU
                                                                                h Craw +
                                                                             1 P
   • id denotes a unique identifier of the template-                      + 2N     s AU Csubgroup (s)
     generated text.
                                                                            P                                                   (2)
                                                                          + s AU CBP SN (s)i
                                                                            P
   • text represents the template-generated text.                         + s AU CBN SP (s)

   • misogynous defines if the template-generated                 Unintended bias can be uncovered by looking at
     text is misogynous or non-misogynous; it                  differences in the score distributions between data
     takes values as 1 if the tweet is misogynous,             mentioning a specific identity-term (subgroup dis-
     0 if the tweet is non-misogynous.                         tribution) and the rest (background distribution).
                                                   Table 4: Team overview
                                  Team Name                          Affiliation         Country      Runs        Subtask
             jigsaw (Lees et al., 2020)                                Google              US         2 (u)        A, B
             fabsam (Fabrizi, 2020)                              University of Pisa        IT         2 (c)        A, B
             YNU OXZ (Ou and Li, 2020)                           Yunnan University         CN         2(u)          A
             NoPlaceForHateSpeech (da Silva and Roman, 2020)   University of Sao Paulo     BR         3 (c)         A
             AMI the winner (Lepri et al., )                     University of Pisa        IT         3 (c)         A
             MDD (El Abassi and Nisioi, 2020)                  University of Bucharest     HU      2 (u), 1 (c)    A, B
             PoliTeam (Attanasio and Pastor, 2020)              Politecnico di Torino      IT         2 (c)        A, B
             UniBO (Muti and Barròn-Cedeño, 2020)            University of Bologna       IT         1 (c)         A


The three per-term AUC-based bias scores are re-                      i.e. “misogynous”, “aggressiveness”, where each
lated to specific subgroups as follows:                               tweet has been represented as a bag-of-words
                                                                      (composed of 1000 terms) coupled with the cor-
   • AU CSubgroup (s): calculates AUC only on                         responding label. Once the representations have
     the data within the subgroup related to a                        been obtained, Support Vector Machines with lin-
     given identity term. This represents model                       ear kernel have been trained and provided as AMI-
     understanding and separability within the                        BASELINE.
     subgroup itself. A low value in this met-
     ric means the model does a poor job of dis-                      5     Participants and Results
     tinguishing between misogynous and non-
     misogynous comments that mention the iden-                       A total of 8 teams from 6 different countries par-
     tity.                                                            ticipated in at least one of the two subtasks of
                                                                      AMI. Two teams participated with the same ap-
   • AU CBP SN (s): Background Positive Sub-                          proach also in the HaSpeeDe shared task (San-
     group Negative (BPSN) calculates AUC on                          guinetti et al., 2020), addressing misogyny iden-
     the misogynous examples from the back-                           tification with generic models for detecting hate
     ground and the non-misogynous examples                           speech. Each team had the chance to submit up
     from the subgroup.         A low value in                        to three runs that could be constrained (c), where
     this metric means that the model confuses                        only the provided training data and lexicons were
     non-misogynous examples that mention the                         admitted, and unconstrained (u), where additional
     identity-term with misogynous examples that                      data for training were allowed. Table 4 reports
     do not, likely meaning that the model predicts                   an overview of the teams illustrating their affilia-
     higher misogynous scores than it should for                      tion, their country, the number and type (c for con-
     non-misogynous examples mentioning the                           strained, u for unconstrained) of submissions, and
     identity-term.                                                   the subtasks they addressed.
   • AU CBN SP (s): Background Negative Sub-                          5.1     Subtask A: Misogyny & Aggressive
     group Positive (BNSP) calculates AUC on                                  Behaviour Identification
     the non-misogynous examples from the back-
     ground and the misogynous examples from                          Table 5 reports the results for the Misogyny &
     the subgroup. A low value here means                             Aggressive Behaviour Identification task, which
     that the model confuses misogynous exam-                         received 20 submissions submitted by 8 teams.
     ples that mention the identity with non-                         The highest result has been achieved by jigsaw
     misogynous examples that do not, likely                          at 0.7406 in an unconstrained setting and by fab-
     meaning that the model predicts lower misog-                     sam at 0.7342 in a constrained run. While the best
     ynous scores than it should for misogynous                       results obtained as unconstrained is based on en-
     examples mentioning the identity.                                sembles of fine-tuned custom BERT models, the
                                                                      one achieved by the best constrained system is
   In order to compare the submitted runs with a                      grounded on a convolutional neural network that
baseline model, we provided a benchmark (AMI-                         exploits pre-trained word embeddings.
BASELINE) based on Support Vector Machine                                By analysing the detailed results, it emerged
trained on a unigram representation of tweets with                    that while the identification of misogynous text
Tf-IDF weighing schema. In particular, we cre-                        can be considered a quite simple problem, the
ated one training set for each field to be predicted,                 recognition of aggressiveness needs to be properly
addressed. In fact, the score reported in Table 5
                                                       Table 6: Results of Subtask B. Constrained runs
are strongly affected by the prediction capabili-
                                                       are marked as “c”, while the unconstrained ones
ties mostly related to the aggressive posts. This
                                                       with “u”.
is likely due to the subjective perception of ag-              Rank   Run Type   Score         Team
                                                                 1       u       0.882   jigsaw
gressiveness captured by the variance of the data                2       c       0.818   PoliTeam
available in the ground truth.                                   3       c       0.814   PoliTeam
                                                                 4       c       0.705   fabsam
                                                                 5       c       0.702   fabsam
                                                                 6       c       0.694   PoliTeam
Table 5: Results of Subtask A. Constrained runs                  7       c       0.691   fabsam
                                                                 8       u       0.649   jigsaw
are marked as “c”, while the unconstrained ones                  9       c       0.613   MDD
                                                                10       c       0.602   AMI BASELINE
with “u”. An amended run, marked with **, has                   11       u       0.601   MDD
been submitted after the deadline.                              12       u       0.601   MDD
      Rank   Run Type   Score           Team
       **       c       0.744   UniBO **
       1        u       0.741   jigsaw
       2        u       0.738   jigsaw                 has been partially mitigated by introducing misog-
       3        c       0.734   fabsam
       4        u       0.731   YNU OXZ                ynous lexicon.
       5        c       0.731   fabsam
       6        c       0.717   NoPlaceForHateSpeech
       7        u       0.701   YNU OXZ
                                                       6    Discussion
       8        c       0.695   fabsam
       9        c       0.693   NoPlaceForHateSpeech   The submitted systems can be compared by tak-
       10       c       0.687   AMI the winner
       11       u       0.684   MDD                    ing into consideration the kind of input feature that
       12       c       0.683   PoliTeam               they have considered for representing tweets and
       13       c       0.682   MDD
       14       c       0.681   PoliTeam               the machine learning model that has been used as
       15       u       0.668   MDD
       16       c       0.665   AMI the winner
                                                       classification model.
       17       c       0.665   AMI BASELINE
       18       c       0.647   PoliTeam               Textual Feature Representation. The systems
       19       c       0.634   UniBO
       20       c       0.626   AMI the winner         submitted by the challenge participants’ consider
       21       c       0.490   NoPlaceForHateSpeech   various techniques for representing the tweet con-
                                                       tents. Most of the teams experimented a high-level
   After the deadline the team UniBO submitted an      representation of the text based deep learning so-
amended run (**), that has not been ranked in the      lutions. While few teams like fabsam and MDD
official results of the AMI shared task. However,      adopted a text representation based on traditional
we believe interesting to mention their achieve-       word embeddings such as Word2Vec (Mikolov et
ment showing an Average Macro F1-score equal           al., 2013), Glove (Pennington et al., 2014) and
to 0.744.                                              FastText (Bojanowski et al., 2017), most of the
                                                       systems. i.e NoPlaceForHateSpeech,jigsaw, Po-
5.2   Subtask B: Unbiased Misogyny                     liTeam, YNU OXZ and UniBO, exploited richer
      Identification                                   sentence embeddings such as BERT (Devlin et
Table 6 reports the results for the Unbiased Misog-    al., 2019) or XLM-RoBert (Ruder et al., 2019).
yny Identification task, which received 11 submis-     For enriching the space for then training the subse-
sions by 4 teams, among which 4 unconstrained          quent models to recognize misogyny and aggres-
and 7 constrained. The highest Average Macro F1        siveness, PoliTeam experimented the use of addi-
score has been achieved by jigsaw at 0.8825 with       tional lexical resources such as misogynous lexi-
an unconstrained run and by PoliTeam at 0.8180         con and sentiment Lexicon.
with a constrained submission.
                                                       Machine Learning Models. Concerning the
   Similarly to the previous task, most of the sys-    machine learning models, we can distinguish be-
tems have shown better performance compared to         tween approaches trained from scratch and those
the AMI-BASELINE. By analizing the runs, we can        ones based on fine-tuning of existing pre-trained
highlight that the two best results achieved on Sub-   models. We report in the following the strategy
task B have been obtained by the unconstrained         adopted by the systems that participated in the
run submitted by jigsaw, where a simple debiasing      AMI shared task, according to the type of machine
technique based on data augumentation have been        learning model that has been adopted:
adopted, and by the constrained run provided by
Politeam, where the problem of biased prediction           • Shallow models have been experimented by
    MDD, where logistic regressions have been           get is not clearly mentioned, but several ag-
    trained according to different hand-crafted         gressive terms are present, the models tend to
    features;                                           be biased and to predict the post as misogy-
                                                        nous and aggressive erroneously. An exam-
  • Convolutional Neural Networks have been             ple of this type of misclassified posts is re-
    exploited by NoPlaceForHateSpeech by us-            ported here:
    ing two distinct models for misogyny detec-
    tion and aggressiveness identification, by fab-          “Vero...ma c’e chi ti cerca, che
    sal investigating the optimal hyperparameters            ti vuole, più di ogni cosa al
    of the model, and by YNU OXZ where on top                mondo......ma non sa se viene
    of the CNN architecture a Capsule Network                capito.....   potrebbe esser mal
    (Sabour et al., 2017) has been introduced for            interpretato e di conseguenza
    taking advantage of spatial patterns available           all’abbraccio esser denunciato per
    in short texts;                                          molestie sessuali e/o stupro”

  • Fine-Tuning of pre-trained models has             • Short hate speech sentences referred to
    been exploited by jigsaw by adapting BERT           others than women: when the target is men-
    to the challenge domain and using a trans-          tioned by using an actual account, but it is re-
    fer multilingual strategy and ensemble learn-       ferred to men, and there are no additional in-
    ing, by UniBO that accommodated the BERT            dications about the gender of the target, most
    model using a multi-label output neuron, and        of the models tend to misclassify the tweet.
    by PoliTeam where the prediction of the fine-       In the following example, the target is a male
    tuned sentence-BERT is coupled with predic-         football player:
    tion based on lexicons.                                  “@bonucci leo19 Cagati in mano
  For what concerns the achieved results on the              e prenditi a schiaffi. Sti post te li
two subtasks, few considerations can be drawn                infili nel c*lo!”
considering both the errors done by the systems         Concerning the errors on the synthetic test set
and the mitigation strategies adopted for reducing      used for estimating the bias of the models,
the bias.                                               two main errors carried out by the majority
                                                        of the systems can be identified:
Error Analysis When testing the developed sys-
tems on raw test data, the majority of the per-       • presence of unusual target: in most of the
formed errors can be summarized by the following        submissions, sentences containing offensive
patterns:                                               expressions towards specific uncommon tar-
  • Under-representation of subjective expres-          gets are misclassified. For instance, around
    sions: those posts written by introducing er-       39% of the predictions related to the target
    roneous lower case and missing spaces be-           nonna (i.e., grandmother) are wrong. An ex-
    tween adjoining words lead the models based         ample of the most misclassified target is re-
    on raw text to make errors on test predictions.     ported in the following example:
    An example of such common errors is the one              “nonne belle”
    reported in the following tweet:
                                                      • Presence of unusual verbs: analogously to
         “Odio Sakura per il semplice                   what has been observed for the target, sen-
         motivo che qualunque cosa faccia               tences containing rare aggressive verbs tend
         o dica Naruto lei lo prende a                  to be misclassified. For instance, around
         schiaffi o a pugniHA CHIESTO                   48% of the instances related to the verbs mal-
         COME STA SAI DIOSANTO                          menare and seviziare (i.e., beat up and tor-
         BRUTTA STRONZA MA CON-                         ture) are wrongly classified. An example of a
         TRALLI MADONNA SPERO CHE                       mistaken sentence are reported here:
         TI UCCIDANOscusami Sarada”
                                                             “femmina dovrebbe essere se-
  • Undefined subject, but presence of aggres-               viziata” (wrongly classified as
    sive terms: for those tweets where the tar-              non-misogynous)
Bias Mitigation strategies. Concerning the              Giuseppe Attanasio and Eliana Pastor. 2020. PoliTeam
Subtask B, only one team (jigsaw) addressed ex-           @ AMI: Improving Sentence Embedding Similarity
                                                          with Misogyny Lexicons for Automatic Misogyny
plicitly the problem related to the unintended bias.
                                                          Identification in Italian Tweets. In Proceedings of
The authors used sentences sampled from the               Seventh Evaluation Campaign of Natural Language
Italian Wikipedia articles containing some of the         Processing and Speech Tools for Italian. Final Work-
identity terms provided with the test set. These          shop (EVALITA 2020), Bologna, Italy. CEUR.org.
sentences, labeled as both non-misogynous and           Amir Bakarov. 2018. Vector Space Models for Au-
non-aggressive, have been used to further fine-          tomatic Misogyny Identification. In Proceedings
tune the model and reduce the bias given by the          of Sixth Evaluation Campaign of Natural Language
data. The results achieved by the jigsaw team            Processing and Speech Tools for Italian. Final Work-
                                                         shop (EVALITA 2018), Turin, Italy. CEUR.org.
highlight that a debiasing method could obtain fair
predictions even using pre-trained models.              Angelo Basile and Chiara Rubagotti. 2018. Automatic
                                                          Identification of Misogyny in English and Italian
7   Conclusions and Future Work                           Tweets at EVALITA 2018 with a Multilingual Hate
                                                          Lexicon. In Proceedings of Sixth Evaluation Cam-
This paper presents the AMI shared task, focused          paign of Natural Language Processing and Speech
                                                          Tools for Italian. Final Workshop (EVALITA 2018),
not only on identifying misogynous and aggres-
                                                          Turin, Italy. CEUR.org.
sive expressions but also on ensuring fair predic-
tions. By analysing the runs submitted by the par-      Valerio Basile, Cristina Bosco, Elisabetta Fersini,
ticipants, we can conclude that while the prob-           Nozza Debora,         Viviana Patti,   Francisco
                                                          Manuel Rangel Pardo, Paolo Rosso, Manuela
lem of misogyny identification has reached satis-         Sanguinetti, et al. 2019. Semeval-2019 task
factory results, the recognition of aggressiveness        5: Multilingual detection of hate speech against
is still in its infancy. Concerning the capabili-         immigrants and women in twitter. In Proceedings
ties of the systems with respect to the unintended        of 13th International Workshop on Semantic Evalu-
                                                          ation, pages 54–63. Association for Computational
bias problem, we can highlight that a domain-             Linguistics.
dependent mitigation strategy is a necessary step
towards fair models.                                    Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
                                                          cia C. Passaro. 2020. EVALITA 2020: Overview of
                                                          the 7th Evaluation Campaign of Natural Language
Acknowledgements                                          Processing and Speech Tools for Italian. In Valerio
                                                          Basile, Danilo Croce, Maria Di Maro, and Lucia C.
The work of the last author was partially funded by       Passaro, editors, Proceedings of Seventh Evalua-
the Spanish MICINN under the research project             tion Campaign of Natural Language Processing and
MISMISFAKEnHATE on MISinformation and                     Speech Tools for Italian. Final Workshop (EVALITA
MIScommunication in social media: FAKE news               2020), Online. CEUR.org.
and HATE speech (PGC2018-096212-B-C31) and              Piotr Bojanowski, Edouard Grave, Armand Joulin, and
by the COST Action 17124 DigForAsp supported               Tomas Mikolov. 2017. Enriching word vectors with
by the European Cooperation in Science and Tech-           subword information. Transactions of the Associa-
nology.                                                    tion for Computational Linguistics, 5:135–146.

                                                        Davide Buscaldi.     2018.    Tweetaneuse AMI
                                                          EVALITA2018: Character-based Models for the
References                                                Automatic Misogyny Identification Task. In Pro-
                                                          ceedings of Sixth Evaluation Campaign of Natu-
Resham Ahluwalia, Himani Soni, Edward Callow, An-         ral Language Processing and Speech Tools for Ital-
  derson Nascimento, and Martine De Cock. 2018.           ian. Final Workshop (EVALITA 2018), Turin, Italy.
  Detecting Hate Speech Against Women in English          CEUR.org.
  Tweets. In Proceedings of Sixth Evaluation Cam-
  paign of Natural Language Processing and Speech       Adriano dos S. R. da Silva and Norton T. Roman. 2020.
  Tools for Italian. Final Workshop (EVALITA 2018),       No Place For Hate Speech @ AMI: Convolutional
  Turin, Italy. CEUR.org.                                 Neural Network and Word Embedding for the Iden-
                                                          tification of Misogyny in Italian. In Proceedings of
Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.      Seventh Evaluation Campaign of Natural Language
 2018. Automatic Identification and Classification of     Processing and Speech Tools for Italian. Final Work-
 Misogynistic Language on Twitter. In Proceedings         shop (EVALITA 2020), Bologna, Italy. CEUR.org.
 of 23rd International Conference on Applications of
 Natural Language to Information Systems (NLDB),        Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
 pages 57–64. Springer.                                    Kristina Toutanova. 2019. BERT: pre-training of
  deep bidirectional transformers for language under-     Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
  standing. In Proceedings of the 2019 Conference of        rado, and Jeff Dean. 2013. Distributed representa-
  the North American Chapter of the Association for         tions of words and phrases and their compositional-
  Computational Linguistics: Human Language Tech-           ity. In Advances in neural information processing
  nologies (NAACL-HLT), pages 4171–4186. Associ-            systems, pages 3111–3119.
  ation for Computational Linguistics.
                                                          Arianna Muti and Alberto Barròn-Cedeño. 2020.
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,       UniBO@AMI: A Multi-Class Approach to Misog-
  and Lucy Vasserman. 2018. Measuring and mitigat-          yny and Aggressiveness Identification on Twitter
  ing unintended bias in text classification. In Pro-       Posts Using AlBERTo. In Proceedings of Seventh
  ceedings of the 2018 AAAI/ACM Conference on AI,           Evaluation Campaign of Natural Language Pro-
  Ethics, and Society, pages 67–73.                         cessing and Speech Tools for Italian. Final Work-
                                                            shop (EVALITA 2020), Bologna, Italy. CEUR.org.
Samer El Abassi and Sergiu Nisioi.               2020.
  MDD@AMI: Vanilla Classifiers for Misogyny Iden-         Debora Nozza, Claudia Volpetti, and Elisabetta Fersini.
  tification. In Proceedings of Sixth Evaluation Cam-       2019. Unintended bias in misogyny detection. In
  paign of Natural Language Processing and Speech           IEEE/WIC/ACM International Conference on Web
  Tools for Italian. Final Workshop (EVALITA 2020),         Intelligence, pages 149–155.
  Bologna, Italy. CEUR.org.                               Xiaozhi Ou and Hongling Li. 2020. YNU OXZ
                                                            @ HaSpeeDe 2 and AMI : XLM-RoBERTa with
Samuel Fabrizi. 2020. fabsam @ AMI: a Convolu-
                                                            Ordered Neurons LSTM for classification task at
  tional Neural Network approach. In Proceedings of
                                                            EVALITA 2020. In Proceedings of Sixth Evalua-
  Seventh Evaluation Campaign of Natural Language
                                                            tion Campaign of Natural Language Processing and
  Processing and Speech Tools for Italian. Final Work-
                                                            Speech Tools for Italian. Final Workshop (EVALITA
  shop (EVALITA 2020), Bologna, Italy. CEUR.org.
                                                            2020), Bologna, Italy. CEUR.org.
Elisabetta Fersini, Debora Nozza, and Paolo Rosso.        Endang Wahyu Pamungkas, Alessandra Teresa
   2018a. Overview of the Evalita 2018 Task on Au-          Cignarella, Valerio Basile, and Viviana Patti.
   tomatic Misogyny Identification (AMI). In Tom-           2018. Automatic Identification of Misogyny in
   maso Caselli, Nicole Novielli, Viviana Patti, and        English and Italian Tweets at EVALITA 2018 with
   Paolo Rosso, editors, Proceedings of the Sixth eval-     a Multilingual Hate Lexicon. In Proceedings of
   uation campaign of Natural Language Processing           Sixth Evaluation Campaign of Natural Language
   and Speech tools for Italian (EVALITA 2018), Turin,      Processing and Speech Tools for Italian. Final
   Italy. CEUR.org.                                         Workshop (EVALITA 2018), Turin, Italy. CEUR.org.
Elisabetta Fersini, Paolo Rosso, and Maria Anzovino.      Jeffrey Pennington, Richard Socher, and Christo-
   2018b. Overview of the Task on Automatic Misog-           pher D. Manning. 2014. Glove: Global vectors for
   yny Identification at IberEval 2018. In IberEval@         word representation. In Empirical Methods in Natu-
   SEPLN, pages 214–228.                                     ral Language Processing, pages 1532–1543.
Simona Frenda, Bilal Ghanem, Estefanı́a Guzmán-          Bailey Poland. 2016. Haters: Harassment, Abuse, and
  Falcón, Manuel Montes-y-Gómez, and Luis Vil-            Violence Online. Potomac Books, Incorporated.
  laseñor-Pineda. 2018. Automatic Lexicons Ex-
  pansion for Multilingual Misogyny Detection. In         Sebastian Ruder, Anders Søgaard, and Ivan Vulić.
  Proceedings of Sixth Evaluation Campaign of Natu-         2019. Unsupervised cross-lingual representation
  ral Language Processing and Speech Tools for Ital-        learning. In Proceedings of the 57th Annual Meet-
  ian. Final Workshop (EVALITA 2018), Turin, Italy.         ing of the Association for Computational Linguis-
  CEUR.org.                                                 tics: Tutorial Abstracts, pages 31–38. Association
                                                            for Computational Linguistics.
Sarah Hewitt, Thanassis Tiropanis, and Christian
                                                          Sara Sabour, Nicholas Frosst, and Geoffrey E Hin-
  Bokhove. 2016. The Problem of identifying Misog-
                                                            ton. 2017. Dynamic routing between capsules. In
  ynist Language on Twitter (and other online social
                                                            Advances in neural information processing systems,
  spaces). In Proceedings of the 8th ACM Conference
                                                            pages 3856–3866.
  on Web Science, pages 333–335. ACM.
                                                          Manuela Sanguinetti, Gloria Comandini, Elisa
Alyssa Lees, Jeffrey Sorensen, and Ian Kivlichan.          Di Nuovo, Simona Frenda, Marco Stranisci,
  2020. Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning            Cristina Bosco, Tommaso Caselli, Viviana Patti, and
  a Pre-Trained Comment-Domain BERT Model. In              Irene Russo. 2020. HaSpeeDe 2@EVALITA2020:
  Proceedings of Seventh Evaluation Campaign of            Overview of the EVALITA 2020 Hate Speech
  Natural Language Processing and Speech Tools for         Detection Task. In Valerio Basile, Danilo Croce,
  Italian. Final Workshop (EVALITA 2020), Bologna,         Maria Di Maro, and Lucia C. Passaro, editors,
  Italy. CEUR.org.                                         Proceedings of the 7th evaluation campaign of
                                                           Natural Language Processing and Speech tools for
Marco Lepri, Giuseppe Grieco, and Mattia Sanger-           Italian (EVALITA 2020), Online. CEUR.org.
 mano. University of Pisa, Italy.