UniBO @ AMI: A Multi-Class Approach to Misogyny and Aggressiveness
          Identification on Twitter Posts Using AlBERTo
                   Arianna Muti                                          Alberto Barrón-Cedeño
         Department of Modern Languages,                                DIT – Università di Bologna
          Literatures and Cultures - LILEC                                     Forlı̀, Italy
                Università di Bologna                                   a.barron@unibo.it
                    Bologna, Italy
       arianna.muti@studio.unibo.it

                                                                 been an increasing number of users that misuse
                        Abstract                                 the platform by engaging in trolling, cyberbully-
                                                                 ing, or by posting aggressive and misogynous con-
    We describe our participation in the                         tent (Samghabadi et al., 2020). Due to the sheer
    EVALITA 2020 (Basile et al., 2020)                           amount of user-generated content on social media,
    shared task on Automatic Misogyny                            providers struggle to control inappropriate con-
    Identification.    We focus on task A                        tent. Twitter relies on the community’s reports to
    —Misogyny and Aggressive Behaviour                           identify and remove abusive posts from the plat-
    Identification— which aims at detecting                      form, while pursuing the users’ right to freedom
    whether a tweet in Italian is misogy-                        of expression. However, it is a tricky task to de-
    nous and, if so, whether it is aggressive.                   termine where to draw the line between free ex-
    Rather than building two different models,                   pression and the production of harmful content,
    one for misogyny and one for aggressive-                     due to the subjective nature of what different users
    ness identification, we handle the prob-                     perceive as offensive. Twitter has committed to
    lem as one single multi-label classifica-                    tackling this issue by releasing a policy containing
    tion task, considering three classes: non-                   a clear definition of abusive speech, according to
    misogynous, non-aggressive misogynous,                       which a user cannot promote violence against or
    and aggressive misogynous. Our three-                        directly attack or threaten people on the basis of
    class supervised model, built on top of                      race, ethnicity, national origin, caste, sexual ori-
    AlBERTo, obtains an overall F1 score of                      entation, gender, gender identity, religious affilia-
    0.7438 on the task test set (F1 = 0.8102                     tion, age, disability, or serious disease.3
    for the misogyny and F1 = 0.6774 for the                        However, two main issues exist. Since Twit-
    aggressiveness task), which outperforms                      ter mostly relies on the community subjective per-
    the top submitted model (F1 = 0.7406).1                      ception of hate speech, many posts are not sub-
                                                                 jected to report, review, and removal. Moreover,
1    Introduction                                                the amount of abusive posts significantly outnum-
In 2020, Twitter users in Italy amount to ap-                    bers the people that can manually control harmful
proximately 3.7 million and the number is ex-                    content. Therefore, there is a need to improve the
pected to constantly increase by 2026.2 Although                 quality of algorithms to spot potential instances of
Twitter is conceived to express personal opin-                   hate speech; in particular towards women, since
ions, share today’s biggest news, follow people                  research shows that they are subjected to more bul-
or simply communicate with friends, there has                    lying, abuse, hateful language, and threats than
                                                                 men on social media (Fallows, 2005).
      Copyright ©2020 for this paper by its authors. Use per-       AMI 2020 consists of two tasks (Fersini et al.,
mitted under Creative Commons License Attribution 4.0 In-
ternational (CC BY 4.0).                                         2020). Task A —Misogyny and Aggressive Be-
    1
      Our official submission to the task obtained F1 = 0.6343   haviour Identification— aims at detecting whether
(F1 = 0.7263 for the misogyny and F1 = 0.5423 for the ag-        a Twitter post is misogynous and, if so, whether it
gressiveness task). The reason behind this poor performance
was the unintended use of a mistaken transformer. See Ap-        is aggressive (Anzovino et al., 2018). Task B —
pendix A for further details.
    2                                                               3
      https://www.statista.com/forecasts/                           https://help.twitter.
1146708/twitter-users-in-italy; last visit: 6                    com/en/rules-and-policies/
November, 2020.                                                  hateful-conduct-policy
Unbiased Misogyny Identification— aims at dis-           tasiewicz, 2018). Past research shows the effec-
criminating misogynistic contents from the non-          tiveness of diverse neural-network architectures to
misogynist ones, while guaranteeing the fairness         learn text representations, such as convolutional
of the model (in terms of unintended bias) on            models, recurrent networks and attention mecha-
a synthetic dataset (Nozza et al., 2019). We             nisms (Sun et al., 2019). Recent work shows that
undertook task A and we present a system to              pre-trained models such as BERT achieve state-of-
flag misogynous and women-addressed aggressive           the-art results in text classification tasks and spare
posts on Twitter in the Italian language. Even           time, since they prevent you from training models
if task A involves two sub-problems, we address          from scratch (Sun et al., 2019).
it as a three-class supervised problem using Al-            For what concerns misogyny identification, a
BERTo (Polignano et al., 2019), a BERT lan-              shared task took place at IberEval 2018, focus-
guage understanding model for the Italian lan-           ing on English and Spanish tweets (Fersini et
guage which is focused on the language used              al., 2018b). Whereas task A concerned misog-
in social networks, specifically on Twitter. We          yny identification, task B proposed a multi-class
built only one model to identify the three possible      problem to classify misogynous sentences into
classes: non-misogynous, non-aggressive misog-           seven categories: discredit, stereotype, objectifica-
ynous, and aggressive misogynous. This multi-            tion, sexual harassment, threats of violence, dom-
class setting has shown to be effective. Our ap-         inance, and derailing. The most used supervised
proach obtains an F1 score of 0.7438, outperform-        models were support vector machines, ensembles
ing the top-ranked official submission (although         of classifiers and deep-learning models. Partici-
our own official submission obtained F1 = 0.6343         pants mostly used n-grams and word embeddings
only; cf. Appendix A).                                   to represent the tweets.
   The rest of the contribution is distributed as fol-      As for misogyny identification in Italian, the
lows. Section 2 includes some background and a           first edition of the AMI shared task took place
brief overview of research in automatic misogyny         in 2018 (Anzovino et al., 2018). The task A
identification. Section 3 describes the employed         was again misogyny identification, while the task
dataset. Section 4 describes our model. Section 5        B aimed at recognizing whether a misogynous
summarizes the experiments performed and dis-            content is person-specific or generally addressed
cusses the obtained results. It includes an error        towards a group of women, and at classifying
analysis, in order to show the error trends of the       the positive instances in the aforementioned cate-
model. Section 6 draws some conclusions and dis-         gories. The best-performing approach obtained an
cusses further possible research lines.                  F1 score of 0.844, using TF-IDF weighting com-
                                                         bined with singular value decomposition for lan-
2   Background                                           guage representation and an ensemble of super-
Due to the subjective perception of misogyny and         vised models (Fersini et al., 2018a).
aggressiveness, a definition of what can be consid-
ered misogynous and aggressive is necessary:             3   Dataset
Misogynous content expresses hating towards
                                                         As mentioned above, the aim of our model is to
women, in the form of insulting, sexual harass-
                                                         flag misogynous contents and aggressive attitudes
ment, male privilege, patriarchy, gender discrim-
                                                         towards women in Italian tweets. To address this
ination, belittling of women, violence against
                                                         task, a dataset was provided by the task organiz-
women, body shaming and sexual objectifica-
                                                         ers: 5, 000 tweets, manually labelled according to
tion (Srivastava et al., 2017). A misogynous
                                                         two classes, misogyny and aggressiveness. The
content expresses an aggressive attitude when it
                                                         first one defines whether a tweet has been flagged
overtly or covertly encourages or legitimizes vio-
                                                         as misogynous (positive class) or not (negative
lent actions against women.
                                                         class). If a tweet has been flagged as misogynous,
   From a computational point of view, misog-            it is further determined whether it is considered as
yny detection is a text classification task. Text        aggressive (positive class) or not (negative class).
classification in Natural Language Processing has            The training dataset is fairly balanced in terms
been widely explored and it is typically addressed       of misogyny. It contains 2, 337 misogynous and
by using supervised models (Mirończuk and Pro-          2, 663 non-misogynous instances. A total of
                          batch size                        team                       run constrained       score
              epochs     16        32                       UniBOa                      2        yes        0.7438
                 8     0.8491 0.8392                        jigsaw                      2         no        0.7406
                10     0.8485 0.8298                        jigsaw                      1         no        0.7380
                15     0.8283 0.8351                        fabsam                      1        yes        0.7343
                20     0.8342 0.8087                        YNU OXZ                     1         no        0.7314
                                                            fabsam                      2        yes        0.7309
Table 1: F1 performance of the 3-class model                NoPlaceForHateSpeech        2        yes        0.7167
                                                            YNU OXZ                     2         no        0.7015
with different batch sizes after diverse numbers of         fabsam                      3        yes        0.6948
epochs using AlBERTo                                        NoPlaceForHateSpeech        1        yes        0.6934
                                                            AMI the winner              2        yes        0.6869
                                                            MDD                         3         no        0.6844
1, 783 of the former are also considered as aggres-         PoliTeam                    3        yes        0.6835
                                                            MDD                         1        yes        0.6820
sive, whereas only 554 are not. The test set was            PoliTeam                    1        yes        0.6810
composed of 1, 000 tweets.                                  MDD                         2         no        0.6679
   Since we opted for a constrained approach, we            AMI the winner              1        yes        0.6653
                                                            PoliTeam                    2        yes        0.6473
only used the data provided by the organizers. We           UniBOb                      1        yes        0.6343
randomly split the supervised data into training            AMI the winner              3        yes        0.6259
and validation sets: 4, 700 instances for the former        NoPlaceForHateSpeech        3        yes        0.4902
                                                            a
                                                              Run submitted after the deadline.
and 300 for the latter.                                     b
                                                              Buggy run submitted on the deadline (cf. Appendix A).

4   Description of the System                           Table 2: Full shared task leaderboard plus our un-
                                                        official top-performing submission. The score is
Since the identification of aggressiveness is related
                                                        the average of the F1 measures for the misogyny
to the identification of misogynous tweets, we opt
                                                        and the aggressiveness tasks.
for a 3-class setting, based on one single model.
The three classes are hence non-misogynist, ag-
gressive misogynist, and non-aggressive misogy-         We used the Pytorch instance of AlBERTo-Base,
nist. The idea is to determine how well a multi-        Italian Twitter lower cased4 and fine-tuned it to the
label classifier can perform when addressing these      downstream task. We used a softmax output layer
two related problems; handling aggressiveness as        with three neurons to produce the classification.
a consequential class of the misogyny one.                 In order to tune the network, we used the
   We decided to base our model on BERT (Bidi-          AdamW optimizer, which decouples weight decay
rectional Encoder Representations from Trans-           from gradient computation, with a learning rate of
formers), a task-independent language represen-         1e-5 (Loshchilov and Hutter, 2017).5
tation model based on the transformers architec-
ture (Devlin et al., 2019). BERT uses a masking         5      Results
approach that randomly masks some input tokens
within a sentence and then predicts the removed         We explored different batch sizes over an increas-
tokens based on the context. It is bidirectional be-    ing number of learning epochs. Table 1 shows the
cause it makes use of Transformers that consider        performance evolution on the validation set. The
both the left and right context at once with re-        best combination was to train the model over 8
spect to the hidden word to make the prediction         epochs with a batch size of 16. This combination
upon. We decided to use AlBERTo, a variation of         leads to an F1 score of 0.8491 on the three-class
BERT in Italian, trained on Twitter posts (Polig-       problem. It is worth noting that these scores are
nano et al., 2019), which includes emojis, links,       not comparable against those for the actual task,
hashtags, and mentions. AlBERTo was trained on          which consists of two independent binary deci-
200M tweets randomly sampled from the TWITA             sions: whether a tweet is considered misogynist
corpus (Basile et al., 2018).                           and, if the answer is yes, whether it is aggressive.6
   As for the pre-processing, we used the pre-              4
                                                              https://github.com/marcopoli/
trained AlBERTo tokenizer for text tokenization,        AlBERTo-it,
                                                            5
and then we encoded the data. We set the maxi-                The implementation is available at https:
mum length to 256 characters, since that was the        //github.com/TinfFoil/unibo_ami2020/.
                                                            6
                                                              Indeed, the official task score is computed as the average
length of the longest instance in the training mate-    of the F1 measures for the misogyny and the aggressiveness
rial (even if Twitter allows up to 280 characters).     classifications.
  tweet                                                                             misogynist aggressive
                                                                                    actual pred actual pred
  Type-1 errors
  1.2 La cosa del filo del telefono attorno alla gola è presa da una lite di         no     yes      no      yes
      Stefanie e Brooke in Beautiful #chilhavisto
  1.2 “Quanti schiaffi, #Malinconico. Li meriteresti tutti, perché anche             no     yes      no      yes
      questa volta sei riuscito a lasciare il segno.E no, non era facile. . . ”
  1.3 Tutti quelli che ho conosciuto mi hanno sempre detto che a primo               yes      no      no      no
      impatto sembro stronza e psicopatica ma in verità sono una cucciola

  Type-2 errors
  2.1 @Fedoc5 in ginocchio con te che comandi il ritmo del pompino e                  no     yes      no      yes
      decidi quando sborrarmi in gola, perchè un pompino deve sempre
      finire con l’ingoio
  2.2 @Nigagalsen @matteosalvinimi un follower dal 2016. . . e ti perme-              no     yes      no      no
      tti di aprire quella lurida fogna di bocca che hai.
  2.3 Antonio ti prenderei a schiaffi, come fa a dire mi manchi, quando               no      no      no      no
      con Ilaria fai tutto. Coglione
  2.4 Posso volerlo vedere cagarsi in mano e prendersi a schiaffi finché             no      no      no      no
      non diventano dispari o sono nazista?

  Type-3 errors
  3.1 La cena è pronta. Lo squalo balena affonda nei banchi di pesce per             no     yes      no      no
      saziare la sua immensa mole. Non è un abitudine frequente visto
      che filtra dall’acqua i microorganismi come le balene.
  3.2 Comunque le pringles più buone sono quelle alla panna acida e                  no     yes      no      no
      cipolla

Table 3: Instances from the test partition including their actual class and the one predicted by our model
for both misogyny and aggressiveness.


   Given these results, we trained a new model          gled mostly with the identification of aggressive
on the full trained and development sets during 8       instances. As a result, there are relatively few
epochs, using a batch size of 16, and predicted on      cases in which our model correctly labels non-
the test set. Such model obtains F1 = 0.7438,           aggressive misogynous instances. We noticed that
resulting from 0.8102 on the misogyny task and          most of the time, when our model labels an in-
0.6774 from the aggressiveness one.                     stance as misogynist, it also labels it as aggres-
   Table 2 shows the AMI shared task leaderboard.       sive. On the contrary, the system performs very
It highlights both our official submission UniBO        well in identifying non-misogynous instances and
run 1 (cf. Appendix A) and our post-deadline            aggressive-misogynous instances. The most com-
submission UniBO run 2. Run 2 tops all the              mon mistakes are grouped into three categories:
systems submitted to the shared task. Indeed,
modelling the two tasks as one single multi-class         1. The system identifies as aggressive the in-
problem (and using transformers for the right lan-           stances that contain verbs expressing an ag-
guage) helps the algorithm significantly.                    gressive attitude.7

Error Analysis After the release of the gold la-          2. The system identifies as misogynous (and
bels, we performed an analysis of the classifica-            most of the time also aggressive) instances
tion errors. We analyzed 300 instances, taken ran-         7
                                                             One potential reason behind this confusion is that we sus-
domly from the test set (100 at the beginning, 100      pect that there are aggressive tweets in the dataset which, not
                                                        having been identified as misogynist in the first place, are
in the middle and 100 at the end). As observed          mislabeled as non-aggressive. This hypothesis should be fur-
from the reported performance, our model strug-         ther explored.
      that are neither misogynous nor aggressive,         The purpose of our participation was to deter-
      but contain typical misogynous sentences.           mine whether a multi-label classifier is a good
                                                          way to address this two-step task. Although the
    3. The system identifies as misogynous in-            task seems to be conceived to be addressed with
       stances that are neither misogynous nor ag-        two different models, one for the identification of
       gressive, but they contain double-entendre         misogyny and the other for aggressiveness, we de-
       words typically used to insult women.              cided to try a different approach and build a sin-
                                                          gle model that could identify three cases: non-
    Table 3 shows some examples for all three kinds       misogynous, non-aggressive misogynous and ag-
of errors. Regarding the errors of type 1, in in-         gressive misogynous tweets.
stance 1.1 the action of winding up a telephone              We built our model on top of AlBERTo, an Ital-
cable around the neck was perceived as aggres-            ian version of BERT, and we trained the model us-
sive, despite the speaker did not express a misog-        ing only the dataset provided by the task organiz-
ynous or aggressive attitude towards a woman,             ers. We experimented by setting different batch
and indeed she is just commenting on something            sizes over an increasing number of epochs. The
watched on TV. In instance 1.2, the sentence              highest F1 score on the validation set was reached
meritare gli schiaffi (deserving slaps) denotes vi-       by a batch size of 16 during 8 epochs. When eval-
olence, but it is not addressed towards a woman.          uated on the test set, our model obtained an overall
This kind of mistake might be overcome by im-             F1 score of 0.7438; 0.8102 for the misogyny and
plementing a model trained on the misogynist par-         0.6744 for the aggressiveness task. We hypothe-
tition of the data only. Finally, instance 1.3 rep-       size that the model struggles to identify misogynist
resents the bias related to the subjectivity nature       aggressive instances partly because it gets con-
of what is perceived to be misogynous. Accord-            fused by non-misogynist aggressive tweets which
ing to the annotation guidelines, a tweet should          are labeled simply as non-misogynous. The imple-
be flagged as misogynous if it expresses hating           mentation is publicly available for research pur-
towards women. In this case, the poster of the            poses.
tweet is not expressing any misogynous attitude,             For what concerns further experiments, we plan
but she is reporting what she is been told by males.      to build two separate models: one to detect misog-
Therefore, our system flagged the instance as non-        yny and the other trained only on already-flagged
misogynous and we could agree.                            misogynous tweets to identify instances of aggres-
    As for the errors of type 2, if we look at the text   siveness. Another step to undertake would be to
only, the instances could seem misogynous sen-            use an unconstrained approach and increase the
tences. However, in the instances 2.1 and 2.2 the         number of instances for the training set, so that
hashtag tells us that it is referred to a man and the     the model will have more data to learn from.
system fails to understand that. On the contrary,
the system performs well when a masculine name
                                                          References
or a masculine pronoun is specified, instead of an
hashtag, as we can observe in the instances 2.3           Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.
                                                           2018. Automatic identification and classification of
and 2.4. In these cases our system could under-
                                                           misogynistic language on twitter. In International
stand that the aggressive actions, that usually tend       Conference on Applications of Natural Language to
to be classified as aggressive-misogynous, are not         Information Systems, pages 57–64. Springer.
referred to a woman.                                      Valerio Basile, Mirko Lai, and Manuela Sanguinetti.
    For the type 3 errors, in instance 3.1 balena           2018. Long-term social media data collection at
(whale/fat woman) and in 3.2 acida (acid/peevish)           the university of turin. In Fifth Italian Conference
could confuse the model causing it to flag such in-         on Computational Linguistics (CLiC-it 2018), Turin,
                                                            Italy.
stances as misogynous.
                                                          Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
6     Conclusions and Further Work                          cia C. Passaro. 2020. Evalita 2020: Overview
                                                            of the 7th evaluation campaign of natural language
                                                            processing and speech tools for italian. In Valerio
In this paper we described our approach to the              Basile, Danilo Croce, Maria Di Maro, and Lucia C.
EVALITA 2020 task on misogyny and aggres-                   Passaro, editors, Proceedings of Seventh Evalua-
siveness identification in Italian tweets —AMI.             tion Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA       Kalpana Srivastava, Suprakash Chaudhury, P.S. Bhat,
  2020), Online. CEUR.org.                                  and Samiksha. Sahu. 2017. Misogyny, feminism,
                                                            and sexual harassment. Industrial psychiatry jour-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               nal, 26(2):111–113.
   Kristina Toutanova. 2019. BERT: Pre-training of
   deep bidirectional transformers for language under-    Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.
   standing. In Proceedings of the 2019 Conference of       2019. How to fine-tune BERT for text classifica-
   the North American Chapter of the Association for        tion? CoRR, abs/1905.05583.
   Computational Linguistics: Human Language Tech-
   nologies, Volume 1 (Long and Short Papers), pages      A   Official English-BERT-based
   4171–4186, Minneapolis, MN, June. ACL.
                                                              Submission
Deborah Fallows. 2005. How women and men use the
  internet. Technical report, Pew Internet & American     Our official submission used a pre-trained BERT
  Life Project, December.                                 model trained only on the English language. The
                                                          experimentation and tuning were identical to the
Elisabetta Fersini, Debora Nozza, and Paolo Rosso.
   2018a. Overview of the evalita 2018 task on auto-
                                                          one applied when using AlBERTo (cf. Section 5).
   matic misogyny identification (ami). In EVALITA        Table 4 shows the tuning evolution. The best con-
   Evaluation of NLP and Speech Tools for Italian:        figuration of this model, derived from the English
   Proceedings of the Final Workshop 12-13 Decem-         BERT, obtains an F1 score of 0.8222 on the vali-
   ber 2018, Naples, pages 59–66. Torino: Accademia
                                                          dation set when dealing with our three-class prob-
   University Press.
                                                          lem. Nevertheless, the performance dropped to
Elisabetta Fersini, Paolo Rosso, and Maria Anzovino.      F1 = 0.6343 on the test set.
   2018b. Overview of the task on automatic misog-
   yny identification at ibereval 2018. In Workshop                                  batch size
   on Evaluation of Human Language Technologies for                epochs      8        16      32
   Iberian Languages (IberEval 2018), Sevilla, Spain.                 5     0.8126    0.8042 0.7955
                                                                      8     0.8067    0.8222 0.8004
Elisabetta Fersini, Debora Nozza, and Paolo Rosso.                   10     0.8042    0.8069 0.8141
   2020. Ami @ evalita2020: Automatic misogyny                       15     0.8095    0.8037 0.8121
   identification. In Valerio Basile, Danilo Croce,                  20     0.7895    0.8178 0.8153
   Maria Di Maro, and Lucia C. Passaro, editors, Pro-
   ceedings of the 7th evaluation campaign of Natural     Table 4: F1 performance of the 3-class model
   Language Processing and Speech tools for Italian       with different batch sizes after diverse numbers of
   (EVALITA 2020), Online. CEUR.org.                      epochs using BERT for English.
Ilya Loshchilov and Frank Hutter.      2017. Fix-
   ing weight decay regularization in adam. CoRR,
   abs/1711.05101.
Marcin M. Mirończuk and Jarosław Protasiewicz.
 2018. A recent overview of the state-of-the-art el-
 ements of text classification. Expert Systems with
 Applications, 106:36–54.
Debora Nozza, Claudia Volpetti, and Elisabetta Fersini.
  2019. Unintended bias in misogyny detection. In
  IEEE/WIC/ACM International Conference on Web
  Intelligence, pages 149–155, Thessaloniki, Greece.
Marco Polignano, Pierpaolo Basile, Marco de Gem-
 mis, Giovanni Semeraro, and Valerio Basile. 2019.
 AlBERTo: Italian BERT Language Understanding
 Model for NLP Challenging Tasks Based on Tweets.
 In Proceedings of the Sixth Italian Conference on
 Computational Linguistics (CLiC-it 2019), volume
 2481, Bari, Italy. CEUR.
Niloofar S. Samghabadi, Parth Patwa, Srinivas PYKL,
  Prerana Mukherjee, Amitava Das, and Thamar
  Solorio. 2020. Aggression and misogyny detection
  using BERT: A multi-task approach. In Proceedings
  of the Second Workshop on Trolling, Aggression and
  Cyberbullying (TRAC-2020).