UniBO @ AMI: A Multi-Class Approach to Misogyny and Aggressiveness Identification on Twitter Posts Using AlBERTo Arianna Muti Alberto Barrón-Cedeño Department of Modern Languages, DIT – Università di Bologna Literatures and Cultures - LILEC Forlı̀, Italy Università di Bologna a.barron@unibo.it Bologna, Italy arianna.muti@studio.unibo.it been an increasing number of users that misuse Abstract the platform by engaging in trolling, cyberbully- ing, or by posting aggressive and misogynous con- We describe our participation in the tent (Samghabadi et al., 2020). Due to the sheer EVALITA 2020 (Basile et al., 2020) amount of user-generated content on social media, shared task on Automatic Misogyny providers struggle to control inappropriate con- Identification. We focus on task A tent. Twitter relies on the community’s reports to —Misogyny and Aggressive Behaviour identify and remove abusive posts from the plat- Identification— which aims at detecting form, while pursuing the users’ right to freedom whether a tweet in Italian is misogy- of expression. However, it is a tricky task to de- nous and, if so, whether it is aggressive. termine where to draw the line between free ex- Rather than building two different models, pression and the production of harmful content, one for misogyny and one for aggressive- due to the subjective nature of what different users ness identification, we handle the prob- perceive as offensive. Twitter has committed to lem as one single multi-label classifica- tackling this issue by releasing a policy containing tion task, considering three classes: non- a clear definition of abusive speech, according to misogynous, non-aggressive misogynous, which a user cannot promote violence against or and aggressive misogynous. Our three- directly attack or threaten people on the basis of class supervised model, built on top of race, ethnicity, national origin, caste, sexual ori- AlBERTo, obtains an overall F1 score of entation, gender, gender identity, religious affilia- 0.7438 on the task test set (F1 = 0.8102 tion, age, disability, or serious disease.3 for the misogyny and F1 = 0.6774 for the However, two main issues exist. Since Twit- aggressiveness task), which outperforms ter mostly relies on the community subjective per- the top submitted model (F1 = 0.7406).1 ception of hate speech, many posts are not sub- jected to report, review, and removal. Moreover, 1 Introduction the amount of abusive posts significantly outnum- In 2020, Twitter users in Italy amount to ap- bers the people that can manually control harmful proximately 3.7 million and the number is ex- content. Therefore, there is a need to improve the pected to constantly increase by 2026.2 Although quality of algorithms to spot potential instances of Twitter is conceived to express personal opin- hate speech; in particular towards women, since ions, share today’s biggest news, follow people research shows that they are subjected to more bul- or simply communicate with friends, there has lying, abuse, hateful language, and threats than men on social media (Fallows, 2005). Copyright ©2020 for this paper by its authors. Use per- AMI 2020 consists of two tasks (Fersini et al., mitted under Creative Commons License Attribution 4.0 In- ternational (CC BY 4.0). 2020). Task A —Misogyny and Aggressive Be- 1 Our official submission to the task obtained F1 = 0.6343 haviour Identification— aims at detecting whether (F1 = 0.7263 for the misogyny and F1 = 0.5423 for the ag- a Twitter post is misogynous and, if so, whether it gressiveness task). The reason behind this poor performance was the unintended use of a mistaken transformer. See Ap- is aggressive (Anzovino et al., 2018). Task B — pendix A for further details. 2 3 https://www.statista.com/forecasts/ https://help.twitter. 1146708/twitter-users-in-italy; last visit: 6 com/en/rules-and-policies/ November, 2020. hateful-conduct-policy Unbiased Misogyny Identification— aims at dis- tasiewicz, 2018). Past research shows the effec- criminating misogynistic contents from the non- tiveness of diverse neural-network architectures to misogynist ones, while guaranteeing the fairness learn text representations, such as convolutional of the model (in terms of unintended bias) on models, recurrent networks and attention mecha- a synthetic dataset (Nozza et al., 2019). We nisms (Sun et al., 2019). Recent work shows that undertook task A and we present a system to pre-trained models such as BERT achieve state-of- flag misogynous and women-addressed aggressive the-art results in text classification tasks and spare posts on Twitter in the Italian language. Even time, since they prevent you from training models if task A involves two sub-problems, we address from scratch (Sun et al., 2019). it as a three-class supervised problem using Al- For what concerns misogyny identification, a BERTo (Polignano et al., 2019), a BERT lan- shared task took place at IberEval 2018, focus- guage understanding model for the Italian lan- ing on English and Spanish tweets (Fersini et guage which is focused on the language used al., 2018b). Whereas task A concerned misog- in social networks, specifically on Twitter. We yny identification, task B proposed a multi-class built only one model to identify the three possible problem to classify misogynous sentences into classes: non-misogynous, non-aggressive misog- seven categories: discredit, stereotype, objectifica- ynous, and aggressive misogynous. This multi- tion, sexual harassment, threats of violence, dom- class setting has shown to be effective. Our ap- inance, and derailing. The most used supervised proach obtains an F1 score of 0.7438, outperform- models were support vector machines, ensembles ing the top-ranked official submission (although of classifiers and deep-learning models. Partici- our own official submission obtained F1 = 0.6343 pants mostly used n-grams and word embeddings only; cf. Appendix A). to represent the tweets. The rest of the contribution is distributed as fol- As for misogyny identification in Italian, the lows. Section 2 includes some background and a first edition of the AMI shared task took place brief overview of research in automatic misogyny in 2018 (Anzovino et al., 2018). The task A identification. Section 3 describes the employed was again misogyny identification, while the task dataset. Section 4 describes our model. Section 5 B aimed at recognizing whether a misogynous summarizes the experiments performed and dis- content is person-specific or generally addressed cusses the obtained results. It includes an error towards a group of women, and at classifying analysis, in order to show the error trends of the the positive instances in the aforementioned cate- model. Section 6 draws some conclusions and dis- gories. The best-performing approach obtained an cusses further possible research lines. F1 score of 0.844, using TF-IDF weighting com- bined with singular value decomposition for lan- 2 Background guage representation and an ensemble of super- Due to the subjective perception of misogyny and vised models (Fersini et al., 2018a). aggressiveness, a definition of what can be consid- ered misogynous and aggressive is necessary: 3 Dataset Misogynous content expresses hating towards As mentioned above, the aim of our model is to women, in the form of insulting, sexual harass- flag misogynous contents and aggressive attitudes ment, male privilege, patriarchy, gender discrim- towards women in Italian tweets. To address this ination, belittling of women, violence against task, a dataset was provided by the task organiz- women, body shaming and sexual objectifica- ers: 5, 000 tweets, manually labelled according to tion (Srivastava et al., 2017). A misogynous two classes, misogyny and aggressiveness. The content expresses an aggressive attitude when it first one defines whether a tweet has been flagged overtly or covertly encourages or legitimizes vio- as misogynous (positive class) or not (negative lent actions against women. class). If a tweet has been flagged as misogynous, From a computational point of view, misog- it is further determined whether it is considered as yny detection is a text classification task. Text aggressive (positive class) or not (negative class). classification in Natural Language Processing has The training dataset is fairly balanced in terms been widely explored and it is typically addressed of misogyny. It contains 2, 337 misogynous and by using supervised models (Mirończuk and Pro- 2, 663 non-misogynous instances. A total of batch size team run constrained score epochs 16 32 UniBOa 2 yes 0.7438 8 0.8491 0.8392 jigsaw 2 no 0.7406 10 0.8485 0.8298 jigsaw 1 no 0.7380 15 0.8283 0.8351 fabsam 1 yes 0.7343 20 0.8342 0.8087 YNU OXZ 1 no 0.7314 fabsam 2 yes 0.7309 Table 1: F1 performance of the 3-class model NoPlaceForHateSpeech 2 yes 0.7167 YNU OXZ 2 no 0.7015 with different batch sizes after diverse numbers of fabsam 3 yes 0.6948 epochs using AlBERTo NoPlaceForHateSpeech 1 yes 0.6934 AMI the winner 2 yes 0.6869 MDD 3 no 0.6844 1, 783 of the former are also considered as aggres- PoliTeam 3 yes 0.6835 MDD 1 yes 0.6820 sive, whereas only 554 are not. The test set was PoliTeam 1 yes 0.6810 composed of 1, 000 tweets. MDD 2 no 0.6679 Since we opted for a constrained approach, we AMI the winner 1 yes 0.6653 PoliTeam 2 yes 0.6473 only used the data provided by the organizers. We UniBOb 1 yes 0.6343 randomly split the supervised data into training AMI the winner 3 yes 0.6259 and validation sets: 4, 700 instances for the former NoPlaceForHateSpeech 3 yes 0.4902 a Run submitted after the deadline. and 300 for the latter. b Buggy run submitted on the deadline (cf. Appendix A). 4 Description of the System Table 2: Full shared task leaderboard plus our un- official top-performing submission. The score is Since the identification of aggressiveness is related the average of the F1 measures for the misogyny to the identification of misogynous tweets, we opt and the aggressiveness tasks. for a 3-class setting, based on one single model. The three classes are hence non-misogynist, ag- gressive misogynist, and non-aggressive misogy- We used the Pytorch instance of AlBERTo-Base, nist. The idea is to determine how well a multi- Italian Twitter lower cased4 and fine-tuned it to the label classifier can perform when addressing these downstream task. We used a softmax output layer two related problems; handling aggressiveness as with three neurons to produce the classification. a consequential class of the misogyny one. In order to tune the network, we used the We decided to base our model on BERT (Bidi- AdamW optimizer, which decouples weight decay rectional Encoder Representations from Trans- from gradient computation, with a learning rate of formers), a task-independent language represen- 1e-5 (Loshchilov and Hutter, 2017).5 tation model based on the transformers architec- ture (Devlin et al., 2019). BERT uses a masking 5 Results approach that randomly masks some input tokens within a sentence and then predicts the removed We explored different batch sizes over an increas- tokens based on the context. It is bidirectional be- ing number of learning epochs. Table 1 shows the cause it makes use of Transformers that consider performance evolution on the validation set. The both the left and right context at once with re- best combination was to train the model over 8 spect to the hidden word to make the prediction epochs with a batch size of 16. This combination upon. We decided to use AlBERTo, a variation of leads to an F1 score of 0.8491 on the three-class BERT in Italian, trained on Twitter posts (Polig- problem. It is worth noting that these scores are nano et al., 2019), which includes emojis, links, not comparable against those for the actual task, hashtags, and mentions. AlBERTo was trained on which consists of two independent binary deci- 200M tweets randomly sampled from the TWITA sions: whether a tweet is considered misogynist corpus (Basile et al., 2018). and, if the answer is yes, whether it is aggressive.6 As for the pre-processing, we used the pre- 4 https://github.com/marcopoli/ trained AlBERTo tokenizer for text tokenization, AlBERTo-it, 5 and then we encoded the data. We set the maxi- The implementation is available at https: mum length to 256 characters, since that was the //github.com/TinfFoil/unibo_ami2020/. 6 Indeed, the official task score is computed as the average length of the longest instance in the training mate- of the F1 measures for the misogyny and the aggressiveness rial (even if Twitter allows up to 280 characters). classifications. tweet misogynist aggressive actual pred actual pred Type-1 errors 1.2 La cosa del filo del telefono attorno alla gola è presa da una lite di no yes no yes Stefanie e Brooke in Beautiful #chilhavisto 1.2 “Quanti schiaffi, #Malinconico. Li meriteresti tutti, perché anche no yes no yes questa volta sei riuscito a lasciare il segno.E no, non era facile. . . ” 1.3 Tutti quelli che ho conosciuto mi hanno sempre detto che a primo yes no no no impatto sembro stronza e psicopatica ma in verità sono una cucciola Type-2 errors 2.1 @Fedoc5 in ginocchio con te che comandi il ritmo del pompino e no yes no yes decidi quando sborrarmi in gola, perchè un pompino deve sempre finire con l’ingoio 2.2 @Nigagalsen @matteosalvinimi un follower dal 2016. . . e ti perme- no yes no no tti di aprire quella lurida fogna di bocca che hai. 2.3 Antonio ti prenderei a schiaffi, come fa a dire mi manchi, quando no no no no con Ilaria fai tutto. Coglione 2.4 Posso volerlo vedere cagarsi in mano e prendersi a schiaffi finché no no no no non diventano dispari o sono nazista? Type-3 errors 3.1 La cena è pronta. Lo squalo balena affonda nei banchi di pesce per no yes no no saziare la sua immensa mole. Non è un abitudine frequente visto che filtra dall’acqua i microorganismi come le balene. 3.2 Comunque le pringles più buone sono quelle alla panna acida e no yes no no cipolla Table 3: Instances from the test partition including their actual class and the one predicted by our model for both misogyny and aggressiveness. Given these results, we trained a new model gled mostly with the identification of aggressive on the full trained and development sets during 8 instances. As a result, there are relatively few epochs, using a batch size of 16, and predicted on cases in which our model correctly labels non- the test set. Such model obtains F1 = 0.7438, aggressive misogynous instances. We noticed that resulting from 0.8102 on the misogyny task and most of the time, when our model labels an in- 0.6774 from the aggressiveness one. stance as misogynist, it also labels it as aggres- Table 2 shows the AMI shared task leaderboard. sive. On the contrary, the system performs very It highlights both our official submission UniBO well in identifying non-misogynous instances and run 1 (cf. Appendix A) and our post-deadline aggressive-misogynous instances. The most com- submission UniBO run 2. Run 2 tops all the mon mistakes are grouped into three categories: systems submitted to the shared task. Indeed, modelling the two tasks as one single multi-class 1. The system identifies as aggressive the in- problem (and using transformers for the right lan- stances that contain verbs expressing an ag- guage) helps the algorithm significantly. gressive attitude.7 Error Analysis After the release of the gold la- 2. The system identifies as misogynous (and bels, we performed an analysis of the classifica- most of the time also aggressive) instances tion errors. We analyzed 300 instances, taken ran- 7 One potential reason behind this confusion is that we sus- domly from the test set (100 at the beginning, 100 pect that there are aggressive tweets in the dataset which, not having been identified as misogynist in the first place, are in the middle and 100 at the end). As observed mislabeled as non-aggressive. This hypothesis should be fur- from the reported performance, our model strug- ther explored. that are neither misogynous nor aggressive, The purpose of our participation was to deter- but contain typical misogynous sentences. mine whether a multi-label classifier is a good way to address this two-step task. Although the 3. The system identifies as misogynous in- task seems to be conceived to be addressed with stances that are neither misogynous nor ag- two different models, one for the identification of gressive, but they contain double-entendre misogyny and the other for aggressiveness, we de- words typically used to insult women. cided to try a different approach and build a sin- gle model that could identify three cases: non- Table 3 shows some examples for all three kinds misogynous, non-aggressive misogynous and ag- of errors. Regarding the errors of type 1, in in- gressive misogynous tweets. stance 1.1 the action of winding up a telephone We built our model on top of AlBERTo, an Ital- cable around the neck was perceived as aggres- ian version of BERT, and we trained the model us- sive, despite the speaker did not express a misog- ing only the dataset provided by the task organiz- ynous or aggressive attitude towards a woman, ers. We experimented by setting different batch and indeed she is just commenting on something sizes over an increasing number of epochs. The watched on TV. In instance 1.2, the sentence highest F1 score on the validation set was reached meritare gli schiaffi (deserving slaps) denotes vi- by a batch size of 16 during 8 epochs. When eval- olence, but it is not addressed towards a woman. uated on the test set, our model obtained an overall This kind of mistake might be overcome by im- F1 score of 0.7438; 0.8102 for the misogyny and plementing a model trained on the misogynist par- 0.6744 for the aggressiveness task. We hypothe- tition of the data only. Finally, instance 1.3 rep- size that the model struggles to identify misogynist resents the bias related to the subjectivity nature aggressive instances partly because it gets con- of what is perceived to be misogynous. Accord- fused by non-misogynist aggressive tweets which ing to the annotation guidelines, a tweet should are labeled simply as non-misogynous. The imple- be flagged as misogynous if it expresses hating mentation is publicly available for research pur- towards women. In this case, the poster of the poses. tweet is not expressing any misogynous attitude, For what concerns further experiments, we plan but she is reporting what she is been told by males. to build two separate models: one to detect misog- Therefore, our system flagged the instance as non- yny and the other trained only on already-flagged misogynous and we could agree. misogynous tweets to identify instances of aggres- As for the errors of type 2, if we look at the text siveness. Another step to undertake would be to only, the instances could seem misogynous sen- use an unconstrained approach and increase the tences. However, in the instances 2.1 and 2.2 the number of instances for the training set, so that hashtag tells us that it is referred to a man and the the model will have more data to learn from. system fails to understand that. On the contrary, the system performs well when a masculine name References or a masculine pronoun is specified, instead of an hashtag, as we can observe in the instances 2.3 Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. 2018. Automatic identification and classification of and 2.4. In these cases our system could under- misogynistic language on twitter. In International stand that the aggressive actions, that usually tend Conference on Applications of Natural Language to to be classified as aggressive-misogynous, are not Information Systems, pages 57–64. Springer. referred to a woman. Valerio Basile, Mirko Lai, and Manuela Sanguinetti. For the type 3 errors, in instance 3.1 balena 2018. Long-term social media data collection at (whale/fat woman) and in 3.2 acida (acid/peevish) the university of turin. In Fifth Italian Conference could confuse the model causing it to flag such in- on Computational Linguistics (CLiC-it 2018), Turin, Italy. stances as misogynous. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- 6 Conclusions and Further Work cia C. Passaro. 2020. Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian. In Valerio In this paper we described our approach to the Basile, Danilo Croce, Maria Di Maro, and Lucia C. EVALITA 2020 task on misogyny and aggres- Passaro, editors, Proceedings of Seventh Evalua- siveness identification in Italian tweets —AMI. tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA Kalpana Srivastava, Suprakash Chaudhury, P.S. Bhat, 2020), Online. CEUR.org. and Samiksha. Sahu. 2017. Misogyny, feminism, and sexual harassment. Industrial psychiatry jour- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and nal, 26(2):111–113. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. standing. In Proceedings of the 2019 Conference of 2019. How to fine-tune BERT for text classifica- the North American Chapter of the Association for tion? CoRR, abs/1905.05583. Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages A Official English-BERT-based 4171–4186, Minneapolis, MN, June. ACL. Submission Deborah Fallows. 2005. How women and men use the internet. Technical report, Pew Internet & American Our official submission used a pre-trained BERT Life Project, December. model trained only on the English language. The experimentation and tuning were identical to the Elisabetta Fersini, Debora Nozza, and Paolo Rosso. 2018a. Overview of the evalita 2018 task on auto- one applied when using AlBERTo (cf. Section 5). matic misogyny identification (ami). In EVALITA Table 4 shows the tuning evolution. The best con- Evaluation of NLP and Speech Tools for Italian: figuration of this model, derived from the English Proceedings of the Final Workshop 12-13 Decem- BERT, obtains an F1 score of 0.8222 on the vali- ber 2018, Naples, pages 59–66. Torino: Accademia dation set when dealing with our three-class prob- University Press. lem. Nevertheless, the performance dropped to Elisabetta Fersini, Paolo Rosso, and Maria Anzovino. F1 = 0.6343 on the test set. 2018b. Overview of the task on automatic misog- yny identification at ibereval 2018. In Workshop batch size on Evaluation of Human Language Technologies for epochs 8 16 32 Iberian Languages (IberEval 2018), Sevilla, Spain. 5 0.8126 0.8042 0.7955 8 0.8067 0.8222 0.8004 Elisabetta Fersini, Debora Nozza, and Paolo Rosso. 10 0.8042 0.8069 0.8141 2020. Ami @ evalita2020: Automatic misogyny 15 0.8095 0.8037 0.8121 identification. In Valerio Basile, Danilo Croce, 20 0.7895 0.8178 0.8153 Maria Di Maro, and Lucia C. Passaro, editors, Pro- ceedings of the 7th evaluation campaign of Natural Table 4: F1 performance of the 3-class model Language Processing and Speech tools for Italian with different batch sizes after diverse numbers of (EVALITA 2020), Online. CEUR.org. epochs using BERT for English. Ilya Loshchilov and Frank Hutter. 2017. Fix- ing weight decay regularization in adam. CoRR, abs/1711.05101. Marcin M. Mirończuk and Jarosław Protasiewicz. 2018. A recent overview of the state-of-the-art el- ements of text classification. Expert Systems with Applications, 106:36–54. Debora Nozza, Claudia Volpetti, and Elisabetta Fersini. 2019. Unintended bias in misogyny detection. In IEEE/WIC/ACM International Conference on Web Intelligence, pages 149–155, Thessaloniki, Greece. Marco Polignano, Pierpaolo Basile, Marco de Gem- mis, Giovanni Semeraro, and Valerio Basile. 2019. AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets. In Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019), volume 2481, Bari, Italy. CEUR. Niloofar S. Samghabadi, Parth Patwa, Srinivas PYKL, Prerana Mukherjee, Amitava Das, and Thamar Solorio. 2020. Aggression and misogyny detection using BERT: A multi-task approach. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020).