AMI @ EVALITA2020: Automatic Misogyny Identification Elisabetta Fersini1 , Debora Nozza2 , Paolo Rosso3 1 DISCo, University of Milano-Bicocca 2 Bocconi University 3 PRHLT Research Center, Universitat Politècnica de València elisabetta.fersini@unimib.it debora.nozza@unibocconi.it prosso@dsic.upv.es Abstract 1 Introduction English. Automatic Misogyny Identifica- tion (AMI) is a shared task proposed at The expressions of people about thoughts, emo- the Evalita 2020 evaluation campaign. The tions, and feelings by means of posts in social AMI challenge, based on Italian tweets, media have been widely spread. Women have is organized into two subtasks: (1) Sub- a strong presence in these online environments: task A about misogyny and aggressiveness 75% of females use social media multiple times identification and (2) Subtask B about the per day compared to 64% of males. While new op- fairness of the model. At the end of the portunities emerged for women to express them- evaluation phase, we received a total of 20 selves, systematic inequality and discrimination runs for Subtask A and 11 runs for Sub- take place in the form of offensive content against task B, submitted by 8 teams. In this paper, the female gender. These manifestations of misog- we present an overview of the AMI shared yny, usually provided by a man to a woman for task, the datasets, the evaluation method- dominating or using a sort of power against the ology, the results obtained by the partici- female gender, is a relevant social problem that pants and a discussion about the method- has been addressed in the scientific literature dur- ology adopted by the teams. Finally, we ing the last few years. Recent investigations stud- draw some conclusions and discuss future ied how the misogyny phenomenon takes place, work. for example as unjustified slurring or as stereotyp- Italiano. Automatic Misogyny Identifica- ing of the role/body of a woman (i.e., the hash- tion (AMI) é uno shared task proposto tag #getbacktokitchen), as described in the book nella campagna di valutazione Evalita by Poland (Poland, 2016). Preliminary research 2020. La challenge AMI, basata su work was conducted in (Hewitt et al., 2016) as the tweet italiani, si distingue in due sub- first attempt of manual classification of misogy- tasks: (1) subtask A che ha come obiet- nous tweets, while automatic misogyny identifica- tivo l’identificazione di testi misogini e ag- tion in social media has been firstly investigated in gressivi (2) subtask B relativo alla fair- (Anzovino et al., 2018). Since 2018, several initia- ness del modello. Al termine della fase tives have been dedicated as a call-to-action to stop di valutazione, sono state ricevute un to- hate against women both from a machine learn- tale di 20 submissions per il subtask A e ing and computational linguistics points of view, 11 per il subtask B, inviate da un totale such as AMI@Evalita 2018 (Fersini et al., 2018a), di 8 team. Presentiamo di seguito una AMI@IberEval2018 (Fersini et al., 2018b) and sintesi dello shared task AMI, i dataset, HatEval@SemEval2019 (Basile et al., 2019). Sev- la metodologia di valutazione, i risultati eral relevant research directions have been inves- ottenuti dai partecipanti e una discus- tigated for addressing the misogyny identifica- sione sulle metodologie adottate dai di- tion challenge, among which approaches focused versi team. Infine, vengono discusse le on effective text representation (Bakarov, 2018; conclusioni e delineati gli sviluppi futuri. Basile and Rubagotti, 2018), machine learning models (Buscaldi, 2018; Ahluwalia et al., 2018) Copyright © 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- and domain-specific lexical resources (Pamungkas ternational (CC BY 4.0). et al., 2018; Frenda et al., 2018). During the AMI shared task organized at the Subtask B has the goal of measuring the atti- Evalita 2020 evaluation campaign (Basile et al., tude of a model to be fair when processing 2020), the focus is not only on misogyny identi- sentences containing specific identity terms fication but also on aggressiveness recognition, as that likely conveyed misogyny in the training well as to the definition of models able to guaran- data, e.g. “girlfriend” and “wife”. tee fair predictions. 3 Training and Testing Data 2 Task Description The data provided to the participants for the AMI The AMI shared task, which is a re-run of a previ- shared task comprises a raw dataset and a synthetic ous challenge at Evalita 2018, proposes the auto- dataset for measuring bias. Each dataset is distin- matic identification of misogynous content in the guished in Training Set and Test Set. Italian language on Twitter. More specifically, it is organized according to two main subtasks: 3.1 Raw dataset • Subtask A - Misogyny & Aggressive Be- The raw dataset is a balanced dataset of misogy- haviour Identification: a system must rec- nous and non-misogynous tweets. The raw train- ognize if a text is misogynous or not, and in ing set (6,000 tweets) is derived from the data case of misogyny, if it expresses an aggres- collected for the 2018 edition of the AMI shared sive attitude. In order to provide an annotated task, where the misogynistic posts have been en- corpus for Subtask A, the following defini- riched by labelling aggressive expressions accord- tions have been adopted to label the collected ing to the definition provided in Section 2. The dataset: raw test dataset (approximately 1,000 tweets) has been collected from Twitter using a similar ap- – Misogynous: a text that expresses hat- proach to the 2018 edition of the shared task. This ing towards women in particular (in the is intentionally done to evaluate the generalization form of insulting, sexual harassment, abilities of the systems on test data collected in a threats of violence, stereotype, objecti- different time period and therefore characterized fication, and negation of male responsi- by higher language variability with respect to the bility). training data. Examples of tweets belonging to the – Not Misogynous: a text that does not ex- raw dataset are shown in Table 1. press any form of hate towards women. – Aggressive: a message is considered ag- Table 1: Examples of misogynous and non- gressive if it (implicitly or explicitly) misogynous tweets presents, incites, threatens, implies, sug- Misogynous Aggressiveness Text gests, or alludes to: 1 1 Stupida tr**a! Meriti di morire * attitudes, violent actions, hostility 1 0 Le ragazze sono tutte pu***ne or commission of offenses against 0 0 @chiellini sei una cagna women; * social isolation towards women for The training raw data for this dataset are pro- physical or psychological character- vided as TSV files (tab-separated files) and report istics; the following fields, where: * justify or legitimize an aggressive action against women. • id denotes a unique identifier of the tweet. – Not Aggressive: If none of the previous conditions hold. • text represents the tweet text. • Subtask B - Unbiased Misogyny Identifica- • misogynous defines whether a tweet is tion: a system must discriminate misogynis- misogynous or not misogynous; it takes val- tic contents from the non-misogynistic ones, ues: while guaranteeing the fairness of the model (in terms of unintended bias) on a synthetic – 0 if the tweet is not misogynous; dataset (Nozza et al., 2019). To this purpose – 1 if the tweet is misogynous. The synthetic testing data are provided as TSV Table 2: Template examples. Template Examples Misogynous files (tab-separated files) reporting only id and dovrebbero essere protette 0 text. dovrebbero essere uccise 1 apprezziamo 0 The statistics about the raw and synthetic picchiamo 1 splendida 0 datasets, both for the training and testing sets, are sporca 1 reported in Table 3. • aggressiveness denotes whether a misogy- Table 3: Distribution of labels on the Training and nous tweet is aggressive or not; it takes value Test datasets Training Testing as: Raw Synthetic Raw Synthetic Misogynous 2337 1007 500 954 – 0 denotes a non-aggressive tweet (not Non-misogynous 2663 1007 500 954 misogynous tweets are labelled as 0 by Aggressive 1783 - 176 - default); Non-aggressive 3217 - 824 - – 1 if the tweet is aggressive. The raw testing data are provided as TSV files re- 4 Evaluation Measures and Baseline porting only id and text. Considering the distribution of labels of the 3.2 Synthetic dataset dataset, we have chosen different evaluation met- The synthetic test dataset for measuring the pres- rics. In particular, we distinguished as follows: ence of unintended bias has been created fol- lowing the procedure adopted in (Dixon et al., Subtask A. Each class to be predicted (i.e. 2018; Nozza et al., 2019): a list of identity terms “Misogyny” and “Aggressiveness”) has been has been constructed by taking into consideration evaluated independently on the other using a some concepts related to the term “donna” (e.g. Macro F1-score. The final ranking of the systems “moglie”, “fidanzata”). Given the identity terms, participating in Subtask A was based on the several templates have been created including pos- Average Macro F1-score (F1 ), computed as itive/negative verbs and adjectives (e.g. nega- follows: tive: hate, inferior; positive: love, awesome) both F1 (M isogyny) + F1 (Aggressiveness) for conveying a misogynistic message or a non- ScoreA = 2 misogynistic one. Some examples of such tem- (1) plates, used to create the synthetic dataset, are re- ported in Table 2. Subtask B. The ranking for Subtask B is com- The synthetic dataset, created for measuring the puted by the weighted combination of AUC esti- presence of unintended bias, contains template- mated on the test raw dataset AU Craw and three generated text labelled according to: per-term AUC-based bias scores computed on the synthetic dataset (AU CSubgroup , AU CBP SN , • Misogyny: Misogyny (1) vs. Not Misogyny AU CBN SP ). Let s be an identity-term (e.g. “girl- (0) friend” and “wife”) and N be the total number of identity-terms, the score of each run is estimated The training data for the raw dataset are pro- according to the following metric: vided as TSV files (tab-separated files) and report the following fields: ScoreB = 12 AU h Craw + 1 P • id denotes a unique identifier of the template- + 2N s AU Csubgroup (s) generated text. P (2) + s AU CBP SN (s)i P • text represents the template-generated text. + s AU CBN SP (s) • misogynous defines if the template-generated Unintended bias can be uncovered by looking at text is misogynous or non-misogynous; it differences in the score distributions between data takes values as 1 if the tweet is misogynous, mentioning a specific identity-term (subgroup dis- 0 if the tweet is non-misogynous. tribution) and the rest (background distribution). Table 4: Team overview Team Name Affiliation Country Runs Subtask jigsaw (Lees et al., 2020) Google US 2 (u) A, B fabsam (Fabrizi, 2020) University of Pisa IT 2 (c) A, B YNU OXZ (Ou and Li, 2020) Yunnan University CN 2(u) A NoPlaceForHateSpeech (da Silva and Roman, 2020) University of Sao Paulo BR 3 (c) A AMI the winner (Lepri et al., ) University of Pisa IT 3 (c) A MDD (El Abassi and Nisioi, 2020) University of Bucharest HU 2 (u), 1 (c) A, B PoliTeam (Attanasio and Pastor, 2020) Politecnico di Torino IT 2 (c) A, B UniBO (Muti and Barròn-Cedeño, 2020) University of Bologna IT 1 (c) A The three per-term AUC-based bias scores are re- i.e. “misogynous”, “aggressiveness”, where each lated to specific subgroups as follows: tweet has been represented as a bag-of-words (composed of 1000 terms) coupled with the cor- • AU CSubgroup (s): calculates AUC only on responding label. Once the representations have the data within the subgroup related to a been obtained, Support Vector Machines with lin- given identity term. This represents model ear kernel have been trained and provided as AMI- understanding and separability within the BASELINE. subgroup itself. A low value in this met- ric means the model does a poor job of dis- 5 Participants and Results tinguishing between misogynous and non- misogynous comments that mention the iden- A total of 8 teams from 6 different countries par- tity. ticipated in at least one of the two subtasks of AMI. Two teams participated with the same ap- • AU CBP SN (s): Background Positive Sub- proach also in the HaSpeeDe shared task (San- group Negative (BPSN) calculates AUC on guinetti et al., 2020), addressing misogyny iden- the misogynous examples from the back- tification with generic models for detecting hate ground and the non-misogynous examples speech. Each team had the chance to submit up from the subgroup. A low value in to three runs that could be constrained (c), where this metric means that the model confuses only the provided training data and lexicons were non-misogynous examples that mention the admitted, and unconstrained (u), where additional identity-term with misogynous examples that data for training were allowed. Table 4 reports do not, likely meaning that the model predicts an overview of the teams illustrating their affilia- higher misogynous scores than it should for tion, their country, the number and type (c for con- non-misogynous examples mentioning the strained, u for unconstrained) of submissions, and identity-term. the subtasks they addressed. • AU CBN SP (s): Background Negative Sub- 5.1 Subtask A: Misogyny & Aggressive group Positive (BNSP) calculates AUC on Behaviour Identification the non-misogynous examples from the back- ground and the misogynous examples from Table 5 reports the results for the Misogyny & the subgroup. A low value here means Aggressive Behaviour Identification task, which that the model confuses misogynous exam- received 20 submissions submitted by 8 teams. ples that mention the identity with non- The highest result has been achieved by jigsaw misogynous examples that do not, likely at 0.7406 in an unconstrained setting and by fab- meaning that the model predicts lower misog- sam at 0.7342 in a constrained run. While the best ynous scores than it should for misogynous results obtained as unconstrained is based on en- examples mentioning the identity. sembles of fine-tuned custom BERT models, the one achieved by the best constrained system is In order to compare the submitted runs with a grounded on a convolutional neural network that baseline model, we provided a benchmark (AMI- exploits pre-trained word embeddings. BASELINE) based on Support Vector Machine By analysing the detailed results, it emerged trained on a unigram representation of tweets with that while the identification of misogynous text Tf-IDF weighing schema. In particular, we cre- can be considered a quite simple problem, the ated one training set for each field to be predicted, recognition of aggressiveness needs to be properly addressed. In fact, the score reported in Table 5 Table 6: Results of Subtask B. Constrained runs are strongly affected by the prediction capabili- are marked as “c”, while the unconstrained ones ties mostly related to the aggressive posts. This with “u”. is likely due to the subjective perception of ag- Rank Run Type Score Team 1 u 0.882 jigsaw gressiveness captured by the variance of the data 2 c 0.818 PoliTeam available in the ground truth. 3 c 0.814 PoliTeam 4 c 0.705 fabsam 5 c 0.702 fabsam 6 c 0.694 PoliTeam Table 5: Results of Subtask A. Constrained runs 7 c 0.691 fabsam 8 u 0.649 jigsaw are marked as “c”, while the unconstrained ones 9 c 0.613 MDD 10 c 0.602 AMI BASELINE with “u”. An amended run, marked with **, has 11 u 0.601 MDD been submitted after the deadline. 12 u 0.601 MDD Rank Run Type Score Team ** c 0.744 UniBO ** 1 u 0.741 jigsaw 2 u 0.738 jigsaw has been partially mitigated by introducing misog- 3 c 0.734 fabsam 4 u 0.731 YNU OXZ ynous lexicon. 5 c 0.731 fabsam 6 c 0.717 NoPlaceForHateSpeech 7 u 0.701 YNU OXZ 6 Discussion 8 c 0.695 fabsam 9 c 0.693 NoPlaceForHateSpeech The submitted systems can be compared by tak- 10 c 0.687 AMI the winner 11 u 0.684 MDD ing into consideration the kind of input feature that 12 c 0.683 PoliTeam they have considered for representing tweets and 13 c 0.682 MDD 14 c 0.681 PoliTeam the machine learning model that has been used as 15 u 0.668 MDD 16 c 0.665 AMI the winner classification model. 17 c 0.665 AMI BASELINE 18 c 0.647 PoliTeam Textual Feature Representation. The systems 19 c 0.634 UniBO 20 c 0.626 AMI the winner submitted by the challenge participants’ consider 21 c 0.490 NoPlaceForHateSpeech various techniques for representing the tweet con- tents. Most of the teams experimented a high-level After the deadline the team UniBO submitted an representation of the text based deep learning so- amended run (**), that has not been ranked in the lutions. While few teams like fabsam and MDD official results of the AMI shared task. However, adopted a text representation based on traditional we believe interesting to mention their achieve- word embeddings such as Word2Vec (Mikolov et ment showing an Average Macro F1-score equal al., 2013), Glove (Pennington et al., 2014) and to 0.744. FastText (Bojanowski et al., 2017), most of the systems. i.e NoPlaceForHateSpeech,jigsaw, Po- 5.2 Subtask B: Unbiased Misogyny liTeam, YNU OXZ and UniBO, exploited richer Identification sentence embeddings such as BERT (Devlin et Table 6 reports the results for the Unbiased Misog- al., 2019) or XLM-RoBert (Ruder et al., 2019). yny Identification task, which received 11 submis- For enriching the space for then training the subse- sions by 4 teams, among which 4 unconstrained quent models to recognize misogyny and aggres- and 7 constrained. The highest Average Macro F1 siveness, PoliTeam experimented the use of addi- score has been achieved by jigsaw at 0.8825 with tional lexical resources such as misogynous lexi- an unconstrained run and by PoliTeam at 0.8180 con and sentiment Lexicon. with a constrained submission. Machine Learning Models. Concerning the Similarly to the previous task, most of the sys- machine learning models, we can distinguish be- tems have shown better performance compared to tween approaches trained from scratch and those the AMI-BASELINE. By analizing the runs, we can ones based on fine-tuning of existing pre-trained highlight that the two best results achieved on Sub- models. We report in the following the strategy task B have been obtained by the unconstrained adopted by the systems that participated in the run submitted by jigsaw, where a simple debiasing AMI shared task, according to the type of machine technique based on data augumentation have been learning model that has been adopted: adopted, and by the constrained run provided by Politeam, where the problem of biased prediction • Shallow models have been experimented by MDD, where logistic regressions have been get is not clearly mentioned, but several ag- trained according to different hand-crafted gressive terms are present, the models tend to features; be biased and to predict the post as misogy- nous and aggressive erroneously. An exam- • Convolutional Neural Networks have been ple of this type of misclassified posts is re- exploited by NoPlaceForHateSpeech by us- ported here: ing two distinct models for misogyny detec- tion and aggressiveness identification, by fab- “Vero...ma c’e chi ti cerca, che sal investigating the optimal hyperparameters ti vuole, più di ogni cosa al of the model, and by YNU OXZ where on top mondo......ma non sa se viene of the CNN architecture a Capsule Network capito..... potrebbe esser mal (Sabour et al., 2017) has been introduced for interpretato e di conseguenza taking advantage of spatial patterns available all’abbraccio esser denunciato per in short texts; molestie sessuali e/o stupro” • Fine-Tuning of pre-trained models has • Short hate speech sentences referred to been exploited by jigsaw by adapting BERT others than women: when the target is men- to the challenge domain and using a trans- tioned by using an actual account, but it is re- fer multilingual strategy and ensemble learn- ferred to men, and there are no additional in- ing, by UniBO that accommodated the BERT dications about the gender of the target, most model using a multi-label output neuron, and of the models tend to misclassify the tweet. by PoliTeam where the prediction of the fine- In the following example, the target is a male tuned sentence-BERT is coupled with predic- football player: tion based on lexicons. “@bonucci leo19 Cagati in mano For what concerns the achieved results on the e prenditi a schiaffi. Sti post te li two subtasks, few considerations can be drawn infili nel c*lo!” considering both the errors done by the systems Concerning the errors on the synthetic test set and the mitigation strategies adopted for reducing used for estimating the bias of the models, the bias. two main errors carried out by the majority of the systems can be identified: Error Analysis When testing the developed sys- tems on raw test data, the majority of the per- • presence of unusual target: in most of the formed errors can be summarized by the following submissions, sentences containing offensive patterns: expressions towards specific uncommon tar- • Under-representation of subjective expres- gets are misclassified. For instance, around sions: those posts written by introducing er- 39% of the predictions related to the target roneous lower case and missing spaces be- nonna (i.e., grandmother) are wrong. An ex- tween adjoining words lead the models based ample of the most misclassified target is re- on raw text to make errors on test predictions. ported in the following example: An example of such common errors is the one “nonne belle” reported in the following tweet: • Presence of unusual verbs: analogously to “Odio Sakura per il semplice what has been observed for the target, sen- motivo che qualunque cosa faccia tences containing rare aggressive verbs tend o dica Naruto lei lo prende a to be misclassified. For instance, around schiaffi o a pugniHA CHIESTO 48% of the instances related to the verbs mal- COME STA SAI DIOSANTO menare and seviziare (i.e., beat up and tor- BRUTTA STRONZA MA CON- ture) are wrongly classified. An example of a TRALLI MADONNA SPERO CHE mistaken sentence are reported here: TI UCCIDANOscusami Sarada” “femmina dovrebbe essere se- • Undefined subject, but presence of aggres- viziata” (wrongly classified as sive terms: for those tweets where the tar- non-misogynous) Bias Mitigation strategies. Concerning the Giuseppe Attanasio and Eliana Pastor. 2020. PoliTeam Subtask B, only one team (jigsaw) addressed ex- @ AMI: Improving Sentence Embedding Similarity with Misogyny Lexicons for Automatic Misogyny plicitly the problem related to the unintended bias. Identification in Italian Tweets. In Proceedings of The authors used sentences sampled from the Seventh Evaluation Campaign of Natural Language Italian Wikipedia articles containing some of the Processing and Speech Tools for Italian. Final Work- identity terms provided with the test set. These shop (EVALITA 2020), Bologna, Italy. CEUR.org. sentences, labeled as both non-misogynous and Amir Bakarov. 2018. Vector Space Models for Au- non-aggressive, have been used to further fine- tomatic Misogyny Identification. In Proceedings tune the model and reduce the bias given by the of Sixth Evaluation Campaign of Natural Language data. The results achieved by the jigsaw team Processing and Speech Tools for Italian. Final Work- shop (EVALITA 2018), Turin, Italy. CEUR.org. highlight that a debiasing method could obtain fair predictions even using pre-trained models. Angelo Basile and Chiara Rubagotti. 2018. Automatic Identification of Misogyny in English and Italian 7 Conclusions and Future Work Tweets at EVALITA 2018 with a Multilingual Hate Lexicon. In Proceedings of Sixth Evaluation Cam- This paper presents the AMI shared task, focused paign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018), not only on identifying misogynous and aggres- Turin, Italy. CEUR.org. sive expressions but also on ensuring fair predic- tions. By analysing the runs submitted by the par- Valerio Basile, Cristina Bosco, Elisabetta Fersini, ticipants, we can conclude that while the prob- Nozza Debora, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, Manuela lem of misogyny identification has reached satis- Sanguinetti, et al. 2019. Semeval-2019 task factory results, the recognition of aggressiveness 5: Multilingual detection of hate speech against is still in its infancy. Concerning the capabili- immigrants and women in twitter. In Proceedings ties of the systems with respect to the unintended of 13th International Workshop on Semantic Evalu- ation, pages 54–63. Association for Computational bias problem, we can highlight that a domain- Linguistics. dependent mitigation strategy is a necessary step towards fair models. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- cia C. Passaro. 2020. EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Acknowledgements Processing and Speech Tools for Italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. The work of the last author was partially funded by Passaro, editors, Proceedings of Seventh Evalua- the Spanish MICINN under the research project tion Campaign of Natural Language Processing and MISMISFAKEnHATE on MISinformation and Speech Tools for Italian. Final Workshop (EVALITA MIScommunication in social media: FAKE news 2020), Online. CEUR.org. and HATE speech (PGC2018-096212-B-C31) and Piotr Bojanowski, Edouard Grave, Armand Joulin, and by the COST Action 17124 DigForAsp supported Tomas Mikolov. 2017. Enriching word vectors with by the European Cooperation in Science and Tech- subword information. Transactions of the Associa- nology. tion for Computational Linguistics, 5:135–146. Davide Buscaldi. 2018. Tweetaneuse AMI EVALITA2018: Character-based Models for the References Automatic Misogyny Identification Task. In Pro- ceedings of Sixth Evaluation Campaign of Natu- Resham Ahluwalia, Himani Soni, Edward Callow, An- ral Language Processing and Speech Tools for Ital- derson Nascimento, and Martine De Cock. 2018. ian. Final Workshop (EVALITA 2018), Turin, Italy. Detecting Hate Speech Against Women in English CEUR.org. Tweets. In Proceedings of Sixth Evaluation Cam- paign of Natural Language Processing and Speech Adriano dos S. R. da Silva and Norton T. Roman. 2020. Tools for Italian. Final Workshop (EVALITA 2018), No Place For Hate Speech @ AMI: Convolutional Turin, Italy. CEUR.org. Neural Network and Word Embedding for the Iden- tification of Misogyny in Italian. In Proceedings of Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. Seventh Evaluation Campaign of Natural Language 2018. Automatic Identification and Classification of Processing and Speech Tools for Italian. Final Work- Misogynistic Language on Twitter. In Proceedings shop (EVALITA 2020), Bologna, Italy. CEUR.org. of 23rd International Conference on Applications of Natural Language to Information Systems (NLDB), Jacob Devlin, Ming-Wei Chang, Kenton Lee, and pages 57–64. Springer. Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language under- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- standing. In Proceedings of the 2019 Conference of rado, and Jeff Dean. 2013. Distributed representa- the North American Chapter of the Association for tions of words and phrases and their compositional- Computational Linguistics: Human Language Tech- ity. In Advances in neural information processing nologies (NAACL-HLT), pages 4171–4186. Associ- systems, pages 3111–3119. ation for Computational Linguistics. Arianna Muti and Alberto Barròn-Cedeño. 2020. Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, UniBO@AMI: A Multi-Class Approach to Misog- and Lucy Vasserman. 2018. Measuring and mitigat- yny and Aggressiveness Identification on Twitter ing unintended bias in text classification. In Pro- Posts Using AlBERTo. In Proceedings of Seventh ceedings of the 2018 AAAI/ACM Conference on AI, Evaluation Campaign of Natural Language Pro- Ethics, and Society, pages 67–73. cessing and Speech Tools for Italian. Final Work- shop (EVALITA 2020), Bologna, Italy. CEUR.org. Samer El Abassi and Sergiu Nisioi. 2020. MDD@AMI: Vanilla Classifiers for Misogyny Iden- Debora Nozza, Claudia Volpetti, and Elisabetta Fersini. tification. In Proceedings of Sixth Evaluation Cam- 2019. Unintended bias in misogyny detection. In paign of Natural Language Processing and Speech IEEE/WIC/ACM International Conference on Web Tools for Italian. Final Workshop (EVALITA 2020), Intelligence, pages 149–155. Bologna, Italy. CEUR.org. Xiaozhi Ou and Hongling Li. 2020. YNU OXZ @ HaSpeeDe 2 and AMI : XLM-RoBERTa with Samuel Fabrizi. 2020. fabsam @ AMI: a Convolu- Ordered Neurons LSTM for classification task at tional Neural Network approach. In Proceedings of EVALITA 2020. In Proceedings of Sixth Evalua- Seventh Evaluation Campaign of Natural Language tion Campaign of Natural Language Processing and Processing and Speech Tools for Italian. Final Work- Speech Tools for Italian. Final Workshop (EVALITA shop (EVALITA 2020), Bologna, Italy. CEUR.org. 2020), Bologna, Italy. CEUR.org. Elisabetta Fersini, Debora Nozza, and Paolo Rosso. Endang Wahyu Pamungkas, Alessandra Teresa 2018a. Overview of the Evalita 2018 Task on Au- Cignarella, Valerio Basile, and Viviana Patti. tomatic Misogyny Identification (AMI). In Tom- 2018. Automatic Identification of Misogyny in maso Caselli, Nicole Novielli, Viviana Patti, and English and Italian Tweets at EVALITA 2018 with Paolo Rosso, editors, Proceedings of the Sixth eval- a Multilingual Hate Lexicon. In Proceedings of uation campaign of Natural Language Processing Sixth Evaluation Campaign of Natural Language and Speech tools for Italian (EVALITA 2018), Turin, Processing and Speech Tools for Italian. Final Italy. CEUR.org. Workshop (EVALITA 2018), Turin, Italy. CEUR.org. Elisabetta Fersini, Paolo Rosso, and Maria Anzovino. Jeffrey Pennington, Richard Socher, and Christo- 2018b. Overview of the Task on Automatic Misog- pher D. Manning. 2014. Glove: Global vectors for yny Identification at IberEval 2018. In IberEval@ word representation. In Empirical Methods in Natu- SEPLN, pages 214–228. ral Language Processing, pages 1532–1543. Simona Frenda, Bilal Ghanem, Estefanı́a Guzmán- Bailey Poland. 2016. Haters: Harassment, Abuse, and Falcón, Manuel Montes-y-Gómez, and Luis Vil- Violence Online. Potomac Books, Incorporated. laseñor-Pineda. 2018. Automatic Lexicons Ex- pansion for Multilingual Misogyny Detection. In Sebastian Ruder, Anders Søgaard, and Ivan Vulić. Proceedings of Sixth Evaluation Campaign of Natu- 2019. Unsupervised cross-lingual representation ral Language Processing and Speech Tools for Ital- learning. In Proceedings of the 57th Annual Meet- ian. Final Workshop (EVALITA 2018), Turin, Italy. ing of the Association for Computational Linguis- CEUR.org. tics: Tutorial Abstracts, pages 31–38. Association for Computational Linguistics. Sarah Hewitt, Thanassis Tiropanis, and Christian Sara Sabour, Nicholas Frosst, and Geoffrey E Hin- Bokhove. 2016. The Problem of identifying Misog- ton. 2017. Dynamic routing between capsules. In ynist Language on Twitter (and other online social Advances in neural information processing systems, spaces). In Proceedings of the 8th ACM Conference pages 3856–3866. on Web Science, pages 333–335. ACM. Manuela Sanguinetti, Gloria Comandini, Elisa Alyssa Lees, Jeffrey Sorensen, and Ian Kivlichan. Di Nuovo, Simona Frenda, Marco Stranisci, 2020. Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning Cristina Bosco, Tommaso Caselli, Viviana Patti, and a Pre-Trained Comment-Domain BERT Model. In Irene Russo. 2020. HaSpeeDe 2@EVALITA2020: Proceedings of Seventh Evaluation Campaign of Overview of the EVALITA 2020 Hate Speech Natural Language Processing and Speech Tools for Detection Task. In Valerio Basile, Danilo Croce, Italian. Final Workshop (EVALITA 2020), Bologna, Maria Di Maro, and Lucia C. Passaro, editors, Italy. CEUR.org. Proceedings of the 7th evaluation campaign of Natural Language Processing and Speech tools for Marco Lepri, Giuseppe Grieco, and Mattia Sanger- Italian (EVALITA 2020), Online. CEUR.org. mano. University of Pisa, Italy.