CrotoneMilano for AMI at Evalita2018. A performant, cross-lingual misogyny detection system. Angelo Basile Chiara Rubagotti Symanto Research Independent Researcher angelo.basile@symanto.net chiara.rubagotti@gmail.com Abstract 1 Introduction We present our systems for misogyny With awareness of violence against women grow- identification on Twitter, for Italian and ing in the public discourse and the spread of un- English. The models are based on a Sup- filtered and possibly anonymous communication port Vector Machine and they use n-grams on social media in our digital culture, the issue as features. Our solution is very simple of misogyny online has become compelling. Vi- and yet we achieve top results on Ital- olence against women has been described by the ian Tweets and excellent results on En- UN as a “Gender-based [..] form of discrimi- glish Tweets. Furthermore, we exper- nation that seriously inhibits women’s ability to iment with a single model that works enjoy rights and freedoms on a basis of equal- across languages by leveraging abstract ity with men”1 . On the web this often takes the features. We show that a single multi- form of female-discriminating attacks of differ- lingual system yields performances com- ent types, which undermine the women’s rights parable to two independently trained sys- of freedom of expression and participation2 . Fol- tems. We achieve accuracy results ranging lowing erjavec2012you ¸ ’s understanding of hate from 45% to 85%. Our system is ranked speech, reported in (Pamungkas et al., 2018) as first out of twelve submissions for sub-task “any type of communication that is abusive, in- B on Italian and second for sub-task A. sulting, intimidating, harassing, and/or incites to violence or discrimination, and that disparages a In questo articolo presentiamo i nostri mo- person or a group on the basis of some charac- delli per il riconoscimento automatico di teristics such as race, colour, ethnicity, gender, testi misogini su Twitter: abbiamo adde- sexual orientation, nationality, religion, or other strato lo stesso sistema prima su un corpus characteristics”, we can define misogynist speech italiano e poi su uno inglese. Il modello si as any kind of aggressive discourse which tar- basa su una macchina a vettori supporto gets women because they are women. Within the e usa n-grammi come feature. La nostra larger context of hate speech, online misogyny — soluzione è molto semplice e tuttavia ci or cybersexism — stands out as a large and com- permette di raggiungere lo stato dell’arte plex phenomenon which reflects other forms of of- sull’italiano e ottimi risultati sull’inglese. fline abuse on women (Poland, 2016). This holds Presentiamo inoltre un sistema che funzio- true for the Italian case as well, where bouts of na con entrambe le lingue sfruttando una misogynistic tweets have been linked to episodes serie di feature astratte. Il nostro livello of femicides 3 . In recent years the NLP commu- raggiunge livelli di accuratezza tra il 45% 1 e l’85%: con questi risultati ci piazziamo http://www.un.org/womenwatch/daw/cedaw/recommendations/recomm.htm 2 https://www.amnesty.org/en/latest/research/2018/03/online- primi nel task B per l’italiano e secondi violence-against-women-chapter-3/. nel task A. 3 http://www.voxdiritti.it/wp- content/uploads//2018/06/mappa-intolleranza-3-donne.jpg nity has addressed the issue of automatic detec- B). The misogynistic behaviour’s space consists of tion of hate speech in general (Schmidt and Wie- five different labels: gand, 2017) and misogyny in particular (Anzovino et al., 2018). This effort to detect and contain ver- • Stereotype & Objectification bal violence on social media (or any kind of text) • Dominance demonstrates how NLP tools can also be used for ethically beneficial purposes and should be con- • Derailing sidered in the newborn and crucial discourse on Ethics in NLP (Hovy and Spruit, 2016; Hovy et • Sexual Harassment & Threats of al., 2017; Alfano et al., 2018). We are therefore Violence proud to take up the AMI challenge (Fersini et al., • Discredit 2018) and present our contribution to the cause of stopping misogynistic speech on Twitter. In The target can be either Active when the mes- this paper we propose a simple linear model us- sage refers to a specific person or Passive when ing n-grams: we show that such a simple setup the message expresses generic misogyny. The can still yield good results. We decided to pro- setup is the same for both Italian and English. pose a simple model for three reasons: first, it has been shown that linear SVM can easily outper- 2 Data form more complex deep neural networks (Plank, We use only the data released by the task organ- 2017; Medvedeva et al., 2017); second, training isers: they consist of Italian and English Tweets. and testing our model does not require expensive The organisers report that the corpus has been hardware but a common laptop is enough to repli- manually labelled by several annotators. We pro- cate our experiments; third, we experiment with a vide an overview of the data set in Table 1. As it transformation of the input (i.e. we extract abstract can be seen from the table, the data for Task A is features) and a linear model allows for an easier more or less balanced, while the data for Task B is interpretation of the contribution of this transfor- highly skewed. mation. To summarise, the following are the contribu- 3 Experiments tions of this paper: In this section we describe the feature extraction • We propose a simple and yet strong misog- process and the model that we built. yny detection system for English and Italian (ranked first out of twelve systems for misog- 3.1 Pre-processing ynistic category detection) We decide not to pre-process the data in any way, since we do not have linguistic (or non-linguistic) • We show how a single system can be trained reasons for doing so. To tokenize the text we sim- to work across languages ply split at every white space. • We release all the code4 and our trained 3.2 Model and Features systems for reproducibility and for a quick We built a sparse linear model for approaching this implementation of language technology sys- task. tems that can help detect and mitigate cyber- We use n-grams extracted at the word level as sexism phenomena. well as at the character level. We use 3-10 n- grams and binary tf-idf. We feed these features Task Description The AMI task is combined to a Support Vector Machine (SVM) model with a binary and multi-label, short text classification linear kernel; we use the implementation included task. Given a Tweet, we have to predict whether it in scikit-learn (Pedregosa et al., 2011). Fur- contains or not misogyny (Task A) and if it does, thermore, we experiment with feature abstraction: we have to classify the misogynistic behaviour and we follow the bleaching approach recently pro- predict who is the subject being targeted (Task posed by (van der Goot et al., 2018). First, we 4 The code can be found at transform each word in a list of symbols that 1) https://github.com/anbasile/AMI/. represents the shape of the individual characters I TALIAN E NGLISH SO DO DE ST DI SO DO DE ST DI Active 625 61 21 428 586 54 78 24 207 695 Passive 40 9 3 2 43 125 70 68 145 319 Non-Misogynous 2172 2215 Table 1: Data Set Overview, showing the label distribution across the five misogynistic behaviours: Stereotype & Objectification (SO), Dominance (DO), Derailing (DE), Sexual Harassment & Threats of Violence (ST) and Discredit (DI). SHAPE FREQ LEN ALPHA the official test set, using the gold labels released by the organisers after the evaluation period. This Ccvc 46 04 True is vc 650 02 True 4 Evaluation and Results an vc 116 02 True example vcvcccv 1 07 True Since the data set labels for the sub-task B are not . . 60 01 False evenly distributed across the classes, we use f1- 1 01 False score to evaluate our model. First we report re- sults obtained via a 10-fold cross-validation on the Table 2: An illustration of the bleaching process. training set; then, we report results from the offi- cial test set, whose labels have been released. The official evaluation does not take into account the and 2) abstracts from meaning by still approxi- joint prediction of the labels, however here we re- mating the vowels and characters that compose the port results considering the 0 label: since we train word; then, we compute the length of the word and different models for the different label sets, we its frequency (while taking care of padding the first make sure that the models trained on Task B are one with a zero in order to avoid feature collision); able to detect if a message is misogynistic in the finally, we use a Boolean label for explicitly dis- first place. tinguishing words from non-alphanumeric token (e.g. emojis). Table 2 shows an example of this 4.1 Development Results feature abstraction process. We report the development results obtained by us- (van der Goot et al., 2018) proposed this bleach- ing different text representations. Table 3 presents ing approach for modelling gender across lan- an overview of these results. We note that all four guages, by leveraging the language-independent representations — words, characters, a combina- nature of these features: here, we try to re-use the tion of these two and the bleached representation technique for classifying misogynist text across — all yield comparable results. The combination languages. We slightly modify the representation of words and characters seems to be the best for- proposed by (van der Goot et al., 2018) by merg- mat. Overall, we note that the system performs ing the shape feature (e.g. Xxx) with the vowel- better on the Italian corpus than on the English consonant approximation feature (e.g. CVC) into corpus. one single feature (e.g. Cvc). We propose three different multi-lingual exper- 4.1.1 Cross-lingual Results iments: In Table 4 we present the results of our cross- lingual experiments. We train and test different • TRAIN Italian → TEST English systems using lexical and abstract features. We note that the abstract model trained on Italian out- • TRAIN English → TEST Italian performs the fully lexicalized model when tested • TRAIN Ita. & Eng. → TEST Ita. & Eng. on English, but the opposite is not true. The En- glish data set seems particularly hard for both the For the last experiment, we use half the data set abstract and the lexicalized model. Interestingly, for each language. We report scores obtained by the abstract model trained on both corpora shows training on the whole training set and testing on good results. E NGLISH I TALIAN MIS . CAT. TGT. MIS . CAT. TGT. Words (W) 0.68 0.29 0.57 0.88 0.60 0.59 Chars (C) 0.71 0.30 0.61 0.88 0.59 0.58 W+C 0.70 0.31 0.59 0.88 0.62 0.59 Bleaching 0.68 0.27 0.57 0.85 0.55 0.56 Table 3: An overview of the development f1-macro scores obtained via cross-validation. TEST → IT EN the big difference in performance between the En- glish and Italian models, we show the importance IT lex 0.85 0.51 of words as learned by the model: we print the IT abs 0.83 0.52 ten most important words, ranked by their learned TRAIN EN lex 0.47 0.62 weights. The result is shown in Table 6. From EN abs 0.45 0.52 the output we see that the model trained on Italian IT + EN lex 0.83 0.60 learned meaningful words for identifying a misog- IT + EN abs 0.81 0.58 ynist message, such as zitta [shut up], tua [your] and muori [die!]: these words stand out from the Table 4: Pair-wise accuracy results for Task A. rest of the profanity for directly referring to some- We compare lexicalized vs. abstract models. The one, while the rest of the words and almost all the combined IT+EN data set is built by randomly most important English words could be used as in- sampling 50% of instances from both corpora. terjections or could be more generic insults. 4.2 Test Results RANK ITA ENG In Table 5 we present official test results (Fersini et al., 2018). We submitted only one, constrained 1 zitta woman run; a run is considered constrained when only 2 bel hoe the data released by the organisers are used. We 3 pompinara she submitted the model using the combined represen- 4 puttanona hoes tation with word- and character-ngrams, trained 5 tua women once on the English corpus and once on the Ital- 6 muori whore ian corpus. We achieve the top and the second po- 7 baldracca her sition for the tasks B and A respectively on the 8 troie bitches Italian data set. On the English data set our sys- 9 culona womensuck tem is ranked 15th and 4th on the tasks A and B 10 tettona bitch respectively. Table 6: Top ten words ranked by their positive weights learned during training. TASK A TASK B CATEGORY TARGET AVG. The results of the abstract system are satisfac- IT 0.843 0.579 0.423 0.501 tory for eventually building a light, portable model EN 0.617 0.293 0.444 0.369 that could be adapted to different language. In the Table 5: Official test results. Task A is measured future we will try training on English and Italian using accuracy and Task B is measured using f1- and testing on a third corpus (such as the Spanish score. We reach the first position on Task B for version of the AMI data set). Italian. In this paper we described our participation to the AMI - Automatic Misogyny Identification for Italian and English. We proposed a very simple 5 Discussion and Conclusions solution that can be implemented quickly and we A warning to the reader: this section contains ex- scored a state-of-the-art result for classification of plicit language. In the attempt to understand better misogynistic behaviours in five classes. Acknowledgements F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- The authors would like to thank the two anony- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- mous reviewers who helped improve the quality of sos, D. Cournapeau, M. Brucher, M. Perrot, and this paper. The first author has conducted this re- E. Duchesnay. 2011. Scikit-learn: Machine learn- ing in Python. Journal of Machine Learning Re- search as he was still part of the Erasmus Mundus search, 12:2825–2830. master in Language and Communication Technol- ogy, a shared master program between the Uni- Barbara Plank. 2017. All-in-1 at ijcnlp-2017 task 4: Short text classification with one model for all lan- versity of Groningen (NL) and the University of guages. Proceedings of the IJCNLP 2017, Shared Malta (MT). Tasks, pages 143–148. Bailey Poland. 2016. Haters: Harassment, Abuse, and References Violence Online. University of Nebraska Press. Mark Alfano, Dirk Hovy, Margaret Mitchell, and Anna Schmidt and Michael Wiegand. 2017. A survey Michael Strube. 2018. Proceedings of the second on hate speech detection using natural language pro- acl workshop on ethics in natural language process- cessing. In Proceedings of the Fifth International ing. In Proceedings of the Second ACL Workshop on Workshop on Natural Language Processing for So- Ethics in Natural Language Processing. cial Media, pages 1–10. Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malv- 2018. Automatic identification and classification of ina Nissim, and Barbara Plank. 2018. Bleaching misogynistic language on twitter. In International text: Abstract features for cross-lingual gender pre- Conference on Applications of Natural Language to diction. In Proceedings of the 56th Annual Meet- Information Systems, pages 57–64. Springer. ing of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 383–389. Elisabetta Fersini, Debora Nozza, and Paolo Rosso. 2018. Overview of the evalita 2018 task on au- tomatic misogyny identification (ami). In Tom- maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evalua- tion campaign of Natural Language Processing and Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org. Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- pers), volume 2, pages 591–598. Dirk Hovy, Shannon Spruit, Margaret Mitchell, Emily M Bender, Michael Strube, and Hanna Wal- lach. 2017. Proceedings of the first acl workshop on ethics in natural language processing. In Proceed- ings of the First ACL Workshop on Ethics in Natural Language Processing. Maria Medvedeva, Martin Kroon, and Barbara Plank. 2017. When sparse traditional models outperform dense neural networks: the curious case of discrimi- nating between similar languages. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 156–163. Endang Wahyu Pamungkas, Alessandra Teresa Cignarella, Valerio Basile, and Viviana Patti. 2018. 14-exlab@unito for ami at ibereval2018: Exploiting lexical knowledge for detecting misogyny in english and spanish tweets. In Proceedings of the Third Workshop on Evaluation of Human Language Tech- nologies for Iberian Languages (IberEval 2018), pages 234–241.