Automatic Identification of Misogyny in English and Italian Tweets at EVALITA 2018 with a Multilingual Hate Lexicon Endang Wahyu Pamungkas1 , Alessandra Teresa Cignarella1,2 , Valerio Basile1 and Viviana Patti1 1 Dipartimento di Informatica, Università degli Studi di Torino 2 PRHLT Research Center, Universitat Politècnica de València {pamungka | cigna | basile | patti}@di.unito.it Abstract Misogyny can be linguistically manifested in numerous ways, including social exclusion, dis- English. In this paper we describe our crimination, hostility, threats of violence and sex- submission to the shared task of Auto- ual objectification (Anzovino et al., 2018). Many matic Misogyny Identification in English Internet companies and micro-blogs already tried and Italian Tweets (AMI) organized at to tackle the problem of blocking this kind of EVALITA 2018. Our approach is based on online contents, but, unfortunately, the issue is SVM classifiers and enhanced by stylistic far from being solved because of the complexity and lexical features. Additionally, we an- of the natural language1 (Schmidt and Wiegand, alyze the use of the novel HurtLex mul- 2017). For the above-mentioned reasons, it has be- tilingual linguistic resource, developed by come necessary to implement targeted NLP tech- enriching in a computational and multilin- niques that can be automated to treat hate speech gual perspective of the hate words Italian online and misogyny. lexicon by the linguist Tullio De Mauro, in The first shared task specifically aimed at Au- order to investigate its impact in this task. tomatic Misogyny Identification (AMI) took place Italiano. Nel presente lavoro descrivi- at IberEval 20182 within SEPLN 2018 considering amo il sistema inviato allo shared task di English and Spanish tweets (Fersini et al., 2018a). Automatic Misogyny Identification (AMI) Hence, the aim of the proposed shared task is ad EVALITA 2018. Il nostro approc- to encourage participating teams in proposing the cio si basa su classificatori SVM, ottimiz- best automatic system firstly to distinguish misog- zati da feature stilistiche e lessicali. In- ynous and non-misogynous tweets, and secondly oltre, analizziamo il ruolo della nuova to classify the type of misogynistic behaviour and risorsa linguistica HurtLex, un’estensione judge whether the target of the misogynistic be- in prospettiva computazionale e multi- haviour is a specific woman or a group of women. lingue del lessico di parole per ferire in In this paper, we describe our submission to the italiano proposto dal linguista Tullio De 2nd shared task of Automatic Misogyny Identifi- Mauro, per meglio comprendere il suo im- cation (AMI)3 organized at EVALITA 2018, orga- patto in questo tipo di task. nized in the same manner but focusing on Italian tweets, rather than Spanish and English as in the 1 Introduction IberEval task. Hate Speech (HS) can be based on race, skin color, ethnicity, gender, sexual orientation, nationality, 2 Task Description or religion, it incites to violence and discrimina- The aim of the AMI task is to detect misogy- tion, abusive, insulting, intimidating, and harass- nous tweets written in English and Italian (Task ing. Hateful language is becoming a huge prob- A) (Fersini et al., 2018b). Furthermore, in Task lem in social media platforms such as Twitter and Facebook (Poland, 2016). In particular, a type 1 https://www.nytimes.com/2013/05/29/ of cyberhate that is increasingly worrying nowa- business/media/facebook-says-it-failed- to-stop-misogynous-pages.html days is the use of hateful language that specifically 2 https://sites.google.com/view/ targets women, which is normally referred to as: ibereval-2018 3 MISOGYNY (Bartlett et al., 2014). https://amievalita2018.wordpress.com/ B, each system should also classify each misog- bigrams as a representation of the tweet. In ad- ynous tweet into one of five different misogyny dition, we also employed Bag of Hashtags (BoH) behaviors (STEREOTYPE , DOMINANCE , DERAIL - and Bag of Emojis (BoE) features, which are built ING , SEXUAL HARASSMENT, AND DISCREDIT ) by using the same technique as BoW, focusing on and two targets of misogyny classes (active and the presence of hashtags and emojis. passive). Participants are allowed to submit up to Swear Words. This feature takes into account the three runs for each language. Table 1 shows the presence of a swear word and the number of its oc- dataset label distribution for each class. Accuracy currences in the tweet. For English, we took a list will be used as an evaluation metric for Task A, of swear words from www.noswearing.com, while macro F -score is used for Task B. while for Italian we gathered the swear word list The organizers provided the same amount of from several sources5 including a translated ver- data for both languages: 4,000 tweets in the train- sion of www.noswearing.com’s list and a list ing set and 1,000 in the test set. The label distri- of swear words from Capuano (2007). bution for Task A is balanced, while in Task B the distribution is highly unbalanced for both misog- Sexist Slurs. Beside swear words, we also con- yny behaviors and targets. sidered sexist words, that are specifically target- ing women. We used a small set of sexist slurs 3 Description of the System from previous work by Fasoli et al. (2015). We translated and expanded that list manually for our We used two Support Vector Machine (SVM) clas- Italian systems. This feature has a binary value, 1 sifiers which exploit different kernels: linear and when at least one sexist slur presence on tweet and radial basis function (RBF) kernels. 0 when there is no sexist slur on tweet. SVM with Linear Kernel. Linear kernel was used to find the optimal hyperplane when SVM Women Words. We manually built a small set of was firstly introduced in 1963 by Vapnik et al., words containing synonyms and several words re- long before Cortes and Vapnik (1995) proposed lated to word “woman" in English and “donna" in to use the kernel trick. Joachims (1998) recom- Italian. Based on our previous work (Pamungkas mends to use linear kernel for text classification, et al., 2018), these words were effective to de- based on the observation that text representation tect the target of misogyny on the tweet. Simi- features are frequently linearly separable. lar to sexist slur feature, this feature also has bi- SVM with RBF Kernel. Choosing the kernel nary value show the presence of women words on is usually a challenging task, because its perfor- tweet. mance will be dataset dependent. Therefore, we Surface Features. We also considered several also experimenteed with a Radial Basis Function surface level features including: upper case char- (RBF) kernel, which has been already proven as acter count, number of hashtags, number of an effective classifier in text classification prob- URLs, and the length of the tweet counting the lems. The drawback of RBF kernels is that they characters. are computationally expensive and obtain a worse Hate Words Lexicon. HurtLex (Bassignana et performance in big and sparse feature matrices. al., 2018) is a multilingual lexicon of hate words, 3.1 Features built starting from a list of words compiled man- ually (De Mauro, 2016). The lexicon is semi- We employed several lexical features, performing automatically translated into 53 languages, and the a simple preprocessing step including tokeniza- lexical items are divided into 17 categories (see tion and stemming, using the NLTK (Natural Lan- Table 2). For our system configuration, we ex- guage Toolkit) library4 . A detailed description of ploited the presence of the words in each category the features employed by our model follows. as a single feature, thus obtaining 17 single fea- Bag of Words (BoW). We used bags of words tures, one for each HurtLex category. in order to build the tweets representation. Be- fore producing the word vector, we changed all the characters from upper to lower case. Our vec- 5 https://www.parolacce.org/2016/12/ tor space consists of the count of unigrams and 20/dati-frequenza-turpiloquio/ and https: //it.wikipedia.org/wiki/Turpiloquio_ 4 https://www.nltk.org/ nella_lingua_italiana Task A Task B English Italian English Italian Stereotype 179/140 668/175 Dominance 148/124 71/61 Misogynistic 1,785/460 1,828/512 Derailing 92/11 24/2 Sexual Harassment 352/44 431/170 Discredit 1,014/141 634/104 Active 1,058/401 1,721/446 Passive 727/59 96/66 Not misogynistic 2,215/540 2,172/488 No class 2,215/540 2,172/488 Total 4,000/1,000 4,000/1,000 Table 1: Dataset label distribution (training/test). Category Description new systems specifically for Task B. PS Ethnic Slurs RCI Location and Demonyms PA Profession and Occupation We experimented with different selections of DDP Physical Disabilities and Diversity categories from the HurtLex lexicon, and identi- DDF Cognitive Disabilities and Diversity fied the most useful for the purpose of misogyny DMC Moral Behavior and Defect IS Words Related to Social and Economic antage identification. As it can be seen in Table 3, the OR Words Related to Plants main categories are: physical disabilities and di- AN Words Related to Animals versity (DDP), words related to prostitution (PR), ASM Words Related to Male Genitalia ASF Words Related to Female Genitalia words referring to male genitalia (ASM) and fe- PR Words Related Prostitution male genitalia (ASF). But also: derogatory words OM Words Related Homosexuality (CDS), words related to felonies and crime, and QAS Descriptive Words with Potential Negative Connotations also immoral behavior (RE). CDS Derogatory Words RE Felonies and Words Related to Crime and Im- moral Behavior SVP Words Related to the Seven Deadly Sins of the Language English Italian Christian Tradition Systems run1 run2 run3 run1 run2 run3 Accuracy 0.765 0.72 0.744 0.786 0.893 0.893 Bag of Word - X - - X X Table 2: HurtLex Categories. Bag of Hashtags - - - - - X Bag of Emojis - - - - - X S.W. Count X - X X - - S.W. Presence X - X X - - 4 Experimental Setup Sexist Slurs X X X X X - Woman Word X X X X X - We experimented with different sets of features Hashtag - - X - X - Link Presence X X X - - - and kernels to find the best configuration of the Upper Case X - - X X - two SVM classifiers (one for each language of the Count Text Length - X - X - - task). A 10-fold cross validation was carried out to ASF Count X X - X X X tune our systems based on accuracy. Our submit- PR Count - - - X X X ted systems configuration can be seen in Table 3. OM Count X X - - - - DDF Count - - - - - - Run #3 for both languages uses the same con- CDS Count X X - X X - figuration of our best system at the IberEval task. DDP Count X X - - - X AN Count X X - - - - (Fersini et al., 2018a). ASM Count - - - X X - The best result on the English training set has DMC Count - - - - - - IS Count X X - - - - been obtained by run #1, where we used the RBF OR Count - - - - - - kernel (0.765 accuracy), while for Italian the best PA Count X X - - - - PS Count - - - - - - result has been obtained by runs #2 and #3 with QAS Count - - - - - - the Linear kernel (0.893 accuracy). Different sets RCI Count - - - - - - RE Count - - - X X - of categories from HurtLex were able to improve SVP Count - - - - - - the classifier performance, depending on the lan- Kernel RBF Linear RBF RBF Linear Linear guage. Table 3: Feature Selection for all the submitted In order to classify the category and target of systems. misogyny (Task B), we adopted the same set of features as Task A. Therefore, we did not build ITALIAN 5 Results Rank Team Avg. Cat. Targ. 1 bakarov.c.run1 0.493 0.555 0.432 Table 4 shows our system performance based on 2 AMI-BASELINE 0.487 0.534 0.440 the test sets. Our best system in Task A ranked 3rd 3 14-exlab.c.run3 0.485 0.552 0.418 in Italian (0.839 in accuracy for run3) and 13th 4 14-exlab.c.run2 0.482 0.550 0.415 5 bakarov.c.run3 0.478 0.536 0.421 in English (0.621 in accuracy for run3). Interest- 6 bakarov.c.run2 0.463 0.499 0.426 ingly, our best result on both languages were ob- 7 SB.c.run.tsv 0.449 0.485 0.414 tained by the best configuration submitted at the 8 SB.c.run1.tsv 0.448 0.483 0.414 9 RCLN.c.run1 0.448 0.473 0.422 IberEval campaign. However, our English system 10 SB.c.run2.tsv 0.446 0.480 0.411 performance was way worse compared to the re- 11 14-exlab.c.run1 0.292 0.164 0.420 sult of IberEval (accuracy = 0.814). We will try to ENGLISH analyze this problem in the Section 6. Rank Team Avg. Cat. Targ. 1 himani.c.run3.tsv 0.406 0.361 0.451 ITALIAN 2 himani.c.run2.tsv 0.377 0.323 0.431 Rank Team Accuracy 3 AMI-BASELINE 0.370 0.342 0.399 1 bakarov.c.run2 0.844 4 hateminers.c.run3 0.369 0.302 0.435 2 bakarov.c.run1 0.842 5 hateminers.c.run1 0.348 0.264 0.431 3 14-exlab.c.run3 0.839 6 SB.c.run2.tsv 0.344 0.282 0.407 4 bakarov.c.run3 0.836 7 himani.c.run1.tsv 0.342 0.280 0.403 5 14-exlab.c.run2 0.835 8 SB.c.run1.tsv 0.335 0.282 0.389 6 StopPropagHate.c.run1 0.835 9 hateminers.c.run2 0.329 0.229 0.430 7 AMI-BASELINE 0.830 10 SB.c.run3.tsv 0.328 0.269 0.387 8 StopPropagHate.u.run2 0.829 11 resham.c.run2 0.322 0.246 0.399 9 SB.c.run1 0.824 12 resham.c.run1 0.316 0.235 0.397 10 RCLN.c.run1 0.824 13 bakarov.c.run1 0.309 0.260 0.357 11 SB.c.run3 0.823 14 resham.c.run3 0.283 0.214 0.353 12 SB.c.run 0.822 15 RCLN.c.run1 0.280 0.165 0.395 16 ITT.c.run2.tsv 0.276 0.173 0.379 ENGLISH 17 bakarov.c.run2 0.275 0.176 0.374 Rank Team Accuracy 18 14-exlab.c.run1 0.260 0.124 0.395 1 hateminers.c.run1 0.704 19 bakarov.c.run3 0.254 0.151 0.356 2 hateminers.c.run3 0.681 20 14-exlab.c.run3 0.239 0.107 0.371 3 hateminers.c.run2 0.673 21 ITT.c.run1.tsv 0.238 0.140 0.335 4 resham.c.run3 0.651 22 ITT.c.run3.tsv 0.237 0.138 0.335 5 bakarov.c.run3 0.649 23 14-exlab.c.run2 0.232 0.205 0.258 6 resham.c.run1 0.648 7 resham.c.run2 0.647 Table 5: Official Results for Subtask B. 8 ITT.c.run2.tsv 0.638 9 ITT.c.run1.tsv 0.636 10 ITT.c.run3.tsv 0.636 around 0.5 for Italian). Several under-represented 11 himani.c.run2.tsv 0.628 12 bakarov.c.run2 0.628 classes such as DERAILING and DOMINANCE are 13 14-exlab.c.run3 0.621 very difficult to be detected in category classifica- 14 himani.c.run1.tsv 0.619 tion (See Table 1 for details). Similarly, the label 15 himani.c.run3.tsv 0.614 16 14-exlab.c.run1 0.614 distribution was very unbalanced for target classi- 17 SB.c.run2.tsv 0.613 fication, where most of the misogynous tweets are 18 bakarov.c.run1 0.605 attacking a specific target (ACTIVE). 19 AMI-BASELINE 0.605 20 StopPropagHate.c.run1.tsv 0.593 Several features which focus on the use of of- 21 SB.c.run1.tsv 0.592 fensive words were proven to be useful in English. 22 StopPropagHate.u.run3.tsv 0.591 23 StopPropagHate.u.run2.tsv 0.590 For Italian, a simple tweet representation which 24 RCLN.c.run1 0.586 involves Bag of Words, Bag of Hashtags, and Bag 25 SB.c.run3.tsv 0.584 of Emojis already produced a better result than 26 14-exlab.c.run2 0.500 the baseline. Some of the HurtLex categories that Table 4: Official Results for Subtask A. were improving the system’s performance during training did not help the prediction on the test set In Task B, most of the submitted systems struggled (ASF, OM, CDS, DDP, AN, IS, PA for English and to classify the misogynous tweets into the five cat- CDS, ASM for Italian). However, similarly to the egories and discriminate whether the target is ac- Spanish case, the system configuration which uti- tive or passive. Both subtasks for both languages lized ASF, PR, and DDP obtained the best result have very low baselines (below 0.4 for English and in Italian. 6 Discussion According to task guidelines this should not be la- beled as a misogynistic tweet, because it is not We performed an error analysis on the gold stan- the user himself who is misogynistic. Therefore, dard test set, and analyzed 160 Italian tweets that instances of this type tend to confuse a classifier our best system configuration mislabelled. The la- based on lexical features. bel “misogynistic” was wrongly assigned to 147 Irony and world knowledge. In Example 3, the instances (false positives, 91.9% of the errors), sentence “Potrei morire per il dispiacere.”6 is while the contrary happened only 13 times (false ironic. Humor is very hard to model for automatic negatives, 8.1% of the errors). The same situation systems — sometimes, the presence of figurative happened in the English dataset, but with a less language even baffles human annotators. More- striking impact, with 228 false positives (60.2% of over, external world knowledge is often required the errors), 151 false negatives (39.8% of the er- in order to infer whether an utterance is ironic rors). In this section we conduct a qualitative error (Wallace et al., 2014). analysis, identifying and discussing several factors Preprocessing and tokenization. In computer- that contribute to the misclassification. mediated communication, and specifically on Presence of swear words. We encountered a lot Twitter, users often resort to a language type that of “bad words” in the dataset of this shared task is closer to speech, rather than written language. for both English and Italian. In case of abusive This is reflected in less-than-clean orthography, context, the presence of swear words can help to with forms and expressions that imitate the verbal spot abusive content such as misogyny. However, face-to-face conversation. they could also lead to false positives when the swear word is used in a casual, not offensive con- 4. @ XXXXXXXXX @ XXXXXXXXXX text (Malmasi and Zampieri, 2018; Van Hee et @ XXXXXXX @ XXXXXX x me glob prox2aa colpiran tutti incluso nemicinterno.. al., 2018; Nobata et al., 2016). Consider the fol- esterno colpopiúduro saràculogrande che lowing two examples containing the swear word bevetropvodka e inoltre x questiondisoldi “bitch" in different contexts: progetta farmezzofallirsudfinitestampe: ciò nnvàben xrchèindebolis 1. Im such a fucking cunt bitch and i dont → 4 me glob next2aa will hit everyone included even mean to be goddammit internalenemy.. external harderhit willbebigass who drinkstoomuchvodka and also 4 mattersof- 2. Bitch you aint the only one who hate money isplanning tomakethesouthfailwithprint- me, join the club, stand in the corner, and ings: dis notgood causeweaken stfu. In Example 4, preprocessing steps like tokeniza- In Example 1, the swear word “bitch" is used tion and stemming are particularly hard to per- just to arouse interest/show off, thus not directly form, because of the lack of spaces between one insulting the other person. This is a case of id- word and the other and the confused orthogra- iomatic swearing (Pinker, 2007). In Example 2, phy. Consequently all the classification pipeline the swear word “bitch" is used to insult a specific is compromised and error-prone. target in an abusive context, an instance of abusive Gender of the target. As defined in the Intro- swearing (Pinker, 2007). Resolving swearing con- duction, we know that misogyny is a specific type text is still a challenging task for automatic system of hateful language, targeting women. However, which contributing to the difficulties of this task. detecting the gender of the target is a challenging Reported speech. Tweets may contain misog- task in itself, especially in Twitter datasets. ynistic content as an indirect quote of someone 5. @realDonaldTrump shut the FUCK up else’s words, such as in the following example: you infected pussy fungus. 3. Quella volta che mia madre mi ha detto 6. @TomiLahren You’re a fucking skank! quella cosa le ho risposto "Mannaggia! Non Both examples use bad words to abuse their tar- sarò mai una brava donna schiava zitta e lava! E adesso?!" Potrei morire per il dispi- gets. However, the first example is labeled as not acere. misogyny since the target is Donald Trump (man), → That time when my mom told me that thing while the second example is labeled as misogyny and I answered “Holy s**t! I will never be with the target Tomi Lahren (woman). a good slave who shuts up and cleans! What 6 now?” Translation: I could die for heartbreak. 7 Conclusions Elisabetta Fersini, Maria Anzovino, and Paolo Rosso. 2018a. Overview of the Task on Automatic Misog- Here we draw some considerations based on the yny Identification at IberEval. In Proceedings of 3rd results of our participation to the EVALITA 2018 Workshop on Evaluation of Human Language Tech- AMI shared task. In order to test the multi- nologies for Iberian Languages (IberEval 2018)), pages 57–64. CEUR-WS.org, September. lingual potential of our model, one of the sys- tems we submitted for Italian at EVALITA (run Elisabetta Fersini, Debora Nozza, and Paolo Rosso. #3) was based on our best model for Spanish at 2018b. Overview of the evalita 2018 task on au- tomatic misogyny identification (ami). In Proceed- IberEval. Based on the official results, this system ings of the 6th evaluation campaign of Natural performed well for Italian, consisting of features Language Processing and Speech tools for Italian such as: BoW, BoE, BoH and several HurtLex (EVALITA’18), Turin, Italy. CEUR.org. categories specifically related to the hate against Thorsten Joachims. 1998. Text categorization with women. Concerning English, we obtained lower support vector machines: Learning with many rel- results in EVALITA in comparison to IberEval evant features. In European conference on machine with the same system configuration. It is worth learning, pages 137–142. Springer. mentioning that even if the training set for the AMI Shervin Malmasi and Marcos Zampieri. 2018. Chal- EVALITA task was substantially bigger, in abso- lenges in discriminating profanity from hate speech. lute terms all the AMI’s participants at EVALITA Journal of Experimental & Theoretical Artificial In- obtained worse scores than the ones obtained by telligence, 30(2):187–202. the IberEval’s teams. Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive lan- Acknowledgments guage detection in online user content. In Proceed- ings of the 25th international conference on world Valerio Basile and Viviana Patti were partially wide web, pages 145–153. supported by Progetto di Ateneo/CSP 2016 (Im- migrants, Hate and Prejudice in Social Media- Endang Wahyu Pamungkas, Alessandra Teresa Cignarella, Valerio Basile, and Viviana Patti. 2018. IhatePrejudice, S1618_L2_BOSC_01). 14-ExLab@ UniTo for AMI at IberEval2018: Exploiting Lexical Knowledge for Detecting Misogyny in English and Spanish Tweets. In Proc. References of 3rd Workshop on Evaluation of Human Language Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. Technologies for Iberian Languages (IberEval 2018. Automatic Identification and Classification of 2018). Misogynistic Language on Twitter. In Proc. of the Steven Pinker. 2007. The stuff of thought: Language 23rd Int. Conf. on Applications of Natural Language as a window into human nature. Penguin. & Information Systems, pages 57–64. Springer. Bailey Poland. 2016. Haters: Harassment, Abuse, and Jamie Bartlett, Richard Norrie, Sofia Patel, Rebekka Violence Online. Potomac Press. Rumpel, and Simon Wibberley. 2014. Misogyny on twitter. Demos. Anna Schmidt and Michael Wiegand. 2017. A survey Elisa Bassignana, Valerio Basile, and Viviana Patti. on hate speech detection using natural language pro- 2018. Hurtlex: A Multilingual Lexicon of Words to cessing. In Proceedings of the Fifth International Hurt. In Proc. of the 5th Italian Conference on Com- Workshop on Natural Language Processing for So- putational Linguistics (CLiC-it 2018), Turin, Italy. cial Media, pages 1–10. CEUR.org. Cynthia Van Hee, Gilles Jacobs, Chris Emmery, Bart Romolo Giovanni Capuano. 2007. Turpia: sociologia Desmet, Els Lefever, Ben Verhoeven, Guy De Pauw, del turpiloquio e della bestemmia. Riscontri (Mi- Walter Daelemans, and Véronique Hoste. 2018. lano, Italia). Costa & Nolan. Automatic detection of cyberbullying in social me- dia text. arXiv preprint arXiv:1801.05617. Corinna Cortes and Vladimir Vapnik. 1995. Support- vector networks. Machine learning, 20(3):273–297. Byron C. Wallace, Laura Kertz, Eugene Charniak, et al. 2014. Humans require context to infer ironic in- Tullio De Mauro. 2016. Le parole per ferire. Inter- tent (so computers probably do, too). In Proceed- nazionale. 27 settembre 2016. ings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- Fabio Fasoli, Andrea Carnaghi, and Maria Paola Pal- pers), volume 2, pages 512–516. adino. 2015. Social acceptability of sexist deroga- tory and sexist objectifying slurs across contexts. Language Sciences, 52:98–107.