Automatic Expansion of Lexicons for Multilingual Misogyny Detection Simona Frenda Bilal Ghanem Università degli Studi di Torino, Italy Universitat Politècnica de València, Spain Universitat Politècnica de València, Spain bigha@doctor.upv.es simona.frenda@unito.it Estefanı́a Guzmán-Falcón, Manuel Montes-y-Gómez and Luis Villaseñor-Pineda Instituto Nacional de Astrofı́sica Óptica y Electrónica (INAOE), Mexico. {fany.guzman, mmontesg, villasen}@inaoep.mx Abstract lessici risultano utili per domini specifici come quello della misoginia, analizzando English. The automatic misogyny identi- i risultati emergono i limiti degli approcci fication (AMI) task proposed at IberEval proposti. and EVALITA 2018 is an example of the active involvement of scientific Re- search to face up the online spread of 1 Introduction hate contents against women. Consider- The anonymity and the interactivity, typical of ing the encouraging results obtained for computer-mediated communication, facilitate the Spanish and English in the precedent edi- spread of hate messages and the perpetuated pres- tion of AMI, in the EVALITA framework ence of hate contents online. As investigated by we tested the robustness of a similar ap- Fox et al. (2015), these factors increase and in- proach based on topic and stylistic infor- fluence social misbehaviors also offline. In order mation on a new collection of Italian and to foster scientific research to find optimal solu- English tweets. Moreover, to deal with the tions that could help to monitor the spread of hate dynamism of the language on social plat- speech contents, different tasks have been pro- forms, we also propose an approach based posed in various campaigns of evaluation. An ex- on automatically-enriched lexica. Despite ample is the AMI shared task proposed at IberEval resources like the lexica prove to be useful 20181 and later at EVALITA 20182 . This task fo- for a specific domain like misogyny, the cuses on the automatic identification of misogyny analysis of the results reveals the limita- in different languages. In particular, the first edi- tions of the proposed approaches. tion focuses on Spanish and English languages, Italiano. Il task AMI circa and the second one on a new English corpus and l’identificatione automatica della mis- Italian language. The multilingual context al- oginia proposto a IberEval e a EVALITA lows to observe the analogies and differences be- 2018 è un chiaro esempio dell’attivo tween different languages. The AMI’s organizers coinvolgimento della Ricerca per fron- (Fersini et al., 2018a; Fersini et al., 2018b) asked teggiare la diffusione online di contenuti participants to detect firstly misogynistic tweets di odio contro le donne. Considerando i and then classify the misogynistic categories and promettenti risultati ottenuti per spagnolo the kind of target (individuals or groups). In the e inglese nella precedente edizione di first edition, we proposed an approach based on AMI, nel contesto di EVALITA abbiamo stylistic and topic information captured respec- testato la robustezza di un approccio sim- tively by means of character n-grams and a set of ile, basato su informationi stilistiche e di modeled lexica (Frenda et al., 2018). Considering dominio, su una nuova collezione di tweet the encouraging results obtained with the lexicon- in inglese e in italiano. Tenendo conto based approach in Spanish and English languages, dei repentini cambiamenti del linguaggio we re-proposed a similar approach for Italian lan- nei social network, proponiamo anche un guage and a new collection of English tweets in approccio basato su lessici automatica- 1 http://amiibereval2018.wordpress.com/ 2 mente estesi. Nonostante risorse come i http://amievalita2018.wordpress.com/ order to test the performance and robustness of 2016; Del Vigna et al., 2017). Considering this approach. Actually, in this paper we pro- the specific domain concerning the hate against pose two approaches. The first one, similar to pre- women, this work exploits stylistic, linguistic and vious work (Frenda et al., 2018), involves topic, topic information about the misogynistic speech. linguistic and stylistic information. The second In particular, differently from previous studies, one focuses mainly on the automatic extension of we use specific lexica relative to offensiveness the original lexica. Indeed, to deal with the con- and discredit of women for English and Italian tinuous variation of the language on social plat- languages, and we extend them with new words forms, the modeled lexica are enriched consider- relative to the issues of the considered lexica. ing the contextual similarity of lexica by the use Considering the fact that commercial methods of pre-trained word embeddings. This technique rely currently on the use of blacklists to mon- helps the system to consider also new terms rel- itor or block offensive contents, the proposed ative to the topic information of the original lex- approach could help to upgrade their blacklists ica. It could be considered as a good methodology automatizing the process of the lexicon building. to upgrade automatically the existing list of words used to block offensive contents in real applica- tions of Internet companies. Indeed, a compari- son between the two approaches reveals that the 3 Proposed Approaches automatic enrichment of the lexica improves the results especially for English language. However, The AMI shared task proposed at EVALITA 2018 comparing the results obtained in both competi- aims to detect misogyny in English and Italian tions and observing the error analyses, we notice collections of tweets. The organizers asked par- that lexica represent a good resource for a specific ticipants to detect misogynistic texts (Task A), domain like misogyny, but they are not sufficient and then, if the tweet is predicted as misogynis- to detect misogyny online. tic, to distinguish the nature of target (individuals Following, Section 2 describes the studies that or groups labeled respectively “active” and “pas- inspired our work. Section 3 explains the ap- sive”), and identify the type of misogyny (Task proaches employed in both languages. Section 4 B), according to the following classes proposed discusses the obtained results and delineates some by Poland (2016): (a) stereotype and objectifica- conclusions. tion, (b) dominance, (c) derailing, (d) sexual ha- rassment and threats of violence, and (e) discredit. 2 Related Work Actually, these classes represent the different man- A first work about misogyny detection is pro- ifestations and the various aspects of this social posed in Anzovino et al. (2018). In this study, the misbehavior. Table 1 shows the composition of authors compared the performance of different the datasets. supervised approaches using word embeddings, Considering the promising results obtained at stylistic and syntactic features. In particular, the IberEval campaign, in this work we use two their results reveal that the best machine learning approaches mainly based on lexica. The first one approach for identification of misogyny is the (Section 3.1) is similar to the approach used in linear Support Vector Machine (SVM) classifier. Frenda et al. (2018), based on topic, linguistic and In general machine learning techniques are the stylistic information captured by means of mod- most used in hate speech detection (Escalante eled lexica and n-grams of characters and words. et al., 2017; Nobata et al., 2016), because they The second one (Section 3.2) principally involves allow researchers for exploring closely the issue the automatically extended versions of the origi- exploiting different features, such as textual (Chen nal lexica (Guzmán Falcón, 2018). In particular, et al., 2012) and syntactical aspects (Burnap and we aim: 1) to test the robustness of lexicon based Williams, 2014) or semantic and sentiment approaches in the new collections of tweets and in information (Samghabadi et al., 2017; Nobata et a new language, and 2) to understand the impact of al., 2016; Gitari et al., 2015). Finally, some recent automatically enriched lexica to face up the varia- works have investigated also the potential of tion of the language in the multilingual computer- deep learning techniques (Mehdad and Tetreault, mediated communication. Misogynistic Non-misogynistic (a) (b) (c) (d) (e) active passive Italian Training set 668 71 24 431 634 1721 97 2172 Test set 175 61 2 170 104 446 66 488 English Training set 179 148 92 352 1014 1058 727 2215 Test set 140 124 11 44 141 401 59 540 Table 1: Composition of AMI’s datasets at EVALITA 2018. 3.1 Approach 1: using manually-modeled each tweet using SentiWordNet provided by Bac- lexica (MML) cianella et al. (2010). For each degree of imbal- ance, we associate a weight used in the vectorial The first proposed approach aims to capture topic, representation of the tweets. Despite our hypoth- linguistic and stylistic information by means of esis is well funded, we obtained lower results for manually-modeled lexica and n-grams of words the runs that contain sentiment imbalance among and characters. Below the features description for the features (see Table 4). each language. Italian Features. For the Italian language, we English Features. For the detection of misog- selected some specific issue groups, described in yny in English tweets, we employed the manually- Bassignana et al. (2018), from the Italian lexi- modeled lexica proposed in Frenda et al. (2018). con “Le parole per ferire” provided by Tullio De These lexica concerns sexuality, profanity, femi- Mauro3 . In particular, we consider the lists of ninity and human body as described in Table 2. words described in Table 3. Differently from En- These lexica contain also slang expressions. glish, the experiments reveal that: the UBT is use- Moreover, we take into account hashtags and ab- ful for both tasks and the best range for BoC is breviations collected in Frenda et al. (2018): 40 from 3 to 5 grams4 . Indeed, in a morphological misogynistic hashtags, such as: #ihatef emales complex language like Italian the desinences of or #bitchesstink; and a list of 50 negative ab- the words (such as the extracted n-grams “tona” or breviations, such as wtf or stf u. Considering “ana ”) contain relevant linguistic information. Di- the most relevant n-grams of words, we employ versely, in English, longer sequences of characters the bigrams for the first task and the combina- could help to capture multi-word expressions con- tion of unigrams, bigrams and trigrams (hence de- taining also pronouns, adjectives or prepositions, fined as UBT) for the second task. Moreover, such as “ing at” or “ss bitc”. the bag of characters (BoC) in a range from 1 to 7 grams is employed to manage misspellings To extract the features correctly, in order to and to capture stylistic aspects of digital writ- train our models, we pre-process the data delet- ing. In order to perform the experiments, each ing emoticons, emojis and URLs. Indeed, from tweet is represented as a vector. The presence our experiments, the emoticons and emojis do not of words in each lexicon is pondered with In- prove to be relevant for these tasks. In order to per- formation Gain, and character and word n-grams form a correct match between the dictionaries of are weighted with Term Frequency-Inverse Doc- the corpora and the single lexicon, we use the lem- ument Frequency (TF-IDF) measure. In addi- matizer provided by the Natural Language Toolkit tion, considering the fact that in Frenda et al. (NLTK5 ) for English, and the Snowball Stemmer (2018) several misclassified misogynistic tweets for Italian. Differently from English, the use of were ironic or sarcastic, we try to analyze the im- lemmatizer for Italian tweets hinders the match. pact of irony in misogyny detection in English. Indeed, Ford and Boxer (2011) reveal that sex- ist jokes that in general are considered innocent, 3 truthfully they are experienced by women as sex- http://www.internazionale.it/ opinione/tullio-de-mauro/2016/09/27/ ual harassment. In particular, inspired by Barbieri razzismo-parole-ferire and Saggion (2014), we calculate the imbalance of 4 The experiments are carried out using the Grid Search. 5 the sentiment polarities (positive and negative) in http://www.nltk.org/ Lexicons Words Definition Sexuality 290 contains words relative to sexual subject (orgasm, orgy, pussy) and especially male domination on women (rape, pimp, slave) Profanity 170 is a collection of vulgar words such as motherf ucker, slut and scum Femininity 90 is a list of terms used to identify the women as target. It contains personal pronouns or possessive adjectives (such as she, her, herself ), common words used to refer to women (girl, mother) and also offensive words towards women (such as barbie, hooker or non − male) Human body 50 is a lexicon strongly connected with sexuality collecting words referred especially to feminine body also with negative connotations (such as holes, throat or boobs) Table 2: Composition of English lexica. Lexicons Words Definition AN 111 collects words relative to animals, such as sanguisuga or pecora ASF 31 contains terms referred to female genitalia, such as f essa ASM 76 contains terms referred to male genitalia, such as verga CDS 298 is a list of derogatory words, such as bastardo or spazzatura OR 17 contains words derived from plants but that are used as offensive words, such as f inocchio or rapa PA 83 is a list of professions or jobs that have also a negative connotations, such as portinaia or impiegato PR 54 contains terms about prostitution, such as bagascia or zoccolona PS 42 is a list of words relative to stereotypes, such as negro or ostrogoto QAS 82 collects words that have in general negative connotations, such as parassita or dilettante RE 37 contains terms relative to criminal acts or immoral actions, such as stupro or violento Table 3: Composition of Italian lexica. 3.2 Approach 2: using vector the context embedding. automatically-enriched lexica (AEL) Dictionary expansion. Using the cosine simi- larity, we compare e(L) against the embedding The second approach aims to deal with the dy- e(wi ) of each wi ∈ W; then, we extract the namism of the informal language online trying to k most similar words to e(L), defining the set capture new words relative to contexts defined in EL = (w1 , . . . , wk ). Finally, we insert the ex- each lexicon. Therefore, we use enriched versions tracted words into the original lexicon to build the of the original lexica (described above), and stylis- new lexicon, i.e., LE = L ∪ EL . tic and linguistic information captured by means Therefore, we carry out the experiments using of n-grams of words and characters as in the first different pre-trained word embeddings for each approach. The method for the expansion of a language: GloVe embeddings trained on 2 bil- given lexicon shares the idea of identifying new lion tweets (Pennington et al., 2014) for English, words by considering their contextual similarity and word embeddings built on TWITA corpus6 for with known words, as defined by some pre-trained Italian (Basile and Novielli, 2014). Finally, the word embeddings. For its description, let assume proposed expansion method is parametric and re- that L = {l1 , . . . , lm } is the initial lexicon of m quires a value for k, the number of words that are words, and W = {(w1 , e(w1 )), . . . , (wn , e(wn ))} going to extend the lexica. In particular, we use is the set of pre-trained word embeddings, where k = 1000, 500 and 100. each pair represents a word and its corresponding 3.3 Experiments and Results embedding vector. This method aims to enrich the lexicon with words strongly related to the context To carry out the experiments, a SVM classifier from the original lexicon without being necessar- is employed with the radial basis function kernel ily associated to any particular word. Its idea is (RBF) using the following parameters: C = 5 and to search for words having similar contexts to the γ = 0.1 for English and γ = 0.01 for Italian. Con- entire lexicon. This method has two main steps, sidering the complexity of the target classification described below. for the Italian language due to imbalanced training Dictionary modeling. Firstly, we extract the em- set (see Table 1), we used a Random Forest (RF) bedding e(li ) for each word li ∈ L; then, we com- classifier that aggregates the votes from different pute the average of these vectors to obtain a vector 6 http://valeriobasile.github.io/twita/ describing the entire lexicon, e(L). We name this about.html decision trees to decide the final class of the tweet. 4 Discussion and Conclusions The evaluation is performed using the test set provided by the organizers of the AMI shared task. This paper reports our participation in the AMI For the competition, they use as evaluation mea- shared task. The organizers provide also the gold sures the Accuracy for Task A and the average of test set that helps us to understand better what are F-score of both classes for Task B. the misclassified cases and the aspects that should be considered in the next experiments. Carry- English ing out the error analysis, we notice that in both Run Approach Accuracy Rank datasets the content of URL affects the transmit- run 27 AEL 0.613 17 baseline AMI 0.605 19 ted information in the tweet (such as Right! As run 1 AEL 0.592 21 they rape and butcher women and children !!!!!! run 3 MML 0.584 25 https://t.co/maEhwuYQ8B). The swear words are Italian Run Approach Accuracy Rank often used also as exclamation without the aim to baseline AMI 0.830 7 offend (such as Volevo dire alla Yamamay che tet- run 1 AEL 0.824 9 tona non sinonimo di curvy dato che di vita ha una run 38 AEL 0.823 11 run 2 MML 0.822 12 40, quindi confidence sta minchia.). Moreover, de- spite the actual English corpus does not contain Table 4: Results obtained in Task A. several jokes, Italian misclassified tweets involve humourous utterances (such as @GrianneOhms- Table 4 and Table 5 show the results obtained for1 @BarbaraRaval A parte il fatto poi che cu- in the competition compared with the baselines lona inchiavabile” è il miglior giudizio politico provided by the organizers for each task. Com- sentito sulla Merkel negli ultimi anni??”). In fact, paring the two approaches, in general AEL seems in general, humour, irony and sarcasm hinder the to work better than MML. However, the improve- correct classification of the texts, as we noticed ment of the results is very slight, especially for in English and Spanish corpora provided in the Italian language. This soft variation is unexpected IberEval framework. Participating in this shared considered the results obtained during the exper- task gave us the opportunity to analyze and com- iments employing 10-fold cross validations. In pare multilingual datasets, and thus, to discover fact, AEL with enriched lexica using k equal 100 and infer general aspects typical of hate speech performed an Accuracy of 0.880. Moreover, look- against women. ing at Table 4, reporting the official results of the AMI Task, only run 2 overcomes the baseline for Acknowledgments the detection of misogyny in English, and for this The work of Simona Frenda was partially funded run we used AEL approach excluding the senti- by the Spanish research project SomEMBED ment imbalance as feature. About the identifica- TIN2015-71147-C2-1-P (MINECO/FEDER). We tion of misogyny in Italian, the obtained results are also thank the support of CONACYT-Mexico lower than provided baselines as well as the values (projects FC-2410, CB-2015-01-257383). of F-score obtained in Task B for both languages (see Table 5). Despite the usefulness of lexica for a specific domain like misogyny, a lexicon-based References approach proves to be insufficient for this task. In- Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. deed, as the error analysis will confirm, misogyny, 2018. Automatic identification and classification of as well as general hate speech, involves linguistic misogynistic language on twitter. In International devices such as humour, exclamations typical of Conference on Applications of Natural Language to orality and contextual information that completes Information Systems, pages 57–64. the meaning transmitted by the tweet. Moreover, Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- the low values obtained also in Task B suggest tiani. 2010. Sentiwordnet 3.0: an enhanced lexical the necessity to implement dedicated approach for resource for sentiment analysis and opinion mining. each misogynistic category. In Lrec, volume 10, pages 2200–2204. 7 This run does not involve the sentiment imbalance Francesco Barbieri and Horacio Saggion. 2014. Mod- 8 This run involves the expansions of lexica with k = 100 elling irony in twitter. In Proceedings of the Stu- English Run Categories F-score Target F-score total ranks baseline AMI 0.342 0.399 0.370 3 run 2 UBT 0.282 UBT+BoC 0.407 0.344 6 run 1 UBT 0.282 UBT+BoC 0.389 0.335 8 run 3 UBT 0.269 UBT+BoC 0.387 0.328 10 Italian Run Categories F-score Target F-score Total ranks baseline AMI 0.534 0.440 0.487 2 run 3 UBT+BoC 0.485 UBT+BoC 0.414 0.449 7 run 1 UBT+BoC 0.483 UBT+BoC 0.414 0.448 8 run 2 UBT+BoC 0.480 UBT+BoC 0.411 0.446 10 Table 5: Results obtained in Task B. dent Research Workshop at the 14th Conference of Thomas E Ford and Christie Fitzgerald Boxer. 2011. the European Chapter of the ACL. Sexist humor in the workplace: A case of subtle ha- rassment. In Insidious Workplace Behavior, pages Pierpaolo Basile and Nicole Novielli. 2014. Uniba 203–234. Routledge. at evalita 2014-sentipolc task: Predicting tweet sen- timent polarity combining micro-blogging, lexicon Jesse Fox, Carlos Cruz, and Ji Young Lee. 2015. Per- and semantic features. In Proceedings of EVALITA petuating online sexism offline: Anonymity, interac- 2014. tivity, and the effects of sexist hashtags on social me- dia. Computers in Human Behavior, 52:436–442. Elisa Bassignana, Valerio Basile, and Patti Viviana. 2018. Hurtlex: A multilingual lexicon of words to Simona Frenda, Bilal Ghanem, and Manuel Montes-y hurt. In Proceedings of CLiC-it, Turin, 10-12 De- Gómez. 2018. Exploration of misogyny in span- cember 2018, CEUR. ish and english tweets. In Proceedings of Workshop IBEREVAL at 3rd SEPLN. Peter Burnap and Matthew Leighton Williams. 2014. Hate speech, machine classification and statistical Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura modelling of information flows on twitter: Interpre- Damien, and Jun Long. 2015. A lexicon-based tation and communication for policy decision mak- approach for hate speech detection. International ing. Internet, Policy & Politics. Journal of Multimedia and Ubiquitous Engineering, 10(4):215–230. Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. 2012. Detecting offensive language in social media Estefanı́a Guzmán Falcón. 2018. Detección de to protect adolescent online safety. In Privacy, Secu- lenguaje ofensivo en Twitter basada en expansión rity, Risk and Trust (PASSAT), pages 71–80. IEEE. automática de lexicones (tesis de maestrı́a). Insti- Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, tuto Nacional de Astrofı́sica, Óptica y Electrónica. Marinella Petrocchi, and Maurizio Tesconi. 2017. Puebla, México. Hate me, hate me not: Hate speech detection on Yashar Mehdad and Joel Tetreault. 2016. Do charac- facebook. In Proceedings of ITASEC17. ters abuse more than words? In Proceedings of the Hugo Jair Escalante, Esaú Villatoro-Tello, Sara E 17th Annual Meeting of the Special Interest Group Garza, A Pastor López-Monroy, Manuel Montes-y on Discourse and Dialogue, pages 299–303. Gómez, and Luis Villaseñor-Pineda. 2017. Early Chikashi Nobata, Joel Tetreault, Achint Thomas, detection of deception and aggressiveness using Yashar Mehdad, and Yi Chang. 2016. Abusive lan- profile-based representations. Expert Systems with guage detection in online user content. In Proceed- Applications, 89:99–111. ings of the 25th international conference on WWW. Elisabetta Fersini, Maria Anzovino, and Paolo Rosso. Jeffrey Pennington, Richard Socher, and Christopher 2018a. Overview of the task on automatic misogyny Manning. 2014. Glove: Global vectors for word identification at ibereval. In Proceedings of Work- representation. In Proceedings of EMNLP. shop IBEREVAL at 3rd SEPLN. Elisabetta Fersini, Debora Nozza, and Paolo Rosso. Bailey Poland. 2016. Haters: Harassment, abuse, and 2018b. Overview of the evalita 2018 task on au- violence online. U of Nebraska Press. tomatic misogyny identification (ami). In Tom- Niloofar Safi Samghabadi, Suraj Maharjan, Alan maso Caselli, Nicole Novielli, Viviana Patti, and Sprague, Raquel Diaz-Sprague, and Thamar Paolo Rosso, editors, Proceedings of the 6th evalua- Solorio. 2017. Detecting nastiness in social media. tion campaign of Natural Language Processing and In Proceedings of ALW1. Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org.