Vector Space Models for Automatic Misogyny Identification (Short Paper) Amir Bakarov National Research University Higher School of Economics, Moscow, Russia amirbakarov at gmail.com Abstract speech is the one that abuses a person’s gender identity. Such form of hate speech is called misog- English. The problem of hate speech and, ynous language since misogyny is a specific case especially, of misogynous language is one of hate whose targets are women. Misogyny on of the most crucial problems of contem- the Internet (cybermisogyny, or online sexual ha- porary Internet communities. Therefore, rassment) is one of the crucial problems of con- automatic detection of such language be- temporary Internet communities, especially from comes one of the most actual natural lan- the perspective of the societal impact of this phe- guage processing tasks. The most ubiqui- nomenon. tous tools for resolving this task are based on vector space models of texts. In this Thus, the problem of automatic misogyny iden- paper we describe our system that ex- tification could be considered as one of the most ploits such tools and have shown the best important branches of a hate speech detection task. performance on the Italian AMI task of The successful solution of this problem could lead EVALITA 2018. to the significant limitation of the diffusion for the Italiano. Il problema dell’uso di dis- hate speech against women. The problem of au- corsi che incitano l’odio, e specialmente tomatic misogynous language detection got atten- dell’uso di linguaggio misogino, è uno tion from the research community fairly recently, dei problemi più cruciali delle comunità and the shared task on automatic misogyny identi- di internet al giorno d’oggi. Pertanto, il fication held as a part of the EVALITA-2018 cam- rilevamento automatico di tale linguag- paign is one of the first works trying to deal with gio diventa uno degli obiettivi più attuali this problem (Fersini et al., 2018b). The aim of per l’elaborazione del linguaggio natu- this task is to automatically identify misogynous rale. I sistemi più diffusi atti ad affrontare content in tweets for the Italian and English lan- questo obiettivo sfruttano l’ipotesi dis- guages. tributiva. In questo articolo, descriviamo il sistema proposto basato su quest’ipotesi This paper describes our system that has outper- che hanno dimostrato le migliori perfor- formed all other systems for the Italian language mance nel task AMI di EVALITA 2018 and also has shown fairly good results for the En- nella lingua italiana. glish language. This system is based on using se- mantic features of tweets as an input of a super- vised classifier. The semantic features are consid- 1 Introduction ered as latent vectors produced by a vector space As the Internet community and several online dis- model. cussions grow, the number of manifestations of hate speech on open web resources also increases. Our work is organized as follows. Section 2 Such type of speech (also called abusive language briefly describes related work on the proposed or textual harassment) could get different forms task. Section 3 describes the setup of our system, depending on its focus on the person’s ethnic- while Section 4 discusses the results and proposes ity, gender identity, religion, or sexual orientation. an analysis of them. Section 5 concludes the pa- Probably, one of the most destructive forms of hate per. Task A (Italian) Task B (Italian) Task A (English) Task B (English) Baseline 0.830 0.487 0.605 0.370 TFIDF+LR 0.842 0.443 0.649 0.241 TFIDF+XGB 0.836 0.493 0.604 0.309 TFIDF+SVD+LR 0.844 0.478 0.628 0.275 TFIDF+SVD+XGB 0.833 0.463 0.605 0.254 Table 1: Performance of each of the compared vectorizers and supervised classifiers on each of the tasks. Task A reports accuracy, Task B reports macro F1-measure. 2 Related Work and patterns appearing in this type of language (Poland, 2016). We think that from the perspective The first notorious works of the task of automatic of natural language processing, such papers could misogyny identification were described as shared be useful for the systems that are highly grounded task proposed at IverEval 2018 workshop (Fersini to linguistic knowledge and manually crafted re- et al., 2018a) (a shared task organized jointly with sources. SEPLN-2018 Conference for Iberian languages), and SemEval-20191 . These tasks proposed certain 3 Experimental Setup baselines based on ubiquitous text classification In the shared task we had two datasets (for English techniques (for example, SVM). The automatic and for Italian) of 5000 tweets each. 4000 tweets misogyny identification task considered in our re- in each dataset were considered as a training sam- search is the third shared task on this topic (An- ple, and the evaluation of the system was done on zovino et al., 2018). We are also aware of certain 1000 tweets (their labels were hidden until the end other attempts to computationally resolve the task of the competition). The classification task has in- of automatic misogyny identification, but most of cluded both binary and multi-label classification. them were published only as some exploratory In our work we have used vectors from term- analysis (Hewitt et al., 2016). Most of the state-of- document matrix with TF-IDF values. We pro- the art approaches to this problem were described pose the text classification based on using seman- as system reports for the aforementioned IberEval- tic features obtained from vector space models of 2018 shared task. As far as we know, there were texts. We considered the terms as word n-grams, no other scholarly works trying to resolve or to for- and used a factorization of the term-document malize this task. matrix (we used a method of singular value de- In the natural language processing community composition, SVD) and a normalization of fac- very similar tasks were also considered in other torized values (in the table with the results we hate speech online challenges and scholarly works call it TFIDF+SVD). From this perspective, our (Davidson et al., 2017). An extensive overview approach is very close to the method of Latent of all the research related to hate speech detection Semantic Analysis (Landauer et al., 1998) (and goes beyond the scope of this work, and an inter- we have also tried to resolve this task using not- ested reader could be referred to a survey paper factorized TF-IDF matrix, called TFIDF in the ta- specialized on this topic (Schmidt and Wiegand, ble). As a supervised classifier we have used a Lo- 2017). gistic Regression classifier, therefore, our system Apart from computational linguistics and natu- is based on using TF-IDF n-gram word features ral language processing, the problem of misogy- and a Logistic Regression (LR). nous speech was also a focus of some linguistic For all the methods of vectorization we used and social science articles (Fulper et al., 2014). a basic pipeline of text pre-processing (tokeniza- Most of such scholarly works were trying to un- tion, lemmatization and stop-word removal based derstand the nature of misogynous hate speech on NLTK build-in tools and resources). 1 https://competitions.codalab.org/ We have also compared it with other classi- competitions/19935 fiers (for instance, a Gradient Boosting classifier, XGB in the table) and got worse results on the terns that people tend to use in misogynous lan- certain tasks. All in all, we have compared four guage. We would also like to try out more promis- different models. The exact hyperparameters of ing approaches to text classification based on deep the models used in our system, and all the code learning (for example, convolutional neural net- for reproducing the experiments could be found works). at our Gitlab repository: https://gitlab. com/bakarov/ami-evalita. References 4 Results and Discussion Anzovino, M., Fersini, E., and Rosso, P. (2018). Auto- matic identification and classification of misogynis- The system evaluation was done on two subtasks. tic language on twitter. In International Conference The first subtask had proposed a binary classifica- on Applications of Natural Language to Information tion to identify whether the text is either misog- Systems, pages 57–64. Springer. ynous or not misogynous (Task A). The second Davidson, T., Warmsley, D., Macy, M., and Weber, subtask (Task B) was to classify the misogynous I. (2017). Automated hate speech detection and tweets according to both the misogynistic behav- the problem of offensive language. arXiv preprint ior (multi-label classification) and the target of the arXiv:1703.04009. message (binary-classification). The results of the Fersini, E., Anzovino, M., and Rosso, P. (2018a). system for the English and Italian subtasks for the Overview of the task on automatic misogyny iden- misogyny identification task are described in Ta- tification at ibereval. In Proceedings of the Third Workshop on Evaluation of Human Language Tech- ble 1. It is notable that our system has outper- nologies for Iberian Languages (IberEval 2018), formed the baseline put by organizers in most of co-located with 34th Conference of the Spanish the cases, and different combinations of vectoriz- Society for Natural Language Processing (SEPLN ers and models have shown different performance 2018). CEUR Workshop Proceedings. CEUR-WS. in different tasks. org, Seville, Spain. After an error analysis conducted on the system, Fersini, E., Nozza, D., and Rosso, P. (2018b). we have found out that the system fails on exam- Overview of the evalita 2018 task on automatic ples where misogyny is expressed without (or with misogyny identification (ami). In Caselli, T., Novielli, N., Patti, V., and Rosso, P., editors, Pro- a very little use of) offensive lexis, or, vice versa, ceedings of the 6th evaluation campaign of Natural such lexis is used not in misogynous context (for Language Processing and Speech tools for Italian example, you pussy boy). This could be explained (EVALITA’18), Turin, Italy. CEUR.org. by the fact that the system is too much focused on Fulper, R., Ciampaglia, G. L., Ferrara, E., Ahn, Y., the lexicon and does not takes into account syntac- Flammini, A., Menczer, F., Lewis, B., and Rowe, K. tic patterns or thematic roles. (2014). Misogynistic language on twitter and sexual violence. In Proceedings of the ACM Web Science 5 Conclusions Workshop on Computational Approaches to Social Modeling (ChASM). The proposed work has described the system that Hewitt, S., Tiropanis, T., and Bokhove, C. (2016). The has shown the best results for the Italian track on problem of identifying misogynist language on twit- all the subtasks (and have also got fairly good re- ter (and other online social spaces). In Proceedings sults on English). Our system is based on a vector of the 8th ACM Conference on Web Science, pages space model of character n-grams and a supervised 333–335. ACM. gradient boosting classifier. Landauer, T. K., Foltz, P. W., and Laham, D. (1998). The system described in this paper is one of the An introduction to latent semantic analysis. Dis- first attempts to the problem of detecting misogy- course processes, 25(2-3):259–284. nistic language for the Italian language in the natu- Poland, B. (2016). Haters: Harassment, abuse, and ral language processing community. We think that violence online. U of Nebraska Press. the description of the implementation of our sys- Schmidt, A. and Wiegand, M. (2017). A survey on hate tem could help other researchers to resolve such speech detection using natural language processing. important and actual task. We consider this value In Proceedings of the Fifth International Workshop as a main contribution of our research. on Natural Language Processing for Social Media, In future we plan to give more attention to some pages 1–10. other linguistic features based on analysis of pat-