Detecting Hate Speech Against Women in English Tweets Resham Ahluwalia, Himani Soni, Edward Callow, Anderson Nascimento, Martine De Cock∗ School of Engineering and Technology University of Washington Tacoma {resh,himanis7,ecallow,andclay,mdecock}@uw.edu Abstract hibit hate speech1 , it thrives online due to lack of accountability and insufficient supervision. Al- English. Hate speech is prevalent in so- though social media companies hire employees to cial media platforms. Systems that can au- moderate content (Gershgorn and Murphy, 2017), tomatically detect offensive content are of the number of social media posts exceeds the ca- great value to assist human curators with pacity of humans to monitor without the assistance removal of hateful language. In this pa- of automated detection systems. per, we present machine learning models In this paper, we focus on the automatic detec- developed at UW Tacoma for detection of tion of misogyny, i.e. hate speech against women, misogyny, i.e. hate speech against women, in tweets that are written in English. We present in English tweets, and the results obtained machine learning (ML) models trained for the with these models in the shared task for tasks posed in the competition for Automatic Automatic Misogyny Identification (AMI) Misogyny Identification (AMI) at EVALITA2018 at EVALITA2018. (Fersini et al., 2018b). Within this competition, Italiano. Commenti offensivi nei confronti Task A was the binary classification problem of di persone con diversa orientazione ses- labeling a tweet as misogynous or not. As be- suale o provenienza sociale sono oggi- comes clear from Table 1, Task B consisted of giorno prevalenti nelle piattaforme di so- two parts: the multiclass classification problem of cial media. A tale fine, sistemi automatici assigning a misogynous tweet to the correct cate- in grado di rilevare contenuti offensivi nei gory of misogyny (e.g. sexual harassment, stereo- confronti di alcuni gruppi sociali sono im- type, . . . ), and the binary classification problem of portanti per facilitare il lavoro dei mod- determining whether a tweet is actively targeted eratori di queste piattaforme a rimuovere against a specific person or not. ogni commento offensivo usato nei social Interest in the use of ML for automatic de- media. In questo articolo, vi presentiamo tection of online harassment and hate speech is sia dei modelli di apprendimento auto- fairly recent (Razavi et al., 2010; Nobata et al., matico sviluppati all’Università di Wash- 2016; Anzovino et al., 2018; Zhang and Luo, ington in Tacoma per il rilevamento della 2018). Most relevant to our work are approaches misoginia, ovvero discorsi offensivi usati published in the context of a recent competition nei tweet in lingua inglese contro le donne, on automatic misogyny identification organized sia i risultati ottenuti con questi modelli at IberEval2018 (Fersini et al., 2018a), which nel processo per l’identificazione automat- posed the same binary classification and multi- ica della misoginia in EVALITA2018. class classification tasks addressed in this paper. The AMI-baseline system for each task in the AMI@IberEval competition was an SVM trained 1 Introduction on a unigram representation of the tweets, where Inappropriate user generated content is of great each tweet was represented as a bag of words concern to social media platforms. Although so- (BOW) composed of 1000 terms. We participated cial media sites such as Twitter generally pro- in the AMI@IberEval competition with an Ensem- ∗ 1 Guest Professor at Dept. of Applied Mathematics, https://help.twitter.com/en/rules- Computer Science and Statistics, Ghent University and-policies/twitter-rules Task A: Misogyny Train Test Task B: Category Train Test Task B: Target Train Test Non-misogynous 2215 540 0 2215 540 0 2215 540 Misogynous 1785 460 Discredit 1014 141 Active 1058 401 Sexual harassment 352 44 Passive 727 59 Stereotype 179 140 Dominance 148 124 Derailing 92 11 Table 1: Distribution of tweets in the dataset ble of Classifiers (EoC) containing a Logistic Re- the distribution of the tweets over the various la- gression model, an SVM, a Random Forest, a Gra- bels is imbalanced; the large majority of misogy- dient Boosting model, and a Stochastic Gradient nistic tweets in the training data for instance be- Descent model, all trained on a BOW represen- long to the category “Discredit”. In addition, the tation of the tweets (composed of both word uni- distribution of tweets in the test data differs from grams and word bigrams) (Ahluwalia et al., 2018). that in the training data. As the ground truth In AMI@IberEval, our team resham was the 7th labels for the test data were only revealed after best team (out of 11) for Task A, and the 3rd best the competition, we constructed and evaluated the team (out of 9) for Task B. The winning system ML models described below using 5-fold cross- for Task A in AMI@IberEval was an SVM trained validation on the training data. on vectors with lexical features extracted from the tweets, such as the number of swear words in the 2.1 Task A: Misogyny tweet, whether the tweet contains any words from Text Preprocessing. We used NLTK2 to tokenize a lexicon with sexist words, etc. (Pamungkas et al., the tweets and to remove English stopwords. 2018). Very similarly, the winning system for the Feature Extraction. We extracted three kinds of English tweets in Task B in AMI@IberEval was features from the tweets: also an SVM trained on lexical features derived from the tweets, using lexicons that the authors • Bag of Word Features. We turned the prepro- built specifically for the competition (Frenda et al., cessed tweets into BOW vectors by counting the 2018). occurrences of token unigrams in tweets, nor- For the AMI@EVALITA competition, which is malizing the counts and using them as weights. the focus of the current paper, we experimented • Lexical Features. Inspired by the work of (Pa- with the extraction of lexical features based on mungkas et al., 2018; Frenda et al., 2018), we dedicated lexicons as in (Pamungkas et al., 2018; extracted the following features from the tweets: Frenda et al., 2018). For Task A, we were the 2nd – Link Presence: 1 if there is a link or URL best team (resham.c.run3), with an EoC approach present in the tweet; 0 otherwise. based on BOW features, lexical features, and sen- – Hashtag Presence: 1 if there is a Hashtag timent features. For Task B, we were the winning present; 0 otherwise. team (himani.c.run3) with a two-step approach: – Swear Word Count: the number of swear for the first step, we trained an LSTM (Long words from the noswearing dictionary3 that Short-Term Memory) neural network to classify appear in the tweet. a tweet as misogynous or not; tweets that are la- – Swear Word Presence: 1 if there is a swear beled as misogynous in step 1 are subsequently as- word from the noswearing dictionary present signed a category and target label in step 2 with an in the tweet; 0 otherwise. EoC approach trained on bags of words, bigrams, – Sexist Slur Presence: 1 if there is a sexist and trigrams. In Section 2 we provide more de- word from the list in (Fasoli et al., 2015) tails about our methods for Task A and Task B. In present in the tweet; 0 otherwise. Section 3 we present and analyze the results. – Women Word Presence: The feature value is 1 if there is a woman synonym word 4 present 2 Description of the System in the tweet; 0 otherwise. 2 https://www.nltk.org/, TweetTokenizer The training data consists of 4,000 labeled tweets 3 https://www.noswearing.com/dictionary that were made available to participants in the 4 https://www.thesaurus.com/browse/ AMI@EVALITA competition. As Table 1 shows, woman • Sentiment scores. We used SentiWordNet (Bac- activation, and an output layer with sigmoid cianella et al., 2010) to retrieve a positive and activation. For the embedding layer we used the a negative sentiment score for each word occur- pretrained Twitter Embedding from the GloVe ring in the tweet, and computed the average of package (Pennington et al., 2014), which maps those numbers to obtain an aggregated positive each word to a 100-dimensional numerical vector. score and an aggregated negative score for the The LSTM network is trained to classify tweets tweet. as misogynous or not. We participated with this Model Training. We trained 3 EoC models for trained network in Task A of the competition as designating a tweet as misogynous or not (Task well (himani.c.run3). The results were not as A). The EoC models differ in the kind of features good as those obtained with the models described they consume as well as in the kinds of classifiers in Section 2.1, so we do not go into further detail. that they contain internally. • EoC with BOW (resham.c.run2)5 : an ensemble Next we describe how we trained the models consisting of a Random Forest classifier (RF), a used in Step 2 in himani.c.run3. Logistic Regression classifier (LR), a Stochastic Gradient Descent (SGD) classifier, and a Gra- Text Preprocessing. We used the same text pre- dient Boosting (GB) classifier, each of them processing as in Section 2.1. In addition we re- trained on the BOW features. moved words occurring in more than 60 percent • EoC with BOW and sentiment scores (re- of the tweets along with those that had a word fre- sham.c.run1): an ensemble consisting of the quency less than 4. same 4 kinds of classifiers as above, each of Feature Extraction. We turned the preprocessed them trained on the BOW and sentiment score tweets into Bag of N-Gram vectors by counting the features. occurrences of token unigrams, bigrams and tri- • EoC with BOW, sentiment scores, and lexical grams in tweets, normalizing the counts and using features (resham.c.run3): an ensemble consist- them as weights. For simplicity, we keep referring ing of to this as a BOW representation. – RF on the BOW and sentiment score features Model Training. For category and target iden- – SVM on the lexical features tification, himani.c.run3 uses an EoC approach – GB on the lexical features where all classifiers are trained on the BOW fea- – LR on the lexical features. tures mentioned above. The EoC models for cate- – GB on the BOW and sentiment features gory identification on one hand, and target detec- All the ensembles use hard voting. For training tion on the other hand, differ in the classifiers they the classifiers we used scikit-learn (Pedregosa et contain internally, and in the values of the hyper- al., 2011) with the default choices for all parame- parameters. Below we list parameter values that ters. differ from the default values in scikit-learn (Pe- dregosa et al., 2011). 2.2 Task B: Category And Target • EoC for Category Identification: For Task B, our winning system himani.c.run3 consists of a pipeline of two classifiers: the first – LR: inverse of regularization strength C is classifier (step 1) in the pipeline labels a tweet 0.7; norm used in the penalization is L1; op- as misogynous or not, while the second classifier timization algorithm is ‘saga’. (step 2) assigns the tweets that were labeled – RF: number of trees is 250; splitting attributes misogynous to their proper category and target. are chosen based on information gain. – SGD: loss function is ‘modified huber’; con- For Step 1 we trained a deep neural network stant that multiplies the regularization term is that consists of a word embedding layer, followed 0.01; maximum number of passes over the by a bi-directional LSTM layer with 50 cells, training data is 5. a hidden dense layer with 50 cells with relu – Multinomial Naive Bayes: all set to defaults. – XGBoost: maximum depth of tree is 25; 5 Here ’resham.c.run2’ refers to the second run of the data number of trees is 200. submitted by the author in connection with the competition. Similar citations that follow have a corresponding meaning. • EoC for Target Identification: Approach 5-fold CV on Train Test yny categories are characterized by their own, par- majority baseline 0.553 0.540 resham.c.run1 0.790 0.648 ticular language, and that during training our bi- resham.c.run2 0.787 0.647 nary classifiers have simply become good at flag- resham.c.run3 0.795 0.651 ging misogynous tweets from categories that oc- himani.c.run3 0.785 0.614 cur most often in the training data, leaving them Table 2: Accuracy results for Task A: Misogyny detection under-prepared to detect tweets from other cate- on English tweets. gories. Regardless, one can see that the ensembles ben- – LR: inverse of regularization strength C is efit from having more features available. Recall 0.5; norm used in the penalization is L1; op- that resham.c.run2 was trained on BOW features, timization algorithm is ‘saga’. resham.c.run1 on BOW features and sentiment – RF: number of trees is 200; splitting attributes scores, and resham.c.run3 on BOW features, sen- are chosen based on information gain. timent scores, and lexical features. As is clear from Table 2, the addition of each feature set in- For completeness we mention that hi- creases the accuracy. As already mentioned in mani.c.run2 consisted of a two-step approach very Section 2.2, the accuracy of himani.c.run3, which similar to the one outlined above. In Step 1 of is a bidirectional LSTM that takes tweets as strings himani.c.run2 tweets are labeled as misogynous of words as its input, is lower than that of the re- or not with an EoC model (RF, XGBoost) trained sham models, which involve explicit feature ex- on the Bag of N-Gram features. In Step 2, a traction. category and target label are assigned with respec- tively an LR, XGBoost-EoC model and an LR, 3.2 Results for Task B RF-EoC model in which all classifiers are trained on the Bag of N-Gram features as well. Since this Table 3 contains the results of our models for approach is highly similar to the himani.c.run3 Task B in terms of F1-scores. Following the ap- approach described above and did not give better proach used on the AMI@EVALITA scoreboard, results, we do not go into further detail. both subtasks are evaluated as multiclass classi- fication problems. For Category detection, there 3 Results and Discussion are 6 possible class labels, namely the label ‘non- misogynous’ and each of the 5 category labels. 3.1 Results for Task A Similarly, for Target detection, there are 3 possible Table 2 presents accuracy results for Task A, class labels, namely ‘non-misogynous’, ‘Active’, i.e. classifying tweets as misogynous or not, eval- and ‘Passive’. uated with 5-fold cross-validation (CV) on the When singling out a specific class c as the “pos- 4,000 tweets in the training data from Table 1. In itive” class, the corresponding F1-score for that addition, the last column of Table 2 contains the class is defined as usual as the harmonic mean of accuracy when the models are trained on all 4,000 the precision and recall for that class. These values tweets and subsequently applied to the test data. are computed treating all tweets with ground truth We include a simple majority baseline algorithm label c as positive examples, and all other tweets that labels all tweets as non-misogynous, which is as negative examples. For example, when com- the most common class in the training data. puting the F1-score for the label “Sexual harass- The accuracy on the test data is noticeably lower ment” in the task of Category detection, all tweets than the accuracy obtained with 5-fold CV on with ground truth label “Sexual harassment” are the training data. At first sight, this is surprising treated as positive examples, while the tweets from because the label distributions are very similar: the other 4 categories of misogyny and the non- 45% of the training tweets are misogynous, and misogynous tweets are considered to be negative 46% of the testing tweets are misogynous. Look- examples. The average of the F1-scores computed ing more carefully at the distribution across the in this way for the 5 categories of misogyny is re- different categories of misogyny in Table 1, one ported in the columns F1 (Category) in Table 3, can observe that the training and test datasets do while the average of the F1-scores for ‘Active’ and vary quite a lot in the kind (category) of misog- ‘Passive’ is reported in the columns F1 (Target) in yny. It is plausible that tweets in different misog- Table 3. The first columns contain results obtained 5-fold CV on Train Test Approach F1 (Category) F1 (Target) Average F1 F1 (Category) F1 (Target) Average F1 majority baseline 0.079 0.209 0.135 0.049 0.286 0.167 himani.c.run2 0.283 0.622 0.452 0.323 0.431 0.377 himani.c.run3 0.313 0.626 0.469 0.361 0.451 0.406 Step 1 from resham.c.run3 & 0.278 0.515 0.396 0.246 0.361 0.303 Step 2 from himani.c.run3 Table 3: F1-score results for Task B on English tweets 5-fold CV on Train Test Approach Pr (A) Re (A) F1 (A) Pr (P) Re (P) F1 (P) Pr (A) Re (A) F1 (A) Pr (P) Re (P) F1 (P) himani.c.run3 0.61 0.79 0.69 0.53 0.56 0.54 0.61 0.75 0.67 0.14 0.61 0.23 Step 1 from resham.c.run3 & 0.70 0.70 0.70 0.51 0.31 0.39 0.67 0.45 0.54 0.17 0.19 0.18 Step 2 from himani.c.run3 Table 4: Detailed precision (Pr), recall Re), and F1-score (F1) results for Task B: Target Identification on English tweets; ‘A’ and ‘P’ refer to ‘Active’ and ‘Passive’ respectively. Predicted value Predicted value N A P N A P N 202 176 162 N 428 78 34 Actual value Actual value A 40 301 60 A 201 182 18 P 8 15 36 P 38 10 11 Table 5: Confusion matrix for Task B: Target Identification Table 6: Confusion matrix for Task B: Target Identifica- with himani.c.run3 on the test data; ‘N’, ‘A’, and ‘P’ refer to tion with Step 1 from resham.c.run3 and Step 2 from hi- ‘Non-misogynous’, ‘Active’ and ‘Passive’ respectively. mani.c.run3 on the test data; ‘N’, ‘A’, and ‘P’ refer to ‘Non- misogynous’, ‘Active’ and ‘Passive’ respectively. with 5-fold CV over the training data with 4,000 ynous or not (Step 1) and then assigned to a Cat- tweets from Table 1, while the last columns con- egory and Target (Step 2). Given that for the task tain results for models trained on the entire train- in Step 1, the binary classifier of himani.c.run3 ing data of 4,000 tweets and subsequently applied was outperformed in terms of accuracy by the bi- to the test data. The latter correspond to the results nary classifier of resham.c.run3 (see Table 2), an on the competition scoreboard. obvious question is whether higher F1-scores for As a simple baseline model, we include an al- Task B could be obtained by combining the bi- gorithm that labels every tweet as misogynous nary classifier for misogyny detection from re- and subsequently assigns it to the most frequently sham.c.run3 with the EoC models for Category occurring Category and Target from the training and Target identification from himani.c.run3. As data, i.e. ‘Discredit’ and ‘Active’. This model has the last row in Table 3 shows, this is not the case. a very low precision, which explains why its F1- To give more insight into where the differences in scores are so low. The best results on the test data predictive performance in the last two rows of Ta- are obtained with himani.c.run3, which is an EoC ble 3 stem from, Table 4 contains more detailed approach using a BOW representation of extracted results about the precision, recall, and F1-scores word unigrams, bigrams, and trigrams as features. for Task B: Target Identification on the train as This was the best performing model for Task B in well as the test data, while Table 5 and 6 contain the AMI@EVALITA competition. corresponding confusion matrices on the test data. Recall that himani.c.run3 uses a two step ap- These tables reveal that the drop in F1-scores in proach where tweets are initially labeled as misog- the last row in Table 3 is due to a substantial drop in recall. As can be seen in Table 4, replacing the Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- binary classifier in Step 1 by the method from re- tiani. 2010. Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. sham.c.run3, causes the recall for ‘Active’ tweets In Lrec, volume 10, pages 2200–2204. in the test data to drop from 0.75 to 0.45, and for ‘Passive’ tweets from 0.61 to 0.19. The slight in- Fabio Fasoli, Andrea Carnaghi, and Maria Paola Pal- crease in precision is not sufficient to compensate adino. 2015. Social acceptability of sexist deroga- tory and sexist objectifying slurs across contexts. for the loss in recall. As can be inferred from Language Sciences, 52:98–107. Table 5 and 6, the recall of misogynous tweets overall with himani.c.run3 is (301 + 60 + 15 + Elisabetta Fersini, M Anzovino, and P Rosso. 2018a. Overview of the task on automatic misogyny identi- 36)/460 ≈ 0.896 while with resham.c.run3 it is fication at IberEval. In Proc. of IberEval 2018, vol- only (182 + 18 + 10 + 11)/460 ≈ 0.480. ume 2150 of CEUR-WS, pages 214–228. 4 Conclusion Elisabetta Fersini, Debora Nozza, and Paolo Rosso. 2018b. Overview of the Evalita 2018 Task on Au- In this paper we presented machine learning mod- tomatic Misogyny Identification (AMI). In Tom- maso Caselli, Nicole Novielli, Viviana Patti, and els developed at UW Tacoma for detection of hate Paolo Rosso, editors, Proceedings of the 6th evalua- speech against women in English language tweets, tion campaign of Natural Language Processing and and the results obtained with these models in the Speech tools for Italian (EVALITA’18), Turin, Italy. shared task for Automatic Misogyny Identification CEUR.org. (AMI) at EVALITA2018. For the binary classifi- Simona Frenda, Bilal Ghanem, and Manuel Montes-y cation task of distinguishing between misogynous Gómez. 2018. Exploration of misogyny in Span- and non-misogynous tweets, we obtained our best ish and English tweets. In Proc. of IberEval 2018, results (2nd best team) with an Ensemble of Clas- volume 2150 of CEUR-WS, pages 260–267. sifiers (EoC) approach trained on 3 kinds of fea- Dave Gershgorn and Mike Murphy. 2017. Face- tures: bag of words, sentiment scores, and lexi- book is hiring more people to moderate con- tent than Twitter has at its entire company. cal features. For the multiclass classification tasks https://qz.com/1101455/facebook-fb- of Category and Target Identification, we obtained is-hiring-more-people-to-moderate- our best results (winning team) with an EoC ap- content-than-twitter-twtr-has-at-its- entire-company/. proach trained on a bag of words representation containing unigrams, bigrams, and trigrams. All Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive language detec- EoC models contain traditional machine learning tion in online user content. In Proc. of the 25th Interna- classifiers, such as logistic regression and tree en- tional Conference on World Wide Web, pages 145–153. semble models. Endang Wahyu Pamungkas, Alessandra Teresa Cignarella, Thus far, the success of our deep learning mod- Valerio Basile, and Viviana Patti. 2018. 14- els has been modest. This could be due to the lim- ExLab@UniTo for AMI at IberEval2018: Exploiting lex- ited size of the dataset and/or the limited length of ical knowledge for detecting misogyny in English and Spanish tweets. In Proc. of IberEval 2018, volume 2150 tweets. Regarding the latter, an interesting direc- of CEUR-WS, pages 234–241. tion to explore next is training neural networks that Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, can consume the tweets at character level instead Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu of at word level, as we did in this paper. Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. References Jeffrey Pennington, Richard Socher, and Christopher Man- ning. 2014. Glove: Global vectors for word representa- Resham Ahluwalia, Evgeniia Shcherbinina, Ed- tion. In Proc. of EMNLP, pages 1532–1543. ward Callow, Anderson Nascimento, and Martine De Cock. 2018. Detecting misogynous tweets. In Amir H Razavi, Diana Inkpen, Sasha Uritsky, and Stan Proc. of IberEval 2018, volume 2150 of CEUR-WS, Matwin. 2010. Offensive language detection using multi- pages 242–248. level classification. In Canadian Conference on Artificial Intelligence, pages 16–27. Springer. Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. Ziqi Zhang and Lei Luo. 2018. Hate speech detection: A 2018. Automatic identification and classification of solved problem? The challenging case of long tail on misogynistic language on Twitter. In International Twitter. arXiv preprint arXiv:1803.03662. Conference on Applications of Natural Language to Information Systems, pages 57–64. Springer.