Detecting Hate Speech Against Women in English Tweets

Resham Ahluwalia, Himani Soni, Edward Callow, Anderson Nascimento, Martine De Cock∗
                       School of Engineering and Technology
                          University of Washington Tacoma
           {resh,himanis7,ecallow,andclay,mdecock}@uw.edu


                     Abstract                           hibit hate speech1 , it thrives online due to lack
                                                        of accountability and insufficient supervision. Al-
    English. Hate speech is prevalent in so-            though social media companies hire employees to
    cial media platforms. Systems that can au-          moderate content (Gershgorn and Murphy, 2017),
    tomatically detect offensive content are of         the number of social media posts exceeds the ca-
    great value to assist human curators with           pacity of humans to monitor without the assistance
    removal of hateful language. In this pa-            of automated detection systems.
    per, we present machine learning models
                                                           In this paper, we focus on the automatic detec-
    developed at UW Tacoma for detection of
                                                        tion of misogyny, i.e. hate speech against women,
    misogyny, i.e. hate speech against women,
                                                        in tweets that are written in English. We present
    in English tweets, and the results obtained
                                                        machine learning (ML) models trained for the
    with these models in the shared task for
                                                        tasks posed in the competition for Automatic
    Automatic Misogyny Identification (AMI)
                                                        Misogyny Identification (AMI) at EVALITA2018
    at EVALITA2018.
                                                        (Fersini et al., 2018b). Within this competition,
    Italiano. Commenti offensivi nei confronti          Task A was the binary classification problem of
    di persone con diversa orientazione ses-            labeling a tweet as misogynous or not. As be-
    suale o provenienza sociale sono oggi-              comes clear from Table 1, Task B consisted of
    giorno prevalenti nelle piattaforme di so-          two parts: the multiclass classification problem of
    cial media. A tale fine, sistemi automatici         assigning a misogynous tweet to the correct cate-
    in grado di rilevare contenuti offensivi nei        gory of misogyny (e.g. sexual harassment, stereo-
    confronti di alcuni gruppi sociali sono im-         type, . . . ), and the binary classification problem of
    portanti per facilitare il lavoro dei mod-          determining whether a tweet is actively targeted
    eratori di queste piattaforme a rimuovere           against a specific person or not.
    ogni commento offensivo usato nei social               Interest in the use of ML for automatic de-
    media. In questo articolo, vi presentiamo           tection of online harassment and hate speech is
    sia dei modelli di apprendimento auto-              fairly recent (Razavi et al., 2010; Nobata et al.,
    matico sviluppati all’Università di Wash-          2016; Anzovino et al., 2018; Zhang and Luo,
    ington in Tacoma per il rilevamento della           2018). Most relevant to our work are approaches
    misoginia, ovvero discorsi offensivi usati          published in the context of a recent competition
    nei tweet in lingua inglese contro le donne,        on automatic misogyny identification organized
    sia i risultati ottenuti con questi modelli         at IberEval2018 (Fersini et al., 2018a), which
    nel processo per l’identificazione automat-         posed the same binary classification and multi-
    ica della misoginia in EVALITA2018.                 class classification tasks addressed in this paper.
                                                        The AMI-baseline system for each task in the
                                                        AMI@IberEval competition was an SVM trained
1   Introduction                                        on a unigram representation of the tweets, where
Inappropriate user generated content is of great        each tweet was represented as a bag of words
concern to social media platforms. Although so-         (BOW) composed of 1000 terms. We participated
cial media sites such as Twitter generally pro-         in the AMI@IberEval competition with an Ensem-
   ∗                                                      1
     Guest Professor at Dept. of Applied Mathematics,       https://help.twitter.com/en/rules-
Computer Science and Statistics, Ghent University       and-policies/twitter-rules
       Task A: Misogyny    Train   Test   Task B: Category      Train   Test    Task B: Target   Train   Test
       Non-misogynous      2215    540    0                     2215    540     0                2215    540
       Misogynous          1785    460    Discredit             1014    141     Active           1058    401
                                          Sexual harassment      352     44     Passive           727     59
                                          Stereotype             179    140
                                          Dominance              148    124
                                          Derailing                92    11

                                   Table 1: Distribution of tweets in the dataset

ble of Classifiers (EoC) containing a Logistic Re-          the distribution of the tweets over the various la-
gression model, an SVM, a Random Forest, a Gra-             bels is imbalanced; the large majority of misogy-
dient Boosting model, and a Stochastic Gradient             nistic tweets in the training data for instance be-
Descent model, all trained on a BOW represen-               long to the category “Discredit”. In addition, the
tation of the tweets (composed of both word uni-            distribution of tweets in the test data differs from
grams and word bigrams) (Ahluwalia et al., 2018).           that in the training data. As the ground truth
In AMI@IberEval, our team resham was the 7th                labels for the test data were only revealed after
best team (out of 11) for Task A, and the 3rd best          the competition, we constructed and evaluated the
team (out of 9) for Task B. The winning system              ML models described below using 5-fold cross-
for Task A in AMI@IberEval was an SVM trained               validation on the training data.
on vectors with lexical features extracted from the
tweets, such as the number of swear words in the            2.1     Task A: Misogyny
tweet, whether the tweet contains any words from            Text Preprocessing. We used NLTK2 to tokenize
a lexicon with sexist words, etc. (Pamungkas et al.,        the tweets and to remove English stopwords.
2018). Very similarly, the winning system for the           Feature Extraction. We extracted three kinds of
English tweets in Task B in AMI@IberEval was                features from the tweets:
also an SVM trained on lexical features derived
from the tweets, using lexicons that the authors            • Bag of Word Features. We turned the prepro-
built specifically for the competition (Frenda et al.,        cessed tweets into BOW vectors by counting the
2018).                                                        occurrences of token unigrams in tweets, nor-
   For the AMI@EVALITA competition, which is                  malizing the counts and using them as weights.
the focus of the current paper, we experimented             • Lexical Features. Inspired by the work of (Pa-
with the extraction of lexical features based on              mungkas et al., 2018; Frenda et al., 2018), we
dedicated lexicons as in (Pamungkas et al., 2018;             extracted the following features from the tweets:
Frenda et al., 2018). For Task A, we were the 2nd              – Link Presence: 1 if there is a link or URL
best team (resham.c.run3), with an EoC approach                  present in the tweet; 0 otherwise.
based on BOW features, lexical features, and sen-              – Hashtag Presence: 1 if there is a Hashtag
timent features. For Task B, we were the winning                 present; 0 otherwise.
team (himani.c.run3) with a two-step approach:                 – Swear Word Count: the number of swear
for the first step, we trained an LSTM (Long                     words from the noswearing dictionary3 that
Short-Term Memory) neural network to classify                    appear in the tweet.
a tweet as misogynous or not; tweets that are la-              – Swear Word Presence: 1 if there is a swear
beled as misogynous in step 1 are subsequently as-               word from the noswearing dictionary present
signed a category and target label in step 2 with an             in the tweet; 0 otherwise.
EoC approach trained on bags of words, bigrams,                – Sexist Slur Presence: 1 if there is a sexist
and trigrams. In Section 2 we provide more de-                   word from the list in (Fasoli et al., 2015)
tails about our methods for Task A and Task B. In                present in the tweet; 0 otherwise.
Section 3 we present and analyze the results.                  – Women Word Presence: The feature value is
                                                                 1 if there is a woman synonym word 4 present
2   Description of the System                                    in the tweet; 0 otherwise.
                                                                2
                                                                https://www.nltk.org/, TweetTokenizer
The training data consists of 4,000 labeled tweets              3
                                                                https://www.noswearing.com/dictionary
that were made available to participants in the               4
                                                                https://www.thesaurus.com/browse/
AMI@EVALITA competition. As Table 1 shows,                  woman
• Sentiment scores. We used SentiWordNet (Bac-                  activation, and an output layer with sigmoid
  cianella et al., 2010) to retrieve a positive and             activation. For the embedding layer we used the
  a negative sentiment score for each word occur-               pretrained Twitter Embedding from the GloVe
  ring in the tweet, and computed the average of                package (Pennington et al., 2014), which maps
  those numbers to obtain an aggregated positive                each word to a 100-dimensional numerical vector.
  score and an aggregated negative score for the                The LSTM network is trained to classify tweets
  tweet.                                                        as misogynous or not. We participated with this
Model Training. We trained 3 EoC models for                     trained network in Task A of the competition as
designating a tweet as misogynous or not (Task                  well (himani.c.run3). The results were not as
A). The EoC models differ in the kind of features               good as those obtained with the models described
they consume as well as in the kinds of classifiers             in Section 2.1, so we do not go into further detail.
that they contain internally.
• EoC with BOW (resham.c.run2)5 : an ensemble                     Next we describe how we trained the models
  consisting of a Random Forest classifier (RF), a              used in Step 2 in himani.c.run3.
  Logistic Regression classifier (LR), a Stochastic
  Gradient Descent (SGD) classifier, and a Gra-                 Text Preprocessing. We used the same text pre-
  dient Boosting (GB) classifier, each of them                  processing as in Section 2.1. In addition we re-
  trained on the BOW features.                                  moved words occurring in more than 60 percent
• EoC with BOW and sentiment scores (re-                        of the tweets along with those that had a word fre-
  sham.c.run1): an ensemble consisting of the                   quency less than 4.
  same 4 kinds of classifiers as above, each of                 Feature Extraction. We turned the preprocessed
  them trained on the BOW and sentiment score                   tweets into Bag of N-Gram vectors by counting the
  features.                                                     occurrences of token unigrams, bigrams and tri-
• EoC with BOW, sentiment scores, and lexical                   grams in tweets, normalizing the counts and using
  features (resham.c.run3): an ensemble consist-                them as weights. For simplicity, we keep referring
  ing of                                                        to this as a BOW representation.
  – RF on the BOW and sentiment score features                  Model Training. For category and target iden-
  – SVM on the lexical features                                 tification, himani.c.run3 uses an EoC approach
  – GB on the lexical features                                  where all classifiers are trained on the BOW fea-
  – LR on the lexical features.                                 tures mentioned above. The EoC models for cate-
  – GB on the BOW and sentiment features                        gory identification on one hand, and target detec-
   All the ensembles use hard voting. For training              tion on the other hand, differ in the classifiers they
the classifiers we used scikit-learn (Pedregosa et              contain internally, and in the values of the hyper-
al., 2011) with the default choices for all parame-             parameters. Below we list parameter values that
ters.                                                           differ from the default values in scikit-learn (Pe-
                                                                dregosa et al., 2011).
2.2    Task B: Category And Target
                                                                • EoC for Category Identification:
For Task B, our winning system himani.c.run3
consists of a pipeline of two classifiers: the first              – LR: inverse of regularization strength C is
classifier (step 1) in the pipeline labels a tweet                  0.7; norm used in the penalization is L1; op-
as misogynous or not, while the second classifier                   timization algorithm is ‘saga’.
(step 2) assigns the tweets that were labeled                     – RF: number of trees is 250; splitting attributes
misogynous to their proper category and target.                     are chosen based on information gain.
                                                                  – SGD: loss function is ‘modified huber’; con-
  For Step 1 we trained a deep neural network                       stant that multiplies the regularization term is
that consists of a word embedding layer, followed                   0.01; maximum number of passes over the
by a bi-directional LSTM layer with 50 cells,                       training data is 5.
a hidden dense layer with 50 cells with relu                      – Multinomial Naive Bayes: all set to defaults.
                                                                  – XGBoost: maximum depth of tree is 25;
   5
    Here ’resham.c.run2’ refers to the second run of the data       number of trees is 200.
submitted by the author in connection with the competition.
Similar citations that follow have a corresponding meaning.     • EoC for Target Identification:
         Approach         5-fold CV on Train    Test       yny categories are characterized by their own, par-
      majority baseline          0.553         0.540
       resham.c.run1             0.790         0.648       ticular language, and that during training our bi-
       resham.c.run2             0.787         0.647       nary classifiers have simply become good at flag-
      resham.c.run3              0.795         0.651       ging misogynous tweets from categories that oc-
       himani.c.run3             0.785         0.614
                                                           cur most often in the training data, leaving them
Table 2: Accuracy results for Task A: Misogyny detection   under-prepared to detect tweets from other cate-
on English tweets.                                         gories.
                                                              Regardless, one can see that the ensembles ben-
    – LR: inverse of regularization strength C is          efit from having more features available. Recall
      0.5; norm used in the penalization is L1; op-        that resham.c.run2 was trained on BOW features,
      timization algorithm is ‘saga’.                      resham.c.run1 on BOW features and sentiment
    – RF: number of trees is 200; splitting attributes     scores, and resham.c.run3 on BOW features, sen-
      are chosen based on information gain.                timent scores, and lexical features. As is clear
                                                           from Table 2, the addition of each feature set in-
   For completeness we mention that hi-                    creases the accuracy. As already mentioned in
mani.c.run2 consisted of a two-step approach very          Section 2.2, the accuracy of himani.c.run3, which
similar to the one outlined above. In Step 1 of            is a bidirectional LSTM that takes tweets as strings
himani.c.run2 tweets are labeled as misogynous             of words as its input, is lower than that of the re-
or not with an EoC model (RF, XGBoost) trained             sham models, which involve explicit feature ex-
on the Bag of N-Gram features. In Step 2, a                traction.
category and target label are assigned with respec-
tively an LR, XGBoost-EoC model and an LR,
                                                           3.2   Results for Task B
RF-EoC model in which all classifiers are trained
on the Bag of N-Gram features as well. Since this          Table 3 contains the results of our models for
approach is highly similar to the himani.c.run3            Task B in terms of F1-scores. Following the ap-
approach described above and did not give better           proach used on the AMI@EVALITA scoreboard,
results, we do not go into further detail.                 both subtasks are evaluated as multiclass classi-
                                                           fication problems. For Category detection, there
3     Results and Discussion                               are 6 possible class labels, namely the label ‘non-
                                                           misogynous’ and each of the 5 category labels.
3.1    Results for Task A                                  Similarly, for Target detection, there are 3 possible
Table 2 presents accuracy results for Task A,              class labels, namely ‘non-misogynous’, ‘Active’,
i.e. classifying tweets as misogynous or not, eval-        and ‘Passive’.
uated with 5-fold cross-validation (CV) on the                When singling out a specific class c as the “pos-
4,000 tweets in the training data from Table 1. In         itive” class, the corresponding F1-score for that
addition, the last column of Table 2 contains the          class is defined as usual as the harmonic mean of
accuracy when the models are trained on all 4,000          the precision and recall for that class. These values
tweets and subsequently applied to the test data.          are computed treating all tweets with ground truth
We include a simple majority baseline algorithm            label c as positive examples, and all other tweets
that labels all tweets as non-misogynous, which is         as negative examples. For example, when com-
the most common class in the training data.                puting the F1-score for the label “Sexual harass-
   The accuracy on the test data is noticeably lower       ment” in the task of Category detection, all tweets
than the accuracy obtained with 5-fold CV on               with ground truth label “Sexual harassment” are
the training data. At first sight, this is surprising      treated as positive examples, while the tweets from
because the label distributions are very similar:          the other 4 categories of misogyny and the non-
45% of the training tweets are misogynous, and             misogynous tweets are considered to be negative
46% of the testing tweets are misogynous. Look-            examples. The average of the F1-scores computed
ing more carefully at the distribution across the          in this way for the 5 categories of misogyny is re-
different categories of misogyny in Table 1, one           ported in the columns F1 (Category) in Table 3,
can observe that the training and test datasets do         while the average of the F1-scores for ‘Active’ and
vary quite a lot in the kind (category) of misog-          ‘Passive’ is reported in the columns F1 (Target) in
yny. It is plausible that tweets in different misog-       Table 3. The first columns contain results obtained
                                                          5-fold CV on Train                                                                     Test
           Approach                            F1 (Category) F1 (Target) Average F1                                   F1 (Category)            F1 (Target)      Average F1
       majority baseline                           0.079          0.209      0.135                                        0.049                   0.286           0.167
         himani.c.run2                             0.283          0.622      0.452                                        0.323                   0.431           0.377
        himani.c.run3                              0.313          0.626      0.469                                        0.361                   0.451           0.406
 Step 1 from resham.c.run3 &
                                                     0.278               0.515            0.396                               0.246              0.361                0.303
  Step 2 from himani.c.run3


                                                  Table 3: F1-score results for Task B on English tweets
                                                            5-fold CV on Train                                                              Test
         Approach                     Pr (A)       Re (A)    F1 (A)    Pr (P)    Re (P)   F1 (P)                  Pr (A)      Re (A)   F1 (A)    Pr (P)      Re (P)    F1 (P)
        himani.c.run3                  0.61         0.79        0.69    0.53      0.56     0.54                    0.61        0.75     0.67       0.14       0.61      0.23
 Step 1 from resham.c.run3 &
                                       0.70         0.70        0.70    0.51      0.31     0.39                    0.67        0.45     0.54       0.17       0.19      0.18
  Step 2 from himani.c.run3


Table 4: Detailed precision (Pr), recall Re), and F1-score (F1) results for Task B: Target Identification on English tweets; ‘A’
and ‘P’ refer to ‘Active’ and ‘Passive’ respectively.


                              Predicted value                                                                                  Predicted value

                              N               A             P                                                                 N         A           P

                          N    202            176     162                                                             N         428      78        34
           Actual value


                                                                                                   Actual value

                          A       40          301      60                                                             A         201     182        18

                          P       8           15       36                                                                 P       38     10        11

Table 5: Confusion matrix for Task B: Target Identification                          Table 6: Confusion matrix for Task B: Target Identifica-
with himani.c.run3 on the test data; ‘N’, ‘A’, and ‘P’ refer to                      tion with Step 1 from resham.c.run3 and Step 2 from hi-
‘Non-misogynous’, ‘Active’ and ‘Passive’ respectively.                               mani.c.run3 on the test data; ‘N’, ‘A’, and ‘P’ refer to ‘Non-
                                                                                     misogynous’, ‘Active’ and ‘Passive’ respectively.


with 5-fold CV over the training data with 4,000
                                                                                     ynous or not (Step 1) and then assigned to a Cat-
tweets from Table 1, while the last columns con-
                                                                                     egory and Target (Step 2). Given that for the task
tain results for models trained on the entire train-
                                                                                     in Step 1, the binary classifier of himani.c.run3
ing data of 4,000 tweets and subsequently applied
                                                                                     was outperformed in terms of accuracy by the bi-
to the test data. The latter correspond to the results
                                                                                     nary classifier of resham.c.run3 (see Table 2), an
on the competition scoreboard.
                                                                                     obvious question is whether higher F1-scores for
   As a simple baseline model, we include an al-                                     Task B could be obtained by combining the bi-
gorithm that labels every tweet as misogynous                                        nary classifier for misogyny detection from re-
and subsequently assigns it to the most frequently                                   sham.c.run3 with the EoC models for Category
occurring Category and Target from the training                                      and Target identification from himani.c.run3. As
data, i.e. ‘Discredit’ and ‘Active’. This model has                                  the last row in Table 3 shows, this is not the case.
a very low precision, which explains why its F1-                                     To give more insight into where the differences in
scores are so low. The best results on the test data                                 predictive performance in the last two rows of Ta-
are obtained with himani.c.run3, which is an EoC                                     ble 3 stem from, Table 4 contains more detailed
approach using a BOW representation of extracted                                     results about the precision, recall, and F1-scores
word unigrams, bigrams, and trigrams as features.                                    for Task B: Target Identification on the train as
This was the best performing model for Task B in                                     well as the test data, while Table 5 and 6 contain
the AMI@EVALITA competition.                                                         corresponding confusion matrices on the test data.
  Recall that himani.c.run3 uses a two step ap-                                      These tables reveal that the drop in F1-scores in
proach where tweets are initially labeled as misog-                                  the last row in Table 3 is due to a substantial drop
in recall. As can be seen in Table 4, replacing the     Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-
binary classifier in Step 1 by the method from re-         tiani. 2010. Sentiwordnet 3.0: an enhanced lexical
                                                           resource for sentiment analysis and opinion mining.
sham.c.run3, causes the recall for ‘Active’ tweets
                                                           In Lrec, volume 10, pages 2200–2204.
in the test data to drop from 0.75 to 0.45, and for
‘Passive’ tweets from 0.61 to 0.19. The slight in-      Fabio Fasoli, Andrea Carnaghi, and Maria Paola Pal-
crease in precision is not sufficient to compensate       adino. 2015. Social acceptability of sexist deroga-
                                                          tory and sexist objectifying slurs across contexts.
for the loss in recall. As can be inferred from           Language Sciences, 52:98–107.
Table 5 and 6, the recall of misogynous tweets
overall with himani.c.run3 is (301 + 60 + 15 +          Elisabetta Fersini, M Anzovino, and P Rosso. 2018a.
                                                           Overview of the task on automatic misogyny identi-
36)/460 ≈ 0.896 while with resham.c.run3 it is
                                                           fication at IberEval. In Proc. of IberEval 2018, vol-
only (182 + 18 + 10 + 11)/460 ≈ 0.480.                     ume 2150 of CEUR-WS, pages 214–228.

4   Conclusion                                          Elisabetta Fersini, Debora Nozza, and Paolo Rosso.
                                                           2018b. Overview of the Evalita 2018 Task on Au-
In this paper we presented machine learning mod-           tomatic Misogyny Identification (AMI). In Tom-
                                                           maso Caselli, Nicole Novielli, Viviana Patti, and
els developed at UW Tacoma for detection of hate           Paolo Rosso, editors, Proceedings of the 6th evalua-
speech against women in English language tweets,           tion campaign of Natural Language Processing and
and the results obtained with these models in the          Speech tools for Italian (EVALITA’18), Turin, Italy.
shared task for Automatic Misogyny Identification          CEUR.org.
(AMI) at EVALITA2018. For the binary classifi-          Simona Frenda, Bilal Ghanem, and Manuel Montes-y
cation task of distinguishing between misogynous          Gómez. 2018. Exploration of misogyny in Span-
and non-misogynous tweets, we obtained our best           ish and English tweets. In Proc. of IberEval 2018,
results (2nd best team) with an Ensemble of Clas-         volume 2150 of CEUR-WS, pages 260–267.
sifiers (EoC) approach trained on 3 kinds of fea-       Dave Gershgorn and Mike Murphy. 2017. Face-
tures: bag of words, sentiment scores, and lexi-          book is hiring more people to moderate con-
                                                          tent than Twitter has at its entire company.
cal features. For the multiclass classification tasks     https://qz.com/1101455/facebook-fb-
of Category and Target Identification, we obtained        is-hiring-more-people-to-moderate-
our best results (winning team) with an EoC ap-           content-than-twitter-twtr-has-at-its-
                                                          entire-company/.
proach trained on a bag of words representation
containing unigrams, bigrams, and trigrams. All         Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar
                                                          Mehdad, and Yi Chang. 2016. Abusive language detec-
EoC models contain traditional machine learning           tion in online user content. In Proc. of the 25th Interna-
classifiers, such as logistic regression and tree en-     tional Conference on World Wide Web, pages 145–153.
semble models.
                                                        Endang Wahyu Pamungkas, Alessandra Teresa Cignarella,
   Thus far, the success of our deep learning mod-        Valerio Basile, and Viviana Patti.        2018.    14-
els has been modest. This could be due to the lim-        ExLab@UniTo for AMI at IberEval2018: Exploiting lex-
ited size of the dataset and/or the limited length of     ical knowledge for detecting misogyny in English and
                                                          Spanish tweets. In Proc. of IberEval 2018, volume 2150
tweets. Regarding the latter, an interesting direc-       of CEUR-WS, pages 234–241.
tion to explore next is training neural networks that
                                                        Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort,
can consume the tweets at character level instead         Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu
of at word level, as we did in this paper.                Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg,
                                                          et al. 2011. Scikit-learn: Machine learning in Python.
                                                          Journal of Machine Learning Research, 12:2825–2830.
References                                              Jeffrey Pennington, Richard Socher, and Christopher Man-
                                                           ning. 2014. Glove: Global vectors for word representa-
Resham Ahluwalia, Evgeniia Shcherbinina, Ed-               tion. In Proc. of EMNLP, pages 1532–1543.
  ward Callow, Anderson Nascimento, and Martine
  De Cock. 2018. Detecting misogynous tweets. In        Amir H Razavi, Diana Inkpen, Sasha Uritsky, and Stan
  Proc. of IberEval 2018, volume 2150 of CEUR-WS,         Matwin. 2010. Offensive language detection using multi-
  pages 242–248.                                          level classification. In Canadian Conference on Artificial
                                                          Intelligence, pages 16–27. Springer.
Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.    Ziqi Zhang and Lei Luo. 2018. Hate speech detection: A
 2018. Automatic identification and classification of      solved problem? The challenging case of long tail on
 misogynistic language on Twitter. In International        Twitter. arXiv preprint arXiv:1803.03662.
 Conference on Applications of Natural Language to
 Information Systems, pages 57–64. Springer.