CrotoneMilano for AMI at Evalita2018.
          A performant, cross-lingual misogyny detection system.


             Angelo Basile                              Chiara Rubagotti
           Symanto Research                           Independent Researcher
     angelo.basile@symanto.net                   chiara.rubagotti@gmail.com


                 Abstract                        1       Introduction
We present our systems for misogyny              With awareness of violence against women grow-
identification on Twitter, for Italian and       ing in the public discourse and the spread of un-
English. The models are based on a Sup-          filtered and possibly anonymous communication
port Vector Machine and they use n-grams         on social media in our digital culture, the issue
as features. Our solution is very simple         of misogyny online has become compelling. Vi-
and yet we achieve top results on Ital-          olence against women has been described by the
ian Tweets and excellent results on En-          UN as a “Gender-based [..] form of discrimi-
glish Tweets. Furthermore, we exper-             nation that seriously inhibits women’s ability to
iment with a single model that works             enjoy rights and freedoms on a basis of equal-
across languages by leveraging abstract          ity with men”1 . On the web this often takes the
features. We show that a single multi-           form of female-discriminating attacks of differ-
lingual system yields performances com-          ent types, which undermine the women’s rights
parable to two independently trained sys-        of freedom of expression and participation2 . Fol-
tems. We achieve accuracy results ranging        lowing erjavec2012you
                                                                 ¸        ’s understanding of hate
from 45% to 85%. Our system is ranked            speech, reported in (Pamungkas et al., 2018) as
first out of twelve submissions for sub-task     “any type of communication that is abusive, in-
B on Italian and second for sub-task A.          sulting, intimidating, harassing, and/or incites to
                                                 violence or discrimination, and that disparages a
In questo articolo presentiamo i nostri mo-      person or a group on the basis of some charac-
delli per il riconoscimento automatico di        teristics such as race, colour, ethnicity, gender,
testi misogini su Twitter: abbiamo adde-         sexual orientation, nationality, religion, or other
strato lo stesso sistema prima su un corpus      characteristics”, we can define misogynist speech
italiano e poi su uno inglese. Il modello si     as any kind of aggressive discourse which tar-
basa su una macchina a vettori supporto          gets women because they are women. Within the
e usa n-grammi come feature. La nostra           larger context of hate speech, online misogyny —
soluzione è molto semplice e tuttavia ci        or cybersexism — stands out as a large and com-
permette di raggiungere lo stato dell’arte       plex phenomenon which reflects other forms of of-
sull’italiano e ottimi risultati sull’inglese.   fline abuse on women (Poland, 2016). This holds
Presentiamo inoltre un sistema che funzio-       true for the Italian case as well, where bouts of
na con entrambe le lingue sfruttando una         misogynistic tweets have been linked to episodes
serie di feature astratte. Il nostro livello     of femicides 3 . In recent years the NLP commu-
raggiunge livelli di accuratezza tra il 45%
                                                     1
e l’85%: con questi risultati ci piazziamo            http://www.un.org/womenwatch/daw/cedaw/recommendations/recomm.htm
                                                     2
                                                      https://www.amnesty.org/en/latest/research/2018/03/online-
primi nel task B per l’italiano e secondi        violence-against-women-chapter-3/.
nel task A.                                         3
                                                      http://www.voxdiritti.it/wp-
                                                 content/uploads//2018/06/mappa-intolleranza-3-donne.jpg
nity has addressed the issue of automatic detec-        B). The misogynistic behaviour’s space consists of
tion of hate speech in general (Schmidt and Wie-        five different labels:
gand, 2017) and misogyny in particular (Anzovino
et al., 2018). This effort to detect and contain ver-       • Stereotype & Objectification
bal violence on social media (or any kind of text)
                                                            • Dominance
demonstrates how NLP tools can also be used for
ethically beneficial purposes and should be con-            • Derailing
sidered in the newborn and crucial discourse on
Ethics in NLP (Hovy and Spruit, 2016; Hovy et               • Sexual Harassment & Threats of
al., 2017; Alfano et al., 2018). We are therefore             Violence
proud to take up the AMI challenge (Fersini et al.,
                                                            • Discredit
2018) and present our contribution to the cause
of stopping misogynistic speech on Twitter. In             The target can be either Active when the mes-
this paper we propose a simple linear model us-         sage refers to a specific person or Passive when
ing n-grams: we show that such a simple setup           the message expresses generic misogyny. The
can still yield good results. We decided to pro-        setup is the same for both Italian and English.
pose a simple model for three reasons: first, it
has been shown that linear SVM can easily outper-       2     Data
form more complex deep neural networks (Plank,
                                                        We use only the data released by the task organ-
2017; Medvedeva et al., 2017); second, training
                                                        isers: they consist of Italian and English Tweets.
and testing our model does not require expensive
                                                        The organisers report that the corpus has been
hardware but a common laptop is enough to repli-
                                                        manually labelled by several annotators. We pro-
cate our experiments; third, we experiment with a
                                                        vide an overview of the data set in Table 1. As it
transformation of the input (i.e. we extract abstract
                                                        can be seen from the table, the data for Task A is
features) and a linear model allows for an easier
                                                        more or less balanced, while the data for Task B is
interpretation of the contribution of this transfor-
                                                        highly skewed.
mation.
   To summarise, the following are the contribu-        3     Experiments
tions of this paper:
                                                        In this section we describe the feature extraction
   • We propose a simple and yet strong misog-          process and the model that we built.
     yny detection system for English and Italian
     (ranked first out of twelve systems for misog-     3.1    Pre-processing
     ynistic category detection)                        We decide not to pre-process the data in any way,
                                                        since we do not have linguistic (or non-linguistic)
   • We show how a single system can be trained         reasons for doing so. To tokenize the text we sim-
     to work across languages                           ply split at every white space.

   • We release all the code4 and our trained           3.2    Model and Features
     systems for reproducibility and for a quick        We built a sparse linear model for approaching this
     implementation of language technology sys-         task.
     tems that can help detect and mitigate cyber-         We use n-grams extracted at the word level as
     sexism phenomena.                                  well as at the character level. We use 3-10 n-
                                                        grams and binary tf-idf. We feed these features
   Task Description The AMI task is combined            to a Support Vector Machine (SVM) model with a
binary and multi-label, short text classification       linear kernel; we use the implementation included
task. Given a Tweet, we have to predict whether it      in scikit-learn (Pedregosa et al., 2011). Fur-
contains or not misogyny (Task A) and if it does,       thermore, we experiment with feature abstraction:
we have to classify the misogynistic behaviour and      we follow the bleaching approach recently pro-
predict who is the subject being targeted (Task         posed by (van der Goot et al., 2018). First, we
    4
      The       code       can      be   found     at   transform each word in a list of symbols that 1)
https://github.com/anbasile/AMI/.                       represents the shape of the individual characters
                                              I TALIAN                       E NGLISH
                                 SO      DO      DE ST          DI   SO    DO DE ST            DI
            Active               625     61     21      428   586    54    78     24    207    695
            Passive               40      9      3       2     43    125   70     68    145    319
            Non-Misogynous                     2172                              2215

Table 1: Data Set Overview, showing the label distribution across the five misogynistic behaviours:
Stereotype & Objectification (SO), Dominance (DO), Derailing (DE), Sexual Harassment & Threats of
Violence (ST) and Discredit (DI).


              SHAPE       FREQ     LEN    ALPHA           the official test set, using the gold labels released
                                                          by the organisers after the evaluation period.
  This        Ccvc          46      04        True
  is          vc           650      02        True        4     Evaluation and Results
  an          vc           116      02        True
  example     vcvcccv        1      07        True        Since the data set labels for the sub-task B are not
  .           .             60      01        False       evenly distributed across the classes, we use f1-
                            1       01        False       score to evaluate our model. First we report re-
                                                          sults obtained via a 10-fold cross-validation on the
Table 2: An illustration of the bleaching process.        training set; then, we report results from the offi-
                                                          cial test set, whose labels have been released. The
                                                          official evaluation does not take into account the
and 2) abstracts from meaning by still approxi-           joint prediction of the labels, however here we re-
mating the vowels and characters that compose the         port results considering the 0 label: since we train
word; then, we compute the length of the word and         different models for the different label sets, we
its frequency (while taking care of padding the first     make sure that the models trained on Task B are
one with a zero in order to avoid feature collision);     able to detect if a message is misogynistic in the
finally, we use a Boolean label for explicitly dis-       first place.
tinguishing words from non-alphanumeric token
(e.g. emojis). Table 2 shows an example of this           4.1    Development Results
feature abstraction process.                              We report the development results obtained by us-
   (van der Goot et al., 2018) proposed this bleach-      ing different text representations. Table 3 presents
ing approach for modelling gender across lan-             an overview of these results. We note that all four
guages, by leveraging the language-independent            representations — words, characters, a combina-
nature of these features: here, we try to re-use the      tion of these two and the bleached representation
technique for classifying misogynist text across          — all yield comparable results. The combination
languages. We slightly modify the representation          of words and characters seems to be the best for-
proposed by (van der Goot et al., 2018) by merg-          mat. Overall, we note that the system performs
ing the shape feature (e.g. Xxx) with the vowel-          better on the Italian corpus than on the English
consonant approximation feature (e.g. CVC) into           corpus.
one single feature (e.g. Cvc).
   We propose three different multi-lingual exper-        4.1.1 Cross-lingual Results
iments:                                                   In Table 4 we present the results of our cross-
                                                          lingual experiments. We train and test different
  • TRAIN Italian → TEST English                          systems using lexical and abstract features. We
                                                          note that the abstract model trained on Italian out-
  • TRAIN English → TEST Italian
                                                          performs the fully lexicalized model when tested
  • TRAIN Ita. & Eng. → TEST Ita. & Eng.                  on English, but the opposite is not true. The En-
                                                          glish data set seems particularly hard for both the
   For the last experiment, we use half the data set      abstract and the lexicalized model. Interestingly,
for each language. We report scores obtained by           the abstract model trained on both corpora shows
training on the whole training set and testing on         good results.
                                                      E NGLISH                I TALIAN
                                              MIS .     CAT. TGT.     MIS .      CAT. TGT.

                              Words (W)       0.68      0.29   0.57   0.88     0.60     0.59
                              Chars (C)       0.71      0.30   0.61   0.88     0.59     0.58
                              W+C             0.70      0.31   0.59   0.88     0.62     0.59
                              Bleaching       0.68      0.27   0.57   0.85     0.55     0.56

            Table 3: An overview of the development f1-macro scores obtained via cross-validation.


                      TEST →       IT       EN                 the big difference in performance between the En-
                                                               glish and Italian models, we show the importance
                     IT lex        0.85     0.51
                                                               of words as learned by the model: we print the
                     IT abs        0.83     0.52
                                                               ten most important words, ranked by their learned
             TRAIN


                     EN lex        0.47     0.62
                                                               weights. The result is shown in Table 6. From
                     EN abs        0.45     0.52
                                                               the output we see that the model trained on Italian
                     IT + EN lex   0.83     0.60
                                                               learned meaningful words for identifying a misog-
                     IT + EN abs   0.81     0.58
                                                               ynist message, such as zitta [shut up], tua [your]
                                                               and muori [die!]: these words stand out from the
Table 4: Pair-wise accuracy results for Task A.
                                                               rest of the profanity for directly referring to some-
We compare lexicalized vs. abstract models. The
                                                               one, while the rest of the words and almost all the
combined IT+EN data set is built by randomly
                                                               most important English words could be used as in-
sampling 50% of instances from both corpora.
                                                               terjections or could be more generic insults.

4.2       Test Results
                                                                       RANK       ITA          ENG
In Table 5 we present official test results (Fersini
et al., 2018). We submitted only one, constrained                        1        zitta        woman
run; a run is considered constrained when only                           2        bel          hoe
the data released by the organisers are used. We                         3        pompinara    she
submitted the model using the combined represen-                         4        puttanona    hoes
tation with word- and character-ngrams, trained                          5        tua          women
once on the English corpus and once on the Ital-                         6        muori        whore
ian corpus. We achieve the top and the second po-                        7        baldracca    her
sition for the tasks B and A respectively on the                         8        troie        bitches
Italian data set. On the English data set our sys-                       9        culona       womensuck
tem is ranked 15th and 4th on the tasks A and B                          10       tettona      bitch
respectively.
                                                               Table 6: Top ten words ranked by their positive
                                                               weights learned during training.
           TASK A                  TASK B
                       CATEGORY      TARGET           AVG.
                                                                  The results of the abstract system are satisfac-
     IT     0.843         0.579           0.423       0.501
                                                               tory for eventually building a light, portable model
    EN      0.617         0.293           0.444       0.369
                                                               that could be adapted to different language. In the
Table 5: Official test results. Task A is measured             future we will try training on English and Italian
using accuracy and Task B is measured using f1-                and testing on a third corpus (such as the Spanish
score. We reach the first position on Task B for               version of the AMI data set).
Italian.                                                          In this paper we described our participation to
                                                               the AMI - Automatic Misogyny Identification for
                                                               Italian and English. We proposed a very simple
5     Discussion and Conclusions
                                                               solution that can be implemented quickly and we
A warning to the reader: this section contains ex-             scored a state-of-the-art result for classification of
plicit language. In the attempt to understand better           misogynistic behaviours in five classes.
Acknowledgements                                          F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
                                                             B. Thirion, O. Grisel, M. Blondel, P. Pretten-
The authors would like to thank the two anony-               hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
mous reviewers who helped improve the quality of             sos, D. Cournapeau, M. Brucher, M. Perrot, and
this paper. The first author has conducted this re-          E. Duchesnay. 2011. Scikit-learn: Machine learn-
                                                             ing in Python. Journal of Machine Learning Re-
search as he was still part of the Erasmus Mundus            search, 12:2825–2830.
master in Language and Communication Technol-
ogy, a shared master program between the Uni-             Barbara Plank. 2017. All-in-1 at ijcnlp-2017 task 4:
                                                            Short text classification with one model for all lan-
versity of Groningen (NL) and the University of             guages. Proceedings of the IJCNLP 2017, Shared
Malta (MT).                                                 Tasks, pages 143–148.
                                                          Bailey Poland. 2016. Haters: Harassment, Abuse, and
References                                                  Violence Online. University of Nebraska Press.

Mark Alfano, Dirk Hovy, Margaret Mitchell, and            Anna Schmidt and Michael Wiegand. 2017. A survey
 Michael Strube. 2018. Proceedings of the second            on hate speech detection using natural language pro-
 acl workshop on ethics in natural language process-        cessing. In Proceedings of the Fifth International
 ing. In Proceedings of the Second ACL Workshop on          Workshop on Natural Language Processing for So-
 Ethics in Natural Language Processing.                     cial Media, pages 1–10.

Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.      Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malv-
 2018. Automatic identification and classification of       ina Nissim, and Barbara Plank. 2018. Bleaching
 misogynistic language on twitter. In International         text: Abstract features for cross-lingual gender pre-
 Conference on Applications of Natural Language to          diction. In Proceedings of the 56th Annual Meet-
 Information Systems, pages 57–64. Springer.                ing of the Association for Computational Linguistics
                                                            (Volume 2: Short Papers), volume 2, pages 383–389.
Elisabetta Fersini, Debora Nozza, and Paolo Rosso.
   2018. Overview of the evalita 2018 task on au-
   tomatic misogyny identification (ami). In Tom-
   maso Caselli, Nicole Novielli, Viviana Patti, and
   Paolo Rosso, editors, Proceedings of the 6th evalua-
   tion campaign of Natural Language Processing and
   Speech tools for Italian (EVALITA’18), Turin, Italy.
   CEUR.org.

Dirk Hovy and Shannon L Spruit. 2016. The social
  impact of natural language processing. In Proceed-
  ings of the 54th Annual Meeting of the Association
  for Computational Linguistics (Volume 2: Short Pa-
  pers), volume 2, pages 591–598.

Dirk Hovy, Shannon Spruit, Margaret Mitchell,
  Emily M Bender, Michael Strube, and Hanna Wal-
  lach. 2017. Proceedings of the first acl workshop on
  ethics in natural language processing. In Proceed-
  ings of the First ACL Workshop on Ethics in Natural
  Language Processing.

Maria Medvedeva, Martin Kroon, and Barbara Plank.
 2017. When sparse traditional models outperform
 dense neural networks: the curious case of discrimi-
 nating between similar languages. In Proceedings of
 the Fourth Workshop on NLP for Similar Languages,
 Varieties and Dialects (VarDial), pages 156–163.

Endang Wahyu Pamungkas, Alessandra Teresa
  Cignarella, Valerio Basile, and Viviana Patti. 2018.
  14-exlab@unito for ami at ibereval2018: Exploiting
  lexical knowledge for detecting misogyny in english
  and spanish tweets. In Proceedings of the Third
  Workshop on Evaluation of Human Language Tech-
  nologies for Iberian Languages (IberEval 2018),
  pages 234–241.