Automatic Expansion of Lexicons for Multilingual Misogyny Detection
               Simona Frenda                                      Bilal Ghanem
   Università degli Studi di Torino, Italy          Universitat Politècnica de València, Spain
  Universitat Politècnica de València, Spain             bigha@doctor.upv.es
      simona.frenda@unito.it

 Estefanı́a Guzmán-Falcón, Manuel Montes-y-Gómez and Luis Villaseñor-Pineda
                          Instituto Nacional de Astrofı́sica
                      Óptica y Electrónica (INAOE), Mexico.
            {fany.guzman, mmontesg, villasen}@inaoep.mx
                Abstract                                 lessici risultano utili per domini specifici
                                                         come quello della misoginia, analizzando
English. The automatic misogyny identi-                  i risultati emergono i limiti degli approcci
fication (AMI) task proposed at IberEval                 proposti.
and EVALITA 2018 is an example of
the active involvement of scientific Re-
search to face up the online spread of           1       Introduction
hate contents against women. Consider-
                                                 The anonymity and the interactivity, typical of
ing the encouraging results obtained for
                                                 computer-mediated communication, facilitate the
Spanish and English in the precedent edi-
                                                 spread of hate messages and the perpetuated pres-
tion of AMI, in the EVALITA framework
                                                 ence of hate contents online. As investigated by
we tested the robustness of a similar ap-
                                                 Fox et al. (2015), these factors increase and in-
proach based on topic and stylistic infor-
                                                 fluence social misbehaviors also offline. In order
mation on a new collection of Italian and
                                                 to foster scientific research to find optimal solu-
English tweets. Moreover, to deal with the
                                                 tions that could help to monitor the spread of hate
dynamism of the language on social plat-
                                                 speech contents, different tasks have been pro-
forms, we also propose an approach based
                                                 posed in various campaigns of evaluation. An ex-
on automatically-enriched lexica. Despite
                                                 ample is the AMI shared task proposed at IberEval
resources like the lexica prove to be useful
                                                 20181 and later at EVALITA 20182 . This task fo-
for a specific domain like misogyny, the
                                                 cuses on the automatic identification of misogyny
analysis of the results reveals the limita-
                                                 in different languages. In particular, the first edi-
tions of the proposed approaches.
                                                 tion focuses on Spanish and English languages,
Italiano.         Il task AMI circa              and the second one on a new English corpus and
l’identificatione automatica della mis-          Italian language. The multilingual context al-
oginia proposto a IberEval e a EVALITA           lows to observe the analogies and differences be-
2018 è un chiaro esempio dell’attivo            tween different languages. The AMI’s organizers
coinvolgimento della Ricerca per fron-           (Fersini et al., 2018a; Fersini et al., 2018b) asked
teggiare la diffusione online di contenuti       participants to detect firstly misogynistic tweets
di odio contro le donne. Considerando i          and then classify the misogynistic categories and
promettenti risultati ottenuti per spagnolo      the kind of target (individuals or groups). In the
e inglese nella precedente edizione di           first edition, we proposed an approach based on
AMI, nel contesto di EVALITA abbiamo             stylistic and topic information captured respec-
testato la robustezza di un approccio sim-       tively by means of character n-grams and a set of
ile, basato su informationi stilistiche e di     modeled lexica (Frenda et al., 2018). Considering
dominio, su una nuova collezione di tweet        the encouraging results obtained with the lexicon-
in inglese e in italiano. Tenendo conto          based approach in Spanish and English languages,
dei repentini cambiamenti del linguaggio         we re-proposed a similar approach for Italian lan-
nei social network, proponiamo anche un          guage and a new collection of English tweets in
approccio basato su lessici automatica-              1
                                                         http://amiibereval2018.wordpress.com/
                                                     2
mente estesi. Nonostante risorse come i                  http://amievalita2018.wordpress.com/
order to test the performance and robustness of         2016; Del Vigna et al., 2017). Considering
this approach. Actually, in this paper we pro-          the specific domain concerning the hate against
pose two approaches. The first one, similar to pre-     women, this work exploits stylistic, linguistic and
vious work (Frenda et al., 2018), involves topic,       topic information about the misogynistic speech.
linguistic and stylistic information. The second        In particular, differently from previous studies,
one focuses mainly on the automatic extension of        we use specific lexica relative to offensiveness
the original lexica. Indeed, to deal with the con-      and discredit of women for English and Italian
tinuous variation of the language on social plat-       languages, and we extend them with new words
forms, the modeled lexica are enriched consider-        relative to the issues of the considered lexica.
ing the contextual similarity of lexica by the use      Considering the fact that commercial methods
of pre-trained word embeddings. This technique          rely currently on the use of blacklists to mon-
helps the system to consider also new terms rel-        itor or block offensive contents, the proposed
ative to the topic information of the original lex-     approach could help to upgrade their blacklists
ica. It could be considered as a good methodology       automatizing the process of the lexicon building.
to upgrade automatically the existing list of words
used to block offensive contents in real applica-
tions of Internet companies. Indeed, a compari-
son between the two approaches reveals that the         3   Proposed Approaches
automatic enrichment of the lexica improves the
results especially for English language. However,       The AMI shared task proposed at EVALITA 2018
comparing the results obtained in both competi-         aims to detect misogyny in English and Italian
tions and observing the error analyses, we notice       collections of tweets. The organizers asked par-
that lexica represent a good resource for a specific    ticipants to detect misogynistic texts (Task A),
domain like misogyny, but they are not sufficient       and then, if the tweet is predicted as misogynis-
to detect misogyny online.                              tic, to distinguish the nature of target (individuals
   Following, Section 2 describes the studies that      or groups labeled respectively “active” and “pas-
inspired our work. Section 3 explains the ap-           sive”), and identify the type of misogyny (Task
proaches employed in both languages. Section 4          B), according to the following classes proposed
discusses the obtained results and delineates some      by Poland (2016): (a) stereotype and objectifica-
conclusions.                                            tion, (b) dominance, (c) derailing, (d) sexual ha-
                                                        rassment and threats of violence, and (e) discredit.
2   Related Work                                        Actually, these classes represent the different man-
A first work about misogyny detection is pro-           ifestations and the various aspects of this social
posed in Anzovino et al. (2018). In this study, the     misbehavior. Table 1 shows the composition of
authors compared the performance of different           the datasets.
supervised approaches using word embeddings,               Considering the promising results obtained at
stylistic and syntactic features. In particular,        the IberEval campaign, in this work we use two
their results reveal that the best machine learning     approaches mainly based on lexica. The first one
approach for identification of misogyny is the          (Section 3.1) is similar to the approach used in
linear Support Vector Machine (SVM) classifier.         Frenda et al. (2018), based on topic, linguistic and
In general machine learning techniques are the          stylistic information captured by means of mod-
most used in hate speech detection (Escalante           eled lexica and n-grams of characters and words.
et al., 2017; Nobata et al., 2016), because they        The second one (Section 3.2) principally involves
allow researchers for exploring closely the issue       the automatically extended versions of the origi-
exploiting different features, such as textual (Chen    nal lexica (Guzmán Falcón, 2018). In particular,
et al., 2012) and syntactical aspects (Burnap and       we aim: 1) to test the robustness of lexicon based
Williams, 2014) or semantic and sentiment               approaches in the new collections of tweets and in
information (Samghabadi et al., 2017; Nobata et         a new language, and 2) to understand the impact of
al., 2016; Gitari et al., 2015). Finally, some recent   automatically enriched lexica to face up the varia-
works have investigated also the potential of           tion of the language in the multilingual computer-
deep learning techniques (Mehdad and Tetreault,         mediated communication.
                                             Misogynistic                               Non-misogynistic
                           (a)   (b)   (c)       (d)         (e)    active   passive
           Italian
           Training set    668   71    24        431         634     1721      97             2172
           Test set        175   61     2        170         104      446      66              488
           English
           Training set    179   148   92        352        1014     1058      727            2215
           Test set        140   124   11         44         141     401       59              540

                          Table 1: Composition of AMI’s datasets at EVALITA 2018.


3.1   Approach 1: using manually-modeled                    each tweet using SentiWordNet provided by Bac-
      lexica (MML)                                          cianella et al. (2010). For each degree of imbal-
                                                            ance, we associate a weight used in the vectorial
The first proposed approach aims to capture topic,          representation of the tweets. Despite our hypoth-
linguistic and stylistic information by means of            esis is well funded, we obtained lower results for
manually-modeled lexica and n-grams of words                the runs that contain sentiment imbalance among
and characters. Below the features description for          the features (see Table 4).
each language.                                              Italian Features. For the Italian language, we
English Features. For the detection of misog-               selected some specific issue groups, described in
yny in English tweets, we employed the manually-            Bassignana et al. (2018), from the Italian lexi-
modeled lexica proposed in Frenda et al. (2018).            con “Le parole per ferire” provided by Tullio De
These lexica concerns sexuality, profanity, femi-           Mauro3 . In particular, we consider the lists of
ninity and human body as described in Table 2.              words described in Table 3. Differently from En-
   These lexica contain also slang expressions.             glish, the experiments reveal that: the UBT is use-
Moreover, we take into account hashtags and ab-             ful for both tasks and the best range for BoC is
breviations collected in Frenda et al. (2018): 40           from 3 to 5 grams4 . Indeed, in a morphological
misogynistic hashtags, such as: #ihatef emales              complex language like Italian the desinences of
or #bitchesstink; and a list of 50 negative ab-             the words (such as the extracted n-grams “tona” or
breviations, such as wtf or stf u. Considering              “ana ”) contain relevant linguistic information. Di-
the most relevant n-grams of words, we employ               versely, in English, longer sequences of characters
the bigrams for the first task and the combina-             could help to capture multi-word expressions con-
tion of unigrams, bigrams and trigrams (hence de-           taining also pronouns, adjectives or prepositions,
fined as UBT) for the second task. Moreover,                such as “ing at” or “ss bitc”.
the bag of characters (BoC) in a range from 1
to 7 grams is employed to manage misspellings                  To extract the features correctly, in order to
and to capture stylistic aspects of digital writ-           train our models, we pre-process the data delet-
ing. In order to perform the experiments, each              ing emoticons, emojis and URLs. Indeed, from
tweet is represented as a vector. The presence              our experiments, the emoticons and emojis do not
of words in each lexicon is pondered with In-               prove to be relevant for these tasks. In order to per-
formation Gain, and character and word n-grams              form a correct match between the dictionaries of
are weighted with Term Frequency-Inverse Doc-               the corpora and the single lexicon, we use the lem-
ument Frequency (TF-IDF) measure. In addi-                  matizer provided by the Natural Language Toolkit
tion, considering the fact that in Frenda et al.            (NLTK5 ) for English, and the Snowball Stemmer
(2018) several misclassified misogynistic tweets            for Italian. Differently from English, the use of
were ironic or sarcastic, we try to analyze the im-         lemmatizer for Italian tweets hinders the match.
pact of irony in misogyny detection in English.
Indeed, Ford and Boxer (2011) reveal that sex-
ist jokes that in general are considered innocent,
                                                              3
truthfully they are experienced by women as sex-                http://www.internazionale.it/
                                                            opinione/tullio-de-mauro/2016/09/27/
ual harassment. In particular, inspired by Barbieri         razzismo-parole-ferire
and Saggion (2014), we calculate the imbalance of             4
                                                                The experiments are carried out using the Grid Search.
                                                              5
the sentiment polarities (positive and negative) in             http://www.nltk.org/
 Lexicons         Words      Definition
 Sexuality        290        contains words relative to sexual subject (orgasm, orgy, pussy) and especially male domination on
                             women (rape, pimp, slave)
 Profanity        170        is a collection of vulgar words such as motherf ucker, slut and scum
 Femininity       90         is a list of terms used to identify the women as target. It contains personal pronouns or possessive
                             adjectives (such as she, her, herself ), common words used to refer to women (girl, mother) and
                             also offensive words towards women (such as barbie, hooker or non − male)
 Human body       50         is a lexicon strongly connected with sexuality collecting words referred especially to feminine body
                             also with negative connotations (such as holes, throat or boobs)

                                       Table 2: Composition of English lexica.
 Lexicons   Words       Definition
 AN         111         collects words relative to animals, such as sanguisuga or pecora
 ASF        31          contains terms referred to female genitalia, such as f essa
 ASM        76          contains terms referred to male genitalia, such as verga
 CDS        298         is a list of derogatory words, such as bastardo or spazzatura
 OR         17          contains words derived from plants but that are used as offensive words, such as f inocchio or rapa
 PA         83          is a list of professions or jobs that have also a negative connotations, such as portinaia or impiegato
 PR         54          contains terms about prostitution, such as bagascia or zoccolona
 PS         42          is a list of words relative to stereotypes, such as negro or ostrogoto
 QAS        82          collects words that have in general negative connotations, such as parassita or dilettante
 RE         37          contains terms relative to criminal acts or immoral actions, such as stupro or violento

                                        Table 3: Composition of Italian lexica.


3.2   Approach 2: using                                           vector the context embedding.
      automatically-enriched lexica (AEL)                         Dictionary expansion. Using the cosine simi-
                                                                  larity, we compare e(L) against the embedding
The second approach aims to deal with the dy-                     e(wi ) of each wi ∈ W; then, we extract the
namism of the informal language online trying to                  k most similar words to e(L), defining the set
capture new words relative to contexts defined in                 EL = (w1 , . . . , wk ). Finally, we insert the ex-
each lexicon. Therefore, we use enriched versions                 tracted words into the original lexicon to build the
of the original lexica (described above), and stylis-             new lexicon, i.e., LE = L ∪ EL .
tic and linguistic information captured by means                     Therefore, we carry out the experiments using
of n-grams of words and characters as in the first                different pre-trained word embeddings for each
approach. The method for the expansion of a                       language: GloVe embeddings trained on 2 bil-
given lexicon shares the idea of identifying new                  lion tweets (Pennington et al., 2014) for English,
words by considering their contextual similarity                  and word embeddings built on TWITA corpus6 for
with known words, as defined by some pre-trained                  Italian (Basile and Novielli, 2014). Finally, the
word embeddings. For its description, let assume                  proposed expansion method is parametric and re-
that L = {l1 , . . . , lm } is the initial lexicon of m           quires a value for k, the number of words that are
words, and W = {(w1 , e(w1 )), . . . , (wn , e(wn ))}             going to extend the lexica. In particular, we use
is the set of pre-trained word embeddings, where                  k = 1000, 500 and 100.
each pair represents a word and its corresponding
                                                                  3.3    Experiments and Results
embedding vector. This method aims to enrich the
lexicon with words strongly related to the context                To carry out the experiments, a SVM classifier
from the original lexicon without being necessar-                 is employed with the radial basis function kernel
ily associated to any particular word. Its idea is                (RBF) using the following parameters: C = 5 and
to search for words having similar contexts to the                γ = 0.1 for English and γ = 0.01 for Italian. Con-
entire lexicon. This method has two main steps,                   sidering the complexity of the target classification
described below.                                                  for the Italian language due to imbalanced training
Dictionary modeling. Firstly, we extract the em-                  set (see Table 1), we used a Random Forest (RF)
bedding e(li ) for each word li ∈ L; then, we com-                classifier that aggregates the votes from different
pute the average of these vectors to obtain a vector                6
                                                                      http://valeriobasile.github.io/twita/
describing the entire lexicon, e(L). We name this                 about.html
decision trees to decide the final class of the tweet.           4   Discussion and Conclusions
  The evaluation is performed using the test set
provided by the organizers of the AMI shared task.               This paper reports our participation in the AMI
For the competition, they use as evaluation mea-                 shared task. The organizers provide also the gold
sures the Accuracy for Task A and the average of                 test set that helps us to understand better what are
F-score of both classes for Task B.                              the misclassified cases and the aspects that should
                                                                 be considered in the next experiments. Carry-
        English                                                  ing out the error analysis, we notice that in both
        Run              Approach     Accuracy    Rank
                                                                 datasets the content of URL affects the transmit-
        run 27           AEL           0.613       17
        baseline AMI                   0.605       19            ted information in the tweet (such as Right! As
        run 1            AEL           0.592       21            they rape and butcher women and children !!!!!!
        run 3            MML           0.584       25
                                                                 https://t.co/maEhwuYQ8B). The swear words are
        Italian
        Run              Approach     Accuracy    Rank
                                                                 often used also as exclamation without the aim to
        baseline AMI                   0.830       7             offend (such as Volevo dire alla Yamamay che tet-
        run 1            AEL           0.824        9            tona non sinonimo di curvy dato che di vita ha una
        run 38           AEL           0.823       11
        run 2            MML           0.822       12
                                                                 40, quindi confidence sta minchia.). Moreover, de-
                                                                 spite the actual English corpus does not contain
          Table 4: Results obtained in Task A.                   several jokes, Italian misclassified tweets involve
                                                                 humourous utterances (such as @GrianneOhms-
   Table 4 and Table 5 show the results obtained                 for1 @BarbaraRaval A parte il fatto poi che cu-
in the competition compared with the baselines                   lona inchiavabile” è il miglior giudizio politico
provided by the organizers for each task. Com-                   sentito sulla Merkel negli ultimi anni??”). In fact,
paring the two approaches, in general AEL seems                  in general, humour, irony and sarcasm hinder the
to work better than MML. However, the improve-                   correct classification of the texts, as we noticed
ment of the results is very slight, especially for               in English and Spanish corpora provided in the
Italian language. This soft variation is unexpected              IberEval framework. Participating in this shared
considered the results obtained during the exper-                task gave us the opportunity to analyze and com-
iments employing 10-fold cross validations. In                   pare multilingual datasets, and thus, to discover
fact, AEL with enriched lexica using k equal 100                 and infer general aspects typical of hate speech
performed an Accuracy of 0.880. Moreover, look-                  against women.
ing at Table 4, reporting the official results of the
AMI Task, only run 2 overcomes the baseline for                  Acknowledgments
the detection of misogyny in English, and for this               The work of Simona Frenda was partially funded
run we used AEL approach excluding the senti-                    by the Spanish research project SomEMBED
ment imbalance as feature. About the identifica-                 TIN2015-71147-C2-1-P (MINECO/FEDER). We
tion of misogyny in Italian, the obtained results are            also thank the support of CONACYT-Mexico
lower than provided baselines as well as the values              (projects FC-2410, CB-2015-01-257383).
of F-score obtained in Task B for both languages
(see Table 5). Despite the usefulness of lexica for
a specific domain like misogyny, a lexicon-based                 References
approach proves to be insufficient for this task. In-
                                                                 Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.
deed, as the error analysis will confirm, misogyny,
                                                                  2018. Automatic identification and classification of
as well as general hate speech, involves linguistic               misogynistic language on twitter. In International
devices such as humour, exclamations typical of                   Conference on Applications of Natural Language to
orality and contextual information that completes                 Information Systems, pages 57–64.
the meaning transmitted by the tweet. Moreover,
                                                                 Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-
the low values obtained also in Task B suggest                      tiani. 2010. Sentiwordnet 3.0: an enhanced lexical
the necessity to implement dedicated approach for                   resource for sentiment analysis and opinion mining.
each misogynistic category.                                         In Lrec, volume 10, pages 2200–2204.
   7
       This run does not involve the sentiment imbalance         Francesco Barbieri and Horacio Saggion. 2014. Mod-
   8
       This run involves the expansions of lexica with k = 100     elling irony in twitter. In Proceedings of the Stu-
                    English
                    Run             Categories   F-score   Target      F-score    total    ranks
                    baseline AMI                  0.342                 0.399     0.370       3
                    run 2           UBT           0.282    UBT+BoC      0.407     0.344       6
                    run 1           UBT           0.282    UBT+BoC      0.389     0.335       8
                    run 3           UBT           0.269    UBT+BoC      0.387     0.328      10
                    Italian
                    Run             Categories   F-score   Target      F-score    Total    ranks
                    baseline AMI                  0.534                 0.440     0.487      2
                    run 3           UBT+BoC       0.485    UBT+BoC      0.414     0.449       7
                    run 1           UBT+BoC       0.483    UBT+BoC      0.414     0.448       8
                    run 2           UBT+BoC       0.480    UBT+BoC      0.411     0.446      10

                                     Table 5: Results obtained in Task B.


  dent Research Workshop at the 14th Conference of         Thomas E Ford and Christie Fitzgerald Boxer. 2011.
  the European Chapter of the ACL.                           Sexist humor in the workplace: A case of subtle ha-
                                                             rassment. In Insidious Workplace Behavior, pages
Pierpaolo Basile and Nicole Novielli. 2014. Uniba            203–234. Routledge.
   at evalita 2014-sentipolc task: Predicting tweet sen-
   timent polarity combining micro-blogging, lexicon       Jesse Fox, Carlos Cruz, and Ji Young Lee. 2015. Per-
   and semantic features. In Proceedings of EVALITA           petuating online sexism offline: Anonymity, interac-
   2014.                                                      tivity, and the effects of sexist hashtags on social me-
                                                              dia. Computers in Human Behavior, 52:436–442.
Elisa Bassignana, Valerio Basile, and Patti Viviana.
   2018. Hurtlex: A multilingual lexicon of words to       Simona Frenda, Bilal Ghanem, and Manuel Montes-y
   hurt. In Proceedings of CLiC-it, Turin, 10-12 De-         Gómez. 2018. Exploration of misogyny in span-
   cember 2018, CEUR.                                        ish and english tweets. In Proceedings of Workshop
                                                             IBEREVAL at 3rd SEPLN.
Peter Burnap and Matthew Leighton Williams. 2014.
  Hate speech, machine classification and statistical      Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura
  modelling of information flows on twitter: Interpre-       Damien, and Jun Long. 2015. A lexicon-based
  tation and communication for policy decision mak-          approach for hate speech detection. International
  ing. Internet, Policy & Politics.                          Journal of Multimedia and Ubiquitous Engineering,
                                                             10(4):215–230.
Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu.
  2012. Detecting offensive language in social media       Estefanı́a Guzmán Falcón. 2018. Detección de
  to protect adolescent online safety. In Privacy, Secu-     lenguaje ofensivo en Twitter basada en expansión
  rity, Risk and Trust (PASSAT), pages 71–80. IEEE.          automática de lexicones (tesis de maestrı́a). Insti-
Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta,         tuto Nacional de Astrofı́sica, Óptica y Electrónica.
  Marinella Petrocchi, and Maurizio Tesconi. 2017.           Puebla, México.
  Hate me, hate me not: Hate speech detection on           Yashar Mehdad and Joel Tetreault. 2016. Do charac-
  facebook. In Proceedings of ITASEC17.                      ters abuse more than words? In Proceedings of the
Hugo Jair Escalante, Esaú Villatoro-Tello, Sara E           17th Annual Meeting of the Special Interest Group
  Garza, A Pastor López-Monroy, Manuel Montes-y             on Discourse and Dialogue, pages 299–303.
  Gómez, and Luis Villaseñor-Pineda. 2017. Early
                                                           Chikashi Nobata, Joel Tetreault, Achint Thomas,
  detection of deception and aggressiveness using
                                                             Yashar Mehdad, and Yi Chang. 2016. Abusive lan-
  profile-based representations. Expert Systems with
                                                             guage detection in online user content. In Proceed-
  Applications, 89:99–111.
                                                             ings of the 25th international conference on WWW.
Elisabetta Fersini, Maria Anzovino, and Paolo Rosso.
                                                           Jeffrey Pennington, Richard Socher, and Christopher
   2018a. Overview of the task on automatic misogyny
                                                              Manning. 2014. Glove: Global vectors for word
   identification at ibereval. In Proceedings of Work-
                                                              representation. In Proceedings of EMNLP.
   shop IBEREVAL at 3rd SEPLN.
Elisabetta Fersini, Debora Nozza, and Paolo Rosso.         Bailey Poland. 2016. Haters: Harassment, abuse, and
   2018b. Overview of the evalita 2018 task on au-           violence online. U of Nebraska Press.
   tomatic misogyny identification (ami). In Tom-          Niloofar Safi Samghabadi, Suraj Maharjan, Alan
   maso Caselli, Nicole Novielli, Viviana Patti, and         Sprague, Raquel Diaz-Sprague, and Thamar
   Paolo Rosso, editors, Proceedings of the 6th evalua-      Solorio. 2017. Detecting nastiness in social media.
   tion campaign of Natural Language Processing and          In Proceedings of ALW1.
   Speech tools for Italian (EVALITA’18), Turin, Italy.
   CEUR.org.