=Paper= {{Paper |id=Vol-2253/paper49 |storemode=property |title=Hurtlex: A Multilingual Lexicon of Words to Hurt |pdfUrl=https://ceur-ws.org/Vol-2253/paper49.pdf |volume=Vol-2253 |authors=Elisa Bassignana,Valerio Basile,Viviana Patti |dblpUrl=https://dblp.org/rec/conf/clic-it/BassignanaBP18 }} ==Hurtlex: A Multilingual Lexicon of Words to Hurt== https://ceur-ws.org/Vol-2253/paper49.pdf
             Hurtlex: A Multilingual Lexicon of Words to Hurt

                Elisa Bassignana and Valerio Basile and Viviana Patti
                              Dipartimento di Informatica
                                  University of Turin
                         {basile,patti}@di.unito.it
                       elisa.bassignana@edu.unito.it


                 Abstract                           della risorsa nell’ambito dello sviluppo di
                                                    un sistema Automatic Misogyny Identifi-
English. We describe the creation of                cation in tweet in spagnolo ed inglese.
HurtLex, a multilingual lexicon of hate
words. The starting point is the Ital-          1   Introduction
ian hate lexicon developed by the linguist
                                                Communication between people is rapidly chang-
Tullio De Mauro, organized in 17 cat-
                                                ing, in particular due to the exponential growth
egories. It has been expanded through
                                                of the use of social media. As a privileged place
the link to available synset-based com-
                                                for expressing opinions and feelings, social me-
putational lexical resources such as Mul-
                                                dia are also used to convey expressions of hostil-
tiWordNet and BabelNet, and evolved
                                                ity and hate speech, mirroring social and politi-
in a multi-lingual perspective by semi-
                                                cal tensions. Social media enable a wide and viral
automatic translation and expert annota-
                                                dissemination of hate messages. The extreme ex-
tion. A twofold evaluation of HurtLex
                                                pressions of verbal violence and their proliferation
as a resource for hate speech detection
                                                in the network are progressively being configured
in social media is provided: a qualita-
                                                as unavoidable emergencies. Therefore, the devel-
tive evaluation against an Italian anno-
                                                opment of new linguistic resources and computa-
tated Twitter corpus of hate against immi-
                                                tional techniques for the analysis of large amounts
grants, and an extrinsic evaluation in the
                                                of data becomes increasingly important, with par-
context of the AMI@Ibereval2018 shared
                                                ticular emphasis on the identification of hate in
task, where the resource was exploited for
                                                language (Schmidt and Wiegand, 2017; Waseem
extracting domain-specific lexicon-based
                                                and Hovy, 2016; Davidson et al., 2017).
features for the supervised classification of
                                                   The main objective of this work is the develop-
misogyny in English and Spanish tweets.
                                                ment of a lexicon of hate words that can be used
Italiano. L’articolo descrive lo sviluppo       as a resource to analyze and identify hate speech
di Hurtlex, un lessico multilingue di pa-       in social media texts in a multilingual perspective.
role per ferire. Il punto di partenza è il     The starting point is the lexicon ‘Le parole per
lessico di parole d’odio italiane sviluppato    ferire’ developed by the Italian linguist Tullio De
dal linguista Tullio De Mauro, organiz-         Mauro for the “Joe Cox” Committee on intoler-
zato in 17 categorie. Il lessico è stato es-   ance, xenophobia, racism and hate phenomena of
panso sfruttando risorse lessicali svilup-      the Italian Chamber of Deputies. The lexicon con-
pate dalla comunità di Linguistica Com-        sists of more than 1,000 Italian hate words orga-
putazionale come MultiWordNet e Babel-          nized along different semantic categories of hate
Net e le sue controparti in altre lingue        (De Mauro, 2016).
sono state generate semi-automaticamente           In this work, we present a computational ver-
con traduzione ed annotazione manuale di        sion of the lexicon. The hate categories and lem-
esperti. Viene presentata sia un’analisi        mas have been represented in a machine-readable
qualitativa della nuova risorsa, mediante       format and a semi-automatic extension and enrich-
l’analisi di corpus di tweet italiani anno-     ment with additional information has been pro-
tati per odio nei confronti dei migranti e      vided using lexical databases and ontologies. In
una valutazione estrinseca, mediante l’uso      particular we augmented the original Italian lexi-
con with translations in multiple languages.            task on hate speech detection has been proposed
  HurtLex, the hate lexicon obtained with the           in the context of the EVALITA 2018 evaluation
method described in Section 3, has been tested          campaign1 , which provides a stimulating setting
with a corpus-based evaluation, through the anal-       for discussion on the role of lexical knowledge in
ysis of a hate corpus of about 6,000 Italian tweets     the detection of hate in language.
(Section 4.1), and through an extrinsic evaluation
in the context of the shared task on Automatic          3       Method
Misogyny Identification at IberEval 2018, focus-        Our lexicon was created starting from preexist-
ing on the identification of hate against women in      ing lexical resources. In this section we give an
Twitter in English and Spanish (Section 4.2).           overview of such resources and of the process we
  The resource is available for download at             followed to create HurtLex.
http://hatespeech.di.unito.it/
resources.html                                          3.1     “Parole per Ferire”
                                                        We started from the lexicon of “words to hurt” Le
2   Related Work                                        parole per ferire by the Italian linguist Tullio De
Lexical knowledge for the detection of hate             Mauro (De Mauro, 2016). This lexicon includes
speech, and abusive language in general, has re-        more than 1,000 Italian words from 3 macro-
ceived little attention in literature until recently.   categories: derogatory words (all those words that
Even for English, there are few publicly available      have a clearly offensive and negative value, e.g.
domain-independent resources — see for instance         slurs), words bearing stereotypes (typically hurt-
the novel lexicon of abusive words recently pro-        ing individuals or groups belonging to vulnerable
posed by (Wiegand et al., 2018). Indeed, lexi-          categories) and words that are neutral, but which
cons of abusive words are often manually com-           can be used to be derogatory in certain contexts
piled specifically for a task, thus they are rarely     through semantic shift (such as metaphor). The
based on deep linguistic studies and reusable in        lexicon is divided into 17 finer-grained, more spe-
the context of new classification tasks. Moreover,      cific sub-categories that aim at capturing the con-
the lexical knowledge exploited in this context is      text of each word (see also Table 1):
often limited to inherently derogative words (such      Negative stereotypes ethnic slurs (PS); loca-
as slurs, swear words, taboo words). De Mauro           tions and demonyms (RCI); professions and oc-
(2016) highlights that this can be a restriction in     cupations (PA); physical disabilities and diversity
the compilation of a lexicon of hate words, where       (DDF); cognitive disabilities and diversity (DDP);
the accent is also on derogatory epithets aimed at      moral and behavioral defects (DMC); words re-
hurting weak and vulnerable categories of people,       lated to social and economic disadvantage (IS).
targeting individuals and groups of individuals on
                                                        Hate words and slurs beyond stereotypes
the basis of race, nationality, religion, gender or
                                                        plants (OR); animals (AN); male genitalia (ASM);
sexual orientation (Bianchi, 2014).
                                                        female genitalia (ASF); words related to prostitu-
   Regarding Italian, apart from the lexicon of hate
                                                        tion (PR); words related to homosexuality (OM).
words developed by Tullio De Mauro described
in Section 3, the literature is sparse, but it is       Other words and insults descriptive words
worth mentioning at least the study by Pelosi et        with potential negative connotations (QAS);
al. (2017) on mining offensive language on social       derogatory words (CDS); felonies and words re-
media and the project reported in D’Errico et al.       lated to crime and immoral behavior (RE); words
(2018) on distinguishing between pro-social and         related to the seven deadly sins of the Christian
anti-social attitudes. Both the works rely on the       tradition (SVP).
use of corpora of Facebook posts. In particular, in
                                                        3.2     Lexical Resources
Pelosi et al. (2017) the focus is on automatically
annotating hate speech in a corpus of posts from        WordNet (Fellbaum, 1998) is a lexical reference
the Facebook page “Sesso Droga e Pastorizia”, by        system for the English language based on psy-
exploiting a lexicon-based method using a dataset       cholinguistic theories of human lexical memory.
of Italian taboo expressions.                               1
                                                           http://www.di.unito.it/˜tutreeb/
   To conclude, let us mention that a new shared        haspeede-evalita18
      Category   Percentage   Category   Percentage          Category   Percentage   Category   Percentage
      PS             3,85%    ASM            7,07%           PS             2,76%    ASM            6,21%
      RCI            0,81%    ASF            2,78%           RCI            0,41%    ASF            1,66%
      PA             7,52%    PR             5,01%           PA             5,38%    PR             1,66%
      DDF            2,06%    OM             2,78%           DDF            1,52%    OM             2,76%
      DDP            6,00%    QAS            7,34%           DDP            8,55%    QAS           11,03%
      DMC            6,98%    CDS           26,68%           DMC            7,45%    CDS           26,07%
      IS             1,52%    RE             3,31%           IS             1,38%    RE             4,69%
      OR             1,52%    SVP            4.83%           OR             2,34%    SVP            6.07%
      AN             9,94%                                   AN            10,07%

Table 1: Distribution of sub-categories in Le pa-      Table 2: Distribution of the words not present
role per ferire.                                       in BabelNet along the 17 sub-categories of De
                                                       Mauro.

WordNet is structured around synsets (sets of syn-
onyms) and their 4 coarse-grained parts of speech:     distribution of the words not present in BabelNet
noun, verb, adjective and adverb.                      across the HurtLex categories. All the informa-
   MultiWordNet (Pianta et al., 2002), is an exten-    tion about the entries of HurtLex (lemma, part of
sion of WordNet that contains mappings between         speech, definition) and the hierarchy of categories
the English lexical items in Wordnet and lexical       is collected in one XML structured file for distri-
items of other languages, including Italian.           bution in machine-readable format.
   BabelNet (Navigli and Ponzetto, 2012) is a
combination of a multilingual encyclopedic dic-        3.4     Semi-automatic Multilingual Extension
tionary and a semantic network that links concepts             of the Lexicon
and named entities in a very wide network of se-       We leverage BabelNet to translate the lexicon into
mantic relationships.                                  multiple languages, by querying the API2 to re-
                                                       trieve all the senses of all the words in the lexicon.
3.3     A Computational Lexicon of Hate Words
                                                          Next, we queried the BabelNet API again to
The first step for the creation of our lexicon con-    retrieve all the lemmas in all the supported lan-
sisted in extracting every item from the lexicon       guages, thus creating a basis for a multilingual lex-
Le parole per ferire. We obtain 1,138 items, but       icon starting from an Italian resource.
1,082 unique items because several items were du-         Not surprisingly, some of the senses retrieved in
plicated in multiple categories. We also removed       the first step were unrelated to the offensive con-
10 lemmas that belong to idiomatic multi-word-         text, therefore their translation to other languages
expressions, e.g., “coccodrillo” (crocodile) in the    would generate unlikely candidates for a lexicon
expression “lacrime di coccodrillo” (crocodile         of hate words. For instance, BabelNet senses of
tears), leaving us to 1,072 unique lemmas.             named entities which are homograph to words in
   As a second step, we use MultiWordNet to aug-       the input lexicon are extracted along with the other
ment the words with their part-of-speech tags. We      senses, but they are typically to exclude from a re-
use the Italian index of MultiWordNet, compris-        source such as HurtLex.
ing, for each lemma, four fields containing the           Therefore, we performed a manual filtering of
identifiers of the synsets in which the lemma is in-   the senses prior to the automatic translation, with
tended like a noun, an adjective, a verb and a pro-    the aim of translating the original words only ac-
noun. By joining this index with our lexicon, we       cording to their offensive meaning. We manually
obtain all the possible part-of-speech for 59,2 % of   annotated each pair lemma-sense according to one
the lemmas, bringing the total number of lemmas        of three classes: Not offensive (used for senses
from 1,072 to 1,156 to include duplicates with dif-    that are totally unrelated to any offensive context),
ferent part of speech. The remaining lemmas were       Neutral (senses that are not inherently offensive,
annotated manually.                                    but are linked to some offensive use of the word,
   The third step consists of linking the lemmas       for example by means of a semantic shift), and
of the lexicon with a definition. We use the Babel-    Offensive (senses that embody a crystallized of-
Net API to retrieve the definitions, aiming for high   fensive use of a word). To check the consistency
coverage. In total, we were able to retrieve a defi-
                                                          2
nition for 71,1% of the lemmas. Table 2 shows the             https://babelnet.org/guide#java
          Definition                 Annotation                        Category   Occurrence    Category    Occurrence
          Finocchio is a station     Not offensive                     RE            45,10%     DDP             1,90%
          of Line C of the                                             QAS           23,32%     IS              1,60%
          Rome Metro.                                                  CDS            8,30%     SVP             0,50%
          Aromatic bulbous stem      Neutral3                          PS             7,10%     RCI             0,30%
          base eaten cooked or                                         ASM            2,70%     PR              0,30%
          raw in salads.                                               OM             2,20%     DDF             0,30%
          Offensive term             Offensive                         AN             2,10%     OR              0,20%
          for an openly                                                PA             2,00%     ASF             0,00%
          homosexual man.                                              DMC            1,90%

Table 3: Annotation of three senses of the Italian              Table 4: Percentage of messages in the hate speech
word “Finocchio”.                                               corpus containing words from the 17 HurtLex cat-
                                                                egories.

of the annotation, a subset of 200 senses were an-
notated by two experts, reporting an agreement on               sification of misogyny in social media text (Sec-
87.6% of the items. Table 3 shows examples of the               tion 4.2).
different annotation of senses of the same word.
   After discussing the results of the pilot annota-            4.1     Qualitative Evaluation
tion, we decided to split the Neutral class into two            In order to gain insights on the composition of the
additional classes. One of the new classes covers               HurtLex lexicon, we evaluated it against an anno-
the cases where a sense is not literally pejorative,            tated corpus of Hate Speech on social media, re-
but it is used to insult by means of a semantic shift,          cently published by Sanguinetti et al. (2018b). The
e.g. metaphorically. The other additional class is              corpus consists of 6,008 tweets selected accord-
for the senses which have a clear negative con-                 ing to keywords related to immigration and ethnic
notation, but not necessarily a direct derogatory               minorities. Each tweet in the corpus is annotated
use in a derogatory way, e.g., the main senses of               following a rich schema, including hate speech
“criminal”. Subsequently, the lexicon was anno-                 (yes/no), aggressiveness (strong/weak/none), of-
tated by two other experts reporting an agreement               fensiveness (strong/weak/none), irony (yes/no)
on 61% of the items. Most disagreement was con-                 and stereotype (yes/no).
centrated in the distinctions Not offensive/Not lit-               We searched the lemmas of HurtLex in the
erally pejorative (43% of the disagreement cases)               version of the hate speech corpus enriched with
and Negative connotation/Offensive (25% of the                  Universal Dependencies annotations4 , by match-
disagreement cases).                                            ing the pairs (lemma, POS-tag) in HurtLex with
   After the annotation, we discarded all the senses            the morphosyntactic annotation of the corpus, and
marked “not offensive”, and created two differ-                 computed several statistics on the actual usage of
ent versions of the multilingual lexicon in 53 lan-             such words in a specific abusive context of hate
guages: one containing only the translations of                 against immigrants. Table 4 shows the rate of
“offensive” senses (more conservative), and the                 messages in the corpus featuring words from each
other containing translations of “offensive”, “not              HurtLex category in the corpus.
literally pejorative” and “negative connotation”                   For a more in-depth analysis, we also examined
senses (more inclusive).                                        the relative frequency of single words in HurtLex
                                                                with respect to the finer-grained annotation of the
4    Evaluation                                                 messages where they occur. Figures 1, 2, 3, 4 and
We evaluated the quality of the lexicon of hate                 5 show examples of such analysis.
words created with the method described in the                  It can be noted how the relative frequency of words
previous section in two settings: by studying the               like “terrorismo” (terrorism), “ladro” (thief ) and
occurrence of its words and their categories in a               “rubare” (stealing) decrease drastically as the
corpus of hate speech (Section 4.1), and by ex-                 tweets become more aggressive, offensive or with
tracting features from HurtLex for supervised clas-             a higher level of hate speech (perhaps because, al-
                                                                beit negative, they are not swear words)), while
    3
      The derogatory use of the word “finocchio” (fennel) in
                                                                   4
Italian is thought to originate from the middle ages, linking       The corpus of hate speech by Sanguinetti et al. (2018b)
the fennel plant to the execution of gay men at the burning     has been annotated with a method similar to that described in
stake.                                                          Sanguinetti et al. (2018a).
Figure 1: Relative frequency of the words “terror-
ismo” (terrorism) and “criminale” (criminal) with      Figure 3: Relative frequency of the words
respect to the hate speech annotation.                 “rubare” (stealing), “zingaro” (gypsy) and “bas-
                                                       tardo” (bastard) with respect to the offensiveness
                                                       annotation.




Figure 2: Relative frequency of the words “ladro”
(thief) and “zingaro” (gypsy) with respect to the
aggressiveness annotation.                             Figure 4: Relative frequency of the words
                                                       “politico” (politician) and “terrone” (slur referring
                                                       to southern Italians) with respect to the irony an-
words like “bastardo” (bastard) occur more as the
                                                       notation.
tweets become more offensive (possibly also be-
cause they belong to the swearing sphere). An-
other class of words, like “zingaro” (gypsy), show     Unito classifier obtained the best result in the first
a parabolic distribution. We hypothesize that this     sub-task for both languages and the best result in
behavior is typical of words with an apparently        the second sub-task for Spanish.
neutral connotation that are sometimes used in
abusive context with an offensive connotation. We      5   Conclusion and Future Work
plan to leverage this method of analysis for further   Our main contribution is a machine-readable ver-
studies on this line.                                  sion of the hate words lexicon by De Mauro, en-
                                                       riched with lexical features from available com-
4.2   Misogyny Identification on Social Media
                                                       putational resources. We make HurtLex avail-
HurtLex was one of the resources used by the           able for download as a tool for hate speech de-
Unito’s team to participate to the shared task Au-     tection. A first evaluation of the lexicon against
tomatic Misogyny Identification (AMI) at IberEval      corpora featuring different targets of hate (immi-
2018 (Pamungkas et al., 2018). The task consists       grants and women) has been presented. The multi-
of identifying misogynous content in Twitter mes-      lingual evaluation of HurtLex showed also promis-
sages (first sub-task) and classifying their misogy-   ing results. Although we are aware that hate
nist behavior (second sub-task). The Unito’s team      speech-related phenomena tend to follow regional
employed different subsets of the 17 categories of     and cultural patterns, our semi-automatically pro-
HurtLex by extracting lexicon-based features for       duced resource was able to partially fill the gap
a supervised classifier. They identified the Pros-     towards hate speech detection in less represented
titution, Female and Male Sexual Apparatus and         languages. To this end, we aim at investigat-
Physical and Mental Diversity and Disability cat-      ing the potential and pitfalls of semi-automating
egories as the most informative for this task. The     mappings further. In particular, two possible ex-
                                                         Roberto Navigli and Simone Paolo Ponzetto. 2012.
                                                           BabelNet: The Automatic Construction, Evaluation
                                                           and Application of a Wide-Coverage Multilingual
                                                           Semantic Network. Artificial Intelligence, 193:217–
                                                           250.
                                                         Endang Wahyu Pamungkas, Alessandra Teresa
                                                           Cignarella, Valerio Basile, and Viviana Patti.
                                                           2018. 14-ExLab@UniTo for AMI at IberEval2018:
                                                           Exploiting Lexical Knowledge for Detecting Misog-
                                                           yny in English and Spanish Tweets. In Proc. of
                                                           3rd Workshop on Evaluation of Human Language
Figure 5: Relative frequency of the words                  Technologies for Iberian Languages (IberEval
“rubare” (stealing) and “cinese” (chinese) with re-        2018) co-located with SEPLN 2018), volume 2150
                                                           of CEUR Workshop Proceedings. CEUR-WS.org.
spect to the stereotype annotation.
                                                         Serena Pelosi, Alessandro Maisto, Pierluigi Vitale, and
                                                           Simonetta Vietri. 2017. Mining offensive lan-
tensions of our method involve using distribu-             guage on social media. In Proceedings of the Fourth
tional semantic models to automatically expand             Italian Conference on Computational Linguistics
                                                           (CLiC-it 2017), Rome, Italy, December 11-13, 2017.
the lexicon with synonyms and lemmas semanti-
cally related to the original ones, and exploiting       Emanuele Pianta, Luisa Bentivogli, and Christian Gi-
De Mauro’s derivational rules.                             rardi. 2002. Multiwordnet: developing an aligned
                                                           multilingual database. In Proceedings of the First
                                                           International Conference on Global WordNet, Jan-
Acknowledgments                                            uary.
Valerio Basile and Viviana Patti were partially          Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
supported by Progetto di Ateneo/CSP 2016 (Im-             Alessandro Mazzei, Oronzo Antonelli, and Fabio
migrants, Hate and Prejudice in Social Media-             Tamburini. 2018a. PoSTWITA-UD: an Italian
IhatePrejudice, S1618 L2 BOSC 01).                        Twitter Treebank in Universal Dependencies. In
                                                          Proceedings of the Eleventh International Confer-
                                                          ence on Language Resources and Evaluation (LREC
                                                          2018), Miyazaki, Japan, May 7-12, 2018. European
References                                                Language Resources Association (ELRA).
Claudia Bianchi. 2014. The speech acts account of        Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
  derogatory epithets: some critical notes. In J. Du-     Viviana Patti, and Marco Stranisci. 2018b. An
  tant, D. Fassio, and Meylan A., editors, Liber Am-      italian Twitter corpus of hate speech against immi-
  icorum Pascal Engel, University of Geneva, pages        grants. In Proceedings of the Eleventh International
  pp. 465–480.                                            Conference on Language Resources and Evaluation
Thomas Davidson, Dana Warmsley, Michael Macy,             (LREC 2018), Paris, France, may. European Lan-
  and Ingmar Weber. 2017. Automated hate speech           guage Resources Association (ELRA).
  detection and the problem of offensive language. In    Anna Schmidt and Michael Wiegand. 2017. A survey
  International AAAI Conference on Web and Social          on hate speech detection using natural language pro-
  Media.                                                   cessing. In Proceedings of the Fifth International
                                                           Workshop on Natural Language Processing for So-
Tullio De Mauro. 2016. Le parole per ferire. In-
                                                           cial Media, pages 1–10. Association for Computa-
  ternazionale. 27 settembre 2016. Compiled for the
                                                           tional Linguistics.
  “Joe Cox” Committee on intolerance, xenophobia,
  racism and hate phenomena, of the Italian Chamber      Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
  of Deputies, which issued a Final Report in 2017.        bols or hateful people? Predictive features for hate
                                                           speech detection on Twitter. In Proceedings of the
Francesca D’Errico, Marinella Paciello, and Matteo         NAACL Student Research Workshop, pages 88–93.
  Amadei. 2018. Prosocial words in social media dis-       ACL.
  cussions on hosting immigrants. insights for psycho-
  logical and computational field. In Symposium on       Michael Wiegand, Josef Ruppenhofer, Anna Schmidt,
  Emotion Modelling and Detection in Social Media          and Clayton Greenberg. 2018. Inducing a lexicon
  and Online Interaction, In conjunction with the 2018     of abusive words – a feature-based approach. In
  Convention of the Society for the Study of Artifi-       Proceedings of the 2018 Conference of the North
  cial Intelligence and Simulation of Behaviour (AISB      American Chapter of the Association for Computa-
  2018).                                                   tional Linguistics: Human Language Technologies,
                                                           Volume 1 (Long Papers), pages 1046–1056. Associ-
Christiane Fellbaum. 1998. WordNet: an electronic          ation for Computational Linguistics.
  lexical database. MIT Press.