=Paper=
{{Paper
|id=Vol-2150/AMI_paper2
|storemode=property
|title=14-ExLab@UniTo for AMI at IberEval2018: Exploiting Lexical Knowledge for Detecting Misogyny in English and Spanish Tweets
|pdfUrl=https://ceur-ws.org/Vol-2150/AMI_paper2.pdf
|volume=Vol-2150
|authors=Endang Wahyu Pamungkas,Alessandra Teresa Cignarella,Valerio Basile,Viviana Patti
|dblpUrl=https://dblp.org/rec/conf/sepln/PamungkasCBP18
}}
==14-ExLab@UniTo for AMI at IberEval2018: Exploiting Lexical Knowledge for Detecting Misogyny in English and Spanish Tweets==
14-ExLab@UniTo for AMI at IberEval2018:
Exploiting Lexical Knowledge for Detecting
Misogyny in English and Spanish Tweets
Endang Wahyu Pamungkas1 , Alessandra Teresa Cignarella1,2 ,
Valerio Basile1 , and Viviana Patti1
1
Dipartimento di Informatica, Università degli Studi di Torino
2
PRHLT Research Center, Universitat Politècnica de València
{pamungka,cigna,basile,patti}@di.unito.it
Abstract We describe our participation to the Automatic Misogyny
Identification (AMI) shared task at IberEval 2018. The task focused
on the detection of misogyny in English and Spanish tweets and was
articulated in two sub-tasks addressing the identification of misogyny at
different levels of granularity. We describe the final submitted systems for
both languages and sub-tasks: Task A is a classical binary classification
task to determine whether a tweet is misogynous or not, while Task B is
a finer grained classification task devoted to distinguish different types
of misogyny, where systems must predict (i) one out of five categories
of misogynistic behaviours and (ii) if the abusive content was purposely
addressed to a specific target or not. We propose an SVM-based archi-
tecture and explore the use of several sets of features, including a wide
range of lexical features relying on the use of available and novel lexicons
of abusive words, with a special focus on sexist slurs and abusive words
targeting women in the two languages at issue. Our systems ranked first
in Task A for both English and Spanish (accuracy score of 0.913 for
English; 0.815 for Spanish), outperforming the baselines and the other
participant systems, and first in Task B on Spanish.
1 Introduction
In the era of mass online communication, more and more episodes of hateful
language and harassment against women occur in social media 3 . Hate Speech
(HS) can be defined as any type of communication that is abusive, insulting,
intimidating, harassing, and/or incites to violence or discrimination, and that
disparages a person or a group on the basis of some characteristics such as
race, color, ethnicity, gender, sexual orientation, nationality, religion, or other
characteristics [1]. In particular, when HS is gender-oriented, and it specifically
targets women, we refer to it as misogyny [2].
Recently, an increasing number of scholars is focusing on the task of auto-
matic detection of abusive or hateful language online [3] where hate speech is
3
https://www.amnesty.org/en/latest/research/2018/03/online-violence-
against-women-chapter-3
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)
characterized by some key aspects which distinguish it from offline, face-to-face
communication and make it potentially more dangerous and hurtful. In partic-
ular, hate speech in the form of racist and misogynist remarks are a common
occurrence on social media [4], therefore recent works on the detection of HS
focused on HS related to race, religion, and ethnic minorities [5] and on gender-
based hate, which is also the focus of the AMI shared task.
Detecting misogynist content and its author is still a difficult task for social
media platforms. For instance, the popular social network Facebook is still unable
to deal with this issue and it relies on its community to report misogynistic
content4 . The work of Hewitt et al. [6] is a first study that attempts to detect
misogyny in Twitter manually, in which the authors used several terms related
to slurs against women to gather the data from Twitter. However, the automatic
detection of misogynistic content is still an open problem, with few approaches
proposed only recently [7].
In this paper, we describe the systems we submitted for detecting misogyny
in the context of the Automatic Misogyny Identification (AMI) shared task at
IberEval 2018 [8], defined as a two-fold task on detecting misogyny in English
and Spanish tweets at different levels of granularity. In particular, considering
the role of lexical choice in gender stereotypes, we decided to explore the role of
lexical knowledge in detecting misogyny, by experimenting with lexical features
based on both generic lexicons of slurs and abusive words, and on specific lexicons
of sexist slurs and hate words targeting women.
2 The 14-ExLab@UniTo systems
We built two similar systems for misogyny detection, one for English and one for
Spanish. Several sets of features were considered based on a linguistically moti-
vated approach, including stylistic, structural and lexical features. In particular,
in order to explore the role of lexical knowledge in this task, we experimented the
use of (i) generic lexicons of abusive words and slurs; (ii) specific lexicons of sex-
ist slurs and hate words reflecting specifically gender-based hate and well-known
cultural gender bias and stereotypes. In particular, we experimented for the first
time in this task the use of a new multilingual lexicon (HurtLex), including an in-
ventory of hate words compiled by the Italian linguist Tullio De Mauro [9], which
has been semi-automatically translated from Italian into English and Spanish
both relying on BabelNet [10].
The list of lexical features includes: Bag of Words (BoW): sparse vector
encoding the occurrence of unigrams, bigrams and trigrams in a tweet. Swear
Word Count: this feature represents the number of swear words contained in
a tweet. We used the list of swear words from noswearing dictionary 5 . Swear
Word Presence: this feature is a binary value representing the presence of swear
words. We used the same dictionary from noswearing. Sexist Slurs Presence:
4
https://www.nytimes.com/2013/05/29/business/media/facebook-says-it-
failed-to-stop-misogynous-pages.html
5
https://www.noswearing.com/dictionary
235
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)
we use a small set of sexist words aimed towards women from prior work [11].
This feature has a binary value 0 (there is no sexist slur in the tweet) and 1 (there
is at least one sexist slur in the tweet). Woman-related Words Presence: this
feature is used to represent the target of misogyny. Therefore, we manually built
a small set of words in English containing synonyms or other words related to
the word “woman” 6 . Additionally, we extracted a set of features based on the
presence of words from the HurtLex lexicon [10]. This lexicon includes a wide
inventory of about 1,000 Italian hate words originally compiled in a manual fash-
ion by De Mauro [9] organized in 17 categories grouped in different macro levels:
(a) Negative stereotypes: ethnic slurs (PS); locations and demonyms (RCI); pro-
fessions and occupations (PA); physical disabilities and diversity (DDF);
cognitive disabilities and diversity (DDP); moral and behavioral defects
(DMC); words related to social and economic disadvantage (IS).
(b) Hate words and slurs beyond stereotypes: plants (OR); animals (AN); male
genitalia (ASM); female genitalia (ASF); words related to prostitution
(PR); words related to homosexuality (OM).
(c) Other words and insults: descriptive words with potential negative connota-
tions (QAS); derogatory words (CDS); felonies and words related to crime and
immoral behavior (RE); words related to the seven deadly sins of the Christian
tradition (SVP). The lexicon has been translated into English and Spanish semi-
automatically by extracting all the senses of all the words from BabelNet [12],
manually discarding the senses that were not relevant to the context of hate, and
finally retrieving all the English and Spanish lemmas for the remaining senses.
Thanks a manual inspection we identified five categories as specifically related
to gender-based hate: DDF and DDP related to negative stereotypes; PR, ASM
and ASF beyond stereotypes (highlighted in bold).
The structural features employed by our systems include: Bag of Hashtags
(BoH): similarly to BoW, we exploit the hashtags. Bag of Emojis (BoE): we
also utilized the Emojis in the tweets as a feature. We used their CLDR short
name 7 in our feature matrix. Therefore, we converted the emoji unicode to its
CLDR short name by using PyPI library8 . Hashtag Presence: this feature has
a binary value 0 (if there is no hashtag in the tweet) or 1 (if there is at least
one hashtag in the tweet). Link Presence: presence of URLs in the tweets as
a binary value: 0 if there is no link, 1 if there is at least one link in the tweet.
All the features are encoded as fixed-size numerical or one-hot vector represen-
tations, allowing us to experiment extensively with their combination.
3 Experiments and Results
In this section, we report on the result of the evaluation of our system for misog-
yny detection according to the benchmark established by the AMI task.
6
For the Spanish system development, we translated all the English word lists
described here by using Google Translate: https://translate.google.com/.
7
https://unicode.org/emoji/charts/full-emoji-list.html
8
https://pypi.org/project/emoji/
236
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)
3.1 AMI: Tasks Description and Dataset Composition
The organizers of AMI proposed an automatic detection task of misogynistic
content on Twitter, in English (EN) and Spanish (SP). Two different tasks were
proposed: Task A is a binary classification task, where every system should de-
termine whether a tweet is misogynous or not misogynous. Task B is composed
of two distinct classification tasks. First, participants were asked to classify
the misogynous tweets into five categories of misogynistic behavior including:
“stereotype & objectification”, “dominance”, “derailing”, “sexual harassment &
threats of violence”, and “discredit”. Secondly, they were asked to classify the
misogynous tweets based on their target, labeling whether it is active (i.e. refer-
ring to one woman in particular) or passive (i.e. referring to a group of women).
Task A is evaluated in terms of accuracy, while for Task B the evaluation
consists in the macro-average of the F1 -scores on the positive classes. Each par-
ticipating team could submit a maximum of 5 runs, pertaining to two different
scenarios: constrained and unconstrained.
Dataset As summarized in Table 1, the organizers provided 3,251 tweets for the
English training set and 3,307 tweets for the Spanish training set. Each tweet,
in both languages, was annotated at three levels: 1) presence of misogynous
content, 2) categories of misogynistic behavior, as described in Section 3.1, and 3)
target of misogyny (active or passive). The organizers provided a balanced label
Task A Task B
English Spanish English Spanish
Stereotype 137 151
Dominance 49 302
Misogynistic 1,568 1,649 Derailing 29 20
Sexual Harassment 410 198
Discredit 943 978
Active 942 1455
Passive 626 194
Not misogynistic 1,683 1,658 No class 1,683 1,658
Total 3,251 3,307
Table 1. Dataset label distribution.
distribution for Task A (misogynous vs. not misogynous), while the distribution
of data for Task B was highly unbalanced, reflecting the natural distribution of
misogynistic behaviours and targets in the corpus.
3.2 Experimental Setup
We built two variants of our system and trained them on the available training
sets. We tuned the system on the basis of the results of a 10-fold cross validation,
using accuracy as an evaluation metric for Task A. The system for English is
237
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)
based on SVM with Radial Basis Function (RBF) kernel, while the system built
for Spanish is based on SVM with a linear kernel. Both systems were built by
using scikit-learn Python library9 . Additionally, we performed an ablation test
on our feature sets to study the impact of the different features on the system
performance. Table 2 shows the features selected for each of our submissions and
accuracy scores from cross-validation on training sets for English and Spanish.
For what concerns features based on HurtLex, in S4 (EN) and S3 (SP) we ex-
plored the impact of hate words belonging to categories specifically related to
gender-based hate (see Sec. 2). In addition, we tested the performance of the
Languages English Spanish
Systems S1 S2 S3 S4 S5 S1 S2 S3 S4 S5
Accuracy 0.748 0.75 0.75 0.737 0.73 0.791 0.789 0.787 0.789 0.73
Bag of Word - - - - X X X X X -
Bag of Hashtags - - - - X X X X X -
Bag of Emojis - - - - X X X X X -
Hashtag Presence X X X X - - - - - X
Link Presence X X X X - - - - - X
Swear Word Count X X X X - - - - - X
Swear Word Presence X X X X - - - - - X
Sexist Slurs Pres. X X X X - - X X X X
Woman Word Pres. X X X X - - X X X X
ASF Count - X X X - - - X X X
PR Count - X X X - - - X X X
OM Count - X - - - - - - X X
DDF Count - - X X - - - X X -
CDS Count - - - - - - - - X -
DDP Count - - - X - - - X X -
AN Count - - - - - - - - X -
ASM Count - - - X - - - X X -
DMC Count - - - - - - - - X -
IS Count - - - - - - - - X -
OR Count - - - - - - - - X -
PA Count - - - - - - - - X -
PS Count - - - - - - - - X -
QAS Count - - - - - - - - X -
RCI Count - - - - - - - - X -
RE Count - - - - - - - - X -
SVP Count - - - - - - - - X -
Table 2. Feature Selection for all the submitted systems.
best-performing sets of features of one language applied to the other language,
to gauge the multilingual potential of the best systems: the English submission
5 is based on the best-performing (in cross-validation) combination of features
for Spanish, and the Spanish submission 5 is based on the best-performing com-
9
http://scikit-learn.org/
238
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)
bination of features for English. For Task B, we used exactly the same features
as Task A in each submission. We only submitted constrained runs.
3.3 Official Results and Analysis
Table 3 until Table 6 shows our submission ranking based on the competition
official results 10 . The submission name is based on the submission numbering
on Table 2 (run 1 is result of S1 and so on). Our systems ranked first in Subtask
A for both English (accuracy 0.913 by run 1) and Spanish (accuracy 0.815 by
run 3). Meanwhile, for Subtask B (Table 5 and Table 6), one of our systems was
the best result on Spanish (average Macro F-measure 0.446 by run 2) and the
6th on English (average Macro F-measure 0.370 by run 5).
Our experiment in testing the multilingual setting proved to be a challenge.
Not surprisingly, both submissions 5 were the worst-performing compared to
other submissions. However, the English S5 shows a comparatively good perfor-
mance in absolute terms. On Table 3, we can see that all of our submissions in
English were above the competition baseline. However as we can see on Table 4,
with the same system applied to the Spanish dataset, we obtained a very low
accuracy score in Spanish (ranked 24th , accuracy 0.537). This asymmetry indi-
cates that the combination of BoW, BoH and BoE is a better representation of
tweets in a multilingual setting than more ad-hoc, task-specific features.
rank submissions accuracy rank submissions accuracy
1 14-exlab.c.run1 0.913 1 14-exlab.c.run3 0.815
2 14-exlab.c.run2 0.902 4 14-exlab.c.run1 0.812
3 14-exlab.c.run4 0.898 5 14-exlab.c.run2 0.812
4 14-exlab.c.run3 0.879 6 14-exlab.c.run4 0.809
... ... ... ... ... ...
10 14-exlab.c.run5 0.824 18 ami-baseline 0.767
... ... ... ... ... ...
15 ami-baseline 0.784 24 14-exlab.c.run5 0.536
Table 3. Task A rankings (English) Table 4. Task A rankings (Spanish)
On Task B, most participants achieved relatively low results, showing the
difficulty of this task, especially in classifying misogynistic behavior categories.
We found the datasets’ unbalanced distribution of labels to be the main issue.
Based on the detailed result provided by the organizers, we note that most of
the submitted system are not able to detect the less represented classes includ-
ing derailing (29), dominance (49), and stereotype & objectification (137). Also
classifying the target of misogyny (active and passive) has not been an easy task,
which can be seen looking at the F1 -score of the result on official results.
Features including Swear Words Count, Swear Words Presence, Hashtag Pres-
ence, Link Presence, Sexist Slurs, and Woman-related Words outperformed all
10
https://amiibereval2018.wordpress.com/important-dates/results/
239
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)
rank submissions F1 -score rank submissions F1 -score
... ... ... 1 14-exlab.c.run2 0.446
6 14-exlab.c.run5 0.369 2 14-exlab.c.run3 0.445
8 14-exlab.c.run3 0.351 3 14-exlab.c.run4 0.444
10 14-exlab.c.run4 0.343 5 14-exlab.c.run1 0.441
12 14-exlab.c.run2 0.342 ... ... ...
15 14-exlab.c.run1 0.338 14 ami-baseline 0.410
... ... ... ... ... ...
16 ami-baseline 0.337 20 14-exlab.c.run5 0.279
Table 5. Task B rankings (English) Table 6. Task B rankings (Spanish)
other submissions in English. In Spanish, the use of terms from the HurtLex
lexicon, which were selected as related to gender-based hate, improves system
performance in submission 3. However, not all the lexicon categories have been
shown useful on this task, as indicated by the result of submission 4.
4 Discussion and Conclusion
In this paper we described the 14-ExLab@UniTO submission for the Automatic
Misogyny Identification (AMI) shared task at IberEval 2018. Our approach based
on lexical knowledge was successful and our systems turned out to be the best-
performing out of the ones participating in the Task A for both English and
Spanish. We also introduced a novel hate-specific lexical resource which helped
to improve the performance on the misogyny identification task.
For what concerns Task B, it was hard for all systems to classify misogynous
tweets into the 5 categories proposed. After a manual inspection of the data, it
emerged that there is no clear demarcation line between one category and the
other and that the high presence of swearing in categories such as dominance
and/or discredit just depends on the focus (e.g. the agent (man) vs. the wounded
part, the target (woman)). At the same time, stereotype & objectification is not
so conceptually distant from the sexual harassment category, due to a strong
use of language referring to sexual body parts or vulgar sexual practices. Some
examples from the English and Spanish datasets:
stereotype & objectification (EN): No girl is even capable of developing morals
until they get the slut fucked out of them. Welcome to my generation
dominance (EN): Bad girls get spankings
derailing: Women want u to automatically believe women who scream rape they
don’t understand our position....
sexual harassment & threats of violence (EN): @ SynergyFinny hey bitch
wassup bitch suck my dick bitch
discredit (EN): @ Herbwilson1967 Fuck that money whore @HillaryClinton Too stupid
to know consensual touching or grabbing is not assault. Only @ChelseaClinton is dumber
stereotype & objectification (ES): Que cuza antes la calle, una mujer inteligente
240
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)
o una tortuga vieja? Una tortuga vieja porque las mujeres inteligentes no existen . . .
dominance (ES): “Voy a enseñarle a esta perra como se trata a un hombre”
LMAO IN LOVE WITH EL TITI
sexual harassment & threats of violence (ES): @ genesismys1985 Me gustarı́a
abrirte las piernas y clavarte toda mi polla en tu culo.
discredit (ES): Porque ladra tanto mi perra? La puta madre cállate un poco
We are planning to participate to the upcoming AMI shared task at EVALITA
2018, in order to validate our approach also for the Italian language.
Acknowledgments
V. Basile and V. Patti were partially funded by Progetto di Ateneo/CSP 2016
(Immigrants, Hate and Prejudice in Social Media, S1618 L2 BOSC 01).
References
1. Erjavec, K., Kovačič, M.P.: “You don’t understand, this is a new war!” Analysis
of hate speech in news web sites’ comments. Mass Communication and Society 15
(2012) 899–920
2. Manne, K.: Down Girl: The Logic of Misogyny. Oxford University Press (2017)
3. Schmidt, A., Wiegand, M.: A survey on hate speech detection using natural lan-
guage processing. In: Proceedings of the Fifth International Workshop on Natural
Language Processing for Social Media. (2017) 1–10
4. Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for
hate speech detection on Twitter. In: Proceedings of the NAACL student research
workshop. (2016) 88–93
5. Sanguinetti, M., Poletto, F., Bosco, C., Patti, Stranisci, M.: An Italian Twitter
Corpus of Hate Speech against Immigrants. In: Proc. of the 11th International
Conference on Language Resources and Evaluation (LREC 2018), ELRA (2018)
6. Hewitt, S., Tiropanis, T., Bokhove, C.: The problem of identifying misogynist
language on Twitter (and other online social spaces). In: Proceedings of the 8th
ACM Conference on Web Science, ACM (2016) 333–335
7. Anzovino, M., Fersini, E., Rosso, P.: Automatic Identification and Classification of
Misogynistic Language on Twitter. In: Proc. of the 23rd Int. Conf. on Applications
of Natural Language & Information Systems, Springer (2018) 57–64
8. Fersini, E., Anzovino, M., Rosso, P.: Overview of the Task on Automatic Misogyny
Identification at IberEval. In: Proc. of 3rd Workshop on Evaluation of Human Lan-
guage Technologies for Iberian Languages (IberEval 2018) co-located with SEPLN
2018), CEUR-WS.org (2018) 57–64
9. De Mauro, T.: Le parole per ferire. Internazionale (2016) 27 settembre 2016.
10. Bassignana, E.: HurtLex: Developing a multilingual computational lexicon of words
to hurt (2018) Bachelor’s thesis. Supervisor: V. Patti, Co-supervisor: V. Basile.
11. Fasoli, F., Carnaghi, A., Paladino, M.P.: Social acceptability of sexist derogatory
and sexist objectifying slurs across contexts. Language Sciences 52 (2015) 98–107
12. Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and
application of a wide-coverage multilingual semantic network. Artificial Intelligence
193 (2012) 217–250
241