Sexism Detection in English and Spanish Tweets

              Ritesh Kumar1 , Soumya Pal2 , and Rajendra Pamula2
                  1
                    Department of Computer Science and Engineering
                 National Institute of Technology Jamshedpur, India
                              ritesh.cse@nitjsr.ac.in
                 2
                    Department of Computer Science and Engineering
       Indian Institute of Technology (Indian School of Mines), Dhanbad, India
                    {soumyapal0601,rajendrapamula }@gmail.com


        Abstract. Sexism is an ancient evil that society struggles with till date.
        The freedom of speech and ease of anonymity granted by social media
        has also resulted in incitement to hatred. This presents the need for
        automatic detection of sexist posts or tweets on social media. In this
        paper, we present the machine learning models that can detect sexism.
        Specifically, we describe the model submitted for the shared task on
        sEXism Identification in Social network at IberLeF 2021. The problem
        concentrates on sexism detection in two languages: English and Spanish.
        The challenge is divided into two tasks of different granularity: (1) a
        binary classification problem to discover the instances of sexism in the
        post and (2) to predict one the five types of sexism present. Overall,
        our performance is good but it needs some improvement, our scores are
        encouraging enough to work for better results in future.

        Keywords: SVM, LSTM, Random Forest, Misogyny, Social Media


1     Introduction
The digitalization of the world also means the digitalization of sexism. Sexism
can be referred to as individuals’ attitudes, beliefs, behaviors, organizational,
institutional and cultural practices that either reflect negative evaluations of in-
dividuals based on their gender or support unequal status of women and men.
With rapid advancement of the internet and social media, such beliefs can spread
rapidly. Thus, it is very essential to detect such behaviour. The amount of data
generated on social media sites can be estimated from the fact that, every sec-
ond, on average, around 6,000 tweets are generated. Content moderation of such
a huge data is difficult to achieve exclusively through man power. Social net-
working sites are struggling with content moderation. Artificial Intelligence and
different Machine Learning techniques can be exploited for Sexism Detection.
    In this paper, we present the models to detect sexism, specifically the system
we submitted for EXIST 2021 shared task 3 . Sexism can be expressed directly,

              ©
3
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright    2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
implicitly and can be hidden in the form of humor. Automatically detecting the
subtle instances of sexism is much harder than discovering the misogyny and
violence aimed towards women. We try to identify the instances of both implicit
and explicit sexim and future classify the sexism into one of the five categories
which are as follow: 1. Ideological and Inequality, Stereotyping and Dominance,
Objectification, Sexual Violence, Misogyny and Non-sexual Violence.
As per requirement of EXIST 2021 [5], we submitted three runs for task1 and
two runs for task2. We extracted different lexical and non lexical features from
the text for the classification. Our best run in task1 achieved an accuracy of
71%. For task2, our best run was with an F1 score of 0.59.
    The organization of the rest of the paper is as follows. Section 2 describes
about Related Work. In Section 3, we describe about Task & Dataset. Section 4,
describes about System Description i.e. Test Preprocessing, Feature Extraction
and Machine Learning Models. In Section 5, we discuss about our Results. Lastly,
Conclusion and future direction of our research work is presented in section 6.


2   Related Work
Several works have been proposed to detect misogyny and hate speech against
women across social platforms. Resham Ahluwalia et al. [1] proposed how ML
models can detect misogynist tweets. Pamungkas et al. [6] proposed a cross lan-
guage study for misogyny detection in tweets. Jose et al. [2] identified cases
of aggressiveness and hate speech towards women. Goenaga et al. [3] used
RNN approach using a Bidirectional Long Short Term Memory (Bi-LSTM) with
Conditional Random Fields (CRF) and evaluated the proposed architecture on
misogyny identification task (text classification). Shushkevich et al. [8] combined
several simpler classifiers into one more complex blended model, which classified
the data taking into account the probabilities of belonging to classes calculated
by simpler models. They used the Logistic Regression, Naive Bayes, and SVM
classifiers. Liu et al. [4] used three classifiers that were trained by using SVM
with a linear kernel, random forests (RF) and gradient boosted trees (GBT).
    The dataset we used, misogynist tweets are considered as a sub-category
of sexist tweets. As most of these works focus on misogyny and hate speech
detection but not the subtle instances of sexism. We try to identify these subtle
sexism in tweets.


3   Task and Dataset Description
In this section, we describe the automatic sexism detection shared task and the
dataset provided to the participants [7].
    EXIST 2021 shared task was divided into two subtasks (Task1 and Task2).
Task1 aims at finding the instances of sexism while the goal of Task2 is to cate-
gorize the sexism into one of the following 5 categories (shown in Table 1):
1. Ideological and inequality : Discrediting of the feminist movement, rejecting
the precence of inequality between men and women, or presenting man as the
victim of gender bias.
2. Stereotyping and dominance : Expressing the false idea about women like they
are more suitable for certain tasks and not suitable for certain others.
3. Objectification : Presenting women as objects.
4. Sexual Violence : Sexual suggestions, requesting sexual favours or sexual har-
rasment threats.
5. Misogyny and Non-sexual Violence : Indicating voilence and hate toward
women of non-sexual nature.

                    Table 1. Categorization of Sexism with Example

Category                         Example
Ideological and inequality       I think the whole equality thing is getting out of hand.
Stereotyping and dominance       Most women no longer have the desire.
Objectification                  First thing I see are her slut nails.
Sexual Violence                  fuck that cunt, I would with my fist.
Misogyny And Non-sexual Violence Want to settle down now that you’ve gotten bored with cock carousel.


    We used the dataset obtained from the shared task at IberLEF 2021. The
dataset contains 11,000 short texts i.e. tweets and gab posts in English and Span-
ish languages. The dataset consists of 6,977 tweets for training and 3,386 tweets
for testing with balanced distribution in both the languages. In addition the test
data contains 492 gabs in English and 490 in Spanish from the uncensored social
network Gab (gab.com). The data was labeled manually by human annotators.
Data was annotated at two different levels of granularity. First, each text was
labelled as sexist or non-sexist. Secondly, the instances labelled as sexist are fur-
ther divided as ideological-inequality, stereotyping-dominance, objectification,
sexual-violence, misogyny-non-sexual-violence. The EXIST dataset focuses on
covering sexism in a broad sense, i.e, both explicit misogyny to implicit subtle
sexism.


4     System Description
We splitted the annotated data into a training set with 85% of the annotated
data and validation set with the remaining 15% of the instances. We built similar
systems for both English and Spanish, and then results obtained by these two
systems were combined and submitted as a run. Same splitting was used for
both task1 and task2.

4.1   Test Preprocessing
We removed punctuation symbols, links and numbers. Also, we removed stop
words. We used lemmatization for the English text and stemming for the spanish
text to ping together the different inflected forms of a word. In particular, we
used nltk wordnet for lemmatization and nltk snowball for stemming. The same
preprocessing is used for both Task1 and Task2.
4.2     Feature Extraction

We extracted the following features from the text:

    • Upper Case Count : Number of uppercase letters present in the text
    • Hashtag Count : Number of hashtags present in the text
    • Link Count : Number of hyperlinks in present in the text
    • Tweet Length : Total number of characters in the text
    • Slang Words Count : used in the English language

These features were extracted before preprocessing as hashtags and links were
removed after preprocessing. We used the Slang word count feature only for
the English text for both Task1 and Task2. All the rest features were used for
both languages and both tasks. To convert the text into numerical features we
used Tf-idf vectorizer. For LSTM, we used the tokenizer by keras library and
for SVM and Random Forest we used Tf-idf vectorizer from scikit-learn library.
Tokenizer by keras converts the text into either a sequence of integers or into a
vector where the coefficient for each token could be binary, based on word count,
based on tf-idf etc.


4.3     Machine Learning Models

For Task1, we submitted three runs based on three different algorithms, namely-
SVM, Random forest and LSTM. We used the scikit-learn library for SVM and
Random forest based models and Keras for LSTM. For SVM, we experimented
with three kernels - linear, rbf (radial basis function) and polynomial. The final
run we submitted was the linear kernel SVM. We used the following values of
the parameter in scikit-learn:
1. For SVM, kernel coefficient gamma is 0.01, and penalty parameter is 5 (default
parameters).
2. For Random Forest, maximum depth is 150, and number of forest is 1000.
3. For LSTM, we used mean squared error for loss,and batch size as 64.
     For Task2, we submitted two runs based on -SVM and Random Forest. We
treated Task2 as a multi-classification problem with 6 categories (five categories
of sexism and one category as non-sexist) and not as an extension of Task1. That
is, in case of Task2, we used the whole dataset for training and not just the text
classified as sexist in Task1. The parameter values were the same as mentioned
above. And SVM was implemented with a linear kernel.


5     Results and Discussion

The results of Task1 are represented in terms of accuracy (shown in Table 2),
while the results of Task2 are in terms of macro F1 score (shown in Table 3). The
best score as accuracy, we get from Task1 is 0.7115. For Task2 we get best score
as F1- measure is 0.4504. Table 2 and 3 shows the ranking of our submissions
based on shared task official ranking. Our best system was ranked 38 in Task1
and 43 for task2. Also, baseline is provided by the organizers for both the task.
These rankings indicate the combined results in English and Spanish.
    For Task1, the Random forest system performed better than LSTM and
SVM. But for Task2, SVM performed better than Random forest. The accuracy
and F1 score obtained in Task2 was lower than that of Task1. The same trend
could be found in results of all the teams. This can be attributed to the fact that
classification of the sexist text into finer granularity is a much more difficult task
than detecting instances of sexism. Also, as Task2 was multi-class classification
problem it’s F1 score was lower as compared to the binary classification problem
in Task1 .Features like number of slang words, number of hashtags present,
hyperlink count resulted in improved accuracy.


Table 2. Results for Task1- The official Evaluation measure is Accuracy. The best
score obtained by us is mentioned in bold

Rank Run                      Accuracy F1
38   task1 Soumya 2           0.7115   0.7114
45   task1 Soumya 1           0.7047   0.7045
55   task1 Soumya 3           0.6761   0.6761
52   task1 Baseline svm tfidf 0.6845   0.6832


Table 3. Results for Task2- The official Evaluation measure is F1. The best score
obtained by us is mentioned in bold

Rank Run                      Accuracy F1
43   task2 Soumya 1           0.5923 0.4504
46   task2 Soumya 2           0.595    0.4415
51   task2 Baseline svm tfidf 0.5222   0.395


6    Conclusion and Future Work

We evaluated the performance of different classification algorithms for sexism
this year's shared task. The results, we achieved were average as compared to
other submissions obtained in the EXIST 2021 shared task. We found that ex-
tracted features like number of slangs and hyperlink counts can result in better
performance. We look forward to experimenting with different features. Also, fine
tuning the parameters of the algorithm can help in improvement of the overall
performance. And the results of more than one system can be combined to gen-
erate an overall better score. We shall be exploring these tasks in the coming
days.
                             Bibliography


[1] Ahluwalia R, Shcherbinina E, Callow E, Nascimento AC, De Cock M (2018)
    Detecting misogynous tweets. In: IberEval@ SEPLN, pp 242–248
[2] Canós JS (2018) Misogyny identification through svm at ibereval 2018. In:
    IberEval@ SEPLN, pp 229–233
[3] Goenaga I, Atutxa A, Gojenola K, Casillas A, de Ilarraza AD, Ezeiza N,
    Oronoz M, Pérez A, Perez-de Viñaspre O (2018) Automatic misogyny iden-
    tification using neural networks. In: IberEval@ SEPLN, pp 249–254
[4] Liu H, Chiroma F, Cocea M (2018) Identification and classification of misog-
    ynous tweets using multi-classifier fusion. In: Evaluation of Human Language
    Technologies for Iberian Languages: IberEval 2018, CEUR Workshop Pro-
    ceedings, pp 268–273
[5] proceedings of IberLEF 2021 can be tentatively referred to as:
    Manuel Montes, Rosso P, Gonzalo J, Aragon E, Agerri R, Angel M, Car-
    mona A, Mellado EA, de Albornoz JC, Chiruzzo L, Freitas L, Adorno HG,
    Gutierrez Y, Zafra SMJ, Lima S, de Arco FMP, (eds) MT (2021) Proceedings
    of the iberian languages evaluation forum (iberlef 2021). CEUR Workshop
    Proceedings
[6] Pamungkas EW, Cignarella AT, Basile V, Patti V, et al (2018) 14-exlab@
    unito for ami at ibereval2018: Exploiting lexical knowledge for detecting
    misogyny in english and spanish tweets. In: 3rd Workshop on Evaluation of
    Human Language Technologies for Iberian Languages, IberEval 2018, CEUR-
    WS, vol 2150, pp 234–241
[7] Rodriguez-Sanchez F, Carrillo-de Albornoz J, Plaza L, Gonzalo J, Rosso P,
    Comet M, Donoso T (2021) Overview of exist 2021: sexism identification in
    social networks. Procesamiento del Lenguaje Natural 67(0)
[8] Shushkevich E, Cardiff J (2018) Classifying misogynistic tweets using a
    blended model: The ami shared task in ibereval 2018. In: IberEval@ SEPLN,
    pp 255–259