Sexism Identification In Tweets Using Machine Learning
                         Approaches
                         Murari Sreekumar, Shreyas Karthik, Durairaj Thenmozhi, Shriram Gopalakrishnan and
                         Krithika Swaminathan
                         Sri Sivasubramaniya Nadar College Of Engineering, Rajiv Gandhi Salai (OMR), Kalavakkam 603 110, Tamil Nadu, India


                                      Abstract
                                      Sexism poses significant challenges in sentiment analysis, as it can manifest in subtle and nuanced ways, often
                                      embedded within seemingly benign language. On social media, where communications are frequently code-mixed,
                                      particularly in Dravidian languages, there is an increasing demand for identifying sexist content to ensure healthy
                                      online interactions. The EXIST 2024 shared task aims to detect sexism in Spanish and English tweets collected
                                      from social media platforms. Various traditional machine learning approaches are employed to identify whether
                                      the comments contain sexist content in Spanish and English languages. Utilizing Support Vector Machines
                                      (SVM), Random Forest and Logistic Regression as a classifier, we achieve F1 scores of 0.6299, 0.6074 and 0.5518
                                      respectively for English dataset.

                                      Keywords
                                      Sexism Identification, Traditional Machine Learning Algorithms, Natural Language Processing, Sentiment Analy-
                                      sis, Text Analytics


                         1. Introduction
                         Sexism is prejudice or discrimination based on one’s sex or gender. Sexism can affect anyone, but
                         primarily affects women and girls. It has been linked to gender roles and stereotypes, and may include
                         the belief that one sex or gender is intrinsically superior to another. With the advent of social media
                         people have begun misusing the freedom speech and expression and instead have engaged in lot of hate
                         speech on women politicians, journalists, personalities etc. This has especially risen in social media
                         platforms such as twitter during the pandemic time [1].
                            Women who experience online abuse often alter their online behaviour, self-censor their content
                         and limit their interactions on platforms out of fear of violence and abuse. By silencing or driving
                         women away from online spaces, online violence can affect their economic outcomes, leading to
                         loss of employment and societal status. Additionally, online gender-based violence may serve as a
                         predictor of violent crimes in the physical world [2][3]. it is crucial to address these aspects of sexism
                         in social networks and hence Natural Language Processing research is crucial in providing insights into
                         identifying the tweets and classifying them as Sexist and Non-Sexist. Computational understanding of
                         natural language has been used in addressing issues such as sentiment analysis[4], human behaviour
                         detection, fake news detection[5], question answering and depression and threat detection across
                         different forms of media.
                            Our research paper presents various innovative solutions contributing to the field of sexism identifi-
                         cation in significant ways:

                                • Annotated Datasets: We leverage a vast dataset annotated by multiple annotators so that it
                                  ensures model’s robustness and improves the accuracy of sexism identification.
                                • Optimized Approach: The models used in this research like Support Vector Machines (SVM),
                                  Logistic Regression, and Random Forest have their hyperparameters tuned to their finest level so
                                  that it effectively identifies sexiset tweets.
                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                          $ murari2310237@ssn.edu.in (M. Sreekumar); shreyas2310140@ssn.edu.in (S. Karthik); theni_d@ssn.edu.in (D. Thenmozhi);
                          shriram2310156@ssn.edu.in (S. Gopalakrishnan); krithika2010039@ssn.edu.in (K. Swaminathan)
                           (0000-0003-0681-6628 (D. Thenmozhi)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    • This project can be used for real time applications in social media platforms like Twitter, Instagram,
      Facebook, LinkedIn etc in order to maintain a healthy and safe online environment.

   The task that we have performed in EXIST 2024 is Sexism Identification in Tweets. In this task, the
systems have to decide whether the tweets are Sexist or Not Sexist.
   In this research paper, we have discussed the research works that we have done for Task 1. The rest
of the paper is organised as follows: Section 2 presents a literature survey explaining the key theories
and concepts, research methodologies and the trends and patterns common in the field of sexism
identification. Section 3 describes the different datasets used and the task performed. Section 4 talks
about the methodology like preprocessing, lemmatization, vectorization and the various models used
for our task. Section 5 talks about our results and performance analysis with other teams participating
in the task. Finally, in Section 6 we talk about the conclusions and the future prospects of the research
work.


2. Related Work
Various works in the field of Sexism Identification were studied and diverse methodologies and approach
for sexism identification and classification were employed to solve this issue. Significant efforts have
been made by researchers around the world to develop annotated datasets and apply deep learning
models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). In
addition to these, various transformer based models like BERT have been used as they have consistently
provided excellent accuracy in identifying sexist tweets.
   Rodríguez-Sánchez et al. (2020) [6] undertook a research on automatic classification of sexism in
social networks. They specialized mainly on Twitter data in Spanish. They developed the MeTwo
dataset that labels the tweets into sexist, non-sexist and doubtful. This is the first dataset in Spanish
used to identify sexism in a broad sense, ranging from hostile to subtle sexism.To classify the tweets
into three categories, they have used various traditional Machine Learning models like Support Vector
Machine (SVM), Logistic Regression, Random Forest, and Naive Bayes. Various advanced deep learning
models like Bidirectional Encoder Representations from Transformers (BERT), Bidirectional Long Short
Term Memory (Bi-LSTM), Convolutional Neural Networks (CNN) and Recurrent Neural Networks
(RNN) have also been used. This research done by them can be used in fields such as misogyny detection
in tweets and various other texts.
   Davidson et al. (2017) [7] in their research worked to distinguish hate speech from offensive language
on social media. They collected the tweets and labeled them into three categories namely hate speech,
offensive language and neither. First, they converted all the text into lowercase, stemmed the text to
obtain the root words using PorterStemmer, create bigram, unigram and trigram features using TF-IDF.
They used Penn Part-Of-Speech (POS) tagging and included count indicators for r hashtags, mentions,
retweets, and URLs, as well as features for the number of characters, words, and syllables in each tweet.
Then various models like Logistic Regression, naive Bayes, decision trees, random forests, and linear
SVMs. These models successfully classified racist and homophobic slurs as hate speech, while sexist
language was more frequently categorized as offensive.
   Harika Abburi et al. (2021) [8] worked on Fine-Grained Multi-label Sexism Classification Using a
Semi-Supervised Multi-level Neural Approach. They initially employed the technique of Self-training,
which is a semi supervised learning approach that helps augment the set of labeled instances by
selectively adding unlabeled samples. Then it applies the models to the unlabeled instances and
identifies a subset of them to be added to the training set, along with the predicted labels. To address
categories with scarce labeled data, they propose a multi-level training approach. The model trains
initially on a reduced set of broader categories (coarse), then refines its understanding on the full set of
fine-grained categories. To begin with, the data was tested on various Traditional Machine Learning
models like logistic regression (LR), Support Vector Machine (SVM) and Random Forests (RF) Classifiers.
These were applied on two feature sets namely TF-IDF on word unigrams and bigrams (Word ngrams)
and the average of the ELMo vectors. Then various Deep Learning techniques like BiLSTM, BERT and
Table 1
Distribution of tweet samples across training, development and testing for each language
                         Task         Language     No. of samples     Percentage (%)
                        Training       English          3260               47.1
                        Training       Spanish          3660               52.9
                      Development      English           489               47.1
                      Development      Spanish           549               52.9
                        Testing        English           978               47.1
                        Testing        Spanish          1089               52.9


other CNN based architectures were used. Thus, this approach can be used to analyze online sexism by
using unlabeled data and various Deep Learning and Neural Network models.
   S Sharifirad et al. (2019) worked on a comprehensive classification of different online harassment
categories and explain its challenges using NLP. The tweets have been classified into Indirect Harassment,
Information Threat, Sexual Threat and Non Sexist. They have used various classification methods like
bigrams, threegrams, Two Character Grams, Word2Vec, Doc2Vec, Long Short Term Memory (LSTM)
among others. These techniques help identify boundaries between words or phrases in text, especially
in languages without explicit word separators. By analyzing sequences of words, n-grams can be used
to predict the next word in a sequence, which is useful for tasks like text generation. They have used
neural networks and the traditional machine learning technique Naive Bayes. The tweets were classified
correctly in their categories with accuracy ranging from 0.66 to 0.91 for LSTM.
   Thus, it is found that while significant progress has been made in identifying and mitigating various
forms of sexism on social networks, many existing studies primarily focus on explicit instances of
sexist language. However, the detection and analysis of more subtle, implicit forms of sexism remain
under-explored. Additionally, the intersection of sexism with other forms of discrimination, such as
racism or homophobia, has not been thoroughly investigated. This research aims to address these gaps
by developing more sophisticated algorithms that can identify both explicit and implicit sexist content,
considering the broader context of intersectional discrimination in social network environments. In
addition to these, we also aim to integrate these techniques in various social media platforms to ensure
safe and healthy online environments.


3. Task and Dataset
The task organizers of CLEF2024 provided a dataset called EXIST2024 [9][10]. The EXIST2024 dataset
contains exactly 6920 tweets for training, 1038 tweets for development and 2076 tweets for testing
which adds upto to an overall of more than 10000 tweets.
  From the above table, it can be observed that the training, development and testing dataset contain
English and Spanish tweets in the same ratio.

   TASK 1: Sexism Identification in Tweets The first task is a binary classification. The systems
have to decide whether or not a given tweet contains sexist expressions or behaviours (i.e., it is sexist
itself, describes a sexist situation or criticizes a sexist behaviour). The following tweets show examples
of sexist and not sexist messages. The opinions of Six annotators were also given. These annotators
classified the tweets into Sexist and Non Sexist using "YES" and "NO". The opinion given by the majority
of the annotators was taken into account for every tweet and then used for identifying whether a tweet
is sexist.
Table 2
Examples of Sexist and Non-Sexist Statements
   Sexist                                                 Non-Sexist
   "Mujer al volante, tenga cuidado!"                     "Alguien me explica que zorra hace la gente en el
                                                          cajero que se demora tanto."
   "People really try to convince women with little       "@messyworldorder it’s honestly so embarrassing
   to no ass that they should go out and buy a body.      to watch and they’ll be like ’not all white women
   Like bih, I don’t need a fat ass to get a man. Never   are like that’"
   have."


4. Methodology
We trained the traditional machine learning models such as Support Vector Machine (SVM) [11, 12],
Random Forest and Logistic Regression on the training dataset, evaluated the models on the dev dataset
and submitted our runs by applying the ML models on the test dataset.

4.1. Preprocessing
Our first step was to clean the data given in order to improve the performance of the machine learning
models:

   1. Converting the text to lowercase: This ensures consistency in text data. By doing this the
      vocabulary size is reduced and it reduces the computational requirements.
   2. Removing punctuation marks:They often point to external resources that are not relevant to the
      context of the text being analyzed.
   3. Removing http links and emoticons:These do not contribute to the semantic meaning of the text.
   4. Removing twitter mentions like @username
   5. Removing all the numbers from the tweet column: These do not contribute towards sexist words.
   6. Removing stop words like "a", "an", "the", "is" and so on to improve the accuracy of the models.

4.2. Lemmatization
Lemmatization is a crucial step in preprocessing data where the words in the text are converted to
the base form. We have preferred to use Lemmatization as it followed grammatical rules better than
Stemming. This process involves:

    • Identifying the part of speech: Understanding whether a word is a noun, verb, adjective, etc.,
      which helps in determining the correct lemma.
    • Morphological analysis: Analyzing the structure and form of the word to convert it to its base
      form.

4.3. Vectorization
In order to ensure that the data is understood well by the model we need to convert the data into
a format that machines can understand,typically vectors or array of numbers. Among vectorization
techniques we found TF-IDF vectorization to give a better accuracy. Basically it adjusts the frequency
of words by how commonly they appear across all documents, giving more weight to less common but
significant words.

4.4. Model Evaluation
We have used three models using hard-hard labels such as Sexist and Non-Sexist. They are:
   1. Support Vector Machines: A supervised machine learning algorithm that we used for classification
      and regression tasks. It operates by creating a decision boundary that separates n-dimensional
      spaces into classes so that a new data point can be assigned to its relevant category.
   2. Logistic Regression: It is a regression model mainly used for classification problems. Logistic
      regression models the probability that a given input belongs to a particular class. It uses the
      logistic function, also known as the sigmoid function, to map any real-valued number into the
      range [0, 1].
   3. Random Forest: It is an ensemble learning method in which multiple decision trees are built
      during training and merges their results to improve accuracy and over-fitting .
   4. Decision Trees: A tree-like model of decisions and their possible consequences, including out-
      comes, resource costs, and utility.
   5. Hyper-parameter tuning: This is an essential step that helps in optimizing the performance of the
      models used for classifying the tweets. Hyper-parameters are configurations external to the model
      that cannot be learned from the data, such as learning rate, batch size, and the number of layers
      in a neural network. Since the data used in NLP is highly complex and multi-dimensional hyper-
      parameter tuning is used to identify optimal hyper-parameter configurations in order to make
      the models more efficient and accurate. There are various methods of hyper-parameter tuning
      like GridSearchCV, RandomSearchCV, Bayesian Optimization and Gradient-based Optimization.
      We have used GridSearchCV for our research.
      For SVM, we have tuned the hyper-parameters like regularization parameter (C) and the kernel
      parameters, such as the gamma parameter for the radial basis function (RBF) kernel. For Logistic
      Regression, we have tuned hyper-parameters like the regularization strength (often denoted as
      C). Regularization techniques such as L1 (lasso) and L2 (ridge) are also tuned to improve model
      generalization.


5. Results and Performance Analysis
5.1. Performance Analysis
Scikit-learn, also known as sklearn, is an open-source, machine learning and data modeling library for
Python. It features various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate
with the Python libraries, NumPy and SciPy. The sklearn metrics library also provides the classification
report for evaluation of the performance of the model. The performance is measured using the following
metrics:
   1) Precision: Precision is defined as the ratio of true positives to sum of true and false positives.
   2) Recall: Recall is defined as the ratio of true positives to sum of true positives and false negatives.
   3) F1-Score: The F1 is the weighted harmonic mean of precision and recall. The closer the value of F1
is to 1, better is the performance of the model.

  The result of the task is represented in the form of the table below. Among the 3 being used SVM
had the best F1 score of 0.6299. The next best F1 score came from Random Forest which is of 0.6074.
Logistic Regression had an F1 score 0.5518. Table 4 also displays the ranking of our submissions based
on the shared task official ranking in (hard-hard) evaluation scenario.


6. Reflections
Through this paper we learnt about important methods in the filed of natural language processing and
the steps involved in it .We learnt through this task that SVM in general is a very good model for text
classification as they are particularly effective in cases where the number of dimensions (features) is
Table 3
Ranks based on the f1 score of our 3 models in comparison with others
               Run                         Rank    ICM-Hard      ICM-Hard Norm        F1
               FraunhoferSIT_1              55       0.2334             0.6191      0.6447
               The-Three_Musketeers_2       56       0.2171             0.6108      0.6299
               The-Three_Musketeers_3       57       0.2130             0.6087      0.6074
               maven_2                      58       0.1926             0.5983      0.6512
               The-Three_Musketeers_1       60       0.1184             0.5604      0.5518


greater than the number of samples. This makes them suitable for applications like text classification,
where each word can be considered a feature. While SVMs work with linear hyperplanes by default,
the ‘kernel trick’ allows them to handle non-linear relationships between features. This is crucial for
text, where complex semantic relationships exist between words.


7. Conclusions
Through the scope of the paper we have explored traditional models to perform classification of Sexist
and Not-Sexist speech on the given data by EXIST in English Language. It was noted that the SVM
had the best F1 score of 0.6299. This research contributes to the field of natural language processing
and provides valuable insights into addressing social issues in online platforms. Future work can be
done in incorporating more advanced techniques and also introduce more pre-processing techniques in
order to improve the performance of the model. Additionally the model can be deployed in real world
applications in order to monitor sexist tweets on social platforms.Future work can focus on expanding
the model to handle multi-class classification problems,incorporating more advanced techniques such
as attention mechanisms, and exploring additional preprocessing steps to improve model performance.
Additionally, the model can be deployed in real-world applications to mitigate and monitor instances of
sexism on social media platforms. We hope these efforts will contribute towards fight against sexism.


References
 [1] N. Dehingia, J. McAuley, L. McDougal, E. Reed, J. G. Silverman, L. Urada, A. Raj, Violence against
     women on twitter in india: Testing a taxonomy for online misogyny and measuring its prevalence
     during covid-19, PLoS one 18 (2023) e0292121.
 [2] A. Chaudhary, R. Kumar, Sexism identification in social networks, Working Notes of CLEF (2023).
 [3] R. Ouedraogo, D. Stenzel, How domestic violence is a threat to economic development, IMF Blog
     Insights & Analysis on Economics and Finance (2021).
 [4] L. Khan, A. Amjad, N. Ashraf, H.-T. Chang, A. Gelbukh, Urdu sentiment analysis with deep
     learning methods, IEEE access 9 (2021) 97803–97812.
 [5] Z. Khanam, B. Alwasel, H. Sirafi, M. Rashid, Fake news detection using machine learning ap-
     proaches, in: IOP conference series: materials science and engineering, volume 1099, IOP Publish-
     ing, 2021, p. 012040.
 [6] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, Automatic classification of sexism in social
     networks: An empirical study on twitter data, IEEE Access 8 (2020) 219563–219576.
 [7] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem
     of offensive language, in: Proceedings of the international AAAI conference on web and social
     media, volume 11, 2017, pp. 512–515.
 [8] P. Parikh, H. Abburi, N. Chhaya, M. Gupta, V. Varma, Categorizing sexism and misogyny through
     neural approaches, ACM Transactions on the Web (TWEB) 15 (2021) 1–31.
 [9] L. Plaza, J. Carrillo-de Albornoz, E. Amigó, J. Gonzalo, R. Morante, P. Rosso, D. Spina, B. Chulvi,
     A. Maeso, V. Ruiz, Exist 2024: sexism identification in social networks and memes, in: European
     Conference on Information Retrieval, Springer, 2024, pp. 498–504.
[10] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
     N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
     and Labs of the Evaluation Forum, 2024.
[11] S. L. Salzberg, C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers,
     inc., 1993, 1994.
[12] T. Pranckevičius, V. Marcinkevičius, Comparison of naive bayes, random forest, decision tree,
     support vector machines, and logistic regression classifiers for text reviews classification, Baltic
     Journal of Modern Computing 5 (2017) 221.