Arabic Misogyny Identification
Fazlourrahman Balouchzahi1 , Grigori Sidorov1 and
Hosahalli Lakshmaiah Shashirekha2
1
    Instituto Politécnico Nacional, Centro de Investigación en Computación, CDMX, Mexico
2
    Department of Computer Science, Mangalore University, Mangalore, India


                                         Abstract
                                         Social media usually consists of various forms of toxic contents such as Hate Speech (HS) and contents
                                         in offensive and abusive languages, in addition to useful and relevant ones. The offensive contents on
                                         social media may target a religion, community, individual or group of people, with specific thoughts
                                         and beliefs. A category of offensive content targeting women termed as Misogyny is increasing day-
                                         by-day and a person/group who shares such content is called a Misogynist. Misogyny detection can be
                                         seen as a sub-category of HS and Offensive Language Identification (OLI) tasks in which women and
                                         issues regarding them such as their rights are targeted. Despite the several works undertaken for HS
                                         and OLI tasks by several researchers, Misogyny detection has been studied rarely even for rich resource
                                         languages. To promote Misogyny detection in Arabic language, Arabic Misogyny Identification (ArMI)-
                                         a shared task in Forum for Information Retrieval Evaluation (FIRE) 2021 provides the dataset and invites
                                         the researches to develop models for Misogyny detection in the given text. The shared task consists of
                                         two subtasks which can be modeled as binary and multiclass Text Classification (TC) tasks. This paper
                                         describes the models submitted by our team MUCIC to the ArMI shared task. The proposed methodology
                                         uses a combination of top frequent char and word n-grams as features to train Machine Learning (ML)
                                         classifiers and obtained an accuracy of 0.873 and F1-score of 0.497 for Subtask A and B respectively.

                                         Keywords
                                         Social Media, Hate Speech, Offensive Language, Misogyny Detection, Machine Learning


1. Introduction
The unlimited freedom and anonymity of users on the social media have provided ample of
opportunities for several users who wish to share Hate Speech (HS) and abusive content targeting
different communities, religions, beliefs, etc. [1, 2]. Knowingly or unknowingly, usually, women,
children and the younger generation will be the victims of this hatredness. Women’s rights in
Middle East countries have always been a concern for the world and feminism. The type of
comments on social media that target women and their rights is seen as an action of violence
against women and is called as Misogyny. Detecting Misogyny on social media manually is
cumbersome and time consuming due to the increased number of users and increase in the
Misogyny content. Despite the several works being explored for the automatic detection of HS
and OLI in various languages, Misogyny detection has got very less attention even for resource

Forum for Information Retrieval Evaluation, December 13-17, 2021, India
" frs_b@yahoo.com (F. Balouchzahi); sidorov@cic.ipn.mx (G. Sidorov); hlsrekha@gmail.com (H. L. Shashirekha)
~ https://mangaloreuniversity.ac.in/dr-h-l-shashirekha (H. L. Shashirekha)
 0000-0003-1937-3475 (F. Balouchzahi)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Description of Categories for Subtask B
  Category                    Description
  Damning (Damn)              The tweet contains an offensive invoke or curse against women
                              The tweet contains texts to validate and justify women abuse
  Derailing (Der)
                              and mistreatment
  Discredit (Disc)            The tweet contains defamation and offensive language against women
                              The tweet contains texts to target equality of men and women rights
  Dominance (Dom)
                              by implying the superiority of men over women
  Sexual Harassment
                              The tweet contains texts describing sexual abuses against women
  (Harass)
  Stereotyping &
                              The tweet contains description of women’s physical appeal
  Objectification (Obj)
  Threat of Violence          The tweet contains a statement of an attention to hostile actions
  (Vio)                       against women
  None                        The tweet does not contain any misogyny contents


rich languages. Hence, Misogyny detection is not only interesting but challenging also [3].
ArMI1 [4], a shared task in FIRE 20212 is a first step to encourage researchers to develop models
for the detection of Misogyny in Arabic texts. With the aim of identifying Misogyny tweets and
categorizing them into different Misogynistic behaviors classes, ArMI shared task consists of
the following two subtasks:

    • Subtask A - Misogyny Content Identification: is a binary Text Classification (TC)
      task where each tweet has to be classified as "Misogynistic (Misogyny)" (if the tweet
      contains texts against women) or "Non-misogynistic (None)" (otherwise).;
    • Subtask B - Misogyny Behavior Identification: is a multiclass TC task where each
      tweet has to be classified into one of the eight categories described in Table 1.

   The effectiveness of various types of n-grams as features have been proved by Balouchzahi
et al. [1, 5, 6] for Dravidian3 languages text and code-mixed texts in Dravidian languages
for several TC tasks. In continuation with this, to explore the efficiency of n-grams based
feature sets for low resource languages, in this paper, we, team MUCIC propose to utilize a
combination of 30,000 top frequent char and word n-grams each as feature set to tackle the
Misogyny detection challenge in ArMI shared task. The generated feature set transformed
into TFIDF vectors is used to train two ML classifiers, namely: Linear Support Vector Machine
(LSVM) and Logistic Regression (LR). SVMs are the popular ML classifiers that take advantage
of high dimensional feature sets such as n-grams and support various kernel functions. LR
is one of the widely employed binary classifier. However, to deal with multiclass TC tasks it
utilizes the one-vs-rest (OvR) scheme [1].
   The rest of paper is organized as follows: Section 2 gives a summary of the recent literature in

   1
     https://sites.google.com/view/armi2021/
   2
     http://fire.irsi.res.in/fire/2021/home
   3
     https://en.wikipedia.org/wiki/Dravidian_languages
Misogyny detection and Arabic TC tasks followed by the description of Methodology in Section
3. The Experiments and results are mentioned in Section 4 and the paper concludes in Section 5.


2. Related Work
A primary requirement to promote NLP tasks in any language is the availability of annotated
datasets. To promote Misogyny detection task in Levantine Arabic language, Mulki et al. [2]
collected Tweet-replies to female journalists Tweets during protests that happened in 2019
in Lebanon. The collected Tweets were cleaned to remove non-textual, Arabic-Arabizi mixed
Tweets, retweets, duplicate instances, sequence of hashtags and single Tweet. Further, 77,856
Tweets from 7 female journalists’ accounts were retrieved using Twitter API4 and non-Levantine
Tweets were removed manually. Services of two female and one male annotator was used to
annotate the Tweets into eight categories as mentioned in Table 1 and only 6,603 Tweets were
used for annotation. The authors also experimented various ML classifiers as baselines with
BOW, LSTM and BERT. BERT outperformed other models and obtained an accuracy of 0.88 and
a F1-score of 0.43 for binary and multiclass misogyny detection tasks respectively.
   Misogyny detection in the Arabic language has never been studied earlier [4]. However,
several HS detection and OLI tasks in Arabic language are experimented and some of them are
briefly described below:
   Farha et al. [7] explored Deep Learning (DL) and Transfer Learning (TL) approaches for the
task of OLI in Arabic language using the SemEval 2020 Arabic OLI shared task dataset. This
dataset consists of 7,000 training samples and 1,000 testing samples for two subtasks, namely:
Subtask 1 (HS v/s Not-HS) and Subtask 2 (Offensive v/s Not-Offensive). They experimented
Bi-directional Long Short Term Memory (BiLSTM) and BiLSTM-Convolutional Neural Network
(CNN) as DL models and ULMFiT as TL model. BiLSTM-CNN was used as a multitask learning
approach where authors assumed that, if a Tweet contain HS content it is offensive as well.
Sentiments labels were also added as an objective in the methodology. Eventually, BiLSTM-CNN
obtained best results with F1-scores of 0.904 and 0.737 for OLI and HS detection respectively.
   Alshaalan et al. [8] developed an Arabic HS dataset consisting of 9,316 Tweets distributed
into five categories, namely: Racist, Religious, Ideological, Tribal, and Regional. Similar to Mulki
et al. [2], Twitter API has been used to scrap the Tweets posted during March 2018 to August
2018 based on keywords. The obtained Tweets were pre-processed by converting Emoji to text
and removing hashtags, stopwords, white spaces and punctuation followed by filtering spams
and normalization and lemmatization of words. They experimented several CNN and Recurrent
Neural Networks (RNN) models, BERT transformer as well as ML classifiers using char n-grams
for HS detection task in Saudi Twitter sphere and obtained an F1- score of 0.79 for CNN models
as the best score among the models.
   Another Arabic HS dataset has been developed by Albadi et al. [9] to encourage researchers
to work on religious HS detection. The developed dataset covers Tweets related to common
religions such as Islam, Christianity, Judaism, and Atheism in Middle East countries. In addition
to these religions, the authors also included Sunni and Shia religions. 6,000 tweets (1,000 per
religion) were collected using Twitter API and was distributed into six categories, namely: Islam,
   4
       http://www.tweepy.org
Sunni, Shia, Christianity, Judaism, and Atheism. Various ML and DL models were experimented
as baselines and GRU-based RNN with an F1-score of 0.79 outperformed the other baselines.


3. Methodology
The proposed methodology consists of the following steps:
  i) pre-processing the dataset
 ii) extracting char and word n-grams from the given text
iii) selecting 30,000 most frequent features in each category and combining them to form a
     feature set
iv) vectorizing the feature set using TfidfVectorizer5
 v) training the ML classifiers with the vectors obtained for training set and
vi) evaluating the models using the vectors obtained for the test set
The overview of the proposed methodology is shown in Figure 1.
   n-grams are simple and scalable features that are utilized in many NLP tasks. The value
of "n" indicates the amount of the context that is captured. Despite consuming less ram and
time, n-grams enhance the efficiency of many TC tasks [10]. The range and the total number of
features before selecting the frequent ones are presented in Table 2.
   The steps to pre-process the dataset are given below:

    • Emoji to text conversion: all Emojis are converted to corresponding texts in English
      using de-emojify6 library. The conversion of Emojis to English words is considered as a
      better option than removing Emojis as it results in losing important information.
    • Punctuation removal: since punctuation usually are not informative features for TC,
      they are removed.
    • Digits removal: since digits usually are not informative features, they are removed from
      texts. This also reduces the number of features.
    • Word with small length: all words of length less than or equal to two are removed to
      reduce the number of features.
    • Lower casing the words: lower casing all the uppercase characters is applied only for
      English words obtained from Emoji to text conversion.

  The feature vectors of the train set are used to train LSVM and LR classifiers which are set
with default parameters for each subtask of ArMI shared task and the predictions on the test set
are submitted to the organizers for evaluation.


4. Experiments and Results
The dataset for ArMI shared task is a collection of Tweets comprised of Gulf, Egyptian and
Levantine dialects and Let-Mi [2] dataset is a collection of Levantine dialects Tweets. Rest of the
   5
       https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
   6
       https://pypi.org/project/demoji/
Table 2
Features’ statistics
                            n-gram type       range      Total No. of features
                                char           (1, 3)           222,236
                               word            (2, 5)           153,519


Figure 1: Overview of the proposed method


Table 3
Statistics and label distribution for the training set
                                               Subtask A
                                               Misogyny    None
                                                 4,805     3,061
                                               Subtask B
                               Stereo-          Threat                                Sexual
  Discredit    Damning        typing &             of    Dominance        Derailing   harass-   None
                            Objectification    violence                                ment
    2,868         669            653              230       230              105        61      3,061


multi-dialects Tweets collected from Twitter are based on hashtags, queries and Misogynists’
timelines that contain Misogyny content. Participants of the shared task were provided with
the training set consisting of 7,866 Tweets (posted during January 2019 - January 2021 and
manually annotated by Arabic-native speakers) and test set containing 1,967 tweets (without
label) for evaluating the model. The label distribution of the train set for the two subtasks is
given in Table 3.
   Accuracy and F1-scores are used by the organizers for ranking the models submitted by the
participants for Subtask A and B respectively and the results obtained are shown in Table 4. The
results illustrates that LR classifier outperformed LSVM with 0.873 accuracy and 0.497 F1-score
for Subtask A and B respectively.
   The comparison of the accuracies of the models submitted by the participating teams to
Table 4
Performances of the proposed methodology
              Subtask      Classifier   Accuracy    Precision    Recall   F1-score
                              LR          0.873       0.868      0.864     0.866
             Subtask A
                             LSVM         0.866       0.860       0.857     0.858
                              LR          0.765       0.578      0.460     0.497
             Subtask B
                             LSVM         0.762       0.572       0.456     0.493


Figure 2: Comparison of accuracies of our model with the top performing teams for Subtask A


Subtask A of the shared task shown in Figure 2 illustrates very competitive results. It can be
observed that the difference between the accuracy values of the models (except the models
submitted by Isey team) is less than 0.02. The comparison of F1-scores of the models submitted
by the participating teams to Subtask B is shown in Figure 3. It can be observed that the
difference between the F1-scores of the models (except the model submitted by iCompass) is
less than 0.3. The proposed methodology obtained differences of only 0.046 in accuracy and
0.168 in F1-score with the best performing team for Subtask A and B respectively. Analysis of
the results also illustrate that all the teams obtained better performance for Subtask A which is
a binary TC task.


5. Conclusion and Future Work
This paper describes the model submitted by the team MUCIC to the ArMI shared task which
focuses on detecting Misogyny in Arabic language. ArMI shared task consists of two subtasks,
Figure 3: Comparison of F1-scores of our model with the top performing teams for Subtask B


namely: Misogyny Content Identification and Misogyny Behavior Identification which are
modeled as binary and multiclass TC tasks respectively. The proposed methodology includes
a text pre-processing step followed by generating the most frequent char and word n-grams
as features, combining and transforming them to TFIDF vectors. These vectors are used to
train two ML classifiers, namely: LSVM and LR. The performances of ML classifiers show very
competitive results for the dataset provided by the shared task organizers for both the subtasks.
However, LR outperformed LSVM with 0.873 accuracy and 0.497 F1-score in Subtask A and B
respectively. Despite the simplicity of the model, our naïve methodology obtained promising
results.
   The results of our models are expected to be improved further by expanding the experiments
on feature engineering part as well as model construction step. Exploring various features,
various feature selection algorithms and ensembling various ML classifiers along with exploring
TL will be the future work.


Acknowledgments
Team MUCIC sincerely appreciate the organizers for their efforts to conduct this shared task.


References
 [1] F. Balouchzahi, B. K. Aparna, H. L. Shashirekha, MUCS@DravidianLangTech-EACL2021:
     COOLI-Code-Mixing Offensive Language Identification, in: Proceedings of the First
     Workshop on Speech and Language Technologies for Dravidian Languages, 2021, pp.
     323–329.
 [2] H. Mulki, B. Ghanem, Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic
     Language, in: Proceedings of the Sixth Arabic Natural Language Processing Workshop,
     2021, pp. 154–163.
 [3] F. Simona, G. Bilal, M.-y.-G. Manuel, Exploration of Misogyny in Spanish and English
     tweets, in: Third Workshop on Evaluation of Human Language Technologies for Iberian
     Languages (IberEval 2018), volume 2150, Ceur Workshop Proceedings, 2018, pp. 260–267.
 [4] H. Mulki, B. Ghanem, ArMI at FIRE2021: Overview of the First Shared Task on Arabic
     Misogyny Identification, in: Working Notes of FIRE 2021 - Forum for Information Retrieval
     Evaluation, CEUR, 2021.
 [5] F. Balouchzahi, B. K. Aparna, H. L. Shashirekha, MUCS@ LT-EDI-EACL2021: CoHope-
     Hope Speech Detection for Equality, Diversity, and Inclusion in Code-Mixed Texts, in:
     Proceedings of the First Workshop on Language Technology for Equality, Diversity and
     Inclusion, 2021, pp. 180–187.
 [6] F. Balouchzahi, H. L. Shashirekha, MUCS@ Dravidian-CodeMix-FIRE2020: SACO-
     Sentiments Analysis for CodeMix Text, in: FIRE (Working Notes), 2020, pp. 495–502.
 [7] I. A. Farha, W. Magdy, Multitask Learning for Arabic Offensive Language and Hate-Speech
     Detection, in: Proceedings of the 4th Workshop on Open-source Arabic Corpora and
     Processing Tools, with a Shared Task on Offensive Language Detection, 2020, pp. 86–90.
 [8] R. Alshaalan, H. Al-Khalifa, Hate Speech Detection in Saudi Twittersphere: A Deep
     Learning Approach, in: Proceedings of the Fifth Arabic Natural Language Processing
     Workshop, 2020, pp. 12–23.
 [9] N. Albadi, M. Kurdi, S. Mishra, Are they our Brothers? Analysis and Detection of Religious
     Hate Speech in the Arabic Tswittersphere, in: 2018 IEEE/ACM International Conference
     on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, 2018, pp. 69–76.
[10] F. Balouchzahi, M. D. Anusha, H. L. Shashirekha, MUCS@TechDOfication using FineTuned
     Vectors and n-grams, in: Proceedings of the 17th International Conference on Natural Lan-
     guage Processing (ICON): TechDOfication 2020 Shared Task, NLP Association of India (NL-
     PAI), Patna, India, 2020, pp. 1–5. URL: https://aclanthology.org/2020.icon-techdofication.1.