<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>H. Mulki); bghanem@ualberta.ca (B. Ghanem)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ArMI at FIRE 2021: Overview of the First Shared Task on Arabic Misogyny Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hala Mulki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bilal Ghanem</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The ORSAM Center for Middle Eastern Studies</institution>
          ,
          <addr-line>Ankara</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Alberta</institution>
          ,
          <addr-line>Edmonton</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper provides an overview of the organization, results and main findings of the first shared task on misogyny identification in Arabic tweets. Arabic Misogyny Identification task (ArMI) is introduced within the Hate Speech and Ofensive Content detection (HASOC) track at FIRE-2021. The ArMI task combines two related classification subtasks: a main binary classification subtask for detecting the presence of misogynistic language, and a finegrained multi-class classification subtask for identifying seven misogynistic behaviors found in misogynistic contents. The data provided for this task is a Twitter dataset composed of 9,833 tweets written in modern standard Arabic (MSA) and several Arabic dialects including Levantine, Egyptian and Gulf1. ArMI at FIRE-2021 has got a total of 15 submitted runs for Sub-task A and 13 runs for Sub-task B provided by six diefrent teams. The systems introduced by the participants employed various methods including feature-based, neural networks using either classical machine learning techniques, ensemble methods or transformers. The best performing system achieved an F-measure of 91.4% and 66.5% for subtask A and subtask B, respectively. This indicates that misogynistic language detection and misogynistic behaviors identification in Arabic textual contents can be, efectively, addressed using transformer-based approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Misogyny identification</kwd>
        <kwd>Arabic language</kwd>
        <kwd>Social media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>
        Online misogyny has become a universal phenomenon spread widely across social media
platforms. Misogyny is one type of hate speech that disparages a person or a group
having the female gender identity; it is typically defined as hatred of or contempt for
women [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. According to [3], based on the misogynistic behavior, misogynistic language
can be classified into several categories such as discredit, dominance, derailing, sexual
harassment, stereotyping and objectification, and threat of violence. Like their peers all
over the world, women in the Arab region are exposed to several forms of online misogyny,
through which, gender inequality, violence against women, and underestimation of women
are, unfortunately, reinforced and justified [ 3]. This made the automatic identification
of online Arabic misogyny very crucial to assist in prohibiting the misogynistic Arabic
contents and, thus, enabling Arab women to explore social media safely and express
their opinions freely [4]. However, this could not be achieved without the provision of
annotated data needed for training and developing automatic misogyny identification
systems. While online misogyny detection for Indo-European languages such as English,
Spanish and Italian have been addressed by multiple systems presented at several shared
tasks [5, 6], Arabic misogynistic language detection has never been tackled in similar tasks
because of the lack of Arabic datasets annotated for misogyny. Therefore, the motivation
for organizing Arabic Misogyny Identification task (ArMI) goes beyond proposing the
ifrst shared task of automatic detection of Arabic misogynistic language to enrich the
Arabic linguistic resources with a novel type of Arabic toxic content. Thus, We have
made the ArMI evaluation dataset publicly available to the research community aiming
to develop novel systems able to handle the challenging nature of the Arabic language
and Arabic dialects in the context of identifying misogynistic language and misogynistic
behaviors. Moreover, through this task, we encourage the participation of several research
groups, from both academia and industry, seeking to promote advancement in the state
of the art and paving the way in misogyny detection for the Arabic language.
      </p>
      <p>The remainder of this paper is organized as follows: Section 2 presents the shared task
description and the sub-tasks included in ArMI. Section 3 provides a detailed review of
the evaluation dataset in terms of data collection, annotation, and statistics. In Section 4,
the evaluation measures are presented while Section 5 discusses the participating systems,
lists and compares their submitted results. Finally, Section 6 concludes the overview
study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <p>The ArMI shared task aims at identifying the misogynistic language and recognizing
diferent misogynistic behaviors in a collection of Arabic (MSA/dialectal) tweets. The
participants can choose to participate in one or both of the following sub-tasks:
• Sub-task A - Misogyny Content Identification:</p>
      <p>This sub-task represents a coarse-grained binary classification in which the
participating systems are required to classify the tweets into two classes, namely:
Misogynistic (Misogyny) and Non-misogynistic (none).
• Sub-task B - Misogyny Behavior Identification:</p>
      <p>This sub-task is a fine-grained, multi-class classification of misogynistic behaviors
where along with the Non-misogynistic class, the misogynistic tweets from the
sub-task A are further classified into seven categories:
1. Damning: tweets under this class contain cursing content.
2. Derailing: tweets under this class combine justification of women abuse or
mistreatment.
3. Discredit: tweets under this class bear slurs and ofensive language against
women.
4. Dominance: tweets under this class imply the superiority of men over women.
5. Sexual Harassment: tweets under this class describe sexual advances and
sexual nature abuse.
6. Stereotyping &amp; Objectification: tweets under this class promote a fixed image
of women or describe women’s physical appeal.
7. Threat of Violence: tweets under this class have an intimidating content with
threats of physical violence.</p>
      <p>8. None: if no misogynistic behaviors exist.</p>
    </sec>
    <sec id="sec-3">
      <title>3. ArMI Evaluation Dataset</title>
      <sec id="sec-3-1">
        <title>3.1. Data Collection</title>
        <p>The dataset used for evaluation in ArMI shared task contains 9,833 tweets written in
Modern Standard Arabic (MSA) and several Arabic dialects including: Gulf, Egyptian
and Levantine. The Levantine tweets were derived from Let-Mi dataset [4] which was
constructed out of the tweet direct replies posted at the timelines of several Lebanese
female journalists who were targeted by online bullying campaigns during the 17 October
protests in Lebanon. The multi-dialectal tweets, however, were collected based on
antiwomen hashtags and specific queries (See Table 1). Moreover, by tracking anti-women
hashtags and queries within the “bio" section of Twitter Arab users, we spotted several
users who describe themselves as misogynists and scraped the tweets from their public
timelines 1. All the tweets were collected during the period (January 2019 - January
2021) using Twitter API2.</p>
        <p>In order to prepare the collected tweets for annotation, they were normalized by
eliminating Twitter-inherited symbols, digits, and URLs. It should be mentioned that as
the hashtags encountered within a tweet can indicate a misogynistic content, hashtag
symbols were removed while the hashtag words were retained.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Annotation Process and Evaluation</title>
        <p>The dataset used in ArMI shared task is resulted from merging Let-Mi dataset [4] with
a collection of Arabic multi-dialectal tweets. Although both datasets were, manually,
annotated for misogynistic content and misogynistic behavior using the same annotation
guidelines, Let-Mi dataset was annotated by three annotators while two annotators
labeled the multi-dialectal dataset. It should be noted that all the annotators are Arabic
native speakers and are fluent in the Egyptian, Levantine and Gulf dialects. Based on the
definition of misogynistic behaviors in [ 3], we designed the annotation guidelines such
that the eight label categories are identified as follows:
1user names were masked throughout the research study.</p>
        <p>2We used python Tweepy library http://www.tweepy.org</p>
        <p>Girls’ mentalities
The belly dancers of media</p>
        <p>A woman of speech defect
Short Women or Tall Women</p>
        <p>Women are grumpy</p>
        <p>#Red_Pill
Girls’ hormones</p>
        <p>Go to kitchen
Wives’ discipline
• Non-Misogynistic (none): tweets are those instances that do not express any hatred,
insulting, or verbal abuse towards women.
• Discredit refers to tweets that combine slurring over women with no other larger
intention.
• Derailing: used to describe tweets that indicate a justification of women abuse
while rejecting male responsibility in an attempt to disrupt the conversation to
refocus it.
• Dominance: tweets are those that express male superiority or preserve male control
over women.
• Stereotyping &amp; objectification: used to annotate tweets that promote a widely held
but fixed and oversimplified image/idea of women. This label also refers to tweet
instances that describe women’s physical appeal and/or provide comparisons to
narrow standards.
• Threat of violence: used to annotate tweets that intimidate women to silence them
with an intent to assert power over women through threats of violence physically.
• Sexual harassment: used for tweets that describe actions such as sexual advances,
requests for sexual favors, and sexual nature harassment.
• Damning: a misogynistic behavior that is inspired by the Arabic culture. It is used
to annotate tweets that contain prayers to hurt women; most of the prayers are
death/illness wishes besides praying God to curse women.</p>
        <p>Later, we evaluated the judgments of the annotators using inter-annotator agreement
measure: Krippendorf’s  [7]. According to [4], the calculated Krippendorf’s  for was</p>
        <p>½JÊ« XQ Ó éÖÏP ½¯QªJÓ ñË</p>
        <p>We’d have answered you if you were a man</p>
        <p>½J.Ê®K. éAP é&lt;ËA @</p>
        <p>May God put a bullet in your heart
82.9% which is good. As for the multi-dialectal collection of tweets which were annotated
by two annotators, we found that Krippendorf’s  is 66.5% which is tentative. Thus, both
Krippendorf’s  values indicates the consistency of the annotations for each of Let-Mi
and the multi-dialectal tweets collection; Consequently, the annotations of ArMI dataset
which is constructed out of merging Let-Mi and the multi-dialectal tweets collection and
used in the shared task are considered reliable and consistent.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. ArMI Dataset Statistics</title>
        <p>Having the annotation process and the evaluation accomplished, we have got a total of
9,833 tweets among which 6, 006 were misogynistic and 3, 827 were non-misogynistic. The
adopted tweets were distributed unevenly among the misogynistic behavior classes as
shown in Table 4.</p>
        <p>On the other hand, the distribution of the tweets between training and test sets is
given in Table 5. The general class distribution (Misogynistic vs. non-misogynistic) is
quite similar, with a proportion of 61% misogynistic tweets in both train and test sets
as shown in Table 5. Table 6, however, lists the number of tweets for each misogynistic
behavior class in both train and test sets.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation and Metrics</title>
      <p>Regarding the submission and systems evaluation process, each participated team is
allowed to submit a maximum of three runs. The performance of the submitted approaches
for the Misogynistic Content Identification</p>
      <p>task will be evaluated by accuracy. However,
the submitted runs of the Misogynistic Behavior Identification
task will be evaluated
using the macro-averaged measures: precision, recall and F1-measure. The final rank of
the systems will be sorted by the macro F1-measure.</p>
      <p>=
=</p>
      <p />
    </sec>
    <sec id="sec-5">
      <title>5. Participants Methods and Results</title>
      <p>Nineteen teams have been registered to the shared task among which six submitted their
runs. Participants were from 4 diferent countries: Mexico, Morocco, Tunis, Libya, and
India. All team members were from public entities (universities or Research Centers).</p>
      <p>Participants used either traditional machine learning approaches or transformers-based
models. Some teams preprocessed the input tweets before feeding them to their classifiers
by removing URLs, special characters, or duplicate characters within words. Other teams
did more than simple text normalization to preprocess the input tweets, where they
converted emojis to text to have a textual representation of the tweets. In Table 7, we
summarize the participants’ approaches in terms of the used models, text representations,
preprocessing steps, and post-processing steps.</p>
      <p>In the following we summarize the participants approaches:
1. iCompass team [8] did not apply any pre or post -process steps. The team used
MARBERT pretrained transformer [9], which is an Arabic version of the BERT
model trained on Arabic corpora, to identify misogynistic tweets and to recognize
the misogynistic behaviors.
2. IsEy [10]: Similar to iCompass approach, IsEy team used MARBERT model for
both sub-tasks, A and B. In addition to that, the team applied some preprocessing
steps to normalize text, such as: removing URLs, user mentions, special characters,
duplicate characters, etc.
3. MUCIC [11]: Although most of the participated teams used transformers-based
approaches, MUCIC team used word and character n-grams models that utilize
TF-IDF weighting scheme and select the top 30,000 features from text. After that,
the new text representations are fed to a Support Vector Machine (SVM) and
Preprocessing steps
- Removing URLs
- Emojis to text
- Hashtags to words
- Mentions to words
- Removing usernames
- Removing special chars.
- Removing repeated chars.
- Processing Arabic letters</p>
      <p>Text representation
- Bag-of-words/chars</p>
      <p>- Lexicons
Classification Models
- Logistic Regression</p>
      <p>- Naive Bayes
- Support Vector Machine
- Neural Network
- Arabic BERT
- AraBERT
- MARBERT
Postprocessing steps
- Ensembling
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
Logistic Regression (LR) models. As a preprocessing step, the authors converted
emojis into text, hashtags and mentions decoded into words, and had the special
characters removed.
4. UM6P-NLP [12] team used MARBERT model as well, but with adding some
further task-specific layers on top of it. for the preprocessing, the authors extracted
emojis from tweets and added the BERT separator token ([SEP]) between them at
encoding time. As a post-processing step, the authors ensembled the predictions
from three diferent versions of MARBERT models.
5. UoT [13] team used two main approaches. For the first approach, the text was
normalized by removing special characters, URLs, commas, used mentions, and
ifnally, a normalization step was applied to standardize some Arabic letters (Hamza,
Yaa’, and Taa’ Marbota). Later, a word n-grams text representation was used and
fed to a Naive Bayes classifier. The authors tested both TF-IDF and word frequency
weighting schemes for the text representation. For the second model, the authors
used AraBERT model [14] without any auxiliary steps.
6. SOA_NLP team [15] used two main models: Arabic-BERT [16] model, and a
character n-grams representation with SVM, LR, and Neural Network classifiers
(each classifier used in one of the three submitted runs). As a post-process step,
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
UM6P-NLP_run3
UM6P-NLP_run2
UM6P-NLP_run1</p>
      <p>UoT_run1
SOA_NLP_run1</p>
      <p>BERT</p>
      <p>MUCIC_run1</p>
      <p>SOA_NLP_run2
(Frenda et. al, 2018)</p>
      <p>MUCIC_run2
SOA_NLP_run3</p>
      <p>UoT_run3
iCompass_run1</p>
      <p>UoT_run2
iCompass_run2</p>
      <p>IsEy_run2
IsEy_run1
the predictions from SVM and LR were ensembled in one of the runs.</p>
      <p>In Table 8, we present the results of Sub-task A. We show the performance of each
team’s runs. We also compare the results of the two tasks with two baselines: AraBERT
model [14] and Frenda et. al, 2018 system [17] which is one of the SOTA systems on the
misogyny identification task. It could be noticed that in this task, the system provided by
the team UM6P-NLP was the best performing system, for all of the three runs. The team
used MARBERT transformers with an ensembling step. The results show that the top
performing systems are using transformers in their best runs. The best word/character
n-grams run for this task is MUCIC_run1, whose results were better than those obtained
by some other transformers-based runs, but still lower than the first baseline.</p>
      <p>Regarding sub-task B (see Table 9), the scenario regarding the best performing
systems changed slightly; transformer-based models are not performing better than the
word/character n-gram models, except for the UM6P-NLP team. We can notice that
word/character n-gram models ranked better than the other models. This is also noticed
with the used baselines where the BERT model has a lower performance comparing to
the Frenda et. al, 2018 baseline.</p>
      <p>The results obtained for both sub-tasks showed that Arabic misogyny identification
SOTA results are in line with English, Spanish [5], and Italian [6] results. Regarding
sub-task A, the results of the best performing systems are close (larger than 0.8 of
accuracy). On the other hand, the macro-averaged F1-measure results of sub-task B are
in between 0.4 to 0.5 for English, Spanish, and Italian, whereas for Arabic it is around
0.67.</p>
      <p>UM6P-NLP_run2
UM6P-NLP_run3
UM6P-NLP_run1
SOA_NLP_run2</p>
      <p>SOA_NLP_run3
(Frenda et. al, 2018)</p>
      <p>SOA_NLP_run1</p>
      <p>UoT_run1
MUCIC_run1
MUCIC_run2</p>
      <p>UoT_run3</p>
      <p>BERT</p>
      <p>UoT_run2
iCompass_run2
iCompass_run1</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we have presented the results of the first shared task on misogyny
identification in Arabic tweets, hosted as a subtrack of HASOC at FIRE-2021. The participants had
to identify misogynistic tweets and then to detect the misogynistic behavior within them.
Nineteen teams participated in the task and a total of six teams submitted their runs. The
systems have been trained on a dataset composed of misogynistic and non-misogynistic
tweets for the first sub-task, and for the second sub-task, the tweets have been further
annotated with seven diferent misogynistic behavior classes. ArMI dataset was manually
annotated and the inter-annotator agreement was found "good". The methods proposed
by the participants ranged from traditional feature-based approaches relying on word or
character n-gram features to transformers-based systems. Several transformer models
were evaluated, as well as, many classical classifiers were used. Ensemble methods have
also been utilized. The best performing system for sub-task A achieved an accuracy
value of 0.919, and the best system for sub-task B achieved an F1 score of 0.665. Both
systems used transformer-based approaches. Finally, we have made ArMI dataset publicly
available to the research community.
[3] B. Poland, Haters: Harassment, abuse, and violence online, Lincoln: University of</p>
      <p>Nebraska Press, 2016.
[4] H. Mulki, B. Ghanem, Let-mi: An Arabic Levantine Twitter dataset for
misogynistic language, in: Proceedings of the Sixth Arabic Natural Language Processing
Workshop, Association for Computational Linguistics, Kyiv, Ukraine (Virtual), 2021,
pp. 154–163.
[5] E. Fersini, P. Rosso, M. Anzovino, Overview of the Task on Automatic Misogyny
Identification at IberEval 2018, in: Third Workshop on Evaluation of Human
Language Technologies for Iberian Languages (IberEval 2018), volume 2150, 2018,
pp. 214–228.
[6] E. Fersini, D. Nozza, P. Rosso, Overview of the EVALITA 2018 Task on Automatic
Misogyny Identification (AMI), in: EVALITA Evaluation of NLP and Speech Tools
for Italian, volume 12, 2018, p. 59.
[7] K. Krippendorf, Computing Krippendorf’s alpha-reliability (2011).
[8] A. Messaoudi, C. Fourati, M. Kchaou, H. Haddad, iCompass Working Notes for
Arabic Misogyny Identification, in: Working Notes of FIRE 2021 - Forum for
Information Retrieval Evaluation, CEUR, 2021.
[9] M. Abdul-Mageed, A. Elmadany, E. M. B. Nagoudi, ARBERT &amp; MARBERT: Deep</p>
      <p>Bidirectional Transformers for Arabic, arXiv preprint arXiv:2101.01785 (2020).
[10] I. Abbes, E. Nakache, M. Benhajhmida, Context-aware Language Modeling for
Arabic Misogyny Identification, in: Working Notes of FIRE 2021 - Forum for
Information Retrieval Evaluation, CEUR, 2021.
[11] F. Balouchzahi, G. Sidorov, H. L. Shashirekha, MUCIC at Arabic Misogyny
Identification, in: Working Notes of FIRE 2021 - Forum for Information Retrieval
Evaluation, CEUR, 2021.
[12] A. El Mahdaouy, A. El Mekki, A. Oumar, H. Mousannif, I. Berrada, Deep Multi-Task
Models for Misogyny Identification and Categorization on Arabic Social Media, in:
Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR,
2021.
[13] A. Nwesri, S. Wu, H. Harmain, Detecting Misogyny in Arabic Tweets, in: Working</p>
      <p>Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
[14] W. Antoun, F. Baly, H. Hajj, AraBERT: Transformer-based Model for Arabic
Language Understanding, in: LREC 2020 Workshop Language Resources and
Evaluation Conference 11–16 May, 2020, pp. 9–15.
[15] A. Kumar, P. Kumar Roy, J. Prakash Singh, A Deep Learning Approach for
Identification of Arabic Misogyny from Tweets, in: Working Notes of FIRE 2021
Forum for Information Retrieval Evaluation, CEUR, 2021.
[16] A. Safaya, M. Abdullatif, D. Yuret, KUISAIL at SemEval-2020 Task 12: BERT-CNN
for Ofensive Speech Identification in Social Media, in: Proceedings of the Fourteenth
Workshop on Semantic Evaluation, International Committee for Computational
Linguistics, Barcelona (online), 2020, pp. 2054–2059.
[17] S. Frenda, B. Ghanem, M. Montes-y Gómez, Exploration of Misogyny in Spanish and
English Tweets, in: 3rd Workshop on Evaluation of Human Language Technologies
for Iberian Languages (IberEval 2018), volume 2150, CEUR-WS, 2018, pp. 260–267.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Nockleby</surname>
          </string-name>
          , Hate Speech, volume
          <volume>1</volume>
          , Encyclopedia of the American Constitution (2nd ed., edited by Leonard W. Levy, Kenneth L.
          <string-name>
            <surname>Karst</surname>
          </string-name>
          et al. New York: Macmillan, New York: Macmillan,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Moloney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Love</surname>
          </string-name>
          ,
          <article-title>Assessing online misogyny: Perspectives from sociology and feminist media studies</article-title>
          ,
          <source>Sociology compass 12</source>
          (
          <year>2018</year>
          )
          <article-title>e12577</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>