1. Introduction

H. Mousannif);

and Categorization on Arabic Social Media

Abdelkader El Mahdaouy

abdelkader.elmahdaouy@um6p.ma 1

Abdellah El Mekki

abdellah.elmekki@um6p.ma 1

Ahmed Oumar

ahmedmohamedlemine.oumar@edu.uca.ma 0

Hajar Mousannif

mousannif@uca.ac.ma 0

Ismail Berrada

ismail.berrada@um6p.ma 1

Misogyny Identification, Misogyny Categorization, Multi-Task Learning, Pre-trained Language Models

0 LISI Laboratory, Computer Science Department, FSSM, Cadi Ayyad University , Morocco 1 School of Computer Sciences, Mohammed VI Polytechnic University , Morocco

000 0 0003

The prevalence of toxic content on social media platforms, such as hate speech, ofensive language, and misogyny, presents serious challenges to our interconnected society. These challenging issues have attracted widespread attention in Natural Language Processing (NLP) community. In this paper, we present the submitted systems to the first Arabic Misogyny Identification shared task. We investigate three multi-task learning models as well as their single-task counterparts. In order to encode the input text, our models rely on the pre-trained MARBERT language model. The overall obtained results show that all our submitted models have achieved the best performances (top three ranked submissions) in both misogyny identification and categorization tasks.

1. Introduction

With the popularity of the Internet and the rise of social media platforms, users around the world are having more freedom of expression. They can express their thoughts and opinions with minimal limitations and restrictions. As a result, they can share their positive thoughts about a specific product or service, a political decision, etc. Besides, they can share their negative thoughts about other things. Unfortunately, many users can employ these communication channels and freedom of expression to bully other people or groups. Misogyny is one of these phenomena, and it is defined as hate speech towards the female gender [ 1 ]. Misogyny can be classified into several categories such as sexual harassment, damning, dominance, etc [ 2 ].

Misogynistic behavior has prevailed on social media such as Facebook and Twitter. The ease of use and richness of these platforms have upraised misogyny to new levels of violence around the globe. Moreover, women sufer from misogyny in the 1st tier world as they sufer from it in the 2nd and 3rd tier world despite their race, language, age, etc. In the Arabic world, women’s rights and liberty have been always a controversial subject. Therefore, women are also exposed nEvelop-O to online misogyny, where people can start campaigns of intimidation and harassment against them for one reason or another.

Fighting online misogyny has become a topic of interest of several Internet players, where social media networks such as Facebook and Twitter propose reporting systems that allow users to post messages expressing misogynistic behavior. These reporting systems can detect these behaviors from users’ posts and delete them automatically. For high-resource languages such as English, Spanish, and French, these systems have been shown to perform well. However, when it comes to languages such as Arabic, automatic reporting systems are not yet deployed, and that is mainly due to: 1) the lack of annotated data needed to build such systems and 2) the complexity of the Arabic language compared to other languages.

Fine-tuning pre-trained transformer-based language models [ 3 ] on downstream tasks has shown state-of-the-the-art (SOTA) performances on various languages including Arabic [ 4, 5, 6, 7, 8 ]. Although several research works based on pre-trained transformers have been introduced for misogyny detection in Indo-European languages [ 9, 10, 11 ], works on Arabic language remain under explored [12].

In this paper, we present our participating system and submissions to the first Arabic Misogyny Identification (ArMI) shared tasks [ 13]. We introduce three Multi-Task Learning (MTL) models and their single-task counterparts. To embed the input texts, our models employ the pre-trained MARBERT language model [ 5 ]. Moreover, for Task 2, we tackle the class imbalance problem by training our models to minimize the Focal Loss [14]. The obtained results demonstrate that our three submissions have achieved the best performances for both ArMI tasks in comparison to the other participating systems. The results also show that MTL models outperform their single-task counterparts on most evaluation measures. Additionally, the Focal Loss has shown efective performances, especially on F1 measures.

The rest of this paper is organized as follows. Section 2 describes the ArMI tasks and the provided dataset. In Section 3, we introduce our participating system and the investigated deep learning models. Section 4 presents the conducted experiments and shows the obtained results. In section 5, we conclude the paper.

2. Tasks and dataset description

The Arabic Misogyny Identification (ArMI) task consists of the automatic detection of misogyny from Arabic tweets [13]. This task is composed of two main sub-tasks: the 1st sub-task is a binary classification task where the objective is to classify whether a tweet is misogynistic or not. In the second sub-task, the objective is to detect the misogynistic behavior expressed in a tweet. It is modeled as a multi-class classification problem consisting of seven misogynistic behaviors (labels). The organizers of this task have provided 7,866 labeled tweets to serve both sub-tasks for model training, while 1,966 tweets have been used for model testing and evaluation. Figure 1 presents the distribution of both tasks labels. It shows that the class labels are imbalanced for both misogyny identification and categorization tasks.

The provided tweets are expressed mainly in Modern Standard Arabic (MSA), while several tweets are expressed in some Arabic dialects such as Egyptian, Gulf, and Levantine. The Levantine tweets are taken from Let-Mi misogyny detection dataset, proposed by Mulki and (a) Distribution of misogynistic tweets (b) Distribution of misogynistic categories

Ghanem [12]. Besides, the rest of the tweets have been scrapped from Twitter using hashtags related to the misogyny phenomenon. The provided dataset is manually annotated by Arabic native speakers.

3. Methodology

We propose three deep Multi-task Learning (MTL) models based on the pre-trained MARBERT encoder [ 5 ] for the ArMI shared task. We also investigate the single-task version of the proposed MTL models. The choice of MARBERT encoder is motivated by the fact that this language model is pre-trained on 1B tweet corpus, containing both dialectal Arabic and MSA. Moreover, Fine-tuning MARBERT on downstream NLP tasks has shown efective results in many Arabic NLP applications [ 5, 7, 8 ]. In what follows, we describe each component of our submitted system.

3.1. Preprocessing

The tweet preprocessing component performs emojis extraction, user mention and URL substitution, and hashtag normalization. Following MARBERT’s tweets preprocessing guidelines, user mentions and URLs are replaced by ”user” and ”url” token, respectively. For hashtags normalization, we remove ”#” symbol and replace ”_” by white space. It is worth mentioning that diacritics are already removed from the training and testing datasets. Based on our preliminary experiments, emojis are not removed from the normalized text and added after the [SEP] token of the employed encoder. Finally, each tweet is represented using its normalized text and its emojis, as follows:

⋆ [CLS] normalized tweet [SEP] emojis [SEP]

3.2. Deep Learning Models

In this section, we describe the employed MTL models and their single task counterparts. All our models utilize MARBERT encoder to represent the input tweets. The models are described as follows: • MT_CLS uses a classification layer for each task on top of MARBERT encoder. It relies on [CLS] token embedding to predict the class label for each task. The single-task version of this model is denoted by ST_CLS. • MT_ATT consists of MARBERT encoder, two task-specific attention layers, and two classification layers. Each attention layer [ 15, 16] extracts task discriminative features by weighting the output token embedding of the encoder according to their contribution to the task at hand. Each classification layer is feed with the concatenation of the task attention output and the [CLS] token embedding. This model has shown efective performances in many NLP tasks, including dialect identification, sentiment analysis and sarcasm detection for the Arabic language [ 7, 8 ], humor detection and rating, as well as lexical complexity prediction in English [17, 18]. The single-task counterpart of MT_ATT is denoted by ST_ATT. • MT_VHATT is an extension of the MT_ATT model. In addition to the task-specific attention layers (called horizontal attention layers), it employs vertical attention layers to incorporate the features of the top intermediate layers of MARBERT encoder for both tasks. This model utilizes six attention layers to extract features from the token embedding of the top six layers of the encoder [15, 16]. Then, another attention layer is employed to aggregate features from the six vertical attention layers. Note that, we exclude the top output layer of the encoder as its features are already used by the horizontal attention layers (task-specific attention). Finally, the input of the classification layers for both tasks is the concatenation of the [CLS] token embedding of the last layer of the encoder, the task-specific attention output, and the aggregated features of intermediate layers. The single-task version of this model (MT_VHATT) is denoted by ST_VHATT.

For misogyny identification (Task 1), all models are trained to minimize the binary crossentropy loss. For misogyny categorization (Task 2), we have investigated the Cross-Entropy (CE) loss, as well as the Focal Loss (FL) [14]. The latter loss is employed to handle the class imbalance problem. It reduces the loss contribution from easy examples and assigns higher importance weights for hard-to-classify examples. The FL is given by: ( , )̂ = − (1 − ̂ ) log( ̂ ) (1) where, ∈ {0, … , − 1}

denotes the category’s label, =̂ ( 0̂, … , ̂−1 ) is a vector representing the predicted probability distribution over the labels, is the weight of label , and controls the contribution of high-confidence predictions in the loss. In other words, a higher value of implies lower loss contribution for well-classified examples [ 14].

4. Experiments and results

In this section, we present the experiment settings as well as the obtained results for our development set and the provided test set.

4.1. Experiment settings

All our models are implemented using PyTorch1 framework and the open-source Transformers2 libraries. Experiments are performed using a PowerEdge R740 Server, having 44 cores Intel Xeon Gold 6152 2.1GHz, a RAM of 384 GB, and a single Nvidia Tesla V100 with 16GB of RAM. The provided training set is split into 90% for the training and 10% for the development. Based on our preliminary results, all models are trained using Adam optimizer. The learning rate, the number of epochs, and the batch size are fixed to 1 × 10−5, 5, and 16 respectively. The hyper-parameter of the Focal Loss is set to 2, while the weights of Task 2 labels are set to = number of instance of dominant label . All models are evaluated using the Accuracy as well as number of instance of label y the macro averaged Precision, Recall, and F1 measures.

4.2. Results

In order to select the best models for our oficial submissions, we have evaluated the three MTL models and their single-task counterparts. For Task 2, we have investigated both CE and FL losses. Table 1 presents the obtained results on the development set using the three single-task models. The overall obtained results for Task 1 show that the ST_ATT model outperforms the other models on most evaluation measures. It shows also the best Recall and F1 measures for Task 2. Moreover, ST_VHATT yields slightly better performances on Task 1 and achieves far better precision and F1 scores on Task 2 in comparison to ST_CLS model. Furthermore, FL outperforms the CE loss on most evaluation measures for Task 2, except for the accuracy and the precision of model ST_CLS. Table 2 presents the classification reports for Task 2 of the ST_ATT model using CE and FL loss functions. The obtained results show that the FL leads to better F1 scores for all categories, except ”Discredit” and ”Damning” misogynistic behaviours. Indeed, the classification of rare events is increased while maintaining the overall performance. 1https://pytorch.org/ 2https://huggingface.co/transformers/ In accordance with the obtained results using single-task models, MT_VHATT shows slightly better performances on Task 1 than ST_CLS model. The overall obtained show that muti-task learning models surpass their single-task counterparts on Task 1. This can be explained by the fact MT models leverage signals from both tasks [19, 20].

4.3. Oficial submissions results

Based on the obtained results on the development set, we have submitted models that are trained using the FL for misogyny categorization (Task 2). This choice is motivated by the fact that the FL loss has lead to better F1 scores (Tables 1 and 3) than CE loss on the dev set. Our three oficial submissions are described as follows: • run1: corresponds to the submission of the obtained results on both tasks using the single-task model ST_ATT. • run2: corresponds to the obtained results on both tasks using the multi-task model

MT_ATT. • run3: corresponds to the ensembling of the three multi-task learning models, namely MT_CLS, MT_ATT, and MT_VHATT models. In this submission, the logits of the three models are averaged. Depending on the task, either the sigmoid or the softmax activation is applied to get the labels probabilities.

5. Conclusion

In this paper, we have presented our participating system in the first Arabic Misogyny Identification shared task. We have investigated three Multi-Task Learning models and their single-task counterparts using the pre-trained MARBERT encoder. In order to deal with class labels imbalance for Task 2, we have employed the Focal Loss. The results show that our three submitted systems are top-ranked among the participating systems to both ArMI tasks. The overall obtained results demonstrate that MTL models outperform their single-task versions in most evaluation scenarios. Besides, the Focal Loss has shown efective performances, especially on F1 measures.

Acknowledgments

Experiments presented in this paper were carried out using the supercomputer simlab-cluster, supported by Mohammed VI Polytechnic University (https://www.um6p.ma), and facilities of simlab-cluster HPC & IA platform. V. Basile, D. Croce, M. D. Maro, L. C. Passaro (Eds.), Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online event, December 17th, 2020, volume 2765 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. [11] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, Automatic classification of sexism in social networks: An empirical study on twitter data, IEEE Access 8 (2020) 219563–219576. doi:1 0 . 1 1 0 9 / A C C E S S . 2 0 2 0 . 3 0 4 2 6 0 4 . [12] H. Mulki, B. Ghanem, Let-mi: An Arabic Levantine Twitter dataset for misogynistic language, in: Proceedings of the Sixth Arabic Natural Language Processing Workshop, Association for Computational Linguistics, Kyiv, Ukraine (Virtual), 2021, pp. 154–163. URL: https://aclanthology.org/2021.wanlp-1.16. [13] H. Mulki, B. Ghanem, ArMI at FIRE2021: Overview of the First Shared Task on Arabic Misogyny Identification, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. [14] T. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, CoRR abs/1708.02002 (2017). URL: http://arxiv.org/abs/1708.02002. a r X i v : 1 7 0 8 . 0 2 0 0 2 . [15] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL: http://arxiv.org/abs/1409.0473. [16] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, 2016, pp. 1480–1489.

URL: https://www.aclweb.org/anthology/N16-1174. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 6 - 1 1 7 4 . [17] K. Essefar, A. El Mekki, A. El Mahdaouy, N. El Mamoun, I. Berrada, CS-UM6P at SemEval2021 task 7: Deep multi-task learning model for detecting and rating humor and ofense, in: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval2021), Association for Computational Linguistics, Online, 2021, pp. 1135–1140. URL: https: //aclanthology.org/2021.semeval-1.159. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . s e m e v a l - 1 . 1 5 9 . [18] N. El Mamoun, A. El Mahdaouy, A. El Mekki, K. Essefar, I. Berrada, CS-UM6P at SemEval2021 task 1: A deep learning model-based pre-trained transformer encoder for lexical complexity, in: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Association for Computational Linguistics, Online, 2021, pp. 585–589.

URL: https://aclanthology.org/2021.semeval-1.73. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . s e m e v a l - 1 . 7 3 . [19] R. Caruana, Learning many related tasks at the same time with backpropagation, in: Proceedings of the 7th International Conference on Neural Information Processing Systems, NIPS’94, MIT Press, Cambridge, MA, USA, 1994, p. 657–664. [20] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, H. Wang, Ernie 2.0: A continual pre-training framework for language understanding, arXiv preprint arXiv:1907.12412 (2019).

[1]

Moloney ,

T. P.

Love , Assessing online misogyny: Perspectives from sociology and feminist media studies , Sociology Compass 12 ( 2018 ).

[2]

Poland , Haters: Harassment, abuse, and violence online, 2016 .

[3]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .

[4]

Antoun ,

Baly , H. Hajj, AraBERT: Transformer-based model for Arabic language understanding , in: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools , with a Shared Task on Ofensive Language Detection , European Language Resource Association, Marseille, France, 2020 , pp. 9 - 15 . URL: https://aclanthology.org/ 2020 .osact- 1 .2.

[5]

Abdul-Mageed ,

Elmadany ,

E. M. B.

Nagoudi , ARBERT & MARBERT: Deep bidirectional transformers for Arabic, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Online, 2021 , pp. 7088 - 7105 . URL: https://aclanthology.org/ 2021 . acl-long.551. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . a c l - l o n g . 5 5 1 .

[6]

El Mekki ,

El Mahdaouy , I. Berrada ,

Khoumsi , Domain adaptation for Arabic crossdomain and cross-dialect sentiment analysis from contextualized word embedding, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Online, 2021 , pp. 2824 - 2837 . URL: https://aclanthology.org/ 2021 .naacl-main. 226. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . n a a c l - m a i n . 2 2 6 .

[7]

El Mekki ,

El Mahdaouy ,

Essefar ,

N. El

Mamoun , I. Berrada ,

Khoumsi , BERT-based multi-task model for country and province level MSA and dialectal Arabic identification , in: Proceedings of the Sixth Arabic Natural Language Processing Workshop , Association for Computational Linguistics, Kyiv, Ukraine (Virtual) , 2021 , pp. 271 - 275 . URL: https: //aclanthology.org/ 2021 .wanlp- 1 . 31 .

[8]

El Mahdaouy ,

El Mekki ,

Essefar ,

N. El

Mamoun , I. Berrada ,

Khoumsi , Deep multi-task model for sarcasm detection and sentiment analysis in Arabic language , in: Proceedings of the Sixth Arabic Natural Language Processing Workshop , Association for Computational Linguistics, Kyiv, Ukraine (Virtual) , 2021 , pp. 334 - 339 . URL: https: //aclanthology.org/ 2021 .wanlp- 1 . 42 .

[9]

N. Safi

Samghabadi ,

Patwa , S. PYKL ,

Mukherjee , A. Das , T. Solorio , Aggression and misogyny detection using BERT: A multi-task approach , in: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, European Language Resources Association (ELRA) , Marseille, France, 2020 , pp. 126 - 131 . URL: https://aclanthology.org/ 2020 .trac- 1 . 20 .

[10]

Fersini ,

Nozza , P. Rosso, AMI @ EVALITA2020: automatic misogyny identification , in: