=Paper=
{{Paper
|id=Vol-3159/T5-7
|storemode=property
|title=iCompass Working Notes for Arabic Misogyny Identification
|pdfUrl=https://ceur-ws.org/Vol-3159/T5-7.pdf
|volume=Vol-3159
|authors=Abir Messaoudi,Chayma Fourati,Mayssa Kchaou,Hatem Haddad
|dblpUrl=https://dblp.org/rec/conf/fire/MessaoudiFKH21
}}
==iCompass Working Notes for Arabic Misogyny Identification==
iCompass Working Notes for Arabic Misogyny Identification Abir Messaoudi1 , Chayma Fourati1 , Mayssa Kchaou1 and Hatem Haddad1 1 iCompass, Tunisia Abstract We describe our submitted system to the first Arabic Misogyny Identification shared task. We tackled both subtasks, namely Misogyny Content Identification (Subtask 1) and Misogyny Behavior Identification (Subtask 2). We used state-of-the-art Machine Learning models and pretrained contextualized text representation models that we fine-tuned according to the downstream task in hand. As a first approach, we used Machine Learning algorithms including: Naive Bayes and Support Vector Machine for both subtasks. Then, we used Google’s multilingual BERT and then other BERT Arabic variants: AraBERT, ARBERT and MARBERT. The results found show that MARBERT outperforms all of the previously mentioned models overall, whether on Subtask 1 or Subtask 2. Keywords Misogyny, Machine Learning, BERT, Finetuning 1. Introduction Nowadays, social media presents an important role in the spread of misogynistic behaviour. Hence, misogyny identification presents a trending task, particularly in Arabic since it has different variants and dialects across the world. Even if some dialects share some vocabulary, they still differ according to countries, where each dialect has its own specifications. Because of the massive amount of such content, automatic identification of misogynistic behaviours becomes crucial. The paper is structured as follows: Section 2 provides a concise description of the used dataset. Section 3 describes the used systems and the experimental setup to build models for Misogyny Content Identification and Misogyny Behavior Identification. Section 4 presents the obtained results. Finally, section 5 concludes and points to possible directions for future work. 2. Data The provided train dataset [1] of the competition [2] consists of 7866 tweets written in Modern Standard Arabic (MSA) and several Arabic dialects including: Gulf, Egyptian and Levantine. The dataset has two label columns: misogyny and category for the first and second subtasks respectively. The first subtask consists of a binary classification problem, where the column misogyny contains two labels (Misogyny and None). The second subtask consists of a multiclass FIRE 2021: Forum for Information Retrieval Evaluation, 13th-17th December, 2021 Envelope-Open abir@icompass.digital (A. Messaoudi); chayma@icompass.digital (C. Fourati); mayssakchaou933@gmail.com (M. Kchaou); hatem@icompass.digital (H. Haddad) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) classification problem, where the column category contains eight labels as follows: none, damning, derailing, discredit, dominance, sexual harassment, stereotyping & objectification, and threat of violence. In order to validate our models, we split the provided train dataset into train, dev and test with ratios 70%, 10% and 20% respectively. Tables 1 and 2 present statistics of the splitted train dataset for Subtask 1 and Subtask 2 respectively. Table 1 Splitted train dataset statistics for Subtask 1. Class Train Validation Test None 2053 221 787 Misogyny 3609 409 787 Total 5662 630 1574 Table 2 Splitted train dataset statistics for Subtask 2. Class Train Validation Test None 2069 205 787 Damning 524 44 101 derailing 88 8 9 discredit 2131 261 476 dominance 159 26 34 Sexual harassment 47 2 12 Stereotyping & objectification 473 59 121 Threat of violence 171 25 34 Total 5662 630 1574 Train, dev and test datasets were preprocessed by removing links (https, //, etc..) emoji symbols (:p, :D, etc..) , hashtags (#), tags (@) retweets (RT) and punctuation (?!., etc..). An example is the tweet ” ”مستخدم@ انتي بطلة. After preprocessing, the latest becomes ””انتي بطلة. 2.1. Third Party Dataset for Training At iCompass, we gathered our Tunisian Misogyny dataset labelled as None (0) and Misogyny (1) collected from Tunisian sources. We added this dataset for the training of the first subtask since it has the same labels. Hence, the dataset was enhanced by 818 tweets labelled as ”None” and 642 labelled as ”Misogyny”. The same preprocessing techniques were performed. However, the new obtained dataset was not used for the experiments, but only when submitting our results. 3. System Description As a first approach, we used two Machine Learning algorithms: Naive Bayes (NB) and Support Vector Machine (SVM) chosen based on the state of the art with a variation of hyperparameters in order to find the best performing values. Pretrained contextualized text representation models have shown to perform effectively in order to make a natural language understandable by machines. Bidirectional Encoder Representations from Transformers (BERT) [3] is, nowadays, the state-of-the-art model for language understanding, outperforming previous models and opening new perspectives in the Natural Language Processing (NLP) field. Hence, as a second approach, we used multilingual cased BERT model (mBERT) [3] since it contains more than 100 languages including the Arabic one. Then, we used three BERT Arabic variants: AraBERT [4], ARBERT [5] and MARBERT [5]. After different experiments, MARBERT achieved the best results for the two subtasks: Misogyny Content Identification and Misogyny Behavior Identification. We believe this is because MARBERT was trained mostly on dialectal Arabic which was underrepresented in previous pretrained models. Since this task’s data is multi-dialectal, this model is expected to achieve the best performance. We trained our models on a Google Cloud GPU of 8 cores using Google Colaboratory. The final models that we used to make the submissions are: • For Misogyny Content Identification: a model based on MARBERT, trained for 4 epochs with a learning rate of 2e-5, a batch size of 32 and max sequence length of 128. • For Misogyny Behavior Identification: a model based on MARBERT, trained for 4 epochs with a learning rate of 2e-5, a batch size of 32 and max sequence length of 128. 4. Results and Discussion We submitted two runs to each subtask: run1 is trained on the provided train dataset, and run2 on the augmented train dataset. 4.1. Sub-task A - Misogyny Content Identification This subtask is a binary classification problem which includes labels ”None” and ”Misogyny”. Table 4 presents the results of experiments performed for this subtask where the best result was achieved by MARBERT. Table 3 Results obtained for Subtask 1. Model Accuracy F1 macro F1 micro SVM 79% 79% 79% NB 80% 79% 79% MBERT 75% 74% 74% ARABERT 81% 81% 81% ARBERT 80% 80% 80% MARBERT 89% 89% 89% 4.2. Sub-task B - Misogyny Behavior Identification This subtask is a multiclass classification problem, including eight labels. Table 4 presents the results of experiments performed for this subtask where the best result was also achieved by MARBERT. Because the dataset is not balanced, F1 macro gives low performances. Table 4 Results obtained for Subtask 2. Model Accuracy F1 macro F1 micro SVM 71% 32% 69% NB 70% 41% 70% MBERT 66% 30% 64% ARABERT 70% 32% 68% ARBERT 63% 24% 59% MARBERT 83% 52% 82% 4.3. Official Submission Results The results obtained on the final released test dataset are presented in table 5. Table 5 Results on the final test datasets. Subtask Accuracy Precision Recall F1 score Subtask 1 run 1 83.3% 82.6% 82% 82.3% Subtask 1 run 2 50.8% 50.2% 50.3% 49.9% Subtask 2 run 1 63.7% 24.2% 24.8% 24.5% Subtask 2 run 2 63.7% 24.2% 24.8% 24.5% The augmented train dataset contains comments in the Tunisian dialect, which may have led to a decrease in the results of the first Subtask. Hence, run 1 outperforms run 2 in the subtask 1. Results of Subtask 2 are the same because we did not increase the train dataset with our Tunisian Misogyny one. 5. Conclusion In this work, two Machine Learning (SVM and NB) and four language models were used to classify misogyny and to detect misogynic behaviour (mBERT, AraBERT, ARBERT and MARBERT). The best results were obtained by MARBERT for both tasks with different hyperparameters, which was selected for the final submission. Future work would involve working on bigger contextualized pretrained models and enriching the existing Misogyny Content and Misogyny Behaviour datasets. References [1] H. Mulki, B. Ghanem, Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language, in: Proceedings of the 6th Arabic Natural Language Processing Workshop (WANLP 2021), 2021. [2] H. Mulki, B. Ghanem, ArMI at FIRE2021: Overview of the First Shared Task on Arabic Misogyny Identification, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. [4] W. Antoun, F. Baly, H. Hajj, AraBERT: Transformer-based model for Arabic language un- derstanding, in: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, 2020, pp. 9–15. [5] M. Abdul-Mageed, A. Elmadany, E. M. B. Nagoudi, Arbert & marbert: Deep bidirectional transformers for arabic, ArXiv abs/2101.01785 (2021).