=Paper=
{{Paper
|id=Vol-3159/T5-6
|storemode=property
|title=Detecting Misogyny in Arabic Tweets
|pdfUrl=https://ceur-ws.org/Vol-3159/T5-6.pdf
|volume=Vol-3159
|authors=Abdusalam Nwesri,Stephen Wu,Harmain Harmain
|dblpUrl=https://dblp.org/rec/conf/fire/NwesriWH21
}}
==Detecting Misogyny in Arabic Tweets==
Detecting Misogyny in Arabic Tweets Abdusalam Nwesri1 , Stephen Wu2 and Harmain Harmain3 1 Faculty of Information Technology, University of Tripoli, Tripoli, Libya 2 School of Biomedical Informatics, UTHealth, Houston, TX USA, 3 Faculty of Information Technology, University of Tripoli, Tripoli, Libya Abstract Systems that can automatically detect offensive content are of great value, for example, to provide pro- tective settings for users or assist social media supervisors with removal of odious language. In this paper, we present three machine learning models developed at University of Tripoli, Libya, for the de- tection of misogyny in Arabic colloquial tweets. We present the results obtained with these models in the first Arabic Misogyny Identification shared task ArMI’21, a sub track of HASOC@FIRE2021. With our first model (optimized BERT-based pipelines), we placed as the second-ranked team on sub-task A: Misogyny Content Identification, and as the third-ranked team on sub-task B: Misogyny Behavior Identification. Keywords Arabic Misogyny detection, hate speech detection 1. Introduction Public speech that expresses hate or encourages violence toward a person or group based on race, religion, sex, or sexual orientation is defined as hate speech. Expressing feelings of hating women, or believing that men are much better than women is termed as Misogyny.1 Misogyny is an increasing phenomenon in virtual environments such as social media as people have more freedom to express their feelings with no restrictions than in face-to-face meetings. For example, Facebook reported 31.5 million instances of content with hate speech in the second quarter of 2021.2 Twitter reported 54 percent increase in the number of accounts violating its hateful conduct policy in a six-month period after July 2019.3 Social media companies struggle with the ethical ramifications of misogynistic speech on their platforms. Thus, some companies, such as Twitter, hire a large number of employees to moderate content. However, the huge number of social media posts generated every day makes manual moderation unscalable, so the assistance of automated misogyny detection systems is necessary to enable this type of curation. Automatically labeling Arabic colloquial tweets as misogynous or non-misogynous is chal- lenging task because the language of tweets is full of syntactic and grammatical flaws, making Forum for Information Retrieval Evaluation, December 13-17, 2021, India " a.nwesri@uot.edu.ly (A. Nwesri); wu.stephen.t@gmail.com (S. Wu); h.harmain@uot.edu.ly (H. Harmain) © 2021 Forum for Information Retrieval Evaluation, December 13-17, 2021, India CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 dictionary.cambridge,org 2 transperancy.fb.com/communit-standards-enforcement/hate-speech/facebook/ 3 blog.twitter.com/en_us/topics/company/2020/new-transparency-center extraction of text-based features a difficult task. Tweets are short and often consist of few words. In this paper, we present three models to detect misogyny in Arabic tweets, for the first Arabic Misogyny Identification (ArMI) shared task, a sub track of Hate and Offensive Content Identification (HASOC) at the 2021 Forum for Information Retrieval Evaluation (FIRE@2021). 2. Related Work Though misogyny detection in Arabic text is a recent topic, some previous work has been done on offensive language and hate speech detection in Arabic. The first study on abusive language detection in Arabic was done by Ehab A. Abozinadah and Jr. (2015). They tested three machine learning algorithms — Naïve Bayes (NB), Support Vector Machines (SVM), and Decision Tree (J48) classifiers — to detect abusive tweets on a set of 1,300,000 Arabic tweets collected using five swear words. They reported that the NB algorithm was the best performer with an accuracy rate of 90%. Alakrot et al. (2018) constructed a data set of 167,549 YouTube comments and utilized SVMs to classify comments as either positive or negative. They reported that the SVM classifier achieved 90.05% accuracy. Husain (2020) tested the impact of the pre-processing phase on the detection of offensive and hate speech for Arabic text. The author used an SVM classifier to identify offensive and hate speech in a data set before and after applying the pre-processing techniques on the original text. The pre-processing techniques improved the classification with an F1 score of 89% for the offensive language detection task and 95% for the hate speech classification task. Mulki and Ghanem (2021b) built a levantine data set of 6,603 tweets collected from the Twitter accounts of several female journalists who covered the Lebanese protests of October 2019. Tweets in the data set are annotated as misogynous or none. Misogynous tweets are further classified to differentiate between, for example, a threat of violence versus a derailing comment. They used several models to detect misogyny and found that BERT is the best model to classify tweets as misogynous, with an F1 score of 0.88. In the categorical classification they reported that Frenda et al. (2018) model was the best performer with an F1 score of 0.43. 3. Experiment Description Our experiment was part of the ArMI shared task, a sub track of the HASOC at FIRE 2021. The task aims at identifying misogynistic tweets and recognizing different misogynistic categories in a collection of Arabic (MSA/dialectal) tweets. 3.1. Task Details ArMI 2021 used a data set composed of (7,866) tweets written in Modern Standard Arabic (MSA) and several Arabic dialects, including Gulf, Egyptian and Levantine (Mulki and Ghanem, 2021c). Participants participated in two sub-tasks. In sub-task A, participants were required to identify a tweet as a misogynistic (misogyny) or non-misogynistic (none). In sub-task B, partic- ipants were required to classify misogynistic tweets to (discredit, derailing, dominance, stereotyping & objectification, threat of violence, sexual harassment, or damning). Two data sets were released, a training set with its gold standard classifications, and a test set with the gold standard withheld. The training set was used to tune detection algorithms and the test set was used to blindly classify new unannotated tweets. More details about tasks are described in (Mulki and Ghanem, 2021a). 4. Experiments We have participated under the name of University of Tripoli (UoT) with three different runs in each sub-task. Below is the description of these runs in each task: 4.1. Sub-task A (Misogyny Content Identification) 4.1.1. UoT run1: Large BERT-based pipelines Full pipelines with BERT models (Devlin et al., 2019) at their center were compared for per- formance on the training set. Words are were segmented using Farasa stemmer.4 We then then used the American University of Beirut’s AraBERT v2 (Antoun et al., 2020) with the BERT-large architecture, pretrained on OSCAR, Arabic Wikipedia, 1.5B words Arabic Corpus, OSIAN Corpus, and Assafir news articles. We then fine-tuned the model on our training data set for misogyny content identification. 4.1.2. UoT run2: Statistical ML Classifiers We also created a classical machine learning pipeline based on sklearn library (Pedregosa et al., 2011). The pipeline consists of 4 stages: a pre-processor, count vectorizer, tf-idf transformer and a multinomialNB classifier. In the text pre-processing stage, any special characters, links, commas, and usernames in @-mentions were removed from the tweets. Hyper parameters were used to tune the transformers and classifier. 4.1.3. UoT run3: Feed-forward networks We started by removing all non-Arabic characters for the text, then we removed repeated characters leaving only two of the following characters ( 0ð", " è", " @", " È", " ø", " ¬" ) We then removed the dot character, normalized the different forms of Hamza to a bare Hamza " @", and split the starting combination of " AK" from any word in the text. This last adjustment was made since this combination is used to address someone in Arabic, is widely used in Arabic hate speech, and is often attached to the following word. We have also normalized wrongly written Arabic phrases widely used in damming someone, such as: " é<Ë@ iJ.¯", " é<Ë@ ½jJ.¯", " é<Ë@ à X@", " é<Ë@ I JªË", " é<Ë@ ZA @", " é<Ë@ àX@" . We then normalized " è" to " è" and final " ø" to " ø". Finally, we replaced the female addressee pronoun " úæK@" with " I K@" since most tweets are addressing 4 https://alt.qcri.org/farasa/ Table 1 Sub-task A results obtained using the training data set Run Acc. Recall Precision F1 UoT_run1 0.909 0.903 0.904 0.904 UoT_run2 0.83 0.81 0.82 0.82 UoT_run3 0.841 0.832 0.845 0.838 Table 2 Sub-task B results obtained using the training data set Run Acc. Recall Precision F1 UoT_run1 0.769 0.468 0.494 0.474 UoT_run2 0.69 0.55 0.33 0.36 UoT_run2 0.728 0.888 0.508 0.654 females. Using word frequency in the training data set, we have removed a list of 29 tokens chosen based on their frequency in the training data set. The remaining words are transformed to a matrix of numbers based on their tf-idf score in tweets. A 2-layer feedforward neural network was implemented in keras. We trained the model with a batch of size 100, and trained the model for 4 epochs. The final F1 score we got is 0.838. 4.2. Sub-task B (Misogyny Behavior Identification) We treated Sub-task B as a multi-class classification problem, using essentially the same strategy and system for each of the 3 runs, respectively, but training on the more fine-grained descriptions of misogyny behavior: damning, discredit, dominance, sexual harassment, stereotyping & objectification, and threat of violence. Based on the full-pipeline comparisons for UoT_run 1, we selected the highest-performing pipeline, which included basic tokenization and a BERT-large model from Koc University. For run 3, we used the same experimental setup used in sub-task A, however, we used the "binary" mode to transform words to either 0 or 1 based on their presence in the tweets. Results on the training data set are shown in Table 2. 5. Official results The test set has been released and above runs have been submitted for evaluation to the organizing committee. Table 3 shows results obtained by participating teams including our submitted runs. Our UoT_run1 scored in the 4th position (2nd-best team), while the UoT_run3 and UoT_run2 scored the 12th and the 14th respectively. Our results show that BERT algorithm is the best performer in our runs. Table 4 shows results of the participants’ submitted runs for the sub-task B. Our best performer is again the UoT_run1 at the 8th position (3rd-best team). UoT_run3 and UoT_run2 were in the 11th and the 13th positions respectively. Table 3 Results of participating teams in sub-task A Run Acc. Recall Precision F1 UM6P-NLP_run3 0.919 0.92 0.909 0.914 UM6P-NLP_run2 0.915 0.915 0.905 0.91 UM6P-NLP_run1 0.915 0.911 0.911 0.911 UoT_run1 0.95 0.901 0.899 0.9 SOA_NLP_run1 0.883 0.878 0.876 0.877 BERT 0.88 0.87 0.88 0.87 MUCIC_run1 0.873 0.868 0.864 0.866 SOA_NLP_run2 0.873 0.868 0.865 0.866 (Frenda et. al. 2018) 0.87 0.86 0.86 0.86 MUCIC_run2 0.866 0.86 0.857 0.858 SOA_NLP_run3 0.854 0.846 0.85 0.848 UoT_run3 0.842 0.835 0.831 0.833 iCompass_run1 0.833 0.826 0.82 0.83 UoT_run2 0.827 0.819 0.833 0.822 iCompass_run1 0.508 0.502 0.503 0.499 Isey_run2 0.483 0.506 0.506 0.483 Isey_run1 0.474 0.5 0.5 0.474 Table 4 Results of participating teams in sub-task B Run Acc. Recall Precision F1 UM6P-NLP_run2 0.827 0.697 0.647 0.665 UM6P-NLP_run3 0.833 0.717 0.636 0.653 UM6P-NLP_run1 0.816 0.692 0.652 0.651 SOA_NLP_run2 0.764 0.676 0.48 0.531 SOA_NLP_run3 0.745 0.559 0.508 0.526 (Frenda et. al, 2018) 0.77 0.66 0.47 0.52 SOA_NLP_runl 0.78 0.549 0.502 0.519 UoT_run1 0.789 0.541 0.508 0.517 MUCIC_run1 0.765 0.578 0.46 0.497 MUCIC_run2 0.762 0.572 0.456 0.493 UoT_run3 0.73 0.585 0.432 0.468 BERT 0.76 0.54 0.4 0.43 UoT_run2 0.709 0.524 0.382 0.407 iCompassrun2 0.637 0.242 0.248 0.245 iCompass_run1 0.637 0.242 0.248 0.245 6. Conclusion We have tested three machine learning algorithms in classifying Arabic tweets. AraBert, feedforward networks, and traditional machine learning models have been tested on classifying Arabic tweets as a non-misogynous or misogynous and additionally on classifying misogynic tweets into six predefined categories. By far the AraBERT algorithm was the best performer with an F1 score of 90% in the first task and 51.7% in the second. In future work, we plan to test the combination of the preprocessing steps we made with the Keras model and the AraBERT approach. References Alakrot, A., Murray, L., and Nikolov, N. S. (2018). Towards accurate detection of offensive language in online communication in arabic. Procedia Computer Science, 142:315–320. Arabic Computational Linguistics. Antoun, W., Baly, F., and Hajj, H. (2020). AraBERT: Transformer-based model for Arabic lan- guage understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 9–15, Marseille, France. European Language Resource Association. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirec- tional transformers for language understanding. Ehab A. Abozinadah, A. V. M. and Jr., J. H. J. (2015). Detection of abusive accounts with arabic tweets. International Journal of Knowledge Engineering, 1:113–119. Frenda, S., Ghanem, B., and y Gómez, M. M. (2018). Exploration of misogyny in spanish and english tweets. In IberEval@SEPLN. Husain, F. (2020). Osact4 shared task on offensive language detection: Intensive preprocessing- based approach. Mulki, H. and Ghanem, B. (2021a). ArMI at FIRE2021: Overview of the First Shared Task on Arabic Misogyny Identification. In Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation. CEUR. Mulki, H. and Ghanem, B. (2021b). Let-mi: An arabic levantine twitter dataset for misogynistic language. Mulki, H. and Ghanem, B. (2021c). Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language. In Proceedings of the 6th Arabic Natural Language Processing Workshop (WANLP 2021). Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.