Overview of the HASOC Subtrack at FIRE 2023: Hate-Speech Identification in Sinhala and Gujarati Shrey Satapara1 , Hiren Madhu2 , Tharindu Ranasinghe3 , Alphaeus Eric Dmonte4 , Marcos Zampieri4 , Pavan Pandya5 , Nisarg Shah5 , Sandip Modha6 , Prasenjit Majumder7 and Thomas Mandl8 1 Indian Institute of Technology, Hyderabad, India 2 Indian Institute of Science, Bangalore, India 3 Aston University, United Kingdom 4 George Mason University, USA 5 Indiana Bloomington University, USA 6 LDRP-ITR, Gandhinagar, India 7 DA-IICT, Gandhinagar, India 8 University of Hildesheim, Germany Abstract Detecting offensive and hateful content in low-resource languages poses a significant challenge due to the limited availability of benchmark datasets. It is crucial to address this gap by creating benchmark datasets tailored to these languages. This not only enhances the accuracy of detection but also provides valuable insights into the efficacy of identifying problematic content in comparison to high-resource languages. In line with this commitment to advancing research on low-resource languages, the Hate Speech and Offensive Content Identification (HASOC) shared task introduced a dedicated subtrack for Hate Speech Identification in Sinhala and Gujarati in 2023. This paper outlines the objectives of the task, discusses the characteristics of the data involved, and presents an analysis of the participants’ submissions. For Task 1a, we utilized an existing Sinhala dataset (SOLD) consisting of 10,000 tweets. Meanwhile, for Task 1b, focused on Gujarati, we curated a new dataset comprising 1,020 tweets. A total of 16 teams submitted experiments for Sinhala, with the leading team achieving an impressive F1 score of 0.83. In the case of the Gujarati task, 17 teams participated, and the highest-performing team achieved an F1 score of 0.84. These results highlight the significance of tailored datasets in facilitating the effective detection of offensive content in low-resource languages. Keywords Hate Speech, Social NLP, Social Media, Language Resource, Deep Learning, Low-Resource Language, Evaluation, Benchmark Forum for Information Retrieval Evaluation, December 15-18, 2023, Goa, India Envelope-Open shreysatapara@gmail.com (S. Satapara); hirenmadhu16@gmail.com (H. Madhu); t.ranasinghe@aston.ac.uk (T. Ranasinghe); admonte@gmu.edu (A. Dmonte); marcos.zampieri@rit.edu (M. Zampieri); pavanpandya1311@gmail.com (P. Pandya); nisarg0606@gmail.com (N. Shah); sjmodha@gmail.com (S. Modha); p_majumder@daiict.ac.in (P. Majumder); mandl@uni-hildesheim.de (T. Mandl) Orcid 0000-0001-6222-1288 (S. Satapara); 0000-0002-6701-6782 (H. Madhu); 0000-0002-2346-3847 (M. Zampieri); 0000-0003-2427-2433 (S. Modha); 0000-0002-8398-9699 (T. Mandl) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1. Introduction Hate speech is a global problem that plagues social media platforms in many countries [1]. Hate speech can ultimately also lead to violent hate crimes [2]. Consequently, detection and moderation are necessary to maintain a rational discourse that allows an exchange of arguments. Reduced efforts in content moderation can lead to the proliferation of hate speech, as the case of Twitter has shown recently [3]. The initiative Hate Speech and Offensive Content Identification (HASOC) has organized shared tasks since 2019 [4] and created resources for several languages. The efforts for the creation of language resources for low-resource languages are of special importance. Research needs to analyze which resources are beneficial for such languages. Is it better to develop specific resources, or is it better to use resources from high-resource languages like English and exploit this knowledge for a low-resource context (e.g., by translating content or transfer learning between languages) [5]. Many offensive language detection benchmarks are available for English and other high- resource languages. However, in the last few years, the NLP community has focused on creating more datasets for low-resource languages such as Marathi [6], Oromo [7], Swahili [8] Greek [9], Danish [10], and Albanian [11]. Supporting this, the last two editions of HASOC contained shared tasks on identifying offensive language in Marathi. In 2023, the HASOC subtask 1 focused on identifying hate speech, offensive language, and profanity in Sinhala and Gujarati. Sinhala is a low-resource Indo-Aryan language spoken by around 16 million people, mainly in Sri Lanka. Gujarati is also a low-resource Indo-Aryan language spoken by approximately 50 million people, mainly in North-Western India. Task 1A deals with identifying hate and offensive content in Sinhala. The task involves classifying tweets into Hate and Offensive (HOF) or Non-Hate and Offensive (NOT). The dataset for this task is based on the Sinhala Offensive Language Detection dataset (SOLD) [12]. Task 1B focuses on identifying hate and offensive content in Gujarati, which was similar to task 1A, where the participants need to classify tweets into HOF or NOT categories. We created a new dataset for Gujarati with 1020 annotated tweets. More details about the dataset is available in Section 2. Overall, both tasks were highly successful and gained attention of the NLP community. The interest demonstrated last year continued this year, too, with 16 teams participating in the Sinhala task and 17 teams participating in the Gujarati task. Furthermore, it should be highlighted that this is the first-ever shared task organised for Sinhala. We believe that this shared task would open many research avenues for low-resource languages like Sinhala and Gujarati. 2. Data 2.1. Sinhala dataset The data used for subtask 1A are from the Sinhala Offensive Language Dataset: SOLD1 [12]. The dataset consists of 10,000 annotated Twitter posts aimed to detect Sinhala offensive text. The 1 https://huggingface.co/datasets/sinhala-nlp/SOLD dataset has two splits, the training and test sets, containing 7500 and 2500 tweets, respectively. The initial dataset consisted of two annotation levels, which were sentence and token-level annotations. They have followed the OLID [13] Task A annotation for sentence level annotations, which we utilized for our subtask 1A. The original paper demonstrates 0.7 - 0.8 Fleiss’ Kappa Inter Annotator Agreement to this dataset. Class distribution of the dataset is shown in Table 1. Class Train Test HOF 3176 1015 NOT 4324 1485 Table 1 Class distribution in SOLD [12] 2.2. Gujarati dataset We created a new Gujarati offensive language detection dataset for subtask 1B. We used the Sinhala dataset from subtask 1A to create the dataset. We first collected all the unique offensive tokens from the Sinhala dataset. These tokens were then automatically translated to Gujarati. From the translations, we manually selected 45 tokens. We also collected offensive tokens from various websites (eg: https://www.youswear.com/index.asp?language=Gujarati ) and manually selected the ones that were appropriate for our problem statement. We then used an in-house web scraper to scrape the tweets using those keywords. We present the dataset statistics for the Gujarati dataset in Table 2. As we can see, we only provide the participants with 200 labelled text samples. This was done to encourage participants to develop innovative techniques in Zero-Shot and Few-Shot learning that make use of high-resource datasets. For the annotations, the inter-annotator agreement was 0.7474. Class Train Test HOF 100 376 NOT 100 820 Total 200 1196 Table 2 Class distribution in Gujarati 3. Results The results for Subtask 1A are presented in Table 3. A total of 52 systems were submitted from 16 teams. The best-performing system by each team is displayed in table 3, ranked by their F1 scores. The performance of the top-5 teams was very similar. Most of the top teams utilized pre-trained transformer models that support Sinhala, such as XLM-R. Several teams used sentence embeddings such as SBERT and LABSE in their experiments. Interestingly, some teams used mBERT, which is not trained on Sinhala text but could also achieve mid-table finishes. Team FiRC-NLP had the best performing system, with an F1 score of 0.8382, followed by ”Krispy Rank Team Name Number of Runs Precision Recall F1 1 FiRC-NLP[14] 2 0.8368 0.8399 0.8382 2 Krispy Mango[15] 5 0.8439 0.8326 0.8371 3 AiAlchemists[16] 3 0.8339 0.8374 0.8355 4 SATLab[17] 5 0.8377 0.833 0.8351 5 Z-AGI Labs[18] 4 0.8342 0.8357 0.8349 6 NAVICK[19] 5 0.8304 0.8262 0.8281 7 XAG-TUD[20] 1 0.8178 0.8093 0.8127 8 Gradient Descenders 1 0.8087 0.8059 0.8072 9 SSN_CSE_ML_TEAM[21] 5 0.7977 0.7923 0.7946 10 IRLab@IITBHU[22] 4 0.7853 0.7849 0.7851 11 MUCS_3 5 0.8056 0.7753 0.7832 12 CNLP-NITS-PP[23] 2 0.7716 0.7707 0.7711 13 UINSUSKA-Mandiri[24] 3 0.7393 0.7455 0.741 14 Wunderkinds 1 0.6839 0.66 0.6628 15 Hate Speech Detectives 1 0.6446 0.6425 0.6433 16 LEGEND[25] 5 0.5588 0.5572 0.5574 Table 3 Results of Subtask 1A - Sinhala. The best system for each team is reported, ordered from best to least performing system Rank Team Number of Runs Precision Recall F1 1 FiRC-NLP[14] 2 0.8391 0.8637 0.8487 2 SATLab[17] 5 0.8500 0.8292 0.8382 3 Krispy Mango[15] 5 0.7896 0.8034 0.7956 4 AiAlchemists[16] 4 0.7859 0.8254 0.7926 5 XAG-TUD[20] 2 0.7717 0.7958 0.7799 6 SSN_CSE_ML_TEAM[21] 5 0.7675 0.8048 0.7731 7 Z-AGI Labs[18] 4 0.7607 0.7970 0.7660 8 LEGEND[25] 5 0.7711 0.7415 0.7526 9 Wunderkinds 2 0.7292 0.7606 0.7333 10 Sanvadita[26] 6 0.7394 0.7776 0.7324 11 MUCS 7 0.7250 0.7572 0.7276 12 NAVICK[19] 4 0.7038 0.7364 0.6945 13 IRLab@IITBHU[22] 4 0.6915 0.7205 0.6896 14 CNLP-NITS-PP[23] 1 0.6998 0.7317 0.6873 15 UINSUSKA-Mandiri[24] 3 0.6929 0.7218 0.6675 16 Gradient Descenders 2 0.6710 0.6626 0.6661 Table 4 Results of Subtask 1B - Gujarati. The best system for each team is reported, ordered from best to least performing system Mango” and ”AiAlchemist”, with F1 scores of 0.8371 and 0.8355, respectively. The last-ranked team had an F1 score of 0.5574. Table 4 presents the results of Subtask 1B. Notably, a total of 54 submissions were received from 17 different teams. The team ”FiRC-NLP” achieved the highest F1 score (0.8488) with their submission named ”no-kfold,” demonstrating excellent precision (0.8392) and recall (0.8638) by fine-tuning XLM-RoBERTa large checkpoint. Following closely, ”SATLab” secured the second position with their submission ”HasocT1bR4,” earning an F1 score of 0.8383, by learning character level ngrams with classical machine learning classifiers. The team ”Krispy Mango” by fine-tuning XLM-RoBERTa, ranked third with an F1 score of 0.7956. The table provides an insightful overview of the competition results, highlighting the strong performance of many teams and their respective submissions. 4. Conclusion and Future Work We presented the results of HASOC 2023 Task 1, which featured datasets in two low-resource Indo-Aryan languages; Sinhala and Gujarati. A total of 16 teams submitted experiments for Sinhala and 17 teams participated for Gujarati. The wide participation in the task allowed us to compare a number of approaches. We observed that the best systems for both languages used pre-trained transformers that support Sinhala and Gujarati, such as XLM-R and mBERT. Furthermore, since Gujarati only contained a limited number of training instances, several teams utilised cross-lingual transfer learning to improve their performance. Despite being low- resource language, top teams produced competitive results that are comparable to high-resource languages. We plan to extend the task in several ways. First, we plan to organize an offensive spans detection task for these two languages that will improve the explainability of the offensive language detection models. Secondly, we hope to add more Indo-Aryan languages that are less researched in the NLP community. HASOC 2023 is the first-ever shared task organised for Sinhala and one of the few shared tasks organized for Gujarati. However, we believe that in light of HASOC 2023, many shared tasks will be created for these languages in the future, improving the involvement of NLP researchers in these low-resource languages. Acknowledgments This work was partially supported by a grant from the Artificial Intelligence Journal (AIJ) for sponsoring IA research (28th call for sponsorship). References [1] B. Di Fátima, Hate Speech on Social Media: A Global Approach, LabCom Books & EdiPUCE, Covilhã, Portugal, 2023. doi:10.25768/654- 916- 9 . [2] K. Müller, C. Schwarz, From hashtag to hate crime: Twitter and antiminority sentiment, American Economic Journal: Applied Economics 15 (2023) 270–312. doi:10.1257/app. 20210211 . [3] D. Hickey, M. Schmitz, D. Fessler, P. E. Smaldino, G. Muric, K. Burghardt, Auditing Elon Musk’s impact on hate speech and bots, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 17, 2023, pp. 1133–1137. doi:10.1609/icwsm. v17i1.22222 . [4] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandalia, A. Patel, Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo- European Languages, in: P. Majumder, M. Mitra, S. Gangopadhyay, P. Mehta (Eds.), FIRE ’19: Forum for Information Retrieval Evaluation, Kolkata, India, December, 2019, ACM, 2019, pp. 14–17. URL: https://doi.org/10.1145/3368567.3368584. doi:10.1145/3368567.3368584 . [5] M. U. Arshad, R. Ali, M. O. Beg, W. Shahzad, Uhated: hate speech detection in urdu language using transfer learning, Lang. Resour. Evaluation 57 (2023) 713–732. URL: https://doi.org/10.1007/s10579-023-09642-7. doi:10.1007/s10579- 023- 09642- 7 . [6] S. S. Gaikwad, T. Ranasinghe, M. Zampieri, C. Homan, Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi, in: G. Angelova, M. Kunilovskaya, R. Mitkov, I. Nikolova-Koleva (Eds.), Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), Held Online, 1-3September, 2021, INCOMA Ltd., 2021, pp. 437–443. URL: https://aclanthology. org/2021.ranlp-1.50. [7] N. B. Defersha, J. Abawajy, K. Kekeba, Deep learning based multilabel hateful speech text comments recognition and classification model for resource scarce ethiopian language: The case of afaan oromo, in: IEEE International Conference on Current Development in Engineering and Technology (CCET), IEEE, 2022, pp. 1–11. doi:10.1109/CCET56606. 2022.10080837 . [8] E. Ombui, L. Muchemi, P. Wagacha, Building and annotating a codeswitched hate speech corpora, Int. J. Inf. Technol. Comput. Sci 3 (2021) 33–52. [9] Z. Pitenis, M. Zampieri, T. Ranasinghe, Offensive Language Identification in Greek, in: Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, European Language Resources Association, 2020, pp. 5113–5119. URL: https://aclanthology.org/2020.lrec-1.629/. [10] G. I. Sigurbergsson, L. Derczynski, Offensive Language and Hate Speech Detection for Danish, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, European Language Resources Association, 2020, pp. 3498–3508. URL: https://aclanthology.org/2020.lrec-1.430/. [11] E. Kaziaj, FUELLING hate: Hate speech towards women in online news websites in Albania, in: Gender and Sexuality in the European Media, Routledge, 2021, pp. 100–118. [12] T. Ranasinghe, I. Anuradha, D. Premasiri, K. Silva, H. Hettiarachchi, L. Uyangodage, M. Zampieri, Sold: Sinhala offensive language dataset, arXiv preprint arXiv:2212.00851 (2022). [13] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the type and target of offensive posts in social media, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 1415–1420. doi:10.18653/v1/N19- 1144 . [14] M. S. Jahan, F. Hassan, W. Aransa, A. Bouchekif, Multilingual Hate Speech Detection Using Ensemble of Transformer Models, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [15] M. K. Sathya, K. Gopalakrishnan, M. PA, P. Balasundaram, Sinhala and gujarati hate speech detection, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [16] C. Muhammad Awais, J. Raj, Breaking Barriers: Multilingual Toxicity Analysis for Hate Speech and Offensive Language in Low-Resource Indo-Aryan Languages, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [17] Y. Bestgen, Using Only Character Ngrams for Hate Speech and Offensive Content Identi- fication in Five Low-Ressource Languages, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [18] N. Narayan, M. Biswal, P. Goyal, A. Panigrahi, Hate Speech and Offensive Content Detection in Indo-Aryan Languages: A Battle of LSTM and Transformers, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [19] M. Rostamkhani, S. Eetemadi, Detecting hate speech and offensive content in english and indo-aryan texts, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [20] M. D. M. Qureshi, M. Sawant, M. A. Qureshi, W. Rashwan, A. Younus, S. Caton, Hate speech classification for sinhalese and gujarati, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [21] S. G GNANA, A. Venkatesh, K. N, O. M, B. V. A, P. Balasundaram, Enhancing hate speech detection in sinhala and gujarati: Leveraging bert models and linguistic constraints, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [22] S. Chanda, A. Dhaka, S. Pal, Crossing borders: Multilingual hate speech detection, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [23] G. Kalita, E. Halder, C. Taparia, A. Vetagiri, D. P. Pakray, Examining Hate Speech Detection Across Multiple Indo-Aryan Languages in Tasks 1 & 4, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [24] S. Agustian, Z. Idhafi, A. F. Rihardi, Improving detection of hate speech, offensive language and profanity in short texts with svm classifier, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [25] O. E. Ojo, O. O. Adebanji, H. Calvo, A. Gelbukh, A. Feldman, G. SIDOROV, Hate and offensive content identification in indo-aryan languages using transformer-based models, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [26] A. Joshi, R. Joshi, Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023.