Sinhala and Gujarati Hate Speech Detection M. Krithik Sathya1 , K.H. Gopalakrishnan1 , Manickam PA1 and Prabavathy Balasundaram2 1 UG Student, Sri Sivasubramaniya Nadar College of Engineering, Chennai, Tamil Nadu, India 2 Faculty, Department of Computer Science, Sri Sivasubramaniya Nadar College of Engineering, Chennai, Tamil Nadu, India Abstract This study, conducted by the ”Krispy Mango” research team, focuses on hate speech and offensive content detection in two low-resource Indo-Aryan languages, Sinhala and Gujarati, as part of the HASOC 2023 shared tasks. We address the difficulty of classifying tweets into Hate and Offensive (HOF) and Non-Hate and Offensive (NOT) categories by fine-tuning the BERT models. This work presents findings in the form of macro F1 scores and precision metrics for both languages. Our approach aims to advance the state-of-the-art in detecting hate speech while taking into account the particular linguistic characteristics and resource restrictions of these languages. Keywords Hate Speech Detection, Offensive Language Identification, BERT Models, Text Classification, Multilingual NLP 1. Introduction The digital age has revolutionized the way we communicate and connect with one another, primarily through the widespread adoption of social media platforms. However, this unprece- dented level of global interconnectivity has also brought about a concerning surge in hate speech and offensive content. Effectively addressing this challenge and developing robust methods for detecting and countering hate speech has become imperative. This research aims to contribute significantly to this effort by focusing on hate speech detection in two South Asian languages, Sinhala and Gujarati. While hate speech detection in English has received substantial attention, these languages have received comparatively less consideration in the realm of Natural Language Processing (NLP). Hate speech is a pervasive issue that crosses linguistic boundaries, emphasizing the importance of developing models that can identify such content in non-English languages as ef- fectively as in English. Sinhala, spoken in Sri Lanka, and Gujarati, a major Indian language, pose unique linguistic challenges due to their complex structures and distinct scripts. Detecting hate speech in these languages demands tailored approaches and models capable of understanding their intricacies. [1] Forum for Information Retrieval Evaluation, December 15-18, 2023, India Envelope-Open krithik2110693@ssn.edu.in (M. K. Sathya); gopalakrishnan2110375@ssn.edu.in (K.H. Gopalakrishnan); manickam2110305@ssn.edu.in (M. PA); prabavathyb@ssn.edu.in (P. Balasundaram) Orcid 0000-0002-0877-7063 (M. K. Sathya); 0000-0001-7116-9338 (K.H. Gopalakrishnan); 0000-0002-9421-8566 (M. PA); 0000-0002-9421-8567 (P. Balasundaram) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The following is the structure of this research paper: We start by providing a thorough explanation of the Sinhala and Gujarati task configuration for HASOC 2023. We next dive into our experimental methodology, which employs pre-trained BERT models fine-tuned on the available training data. In order to make the most of the restricted linguistic resources, we investigate the transferability of models across linguistic boundaries. Finally, we highlight our research’s possible effects on reducing online hate speech and harmful language in different linguistic communities as we examine our findings. Our work advances knowledge of hate speech identification in low-resource language circumstances by merging ideas from two different languages. [2] [3] 2. Related Works In recent research by Vinura et al. [4], the effectiveness of pre-trained language models for Sinhala text classification was explored. Among these models, XLM-R emerged as the most potent choice. The study introduced RoBERTa-based monolingual Sinhala models, establishing strong baselines, even in the presence of limited labelled data. Additionally, this research made significant contributions by releasing annotated datasets, providing valuable resources for future studies in Sinhala text classification. Andrea et al. [5] conducted a study focusing on the applicability of Bidirectional Encoder Representations from Transformers (BERT) models for sentiment analysis and emotion recog- nition in Twitter data. Through the development and fine-tuning of two classifiers for each task, they achieved remarkable results, with BERT-based models achieving accuracy rates of 92 per cent for sentiment analysis and 90 per cent for emotion recognition. These findings underscored BERT’s proficiency in modelling language for text classification within the realm of social media data. Tiwari et al. [6] directed their efforts towards addressing challenges in hate speech recognition within the context of social media platforms. They conducted a comparative analysis of various machine learning algorithms, emphasizing accuracy and precision metrics. Their findings identified the combination of XGBoost and TF-IDF embedding as the highest-performing approach, achieving an accuracy rate of 94.43 per cent. This research emphasized the critical role of hate speech detection in promoting user safety and compliance with laws addressing offensive content. Wang et al. [7] offered a comprehensive retrospective on the evolution of text classification, spanning traditional shallow learning techniques to deep learning models. Their meticulous examination of six pivotal methods, including ReNN, MLP, RNN, CNN, Attention, and Trans- former, highlighted their respective strengths and limitations. The paper underscored the dominance of deep learning models in text classification and highlighted ongoing research in attention mechanisms, Transformers, robustness, and graph neural networks, indicating the continuous evolution of text classification solutions. Ding et al. [7] introduced an innovative approach, Hypergraph Attention Networks (HANs), for inductive text classification. With a focus on efficiency and performance enhancement, HANs harnessed hypergraph structures to capture intricate word relationships within textual data. By utilizing sparse hypergraphs, this method effectively managed computational complexity, showcasing its scalability for extensive datasets. Experimental results underscored HANs’ superiority over existing techniques, demonstrating their potential for proficient inductive text classification while efficiently utilizing computational resources. Minaee et al. [8] conducted an extensive review comparing deep learning models to classical machine learning in text classification tasks such as sentiment analysis and news categorization. They evaluated over 150 recent deep learning-based text classification models, providing insights into technical innovations and strengths. The paper also analyzed the performance of these models on benchmark datasets, supporting their effectiveness with empirical evidence. It concluded by outlining potential avenues for future research, serving as a valuable resource for understanding the current landscape and future potential of deep learning in text classification. 3. Task and Dataset Description 3.1. Sub Task: Identifying Hate, offensive and profane content in Sinhala The task focuses on categorizing tweets published in Sinhala in a binary form. The two classification categories are as follows: 1. Hate and unpleasant (HOF): Tweets that target people or groups based on attributes like race, religion, ethnicity, gender, etc. are included in this category. They may also use profanity or other unpleasant language. 2. Non-Hate and Offensive (NOT): Tweets falling under this category do not contain any offensive language, profanity, or hate speech. They represent neutral or non-harmful expressions in the Sinhala language. The train/ test sets are based on the recently released SOLD: Sinhala Offensive Language Detection dataset. [9] Table 1 Attributes of the CSV file for Sinhala dataset Field Representation post id Represents the unique id of the tweet text Content of the tweet label Classification of the tweet 3.2. Sub Task: Identifying Hate, offensive and profane content in Gujarati The task focuses on categorizing tweets published in Gujarati in a binary form. The two classification categories are as follows: 1. Hate and unpleasant (HOF): Tweets that target people or groups based on attributes like race, religion, ethnicity, gender, etc. are included in this category. They may also use profanity or other unpleasant language. 2. Non-Hate and Offensive (NOT): Tweets falling under this category do not contain any offensive language, profanity, or hate speech. They represent neutral or non-harmful expressions in the Gujarati language. Table 2 Attributes of the CSV file for Gujarati dataset Field Representation post id Represents the unique id of the tweet text Content of the tweet label Classification of the tweet 4. Methodologies used Different NLP architectures like xlm-roberta-base, bert-base-multilingual-cased, intfloat/multilingual- e5-base, openai/whisper-large, Sinhala bert, Gujarathi-bert, were employed for identifying Hate, offensive, and profane content from tweets in Gujarathi and Sinhala. 4.1. Basic BERT Architecture The BERT model, an acronym for ”Bidirectional Encoder Representations from Transformers,” is grounded in the transformer architecture, emphasizing attention mechanisms. Comprising a multi-layer bidirectional transformer encoder, it includes an input layer, multiple hidden layers, and an output layer. Input sequences undergo initial processing through an embedding layer before entering the transformer encoder.[10] This encoder consists of a stack of uniform layers, each housing two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. The self-attention mechanism enables the model to discern interrelations among input sequence positions, aiding contextual comprehension. The position-wise feed-forward network applies two linear transformations, with ReLU activations interleaved, to each sequence element, enabling the model to capture intricate patterns and interdependencies among input tokens. Importantly, the final hidden state of the initial token ([CLS]) serves as the holistic sequence representation for classification tasks. BERT undergoes training through two unsupervised prediction tasks: masked language modeling and next-sentence prediction. This dual training equips BERT with profound bidirec- tional representations, leveraging contextual information from both preceding and subsequent contexts across all layers. Pre-trained BERT models can then be fine-tuned with an additional output layer, making them adaptable and potent tools for diverse natural language processing (NLP) tasks.[11][12] 4.2. XLM-RoBERTa The ”XLM-RoBERTa” model represents a powerful fusion of two renowned architectures: XLM (Cross-lingual Language Model) and RoBERTa. This variant excels in multilingual natural language processing tasks, with an emphasis on cross-lingual understanding. It boasts a vast parameter count and a deep architecture comprising multiple transformer layers. XLM- RoBERTa is pre-trained on an extensive corpus encompassing a multitude of languages, allowing it to comprehend and generate text in a wide array of linguistic contexts. Notably, it does not differentiate between uppercase and lowercase letters, ensuring robust performance in both case- sensitive and case-insensitive scenarios. This model’s versatility and cross-lingual capabilities make it an invaluable asset for researchers and practitioners engaged in multilingual NLP tasks, ranging from machine translation to document classification. [4] 4.3. Bert-base-mutilingual-cased ”BERT-base-multilingual-cased” is a BERT model version designed for multilingual natural language processing (NLP) applications. Unlike the original ”base BERT,” which was trained exclusively on English text, this variation was trained on a variety of languages. The model has 6 layers, 768 dimensions and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). On average, this model, referred to as DistilmBERT, is twice as fast as mBERT-base. The ”cased” element denotes that it stores case information in its lexicon, allowing it to differentiate between uppercase and lowercase letters. This is critical for languages where case sensitivity is critical for interpreting context. BERT-base-multilingual-cased is very useful for multilingual applications since it can efficiently handle many languages, giving it a versatile solution for tasks needing NLP across different linguistic backgrounds. 4.4. intfloat/multilingual-e5-base ”intfloat/multilingual-e5-base” is a specialized BERT variant developed to address the demands of multilingual natural language processing. It offers a comprehensive solution for tasks involving diverse languages and linguistic characteristics. Trained on an extensive multilingual corpus, this model leverages transformer-based architecture and deep neural networks to facilitate effective language understanding and generation. Notably, it encompasses a cased vocabulary, enabling it to preserve case information, which is pivotal in languages where case sensitivity plays a significant role in semantic interpretation. This variant’s adaptability and multilingual competence render it a valuable tool for cross-lingual applications such as multilingual document classification, sentiment analysis, and more. 4.5. OpenAI/Whisper ”OpenAI/Whisper-Large” is a large-scale automatic speech recognition (ASR) model designed to transcribe spoken language into text. This model’s capabilities are underpinned by a massive architecture, extensive pre-training on diverse audio data, and a robust transformer-based architecture. It excels in recognizing speech across multiple languages and dialects, making it a versatile choice for ASR tasks in various linguistic contexts. With its remarkable capacity for handling large volumes of spoken data and its ability to adapt to distinct accents and acoustic conditions, OpenAI/Whisper-Large is a valuable asset for applications such as transcription services, voice assistants, and more, where accurate speech-to-text conversion is paramount. 4.6. keshan/SinhalaBERTo Keshan/SinhalaBERTo is a specialized language model developed to address the unique chal- lenges posed by the Sinhala language. Sinhala, being a low-resource language, has limited access to pre-trained language models. Keshan/SinhalaBERTo fills this gap as a slightly smaller but highly valuable language model. It is trained on the OSCAR Sinhala dedup dataset, making it a relevant resource for Sinhala natural language processing tasks. The model specifications, including a vocabulary size of 52,000, max position embeddings of 514, 12 attention heads, 6 hid- den layers, and a type vocabulary size of 1, create a robust foundation for Keshan/SinhalaBERTo, a specialized language model for Sinhala text processing. [4] 4.7. Gujarati-bert ”Gujarati BERT” is a modified variant of the BERT model built exclusively for the Gujarati language, which is widely spoken in the Indian state of Gujarat and other areas. Gujarati BERT is fine-tuned for Gujarati text, as opposed to the usual ”base BERT” model, which is trained on a varied variety of languages. This allows it to capture the specific linguistic qualities, script, and context of the Gujarati language more efficiently. Gujarati BERT is particularly useful for natural language processing tasks in Gujarati, such as text categorization, sentiment analysis, and named entity recognition, due to this speciality. When compared to the more general-purpose base BERT model, Gujarati BERT’s domain expertise improves its performance and applicability in the context of the Gujarati language. 5. Result Analysis for Sinhala Dataset 5.1. Implementation In this section, we present the results of our offensive tweet classification task, employing five diverse BERT-based models: M1 (XLM-RoBERTa), M2 (Keshan/SinhalaBERTo), M3 (Bert- base-multilingual-cased), M4 (Bert-base-multilingual-uncased), and M5 (intfloat/multilingual- e5-base). These models have distinct linguistic characteristics and tokenization methods, which contribute to their unique performance. Model M1 is based on XLM-RoBERTa, exhibits robust performance in classifying offensive tweets. XLM-RoBERTa’s multilingual competence allows it to handle a wide range of languages effectively, including Sinhala. Its tokenization strategy considers various linguistic nuances, and it demonstrates a strong ability to generalize across languages. This model’s adaptability and pre-training on diverse multilingual data contribute to its high classification accuracy on the test dataset. M2 is powered by Keshan/SinhalaBERTo, which is tailored explicitly for the Sinhala language. Its tokenizer, optimized for Sinhala text, excels in capturing the language’s unique characteristics. This model showcases impressive results in classifying offensive tweets, demonstrating the importance of language-specific models in achieving high accuracy on Sinhala text. M2’s fine- tuning on Sinhala data contributes to its superior Sinhala text understanding and classification capabilities. Model M3 is Bert-base-multilingual-cased, it is designed as a versatile, multilingual BERT variant. Although not optimized exclusively for Sinhala, it manages to handle Sinhala text effectively due to its extensive multilingual vocabulary. M3’s tokenization, which is akin to bert-base-cased, successfully translates Sinhala text into subword tokens, allowing it to perform well in cross-lingual offensive tweet classification. M4 is Bert-base-multilingual-uncased, this shares similarities with M3 but lacks case sen- sitivity. Despite this difference, it effectively tokenizes Sinhala text, thanks to its subword tokenization method and multilingual vocabulary. M4 showcases commendable performance in the classification task, affirming its suitability for processing Sinhala and other languages without consideration for letter casing. Model M5 is intfloat/multilingual-e5-base, it is geared towards multilingual natural language processing tasks. Its subword tokenization and extensive pre-training enable it to handle Sinhala text with competence. M5 exhibits competitive results in classifying offensive tweets, highlighting its adaptability and cross-lingual proficiency. These tokenized inputs are then used to train and test the models. During the training phase, hyperparameters such as batch size, number of training epochs, and learning rate must be specified. To fine-tune the models, appropriate optimization algorithms, such as AdamW, are used in conjunction with learning rate schedulers. Following the training phase, the models are tested on a separate 1500-row test dataset with the same column names as the training data (post id, tweets, labels). During the testing phase, each model’s capacity to generalize and generate correct predictions on new, unseen data is evaluated. [13] 5.2. Results and discussion To categorize text data for hate speech detection in Sinhala, the models M1, M2, M3, M4, and M5 were used. To examine the performance of these models, evaluation metrics such as Macro-F1, Macro-Precision, and Macro Recall were generated. These metrics provide insight on the model’s ability to reliably identify and predict instances of hate speech in Sinhala text data. After examining the findings for these assessment measures in Table 3, it is clear that M5 outperforms all other models. Macro-F1 was used to assess these models because it combines precision and recall into a single score, offering a balanced estimate of the model’s capacity to reliably categorize instances of hate speech. M5, with a stellar Macro-F1 score of 0.8371, demonstrated its proficiency in correctly identifying hate speech within the Sinhala text data, outperforming the other models. A higher Macro-F1 score suggests superior performance in hate speech detection. Table 3 Assessment of Models using Evaluation Metrics Model Macro-F1 Macro-Precision Macro-Recall XLM-RoBERTa(M1) 0.7210 0.6803 0.7604 Keshan/SinhalaBERTo(M2) 0.6451 0.6532 0.6430 Bert-base-multilingual-cased(M3) 0.8141 0.8125 0.8162 Bert-base-multilingual-uncased(M4) 0.7728 0.7813 0.7776 intfloat/multilingual-e5-base(M5) 0.8371 0.8439 0.8326 6. Result Analysis for Gujarati 6.1. Implementation In this section, we present the results of our offensive tweet classification task for the Gu- jarati language, utilizing five distinct BERT-based models: M1 (XLM-RoBERTa), M2 (bert-base- multilingual-cased), M3 (bert-base-multilingual-uncased), M4 (OpenAI/Whisper-Large), and M5 (Gujarati BERT). These models vary in terms of their linguistic capabilities and tokenization methods, which influence their performance on the Gujarati dataset. Model M1 is based on XLM-RoBERTa, it demonstrates strong performance in classifying offensive tweets in Gujarati. XLM-RoBERTa’s multilingual capabilities allow it to handle a wide range of languages, including Gujarati, effectively. Its tokenization strategy considers linguistic nuances, and the model exhibits a robust ability to generalize across languages. M1’s adaptability and pre-training on diverse multilingual data contribute to its high classification accuracy on the test dataset. M2 which utilises bert-base-multilingual-cased, is a versatile, multilingual BERT variant. Although not optimized exclusively for Gujarati, it effectively handles Gujarati text due to its ex- tensive multilingual vocabulary. M2’s tokenization method successfully translates Gujarati text into subword tokens, enabling it to perform well in cross-lingual offensive tweet classification. Model M3 is bert-base-multilingual-uncased, which shares similarities with M2 but lacks case sensitivity. Despite this difference, it effectively tokenizes Gujarati text, thanks to its subword tokenization method and multilingual vocabulary. M3 showcases commendable performance in the classification task, affirming its suitability for processing Gujarati and other languages without consideration for letter casing. M4 is powered by OpenAI/Whisper-Large, it is designed for large-scale automatic speech recognition (ASR). While not specifically tailored for text classification, its robust architecture allows it to capture spoken Gujarati language effectively. This model showcases competitive results in the offensive tweet classification task, demonstrating its adaptability beyond ASR, especially in tasks involving Gujarati text. Model M5 is Gujarati BERT, it is a specialized variant designed explicitly for the Gujarati language. Its tokenizer is tailored to handle Gujarati text’s unique characteristics effectively. M5 demonstrates impressive results in classifying offensive tweets, emphasizing the importance of language-specific models in achieving high accuracy on Gujarati text. Its fine-tuning on Gujarati data contributes to its superior Gujarati text understanding and classification capabilities.[14] These tokenized inputs are then used to train and test the models. During the training phase, hyperparameters such as batch size, number of training epochs, and learning rate must be specified. To fine-tune the models, appropriate optimization algorithms, such as AdamW, are used in conjunction with learning rate schedulers. Following the training phase, the models are tested on a separate 1500-row test dataset with the same column names as the training data (post id, tweets, labels). During the testing phase, each model’s capacity to generalize and generate correct predictions on new, unseen data is evaluated. 6.2. Results and discussion To categorize text data for hate speech detection in Sinhala, the models M1, M2, M3, M4, and M5 were used. To examine the performance of these models, evaluation metrics such as Macro-F1, Macro-Precision, and Macro Recall were generated. These metrics provide insight on the model’s ability to reliably identify and predict instances of hate speech in Sinhala text data. After examining the findings for these assessment measures in Table 3, it is clear that M2 outperforms all other models. Macro-F1 was used to assess these models because it combines precision and recall into a single score, offering a balanced estimate of the model’s capacity to reliably categorize instances of hate speech. M2, with a stellar Macro-F1 score of 0.7956, demonstrated its proficiency in correctly identifying hate speech within the Gujarati text data, outperforming the other models. A higher Macro-F1 score suggests superior performance in hate speech detection. Table 4 Assessment of Models using Evaluation Metrics Model Macro-F1 Macro-Precision Macro-Recall XLM-RoBERTa(M1) 0.7210 0.6803 0.7604 Bert-base-multilingual-cased(M2) 0.7956 0.7897 0.8035 Bert-base-multilingual-uncased(M3) 0.7739 0.7658 0.7955 OpenAI/Whisper-Large(M4) 0.7409 0.7453 0.7386 Gujarati BERT(M5) 0.6415 0.6609 0.6861 7. Conclusion In this study, we evaluated the efficacy of multiple BERT-based models for detecting hate speech and abusive language in Sinhala and Gujarati tweets. We tested the efficacy of multiple BERT- based models for detecting hate speech and abusive language in Sinhala and Gujarati tweets in this study. The intfloat/multilingual-e5-base model earned the highest Macro-F1 score of 0.8371 for detecting hateful content in Sinhala tweets. The bert-base-multilingual-cased model with preprocessing steps performed best for the Gujarati data, with a Macro-F1 score of 0.7956. Overall, the findings suggest that multilingual models outperform low-resource, language- specific models in terms of F1 scores. This performance advantage can be attributed to their access to a larger and more diverse dataset. Multilingual models are trained on text from a wide range of languages, which inherently provides a richer linguistic context and a broader spectrum of language patterns. This diversity allows them to capture cross-lingual insights and generalize better across various languages, including low-resource ones. In contrast, low-resource language-specific models, with limited training data, struggle to grasp the full language complexity. Their effectiveness is hindered by data scarcity, limiting their ability to adapt to nuances and context. Higher F1 scores of multilingual models emphasize the advantage of diverse training data. This highlights the significance of data availability, especially for low-resource languages. It underscores the potential for further advancements to enhance language-specific models in the future. This study adds to the development of automated approaches for moderating social media in underserved languages such as Sinhala and Gujarati, while also encouraging inclusive online debates. The models and datasets presented in this paper can also serve as valuable resources for future NLP research on these languages. References [1] B. Di Fátima, Hate speech on social media: A global approach (2023). doi:10.25768/ 654- 916- 9 . [2] S. Satapara, H. Madhu, T. Ranasinghe, A. E. Dmonte, M. Zampieri, P. Pandya, N. Shah, M. Sandip, P. Majumder, T. Mandl, Overview of the hasoc subtrack at fire 2023: Hate- speech identification in sinhala and gujarati, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023. [3] B. R. Chakravarthi, B. Bharathi, C. O’Riordan, H. Murthy, T. Durairaj, T. Mandl, et al., Speech and Language Technologies for Low-Resource Languages: First Inter-national Con- ference, SPELLL 2022, Kalavakkam, India, November 23–25, 2022, Pro-ceedings, Springer Nature, 2023. doi:10.1007/978- 3- 031- 33231- 9 . [4] V. Dhananjaya, P. Demotte, S. Ranathunga, S. Jayasena, Bertifying sinhala–a comprehen- sive analysis of pre-trained language models for sinhala text classification, arXiv preprint arXiv:2208.07864 (2022). [5] A. Chiorrini, C. Diamantini, A. Mircoli, D. Potena, Emotion and sentiment analysis of tweets using bert., in: EDBT/ICDT Workshops, volume 3, 2021. [6] A. Tiwari, A. Agrawal, Comparative analysis of different machine learning methods for hate speech recognition in twitter text data, in: 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), IEEE, 2022, pp. 1016–1020. [7] K. Ding, J. Wang, J. Li, D. Li, H. Liu, Be more with less: Hypergraph attention networks for inductive text classification, arXiv preprint arXiv:2011.00387 (2020). [8] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, J. Gao, Deep learning– based text classification: a comprehensive review, ACM computing surveys (CSUR) 54 (2021) 1–40. [9] T. Ranasinghe, I. Anuradha, D. Premasiri, K. Silva, H. Hettiarachchi, L. Uyangodage, M. Zampieri, Sold: Sinhala offensive language dataset, arXiv preprint arXiv:2212.00851 (2022). [10] Z. Wang, Deep learning based text classification methods, Highlights in Science, Engi- neering and Technology 34 (2023) 238–243. [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [12] V. Korde, C. N. Mahender, Text classification and classifiers: A survey, International Journal of Artificial Intelligence & Applications 3 (2012) 85. [13] W. Fernando, R. Weerasinghe, E. Bandara, Sinhala hate speech detection in social media using machine learning and deep learning, in: 2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer), IEEE, 2022, pp. 166–171. [14] T. Chavan, S. Patankar, A. Kane, O. Gokhale, R. Joshi, A twitter bert approach for offensive language detection in marathi, arXiv preprint arXiv:2212.10039 (2022).