Multilingual Hate Speech and Offensive Language Detection of Low Resource Languages Abhinav Reddy Gutha1 , Nidamanuri Sai Adarsh1 , Ananya Alekar1 and Dinesh Reddy1 1 Indian Institute of Technology, Goa - 403401 Abstract The last decade has seen a steep rise in the use and dependence of society on social media. The need for detection and prevention of hate and offensive speech is more than ever. The everchanging form of natural language makes the detection of hate speech challenging, involving code-mixed text. The task becomes even more daunting in a country like India, where different languages and dialects are spoken across the country. This paper details the Code Fellas team’s approaches in the context of HASOC 2023 - Task 4: Annihilate Hate, an initiative aimed at extending hate speech detection to Bengali, Bodo, and Assamese languages. Here we describe our approaches which broadly involve Long Short Term Memory (LSTM) coupled with Convolutional Neural Networks (CNN) and pre-trained Bidirectional Encoder Representations from Transformers (BERT) based models like IndicBERT [1] and MuRIL [2]. Notably, our results showcase the effectiveness of these approaches, with IndicBERT achieving a remarkable F1 score of 69.726% for Assamese, MuRIL achieving 71.955% for Bengali, and a BiLSTM model enhanced with an additional Dense Layer attaining an impressive 83.513% for Bodo. Keywords Hate Speech, Offensive Language, LSTM, Convolutional Neural Networks, Transformers 1. Introduction Social media allows users to express their opinions without disclosing their real identities. This leads to the misuse of social media platforms to generate hate among individuals and communities, often leading to hate crimes. In the past few years, platforms like Twitter, Facebook, and Reddit have seen a rising trend in the dissemination of offensive content and the coordination of hate-related actions. The hate affects not only the users but the general public, increasing the cases of depression, anxiety, and other mental health issues. Hence, effective hate speech detection is required. Until a few years ago, hate and offensive speech were identified manually, which is now an impossible task due to the enormous amounts of data being generated daily on social media platforms. Detection of hate speech becomes a challenging task since filtering out certain words that express hate is not sufficient. The task also requires one to know the context and the background of the user. In addition, in a diverse country like India, where numerous languages are spoken, individuals often use their local languages when engaging through social media. Forum for Information Retrieval Evaluation, December 15-18, 2023, India $ abhinav.reddy.21031@iitgoa.ac.in (A. R. Gutha); nidamanuri.adarsh.21031@iitgoa.ac.in (N. S. Adarsh); ananya.alekar.21031@iitgoa.ac.in (A. Alekar); dinesh.reddy.21063@iitgoa.ac.in (D. Reddy) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings This becomes a major hurdle in hateful speech detection since two texts that have the same meaning literally can mean different things in their respective languages. In this paper, we propose ways to address the above-mentioned issues and approaches like Machine Learning Algorithms, LSTM, and BiLSTM coupled with CNN and pre-trained BERT-based models to identify Hate Speech in Bengali, Bodo, and Assamese languages. The rest of the paper is organized as follows. Section 3 describes the dataset, followed by the description of our proposed model in Section 4. Section 5 elaborates on the experimental setup used. In Section 6 we list out the results obtained by our model in the evaluations and finally conclude the paper. 2. Related works The four major tasks in HASOC 2023 are Task-1[3], Task-2[4][5], Task-3[6], Task-4[7][8]. Task- 1[8] focuses on Identifying Hate, offensive and profane content in Indo-Aryan languages with subtask Task-1A which focuses on Sinhala and subtask Task-2A focuses on Gujarati. Task-2 focuses on Identification of Conversational Hate-Speech in Code-Mixed Languages, this task focuses on the binary classification of such conversational tweets with tree-structured data. Task-3 focuses on Identification of Tokens Contributing to Explicit Hate in Text by Hate Span Detection. Task-4 of Hate Speech and Offensive Content Identification in English and Indo- Aryan Languages aims to tackle the detection of hate speech within the Bengali, Bodo, and Assamese languages. The dataset used in this task is primarily derived from social media platforms, including Twitter, Facebook, and YouTube. It comprises lists of sentences, each annotated with a label indicating the presence or absence of offensive content. This task entails binary classification, with the core objective of predicting whether a given sentence contains offensive language. Researchers have used various techniques for the classification of text for hate speech de- tection. K.Ghosh, A.Senapati et al.[9] [10] [11] used baseline bert models for conversational hate speech detection in code-mixed tweets utilizing data augmentation and offensive language identification, compared mono and multilingual transformer model with cross-language evalua- tion and achieved transformer-based hate speech detection in assamese. A primary challenge within this research domain revolves around the scarcity of available data, particularly in languages like Assamese, Bodo, and Bengali. The limited data availability has hindered comprehensive investigations into hate speech classification within these languages. In light of this, our work endeavors to bridge this critical gap by developing a model capable of effectively addressing hate speech detection in low-resource language contexts. Our approach holds relevance not only for the specific languages discussed but also as a valuable blueprint for addressing similar challenges in other low-resource languages. 0 Github: https://github.com/16AbhinavReddy/Multilingual-Hate-Speech-and-Offensive-Language-Detection- of-Low-Resource-Languages 3. Task and Dataset Information In this paper, we have used the multilingual datasets provided by Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC). The shared tasks of HASOC were provided for three languages (Assamese, Bengali, and Bodo) as part of task 4. There are training and test datasets provided for the three languages. Task 4 is a binary classification that needs respondents to classify the given tweets into two groups: Hate and Offensive (HOF) and Not Hate-Offensive (NOT). 1. HOF: This post includes hateful, offensive, or profane content. 2. NOT: This post contains neither Hate Speech nor offensive content. Table 1 provides a detailed description of the training datasets used in this work. Table 1 Details of Training Datasets Language HOF NOT Total Assamese 2347 1689 4036 Bengali 766 515 1281 Bodo 998 681 1679 4. Preprocessing Data processing is a crucial step in Natural Language Processing (NLP) tasks to clean and make the data suitable for analysis. However, it is recommended that minimal processing is applied, especially while dealing with BERT/LSTM-based approaches. The application of excessive preprocessing can strip away some of the information these models use to make accurate predictions. BERT and LSTM are both powerful language models that can learn complex relationships between words and phrases. However, our approach follows the steps [12] mentioned below to remove irrelevant text from the collected corpus and training datasets. The steps include: 1. Removing all the usernames using regular expressions. 2. Removing all the URLs and numerics using regular expressions. 3. Reducing elongated words that express screaming to their normal form. For example, helloooooooooo is reduced to hello. 4. Removing all the newline characters from the text. 5. Converting the text to tensors by applying tokenization to the tweets and padding the sequences accordingly. We have used the tokenizers that are specifically designed for BERT or LSTM models. We experimented with various other preprocessing methods like replacing English emoticons and emojis1 with their actual meaning using the emoji library, etc. But approaches like the 1 https://pypi.org/project/emoji/ BERT and LSTM model are very powerful and can learn Unicode characters like emoticons, emojis, etc. This is because BERT models are already trained on a massive dataset of text and code, and they have learned to capture many of the important features of languages. Hence, applying excessive processing to the input text can introduce noise and can lead to the loss of hateful text like emojis, which although do not hold much meaning in normal embeddings, but can express hate in many contexts. Hence, removing important information like emojis can adversely affect the model’s performance. 5. Methodology As mentioned in the Data Preprocessing section, we first developed a pipeline to preprocess the data for model training. 5.1. Translation The task involves detecting hate speech in various languages; hence, the most natural way to proceed is by converting the text from different languages into a single language, preferably English [13]. This would have been a good approach if the language translation would be able to capture the meaning of the text in its most appropriate context. We used Googletrans2 library to translate the given text into English. Further to preprocess the text, we implemented the preprocessor pipeline and eventually we were able to proceed with classification. However, this methodology has certain limitations associated with it. Primarily, the translation loses the context, the original meaning, and the intentions of some offensive text, due to which the model is unable to perform well in classification. Generally, translation libraries are not available for some low-resource languages like Bodo, hence, we can conclude that this is not the optimal method. While experimenting on offensive tweets in languages like Assamese and Bengali, we found that the offensive connotation is lost for some tweets as the translator cannot detect that information. Our observation is that the translation carries the errors forward into classification. Hence, we needed to identify an optimal method to complete the task without translation, while performing the task separately in all the languages. 5.2. Machine Learning Techniques As the data from the given languages is sparse, the Term Frequency Inverse Document Frequency (TF-IDF) and CountVectorizer method is the best way to generate vector formats for the text data given. 5.2.1. Embedding Technique While generating vector formats using TF-IDF and CountVectorizer for preprocessing, we ap- plied basic preprocessing steps and removed URLs and usernames. As emojis indicate aggression in the context, we have retained them to capture the intended sentiment. 2 https://pypi.org/project/googletrans/ After preprocessing, we applied TF-IDF and CountVectorizer techniques, considering N- grams with N-gram values 2 and 3, thus obtaining semantic relationships between two or three sequential words to some extent. 5.2.2. Model Application After generating the vector embeddings for the text, we applied the following models for classifying hateful and offensive speech: • Support Vector Machine • Logistic Regression • XGB classifier • Decision Tree Classifier Apart from this, we also tried using a Latent Semantic-based approach3 , where we took the number of components between 95 and 500. We used this as a variable when applying the RandomizedSearchCV method, which is further explained below. The Latent Semantic Approach is a technique used in NLP and information retrieval to understand the underlying meaning of words and documents. It relies on statistical analysis to identify patterns and relationships between words based on their co-occurrence in a large text corpus. By doing so, it can capture the semantic similarity between words and documents, even if they don’t share exact terms. During Model Application, hyperparameter tuning is employed to reduce computation and calculate the best parameters for classification. In this process, RandomizedSearchCV is used over GridSearchCV to get better results with comparatively less computation. Table 2, Table 3 and Table 4 tabulate the performance of the top eight models in the three languages. Table 2 Results Obtained from Machine Learning Approaches for Assamese Model F1-Score Logistic Regression + CountVectorizer + ngrams=(1,2) 0.6308 Logistic Regression + CountVectorizer + ngrams=(1,3) 0.6267 XGB Classifier + CountVectorizer + ngrams=(1,3) 0.5848 Decision Trees + CountVectorizer + ngrams=(1,3) 0.5610 Decision Tree + CountVectorizer + ngrams=(1,2) 0.5941 XGB Classifier + TfidfVectorizer + ngrams=(1,2) 0.5933 XGB Classifier + TfidfVectorizer + ngrams=(1,3) 0.5908 Decision Tree + TfidfVectorizer + ngrams=(1,2) 0.5857 5.3. Deep Learning Architectures After employing machine learning models, our research turns its attention toward deep learning models, specifically LSTM and Bidirectional Long Short-term memory (BiLSTM). 3 https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html Table 3 Results Obtained from Machine Learning Approaches for Bengali Model F1-Score Logistic Regression + CountVectorizer + ngrams=(1,2) 0.6243 Logistic Regression + TfidfVectorizer + ngrams=(1,2) 0.6043 Decision Tree + TfidfVectorizer + ngrams=(1,2) 0.5801 Logistic Regression + CountVectorizer + ngrams=(1,3) 0.5572 XGB Classifier + TfidfVectorizer + ngrams=(1,3) 0.5512 XGB Classifier + CountVectorizer + ngrams=(1,2) 0.5659 Logistic Regression + TfidfVectorizer + ngrams=(1,3) 0.5480 SVM + TfidfVectorizer + ngrams=(1,2) 0.5147 SVM + CountVectorizer + ngrams=(1,2) 0.5004 Table 4 Results Obtained from Machine Learning Approaches for Bodo Model F1-Score SVM + CountVectorizer + ngrams=(1,2) 0.7018 XGB Classifier + CountVectorizer + ngrams=(1,2) 0.6902 XGB Classifier + CountVectorizer + ngrams=(1,3) 0.6901 Logistic Regression + CountVectorizer + ngrams=(1,2) 0.6901 Logistic Regression + CountVectorizer + ngrams=(1,3) 0.6469 SVM + CountVectorizer + ngrams=(1,3) 0.63 SVM + TfidfVectorizer + ngrams=(1,2) 0.61 Decision Tree + TfidfVectorizer + ngrams=(1,2) 0.61 5.3.1. LSTM LSTM stands out as a specialized variant of Recurrent Neural Networks (RNNs), meticulously designed to apprehend and model extensive temporal dependencies within sequential data. It harnesses memory cells and gating mechanisms to selectively retain and retrieve information over protracted sequences. The rationale behind our exploration of LSTM stems from its manifold advantages, encompassing sequential data processing, the presence of memory cells that facilitate the storage and retrieval of information across extended sequences, and the incorporation of gating mechanisms. However, we encountered a notable challenge with LSTM. It exhibits a tendency to demand a substantial volume of training data to achieve optimal performance. The given dataset does not possess the requisite scale. To address this limitation, we introduced a 1D CNN layer (CNN 1-D) to enhance the performance, since combining LSTM with 1D CNN amalgamates the strengths of both architectures. Table 5 shows the performance of LSTM in all three languages. 5.3.2. LSTM with CNN 1-D LSTM with CNN 1-D [14] is particularly advantageous when the input data exhibits local spatial patterns, which pertains to features or patterns within a sequence not contingent upon the order of elements but related to their relative positions or proximity within the sequence. Additionally, it excels in capturing intricate temporal relationships, signifying how elements within a sequence relate to each other over time facilitating the modeling of long-range dependencies and temporal patterns. However, in contrast to our initial expectations this configuration underperformed as com- pared to its performance in the preceding scenario. Overfitting turned out to be a prominent concern, primarily due to the heightened complexity of the model structure, which ultimately led to diminished accuracy. Table 5 shows the performance of LSTM + CNN-1D in all three languages. 5.3.3. BiLSTM Subsequently, we chose the BiLSTM approach, taking into account the issue of overfitting. BiLSTMs employ bidirectional processing, which differentiates them from traditional LSTMs that process sequences unidirectionally from left to right. BiLSTMs simultaneously engage two hidden states, one for processing sequences from the beginning to the end (forward direction) and another for processing sequences from the end to the beginning (backward direction). Table 5 shows the performance of BiLSTM in all three languages. 5.3.4. BiLSTM with CNN-1D A typical architectural configuration featuring the amalgamation of BiLSTM and CNN-1D [15] commences with the stacking of CNN-1D layers at the model’s inception to extract local features. Subsequently, one or more BiLSTM layers capture temporal dependencies. The resultant outputs from these layers are now directed into additional fully connected layers, culminating in the final classification or regression tasks. Table 5 shows the performance of BiLSTM + CNN-1D in all three languages. Table 5 F1-Score of the different models on the datasets Language LSTM LSTM + 1D CNN BiLSTM BiLSTM + 1D CNN Bodo 0.8041 0.8178 0.8351 0.8141 Assamese 0.6770 0.6528 0.6554 0.6630 Bengali 0.6279 0.6279 0.5850 0.6447 5.4. Transformers based approach After applying preprocessing techniques such as removing URLs, hashtags, usernames, etc, we employed various Transformer-based approaches [16], using pre-trained models and their tokenizers. Bert-Base-Multilingual-Uncased: Bert-Base-Multilingual-Uncased is a model [17] pre-trained on the top 102 languages with the largest Wikipedia using a masked language modeling objective. Uncased means all the words are converted into lowercase. This model does not show the difference between ‘english’ and ‘English’. Assamese-Bert: It is a BERT [18] model pre-trained on publicly available Assamese Monolingual datasets. This is used to classify for the Assamese Language task. Bengali-Bert: It is a BERT [18] model pre-trained on publicly available Bengali monolingual datasets. This is used to classify for the Bengali Language task. Bengali-Abusive-Muril: This is a MuRIL [2] model trained on Bengali abusive speech dataset. This model is trained with learning rates of 2𝑒−5 . We fine-tuned this to our Bengali dataset to obtain better results. XLM-Roberta: XLM-Roberta [19] is a model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. We fine-tuned the given dataset and then applied hyperparameter tuning on an optimal batch size of 8, for accuracy, as very low and high batch sizes affect the training. Further, we incorporated early stopping and exponential learning rate decay with an initial learning rate of 1𝑒−5 to fine-tune the above model for our task. Indic-Bert: IndicBERT[1] is a language model designed specifically for languages spoken in the Indian subcontinent, such as Hindi, Bengali, Tamil, and others. It underwent pre-training using a massive corpus of 9 billion tokens and we assessed it across various tasks. It is noteworthy that despite having significantly fewer parameters than models like m-BERT and XLM-Roberta, IndicBERT achieved top-tier performance in multiple tasks. Distil-Bert: DistilBERT [20] is like a smaller, faster version of the powerful BERT language model used for understanding and processing human language. It still works quite well but doesn’t require as much computing power. It learns from BERT’s knowledge instead of starting from scratch, making it good for tasks like figuring out what the text means or recognizing names in text. Table 6 shows the performance of Transformers on Assamese, Bengali and Bodo. 6. Experimental Setup Changing parameters in deep learning is a crucial part of the model development process. Before training the model, we split the training data into 2 parts, with 20% data as a test dataset and 80% as a training dataset using stratified sampling. We applied inbuilt CUDA GPU to our models, whenever it was available, else we used the CPU. Hyperparameters are settings that are not learned during training but are instead configured before training begins. They have a significant impact on the model’s performance and training Table 6 Top Performances of Transformer Models Model Architecture F1-Score Assamese Indic Bert 0.6972 Assamese Bert 0.68947 Distil Bert - base uncased 0.5872 XLM Roberta - large 0.3999 Bengali Bengali MuRIL 0.7195 m-Bert base uncased 0.6447 Bodo m-BERT base uncased 0.8111 XLM Roberta 0.8038 Indic Bert 0.6755 process. Here are some of the parameters we used to enhance the model’s performance: 1. Learning Rate: Learning rate helps us measure the size of the updates made to the model’s parameters during training. It influences how quickly or slowly a neural network learns. In this research, we have used an exponential learning rate4 , taking the initial value as 10−5 and gamma as 0.9. Gamma typically refers to a hyperparameter that controls the rate at which the learning rate decreases over time. 2. Batch Size: Batch size refers to the number of training examples used in each training iteration. A large batch size can lead to faster training but may require more memory. Hence, we changed the batch size based on memory usage. We took the batch size as 96 for Assamese and Bengali languages and as 64 for the Bodo corpus. It is recommended to choose a number that is a multiple of 32. 3. Number of Epochs: This refers to the number of times the model processes the entire training dataset. But if the number of epochs is less, then it may result in underfitting. Also, if the number of epochs is more, then it may cause the model to overfit for the given data. To handle this issue, we implemented an early stopping5 method while training the model. Early stopping helps in finding a balance between model complexity and generalization. We chose the patience value as 2 for BERT-based approaches and 3 for Neural Network approaches. This value determines the number of training epochs the model should continue to train without seeing an improvement in the chosen validation metric before the early stopping mechanism is triggered. During this process, we found that early stopping is triggered for an epoch count of 6 in the LSTM model. When we implemented it with BERT models, early stopping triggered differently for different BERT models. Table 7 below lists the models and their corresponding epoch numbers. 4. Network Depth: This refers to the number of layers in a neural network. Deeper networks can capture more complex patterns but may be prone to overfitting if not 4 https://keras.io/api/optimizers/learning_rate_schedules/exponential_decay/ 5 https://keras.io/api/callbacks/early_stopping/ Table 7 Early Stopping Details Model Early Stopping Triggered Epoch with Best Parameters Assamese BERT-base uncased 12 10 XLM RoBERTa 8 6 Indic BERT 6 4 Distil BERT 11 9 Assamese BERT 9 7 Bengali BERT-base uncased 10 8 Bengali MuRIL 8 6 Indic BERT 9 7 Bengali BERT 8 6 Bodo BERT-base uncased 6 4 XLM RoBERTa 8 6 Indic BERT 10 7 regularized properly. While working with neural network approaches like LSTM, BiLSTM, and BiLSTM with CNN1D, we took a network depth of 4. 5. Number of Neurons: While working with the neural networks, we set the input length as the number of neurons in the first input layer. In the layer where we applied RNN-based networks, we set the number of neurons as 128. In the third layer, where we applied a dense layer, we set it as 256. For the final layer, we set it as 1 because it determines the model’s outcome. It is recommended to choose a number that is a multiple of 32. 6. Dropout Rate: This gives us the probability of dropping a neuron from the neural network during training, which helps prevent overfitting. While working with recurrent neural networks, we used two dropout rates. One being the regular dropout, which we applied to the inputs and/or the outputs. Another being the recurrent dropout, which removes the connections between the recurrent units. During our work, we set both of these values as 0.2. 7. Optimizers: We utilized Adam and AdamW optimizers while working with the model. Adam optimizer combines the advantages of two other popular optimization algorithms, stochastic gradient descent (SGD) and RMSprop, whereas AdamW incorporates weight decay directly into the optimization process, which helps prevent overfitting. 8. Activation Functions: While working with neural networks, we used ReLU activation function for every layer except the last layer where we used sigmoid activation function. The purpose being, dealing with vanishing gradients and exploding gradient problems. The problem of vanishing gradient occurs when the gradients of the loss function diminish significantly with respect to the model’s parameters as they are propagated backwards through the layers of a deep neural network during training. The exploding gradient problem is the opposite of the vanishing gradient problem. It occurs when gradients grow significantly during backpropagation, often to the point where they become null or cause numerical instability in the training process. 7. Results and Conclusion The methodologies discussed in Section 5 and the comparison models are used to evaluate the performance of the three languages, namely Assamese, Bengali, and Bodo. Based on our research, we observed the following: 1. Notably, for Assamese and Bengali, BERT-based models exhibit the highest F1 scores, suggesting their efficacy in these contexts. Specifically, IndicBERT [1] emerges as the top-performing model for the Assamese corpus, while Bengali MuRIL [2] demonstrates superior performance for the Bengali corpus. These results underscore the effectiveness of leveraging pre-trained Bert models that cater to languages commonly spoken in the northeastern region of India, where Assamese and Bengali are prevalent. 2. In contrast, when evaluating the Bodo language, which is characterized by limited lin- guistic resources, we observed that a Neural Network-based approach outperforms other methodologies. Among the neural network architectures tested, the BiLSTM with an additional Dense layer yields the highest F1 score for Bodo. This result highlights the adaptability of neural network models for low-resource languages like Bodo, where dedicated pre-trained models may be scarce. Table 8 displays more details. Table 8 Leaderboard best run in All the Languages Language Top-Performing Model F1-Score Assamese Indicbert (BERT-based) 0.69726 Bengali Bengali MuRIL (BERT-based) 0.71955 Bodo BiLSTM with extra Dense Layer (Neural Network-based) 0.83513 3. A noteworthy observation from our research is the advantage of leveraging specialized BERT models pre-trained on languages such as Assamese and Bengali. These models are tailored to the linguistic nuances and characteristics of these languages, particularly relevant in the context of the northeastern region of India. Our findings demonstrate that utilizing these specialized models led to superior performance for Assamese and Bengali, showcasing the significance of language-specific pre-training in NLP tasks. 4. We encountered unique challenges when working with Bodo, a low-resource language in India. Unlike Assamese and Bengali, which benefit from pre-trained BERT models, Bodo lacks dedicated pre-trained models. Consequently, our research favors neural network- based methodologies for Bodo, as they outperform BERT models in this particular context. While it’s possible to adapt existing BERT models to the Devanagari script used in Bodo, our results indicate that these adaptations may not match the performance achieved through neural network based approaches. 5. Apart from the given models, we are also experimenting with large transformer models [21] within the BERT family. However, due to the relatively small size of our dataset, these models tend to overfit during training. In summary, our research highlights the importance of tailoring NLP methodologies to the linguistic characteristics and available resources of specific languages. While BERT-based models excel in well-resourced languages, low-resource languages like Bodo may benefit more from neural network-based approaches. Additionally, utilizing specialized pre-trained models for languages like Assamese and Bengali can significantly enhance performance, but it’s crucial to consider the limitations posed by dataset size, especially when working with large transformer models. These findings contribute towards valuable insights into optimizing NLP approaches for diverse linguistic contexts. We secured 11th, 7th, and 11th position in Assamese, Bengali, and Bodo languages, respectively. 8. Acknowledgments The authors would like to convey their sincere thanks to Clint Pazhayidam George, Vignesh- waran Shankaran, and Rajesh Sharma for helping us with our research. Throughout our research journey, we have been deeply inspired by their unwavering commitment to our project and our individual development. Their knowledge and teaching methods have made a big difference in our research journey. We would also like to extend our heartfelt gratitude to Koyel Ghosh and her dedicated team for their exceptional organization of HASOC 2023. Throughout our research, they exhibited remarkable support and approachability, which greatly contributed to the success of our work. References [1] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Kumar, IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilin- gual Language Models for Indian Languages, in: Findings of EMNLP, 2020. [2] M. Das, S. Banerjee, A. Mukherjee, Data bootstrapping approaches to improve low resource abusive language detection for indic languages, arXiv preprint arXiv:2204.12543 (2022). [3] S. Satapara, H. Madhu, T. Ranasinghe, A. E. Dmonte, M. Zampieri, P. Pandya, N. Shah, M. Sandip, P. Majumder, T. Mandl, Overview of the hasoc subtrack at fire 2023: Hate- speech identification in sinhala and gujarati, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023. [4] H. Madhu, S. Satapara, P. Pandya, N. Shah, T. Mandl, S. Modha, Overview of the hasoc subtrack at fire 2023: Identification of conversational hate-speech, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023. [5] S. Satapara, S. Masud, H. Madhu, M. A. Khan, M. S. Akhtar, T. Chakraborty, S. Modha, T. Mandl, Overview of the HASOC subtracks at FIRE 2023: Detection of hate spans and conversational hate-speech, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, Goa, India. December 15-18, 2023, ACM, 2023. [6] S. Masud, M. A. Khan, M. S. Akhtar, T. Chakraborty, Overview of the HASOC Subtrack at FIRE 2023: Identification of Tokens Contributing to Explicit Hate in English by Span Detection, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [7] K. Ghosh, A. Senapati, A. S. Pal, Annihilate Hates (Task 4, HASOC 2023): Hate Speech Detection in Assamese, Bengali, and Bodo languages, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [8] T. Ranasinghe, K. Ghosh, A. S. Pal, A. Senapati, A. E. Dmonte, M. Zampieri, S. Modha, S. Satapara, Overview of the HASOC subtracks at FIRE 2023: Hate speech and offensive content identification in assamese, bengali, bodo, gujarati and sinhala, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, Goa, India. December 15-18, 2023, ACM, 2023. [9] K. Ghosh, A. Senapati, U. Garain, Baseline bert models for conversational hate speech detection in code-mixed tweets utilizing data augmentation and offensive language identifi- cation in marathi, in: Fire, 2022. URL: https://api.semanticscholar.org/CorpusID:259123570. [10] K. Ghosh, D. A. Senapati, Hate speech detection: a comparison of mono and multilingual transformer model with cross-language evaluation, in: Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, De La Salle University, Manila, Philippines, 2022, pp. 853–865. URL: https://aclanthology.org/2022.paclic-1.94. [11] K. Ghosh, D. Sonowal, A. Basumatary, B. Gogoi, A. Senapati, Transformer-based hate speech detection in assamese, in: 2023 IEEE Guwahati Subsection Conference (GCON), 2023, pp. 1–5. doi:10.1109/GCON58516.2023.10183497. [12] S. Mundra, N. Mittal, Cmhe-an: Code mixed hybrid embedding based attention network for aggression identification in hindi english code-mixed text, Multimedia Tools and Applications 82 (2023) 11337–11364. doi:10.1007/s11042-022-13668-4. [13] I. Bhat, V. Mujadia, A. Tammewar, R. Bhat, M. Shrivastava, Iiit-h system submission for fire 2014 shared task on transliterated search, in: Proceedings of the [conference name], 2015. doi:10.1145/2824864.2824872. [14] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text, in: Proceedings of COLING 2016, the 26th International conference on computational linguistics: Technical papers, The COLING 2016 Organizing Committee, 2016, pp. 2482–2491. URL: https://aclanthology.org/C16-1234. [15] P. Kapil, A. Ekbal, D. Das, Investigating deep learning approaches for hate speech detection in social media, 2020. [16] M. Das, S. Banerjee, P. Saha, A. Mukherjee, Hate speech and offensive language detection in Bengali, in: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online only, 2022, pp. 286–296. URL: https://aclanthology.org/2022.aacl-main. 23. [17] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv. org/abs/1810.04805. arXiv:1810.04805. [18] R. Joshi, L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based hindi and marathi languages, arXiv preprint arXiv:2211.11418 (2022). [19] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116. [20] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, ArXiv abs/1910.01108 (2019). [21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy- anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.