Crossing Borders: Multilingual Hate Speech Detection
                                Supriya Chanda1 , Abhishek Dhaka2 and Sukomal Pal1
                                1
                                    Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, INDIA, 221005
                                2
                                    Department of Computer Science and Engineering, B.K. Birla Institute of Engineering&Technology, Pilani, INDIA,333031


                                                                         Abstract
                                                                         With the relentless growth of technology usage, particularly among younger generations, the alarming
                                                                         prevalence of hate speech on the internet has become an urgent global concern. This research paper
                                                                         addresses this critical need by presenting an extensive investigation encompassing three distinct hate
                                                                         speech detection tasks across a diverse linguistic landscape. The first task involves hate and offensive
                                                                         speech classification in Gujarati and Sinhala, assessing sentence-level hatefulness. The second task
                                                                         extends to fine-grained BIO tagging, enabling precise identification of hate speech within sentences.
                                                                         Finally, the third task expands the scope to hate speech classification in Bengali, Bodo, and Assamese
                                                                         using social media data, categorizing content as hateful or not. Employing state-of-the-art deep learning
                                                                         techniques tailored to each language’s characteristics, this research contributes significantly to the
                                                                         development of robust and culturally sensitive hate speech detection systems, imperative for nurturing
                                                                         safer online spaces and fostering cross-cultural understanding.
                                                                         Warning: The content of this paper may contain offensive material, reader discretion is advised.

                                                                         Keywords
                                                                         Hate Speech, Social Media, Gujrati, Sinhala, Assamese, Bengali, Bodo, Multilingual BERT,


                                1. Introduction
                                In light of the burgeoning utilization of technology, accompanied by a substantial rise in users,
                                particularly among the younger demographic, the presence of hate speech on the internet has
                                emerged as a pressing concern. While the internet was initially envisioned as a platform for
                                individuals to express their thoughts freely, it is equally imperative that this unbridled expression
                                does not encroach upon the dignity and beliefs of others. Safeguarding these principles is pivotal
                                to sustaining the unfettered expression of individuals’ thoughts.
                                   Hate speech refers to “ Any form of communication, whether spoken or expressed non-
                                verbally, that displays hostility towards specific social groups. These groups are typically
                                targeted based on factors like race and ethnicity (which encompasses racism, xenophobia,
                                anti-Semitism, etc.), gender (including sexism and misogyny), sexual orientation (involving
                                homophobia and transphobia), age (ageism), and disability (known as ableism)”.
                                   In the landscape of digital connectivity, India exhibited remarkable statistics in early 2023.
                                With a population of 1.10 billion, cellular mobile connections thrived, reaching an impressive
                                77.0% of the total population. Simultaneously, the internet had made significant inroads, with

                                Forum for Information Retrieval Evaluation, December 15-18, 2023, Goa, India
                                Envelope-Open supriyachanda.rs.cse18@itbhu.ac.in (S. Chanda); abhishek.dhaka7340@gmail.com (A. Dhaka);
                                spal.cse@itbhu.ac.in (S. Pal)
                                Orcid 0000-0002-6344-8772 (S. Chanda); 0000-0002-9421-8566 (S. Pal)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
692.0 million users in India, representing 48.7% of the populace. In the realm of social media,
India stood out with 467.0 million users in January 2023, accounting for 32.8% of the total
population. Additionally, data from top social media platforms’ ad planning tools revealed that
398.0 million users aged 18 and above were actively engaged in social media usage, forming a
substantial 40.2% of the adult population at the beginning of 2023. These statistics collectively
underscore the pervasive presence and impact of digital technology and social media within
India’s diverse and expansive demographic landscape.
   India, renowned for its linguistic diversity, is home to a population of 1.4 billion, comprising
individuals who speak a myriad of languages and hold diverse beliefs. Among these, there are
121 officially recognized languages, each with over 10,000 speakers. Hindi boasts the largest
number of speakers, at 436 million, followed by Bengali with 83 million, Assamese with 12.6
million, Gujrati with 62 million, and Bodo with 1.4 million (according to the 2011 census).
Sinhala spoken by the Sinhalese people of Sri Lanka, who make up the largest ethnic group on
the island, numbering about 16 million. While considerable research has been conducted on hate
speech identification in Hindi, it is equally vital to address this issue in other under-resourced
languages such as Bodo, Assamese, Gujrati, Sinhala and Bengali because many individuals
prefer to communicate in their native languages, as it fosters a sense of connection and cultural
grounding. Hence, it is crucial to identify hate speech in these languages to uphold cultural
respect.
   In this study, we approach all four tasks of hate speech detection as a text classification
problem and delve into various deep learning methodologies for its resolution. The datasets for
all four task and all languages like Assamese, Bengali, Gujrati, Sinhala, Bodo, and Hindi-English
code mixed data utilized are obtained from the FIRE 2023 (Forum for Information Retrieval
Evaluation) Hate Speech and Offensive Content Identification in English and Indo-Aryan
Languages (HASOC). All the task descriptions are mentioned below.

1.1. HASOC Tasks
The goal of HASOC 2023 was to establish a testbed for the automated detection of hate speech
and objectionable material in social media posts. HASOC 2023 included four tasks, and our
team actively participated in tasks 1, 3 and 4. The tasks in this study are distributed as follows:
   Task 1: Offensive Language Identification in Gujrati, and Sinhala.

    •       Offensive(OFF)- Contain offensive language.
    •       Non Hate-Offensive(NOT)- No offense or profanity is present.

   Task 3: Hateful Span Detection in Texts
   Task 3 focuses on identifying hateful parts within a sentence that is already considered hateful.
A hateful span is a continuous set of words that together express explicit hate in a sentence.
   So in the above example, the input texts are in English. In this sequence labeling problem
each token in the sequence is manually tagged with the start and end of a hateful span using
BIO notation. ‘B’ represents the start of a hate span, ‘I’ continues it, and ‘O’ indicates non-hate.
“You all niggers are cancers” → “O O B I I.” It’s important to note that ‘I’ cannot stand alone
and must follow either ‘I’ or ‘B’. Additionally, a ‘B’ can be followed immediately by an ‘O’ for
single-word spans.
Figure 1: Example of task 3


  Task 4: Offensive Language Identification in Assamese, Bengali and Bodo.

    •       Hate and Offensive(HOF)- Contains Offensive language
    •       Non Hate-Offensive(NOT)- No offense or profanity is present.

  Table 1 provides examples of the various posts and associated labels.

Table 1
Example tweets from the HASOC2023 dataset for all classes

  Language      Sample tweet from the class                                            Task 4
  Bodo                                                                                 HOF
                बुरबकलै साला नों बायदि मानसिनि जे बो अबदान गै यालै जाथिआव
  Bodo          Btr आव bjp आ साव लाबोबाय ।बे सावाव बर leader फ्रा लोगो लाफाबाय।        NOT

  The remaining sections of the paper are structured as follows. In Section 2, a brief outline
of some previous attempts is provided. The dataset description is presented in Section 3.
Computational methods, model descriptions, and the evaluation methodology are discussed in
Section 4. Results and discussions are presented in Section 5, and the conclusion is provided in
Section 6.


2. Related Work
The identification of hate speech and offensive content has garnered significant attention in
both academic and business contexts. While a substantial body of research has concentrated
primarily on English due to its global prevalence, there exists a pressing need for relevant
corpora in other languages to comprehensively address this issue. Several studies have delved
into the varied aspects of offensive content, such as abusive language [1, 2], cyber-aggression
[3], cyber-bullying [4, 5], and toxic comments or hate speech [6, 7, 8]. A brief overview of some
notable works in these areas is provided.

    • Hate Speech Identification : Hate speech, a pervasive challenge, has been systematically
      categorized into various types based on the nature of its textual content. Diverse datasets
      have been curated to cater to these distinct categories of hate speech. Notably, a common
      dataset [9] has served as a foundation for identifying hate speech and profanity, with
      recent work by Davidson et al. [7] making use of a dataset comprising nearly 24,000
      labeled tweets.
    • Offensive Content and Cyberbullying : The broader domain of offensive content encom-
      passes abusive language [3], cyber-aggression, cyber-bullying, and toxic comments.
      Previous investigations have employed techniques such as sentiment analysis, topic
      modeling [4], and user-related features [5] to tackle this multifaceted problem.

   Efforts have extended beyond English, with endeavors in languages including German [10, 11],
Spanish, Arabic [2, 12], Greek [13], Slovene [14] and Chinese [15]. Mubarak et al. [2] introduced
a collection of profane terms, known as SeedWords (SW), and applied the Log Odds Ratio (LOR)
to individual word unigrams and bigrams. Saroj et al. [16] adopted a Support Vector Machine
(SVM) approach alongside TF-IDF features, targeting hate speech and offensive language in
Arabic and Greek.
   In recent years, initiatives like HASOC [10] and GermEval [17] have spotlighted the im-
portance of addressing hate speech detection in various languages and contexts. Dravidian
LangTech [18], for example, focused on detecting offensive language in a code-mixed dataset
comprising Tamil–English, Malayalam–English, and Kannada–English. The application of
multilingual models, including BERT variants and IndicBERT, has shown promise in this regard.
Transfer learning has shown potential in enhancing offensive language recognition, particularly
in code-mixed contexts. Researchers have leveraged transfer learning from English datasets
to improve offensive language recognition in code-mixed Kannada [19], Malayalam [20], and
Tamil [21].
   Detecting hate speech in conversational Hindi-English code-mixed data presents additional
complexities due to the conversational nature of such content. The hierarchical structure of
posts, comments, and replies necessitates a nuanced approach, with techniques ranging from
unified text treatment to novel hierarchical neural network architectures. Multiple comments
can be associated with each post, and each comment may have several replies. In the case of
English-Hindi data, each component of the tuple can exhibit code-mixing between Hindi and
English, be exclusively in English, exclusively in Hindi, in the form of romanized Hindi, or a
combination thereof. As a result, complex input patterns emerge. The labels assigned to replies
or comments are significantly influenced by the contextual information provided by the parent
text. To address this, Chanda et al. [22] treated all the post, comments, and replies as a single
unified text and applied a pre-trained multilingual BERT model. To maintain the context of
post to comments and reply, Chanda et al., [23] concatenate. Bagora et al., [24] proposed a
novel hierarchical neural network architecture, while Madhu et al., [25] employed a pipeline
consisting of an LSTM classifier followed by a fine-tuned SentBERT model.


3. Dataset
In this study, we utilized the HASOC 2023 datasets, generously provided by the organizers
of the FIRE 2023. The organizers furnished the training data for all four tasks and, for the
final evaluation, made available the test data, for which participants were required to submit
prediction files for each data sample.
   For tasks 1 [26] and 4, the data files were formatted in a simple CSV structure, with one
column dedicated to the text and another to the corresponding label. Task 3, however, presented
a distinct dataset structure, comprising four columns in the training data: ’id,’ ’sentence,’ ’span,’
and ’bio.’ The ’span’ column denoted the word indices at which hate content commenced and
concluded, while the ’bio’ column utilized these span indices to represent the respective words
as ’B’ (beginning), ’I’ (intermediate), or ’O’ (outside hate content).
   The corpus collection and class distribution is shown in Table 2.

Table 2
Statistical overview of the Training Data and Test Data

                                                    Task-1
                           Data     Language       #Sentence       NOT    HOF
                          Train      Gujrati          200           100   100
                          Test       Gujrati          1196           -     -
                          Train      Sinhala          7500         4324   3176
                          Test       Sinhala          2500           -      -
                                                    Task-3
                           Data     Language      # Sentences
                          Train      English          2421
                          Test       English          606
                                                    Task-4
                           Data     Language      # Sentences     NONE    HOF
                          Train    Assamese           4036         1689   2347
                          Test     Assamese           1009           -      -
                          Train      Bengali          1281          766   515
                          Test       Bengali          320            -     -
                          Train       Bodo            1679          681   998
                          Test        Bodo            420            -     -


4. Proposed Methodology
HASOC2023 includes four distinct tasks, and in the subsequent subsections, the methodology
employed for task 1,3 and 4 will be outlined individually. It is essential to underscore that
preprocessing, a pivotal facet in addressing various text-related downstream tasks, will be
discussed initially before delving into the specifics of each task’s methodology. The code for all
proposed methods can be found on GitHub.1


    1
        GitHub repository: https://github.com/abhishekdhakaab/FIRE-2023
4.1. Preprocessing
Social media data exhibits a high degree of structural informality and is susceptible to noise
due to the colloquial nature of Twitter conversations . This inherent characteristic poses a
potential challenge to the accuracy of processing techniques. Consequently, it has been deemed
imperative to subject all data to preprocessing procedures aimed at mitigating the impact of
less informative textual components.
   Notably, for tasks 1 and 4, the following preprocessing was done. Conversely, for task
3, no preprocessing measures were deemed necessary. Below, we provide a comprehensive
enumeration of the preprocessing steps that were applied.

    • Perform cleaning by removing usernames, punctuation and URLs, mentions and hashtags.
    • Use ekphrasis which is a text processing tool, geared towards text from social networks,
      such as Twitter or Facebook. ekphrasis performs tokenization, word normalization, word
      segmentation (for splitting hashtags) and spell correction.
    • Normalizing hashtags (for example, “#BlackLivesMatters” is segmented into “Black”,
      “Lives”,and “Matters”).

   For the binary classification task, the ‘HOF’ labels have been converted to integer ‘1’, repre-
senting instances of harmful or offensive content, while the ‘NOT’ labels have been converted
to integer ‘0’, indicating non-harmful content.
   The preprocessing steps were little bit different for task 3. To maintain consistent sentence
lengths for word-level sequence classification using BIO tagging, padding was applied to both
the input data and the corresponding true labels. Specifically, the input data was padded to a
maximum length of 128 tokens, and the true labels were augmented with ‘0’ values, ensuring
that the padded sections consistently predicted ‘O’ label.

4.2. Methodology for Task 1
The selection of embedding techniques was contingent upon the size of the vocabulary, and as
such, various transformer-based embeddings were explored, including mBERT and many more.
The empirical evaluation revealed that fasttext exhibited a larger presence of commonly used
words while minimizing

   1. Sinhala : For the Sinhala language, four distinct submission strategies were employed,
      each utilizing a specific model methodology. Herein, a comprehensive overview of these
      methodologies is provided:
      A. Fasttext-CNN : In this approach, a 300-dimensional Fasttext embedding was utilized.
         The training dataset comprised 90% of the available data, with the remaining 10%
         designated for validation. The maximum sequence length was set to 128 tokens.
         Notably, the dataset encompassed 39,793 word tokens, while the Fasttext embedding
         contained 30,277 words of these 39,793 words. The model architecture included two
         convolutional layers, both with 300 filters and kernel sizes of 3 and 2, respectively.
         These convolutional layers were concatenated and fed into a subsequent CNN layer
         with 500 filters, followed by a dropout layer with a rate of 0.3. Subsequently, another
      CNN layer with 300 filters was applied, and Global Max pooling was employed. A
      dense layer with 50 units and a ReLU activation function preceded a final sigmoid
      activation layer. Hyperparameters encompassed a learning rate of 5e-5, a batch size of
      32, AdamW optimization, and a loss function of Binary Cross entropy. Early stopping
      criteria were also employed, and this configuration consistently yielded the best
      performance scores within the range of 0.1 to 5e-7 for learning rate.
   B. Full-Data Fasttext-CNN for Sinhala : This submission retained the same model archi-
      tecture and hyperparameters as Submission 1, with the exception that all available
      data was used for training, and no validation data was set aside.
   C. BiLSTM-Attention for Sinhala : For this strategy, a 300-dimensional embedding was
      employed, followed by a bidirectional LSTM layer with 300 units, incorporating an
      attention mechanism. A subsequent dense layer with 50 units with a ReLU activation
      function, along with a dropout layer (rate = 0.3). Finally, a dense layer with a single
      unit and sigmoid activation function concluded the model.
   D. BiLSTM-Attention for Sinhala (2) : Submission 4 utilized a 300-dimensional embedding
      followed by a bidirectional LSTM layer with 128 units and a dropout layer (rate =
      0.3). An attention layer was applied to the output of the bidirectional LSTM, followed
      by Global Max pooling and flattening. A dropout layer with a rate of 0.3 preceded
      a dense layer with 64 units, followed by another dropout layer with the same rate.
      The model culminated with a final dense layer featuring a single unit and a sigmoid
      activation function.
2. Gujrati : For the Gujarati language, two distinct submission strategies were employed,
   each utilizing a specific model methodology. Here, a detailed overview of these method-
   ologies is provided:
   A. FastText-CNN : This approach utilized a 300-dimensional FastText embedding. The
       training dataset comprised 90% of the available data, with the remaining 10% reserved
       for validation.Padding was applied to each sentence to maintain a consistent length
       of 128 tokens. Notably, the dataset contained 4,412 word tokens, while the FastText
       embedding encompassed 3,931 words out of 4,412 words. The same DNN classifier
      was used as mentioned in Sinhala A.
   B. BiLSTM-Attention Approach for Gujarati : In this strategy, a 300-dimensional embed-
       ding was employed, followed by a bidirectional LSTM layer comprising 128 units
       and a dropout layer (rate = 0.3). An attention layer was applied to the output of the
       bidirectional LSTM, followed by Global Max pooling and flattening. Subsequently, a
       dropout layer with a rate of 0.3 preceded a dense layer featuring 1024 units and an
       additional dropout layer with a rate of 0.3. The model further incorporated a dense
       layer with 256 units, followed by another dense layer with 32 units. The final layer
       consisted of a dense unit with a single node, activated by a sigmoid activation function.
   The rationale behind opting for a Deep Neural Network (DNN) over a Transformer-based
   model is rooted in a critical observation. It has been noted that when dealing with
   a relatively limited training dataset consisting of only 200 examples, a DNN tends to
   outperform a Transformer architecture. This preference for a DNN stems from the fact
   that Transformer models typically require a larger volume of training data to achieve
   their optimal performance. In situations where data scarcity is a significant concern, as
         evidenced by the small training dataset in this context, the DNN’s ability to generalize
         and learn effectively from limited examples becomes a compelling choice.

4.3. Methodology for Task 3
For this task, the initial step involved mapping the true labels as follows: ’O’ to 0, ’B’ to 1, and
’I’ to 2. No preprocessing of the data was required. The dataset was then divided into 20% for
validation and 80% for training purposes. To enhance word embeddings, Glove embeddings
trained on the Twitter 27b token dataset 2 were employed. The model architecture is inspired
from [27].
    The model architecture comprised several key components. Initially, input tokens were
embedded using Glove embeddings, followed by a 64-unit attention layer. Subsequently, the
output of the attention layer was passed through two BiLSTM layers, each consisting of 512
units and a dropout rate of 0.2. The outputs from both BiLSTM layers were added. This was
followed by a time-distributed dense layer with 50 units. Additionally, the output of each
time-distributed dense layer (resulting in a shape of (batch_size, 128, 50)) was further processed
through a simple dense layer with 3 units, resulting in a shape of (batch_size, 128, 3) and then it
was passed to CRF, with number of crf tag set to 3. The training of this model was carried out
using the Adam optimizer with a learning rate of 0.001 for a total of 5 epochs.

4.4. Methodology for Task 4
The selected models for submission were meticulously chosen following extensive experi-
mentation involving various learning rates ranging from 0.1 to 1e-7. Furthermore, different
combinations of LSTM layers, ranging from a single LSTM layer to up to 4 LSTM layers, were
evaluated to discern the impact of the number of LSTM layers. Additionally, variations in the
number and sizes of dense layers were explored. Almost for all experiment, approximately 90%
of the dataset was allocated to training, while the remaining 10% was set aside for validation.

   1. Assamese :
      A. Multilingual BERT for Assamese : In this approach, the utilization of mBERT (bert-base-
         multilingual-cased) with BertForSequenceClassification, a state-of-the-art transformer
         model, was aimed at assessing its efficacy in grasping the nuances of the Assamese
         language. Given the constraints of Assamese data available for model training, a
         separate test dataset was not utilized. A learning rate of 2e-5 was chosen for effective
         training, alongside a batch size of 32. The training process encompassed 50 epochs.
      B. Fine-Tuned Assamese BERT : This approach entailed the fine-tuning of a BERT variant,
         tailor-made for the Assamese language, using a dedicated monolingual dataset. The
         maximum sequence length was restricted to 128 tokens to accommodate the inherent
         characteristics of Assamese text. Fine-tuning concentrated on optimizing the last
         three layers of the BERT model. The subsequent architectural flow incorporated
         the transformation of BERT embeddings via a Bi-LSTM (128) layer, followed by an
         LSTM (128) layer and an additional LSTM (64) layer. This intricate representation
    2
        https://github.com/stanfordnlp/GloVe
      was then directed through a dense layer, composed of 32 neurons, invoking ReLU
      activation, followed by dropout with a rate of 0.3, an additional linear layer, and a
      sigmoid activation function. Model training, conducted over 200 epochs, utilized a
      batch size of 16, with the best model selection contingent on validation accuracy. The
      learning rate was optimized to 1e-6, with the AdamW optimizer in play.
   C. Fine-Tuned Assamese BERT with Variant Learning Rate : This variant closely adhered
      to the fine-tuned Assamese BERT approach described in (B). However, it introduced
      a different learning rate, specifically 5e-8, during training to explore its impact on
      model performance.
2. Bengali :
   A. XLM-RoBERTa for Bengali : Implemented XLM Roberta in conjunction with BERT-
      ForSequenceClassification. To accommodate the lengthier nature of Bengali sentences,
      the maximum sequence length was set to 256 tokens. Training involved fine-tuning
      XLM Roberta’s weights with a learning rate of 2e-5 and a batch size of 16 over 10
      epochs.
   B. Multilingual BERT for Bengali : Used mBERT with the same hyperparameters as XLM
      Roberta and employed the BERTForSequenceClassification framework to assess its
      performance in Bengali text classification.
   C. BengaliBERT : Leveraged BengaliBERT, a model pre-trained on monolingual Bengali
      data. Notably, no further fine-tuning was performed on this BERT for classification
      tasks. The architecture incorporated a combination of LSTM and dense layers. Data
      was propagated through a Bi-LSTM (128) layer, followed by an LSTM (128) layer, and
      subsequently to a Dense layer comprising 32 neurons, enhanced with ReLU activation.
      To mitigate overfitting, a dropout rate of 0.3 was applied, followed by another linear
      layer and a sigmoid activation function. Training spanned 200 epochs with a batch
      size of 16, and model selection was based on validation accuracy. The learning rate
      was set to 1e-5, and the AdamW optimizer was utilized.
3. Bodo :
   A. XLM-RoBERTA for Bodo : Employed BertForSequenceClassification for Bodo language
      text classification. The maximum sequence length was set at 256 tokens to accommo-
      date the language’s characteristics. The learning rate was configured to 2e-5, with a
      batch size of 16 for 10 training epochs.
   B. HBERT for Bodo : Utilized L3Cube’s Hindi Bert v2 due to the similarities between Bodo
      and Hindi scripts. However, recognizing the distinctions between the languages, the
      last three layers of the Bert model were fine-tuned. The same DNN classifier was used
      as mentioned in Assamese B. Two different learning rates, lr=1e-6 and lr=1e-5, were
      tested, with results saved as HBERT.csv and HBERT_2.csv, respectively. The AdamW
      optimizer was employed for training.
   C. BodoBERT : Employed BodoBERT, a model specifically tailored for the Bodo language.
      The same DNN classifier was used as mentioned in Assamese B. The learning rate
      was set to 1e-6, and the AdamW optimizer was utilized for training.
   D. Ensemble Method : Introduced an ensemble approach that amalgamates the outcomes
      of Method 2 (HBERT for Bodo), Method 3 (BodoBERT), aninto a cohesive predictive
      framework. This ensemble method explores the synergy between different models to
          enhance the overall accuracy and robustness of hate speech classification in the Bodo
          language dataset.

   Given the substantial computational demands of transformers, all models based on trans-
formers were trained on Colab’s T4 GPU, while non-transformer based models were trained
on Colab’s CPU, boasting 12.7GB of RAM. The conversion of model predictions back to binary
labels (HOF and NOT) was executed using a threshold value of 0.5, a common practice in binary
classification tasks.


5. Results and Discussion
The model was validated on the training and development sets due to the limited amount of
data available for training. Subsequently, the prediction file was submitted on the test data to
obtain the final results [28, 29, 30, 31].
   In Task 1, focused on Sinhala and Gujrati text classification, our team achieved competitive
scores. Table 3 shows the best performing team and our official performances on the test data
as shared by the organizers. In the Sinhala category, we earned a respectable Macro F1 score of
0.78, while in the Gujrati category, our score was 0.68. For reference, the top-performing team,
”FiRC-NLP,” secured scores of 0.83 and 0.84 in Sinhala and Gujrati, respectively.

5.1. Results for Task 1 :

Table 3
Evaluation results for Task 1 on test data
     Language                  Team Name              Macro 𝐹1 score    precision   recall
                            FiRC-NLP                        0.83           0.83      0.83
       Sinhala
                   IRLab@IITBHU (Fasttext-CNN)              0.78           0.78      0.78
                            FiRC-NLP                        0.84           0.83      0.86
       Gujrati
                   IRLab@IITBHU (Fasttext-CNN)              0.68           0.69      0.72


5.2. Results for Task 3 :
In Task 3, which involved offensive span detection, our team faced more significant challenges.
Table 4 shows the best performing team and our official performances on the test data as shared
by the organizers. On the public leaderboard, our team stood at rank 3. In both the public and
private datasets, our team achieved scores of 0.45 and 0.51, respectively.

5.3. Results for Task 4 :
Task 4 encompassed text classification in Assamese, Bengali, and Bodo languages. Our team,
”IRLab@IITBHU,” achieved competitive scores in all three categories. Table 5 shows the best
Table 4
Evaluation results for Task 3 on test data
                              Leaderboard     Team Name         score
                                                FiRC-NLP         0.53
                                  Public
                                             IRLab@IITBHU        0.45
                                                FiRC-NLP         0.57
                                 Private
                                             IRLab@IITBHU        0.51


performing team and our official performances on the test data as shared by the organizers.
For Assamese, our score was 0.70, while ”InclusiveTechies” led with a score of 0.80. In Bengali,
we scored 0.65, whereas ”Sanvadita” achieved a score of 0.77. In the Bodo category, our team
secured a score of 0.74, closely following ”SATLab,” which led with a score of 0.86.

Table 5
Evaluation results for Task 4 on test data
             Language                        Team Name                          score
                                        InclusiveTechies                         0.80
             Assamese
                           IRLab@IITBHU (Fine-Tuned Assamese BERT)               0.70
                                             Sanvadita                           0.77
              Bengali
                                    IRLab@IITBHU (BengaliBERT)                   0.65
                                             SATLab                              0.86
                Bodo
                                  IRLab@IITBHU (HBERT for Bodo)                  0.74


5.4. Discussion
During the analysis of Sinhala and Gujrati, it was observed that the training data for Gujarati
was insufficient to train a model effectively. In the investigation of hate speech classification
across Bengali, Assamese, and Bodo languages, a noteworthy revelation emerged: validation
accuracy alone does not necessarily encapsulate a model’s true ability. Instead, we found that
validation loss holds paramount importance. A model with marginally lower validation accuracy
but a considerably lower validation loss often outperforms a model with higher accuracy but
slightly greater loss. This discrepancy underscores the significance of validation loss in gauging
a model’s confidence in its predictions. In practical scenarios, it is often more prudent to err on
the side of caution, minimizing the risk of false positives, where benign content is mistakenly
flagged as hate speech.
   In all most all tasks, it was observed that employing approximately 2 LSTM layers proved
sufficient, as marginal improvements were discerned beyond 2 to 3 LSTM layers. However, such
enhancements came at the cost of increased computational complexity.
   While numerous embedding techniques are available for deep learning models, our experi-
mentation revealed that FastText embeddings exhibited the most extensive vocabulary coverage
for our dataset. This finding underscores the value of selecting embeddings tailored to the
specific language and task at hand.
   Given the challenges posed by low-resource languages and limited embedding resources, an
alternative approach emerged: initial randomization of word embeddings, followed by training
them may hold potential for optimizing model performance under resource constraints.
   In our quest for the optimal optimizer, our experimentation indicated that Stochastic Gradient
Descent (SGD) with a slightly higher learning rate converges more rapidly. Conversely, the
AdamW optimizer with a higher learning rate exhibited a zig-zag convergence pattern. Notably,
AdamW performed optimally with a lower learning rate, typically around 1e-5. However, it
is important to recognize that SGD with a marginally higher learning rate can be a pragmatic
choice for quick model testing, particularly in the context of Transformer-based models. This
approach provides insights into a model’s convergence tendencies before committing to more
computationally intensive optimization methods.
   Upon the completion of preprocessing the raw data, the rationale behind the utilization of a
transformers-based model was to employ BERT for word-level embedding. This choice was
made because BERT leverages the contextual information of each word to enhance the quality
of word embeddings. Additionally, the motivation behind incorporating a bidirectional LSTM
(bi-LSTM) layer into the model was to ensure that each word’s embedding would encompass a
comprehensive contextual context, spanning both preceding and subsequent words. Following
the bi-LSTM layer, an additional LSTM layer can be applied to further process the bidirectional
output of the preceding LSTM layer. Subsequently, after approximately 2 to 3 LSTM layers,
the hidden state of the last time step of the final LSTM layer is passed to a dense layer for
subsequent binary classification into classes 0 and 1. It is noteworthy that in all approaches, the
final layer consists of a dense layer with a single neuron and a sigmoid activation function.


6. Conclusion
In the course of our investigation encompassing four diverse hate speech detection tasks, the
following insights emerged:
   Task 1: In the context of low-resourced languages, such as those examined in this study, the
amalgamation of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM)
models yielded the most efficacious results. The intricacies of these languages, coupled with
the paucity of data, necessitated a tailored approach.
   Task 3: Conditional Random Fields (CRF) emerged as the preeminent choice for Task 3,
demonstrating superior performance in offensive span detection. Its efficacy surpassed that of
alternative methods, underscoring its relevance and utility in this context.
   Task 4: Task 4 underscored the value of models fine-tuned on language-specific monolingual
data for the classification of text in low-resourced languages. These meticulously tailored
models exhibited enhanced performance in text classification, emphasizing the significance of
linguistic specificity in classification endeavors.
7. Acknowledgements
We would want to express our gratitude to the HASOC organisers for organising this interesting
shared task and for immediately responding to all of our questions. Our investigation benefited
from well-structured data with minimal spelling errors, rendering the classification task more
straightforward.


References
 [1] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive language detection
     in online user content, in: Proceedings of the 25th International Conference on World
     Wide Web, WWW ’16, International World Wide Web Conferences Steering Committee,
     Republic and Canton of Geneva, CHE, 2016, p. 145–153. URL: https://doi.org/10.1145/
     2872427.2883062. doi:10.1145/2872427.2883062 .
 [2] H. Mubarak, K. Darwish, W. Magdy, Abusive language detection on Arabic social media,
     in: Proceedings of the First Workshop on Abusive Language Online, Association for
     Computational Linguistics, Vancouver, BC, Canada, 2017, pp. 52–56. URL: https://www.
     aclweb.org/anthology/W17-3008. doi:10.18653/v1/W17- 3008 .
 [3] R. Kumar, A. K. Ojha, S. Malmasi, M. Zampieri, Benchmarking aggression identification
     in social media, in: Proceedings of the First Workshop on Trolling, Aggression and
     Cyberbullying (TRAC-2018), Association for Computational Linguistics, Santa Fe, New
     Mexico, USA, 2018, pp. 1–11. URL: https://www.aclweb.org/anthology/W18-4401.
 [4] J.-M. Xu, K.-S. Jun, X. Zhu, A. Bellmore, Learning from bullying traces in social media, in:
     Proceedings of the 2012 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, NAACL HLT ’12, Association
     for Computational Linguistics, USA, 2012, p. 656–666.
 [5] M. Dadvar, D. Trieschnigg, R. Ordelman, F. de Jong, Improving cyberbullying detection
     with user context, in: P. Serdyukov, P. Braslavski, S. O. Kuznetsov, J. Kamps, S. Rüger,
     E. Agichtein, I. Segalovich, E. Yilmaz (Eds.), Advances in Information Retrieval, Springer
     Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 693–696.
 [6] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate speech
     detection with comment embeddings, in: Proceedings of the 24th International Conference
     on World Wide Web, WWW ’15 Companion, Association for Computing Machinery, New
     York, NY, USA, 2015, p. 29–30. URL: https://doi.org/10.1145/2740908.2742760. doi:10.1145/
     2740908.2742760 .
 [7] T. Davidson, D. Warmsley, M. W. Macy, I. Weber, Automated hate speech detection and
     the problem of offensive language, CoRR abs/1703.04009 (2017). URL: http://arxiv.org/abs/
     1703.04009. arXiv:1703.04009 .
 [8] I. Kwok, Y. Wang, Locate the hate: Detecting tweets against blacks, in: Proceedings of the
     Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, AAAI Press, 2013,
     p. 1621–1622.
 [9] S. Malmasi, M. Zampieri, Detecting hate speech in social media, CoRR abs/1712.06427
     (2017). URL: http://arxiv.org/abs/1712.06427. arXiv:1712.06427 .
[10] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview of the
     hasoc track at fire 2019: Hate speech and offensive content identification in indo-european
     languages, in: Proceedings of the 11th Annual Meeting of the Forum for Information
     Retrieval Evaluation, FIRE ’19, Association for Computing Machinery, New York, NY,
     USA, 2019, p. 14–17. URL: https://doi.org/10.1145/3368567.3368584. doi:10.1145/3368567.
     3368584 .
[11] B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, M. Wojatzki, Measuring the
     reliability of hate speech annotations: The case of the european refugee crisis, CoRR
     abs/1701.08118 (2017). URL: http://arxiv.org/abs/1701.08118. arXiv:1701.08118 .
[12] H. Mubarak, A. Rashed, K. Darwish, Y. Samih, A. Abdelali, Arabic offensive language on
     twitter: Analysis and experiments, arXiv preprint arXiv:2004.02192 (2020).
[13] Z. Pitenis, M. Zampieri, T. Ranasinghe, Offensive Language Identification in Greek, in:
     Proceedings of the 12th Language Resources and Evaluation Conference, ELRA, 2020.
[14] D. Fišer, T. Erjavec, N. Ljubešić, Legal framework, dataset and annotation schema for
     socially unacceptable online discourse practices in Slovene, in: Proceedings of the First
     Workshop on Abusive Language Online, Association for Computational Linguistics, Van-
     couver, BC, Canada, 2017, pp. 46–51. URL: https://www.aclweb.org/anthology/W17-3007.
     doi:10.18653/v1/W17- 3007 .
[15] H.-P. Su, Z.-J. Huang, H.-T. Chang, C.-J. Lin, Rephrasing profanity in Chinese text, in:
     Proceedings of the First Workshop on Abusive Language Online, Association for Compu-
     tational Linguistics, Vancouver, BC, Canada, 2017, pp. 18–24. URL: https://www.aclweb.
     org/anthology/W17-3003. doi:10.18653/v1/W17- 3003 .
[16] A. Saroj, S. Chanda, S. Pal, Irlab@iitv at semeval-2020 task 12: Multilingual offensive
     language identification in social media using svm, in: Proceedings of the Fourteenth
     Workshop on Semantic Evaluation, International Committee for Computational Linguistics,
     Barcelona (online), 2020, pp. 2012–2016. URL: https://aclanthology.org/2020.semeval-1.265.
     doi:10.18653/v1/2020.semeval- 1.265 .
[17] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview of the germeval 2018 shared task on the
     identification of offensive language, Proceedings of GermEval 2018, 14th Conference on
     Natural Language Processing (KONVENS 2018), Vienna, Austria – September 21, 2018,
     Austrian Academy of Sciences, Vienna, Austria, 2018, pp. 1 – 10. URL: http://nbn-resolving.
     de/urn:nbn:de:bsz:mh39-84935.
[18] B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. Kumar M, T. Mandl, P. K. Kumaresan,
     R. Ponnusamy, H. R L, J. P. McCrae, E. Sherly, Findings of the shared task on offensive
     language identification in Tamil, Malayalam, and Kannada, in: Proceedings of the First
     Workshop on Speech and Language Technologies for Dravidian Languages, Association
     for Computational Linguistics, Kyiv, 2021, pp. 133–145. URL: https://aclanthology.org/2021.
     dravidianlangtech-1.17.
[19] S. Sai, Y. Sharma, Towards offensive language identification for Dravidian languages, in:
     Proceedings of the First Workshop on Speech and Language Technologies for Dravidian
     Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 18–27. URL: https:
     //aclanthology.org/2021.dravidianlangtech-1.3.
[20] T. Ranasinghe, S. Gupte, M. Zampieri, I. Nwogu, WLV-RIT at hasoc-dravidian-codemix-
     fire2020: Offensive language identification in code-switched youtube comments, CoRR
     abs/2011.00559 (2020). URL: https://arxiv.org/abs/2011.00559. arXiv:2011.00559 .
[21] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the type
     and target of offensive posts in social media, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 1415–1420. URL: https://aclanthology.org/
     N19-1144. doi:10.18653/v1/N19- 1144 .
[22] S. Chanda, S. Ujjwal, S. Das, S. Pal, Fine-tuning pre-trained transformer based model for
     hate speech and offensive content identification in english, indo-aryan and code-mixed
     (english-hindi) languages, 2021.
[23] S. Chanda, S. Sheth, S. Pal, Coarse and fine-grained conversational hate speech and
     offensive content identification in code-mixed languages using fine-tuned multilingual
     embedding, 2022.
[24] A. Bagora, K. Shrestha, K. Maurya, M. S. Desarkar, Hostility detection in online hindi-
     english code-mixed conversations, in: 14th ACM Web Science Conference 2022, 2022, pp.
     390–400.
[25] H. Madhu, S. Satapara, S. Modha, T. Mandl, P. Majumder, Detecting offensive speech in
     conversational code-mixed dialogue on social media: A contextual dataset and bench-
     mark experiments, Expert Systems with Applications 215 (2023) 119342. URL: https:
     //www.sciencedirect.com/science/article/pii/S0957417422023600. doi:https://doi.org/
     10.1016/j.eswa.2022.119342 .
[26] T. Ranasinghe, I. Anuradha, D. Premasiri, K. Silva, H. Hettiarachchi, L. Uyangodage,
     M. Zampieri, Sold: Sinhala offensive language dataset, arXiv preprint arXiv:2212.00851
     (2022).
[27] S. Masud, M. Bedi, M. A. Khan, M. S. Akhtar, T. Chakraborty, Proactively reducing the
     hate intensity of online posts via hate speech normalization, 2022. arXiv:2206.04007 .
[28] S. Satapara, H. Madhu, T. Ranasinghe, A. E. Dmonte, M. Zampieri, P. Pandya, N. Shah,
     M. Sandip, P. Majumder, T. Mandl, Overview of the hasoc subtrack at fire 2023: Hate-
     speech identification in sinhala and gujarati, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra
     (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa,
     India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023.
[29] S. Masud, M. A. Khan, M. S. Akhtar, T. Chakraborty, Overview of the HASOC Subtrack
     at FIRE 2023: Identification of Tokens Contributing to Explicit Hate in English by Span
     Detection, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation,
     CEUR, 2023.
[30] K. Ghosh, A. Senapati, A. S. Pal, Annihilate Hates (Task 4, HASOC 2023): Hate Speech
     Detection in Assamese, Bengali, and Bodo languages, in: Working Notes of FIRE 2023 -
     Forum for Information Retrieval Evaluation, CEUR, 2023.
[31] T. Ranasinghe, K. Ghosh, A. S. Pal, A. Senapati, A. E. Dmonte, M. Zampieri, S. Modha,
     S. Satapara, Overview of the HASOC subtracks at FIRE 2023: Hate speech and offensive
     content identification in assamese, bengali, bodo, gujarati and sinhala, in: Proceedings of
     the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023,
     Goa, India. December 15-18, 2023, ACM, 2023.