<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual Hate Speech and Ofensive Language Detection of Low Resource Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abhinav Reddy Gutha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nidamanuri Sai Adarsh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ananya Alekar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dinesh Reddy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Goa - 403401</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The last decade has seen a steep rise in the use and dependence of society on social media. The need for detection and prevention of hate and ofensive speech is more than ever. The everchanging form of natural language makes the detection of hate speech challenging, involving code-mixed text. The task becomes even more daunting in a country like India, where diferent languages and dialects are spoken across the country. This paper details the Code Fellas team's approaches in the context of HASOC 2023 Task 4: Annihilate Hate, an initiative aimed at extending hate speech detection to Bengali, Bodo, and Assamese languages. Here we describe our approaches which broadly involve Long Short Term Memory (LSTM) coupled with Convolutional Neural Networks (CNN) and pre-trained Bidirectional Encoder Representations from Transformers (BERT) based models like IndicBERT [1] and MuRIL [2]. Notably, our results showcase the efectiveness of these approaches, with IndicBERT achieving a remarkable F1 score of 69.726% for Assamese, MuRIL achieving 71.955% for Bengali, and a BiLSTM model enhanced with an additional Dense Layer attaining an impressive 83.513% for Bodo.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate Speech</kwd>
        <kwd>Ofensive Language</kwd>
        <kwd>LSTM</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Social media allows users to express their opinions without disclosing their real identities.
This leads to the misuse of social media platforms to generate hate among individuals and
communities, often leading to hate crimes. In the past few years, platforms like Twitter, Facebook,
and Reddit have seen a rising trend in the dissemination of ofensive content and the coordination
of hate-related actions. The hate afects not only the users but the general public, increasing
the cases of depression, anxiety, and other mental health issues. Hence, efective hate speech
detection is required.</p>
      <p>Until a few years ago, hate and ofensive speech were identified manually, which is now an
impossible task due to the enormous amounts of data being generated daily on social media
platforms. Detection of hate speech becomes a challenging task since filtering out certain words
that express hate is not suficient. The task also requires one to know the context and the
background of the user. In addition, in a diverse country like India, where numerous languages
are spoken, individuals often use their local languages when engaging through social media.
This becomes a major hurdle in hateful speech detection since two texts that have the same
meaning literally can mean diferent things in their respective languages.</p>
      <p>In this paper, we propose ways to address the above-mentioned issues and approaches
like Machine Learning Algorithms, LSTM, and BiLSTM coupled with CNN and pre-trained
BERT-based models to identify Hate Speech in Bengali, Bodo, and Assamese languages.</p>
      <p>The rest of the paper is organized as follows. Section 3 describes the dataset, followed by the
description of our proposed model in Section 4. Section 5 elaborates on the experimental setup
used. In Section 6 we list out the results obtained by our model in the evaluations and finally
conclude the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        The four major tasks in HASOC 2023 are Task-1[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Task-2[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Task-3[6], Task-4[7][8].
Task1[8] focuses on Identifying Hate, ofensive and profane content in Indo-Aryan languages with
subtask Task-1A which focuses on Sinhala and subtask Task-2A focuses on Gujarati. Task-2
focuses on Identification of Conversational Hate-Speech in Code-Mixed Languages, this task
focuses on the binary classification of such conversational tweets with tree-structured data.
Task-3 focuses on Identification of Tokens Contributing to Explicit Hate in Text by Hate Span
Detection. Task-4 of Hate Speech and Ofensive Content Identification in English and
IndoAryan Languages aims to tackle the detection of hate speech within the Bengali, Bodo, and
Assamese languages. The dataset used in this task is primarily derived from social media
platforms, including Twitter, Facebook, and YouTube. It comprises lists of sentences, each
annotated with a label indicating the presence or absence of ofensive content. This task entails
binary classification, with the core objective of predicting whether a given sentence contains
ofensive language.
      </p>
      <p>Researchers have used various techniques for the classification of text for hate speech
detection. K.Ghosh, A.Senapati et al.[9] [10] [11] used baseline bert models for conversational
hate speech detection in code-mixed tweets utilizing data augmentation and ofensive language
identification, compared mono and multilingual transformer model with cross-language
evaluation and achieved transformer-based hate speech detection in assamese.</p>
      <p>A primary challenge within this research domain revolves around the scarcity of available data,
particularly in languages like Assamese, Bodo, and Bengali. The limited data availability has
hindered comprehensive investigations into hate speech classification within these languages.
In light of this, our work endeavors to bridge this critical gap by developing a model capable of
efectively addressing hate speech detection in low-resource language contexts. Our approach
holds relevance not only for the specific languages discussed but also as a valuable blueprint for
addressing similar challenges in other low-resource languages.</p>
      <p>0Github:
https://github.com/16AbhinavReddy/Multilingual-Hate-Speech-and-Ofensive-Language-Detectionof-Low-Resource-Languages</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Dataset Information</title>
      <p>In this paper, we have used the multilingual datasets provided by Hate Speech and Ofensive
Content Identification in Indo-European Languages (HASOC). The shared tasks of HASOC
were provided for three languages (Assamese, Bengali, and Bodo) as part of task 4. There are
training and test datasets provided for the three languages. Task 4 is a binary classification that
needs respondents to classify the given tweets into two groups: Hate and Ofensive (HOF) and
Not Hate-Ofensive (NOT).</p>
      <p>1. HOF: This post includes hateful, ofensive, or profane content.
2. NOT: This post contains neither Hate Speech nor ofensive content.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Preprocessing</title>
      <p>Data processing is a crucial step in Natural Language Processing (NLP) tasks to clean and
make the data suitable for analysis. However, it is recommended that minimal processing
is applied, especially while dealing with BERT/LSTM-based approaches. The application of
excessive preprocessing can strip away some of the information these models use to make
accurate predictions. BERT and LSTM are both powerful language models that can learn
complex relationships between words and phrases. However, our approach follows the steps
[12] mentioned below to remove irrelevant text from the collected corpus and training datasets.
The steps include:
1. Removing all the usernames using regular expressions.
2. Removing all the URLs and numerics using regular expressions.
3. Reducing elongated words that express screaming to their normal form. For example,
helloooooooooo is reduced to hello.
4. Removing all the newline characters from the text.
5. Converting the text to tensors by applying tokenization to the tweets and padding the
sequences accordingly. We have used the tokenizers that are specifically designed for
BERT or LSTM models.</p>
      <p>We experimented with various other preprocessing methods like replacing English emoticons
and emojis1 with their actual meaning using the emoji library, etc. But approaches like the
1https://pypi.org/project/emoji/
BERT and LSTM model are very powerful and can learn Unicode characters like emoticons,
emojis, etc. This is because BERT models are already trained on a massive dataset of text and
code, and they have learned to capture many of the important features of languages. Hence,
applying excessive processing to the input text can introduce noise and can lead to the loss of
hateful text like emojis, which although do not hold much meaning in normal embeddings, but
can express hate in many contexts. Hence, removing important information like emojis can
adversely afect the model’s performance.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <sec id="sec-5-1">
        <title>5.1. Translation</title>
        <p>As mentioned in the Data Preprocessing section, we first developed a pipeline to preprocess the
data for model training.</p>
        <p>The task involves detecting hate speech in various languages; hence, the most natural way to
proceed is by converting the text from diferent languages into a single language, preferably
English [13]. This would have been a good approach if the language translation would be able
to capture the meaning of the text in its most appropriate context.</p>
        <p>We used Googletrans2 library to translate the given text into English. Further to preprocess
the text, we implemented the preprocessor pipeline and eventually we were able to proceed with
classification. However, this methodology has certain limitations associated with it. Primarily,
the translation loses the context, the original meaning, and the intentions of some ofensive
text, due to which the model is unable to perform well in classification. Generally, translation
libraries are not available for some low-resource languages like Bodo, hence, we can conclude
that this is not the optimal method.</p>
        <p>While experimenting on ofensive tweets in languages like Assamese and Bengali, we found
that the ofensive connotation is lost for some tweets as the translator cannot detect that
information. Our observation is that the translation carries the errors forward into classification.
Hence, we needed to identify an optimal method to complete the task without translation, while
performing the task separately in all the languages.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Machine Learning Techniques</title>
        <p>As the data from the given languages is sparse, the Term Frequency Inverse Document Frequency
(TF-IDF) and CountVectorizer method is the best way to generate vector formats for the text
data given.</p>
        <sec id="sec-5-2-1">
          <title>5.2.1. Embedding Technique</title>
          <p>While generating vector formats using TF-IDF and CountVectorizer for preprocessing, we
applied basic preprocessing steps and removed URLs and usernames. As emojis indicate aggression
in the context, we have retained them to capture the intended sentiment.</p>
          <p>2https://pypi.org/project/googletrans/</p>
          <p>After preprocessing, we applied TF-IDF and CountVectorizer techniques, considering
Ngrams with N-gram values 2 and 3, thus obtaining semantic relationships between two or three
sequential words to some extent.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Model Application</title>
          <p>After generating the vector embeddings for the text, we applied the following models for
classifying hateful and ofensive speech:
• Support Vector Machine
• Logistic Regression
• XGB classifier
• Decision Tree Classifier</p>
          <p>Apart from this, we also tried using a Latent Semantic-based approach3, where we took the
number of components between 95 and 500. We used this as a variable when applying the
RandomizedSearchCV method, which is further explained below.</p>
          <p>The Latent Semantic Approach is a technique used in NLP and information retrieval to
understand the underlying meaning of words and documents. It relies on statistical analysis to
identify patterns and relationships between words based on their co-occurrence in a large text
corpus. By doing so, it can capture the semantic similarity between words and documents, even
if they don’t share exact terms.</p>
          <p>During Model Application, hyperparameter tuning is employed to reduce computation and
calculate the best parameters for classification. In this process, RandomizedSearchCV is used
over GridSearchCV to get better results with comparatively less computation.</p>
          <p>Table 2, Table 3 and Table 4 tabulate the performance of the top eight models in the three
languages.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Deep Learning Architectures</title>
        <p>After employing machine learning models, our research turns its attention toward deep learning
models, specifically LSTM and Bidirectional Long Short-term memory (BiLSTM).</p>
        <p>3https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
5.3.1. LSTM
LSTM stands out as a specialized variant of Recurrent Neural Networks (RNNs), meticulously
designed to apprehend and model extensive temporal dependencies within sequential data. It
harnesses memory cells and gating mechanisms to selectively retain and retrieve information
over protracted sequences. The rationale behind our exploration of LSTM stems from its
manifold advantages, encompassing sequential data processing, the presence of memory cells
that facilitate the storage and retrieval of information across extended sequences, and the
incorporation of gating mechanisms.</p>
        <p>However, we encountered a notable challenge with LSTM. It exhibits a tendency to demand a
substantial volume of training data to achieve optimal performance. The given dataset does not
possess the requisite scale. To address this limitation, we introduced a 1D CNN layer (CNN 1-D)
to enhance the performance, since combining LSTM with 1D CNN amalgamates the strengths
of both architectures.</p>
        <p>Table 5 shows the performance of LSTM in all three languages.
5.3.2. LSTM with CNN 1-D
LSTM with CNN 1-D [14] is particularly advantageous when the input data exhibits local spatial
patterns, which pertains to features or patterns within a sequence not contingent upon the order
of elements but related to their relative positions or proximity within the sequence. Additionally,
it excels in capturing intricate temporal relationships, signifying how elements within a sequence
relate to each other over time facilitating the modeling of long-range dependencies and temporal
patterns.</p>
        <p>However, in contrast to our initial expectations this configuration underperformed as
compared to its performance in the preceding scenario. Overfitting turned out to be a prominent
concern, primarily due to the heightened complexity of the model structure, which ultimately
led to diminished accuracy.</p>
        <p>Table 5 shows the performance of LSTM + CNN-1D in all three languages.
5.3.3. BiLSTM
Subsequently, we chose the BiLSTM approach, taking into account the issue of overfitting.
BiLSTMs employ bidirectional processing, which diferentiates them from traditional LSTMs
that process sequences unidirectionally from left to right. BiLSTMs simultaneously engage two
hidden states, one for processing sequences from the beginning to the end (forward direction)
and another for processing sequences from the end to the beginning (backward direction).</p>
        <p>Table 5 shows the performance of BiLSTM in all three languages.</p>
        <sec id="sec-5-3-1">
          <title>5.3.4. BiLSTM with CNN-1D</title>
          <p>A typical architectural configuration featuring the amalgamation of BiLSTM and CNN-1D [ 15]
commences with the stacking of CNN-1D layers at the model’s inception to extract local features.
Subsequently, one or more BiLSTM layers capture temporal dependencies. The resultant outputs
from these layers are now directed into additional fully connected layers, culminating in the
ifnal classification or regression tasks.</p>
          <p>Table 5 shows the performance of BiLSTM + CNN-1D in all three languages.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Transformers based approach</title>
        <p>After applying preprocessing techniques such as removing URLs, hashtags, usernames, etc,
we employed various Transformer-based approaches [16], using pre-trained models and their
tokenizers.</p>
        <p>Bert-Base-Multilingual-Uncased:</p>
        <p>Bert-Base-Multilingual-Uncased is a model [17] pre-trained on the top 102 languages with
the largest Wikipedia using a masked language modeling objective. Uncased means all the
words are converted into lowercase. This model does not show the diference between ‘english’
and ‘English’.</p>
        <p>Assamese-Bert:</p>
        <p>It is a BERT [18] model pre-trained on publicly available Assamese Monolingual datasets.
This is used to classify for the Assamese Language task.</p>
        <p>Bengali-Bert:</p>
        <p>It is a BERT [18] model pre-trained on publicly available Bengali monolingual datasets. This
is used to classify for the Bengali Language task.</p>
        <p>Bengali-Abusive-Muril:</p>
        <p>
          This is a MuRIL [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] model trained on Bengali abusive speech dataset. This model is trained
with learning rates of 2− 5. We fine-tuned this to our Bengali dataset to obtain better results.
XLM-Roberta:
        </p>
        <p>XLM-Roberta [19] is a model pre-trained on 2.5TB of filtered CommonCrawl data containing
100 languages.</p>
        <p>We fine-tuned the given dataset and then applied hyperparameter tuning on an optimal
batch size of 8, for accuracy, as very low and high batch sizes afect the training. Further, we
incorporated early stopping and exponential learning rate decay with an initial learning rate of
1− 5 to fine-tune the above model for our task.</p>
        <p>Indic-Bert:</p>
        <p>
          IndicBERT[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a language model designed specifically for languages spoken in the Indian
subcontinent, such as Hindi, Bengali, Tamil, and others. It underwent pre-training using a
massive corpus of 9 billion tokens and we assessed it across various tasks. It is noteworthy
that despite having significantly fewer parameters than models like m-BERT and XLM-Roberta,
IndicBERT achieved top-tier performance in multiple tasks.
        </p>
        <p>Distil-Bert:</p>
        <p>DistilBERT [20] is like a smaller, faster version of the powerful BERT language model used
for understanding and processing human language. It still works quite well but doesn’t require
as much computing power. It learns from BERT’s knowledge instead of starting from scratch,
making it good for tasks like figuring out what the text means or recognizing names in text.</p>
        <p>Table 6 shows the performance of Transformers on Assamese, Bengali and Bodo.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental Setup</title>
      <p>Changing parameters in deep learning is a crucial part of the model development process.
Before training the model, we split the training data into 2 parts, with 20% data as a test dataset
and 80% as a training dataset using stratified sampling. We applied inbuilt CUDA GPU to our
models, whenever it was available, else we used the CPU.</p>
      <p>Hyperparameters are settings that are not learned during training but are instead configured
before training begins. They have a significant impact on the model’s performance and training
process. Here are some of the parameters we used to enhance the model’s performance:
1. Learning Rate: Learning rate helps us measure the size of the updates made to the
model’s parameters during training. It influences how quickly or slowly a neural network
learns. In this research, we have used an exponential learning rate4, taking the initial value
as 10− 5 and gamma as 0.9. Gamma typically refers to a hyperparameter that controls the
rate at which the learning rate decreases over time.
2. Batch Size: Batch size refers to the number of training examples used in each training
iteration. A large batch size can lead to faster training but may require more memory.
Hence, we changed the batch size based on memory usage. We took the batch size as 96
for Assamese and Bengali languages and as 64 for the Bodo corpus. It is recommended to
choose a number that is a multiple of 32.
3. Number of Epochs: This refers to the number of times the model processes the entire
training dataset. But if the number of epochs is less, then it may result in underfitting.
Also, if the number of epochs is more, then it may cause the model to overfit for the given
data. To handle this issue, we implemented an early stopping5method while training
the model. Early stopping helps in finding a balance between model complexity and
generalization. We chose the patience value as 2 for BERT-based approaches and 3 for
Neural Network approaches. This value determines the number of training epochs the
model should continue to train without seeing an improvement in the chosen validation
metric before the early stopping mechanism is triggered. During this process, we found
that early stopping is triggered for an epoch count of 6 in the LSTM model. When we
implemented it with BERT models, early stopping triggered diferently for diferent BERT
models. Table 7 below lists the models and their corresponding epoch numbers.
4. Network Depth: This refers to the number of layers in a neural network. Deeper
networks can capture more complex patterns but may be prone to overfitting if not
4https://keras.io/api/optimizers/learning_rate_schedules/exponential_decay/
5https://keras.io/api/callbacks/early_stopping/
regularized properly. While working with neural network approaches like LSTM, BiLSTM,
and BiLSTM with CNN1D, we took a network depth of 4.
5. Number of Neurons: While working with the neural networks, we set the input length
as the number of neurons in the first input layer. In the layer where we applied RNN-based
networks, we set the number of neurons as 128. In the third layer, where we applied a
dense layer, we set it as 256. For the final layer, we set it as 1 because it determines the
model’s outcome. It is recommended to choose a number that is a multiple of 32.
6. Dropout Rate: This gives us the probability of dropping a neuron from the neural
network during training, which helps prevent overfitting. While working with recurrent
neural networks, we used two dropout rates. One being the regular dropout, which we
applied to the inputs and/or the outputs. Another being the recurrent dropout, which
removes the connections between the recurrent units. During our work, we set both of
these values as 0.2.
7. Optimizers: We utilized Adam and AdamW optimizers while working with the model.</p>
      <p>Adam optimizer combines the advantages of two other popular optimization algorithms,
stochastic gradient descent (SGD) and RMSprop, whereas AdamW incorporates weight
decay directly into the optimization process, which helps prevent overfitting.
8. Activation Functions: While working with neural networks, we used ReLU activation
function for every layer except the last layer where we used sigmoid activation function.
The purpose being, dealing with vanishing gradients and exploding gradient problems.
The problem of vanishing gradient occurs when the gradients of the loss function diminish
significantly with respect to the model’s parameters as they are propagated backwards
through the layers of a deep neural network during training. The exploding gradient
problem is the opposite of the vanishing gradient problem. It occurs when gradients grow
significantly during backpropagation, often to the point where they become null or cause
numerical instability in the training process.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Results and Conclusion</title>
      <p>
        The methodologies discussed in Section 5 and the comparison models are used to evaluate
the performance of the three languages, namely Assamese, Bengali, and Bodo. Based on our
research, we observed the following:
1. Notably, for Assamese and Bengali, BERT-based models exhibit the highest F1 scores,
suggesting their eficacy in these contexts. Specifically, IndicBERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] emerges as the
top-performing model for the Assamese corpus, while Bengali MuRIL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] demonstrates
superior performance for the Bengali corpus. These results underscore the efectiveness
of leveraging pre-trained Bert models that cater to languages commonly spoken in the
northeastern region of India, where Assamese and Bengali are prevalent.
2. In contrast, when evaluating the Bodo language, which is characterized by limited
linguistic resources, we observed that a Neural Network-based approach outperforms other
methodologies. Among the neural network architectures tested, the BiLSTM with an
additional Dense layer yields the highest F1 score for Bodo. This result highlights the
adaptability of neural network models for low-resource languages like Bodo, where
dedicated pre-trained models may be scarce. Table 8 displays more details.
3. A noteworthy observation from our research is the advantage of leveraging specialized
BERT models pre-trained on languages such as Assamese and Bengali. These models
are tailored to the linguistic nuances and characteristics of these languages, particularly
relevant in the context of the northeastern region of India. Our findings demonstrate that
utilizing these specialized models led to superior performance for Assamese and Bengali,
showcasing the significance of language-specific pre-training in NLP tasks.
4. We encountered unique challenges when working with Bodo, a low-resource language in
India. Unlike Assamese and Bengali, which benefit from pre-trained BERT models, Bodo
lacks dedicated pre-trained models. Consequently, our research favors neural
networkbased methodologies for Bodo, as they outperform BERT models in this particular context.
While it’s possible to adapt existing BERT models to the Devanagari script used in Bodo,
our results indicate that these adaptations may not match the performance achieved
through neural network based approaches.
5. Apart from the given models, we are also experimenting with large transformer models
[21] within the BERT family. However, due to the relatively small size of our dataset,
these models tend to overfit during training.
      </p>
      <p>In summary, our research highlights the importance of tailoring NLP methodologies to the
linguistic characteristics and available resources of specific languages. While BERT-based
models excel in well-resourced languages, low-resource languages like Bodo may benefit more
from neural network-based approaches. Additionally, utilizing specialized pre-trained models
for languages like Assamese and Bengali can significantly enhance performance, but it’s crucial
to consider the limitations posed by dataset size, especially when working with large transformer
models. These findings contribute towards valuable insights into optimizing NLP approaches
for diverse linguistic contexts. We secured 11th, 7th, and 11th position in Assamese, Bengali,
and Bodo languages, respectively.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Acknowledgments</title>
      <p>The authors would like to convey their sincere thanks to Clint Pazhayidam George,
Vigneshwaran Shankaran, and Rajesh Sharma for helping us with our research. Throughout our research
journey, we have been deeply inspired by their unwavering commitment to our project and our
individual development. Their knowledge and teaching methods have made a big diference in
our research journey.</p>
      <p>We would also like to extend our heartfelt gratitude to Koyel Ghosh and her dedicated team
for their exceptional organization of HASOC 2023. Throughout our research, they exhibited
remarkable support and approachability, which greatly contributed to the success of our work.
[6] S. Masud, M. A. Khan, M. S. Akhtar, T. Chakraborty, Overview of the HASOC Subtrack
at FIRE 2023: Identification of Tokens Contributing to Explicit Hate in English by Span
Detection, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation,
CEUR, 2023.
[7] K. Ghosh, A. Senapati, A. S. Pal, Annihilate Hates (Task 4, HASOC 2023): Hate Speech
Detection in Assamese, Bengali, and Bodo languages, in: Working Notes of FIRE 2023
Forum for Information Retrieval Evaluation, CEUR, 2023.
[8] T. Ranasinghe, K. Ghosh, A. S. Pal, A. Senapati, A. E. Dmonte, M. Zampieri, S. Modha,
S. Satapara, Overview of the HASOC subtracks at FIRE 2023: Hate speech and ofensive
content identification in assamese, bengali, bodo, gujarati and sinhala, in: Proceedings of
the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023,
Goa, India. December 15-18, 2023, ACM, 2023.
[9] K. Ghosh, A. Senapati, U. Garain, Baseline bert models for conversational hate speech
detection in code-mixed tweets utilizing data augmentation and ofensive language
identification in marathi, in: Fire, 2022. URL: https://api.semanticscholar.org/CorpusID:259123570.
[10] K. Ghosh, D. A. Senapati, Hate speech detection: a comparison of mono and multilingual
transformer model with cross-language evaluation, in: Proceedings of the 36th Pacific Asia
Conference on Language, Information and Computation, De La Salle University, Manila,
Philippines, 2022, pp. 853–865. URL: https://aclanthology.org/2022.paclic-1.94.
[11] K. Ghosh, D. Sonowal, A. Basumatary, B. Gogoi, A. Senapati, Transformer-based hate
speech detection in assamese, in: 2023 IEEE Guwahati Subsection Conference (GCON),
2023, pp. 1–5. doi:10.1109/GCON58516.2023.10183497.
[12] S. Mundra, N. Mittal, Cmhe-an: Code mixed hybrid embedding based attention network
for aggression identification in hindi english code-mixed text, Multimedia Tools and
Applications 82 (2023) 11337–11364. doi:10.1007/s11042-022-13668-4.
[13] I. Bhat, V. Mujadia, A. Tammewar, R. Bhat, M. Shrivastava, Iiit-h system submission for
ifre 2014 shared task on transliterated search, in: Proceedings of the [conference name],
2015. doi:10.1145/2824864.2824872.
[14] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, Towards sub-word level compositions for
sentiment analysis of hindi-english code mixed text, in: Proceedings of COLING 2016, the
26th International conference on computational linguistics: Technical papers, The COLING
2016 Organizing Committee, 2016, pp. 2482–2491. URL: https://aclanthology.org/C16-1234.
[15] P. Kapil, A. Ekbal, D. Das, Investigating deep learning approaches for hate speech detection
in social media, 2020.
[16] M. Das, S. Banerjee, P. Saha, A. Mukherjee, Hate speech and ofensive language detection
in Bengali, in: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the
Association for Computational Linguistics and the 12th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), Association for Computational
Linguistics, Online only, 2022, pp. 286–296. URL: https://aclanthology.org/2022.aacl-main.
23.
[17] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
org/abs/1810.04805. arXiv:1810.04805.
[18] R. Joshi, L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari
based hindi and marathi languages, arXiv preprint arXiv:2211.11418 (2022).
[19] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation
learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116.
arXiv:1911.02116.
[20] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter, ArXiv abs/1910.01108 (2019).
[21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V.
Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
(2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kakwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunchukuttan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Golla</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. N.C.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Khapra</surname>
          </string-name>
          , P. Kumar, IndicNLPSuite: Monolingual Corpora,
          <article-title>Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages</article-title>
          , in: Findings of EMNLP,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>M. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <article-title>Data bootstrapping approaches to improve low resource abusive language detection for indic languages</article-title>
          ,
          <source>arXiv preprint arXiv:2204.12543</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Dmonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pandya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandip</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , T. Mandl,
          <article-title>Overview of the hasoc subtrack at fire 2023: Hatespeech identification in sinhala and gujarati</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2023 -
          <article-title>Forum for Information Retrieval Evaluation, Goa, India</article-title>
          .
          <source>December 15-18</source>
          ,
          <year>2023</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pandya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc subtrack at fire 2023: Identification of conversational hate-speech, in:</article-title>
          K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2023 -
          <article-title>Forum for Information Retrieval Evaluation, Goa, India</article-title>
          .
          <source>December 15-18</source>
          ,
          <year>2023</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Masud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          , T. Mandl,
          <article-title>Overview of the HASOC subtracks at FIRE 2023: Detection of hate spans and conversational hate-speech, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2023</year>
          , Goa,
          <source>India. December 15-18</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>