<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M. Krithik Sathya</string-name>
          <email>krithik2110693@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>K.H. Gopalakrishnan</string-name>
          <email>gopalakrishnan2110375@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manickam PA</string-name>
          <email>manickam2110305@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prabavathy Balasundaram</string-name>
          <email>prabavathyb@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty, Department of Computer Science, Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Chennai, Tamil Nadu</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hate Speech Detection</institution>
          ,
          <addr-line>Ofensive Language Identification, BERT Models, Text Classification, Multilingual</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>UG Student, Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Chennai, Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>5</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>This study, conducted by the ”Krispy Mango” research team, focuses on hate speech and ofensive content detection in two low-resource Indo-Aryan languages, Sinhala and Gujarati, as part of the HASOC 2023 shared tasks. We address the dificulty of classifying tweets into Hate and Ofensive (HOF) and Non-Hate and Ofensive (NOT) categories by fine-tuning the BERT models. This work presents findings in the form of macro F1 scores and precision metrics for both languages. Our approach aims to advance the state-of-the-art in detecting hate speech while taking into account the particular linguistic characteristics and resource restrictions of these languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>The digital age has revolutionized the way we communicate and connect with one another,
primarily through the widespread adoption of social media platforms. However, this
unprecedented level of global interconnectivity has also brought about a concerning surge in hate speech
and ofensive content. Efectively addressing this challenge and developing robust methods for
detecting and countering hate speech has become imperative. This research aims to contribute
significantly to this efort by focusing on hate speech detection in two South Asian languages,
Sinhala and Gujarati.</p>
      <p>
        While hate speech detection in English has received substantial attention, these languages
have received comparatively less consideration in the realm of Natural Language Processing
(NLP). Hate speech is a pervasive issue that crosses linguistic boundaries, emphasizing the
importance of developing models that can identify such content in non-English languages as
effectively as in English. Sinhala, spoken in Sri Lanka, and Gujarati, a major Indian language, pose
unique linguistic challenges due to their complex structures and distinct scripts. Detecting hate
speech in these languages demands tailored approaches and models capable of understanding
their intricacies. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
nEvelop-O
CEUR
Workshop
Proceedings
      </p>
      <p>
        The following is the structure of this research paper: We start by providing a thorough
explanation of the Sinhala and Gujarati task configuration for HASOC 2023. We next dive into
our experimental methodology, which employs pre-trained BERT models fine-tuned on the
available training data. In order to make the most of the restricted linguistic resources, we
investigate the transferability of models across linguistic boundaries. Finally, we highlight our
research’s possible efects on reducing online hate speech and harmful language in diferent
linguistic communities as we examine our findings. Our work advances knowledge of hate
speech identification in low-resource language circumstances by merging ideas from two
diferent languages. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Works</title>
      <p>
        In recent research by Vinura et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the efectiveness of pre-trained language models for
Sinhala text classification was explored. Among these models, XLM-R emerged as the most
potent choice. The study introduced RoBERTa-based monolingual Sinhala models, establishing
strong baselines, even in the presence of limited labelled data. Additionally, this research made
significant contributions by releasing annotated datasets, providing valuable resources for future
studies in Sinhala text classification.
      </p>
      <p>
        Andrea et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] conducted a study focusing on the applicability of Bidirectional Encoder
Representations from Transformers (BERT) models for sentiment analysis and emotion
recognition in Twitter data. Through the development and fine-tuning of two classifiers for each
task, they achieved remarkable results, with BERT-based models achieving accuracy rates of
92 per cent for sentiment analysis and 90 per cent for emotion recognition. These findings
underscored BERT’s proficiency in modelling language for text classification within the realm
of social media data.
      </p>
      <p>
        Tiwari et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] directed their eforts towards addressing challenges in hate speech recognition
within the context of social media platforms. They conducted a comparative analysis of various
machine learning algorithms, emphasizing accuracy and precision metrics. Their findings
identified the combination of XGBoost and TF-IDF embedding as the highest-performing
approach, achieving an accuracy rate of 94.43 per cent. This research emphasized the critical
role of hate speech detection in promoting user safety and compliance with laws addressing
ofensive content.
      </p>
      <p>
        Wang et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] ofered a comprehensive retrospective on the evolution of text classification,
spanning traditional shallow learning techniques to deep learning models. Their meticulous
examination of six pivotal methods, including ReNN, MLP, RNN, CNN, Attention, and
Transformer, highlighted their respective strengths and limitations. The paper underscored the
dominance of deep learning models in text classification and highlighted ongoing research in
attention mechanisms, Transformers, robustness, and graph neural networks, indicating the
continuous evolution of text classification solutions.
      </p>
      <p>
        Ding et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced an innovative approach, Hypergraph Attention Networks (HANs),
for inductive text classification. With a focus on eficiency and performance enhancement, HANs
harnessed hypergraph structures to capture intricate word relationships within textual data.
By utilizing sparse hypergraphs, this method efectively managed computational complexity,
showcasing its scalability for extensive datasets. Experimental results underscored HANs’
superiority over existing techniques, demonstrating their potential for proficient inductive text
classification while eficiently utilizing computational resources.
      </p>
      <p>
        Minaee et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] conducted an extensive review comparing deep learning models to classical
machine learning in text classification tasks such as sentiment analysis and news categorization.
They evaluated over 150 recent deep learning-based text classification models, providing insights
into technical innovations and strengths. The paper also analyzed the performance of these
models on benchmark datasets, supporting their efectiveness with empirical evidence. It
concluded by outlining potential avenues for future research, serving as a valuable resource for
understanding the current landscape and future potential of deep learning in text classification.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Task and Dataset Description</title>
      <sec id="sec-4-1">
        <title>3.1. Sub Task: Identifying Hate, ofensive and profane content in Sinhala</title>
        <p>
          The task focuses on categorizing tweets published in Sinhala in a binary form. The two
classification categories are as follows: 1. Hate and unpleasant (HOF): Tweets that target people
or groups based on attributes like race, religion, ethnicity, gender, etc. are included in this
category. They may also use profanity or other unpleasant language. 2. Non-Hate and Ofensive
(NOT): Tweets falling under this category do not contain any ofensive language, profanity, or
hate speech. They represent neutral or non-harmful expressions in the Sinhala language. The
train/ test sets are based on the recently released SOLD: Sinhala Ofensive Language Detection
dataset. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Sub Task: Identifying Hate, ofensive and profane content in Gujarati</title>
        <p>The task focuses on categorizing tweets published in Gujarati in a binary form. The two
classification categories are as follows: 1. Hate and unpleasant (HOF): Tweets that target people
or groups based on attributes like race, religion, ethnicity, gender, etc. are included in this
category. They may also use profanity or other unpleasant language. 2. Non-Hate and Ofensive
(NOT): Tweets falling under this category do not contain any ofensive language, profanity, or
hate speech. They represent neutral or non-harmful expressions in the Gujarati language.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Methodologies used</title>
      <p>Diferent NLP architectures like xlm-roberta-base, bert-base-multilingual-cased,
intfloat/multilinguale5-base, openai/whisper-large, Sinhala bert, Gujarathi-bert, were employed for identifying Hate,
ofensive, and profane content from tweets in Gujarathi and Sinhala.</p>
      <sec id="sec-5-1">
        <title>4.1. Basic BERT Architecture</title>
        <p>
          The BERT model, an acronym for ”Bidirectional Encoder Representations from Transformers,”
is grounded in the transformer architecture, emphasizing attention mechanisms. Comprising a
multi-layer bidirectional transformer encoder, it includes an input layer, multiple hidden layers,
and an output layer. Input sequences undergo initial processing through an embedding layer
before entering the transformer encoder.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
        </p>
        <p>This encoder consists of a stack of uniform layers, each housing two sub-layers: a multi-head
self-attention mechanism and a position-wise fully connected feed-forward network. The
self-attention mechanism enables the model to discern interrelations among input sequence
positions, aiding contextual comprehension.</p>
        <p>The position-wise feed-forward network applies two linear transformations, with ReLU
activations interleaved, to each sequence element, enabling the model to capture intricate
patterns and interdependencies among input tokens. Importantly, the final hidden state of the
initial token ([CLS]) serves as the holistic sequence representation for classification tasks.</p>
        <p>
          BERT undergoes training through two unsupervised prediction tasks: masked language
modeling and next-sentence prediction. This dual training equips BERT with profound
bidirectional representations, leveraging contextual information from both preceding and subsequent
contexts across all layers. Pre-trained BERT models can then be fine-tuned with an additional
output layer, making them adaptable and potent tools for diverse natural language processing
(NLP) tasks.[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ][
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
The ”XLM-RoBERTa” model represents a powerful fusion of two renowned architectures: XLM
(Cross-lingual Language Model) and RoBERTa. This variant excels in multilingual natural
language processing tasks, with an emphasis on cross-lingual understanding. It boasts a
vast parameter count and a deep architecture comprising multiple transformer layers.
XLMRoBERTa is pre-trained on an extensive corpus encompassing a multitude of languages, allowing
it to comprehend and generate text in a wide array of linguistic contexts. Notably, it does not
diferentiate between uppercase and lowercase letters, ensuring robust performance in both
casesensitive and case-insensitive scenarios. This model’s versatility and cross-lingual capabilities
make it an invaluable asset for researchers and practitioners engaged in multilingual NLP tasks,
ranging from machine translation to document classification. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
4.3. Bert-base-mutilingual-cased
”BERT-base-multilingual-cased” is a BERT model version designed for multilingual natural
language processing (NLP) applications. Unlike the original ”base BERT,” which was trained
exclusively on English text, this variation was trained on a variety of languages. The model
has 6 layers, 768 dimensions and 12 heads, totalizing 134M parameters (compared to 177M
parameters for mBERT-base). On average, this model, referred to as DistilmBERT, is twice as
fast as mBERT-base. The ”cased” element denotes that it stores case information in its lexicon,
allowing it to diferentiate between uppercase and lowercase letters. This is critical for languages
where case sensitivity is critical for interpreting context. BERT-base-multilingual-cased is very
useful for multilingual applications since it can eficiently handle many languages, giving it a
versatile solution for tasks needing NLP across diferent linguistic backgrounds.
4.4. intfloat/multilingual-e5-base
”intfloat/multilingual-e5-base” is a specialized BERT variant developed to address the demands of
multilingual natural language processing. It ofers a comprehensive solution for tasks involving
diverse languages and linguistic characteristics. Trained on an extensive multilingual corpus,
this model leverages transformer-based architecture and deep neural networks to facilitate
efective language understanding and generation. Notably, it encompasses a cased vocabulary,
enabling it to preserve case information, which is pivotal in languages where case sensitivity
plays a significant role in semantic interpretation. This variant’s adaptability and multilingual
competence render it a valuable tool for cross-lingual applications such as multilingual document
classification, sentiment analysis, and more.
4.5. OpenAI/Whisper
”OpenAI/Whisper-Large” is a large-scale automatic speech recognition (ASR) model designed
to transcribe spoken language into text. This model’s capabilities are underpinned by a massive
architecture, extensive pre-training on diverse audio data, and a robust transformer-based
architecture. It excels in recognizing speech across multiple languages and dialects, making it a
versatile choice for ASR tasks in various linguistic contexts. With its remarkable capacity for
handling large volumes of spoken data and its ability to adapt to distinct accents and acoustic
conditions, OpenAI/Whisper-Large is a valuable asset for applications such as transcription
services, voice assistants, and more, where accurate speech-to-text conversion is paramount.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.6. keshan/SinhalaBERTo</title>
        <p>
          Keshan/SinhalaBERTo is a specialized language model developed to address the unique
challenges posed by the Sinhala language. Sinhala, being a low-resource language, has limited
access to pre-trained language models. Keshan/SinhalaBERTo fills this gap as a slightly smaller
but highly valuable language model. It is trained on the OSCAR Sinhala dedup dataset, making
it a relevant resource for Sinhala natural language processing tasks. The model specifications,
including a vocabulary size of 52,000, max position embeddings of 514, 12 attention heads, 6
hidden layers, and a type vocabulary size of 1, create a robust foundation for Keshan/SinhalaBERTo,
a specialized language model for Sinhala text processing. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
4.7. Gujarati-bert
”Gujarati BERT” is a modified variant of the BERT model built exclusively for the Gujarati
language, which is widely spoken in the Indian state of Gujarat and other areas. Gujarati BERT
is fine-tuned for Gujarati text, as opposed to the usual ”base BERT” model, which is trained
on a varied variety of languages. This allows it to capture the specific linguistic qualities,
script, and context of the Gujarati language more eficiently. Gujarati BERT is particularly
useful for natural language processing tasks in Gujarati, such as text categorization, sentiment
analysis, and named entity recognition, due to this speciality. When compared to the more
general-purpose base BERT model, Gujarati BERT’s domain expertise improves its performance
and applicability in the context of the Gujarati language.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Result Analysis for Sinhala Dataset</title>
      <sec id="sec-6-1">
        <title>5.1. Implementation</title>
        <p>In this section, we present the results of our ofensive tweet classification task, employing
ifve diverse BERT-based models: M1 (XLM-RoBERTa), M2 (Keshan/SinhalaBERTo), M3
(Bertbase-multilingual-cased), M4 (Bert-base-multilingual-uncased), and M5
(intfloat/multilinguale5-base). These models have distinct linguistic characteristics and tokenization methods, which
contribute to their unique performance.</p>
        <p>Model M1 is based on XLM-RoBERTa, exhibits robust performance in classifying ofensive
tweets. XLM-RoBERTa’s multilingual competence allows it to handle a wide range of languages
efectively, including Sinhala. Its tokenization strategy considers various linguistic nuances,
and it demonstrates a strong ability to generalize across languages. This model’s adaptability
and pre-training on diverse multilingual data contribute to its high classification accuracy on
the test dataset.</p>
        <p>M2 is powered by Keshan/SinhalaBERTo, which is tailored explicitly for the Sinhala language.
Its tokenizer, optimized for Sinhala text, excels in capturing the language’s unique characteristics.
This model showcases impressive results in classifying ofensive tweets, demonstrating the
importance of language-specific models in achieving high accuracy on Sinhala text. M2’s
finetuning on Sinhala data contributes to its superior Sinhala text understanding and classification
capabilities.</p>
        <p>Model M3 is Bert-base-multilingual-cased, it is designed as a versatile, multilingual BERT
variant. Although not optimized exclusively for Sinhala, it manages to handle Sinhala text
efectively due to its extensive multilingual vocabulary. M3’s tokenization, which is akin to
bert-base-cased, successfully translates Sinhala text into subword tokens, allowing it to perform
well in cross-lingual ofensive tweet classification.</p>
        <p>M4 is Bert-base-multilingual-uncased, this shares similarities with M3 but lacks case
sensitivity. Despite this diference, it efectively tokenizes Sinhala text, thanks to its subword
tokenization method and multilingual vocabulary. M4 showcases commendable performance
in the classification task, afirming its suitability for processing Sinhala and other languages
without consideration for letter casing.</p>
        <p>Model M5 is intfloat/multilingual-e5-base, it is geared towards multilingual natural language
processing tasks. Its subword tokenization and extensive pre-training enable it to handle
Sinhala text with competence. M5 exhibits competitive results in classifying ofensive tweets,
highlighting its adaptability and cross-lingual proficiency.</p>
        <p>
          These tokenized inputs are then used to train and test the models. During the training phase,
hyperparameters such as batch size, number of training epochs, and learning rate must be
specified. To fine-tune the models, appropriate optimization algorithms, such as AdamW, are
used in conjunction with learning rate schedulers. Following the training phase, the models
are tested on a separate 1500-row test dataset with the same column names as the training
data (post id, tweets, labels). During the testing phase, each model’s capacity to generalize and
generate correct predictions on new, unseen data is evaluated. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Results and discussion</title>
        <p>To categorize text data for hate speech detection in Sinhala, the models M1, M2, M3, M4,
and M5 were used. To examine the performance of these models, evaluation metrics such as
Macro-F1, Macro-Precision, and Macro Recall were generated. These metrics provide insight
on the model’s ability to reliably identify and predict instances of hate speech in Sinhala text
data. After examining the findings for these assessment measures in Table 3, it is clear that M5
outperforms all other models.</p>
        <p>Macro-F1 was used to assess these models because it combines precision and recall into a
single score, ofering a balanced estimate of the model’s capacity to reliably categorize instances
of hate speech. M5, with a stellar Macro-F1 score of 0.8371, demonstrated its proficiency in
correctly identifying hate speech within the Sinhala text data, outperforming the other models.
A higher Macro-F1 score suggests superior performance in hate speech detection.
XLM-RoBERTa(M1) 0.7210
Keshan/SinhalaBERTo(M2) 0.6451
Bert-base-multilingual-cased(M3) 0.8141
Bert-base-multilingual-uncased(M4) 0.7728
intfloat/multilingual-e5-base(M5) 0.8371</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Result Analysis for Gujarati</title>
      <sec id="sec-7-1">
        <title>6.1. Implementation</title>
        <p>In this section, we present the results of our ofensive tweet classification task for the
Gujarati language, utilizing five distinct BERT-based models: M1 (XLM-RoBERTa), M2
(bert-basemultilingual-cased), M3 (bert-base-multilingual-uncased), M4 (OpenAI/Whisper-Large), and
M5 (Gujarati BERT). These models vary in terms of their linguistic capabilities and tokenization
methods, which influence their performance on the Gujarati dataset.</p>
        <p>Model M1 is based on XLM-RoBERTa, it demonstrates strong performance in classifying
ofensive tweets in Gujarati. XLM-RoBERTa’s multilingual capabilities allow it to handle a
wide range of languages, including Gujarati, efectively. Its tokenization strategy considers
linguistic nuances, and the model exhibits a robust ability to generalize across languages. M1’s
adaptability and pre-training on diverse multilingual data contribute to its high classification
accuracy on the test dataset.</p>
        <p>M2 which utilises bert-base-multilingual-cased, is a versatile, multilingual BERT variant.
Although not optimized exclusively for Gujarati, it efectively handles Gujarati text due to its
extensive multilingual vocabulary. M2’s tokenization method successfully translates Gujarati text
into subword tokens, enabling it to perform well in cross-lingual ofensive tweet classification.</p>
        <p>Model M3 is bert-base-multilingual-uncased, which shares similarities with M2 but lacks case
sensitivity. Despite this diference, it efectively tokenizes Gujarati text, thanks to its subword
tokenization method and multilingual vocabulary. M3 showcases commendable performance
in the classification task, afirming its suitability for processing Gujarati and other languages
without consideration for letter casing.</p>
        <p>M4 is powered by OpenAI/Whisper-Large, it is designed for large-scale automatic speech
recognition (ASR). While not specifically tailored for text classification, its robust architecture
allows it to capture spoken Gujarati language efectively. This model showcases competitive
results in the ofensive tweet classification task, demonstrating its adaptability beyond ASR,
especially in tasks involving Gujarati text.</p>
        <p>Model M5 is Gujarati BERT, it is a specialized variant designed explicitly for the Gujarati
language. Its tokenizer is tailored to handle Gujarati text’s unique characteristics efectively. M5
demonstrates impressive results in classifying ofensive tweets, emphasizing the importance of
language-specific models in achieving high accuracy on Gujarati text. Its fine-tuning on Gujarati
data contributes to its superior Gujarati text understanding and classification capabilities.[ 14]</p>
        <p>These tokenized inputs are then used to train and test the models. During the training phase,
hyperparameters such as batch size, number of training epochs, and learning rate must be
specified. To fine-tune the models, appropriate optimization algorithms, such as AdamW, are
used in conjunction with learning rate schedulers. Following the training phase, the models
are tested on a separate 1500-row test dataset with the same column names as the training
data (post id, tweets, labels). During the testing phase, each model’s capacity to generalize and
generate correct predictions on new, unseen data is evaluated.</p>
      </sec>
      <sec id="sec-7-2">
        <title>6.2. Results and discussion</title>
        <p>To categorize text data for hate speech detection in Sinhala, the models M1, M2, M3, M4,
and M5 were used. To examine the performance of these models, evaluation metrics such as
Macro-F1, Macro-Precision, and Macro Recall were generated. These metrics provide insight
on the model’s ability to reliably identify and predict instances of hate speech in Sinhala text
data. After examining the findings for these assessment measures in Table 3, it is clear that M2
outperforms all other models.</p>
        <p>Macro-F1 was used to assess these models because it combines precision and recall into a
single score, ofering a balanced estimate of the model’s capacity to reliably categorize instances
of hate speech. M2, with a stellar Macro-F1 score of 0.7956, demonstrated its proficiency in
correctly identifying hate speech within the Gujarati text data, outperforming the other models.
A higher Macro-F1 score suggests superior performance in hate speech detection.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusion</title>
      <p>In this study, we evaluated the eficacy of multiple BERT-based models for detecting hate speech
and abusive language in Sinhala and Gujarati tweets. We tested the eficacy of multiple
BERTbased models for detecting hate speech and abusive language in Sinhala and Gujarati tweets in
this study. The intfloat/multilingual-e5-base model earned the highest Macro-F1 score of 0.8371
for detecting hateful content in Sinhala tweets. The bert-base-multilingual-cased model with
preprocessing steps performed best for the Gujarati data, with a Macro-F1 score of 0.7956.</p>
      <p>Overall, the findings suggest that multilingual models outperform low-resource,
languagespecific models in terms of F1 scores. This performance advantage can be attributed to their
access to a larger and more diverse dataset. Multilingual models are trained on text from a
wide range of languages, which inherently provides a richer linguistic context and a broader
spectrum of language patterns. This diversity allows them to capture cross-lingual insights
and generalize better across various languages, including low-resource ones. In contrast,
low-resource language-specific models, with limited training data, struggle to grasp the full
language complexity. Their efectiveness is hindered by data scarcity, limiting their ability to
adapt to nuances and context. Higher F1 scores of multilingual models emphasize the advantage
of diverse training data. This highlights the significance of data availability, especially for
low-resource languages. It underscores the potential for further advancements to enhance
language-specific models in the future.</p>
      <p>This study adds to the development of automated approaches for moderating social media in
underserved languages such as Sinhala and Gujarati, while also encouraging inclusive online
debates. The models and datasets presented in this paper can also serve as valuable resources
for future NLP research on these languages.
using machine learning and deep learning, in: 2022 22nd International Conference on
Advances in ICT for Emerging Regions (ICTer), IEEE, 2022, pp. 166–171.
[14] T. Chavan, S. Patankar, A. Kane, O. Gokhale, R. Joshi, A twitter bert approach for ofensive
language detection in marathi, arXiv preprint arXiv:2212.10039 (2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Di Fátima</surname>
          </string-name>
          ,
          <article-title>Hate speech on social media: A global approach (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .25768/
          <fpage>654</fpage>
          - 916- 9.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Dmonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pandya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandip</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , T. Mandl,
          <article-title>Overview of the hasoc subtrack at fire 2023: Hatespeech identification in sinhala and gujarati</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2023 -
          <article-title>Forum for Information Retrieval Evaluation, Goa, India</article-title>
          .
          <source>December 15-18</source>
          ,
          <year>2023</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. O'Riordan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Murthy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Durairaj</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mandl</surname>
          </string-name>
          , et al.,
          <article-title>Speech and Language Technologies for Low-Resource Languages: First Inter-national Conference</article-title>
          ,
          <source>SPELLL</source>
          <year>2022</year>
          , Kalavakkam, India,
          <source>November 23-25</source>
          ,
          <year>2022</year>
          , Pro-ceedings, Springer Nature,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>031</fpage>
          - 33231- 9.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Dhananjaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Demotte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ranathunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jayasena</surname>
          </string-name>
          ,
          <article-title>Bertifying sinhala-a comprehensive analysis of pre-trained language models for sinhala text classification</article-title>
          ,
          <source>arXiv preprint arXiv:2208.07864</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chiorrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Diamantini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mircoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Potena</surname>
          </string-name>
          ,
          <article-title>Emotion and sentiment analysis of tweets using bert</article-title>
          ., in: EDBT/ICDT Workshops, volume
          <volume>3</volume>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <article-title>Comparative analysis of diferent machine learning methods for hate speech recognition in twitter text data</article-title>
          ,
          <source>in: 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1016</fpage>
          -
          <lpage>1020</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Be more with less: Hypergraph attention networks for inductive text classification</article-title>
          , arXiv preprint arXiv:
          <year>2011</year>
          .
          <volume>00387</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Minaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cambria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nikzad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chenaghlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Deep learningbased text classification: a comprehensive review, ACM computing surveys (CSUR) 54 (</article-title>
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Anuradha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Premasiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hettiarachchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Uyangodage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , Sold:
          <article-title>Sinhala ofensive language dataset</article-title>
          ,
          <source>arXiv preprint arXiv:2212.00851</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Deep learning based text classification methods</article-title>
          ,
          <source>Highlights in Science, Engineering and Technology</source>
          <volume>34</volume>
          (
          <year>2023</year>
          )
          <fpage>238</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Korde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Mahender</surname>
          </string-name>
          ,
          <article-title>Text classification and classifiers: A survey</article-title>
          ,
          <source>International Journal of Artificial Intelligence &amp; Applications</source>
          <volume>3</volume>
          (
          <year>2012</year>
          )
          <fpage>85</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fernando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weerasinghe</surname>
          </string-name>
          , E. Bandara,
          <article-title>Sinhala hate speech detection in social media</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>