<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Hate Speech Detection in Sinhala and Gujarati: Leveraging BERT Models and Linguistic Constraints</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>G Gnana Sai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aswath Venkatesh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kishore N</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olirva M</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Balaji V A</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prabavathy Balasundaram</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty, Department of Computer Science, Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Chennai, Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UG Student, Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Chennai, Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research paper, presented by the SSN_CSE_ML_TEAM, introduces a unified approach to hate speech and ofensive language identification in two low-resource Indo-Aryan languages, Sinhala and Gujarati, as part of the HASOC 2023 shared tasks. Leveraging various BERT models, we address the challenge of classifying tweets into Hate and Ofensive (HOF) and Non-Hate and Ofensive (NOT) categories by ifne-tuning the BERT models. Our approach seeks to advance the state-of-the-art in detecting hate speech while considering the unique linguistic characteristics and resource constraints of these languages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate Speech Detection</kwd>
        <kwd>Ofensive Language Identificationn</kwd>
        <kwd>BERT Models</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Multilingual NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Online communication platforms have become integral to modern society, enabling diverse
linguistic communities to interact and express their thoughts and opinions. However, these
platforms are not immune to the proliferation of hate speech and ofensive language, which
can have severe social and psychological consequences. Addressing this issue is of utmost
importance, and it becomes particularly challenging in low-resource languages where
languagespecific resources and models are limited.</p>
      <p>This research paper tackles the problem of hate speech and ofensive language detection in
two low-resource Indo-Aryan languages: Sinhala and Gujarati. Sinhala, an oficial language of
Sri Lanka, and Gujarati, a prominent language in India, each pose unique challenges due to their
linguistic diversity and limited availability of annotated data. In this paper, these challenges are
addressed by employing cutting-edge BERT-based models.</p>
      <p>This research paper revolves around the HASOC 2023 shared task [1], which serves as the
foundation for our investigation. The shared task includes coarse-grained binary classification,
where tweets are categorized into Hate and Ofensive (HOF) and Non-Hate and Ofensive
(NOT) classes. For both languages, participants are provided with a relatively small training set,
challenging them to develop efective models for hate speech detection.</p>
      <p>This research paper is structured as follows: We begin by presenting a detailed description of
the HASOC 2023 task setup for both Sinhala and Gujarati. We then delve into our experimental
methodology, which leverages pre-trained BERT models fine-tuned on the provided training
data. We explore the transferability of models across languages, aiming to maximize the utility
of limited linguistic resources. Finally, we discuss our findings, highlighting the potential impact
of our research on mitigating online hate speech and ofensive language in these linguistic
communities. By combining insights from two distinct languages, our work contributes to a
broader understanding of hate speech detection in low-resource language contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In research [2] conducted by Vinura et al., pre-trained language models for Sinhala text
classiifcation were analyzed, with XLM-R being identified as the most efective model. The study
introduced high-performing RoBERTa-based monolingual Sinhala models, which ofered strong
baselines even when there is insuficient labeled data for fine-tuning. It provided valuable usage
recommendations and contributed by releasing annotated datasets for future research in Sinhala
text classification.</p>
      <p>A study [3], conducted by Andrea et al., aimed to assess the suitability of Bidirectional
Encoder Representations from Transformers (BERT) models for sentiment analysis and emotion
recognition in Twitter data. Two classifiers were developed for each task, and the models were
ifne-tuned using real-world tweet datasets. The research demonstrated that BERT-based models
achieved high accuracy, with 92% for sentiment analysis and 90% for emotion recognition,
showcasing BERT’s potent language modeling capabilities for text classification in social media
data.</p>
      <p>The problem of hate speech and ofensive language has increased due to the internet being
widely used and technical resources being available to most people. If not moderated, it could
lead to severe riots and hate-mongering against minorities. Therefore, Apoorv et al. [4], using
one vs rest classification, introduced a system to classify a comment as being normal, hateful,
or ofensive, and the communities targeted by it; a total of 18 labels are used, one for the
classification of the comment and the other 17 for the target communities. In addition to the
global accuracy, this research study also provided individual accuracy for each community
being targeted in the one-vs-rest model.</p>
      <p>In the article by Tiwari et al. [5], the focus was on the challenges surrounding hate speech
recognition within the realm of social media platforms, with the ultimate goal of enhancing the
accuracy of machine learning models. Leveraging Twitter datasets, the researchers conducted
a comparative analysis of various machine learning algorithms, considering metrics such
as accuracy and precision. Their findings revealed that the combination of XGBoost and
TF-IDF embedding yielded the highest accuracy at 94.43%. The article stressed the critical
importance of hate speech detection in ensuring user safety and compliance with laws addressing
discriminatory and ofensive content.</p>
      <p>This study [6] was dedicated to addressing the challenge of identifying and categorizing
ofensive language and hate speech using cutting-edge text classification techniques. To support
the research, the authors curated a custom dataset in the Egyptian Arabic dialect, each manually
categorized. They leveraged this dataset to fine-tune and assess various Arabic pre-trained
transformer models that employed diferent transformer architectures and pre-training
strategies, specifically tailored for the task of natural language processing and text classification. The
results were striking, with an average accuracy of around 96% achieved across all the fine-tuned
transformer models, showcasing their eficacy in combating ofensive language and hate speech
on Egyptian social media platforms.</p>
      <p>Ding et al.’s paper [7] introduced an innovative approach known as Hypergraph Attention
Networks (HANs) for inductive text classification, with a strong emphasis on eficiency and
performance enhancement. HANs leverage hypergraph structures to capture intricate
higherorder word relationships within textual data, thereby enriching contextual comprehension.
By harnessing sparse hypergraphs, this method efectively curtails computational complexity,
rendering it highly scalable, especially for extensive datasets. Experimental outcomes underscore
HANs’ superiority over existing techniques, showcasing their potential for proficient inductive
text classification while eficiently utilizing computational resources.</p>
      <p>There was a notable increase in ofensive language within the content generated by the
crowd across various social platforms. This type of language had the potential to bully or
harm the sentiments of individuals or communities. Hajibabaee et al [8] had, at that time,
delved into investigating and developing various supervised methods and training datasets
aimed at automatically detecting or preventing ofensive monologues or dialogues. Their
experiments yielded promising results in the detection of ofensive language using the dataset
they had collected from Twitter. After hyperparameter optimization, it was found that three
methods—AdaBoost, SVM, and MLP—had achieved the highest average F1-score.</p>
      <p>Korde et al.’s paper [9] underscored text mining’s commercial potential, as about 80% of data
exists in textual form, and unstructured texts ofer a rich source of information. The paper
centered on text classification, introducing its concept, processing steps, and various classifiers.
It conducted a comparative analysis based on criteria like time complexity, principal components,
and performance metrics, highlighting text classification’s significance in extracting knowledge
from unstructured data and aiding classifier selection for diverse applications.</p>
      <p>Santhoopa et al’s paper [10] introduced a deep learning-based approach using a model
incorporating Long Short-Term Memory (LSTM) units and FastText word embeddings for hate
speech detection. Their model was trained on the Sinhala Unicode Hate Speech dataset from
Kaggle, consisting of 6345 Facebook comments, with 54%, categorized as hate speech. The
study compared this deep learning model with various machine learning algorithms, and the
proposed model demonstrated superior performance. It employed pre-trained 100-dimensional
FastText word embeddings in its architecture. The model consisted of a Bi-directional LSTM
layer, a Dense layer with Rectified Linear Unit (ReLU) activation, and a Sigmoid layer for binary
classification. The research suggests the potential for retraining the model for hate speech
detection in diferent languages.</p>
      <p>Tanmay et al.’s paper [11] addressed the task of Ofensive Language Identification in the
low-resource Indic language Marathi. They focused on a text classification task aimed at
discerning ofensive from non-ofensive tweets. The study evaluated various mono-lingual
and multi-lingual BERT models, with a particular focus on those pre-trained with social media
data. The performance of MuRIL, MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT
was compared using the HASOC 2022 test set. Additionally, the paper explored external data
augmentation from other existing Marathi hate speech corpora, HASOC 2021 and
L3CubeMahaHate. Notably, MahaTweetBERT, a BERT model pre-trained on Marathi tweets and
ifne-tuned on a combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), achieved superior
performance with an F1 score of 98.43 on the HASOC 2022 test set, establishing a new
state-ofthe-art result for HASOC 2022 / MOLD v2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Dataset Description</title>
      <p>The task focuses on the binary classification of tweets written in Sinhala and Gujarati. The two
classes for classification are as follows: 1. Hate and Ofensive (HOF): Tweets in this category
contain hate speech, profane language, or ofensive content targeting individuals or groups based
on their characteristics such as race, religion, ethnicity, gender, etc. 2. Non-Hate and Ofensive
(NOT): Tweets in this category do not contain any hate speech, profanity, or ofensive content.
They represent neutral or non-harmful expressions in the Sinhala and Gujarati languages. The
below subsections discuss Sinhala and Gujarati datasets.</p>
      <sec id="sec-3-1">
        <title>3.1. Sub Task: Identifying Hate, ofensive and profane content in Sinhala</title>
        <p>The training and test datasets for this task are based on the Sinhala Ofensive Language
Detection dataset (SOLD) [12]. SOLD is a valuable resource comprising Sinhala tweets labeled for
hate speech and ofensive content. The dataset is designed to facilitate the development and
evaluation of hate speech detection models in the Sinhala language. Participants are encouraged
to use this dataset to train their models and subsequently evaluate them on the provided test
set.</p>
        <p>The task is to classify the tweet as whether it contains hate speech or not. Each entry or row
in the CSV file is of the format as given below in Table 1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Sub Task: Identifying Hate, ofensive and profane content in Gujarati</title>
        <p>The training dataset contains approximately 200 labeled tweets in Gujarati, consisting of both
HOF and NOT categories. The evaluation will be conducted on a separate test dataset, ensuring
fair evaluation of participant systems’ performance.</p>
        <p>The task is to classify the tweet as whether it contains hate speech or not. Each entry or row
in the CSV file is as per the format given in Table 2.
tweet_id represents the unique id of the tweet
created_at represents the date when the tweet was posted
text content of the tweet
user_screen_name represents the Twitter account name of the tweet creator
label classification of the tweet</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodologies used</title>
      <p>Diferent NLP architectures like Bert-Based-uncased, Bert-base-multilingual-cased,
Sinhalabert, Gujarathi-bert, and Indic-bert were employed for identifying Hate, ofensive, and profane
content from tweets in Gujarathi and Sinhala.</p>
      <sec id="sec-4-1">
        <title>4.1. Basic BERT Architecture</title>
        <p>BERT stands for Bidirectional Encoder Representations from Transformers. BERT is based on
the transformer architecture which relies primarily on attention mechanisms. The BERT model
is a multi-layer bidirectional transformer encoder. It consists of an input layer, multiple hidden
layers, and an output layer. The input to BERT is a sequence of tokens that are first passed
through an embedding layer.</p>
        <p>The embedded tokens are then passed to the transformer encoder. The transformer encoder
is made up of a stack of identical layers. Each layer consists of two sub-layers - a multi-head
self-attention mechanism and a position-wise fully connected feed-forward network. The
self-attention mechanism allows the model to learn the relationship between diferent positions
in the input sequence to understand the context of the words.</p>
        <p>The position-wise feed-forward network applies two linear transformations with a ReLU
activation in between each element of the sequence. This helps the model learn complex patterns
and relationships between words or tokens from the input. The output of each transformer
layer is fed as input to the next layer in the stack. The last hidden state of the first token
(which corresponds to the [CLS] token) is used as the aggregate sequence representation for
classification tasks.</p>
        <p>BERT is trained on two unsupervised prediction tasks - masked language modeling and
next-sentence prediction. This allows BERT to learn deep bidirectional representations by
conditioning on both left and right contexts in all layers. The pre-trained BERT models can
then be fine-tuned with just one additional output layer to create state-of-the-art models for a
wide range of NLP tasks.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Bert-Base-uncased</title>
        <p>"BERT-base-uncased" is a specific configuration of the BERT model. It shares the same
architecture as the original "base BERT" consisting of 12 transformer layers and 110 million parameters.
However, the key distinction lies in its vocabulary and tokenization. The "uncased" variant
uses an all-lowercase vocabulary, which simplifies tokenization and reduces the vocabulary
size compared to the original "base BERT" making it more memory-eficient. This modification
allows BERT-base-uncased to be well-suited for NLP tasks that do not require case
sensitivity while retaining the strong pre-training and generalization capabilities of the BERT model
architecture.
4.3. Bert-base-mutilingual-cased
"BERT-base-multilingual-cased" is a specicfi variant of the BERT model designed for multilingual
natural language processing (NLP) tasks. Unlike the original "base BERT" which is primarily
trained on English text, this variant is trained on a diverse range of languages. The "cased"
aspect indicates that it retains case information in its vocabulary, allowing it to distinguish
between uppercase and lowercase letters. This is important for languages where case sensitivity
plays a crucial role in understanding context. BERT-base-multilingual-cased is particularly
valuable for multilingual applications as it can handle multiple languages efectively, making it
a versatile choice for tasks requiring NLP across diferent linguistic backgrounds.
4.4. Sinhala-bert
"Sinhala BERT" is tailored for the Sinhala language, primarily spoken in Sri Lanka. This
specialized variant captures language-specific nuances, script, and context, enabling more efective
handling of Sinhala. It proves invaluable for various Sinhala natural language processing tasks,
enhancing performance compared to general-purpose BERT models.
4.5. Gujarati-bert
"Gujarati BERT" is designed for the Gujarati language, predominantly spoken in the Indian
state of Gujarat. It excels in capturing unique linguistic characteristics, script, and context,
making it highly valuable for Gujarati text analysis. Its specialization enhances performance
for tasks like text classification, sentiment analysis, and named entity recognition, surpassing
general-purpose BERT models.
4.6. Indic-bert
"Indic BERT" is crafted for the diverse family of Indic languages spoken in the Indian
subcontinent. Fine-tuned for languages like Hindi, Bengali, Tamil, and more, it adeptly captures linguistic
nuances, scripts, and contextual intricacies. Indic BERT proves to be a versatile resource for
various natural language processing tasks across the Indic linguistic landscape, outperforming
general-purpose BERT models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Result Analysis for Sinhala Dataset</title>
      <p>This section discusses the implementation of various NLP techniques for Sinhala and Gujarati
Datasets with the analysis of the results using evaluation metrics namely Macro-F1,
MacroPrecision, and Macro-Recall.</p>
      <sec id="sec-5-1">
        <title>5.1. Implementation</title>
        <p>The datasets for Sinhala and Gujarati text classification share a similar structure, each containing
specific columns such as "post_id", "tweets", and "labels", and "tweet_id", "created_at", "text",
"user_screen_name", and "label" respectively. Data preparation involves designating the "labels"
or "label" column as the target variable, with the corresponding "tweets" or "text" column
housing the data to be categorized as ofensive or not. Both datasets are then divided into an
80% training set and a 20% testing set, facilitating the assessment of model performance for hate
speech detection in these languages.</p>
        <p>For the task of classifying Sinhala ofensive tweets, five BERT-based models denoted as M1,
M2, M3, M4, and M5 have been selected. These models encompass various BERT architectures
ifne-tuned for diferent languages and tasks. Specifically, M1, M2, M3, M4, and M5 correspond
to Bert-base-cased, Bert-base-multilingual-cased, Sinhala-bert, Bert-base-multilingual-uncased,
and Indic-bert, respectively. Each model’s respective tokenizer is applied to convert tweet text
into suitable numerical representations for BERT-based analysis.</p>
        <p>The models M1, M2, M3, M4, and M5 are employed for classifying Sinhala ofensive text,
each with specific attributes. M1, utilizing "bert-base-cased", primarily designed for English but
capable of handling Sinhala, employs WordPiece tokenization for subword units, considering
frequently occurring Sinhala character sequences. M2, known as "bert-base-multilingual-cased",
is a multilingual model, efectively tokenizing Sinhala text for multilingual tasks with a broader
vocabulary. M3, "Sinhala-bert", is tailored specifically for Sinhala, employing a Sinhala-specific
tokenizer for subword tokenization. M4, "bert-base-multilingual-uncased", like M2, handles
subword tokenization but without case diferentiation, suitable for processing Sinhala and other
languages. Lastly, M5, "Indic-bert," designed for the Indian subcontinent, including Sinhala,
uses a subword tokenization technique customized for Indic scripts, optimizing its ability to
handle Sinhala-specific character combinations, linguistic patterns, and word segmentation.</p>
        <p>For ofensive Gujarati tweet classification, five BERT-based models denoted as N1, N2, N3, N4,
and N5 are chosen, and fine-tuned for various languages and tasks. These models correspond
to bert-base-multilingual-uncased, Indic-bert, bert-base-multilingual-cased, Gujarati-bert, and
Gujarati-bert (with preprocessing), respectively. Each model’s specific tokenizer is applied to
convert tweet text into suitable numerical representations for BERT analysis.</p>
        <p>The models, N1, N2, N3, N4, and N5, are utilized for the classification of Gujarati ofensive
text in a multilingual context. N1, employing "bert-base-multilingual-uncased", is designed
to handle various languages, including Gujarati, utilizing subword tokenization for text
segmentation. N2, known as "Indic-bert", is specialized for Indian subcontinent languages, with a
tailored subword tokenization technique to accurately represent Gujarati text. N3,
"bert-basemultilingual-cased", is a multilingual model that can handle Gujarati and employs subword
tokenization similar to N1 but with a broader vocabulary. N4, "Gujarati-bert", is customized for
the Gujarati language, featuring a dedicated tokenizer and subword tokenization optimized for
Gujarati script characteristics. N5, "Gujarati-bert(with preprocessing)", is an enhanced version
of "Gujarati-bert", incorporating preprocessing steps to improve performance on Gujarati text.
These preprocessing steps encompass text cleaning, normalization, and other language-specific
enhancements.</p>
        <p>These tokenized inputs are then fed into the models for training and testing. The training
process involves specifying hyperparameters like batch size, the number of training epochs, and
learning rate. Additionally, appropriate optimization algorithms, such as AdamW, are employed,
along with learning rate schedulers to fine-tune the models.</p>
        <p>After the training phase, the models are evaluated on separate test datasets, each specific to its
respective language. The Sinhala dataset consists of 1500 rows, maintaining column names such
as post_id, tweets, and labels. This testing phase assesses each model’s ability to generalize and
make accurate predictions on new, unseen Sinhala text. Similarly, the Gujarati dataset contains
40 rows with columns like tweet_id, text, and label. The testing phase for Gujarati evaluates
each model’s performance in making accurate predictions on new and unseen Gujarati tweets,
ensuring a comprehensive assessment of their capabilities in both language contexts.</p>
        <p>The complete implementation of these models, along with the code, can be found on our
GitHub repository at https://github.com/g-sai/HASOC2023-SSN_CSE_ML_TEAM.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results and discussion</title>
        <p>The models M1, M2, M3, M4, and M5 were applied to classify text data for hate speech
detection in Sinhala. To evaluate their performance, we employed evaluation metrics, including
Macro-F1, Macro-Precision, and Macro-Recall. These metrics allow us to assess the model’s
abilities to accurately identify and predict instances of hate speech within Sinhala text. Upon
analyzing the results in Table 3, it becomes evident that M3 achieved the best results among all
the models, with an impressive Macro-F1 score of 0.7946. This score highlights its proficiency in
correctly identifying hate speech within the Sinhala text data, outperforming the other models
and indicating superior performance in hate speech detection.</p>
        <p>Similarly, models N1, N2, N3, N4, and N5 were utilized to classify the text data for hate speech
detection in Gujarati. We computed evaluation metrics, including Macro-F1, Macro-Precision,
and Macro-Recall, to gauge their performance. After scrutinizing the results in Table 4, it’s clear
that N5 yielded the best results among all the models, achieving an impressive Macro-F1 score
of 0.7732. This score underscores its proficiency in correctly identifying hate speech within
Gujarati text data, outperforming the other models and indicating superior performance in hate
speech detection.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this study, we evaluated the efectiveness of several BERT-based models for detecting hate
speech and ofensive language in Sinhala and Gujarati tweets. Our results demonstrated that
the Sinhala-BERT model achieved the highest Macro-F1 score of 0.7946 for identifying hateful
content in Sinhala tweets. For the Gujarati data, the Gujarati-BERT model with preprocessing
steps showed the best performance with a Macro-F1 score of 0.7732. Overall, the findings
suggest that language-specific models that leverage characteristics of the target script can more
accurately classify instances of hateful and ofensive content compared to general-purpose
multilingual models. This research contributes to developing automated techniques for moderating
social media in under-resourced languages like Sinhala and Gujarati while promoting inclusive
online discussions. The models and datasets introduced in this study can also serve as valuable
resources for future NLP research on these languages.</p>
    </sec>
    <sec id="sec-7">
      <title>7. References</title>
      <p>[1] Shrey Satapara, Hiren Madhu, Tharindu Ranasinghe, Alphaeus Eric Dmonte, Marcos
Zampieri, Pavan Pandya, Nisarg Shah, Modha Sandip, Prasenjit Majumder, and Thomas
Mandl, Overview of the HASOC Subtrack at FIRE 2023: Hate-Speech Identification in Sinhala
and Gujarati, In Kripabandhu Ghosh, Thomas Mandl, Prasenjit Majumder, and Mandar
Mitra, editors, Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation,
Goa, India, December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023.
[2] Chiorrini, Andrea &amp; Diamantini, Claudia &amp; Mircoli, Alex &amp; Potena, Domenico. "Emotion
and sentiment analysis of tweets using BERT." March 2021.
[3] Dhananjaya, Vinura &amp; Demotte, Piyumal &amp; Ranathunga, Surangika &amp; Jayasena, Sanath.
"BERTifying Sinhala – A Comprehensive Analysis of Pre-trained Language Models for
Sinhala Text Classification." 10.48550/arXiv.2208.07864, August 2022.
[4] A. Aditya, R. Vinod, A. Kumar, I. Bhowmik and J. Swaminathan, "Classifying Speech into
Ofensive and Hate Categories along with Targeted Communities using Machine Learning,"
2022 International Conference on Inventive Computation Technologies (ICICT), Nepal,
2022, pp. 291-295, doi: 10.1109/ICICT54344.2022.9850944.
[5] Tiwari, Abhay, and Anupam Agrawal. "Comparative Analysis of Diferent Machine
Learning Methods for Hate Speech Recognition in Twitter Text Data." In 2022 Third International
Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT),
pp. 1016-1020. IEEE, 2022.
[6] I. Ahmed, M. Abbas, R. Hatem, A. Ihab and M. W. Fahkr, "Fine-tuning Arabic Pre-Trained
Transformer Models for Egyptian-Arabic Dialect Ofensive Language and Hate Speech
Detection and Classification," 2022 20th International Conference on Language Engineering
(ESOLEC), Cairo, Egypt, 2022, pp. 170-174, doi: 10.1109/ESOLEC54569.2022.10009167.
[7] Ding, Kaize, Jianling Wang, Jundong Li, Dingcheng Li, and Huan Liu. "Be more with
less: Hypergraph attention networks for inductive text classification." arXiv preprint
arXiv:2011.00387 (2020).
[8] Parisa Hajibabaee, Masoud Malekzadeh, Mohsen Ahmadi, Maryam Heidari, Armin
Esmaeilzadeh, Reyhaneh Abdolazimi, James H Jr Jones, "Ofensive Language Detection on
Social Media Based on Text Classification," 2022 IEEE 12th Annual Computing and
Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 2022, pp. 0092-0098,
doi: 10.1109/CCWC54503.2022.9720804.
[9] Korde, Vandana, and C. Namrata Mahender. "Text classification and classifiers: A survey."</p>
      <p>International Journal of Artificial Intelligence &amp; Applications 3, no. 2 (2012): 85.
[10] Fernando, W. S. S., Ruvan Weerasinghe, and E. R. A. D. Bandara. "Sinhala hate speech
detection in social media using machine learning and deep learning." In 2022 22nd
International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 166-171. IEEE,
2022.
[11] Chavan, Tanmay, Shantanu Patankar, Aditya Kane, Omkar Gokhale, and Raviraj Joshi.
"A Twitter BERT Approach for Ofensive Language Detection in Marathi." arXiv preprint
arXiv:2212.10039 (2022).
[12] Tharindu Ranasinghe, Ishara Anuradha, Danushka Premasiri, Kasun Silva, Hirunima
Hettiarachchi, Lakshika Uyangodage, and Marcos Zampieri. "SOLD: Sinhala Ofensive
Language Dataset." arXiv preprint arXiv:2212.00851, 2022.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>