Improving Detection of Hate Speech, Offensive Language and
                         Profanity in Short Texts with SVM Classifier
                         Surya Agustian1, Zaky Idhafi2 and Agit Fadillah Rihardi3
                         1,2,3
                                 UIN Sultan Syarif Kasim, Jl. H.R. Soeberantas km 11.5 Panam, Pekanbaru, Riau, Indonesia


                                           Abstract
                                           Hate speech and offensive language in social media have become a global issue, affecting
                                           various nations and languages. Conflicts on social media, triggered by hate speech and
                                           offensive language, can lead to victims experiencing mental health problem, disruptions in
                                           their peace, and disturbances in their real-world social lives. HASOC 2023 organizes some
                                           shared tasks to detect hate speech and offensive language in several languages spoken on the
                                           Indian peninsula, which categorized as low-resource languages. Tasks 1A and 1B in Sinhala
                                           and Gujarati languages conceal underlying difficulties, requiring the use of particular
                                           techniques in the classification procedure. This study proposes an SVM classifier method with
                                           improvement strategies for optimization and feature selection based on FastText word
                                           embeddings. The experimental results indicate that the applied strategies significantly enhance
                                           performance compared to the baseline method. The improvement achieved for Sinhala is
                                           5.37%, and for Gujarati, it is 26.08% over the baseline method which use bag-of-words input
                                           features.

                                           Keywords 1
                                           SVM classifier, FastText word embeddings, optimization, feature selection

                         1. Introduction
                             Social media offers individuals the freedom to express their thoughts and emotions on a wide range
                         of daily life issues. Unfortunately, this freedom is frequently misused to propagate hate speech,
                         offensive language and profanity, even targeting people, group, and governments. Hate speech and
                         offensive language remains a global concern on social media, regardless of the language spoken, as
                         long as the internet and mobile phones are accessible. Hate speech and offensive language can arise
                         from the differences in personal view or group opinions regarding religion, politics, ideology, social
                         issues, gender, ethnicity, culture, economics, and more. Smedt et al. [1] investigated them in several
                         domains and languages and found that across various topics, they exhibited similar characteristics of
                         hateful expression.
                             Social media platforms such as Facebook, Instagram, Twitter, YouTube comment sections, and
                         various community forums have become virtual battlegrounds where hatred and profanity are
                         frequently unleashed. Bullying, whether by an individual or a group, is also often targeting other people
                         or specific groups, which can lead to victims experiencing depression, stressed, and in some cases, may
                         even result in suicide [2]. Therefore, messages contain hate speech and profane words need to be
                         minimized, filtered, and removed from social media.
                             Various research and shared tasks to detect hate on social media have been carried out in various
                         languages, such as Portuguese [3], Spanish [4], Arabic [5], Vietnamese [6], Italian [7] and Bahasa
                         Indonesia [8]. Hate speech detection is also discussed on multilingual texts [9], [10], as well as on its
                         level and target objects [9], [10].

                         Forum for Information Retrieval Evaluation, December 15-18, 2023, Goa, India
                         EMAIL: surya.agustian@uin-suska.ac.id (S. Agustian); 11950110021@students.uin-suska.ac.id (Z. Idhafi);
                         11751100389@students.uin-suska.ac.id (A.F. Rihardi)
                         ORCID: 0000-0002-9631-8237 (S. Agustian)
                                        ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

                                        CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    In addition to widely spoken languages, the hate speech classification task has piqued researchers'
interest in languages from regions with limited resources. Since 2019, HASOC has organized shared
tasks aimed at detecting hate speech and abusive language in various languages, including English,
German, and the Indo-Aryan language family of the Indian subcontinent [11], [12]. At HASOC 2023,
the shared tasks involve detecting hate speech in low-resource languages like Sinhala, Gujarati, Bengali,
Bodo, and Assamese. Additionally, there is a task focused on identifying conversational hate speech in
mixed languages, a continuation from the previous year, as well as a task for detecting hate speech
spans within sentences [13].
    Lexicon-based and machine-learning approaches were used in previous work [14] to detect hate
speech in Sinhala. The lexicon list was generated through the translation of prohibited words in English
into Sinhala. Apart from that, the source of profane and offensive words was also obtained from online
sources, and then various variations were taken through the dataset collection for this research. The
Convolution Neural Networks (CNNs) method in [15] is used to detect hate speech in Sinhala language.
The first CNN model is trained to detect the presence or absence of hate speech, then if it is detected,
the level of hate speech will be detected by a second CNN model, which is trained separately to classify
the level of hate speech. Gujarati is also known as a low resource language. Hate speech detection in
Gujarati in [16] still uses the external resource of a list of sentiment words from word-net.
    There are still numerous challenges in hate speech detection research, which is why this task
continues to capture researchers' interest in search of the most optimal automatic solution. Among these
challenges are the subjective nature of determining which sentences contain hate speech—where the
context of the conversation often determines whether a sentence qualifies—and the limited availability
of data, among others [17], [18]. In HASOC 2023 Task 1, besides facing the constraint of being a low-
resource language, we identified a fundamental issue with the dataset for the Sinhala language—
namely, an imbalance in the number of samples between the HOF (Hate, Offensive, and Fear) and NOT
classes within the coarse-grained text data. On the other hand, for Gujarati, the scarcity of training data
poses a significant challenge, making it difficult to construct optimal models for various detection
methods using machine learning.
    Due to limited knowledge and references for these two languages, we rely solely on the robustness
of the proposed Machine Learning method. We adopt the SVM method with word embeddings as input
features, following a similar approach used for hate speech detection in Indonesian tweets [19], with
specific optimizations addressing these challenges. We opted not to use a transformer-based method
due to our limited proficiency in these languages, which posed challenges when creating an appropriate
training dataset compatible with the pre-trained BERT model.
    The rest of this paper is organized in this sequence: Section 2 will disucss about the proposed method
in Task 1A and 1B. The performance improvement of the baseline method as the results of optimization
is described in section 3. The final section will summarize this finding and offer suggestions for further
works.


2. Research Methodology
   The Task 1 of HASOC 2023 is focused on hate speech and offensive language detection in Sinhala
(Task 1A) and Gujarati (Task 1B). Sinhala, one of the Indo-Aryan languages, is considered as a
language with limited resources. It is spoken by more than 17 million individuals in Sri Lanka and holds
the status of Sri Lanka's official languages. Gujarati as well, classified as a low-resource Indo-Aryan
language, has approximately 50 million native speakers and remain one of the 22 official languages
recognized in India. Both tasks are binary classification with label HOF (Hate and Offensive) if the post
containing hate, offensive and profane language, and NOT for Non Hate-Offensive class.
   The statistic of the dataset provided for training and development is describe in Table 1 below. In
Sinhala [20], the total number of posts sufficient to train a machine learning, i.e. 7500 posts, but for
Gujarati, the size of dataset for training is very small,which is only 200 post. Regarding the
compositions, Gujarati has a same number of sample of each class, while Sinhala is imbalanced, as
describe in Table 1. In our development phase, we split the datasets into train and validation sets, with
proportion of 90:10.
Table 1
Label distribution of Sinhala and Gujarati dataset for training
        Language                    Label                  Number of posts                Percentile
  Sinhala (7500 posts)               HOF                        3176                       42.35%
                                     NOT                        4324                       57.65%
   Gujarati (200 posts)              HOF                        100                         50%
                                     NOT                        100                         50%

2.1.          SVM Method
    Support Vector Machine (SVM) is a machine learning method that can be used for classification
tasks, by finding a hyperplane that separates two classes of data. This hyperplane is the optimal decision
boundary that maximizes the margin (distance) between the two classes. SVM aims to find this
hyperplane with maximum margin. With this technique, it performs remarkably well when categorizing
data into two classes efficiently. SVM stands as a cutting-edge machine learning algorithm that was
initially designed for solving binary classification challenges, and later enhanced to tackle multiclass
classification problems and regression tasks.
    We proposed SVM as a state-of-the-art text classifier [21] for Task 1A and 1B. Both are first
classified with bag of words feature set as baseline method. For our baseline, we employ the tokenizer
from scikit-learn2 to extract both word unigrams and bigrams from each tweet, which are then
transformed into TF-IDF features vector. In the case of Sinhala, it results in a vocabulary list
comprising the substantial 47,316 n-grams, and 2,083 n-grams in Gujarati. Due to a lack of knowledge
in the Sinhala and Gujarati language, we utilize all n-grams as input features, resulting in a baseline
SVM input dimensionality of 47,316 and 2,083 respectively. However, this approach produced a very
wide level of sparsity in each sentence vector, specifically for Sinhala, as it utilize the entire vocabulary
to form a vector. This condition is well-tackled by SVM with its robustness sparse technique and good
generalization ability, which make SVM become a popular approach for supervised learning [22].
    We employ basic text preprocessing before tokenize the tweets, i.e. removing numbers,
punctuations, mentions, URLs, hashtags, retweet states like “RT @username:” and adding space
between emojis. Stemming is not implemented based on the hypothesis that it might inadvertently
reduce or strip away emotional nuances from written expressions in social media. Differently,
stopwords can be treated as options. In some cases, retaining stopwords in tweets can be advantageous,
but in others, their presence may lower accuracy if they are not removed.


2.2.          Feature Selection
    In general, many studies have reported the use of bag-of-words vector features as inputs for SVM
text classifiers. While these methods have shown good performance among reported machine learning
approaches, they can be inefficient in terms of computing and memory usage. We hypothesize that
employing word embeddings as a feature could significantly reduce input vector dimensionality for
various Machine Learning methods, including SVM. This word embeddings may not only enhance
classification capabilities in terms of computing time but also improve classification performance
(accuracy, precision, recall and F1-score). This is because converting sentences into vectors with
smaller dimensions using word embeddings makes similarity measure between sentences more
efficient. In contrast, bag-of-words vectors exhibit high sparsity.


2
    https://scikit-learn.org/
    For example, let's consider two sentences with similar meanings but different words (synonyms),
which may result in a low similarity score:
    Sentence 1: “This research uses formula 2.5 to determine the direction of objects movements”
    Sentence 2: “Our study employ equation 2.5 to calculate the direction of a moving object.”
Assuming stopwords are removed from the text, only the word 'direction' is common to both sentences.
If stemming is applied, the word 'object' can be added to the set of shared words between sentence 1
and sentence 2. When using similarity measures like Jaccard or cosine similarity with bag-of-words
vectors, the similarity results tend to be quite low because they rely solely on matched words. In
contrast, when using word embeddings for similarity measurement, we believe the results will be better.
    Based on this hypothesis, we use word embeddings as feature selection. Among word2vec [23],
glove [24] and fasttext [25], we chose fasttext word embeddings with the hypothesis that fasttext can
predict unseen words in generating word embeddings model (OOV, out of vocabulary). FastText can
generate word vectors based on the composition of it's character n-grams, so that it can deal with unseen
words during the training.

2.3.    Optimization
   The optimization steps undertaken in tasks 1A and 1B were tailored to address the underlying issues
which we identified. In the case of Sinhala, the primary issue was the imbalance between the HOF and
NOT classes. To tackle this class imbalance, various approaches were available, including
oversampling the class with fewer instances or undersampling the class with more instances. We tend
to use oversampling approach, as undersampling could lead to the loss of numerous word-specific
features associated with the dominant class.
   In the Sinhala dataset, as is typical in real-world scenarios, the neutral class tends to be more
prevalent in social media posts which collected through keyword-based crawling. Given the smaller
volume of data in the HOF class, we performed oversampling by randomly selecting existing tweets
from the HOF class, where each tweet should contain more than 18 clean tokens. Emojis, when detected,
were treated as individual tokens. Notably, during our analysis of the existing Sinhala tweet data, we
did not find the use of emojis. However, in Gujarati, emojis are commonly employed, making it
imperative to consider them due to the limited number of samples.
   We employ a special treatment for emoji tokenization, including instances of multiple identical
emojis in a row (e.g. emojis in text: “I am angry             ”). Instead of remove duplication, we treat
each duplicate emoji as a separate token when generating sentence vector. This approach is based on
the idea that emojis occupy distinct vectors within the word embedding vector space, and the repetition
of each emoji influences the sentence's position in that space. Certain emojis can accentuate the
emotional content of a tweet. For example, a sequence of angry emoji may indicate an emotional
outburst by the author. By extracting duplicated emojis into single tokens, each will contribute in
steering the sentence vector toward a particular direction, as illustrated in Figure 1. This direction will
be more aligned and closer to the common sentiment or nuance associated with the feelings expressed
by those emojis, either positive or negative. Furthermore, we also excluded stop words from the data
used for oversampling in the Sinhala language.


                            angry                         Text: I am angry
                      am         😡                        Arrows are vectors in space
                             R1
                                  😡                       R : sentence vector
                      I       R2                          R1: emojis are removed
                                   😡
                                                          R2: duplicates are removed
                                          😡               R3: duplicates are taken into account
                                    R3


    Figure 1. An illustration of how duplicate emojis steer sentence vector into a certain direction.
   Sentence embeddings are computed from the resulting vector of word embeddings in normalized
form, derived from the cleaned constituent tokens of a tweet, following equations (1) and (2). If Vt
represents the word embedding vector for a token, given as Vt =[v1, v2, … vd] with dimension=d, then
the normalized vector Vn is obtained by element-wise division of the elements of Vt by the norm vector,
as defined in equation (1). On the other hand, the sentence vector (Vs), composed of j tokens, is
calculated by taking the element-wise average of the normalized vectors (Vn) of the tokens comprising
the sentence, as illustrated in equation (2).

                                                          𝑉𝑡
                                                 𝑉𝑛 =                                                 (1)
                                                        √∑𝑖 𝑣𝑖 2

                                                     1
                                                 𝑉𝑠 = ∑ 𝑉𝑛 𝑗                                          (2)
                                                     𝑗 𝑗


    Another optimization step involved normalizing the sentence embeddings vector, as it contained
elements with both negative or some values exceeding 1. To achieve this, we applied normalization
using the scaling function from scikit-learn, which brings the magnitudes of vector elements converge
to a range between 0 and 1. In this optimization process, we explored the best model performance using
various normalization techniques, including min-max scaling, robust scaling, and no scaling, as
discussed by Zikri and Agustian [19].
    Data balancing for the Sinhala language is exclusively applied to the portion of the train dataset,
which has been initially split into a 90:10 ratio for training (data-train) and validation (data-dev). This
process yields data-train comprising 7,796 tweets, with each class containing 3,898 tweets. In contrast,
the validation data (data-dev) remains unchanged. This approach simplifies the selection of the optimal
SVM model after undergoing several improvement stages (baseline, feature selection, optimization).
The training outcomes of each method (SVM models) are assessed to predict the data-dev. The model
with highest F1-score (optimal model) is chosen to predict the testing data for our RUN submission.
    In addition to the optimizations mentioned above, we also employ a grid search to obtain the optimal
SVM parameters. This includes selecting the appropriate kernel (RBF, linear, or sigmoid) as well as
determining optimal values for C and gamma, using 5-fold cross-validation.

3. Experiment and Results
   We have designed a two-phase hate speech detection system, similar with the approach adopted by
Agustian et al. [26], where we utilize an SVM classifier as the core machine learning component. The
system's workflow is illustrated in Figure 2 below. Phase 1 focuses on building a language model using
FastText word embeddings, whereas Phase 2 employs the SVM classifier to detect hate speech. The
Python code for this method is made available on GitHub3.

3.1.         Experiment Setup
   The tweet source in whole provided dataset flows into a set of preprocessing step before trained by
FastText. It is optional to use or discard one or more steps in the below preprocessing set, to get
improvement of classification results. We empirically choose the suitable steps and compare the
prediction results on validation data (data-dev).
        • Remove numbers
        • Remove punctuation
        • Remove words repetition
        • Remove stopwords
        • Replace mentions (@user, @USER, @AUTHOR) into single token “MENTIONED”
        • Remove double spaces

3
    https://github.com/s4gustian/HASOC2023.git
        •   Remove URLs
        •   Tokenize (get every emojis into single tokens)
        •   Remove Latin words


                         Figure 2. Two phase hate speech detection method [26]

   The same preprocessing steps are applied to both data-train and data-dev sets before converting them
into sentence embeddings. These vectors are then used as input features for SVM and undergo an
optimization process, which may include normalization or without normalization. To address the
imbalance of the classes in the data, oversampling techniques, as explained in the previous section, are
applied. We conducted experiments using all combinations of these preprocessing and optimization
techniques to identify the optimal model. The selected model is the one that achieved the highest F1-
score on the validation data (data-dev) during training. Table 2 displays the experimental combinations
conducted in our search for the optimal model, which we submitted to the HASOC 2023 system.

Table 2
Experiment Setup
        Task           RUN          Feature        Balancing         Scaling        Single Emoji as
                                                                                         token
  Task 1A (Sinhala)    RUN1       Bag of Word         No              No                  No
                       RUN2         FastText          No              Yes                 No
                       RUN3         FastText          Yes             Yes                 No
 Task 1B (Gujarati)    RUN1       Bag of Word         No              No                  No
                       RUN2         FastText          No              Yes                 No
                       RUN3         FastText          No              Yes                 Yes


3.2.    Result and Discussion
    The results from the conducted experiments reveal a significant improvement in each run, as shown
in Table 3 below. The selection of FastText as an input feature not only speeds up computation due to
its substantial reduction in vector dimensionality but also yields superior performance (RUN2).
Specifically, for Sinhala data, there was a notable increase of 4.96% in terms of the F1-score, the official
metric used in HASOC 2023. In contrast, for Gujarati, the results showed a remarkable improvement
of 21.64%.

Table 3
System performance on HASOC 2023 test-data (in percent)
    Task/Team         RUN       Macro            Macro                 Macro F1-       F1 Improvement to
                               Precision         Recall                 score           baseline (RUN1)
Task 1A (Sinhala)
Proposed method          RUN1           71.20            67.49            68.73                  -
                         RUN2           75.34            73.13            73.69                4.96
                         RUN3           73.93            74.55            74.10                5.37
FiRC-NLP               First Rank       83.82            83.68            84.00
LEGEND                 Last Rank        55.88            55.72            55.75
Task 1B (Gujarati)
Proposed method          RUN1           34.28            50.00            40.67                 -
                         RUN2           62.22            63.69            62.31               21.64
                         RUN3           69.30            72.19            66.75               26.08
FiRC-NLP               First Rank       83.92            86.38            84.88
Gradient Descenders    Last Rank        67.12            66.26            66.62

   The oversampling process applied to the Sinhala language resulted in performance improvements
compared to the imbalanced dataset, with an increase of 0.97% compared to RUN2, and a substantial
5.37% improvement when compared to RUN1. On the small balanced Gujarati data, treating emojis as
single tokens boosted the F1-score performance by 4.44% compared to RUN2. Meanwhile, for RUN1
which use bag of words feature, there was a remarkable improvement of 26.08%. This enhancement
can be attributed to emojis' presence within tweet contexts, where they influence the formation of word
embedding vectors. This enriches the language model's comprehension of emotional and contextual
cues within the text, thereby improving classification accuracy.

4. Conclusion and Future Work
     Our participation in HASOC 2023 proposed a strategy to improve the performance of machine
learning, i.e. SVM classifier by implementing optimization and feature selection. Our experiments on
the HASOC datasets shows that applying word embeddings as input features for SVM can improve the
F1-score significantly, compared to the use of Bag of word vectors. By dimensionality reduced, the
computation become more efficient due to omiting sparsity of the vectors.
     Other optimizations are also have significant effect in improving the classification results.
Evaluation on the Gujarati data-test shows that for a very small data train, treating emoji as special
token in the word embeddings vector space can improve the F1 score more than 4%. For the Sinhala
dataset, since the training data is not contain any emoji, this optimization does not works.
     We are inspired to the problem introduce in HASOC 2023, specifically the available of traning data
which is very small. We courious to implement this strategy to other language we understand well, and
want to proof that this optimization strategy will work well. Our future work will investigate it to
English and Bahasa Indonesia dataset, hoping that the result would be improve significantly compared
to the baseline.


5. References
[1]    T. De Smedt, S. Jaki, E. Kotzé, L. Saoud, M. Gwóźdź, G. De Pauw, and W. Daelemans,
       Multilingual Cross-domain Perspectives on Online Hate Speech, CLiPS Technical Report
       Series, vol. 8, pp. 1–24, Sep. 2018, arXiv:1809.03944.
[2]    D. D. Luxton, J. D. June, and J. M. Fairall, Social Media and Suicide: A Public Health
       Perspective, American Journal of Public Health, vol. 102, no. Suppl 2, pp. S195–S200, May
       2012, doi:10.2105/AJPH.2011.300608.
[3]    P. Fortuna, J. Rocha Da Silva, J. Soler-Company, L. Wanner, and S. Nunes, A Hierarchically-
       Labeled Portuguese Hate Speech Dataset, in: Proceedings of the Third Workshop on Abusive
       Language Online, Florence, Italy, 2019, pp. 94-104.
[4]    E. del Valle and L. de la Fuente, Sentiment analysis methods for politics and hate speech
       contents in Spanish language: a systematic review, IEEE Latin America Transactions, vol. 21,
       no. 3, pp. 408–418, Mar. 2023, doi:10.1109/TLA.2023.10068844.
[5]    H. Mubarak, A. Rashed, K. Darwish, Y. Samih, and A. Abdelali, Arabic Offensive Language
       on Twitter: Analysis and Experiments, in: WANLP 2021 - 6th Arabic Natural Language
       Processing Workshop, Association for Computational Linguistics (ACL), 2021, pp. 126–135.
[6]    X.-S. Vu, T. Vu, M.-V. Tran, T. Le-Cong, and H. T. M. Nguyen, HSD Shared Task in VLSP
       Campaign 2019:Hate Speech Detection for Social Good, arXiv preprint, arXiv:2007.06493
       (2020).
[7]    C. Bosco, F. Dell’orletta, F. Poletto, M. Sanguinetti, and M. Tesconi, Overview of the
       EVALITA 2018 Hate Speech Detection Task. In: Proceedings of the Sixth Evaluation
       Campaign of Natural Language Processing and Speech Tools for Italian, Turin, Italy, 2018.
[8]    M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in
       Indonesian Twitter, in: Proceedings of the Third Workshop on Abusive Language Online,
       Florence, Italy, 2019, pp. 46-57.
[9]    T. Mandl, S. Modha, A. Kumar M, and B. R. Chakravarthi, Overview of the HASOC Track at
       FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi,
       English and German, in: Proceedings of the 12th Annual Meeting of the Forum for Information
       Retrieval Evaluation, December 2020, pp. 29–32. doi: 10.1145/3441501.3441517.
[10]   V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. R. Pardo, P. Rosso, and M. Sanguinetti,
       SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women
       in Twitter, in: Proceedings of the 13th International Workshop on Semantic Evaluation,
       Stroudsburg, PA, USA: Association for Computational Linguistics, 2019, pp. 54–63. doi:
       10.18653/v1/S19-2007.
[11]   T. Mandl et al., Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive
       Content Identification in Indo-European Languages, in: Proceedings of the 11th Annual Meeting
       of the Forum for Information Retrieval Evaluation, December 2019, pp. 14–17
       doi:10.1145/3368567.3368584.
[12]   T. Ranasinghe, K. North, D. Premasiri, and M. Zampieri, Overview of the HASOC Subtrack at
       FIRE 2022: Offensive Language Identification in Marathi, arXiv preprint, arXiv:2211.10163
       (2022).
[13]   S. Satapara, H. Madhu, T. Ranasinghe, A.E. Dmonte, M. Zampieri, P. Pandya, N. Shah, M.
       Sandip, P. Majumder, and T. Mandl, Overview of the HASOC Subtrack at FIRE 2023: Hate-
       Speech Identification in Sinhala and Gujarati, in: Working Notes of FIRE 2023 - Forum for
       Information Retrieval Evaluation, Goa, India, Dec 15-18, 2023.
[14]   H. M. S. T. Sandaruwan, S. A. S. Lorensuhewa, and M. A. L. Kalyani, Sinhala Hate Speech
       Detection in Social Media using Text Mining and Machine learning, in: 19th International
       Conference on Advances in ICT for Emerging Regions (ICTer), IEEE, Sep. 2019, pp. 1–8. doi:
       10.1109/ICTer48817.2019.9023655.
[15]   S. W. A. M. D. Samarasinghe, R. G. N. Meegama, and M. Punchimudiyanse, Machine Learning
       Approach for the Detection of Hate Speech in Sinhala Unicode Text, in: 20th International
       Conference on Advances in ICT for Emerging Regions (ICTer), IEEE, Nov. 2020, pp. 65–70.
       doi: 10.1109/ICTer51097.2020.9325493.
[16]   L. Gohil and D. Patel, A sentiment analysis of Gujarati text using Gujarati senti word net,
       International Journal of Innovative Technology and Exploring Engineering, vol. 8, no. 9, pp.
       2290–2292, Jul. 2019, doi: 10.35940/ijitee.i8443.078919.
[17]   S. MacAvaney, H. R. Yao, E. Yang, K. Russell, N. Goharian, and O. Frieder, Hate speech
       detection: Challenges and solutions, PLoS One, vol. 14, no. 8, Aug. 2019, doi:
       10.1371/journal.pone.0221152.
[18]   G. Kovács, P. Alonso, and R. Saini, Challenges of Hate Speech Detection in Social Media: Data
       Scarcity, and Leveraging External Resources, SN Comput Sci, vol. 2, no. 2, Apr. 2021, doi:
       10.1007/s42979-021-00457-3.
[19]   A. Zikri and S. Agustian, Penerapan Support Vector Machine dan FastText untuk Mendeteksi
       Hate Speech dan Abusive pada Twitter, Jurnal Media Informatika Budidarma, vol. 7, no. 1, pp.
       436–443, 2023, doi: 10.30865/mib.v7i1.5408.
[20]   T. Ranasinghe, I. Anuradha, D. Premasiri, K. Silva, H. Hettiarachchi, L. Uyangodage, and M.
       Zampieri, Sold: Sinhala offensive language dataset, arXiv preprint, arXiv:2212.00851, (2022).
[21]   T. Joachims, Text Classification, in: Learning to Classify Text Using Support Vector Machines,
       Boston, MA: Springer US, 2002, pp. 7–33. doi: 10.1007/978-1-4615-0907-3_2.
[22]   R. Awad Mariette and Khanna, Support Vector Machines for Classification, in: Efficient
       Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers,
       Berkeley, CA: Apress, 2015, pp. 39–66. doi: 10.1007/978-1-4302-5990-9_3.
[23]   T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient Estimation of Word Representations in
       Vector Space, arXiv preprint, arXiv:1301.3781 (2013).
[24]   J. Pennington, R. Socher, and C. Manning, Glove: Global Vectors for Word Representation, in:
       Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
       (EMNLP), Stroudsburg, PA, USA: Association for Computational Linguistics, 2014, pp. 1532–
       1543. doi: 10.3115/v1/D14-1162.
[25]   A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, Bag of Tricks for Efficient Text
       Classification, arXiv preprint, arXiv1607.01759 (2016).
[26]   S. Agustian, R. Saputra, and A. Fadhilah, ‘Feature Selection’ with Pretrained-BERT for Hate
       Speech and Offensive Content Identification in English and Hindi Languages, in: Forum for
       Information Retrieval Evaluation, India, 2021.