<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Machine Learning and Lexicon-Based Strategies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vinayak Vijay</string-name>
          <email>vinayakvijay2003@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Avishikta Bhattacharjee</string-name>
          <email>avishikta.bhattacharjee@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kirti Kumari</string-name>
          <email>kirti@iiitranchi.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Upkar Kumar Kedia</string-name>
          <email>upkar.2023dr101@iiitranchi.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Information Technology Ranchi</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kalinga Institute of Industrial Technology Bhubaneswar</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The proliferation of online content in regional and code-mixed languages has led to a significant increase in abusive and hate speech, necessitating the development of robust detection systems. This paper presents a comprehensive study on hate speech detection in Hinglish (a mix of Hindi and English) and Bangla language, focusing on the unique challenges these languages pose due to code-mixing, transliteration challenges, and rich morphological variations. Our approach includes pre-processing pipelines tailored to handle codes-mixing data and transliteration challenges. We employ techniques such as TF-IDF word embeddings and a lexicon-based hierarchical approach to capture the nuances of hate speech in these languages. The lexicon-based approach allows us to efectively identify hate speech terms and their variations, even in the presence of morphological variations and code-mixing. The models were trained and evaluated on curated datasets, showcasing their efectiveness in identifying hate speech with high precision.</p>
      </abstract>
      <kwd-group>
        <kwd>Hate Speech Detection</kwd>
        <kwd>Deep learning model</kwd>
        <kwd>Hybrid lexicon-based model</kwd>
        <kwd>Tf-idf word embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rapid proliferation of digital communication platforms has fundamentally transformed human
interaction, fostering the widespread use of diverse languages and dialects online. While this linguistic
diversity enriches global discourse, it also poses significant challenges—particularly in the detection and
mitigation of hate speech. Such harmful content threatens social harmony and can incite real-world
violence, making its identification a critical area of concern. In linguistically diverse regions like
South Asia, where hybrid languages such as Hinglish (a blend of Hindi and English) and Bangla are
widely spoken, hate speech detection becomes even more complex due to the nuances of code-mixing,
transliteration, and cultural context.</p>
      <p>This research addresses these challenges by developing a specialized hate speech detection framework
tailored to the linguistic and cultural intricacies of Hinglish and Bangla. Leveraging advanced Natural
Language Processing (NLP) techniques and machine learning algorithms, we analyze a rich corpus of
user-generated content collected from various social media platforms. Our model achieves a macro
F1-score of 72 for general Hate/Ofensive content detection and 45 for the specific detection of Hate
Speech, underscoring both the efectiveness and the dificulty of the task in these multilingual settings.</p>
      <p>The results reveal distinct patterns and expressions of hate speech across the two languages,
highlighting the necessity of language-specific modeling for accurate classification. Beyond the technical
contributions, this study provides valuable insights into hate speech dynamics in code-mixed and
underrepresented languages, ofering a scalable framework for the development of robust content
moderation tools. These tools are vital for maintaining respectful discourse online and fostering safer,
more inclusive digital communities.
https://github.com/Vinayak164000 (V. Vijay)</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>The remaining sections of this work are organized as follows. Section 2 ofers a quick introduction to
relevant literature. Section 3 outlines the proposed approach and structure for dealing with Bangla text.
Section 4 summarizes the experiments and findings. The paper concludes in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Hate speech (HS) is a form of expression that disseminates negativity, often inciting violence or
discrimination based on innate characteristics such as race, ethnicity, or gender. Identifying and
addressing HS, especially online, has become a critical issue, particularly in regional languages like
Bengali and code-mixed languages such as Hinglish. One of the main hurdles in Bengali HS detection
is the lack of labeled datasets, which makes model training dificult. Studies such as the one by Mithun
et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduced the HS-BAN dataset, a benchmark collection of social media comments for hate
speech detection in Bangla. This dataset fills a critical gap by providing annotated data for training and
testing machine learning models for hate speech detection in the Bengali language. In the survey on hate
speech detection in Bengali, Abdullah et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] highlighted the challenges of limited data availability
and the complexity of capturing cultural context in automated detection systems. These challenges
are compounded in code-mixed languages, where interleaving between diferent linguistic systems
adds further complexity. For instance, Hossain et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] developed machine learning models to detect
hate speech in videos, combining neural networks and traditional algorithms, achieving promising
results.Kumari et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]proposed a deep learning approach based on pre-trained BERT models to
identify hate speech and ofensive language in code-mixed Hindi-English social media text. Their work
demonstrated the efectiveness of fine-tuning BERT models for this task.
      </p>
      <p>
        In recent work, Ahammed et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] explored the application of machine learning techniques for
identifying hate speech in Bangla, demonstrating the eficacy of Support Vector Machines (SVM)
combined with Term Frequency-Inverse Document Frequency (TF-IDF) features in curbing online
hate speech in regional languages. Similarly, Barman et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] emphasized the dificulties posed
by recognizing code-mixed languages on social media, noting that code-switching presents unique
challenges for Natural Language Processing (NLP) models, particularly when languages are interleaved
in unpredictable ways. A lexicon-based approach introduced by Gitari et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for hate speech
detection demonstrated the efectiveness of such models but underscored the need for more sophisticated
techniques in future research, as lexicon-based methods might struggle with the nuances of HS in
diferent contexts.
      </p>
      <p>
        Furthermore, Islam et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] examined various NLP and machine learning methods for detecting
hate speech in Bangla social media texts, finding that advanced models such as Long Short-Term
Memory (LSTM) networks and Convolutional Neural Networks (CNN) significantly improved detection
accuracy.Their findings suggest that while basic machine learning models provide a strong foundation,
deep learning techniques ofer greater potential for handling complex language patterns and nuances
in hate speech, particularly in under-resourced languages and dialects.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section consists of an overview of data visualization, data preprocessing, feature extraction
techniques, and the methods used to train models.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preprocessing and Data Visualization:</title>
        <p>Data preprocessing and visualization are vital steps in natural language processing (NLP). Cleansing
the dataset is crucial as it helps in preparing raw data for model training. This initial step ensures that
any further analysis is based on accurate and well-organized information. Pre-processing is essential
for addressing irregularities and uncertainties in raw textual data, ultimately leading to more reliable
and insightful results. The dataset distribution of diferent classes of comments class for Training and
Developments are presented in Table 1. The detailed explanation about tasks and datasets are seen in
articles [9] [10].</p>
        <p>To analyze the dataset, we developed a thorough pre-processing pipeline. This step was essential to
convert unstructured social media text into a suitable format for machine learning and deep learning
models. The preprocessing steps shown in the Figure 1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature Engineering</title>
        <p>Transforming text data into a numerical format is a critical step in building efective machine learning
(ML) models, as these models are incapable of interpreting raw characters or words. To this end, we
employed TF-IDF (Term Frequency–Inverse Document Frequency) vectorization to extract both word
n-gram features (ranging from unigrams to trigrams) and character n-gram features (ranging from
bigrams to 5-grams) from the preprocessed text. To reduce noise and improve eficiency, we restricted
the extraction to only the most frequent features in both categories, as suggested by Kumari et al. [11].</p>
        <p>The selected frequent features were then stacked to form a composite feature vector, efectively
capturing both word-level and subword-level textual patterns. By focusing on high-frequency n-grams,
we not only reduced the dimensionality of the feature space but also significantly decreased the training
time of the classifiers. Furthermore, this approach mitigates the risk of overfitting, thereby enhancing
the generalization performance of the models.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Construction</title>
        <p>For the hate speech detection in bangla language, we implemented a hierarchical classification approach
to manage the complexity of the multilabel multiclass problem. This approach includes a hybrid model
that combined lexicon-based analysis and machine learning (ML) algorithms. The Model Construction
Pipeline For Label 1 in Hierarchical Classification shown in the Figure ??.</p>
        <p>Our dataset contains many misspelled Bengali words, which TF-IDF struggled to handle efectively.
To address this, a lexicon-based Hierarchical approach was implemented. A manually curated dictionary,
encompassing both correct and incorrect spellings of Bengali words, was created. By augmenting
the training data with these dictionary entries, the model was exposed to a wider range of linguistic
variations. This enabled the model to learn the patterns of misspellings and their correct counterparts.
This enhancement can be attributed to the richer feature representation, improved generalization
capabilities, and increased robustness to noise and errors in the input data.</p>
        <p>Logistic Regression was used in tandem with the lexicon to classify the data by examining both
the presence of hate-related terms and features extracted from the text, such as term frequencies and
patterns indicative of hate speech. Once the initial label (hate speech or not) was predicted, we had to
prepare the input data for the label 2 prediction. To achieve this, the existing input data, represented
using TF-IDF features, was combined with the “ofensive gold label” into a single matrix. This augmented
dataset served as the input for the second prediction task. This enhanced feature set was then used as
input for the second stage of classification. The detail steps for model construction for Label 2 is shown
in Figure 3.</p>
        <p>In the second stage, we employed a Soft Voting Classifier using the Scikit-learn library. The Soft
Voting Classifier was composed of three base classifiers: Logistic Regression, Decision Tree Classifier,
and Extra Tree Classifier. The objective in this phase was to predict the second label, which involved
determining whether the hate speech was targeted at an individual, a group. By using soft voting, we
aimed to aggregate the predictions from each individual classifier, where the final prediction was a
weighted combination of the probabilities from each model. This ensemble method allowed for more
robust decision-making, as it leveraged the strengths of multiple algorithms to improve classification
accuracy.</p>
        <p>After training the model on the labeled training set, we evaluated its performance on a development
(dev) dataset using the macro F1 score.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>The dataset provided by the HASOC 2024 shared task organizers contains train, development, and test
set which includes a mixed Bangla code text which must be classified for Hate \Not Hate for the first
label classification and the individual target categories for the second label classification, which was
multiclass classification, making the overall problem as multilabel and multiclass classification. We
experimented with two approaches:(i) Using Scikit-Learn Chain Classifier and (ii) Hierarchal Approach</p>
      <p>Chain Classification: In our approach, we employed a chain classification with diferent machine
learning classifiers to tackle this multilabel and multiclass problem. Chain classification involves
predicting the first label (Hate/Not Hate) and then using that prediction to inform the second label
(target type). By combining these predictions, the models can efectively capture the hierarchical
relationships between the labels. Table 2 presents the performance results for various ML classifiers
using chain classification on the development set. These results, computed using the Scikit-learn library,
demonstrate the models’ efectiveness in predicting both labels and give insight into which classifiers
perform best based on their macro F1-scores.</p>
      <p>In our analysis, we found that the best F1 score of 0.70 for the first label (Hate/Not Hate classification)
was achieved using Logistic Regression. However, we sought to further improve the classification of
label 2 accuracy by integrating a lexicon-based hierarchical analysis with Logistic Regression. This
combination proved efective, increasing the accuracy from 0.70 to 0.72. The integration process involved
applying sentiment analysis techniques, incorporating contextual word embeddings to capture nuanced
meanings in the text, and leveraging sentiment scores derived from a lexicon specifically designed for
hate speech detection. These combined features allowed the model to better understand the context
and emotional tone of the code-mixed Bangla text, improving its ability to detect hate speech more
accurately.</p>
      <p>Additionally, the predictions obtained from the first label classification were used to enhance the
feature set for the second label classification. The second label aimed to identify whether the hate
speech was directed towards an individual, a group, or if it was untargeted. By incorporating the first
label’s predictions, we created a more comprehensive and enriched feature set, which significantly
boosted the model’s performance for this second label. This approach not only enhanced the overall
classification accuracy but also provided deeper insights into the linguistic patterns associated with
hate speech, contributing to a better understanding of the types of targets in hate speech scenarios.</p>
      <p>Table 3 showcases the results for the 2nd label prediction using various ML classifiers on the enhanced
feature set, highlighting the achieved F1 scores on validation dataset and rank we got on test dataset
provided by the HASOC organizers.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In our paper, we outline the comprehensive strategy devised by Team  _  _       for the
HASOC 2024 shared task. Our approach involves meticulously selecting the top frequent character and
word n-grams from the texts, then consolidating and transforming them into TF-IDF vectors to train
the ML classifiers, which are complemented by a Lexicon-based strategy.</p>
      <p>Notably, Team  _  _       actively participated in Task 2 and demonstrated impressive
performance by securing 3rd place for 1st label prediction and 4th place for 2nd label prediction. Our
proposed strategy surpassed most models submitted by other participants in the shared task, positioning
our team as one of the top performers. Furthermore, our work serves as an example of the eficacy
of feature reduction algorithms, even those that are relatively simple, in classification tasks. Moving
forward, our goal is to investigate statistical feature selection algorithms and diverse feature sets to
further enhance the performance of ML classifiers.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling
check. After using these tools/services, the authors reviewed and edited the content as needed and take
full responsibility for the publication’s content.
media, in: 2023 26th International Conference on Computer and Information Technology (ICCIT),
IEEE, 2023, pp. 1–6. doi:10.1109/ICCIT60459.2023.10441452.
[9] K. Ghosh, N. Raihan, S. Modha, S. Satapara, T. Gaur, Y. Dave, M. Zampieri, S. Jaki, T. Mandl,
Overview of the HASOC Track at FIRE 2024: Hate-Speech Identification in English and Bengali,
in: FIRE ’24: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval
Evaluation. December 12-15, Gandhinagar, India, Association for Computing Machinery (ACM),
New York, NY, USA, 2024.
[10] N. Raihan, K. Ghosh, S. Modha, S. Satapara, T. Gaur, Y. Dave, M. Zampieri, S. Jaki, T. Mandl,
Overview of the HASOC Track at FIRE 2024: Hate-Speech Identification in English and Bengali, in:
K. Ghosh, T. Mandl, P. Majumder, D. Ganguly (Eds.), Forum for Information Retrieval Evaluation
(Working Notes) (FIRE 2024) December 12-15, Gandhinagar, India, CEUR-WS.org, 2024.
[11] K. Kumari, J. P. Singh, Ai_ml_nit_patna@ hasoc 2020: Bert models for hate speech identification
in indo-european languages., in: FIRE (Working Notes), 2020, pp. 319–324.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>M. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <article-title>Hate speech and ofensive language detection in bengali</article-title>
          ,
          <source>arXiv preprint arXiv:2210.03479</source>
          (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2210.03479.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Maruf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Abidin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Haque</surname>
            ,
            <given-names>Z. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiyad</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Golder</surname>
            ,
            <given-names>R. A. Z.</given-names>
          </string-name>
          <string-name>
            <surname>Aung</surname>
          </string-name>
          ,
          <article-title>Hate speech detection in the bengali language: a comprehensive survey</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <article-title>53</article-title>
          . doi:
          <volume>10</volume>
          .1186/s40537- 024- 00956- z.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. I. H.</given-names>
            <surname>Junaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <article-title>Bangla hate speech detection in videos using machine learning</article-title>
          ,
          <source>in: 2021 IEEE 12th Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON)</source>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1109/UEMCON53757.
          <year>2021</year>
          .
          <volume>9666550</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <source>Ai ml nit patna at hasoc</source>
          <year>2019</year>
          :
          <article-title>Deep learning approach for identification of abusive content</article-title>
          .,
          <source>FIRE (working notes)</source>
          <volume>2517</volume>
          (
          <year>2019</year>
          )
          <fpage>328</fpage>
          -
          <lpage>335</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahammed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Niloy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>M. M. H. Chowdhury</surname>
          </string-name>
          ,
          <article-title>Implementation of machine learning to detect hate speech in bangla language</article-title>
          ,
          <source>in: 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART)</source>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1109/SMART46866.
          <year>2019</year>
          .
          <volume>9117214</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>U.</given-names>
            <surname>Barman</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <string-name>
            <surname>Joachim</surname>
            <given-names>Wagner</given-names>
          </string-name>
          ,
          <article-title>Code mixing: A challenge for language identification in the language of social media</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Computational Approaches</source>
          to Code Switching,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .13140/2.1.3385.6967.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Gitari1</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zuping</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <article-title>A lexicon-based approach for hate speech detection</article-title>
          ,
          <source>International journal of multimedia and ubiquitous engineering 10</source>
          (
          <year>2015</year>
          )
          <fpage>215</fpage>
          -
          <lpage>230</lpage>
          . doi:
          <volume>10</volume>
          .14257/ ijmue.
          <year>2015</year>
          .
          <volume>10</volume>
          .4.21.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Alam</surname>
            ,
            <given-names>N. K.</given-names>
          </string-name>
          <string-name>
            <surname>Datta</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Hossain</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Moazzam</surname>
          </string-name>
          ,
          <article-title>Natural language processing and machine learning approaches to detect bangla hate speech on social</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>