<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hate Speech Detection in Low Resource Indo-Aryan Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sougata Saha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Sullivan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rohini Srihari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>State University of New York at Bufalo</institution>
          ,
          <addr-line>NY 14260</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This report outlines the problem formulation and methodology employed by team Chetona for identifying hate-speech in low resource languages from social media comments. We focus on HASSOC 2023 Task 4, which involves binary classification of Twitter, Facebook, and Youtube comments for hate speech in Bengali, Bodo, and Assamese languages. We propose ensembling IndicBERT and Naive Bayes, along with synthetic data upsampling techniques, and attain macro F1 scores of 0.73, 0.68, and 0.84 for Assamese, Bengali, and Bodo. The scores are significant improvements over existing baselines, placing us within the top 10 of the leaderboard for all languages. The code and method is available on Github.1 Hateful comments are prevalent on social media platforms. Although tools for automatically detecting, flagging, and blocking such false, ofensive, and harmful content online have matured lately, research in low-resource languages such as Assamese, Bengali, and Bodo is still lacking [1, 2, 3]. Most prior research pertains to detecting ofensive text in social media while neglecting the subtler and broader task of identifying hateful comments. This document delineates the problem definition and the approach incorporated to tackle the challenges presented by HASOC 2023 Task 4 [4, 5], which revolves around identifying hate speech in low-resource Indo-Aryan textual content sourced from social media platforms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate Speech</kwd>
        <kwd>Low resource language</kwd>
        <kwd>Assamese</kwd>
        <kwd>Bengali</kwd>
        <kwd>Bodo</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Annihilate Hates (Task 4)</title>
      <sec id="sec-2-1">
        <title>2.1. Problem Statement</title>
        <p>This task aims to identify hate speech from social media (Twitter, Facebook, and YouTube) text
comments spanning the low resource languages of Bengali, Bodo, and Assamese from Eastern
India. The datasets for these languages contain sentences labeled as hate/ofensive (HOF) or
not hate (NOT), and the goal is to develop robust machine learning models that can reliably
predict the correct binary class for each comment. Models are evaluated and compared using
the Macro F1 score.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Methodology</title>
        <p>Our approach incorporates robust data augmentation followed by an ensemble approach for
hate-speech detection.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Data Augmentation</title>
          <p>
            The provided training data comprises 1281 Bengali, 4036 Assamese, and 1679 Bodo samples. We
up-sample the training examples of each language by translating the examples from other two
languages to the given language. For a language, we use the IndicTrans2 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] Indic to English
translation model to first translate examples from the other two languages to English. Next we
use the English to Indic translation model to translate the English text to the desired source
language. This method helps us generate additional 5715 Bengali, 2960 Assamese, and 5317
Bodo noisy training data, which we add to the original training data. We remove all emojis
from the comments, and truncate them to 50 tokens. Figure 1 illustrates our data augmentation
pipeline.
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Model Architecture and Training</title>
          <p>
            We implement an ensemble approach using IndicBERT [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] and Multinomal Naive Bayes.
IndicBERT is a multilingual language model with 278 million parameters, and was trained on the
IndicCorp v2 dataset [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] and evaluated on the IndicXTREME [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] benchmark. This model is
versatile, supporting 23 Indic languages as well as English. We use the IndicBERT-MLM+Samanantar
variant, a BERT-style model [10] trained on IndicCorp v2 and the Samanantar Parallel Corpus
[11] focused on the MLM (Masked Language Model) objective. The final HOF/NOT predictions
are made by passing the pooler output through a dropout layer with 0.1 probability, followed
by a single linear layer. We also train a multinomal Naive Bayes (NB) model for each language
using Sklearn. The NB model inputs tf-idf representations of comment tokens and predicts the
binary HOF/NOT class. We weigh the predicted probabilities from the BERT-based model and
Naive Bayes model using a ratio of 4:1, and classify the text as HOF if the weighted sum is above
a threshold of 0.5.
          </p>
          <p>Since multi-task learning generally yields better results, we train a single model for all
three languages. Furthermore, to distinguish between the original and translated examples,
we prepend each sample text with &lt;language&gt; + "original:" or "translated:", e.g. "bengali
original:&lt;followed by the original text&gt;", "bodo translated:&lt;followed by the translated text&gt;". We
ifne-tuned IndicBert’s ’ai4bharat/IndicBERTv2-MLM-Sam-TLM’ version from the Hugging Face
library using Pytorch. The model was optimized using AdamW [12] and a learning rate of 2e-5.
We trained the model for 20 epochs on a single A5000 GPU with a batch size of 16. The training
took approximately 1 hour with early stopping if the validation loss didn’t decrease for 2 epochs.
The Naive Bayes models were trained with the default Sklearn settings.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. Results and Evaluation</title>
          <p>We report our results in Table 1 and compare the test set Macro F1 score against teams that
submitted predictions for all three languages. We highlight in bold the best F1 score for each
language and underline our results. We attain the best results for Assamese and competitive
results for Bengali and Bodo. Since most Indian languages originate from Devanagari, we
experimented with using the IndicBERT-SS IndicBERT variant instead of IndicBERT-MLM+Samanantar.
IndicBERT-SS was trained with the MLM objective on an Indic language to Devanagari corpus
to encourage better lexical sharing among languages. We did not note any improvements in
results on the validation and test sets (Note: We do not report the numbers here).</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>Team</title>
          <p>AI Alchemists
Avigail Stekel
Chen876
Ours
CIT TEAM
CNLP-NITS-PP
Code Fellas
FiRC-NLP
InclusiveTechies
IRLab@IITBHU
Komar99
Michal Stekel
MUCS
Ravens
SATLab
Team +1
TeamBD</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion</title>
      <p>This report outlines the problem formulation and the methodology employed to address the
HASOC 2023 competition Task 4, focused on hate speech detection in low resource text from
social media. Our proposed method implements a weighted ensemble of IndicBERT v2 and
Multinomal Naive Bayes and incorporates a translation-based data augmentation approach. Our
results indicate that our implementations can robustly detect social media hate speech in low
resource Indo-Aryan languages, thus promoting a safer and more inclusive online environment.
P. Kumar, Towards leaving no indic language behind: Building monolingual corpora,
benchmark and models for indic languages, 2023. arXiv:2212.05409.
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[11] G. Ramesh, S. Doddapaneni, A. Bheemaraj, M. Jobanputra, R. AK, A. Sharma, S. Sahoo,
H. Diddee, M. J, D. Kakwani, N. Kumar, A. Pradeep, S. Nagaraj, K. Deepak, V. Raghavan,
A. Kunchukuttan, P. Kumar, M. S. Khapra, Samanantar: The largest publicly available
parallel corpora collection for 11 Indic languages, Transactions of the Association for
Computational Linguistics 10 (2022) 145–162. URL: https://aclanthology.org/2022.tacl-1.9.
doi:10.1162/tacl_a_00452.
[12] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint
arXiv:1711.05101 (2017).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          , U. Garain,
          <article-title>Baseline bert models for conversational hate speech detection in code-mixed tweets utilizing data augmentation and ofensive language identification in marathi</article-title>
          , in: Fire,
          <year>2022</year>
          . URL: https://api.semanticscholar.org/CorpusID:259123570.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          ,
          <article-title>Hate speech detection: a comparison of mono and multilingual transformer model with cross-language evaluation</article-title>
          ,
          <source>in: Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation</source>
          , De La Salle University, Manila, Philippines,
          <year>2022</year>
          , pp.
          <fpage>853</fpage>
          -
          <lpage>865</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .paclic-
          <volume>1</volume>
          .
          <fpage>94</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sonowal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Basumatary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gogoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          ,
          <article-title>Transformer-based hate speech detection in assamese</article-title>
          ,
          <source>in: 2023 IEEE Guwahati Subsection Conference (GCON)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/GCON58516.
          <year>2023</year>
          .
          <volume>10183497</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Annihilate Hates (Task 4</article-title>
          ,
          <string-name>
            <surname>HASOC</surname>
          </string-name>
          <year>2023</year>
          )
          <article-title>: Hate Speech Detection in Assamese, Bengali, and Bodo languages</article-title>
          , in: Working Notes of FIRE 2023 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Dmonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <source>Overview of the HASOC subtracks at FIRE</source>
          <year>2023</year>
          :
          <article-title>Hate speech and ofensive content identification in assamese, bengali, bodo, gujarati and sinhala</article-title>
          ,
          <source>in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2023</year>
          , Goa,
          <source>India. December 15-18</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>AI4Bharat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gala</surname>
            ,
            <given-names>P. A.</given-names>
          </string-name>
          <string-name>
            <surname>Chitale</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>AK</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Doddapaneni</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Gumma</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Nawale</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sujatha</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Puduppully</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Khapra</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Dabre</surname>
            ,
            <given-names>A. Kunchukuttan,</given-names>
          </string-name>
          <article-title>Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages</article-title>
          ,
          <source>arXiv preprint arXiv: 2305.16307</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Doddapaneni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aralikatte</surname>
          </string-name>
          , G. Ramesh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Khapra</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kunchukuttan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Towards leaving no indic language behind: Building monolingual corpora, benchmark and models for indic languages</article-title>
          ,
          <source>ArXiv abs/2212</source>
          .05409 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kakwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunchukuttan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Golla</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. N.C.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Khapra</surname>
          </string-name>
          , P. Kumar,
          <article-title>IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>4948</fpage>
          -
          <lpage>4961</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .findings-emnlp.
          <volume>445</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>445</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Doddapaneni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aralikatte</surname>
          </string-name>
          , G. Ramesh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Khapra</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Kunchukuttan,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>