<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hate and Ofensive content identification from Dravidian social media posts: A deep learning approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anu Priya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhinav Kumar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Central University of Punjab</institution>
          ,
          <addr-line>Bathinda, Punjab</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science &amp; Engineering, Siksha 'O' Anusandhan Deemed to be University</institution>
          ,
          <addr-line>Bhubaneswar</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Identifying hate and ofensive content in social media posts is one of the most challenging tasks for Natural Language Processing. The usage of non-standard acronyms, misspellings, poor grammar, and multilingualism in social media posts makes detecting hate and ofensive language much more dificult. This work proposes a deep neural network-based model for the identification of ofensive social media posts from Tamil script-mixed, Tamil code-mixed, and Malayalam code-mixed messages. The combination of one to six-gram character-level Term-Frequency Inverse Document Frequency (TF-IDF) features with a four-layered deep neural network model performed better than the other combinations of character-level n-gram TF-IDF features. For Tamil script-mixed, Tamil code-mixed, and Malayalam code-mixed social media postings, the suggested model attained weighted  1-scores of 0.84, 0.65, and 0.71, respectively. The code for the proposed models is available at https://github.com/Abhinavkmr/ Hate-Speech-Identification-Dravidian-Language.git</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate speech</kwd>
        <kwd>Social media</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Ofensive language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Along with population growth during the previous ten years, the number of people using
social networking has increased dramatically. People may express and share their opinions
globally on social media sites like Twitter and Facebook, which has resulted in a flood of textual
information on these platforms [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. These platforms were developed with the purpose of
connecting people from all around the globe together. Currently, social platforms are used for a
variety of objectives, including allowing governments to engage citizens, allowing consumers
to make educated decisions, disaster management, and so on [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Social media have a dark
side as well [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The widespread use of such platforms has led to the spread of abusive and
controversial content, which has resulted in cyberstalking. The lack of social media monitoring
norms contributes to the unhealthy use of these networks. The number of negative comments
on social media grew rapidly over time. This has created the enormous interest of many experts
who are trying to figure out how to filter out these hate and ofensive social media contents
[
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ]. The identification of hate and ofensive social media contents creates another level of
complexity when the posted messages are in the form of code-mixed and script-mixed [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
      <p>
        In the non-native English-speaking country, a huge fraction of multi-lingual social media
contents (mainly code-mixed and script-mixed) are posted by the users. Several works [12,
13, 14, 15] have been proposed for the detection of hate and ofensive contents from English,
Hindi, and German social media posts. Raj et al.[14] proposed Convolutional Neural Networks
(CNN), Bi-directional Long Short-Term Memory (BiLSTM), and hybrid models (CNN+BiLSTM)
model. Ray and Garain [15] uses TF-IDF, Word2Vec, and other textual features to train Random
Forest and RoBERTa model whereas, Ou and Li [13] proposed XLM-RoBERTa and Ordered
Neurons LSTM (ON-LSTM)-based model. Recently, a few works [16, 17, 18] have been reported
by diferent researchers to identify hate and ofensive language from Dravidian social media
posts. Sai and Sharma [17] performed translation and transliteration of the posts and combined
several transformer-based models to identify hate and ofensive contents from Dravidian social
media posts. Tula et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed an ensemble-based model by combining several popular
BERT-based models. Kumar et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] explored the usability of diferent deep learning models
such as attention-based Long Short Term Memory (LSTM), Convolution Neural Network (CNN),
and conventional machine learning models such as support vector machine, Logistic regression,
Random forest, and Naive Bayes in the identification of hate contents. In their experiment, they
found the use of character-level TF-IDF features with conventional machine learning models
achieved state-of-the-art performance. Further, Saumya et al. [18] explore several popular
models such as BERT, ULMFiT, hybrid deep learning models, and several conventional machine
learning models. They also found that the use of character-level features with conventional
machine learning models outperformed several complex deep learning models for the hate
content identification from Dravidian social media posts.
      </p>
      <p>In line with these studies, this work proposes a deep neural network-based model that uses
character n-gram TF-IDF features to classify Tamil script-mixed, Tamil code-mixed, and
Malayalam code-mixed social media posts into ofensive and not-ofensive classes. The proposed
model is validated with the datasets provided by HASOC-Dravidian-CodeMix-FIRE2021
challenge [19]. Two diferent tasks were given by the organizer: (i) Task-1: classification of YouTube
Tamil comments into ofensive and not-ofensive classes, (ii) Task-2: classification of code-mixed
Tamil and Malayalam tweets into ofensive and not-ofensive classes.</p>
      <p>The rest of the sections are organized as follows: section 2 discusses the proposed methodology
in detail, section 3 list the finding of the proposed system. Finally, section 4 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The systematic diagram for the proposed dense neural network-based model can be seen in
Figure 1. The proposed system is validated with the dataset provided in
HASOC-DravidianCodeMix-FIRE2021 challenge [19]. The overall data statistic can be seen in Table 1.</p>
      <p>The extensive experimentation was performed to find the best-suited features for the dense
neural network. In the extensive experiments, we found the first 30,000 character one to
sixgram TF-IDF features performed best for Tamil script-mixed and Tamil code-mixed dataset
whereas the first 20,000 character one to six-gram TF-IDF features for Malayalam code-mixed
tss -edo iedx
aoP lic
m
e
id aTm cod
leM ,exd laam
ica i
oS t-m lay</p>
      <p>a
iev irp M
s sc &amp;
fen il ,
f m exd
O (aT im
r
e
t
rac rse
raahCm taeFF</p>
      <p>u
)-G -ID
-61 FT
(</p>
      <sec id="sec-2-1">
        <title>Dense layer 1</title>
      </sec>
      <sec id="sec-2-2">
        <title>Dense layer 4</title>
      </sec>
      <sec id="sec-2-3">
        <title>Dense layer 3</title>
        <sec id="sec-2-3-1">
          <title>Offensive</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Not-Offensive</title>
          <p>dataset performed best. The proposed system for Tamil script-mixed and Tamil code-mixed
datasets have four dense layers containing 4,096, 512, 64, and 2-neurons, respectively. At the
output layer, a softmax activation function is used to calculate the probabilities to decide the
ifnal class. Similarly, for the Malayalam code-mixed dataset, the proposed dense neural network
has four layers containing 2,048, 512, 64, and 2-neurons, respectively. A softmax layer is then
used to calculate the final class for a given social media post. As the deep learning models
are very sensitive to the chosen hyper-parameters, we performed a sensitivity analysis of the
model by varying learning rate, batch size, epochs, optimizer, and loss function. The best-suited
hyper-parameters for the proposed system can be seen in Table 2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>The performance of the proposed system is measured in terms of precision, recall, 1-score,
AUC-ROC curve, and confusion matrix. Along with these metrics, weighted precision, weighted
Confusion matrix</p>
      <p>Receiver operating characteristic curve
recall, and weighted 1-score are also calculated. The performance of the proposed model for
Tamil script-mixed, Tamil code-mixed, and Malayalam code-mixed datasets are listed in Table 3.</p>
      <p>For Tamil script-mixed dataset, the proposed system achieved a weighted precision, recall,
and 1-score of 0.84, 0.83, and 0.84, respectively. The confusion matrix and ROC curve for the
Tamil script dataset can be seen in Figures 2 and 3, respectively. For Tamil code-mixed dataset,
the proposed system achieved a weighted precision, recall, and 1-score of 0.65, 0.66, and 0.65,
respectively. The confusion matrix and ROC curve for the Tamil code-mixed dataset can be
seen in Figures 4 and 5, respectively. For Malayalam code-mixed dataset, the proposed dense
neural network-based model achieved a weighted precision, recall, and 1-score of 0.75, 0.70,
and 0.71, respectively. The confusion matrix and ROC curve for the Tamil code-mixed dataset
can be seen in Figures 6 and 7, respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The identification of hate and ofensive content from script-mixed and code-mixed social media
posts is one of the hot research topics in recent times. This work proposes a dense neural
network-based model that uses character-level TF-IDF features to identify ofensive messages
from Tamil script-mixed, Tamil code-mixed, and Malayalam code-mixed datasets. The proposed
model achieved a weighted 1-score of 0.84, 0.65, and 0.71 for Tamil script-mixed, Tamil
codemixed, and Malayalam code-mixed social media posts, respectively. As the character-level
features for Dravidian social media posts are giving the promising performance, therefore the
character-level features can be explored further in the future. An ensemble-based model can
also be made in feature to achieve better performance.
the hate speech and ofensive content identification from social media., in: FIRE (Working
Notes), 2020, pp. 266–273.
[12] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the hasoc track at fire
2020: Hate speech and ofensive language identification in tamil, malayalam, hindi, english
and german, in: FIRE, 2020, pp. 29–32.
[13] X. Ou, H. Li, Ynu_oxz at hasoc 2020: Multilingual hate speech and ofensive content
identification based on xlm-roberta., in: FIRE (Working Notes), 2020, pp. 121–127.
[14] R. Raj, S. Srivastava, S. Saumya, NSIT &amp; IIITDWD@ HASOC 2020: Deep learning model
for hate-speech identification in indo-european languages., in: FIRE (Working Notes),
2020, pp. 161–167.
[15] B. Ray, A. Garain, JU at HASOC 2020: Deep learning with RoBERTa and random forest for
hate speech and ofensive content identification in Indo-European languages., in: FIRE
(Working Notes), 2020, pp. 168–174.
[16] B. R. Chakravarthi, A. K. M, J. P. McCrae, B. Premjith, K. Soman, T. Mandl, Overview of the
track on HASOC-ofensive language identification-DravidianCodeMix., in: FIRE (Working
Notes), 2020, pp. 112–120.
[17] S. Sai, Y. Sharma, Siva@ HASOC-Dravidian-CodeMix-FIRE-2020: Multilingual ofensive
speech detection in code-mixed and romanized text., in: FIRE (Working Notes), 2020, pp.
336–343.
[18] S. Saumya, A. Kumar, J. P. Singh, Ofensive language identification in Dravidian code
mixed social media text, in: Proceedings of the First Workshop on Speech and Language
Technologies for Dravidian Languages, 2021, pp. 36–45.
[19] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan,
P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
HASOC-DravidianCodeMix Shared Task on Ofensive Language Detection in Tamil and
Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
CEUR, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Rana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <article-title>Event classification and location prediction from tweets during disasters</article-title>
          ,
          <source>Annals of Operations Research</source>
          <volume>283</volume>
          (
          <year>2019</year>
          )
          <fpage>737</fpage>
          -
          <lpage>757</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Location reference identification from tweets during emergencies: A deep learning approach</article-title>
          ,
          <source>International journal of disaster risk reduction 33</source>
          (
          <year>2019</year>
          )
          <fpage>365</fpage>
          -
          <lpage>375</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Rana</surname>
          </string-name>
          ,
          <article-title>A deep multi-modal neural network for informative twitter content classification during emergencies</article-title>
          ,
          <source>Annals of Operations Research</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Disaster severity prediction from twitter images</article-title>
          ,
          <source>in: Intelligence Enabled Research</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sreelakshmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Premjith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Soman</surname>
          </string-name>
          ,
          <article-title>Detection of hate speech text in hindi-english code-mixed data</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>171</volume>
          (
          <year>2020</year>
          )
          <fpage>737</fpage>
          -
          <lpage>744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Potluri</surname>
          </string-name>
          , S. Ms,
          <string-name>
            <given-names>S.</given-names>
            <surname>Doddapaneni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sukumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          , Bitions@ DravidianLangTech-EACL2021:
          <article-title>Ensemble of multilingual language models with pseudo labeling for ofence detection in Dravidian languages</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>291</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Mishraa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumyab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumara</surname>
          </string-name>
          , IIIT_DWD@ HASOC 2020:
          <article-title>Identifying ofensive content in indo-european languages (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , A.
          <string-name>
            <surname>Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          , T. Mandl,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R L</given-names>
            ,
            <surname>J. P. McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <article-title>Findings of the shared task on ofensive language identification in Tamil, Malayalam, and Kannada</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics</source>
          , Kyiv,
          <year>2021</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>145</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . dravidianlangtech-
          <volume>1</volume>
          .
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Raja</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Comparison of pretrained embeddings to identify hate speech in indian code-mixed text</article-title>
          ,
          <source>in: 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>25</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICACCCN51052.
          <year>2020</year>
          .
          <volume>9362731</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>NITP-AI-NLP@ HASOC-Dravidian-CodeMix-FIRE2020: A machine learning approach to identify ofensive languages from Dravidian code-mixed text</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>384</fpage>
          -
          <lpage>390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>NITP-AI-NLP@ HASOC-FIRE2020: Fine tuned bert for</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>