<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Ofensive Language Identification on Multilingual Code Mixing Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jyoti Kumari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhinav Kumar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science &amp; Engineering, National Institute of Technology Patna</institution>
          ,
          <addr-line>Patna</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science &amp; Engineering, Siksha 'O' Anusandhan Deemed to be University</institution>
          ,
          <addr-line>Bhubaneswar</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>3</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>Hate and ofensive language identification from social media platforms have been an active area of research for the researchers. As the user-generated social media posts contain several grammatical errors, spelling mistakes, and non-standard abbreviations, the identification of hate and ofensive posts have become a challenging task. In non-native English-speaking countries, social media texts are often code mixed or script mixed/switched, making it considerably more dificult. This work proposes ensemblebased models for the identification of ofensive language from Tamil script-mixed, Tamil code-mixed, and Malayalam code-mixed social media posts. The use of character n-gram TF-IDF features with the ensemble-based model have shown promising results with weighted 1-scores of 0.83 for Tamil scriptmixed, 0.67 for Tamil code-mixed, and 0.77 for Malayalam code-mixed social media posts. The code for the proposed models is available at https://github.com/Abhinavkmr/Dravidian-hate-speech.git</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate speech</kwd>
        <kwd>Dravidian language</kwd>
        <kwd>Code-mixed</kwd>
        <kwd>Social media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The technology advancement aimed to ease the people life has attracted much users towards
digitization specially the young generation. Today, the life of a person is incomplete without
social media [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Online social media platforms like Facebook, Twitter etc. allow users to
connect with their friends, make friends, share their thoughts, pictures, videos, etc.[2]. The
users are also increasing day by day. Along with huge data generation [3, 4], the use of ofensive
language or terminologies are also increasing at a rapid pace1. This is generating a serious issue
to the sustainable society [5].
      </p>
      <p>The ofensive language broadly comprises of hate speeches including race, age, sexual
orientation, disability, religion, and racism against violence or hate promoting contents 2. These
contents impact a user’s mental health terribly leading to depression, sleeplessness, and even
suicide. Few countries have already adopted strict rules or policies against such activities caused
due to freedom of expression or freedom to write. [6].</p>
      <p>The manual identification of hate speech is impossible due to various reasons like huge
amount of data, diferent policies, various types of hate speeches etc. Rather it should be done
automatically [6, 7]. Few researchers have tried to build such models [8, 9, 10]. Agarwal and
Sureka [11] extracted linguistic, semantic, and sentimental features and learned an ensemble
classifier to detect racist contents. Kapil et al. [ 6] proposed LSTM and CNN based model to
identify the hate speech in social media posts whereas, Badjatiya et al. [12] learned semantic
word embedding to classify each tweet as racist, sexist, or neither. Kumari and Singh [13]
presented a deep learning model to detect hate speech for English text. A considerable amount
of research work is present for English language in the literature. The major challenges arises
for the code-mixed and script-mixed sentences due to the unavailability of a suficient datasets.</p>
      <p>The purpose of this study is to recognize the hate speech from Tamil script-mixed, Tamil
code-mixed, and Malayalam code-mixed social media posts into ofensive and not-ofensive
classes. The proposed model is validated with the datasets provided by
HASOC-DravidianCodeMix-FIRE2021 challenge [14]. Two diferent tasks were given by the organizer: (i) Task-1:
classification of YouTube Tamil comments into ofensive and not-ofensive classes, (ii) Task-2:
classification of code-mixed Tamil and Malayalam tweets into ofensive and not-ofensive classes.
The current paper explores the usability of character-level features with ensemble-based model
to classify Tamil script-mixed, Tamil code-mixed, and Malayalam code-mixed social media posts
into ofensive and not-ofensive classes.</p>
      <p>The rest of the article is organized as follows; The proposed methodology is explained in
Section 2. The experiment setting and obtained results are discussed in Section 3. Finally, the
paper is concluded in Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>This section discusses the proposed methodology for the identification of ofensive social media
posts. The proposed model is validated with three datasets [14]: (i) Tamil script-mixed, (ii)
Tamil code-mixed, and (iii) Malayalam code-mixed social media posts. The overall data statistic
used in this study can be seen in Table 1. Two diferent ensemble-based methods are proposed:
(i) Ensemble of Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest
(RF) for the Tamil code-mixed and Malayalam code-mixed social media posts (see Figure 1, (ii)
Ensemble of AdaBoost classifier trained on three diferent validation split (see Figure 2).</p>
      <sec id="sec-2-1">
        <title>2.1. Ensemble-based model for Tamil and Malayalam code-mixed dataset</title>
        <p>The systematic diagram for the proposed ensemble-based model for the identification of ofensive
Tamil and Malayalam code-mixed social media posts can be seen in Figure 1. Character N-gram
TF-IDF (Term-Frequency Inverse-Document-Frequency) features were given to SVM, LR, and RF
classifiers. The predicted probabilities from each of the classifiers for ofensive and not-ofensive
classes is then averaged to get the final probability values for each of the classes. The higher
probability gets the final class label (as can be seen in Figure 1). The experiment has been
performed with diferent combinations of character (1-gram to 6-gram) TF-IDF features. In
this extensive experiment, it is observed that the first 30,000 one to six-gram character TF-IDF
features have performed best. The results of the proposed model are listed in section 3.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Ensemble-based model for Tamil script-mixed dataset</title>
        <p>The systematic diagram for the proposed ensemble-based model for the identification of ofensive
Tamil script-mixed social media posts can be seen in Figure 2. Similar to the previous model
(Figure 1), character n-gram TF-IDF features are input to AdaBoost classifier with three diferent
validation splits. Three diferent random seeds 10, 20, and 42 are used to select the data samples
into training and validation sets. The predicted probabilities of ofensive and not-ofensive
classes from all the three AdaBoost model are then averaged to get the final classification
probability. In this extensive experiment, it is observed that the first 50,000 one to six-gram
character TF-IDF features performed best. The results of the proposed model are listed in section
3.</p>
        <p>s
t
s
o
P
l
i
m
aT )d
ia e
ed ix</p>
        <p>m
lM
t</p>
        <p>p
ia i
c rc
oS (S
e
v
i
s
n
e
f
f
O</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>The performance of the proposed models are measured in terms of precision, recall, and 1-score.
Along with these, the confusion matrix and AUC-ROC curve are also plotted. The results for
the Tamil script-mixed, Tamil code-mixed, and Malayalam code-mixed dataset is listed in Table
2. The proposed ensemble-based model has achieved a weighted precision of 0.82, weighted
recall of 0.84, and weighted 1-score of 0.83 for the Tamil script-mixed dataset. The confusion
matrix and ROC curve for the Tamil script-mixed dataset are illustrated in Figures 3, and 4,
respectively.
0.8
e
t
a
R
iive0.6
t
s
o
P
eu0.4
r
T
0.2
0.0
0.2
0.95
0.64
0.4 0.6
False Positive Rate</p>
      <p>Similarly, the proposed ensemble-based model for Tamil code-mixed dataset has achieved
weighted precision, reacll, and 1-score of 0.67. Whereas, the proposed ensemble-based model
has achieved weighted precision of 0.78, weighted recall of 0.76, and weighted 1-score of 0.77.
The confusion matrix and ROC curve for the Tamil code-mixed and Malayalam code-mixed
datasets can be seen in Figures 5 and 6, 7 and 8, respectively.
0.8
e
t
a
R
iive0.6
t
s
o
P
eu0.4
r
T
0.2
0.0
0.2
0.73
0.42
NOT
0.4 0.6
False Positive Rate</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Hate and abusive language detection from code-mixed and script-mixed Dravidian social media
postings are one of the most challenging tasks for natural language processing. Two diferent
ensemble-based models have been developed, one for Tamil and Malayalam code-mixed and
another one for Tamil script-mixed social media posts. The proposed model has achieved
weighted 1-scores of 0.83 for Tamil script-mixed, 0.67 for Tamil code-mixed, and 0.77 for
Malayalam code-mixed social media posts. As the character-level features are giving promising
NOT
0.8
e
t
a
R
iive0.6
t
s
o
P
eu0.4
r
T
0.2
0.0
0.2
0.78
0.28
NOT
0.4 0.6
False Positive Rate
results for code-mixed and script-mixed social media posts, it can be explored further for
developing a robust system in the future.
[2] K. Gaurav, A. Sinha, J. P. Singh, P. Kumar, Facebook like: Past, present and future, in: Data</p>
      <p>Engineering and Intelligent Computing, Springer, 2018, pp. 617–625.
[3] A. Kumar, J. P. Singh, S. Saumya, A comparative analysis of machine learning techniques
for disaster-related tweet classification, in: 2019 IEEE R10 Humanitarian Technology
Conference (R10-HTC)(47129), IEEE, 2019, pp. 222–227.
[4] A. Kumar, N. C. Rathore, Relationship strength based access control in online social
networks, in: Proceedings of First International Conference on Information and
Communication Technology for Intelligent Systems: Volume 2, Springer, 2016, pp. 197–206.
[5] S. Saumya, J. P. Singh, Detection of spam reviews: A sentiment analysis approach, Csi</p>
      <p>Transactions on ICT 6 (2018) 137–148.
[6] P. Kapil, A. Ekbal, D. Das, Investigating deep learning approaches for hate speech detection
in social media, arXiv preprint arXiv:2005.14690 (2020).
[7] A. Kumar, S. Saumya, J. P. Singh, NITP-AI-NLP@ HASOC-FIRE2020: Fine tuned bert for
the hate speech and ofensive content identification from social media., in: FIRE (Working
Notes), 2020, pp. 266–273.
[8] A. Kumar, S. Saumya, J. P. Singh, NITP-AI-NLP@ HASOC-Dravidian-CodeMix-FIRE2020:
A machine learning approach to identify ofensive languages from Dravidian code-mixed
text., in: FIRE (Working Notes), 2020, pp. 384–390.
[9] A. K. Mishraa, S. Saumyab, A. Kumara, Iiit_dwd@ hasoc 2020: Identifying ofensive
content in indo-european languages (2020).
[10] S. Saumya, A. Kumar, J. P. Singh, Ofensive language identification in Dravidian code
mixed social media text, in: Proceedings of the First Workshop on Speech and Language
Technologies for Dravidian Languages, 2021, pp. 36–45.
[11] S. Agarwal, A. Sureka, Characterizing linguistic attributes for automatic classification of
intent based racist/radicalized posts on tumblr micro-blogging website, arXiv preprint
arXiv:1701.04931 (2017).
[12] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep learning for hate speech detection in
tweets, in: Proceedings of the 26th International Conference on WWW Companion, 2017,
pp. 759–760.
[13] K. Kumari, J. P. Singh, Ai_ml_nit patna at hasoc 2019: Deep learning approach for
identification of abusive content, in: Proceedings of the 11th annual meeting of the Forum
for Information Retrieval Evaluation (December 2019), 2019, pp. 328–335.
[14] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan,
P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
HASOC-DravidianCodeMix Shared Task on Ofensive Language Detection in Tamil and
Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
CEUR, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dasari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <article-title>Controlling and mitigating targeted socio-economic attacks</article-title>
          , in: Conference on e-Business, e-Services and e-Society, Springer,
          <year>2016</year>
          , pp.
          <fpage>471</fpage>
          -
          <lpage>476</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>