<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ofensive Language Classification of Code-Mixed Tamil with Keras</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Suchismita Tripathy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ameya Pathak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashvardhan Sharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Systems, Birla Institute of Technology and Science Pilani</institution>
          ,
          <addr-line>Pilani Campus</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the method adopted for completing Task 1 of Dravidian-CodeMix-HASOC (Hate Speech and Ofensive Content Identification in English and Indo-European Languages) Shared Task proposed by the Forum of Information Retrieval Evaluation in 2021, for ofensive language detection. For detecting ofensive language, a custom model architecture using convolutional neural networks was created using Keras for supervised learning, and trained on a dataset of YouTube comments, written in code-mixed Tamil in both Roman and Tamil scripts. The 5 layer neural network was built only using Keras, and required simple tokenized data, padded to an appropriate length. Recurrent neural networks and transfer learning were not used, and an F-score of 0.835 was achieved with the created CNN model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ofensive language detection</kwd>
        <kwd>code-mixed text</kwd>
        <kwd>Tamil</kwd>
        <kwd>HASOC</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Ofensive language detection is a classification task that uses supervised learning to identify
ofensive statements/texts in corpora [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. With the increasing usage of social media in today’s
hyper-connected world, being able to prevent the abuse of the freedom of speech by writers
of hate comments is very important, and can pave the way to less hostile social environments
that decrease the negative efects of social media usage on mental health and self-esteem [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
This ofensive language detection needs to further be implemented on a variety of languages,
including diferent scripts, while adapting to code-switching between 2 or more languages as
well. Most eforts in ofensive language detection have been limited to corpora in only one
language and script, using usually pretrained models for classification, which do not work too
well when code-switching is involved [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        This paper focuses on the tasks under Dravidian-CodeMix-HASOC Shared Task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], specifically
Task 1, which involves classifying YouTube comments in Tamil as ofensive/inofensive. Tamil
is a very widely spoken language, with its primary speakers residing in South India, Singapore,
Malaysia, Canada, and Sri Lanka [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. With the advent of social media, and the popularity of the
Roman script in interfaces for the same, Tamil speakers often resort to writing social media
posts/comments in the Roman script itself, with the influence of the English language reaching
the level of sentence structure and vocabulary as well, resulting in code-mixed sentences with
both Tamil and English words, and grammar rules [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The task thus involved dealing with this
code-mixing in the YouTube comments for classification. The approach used achieved a rank of
5 among all the submissions, with an F-score of 0.835.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Datasets</title>
      <p>Labeled training dataset was provided for HASOC, along with an unlabeled test dataset. The
training dataset included a total of 5880 YouTube comments, each with one of the three following
labels, with the split-up as shown in Table 1.</p>
      <p>The task involved labeling the 654 un-annotated comments of the test dataset appropriately,
along with an ID for each comment. A validation dataset was not provided for this particular
task, unlike for Task 2 of HASOC, and hence, the training dataset was appropriately divided as
part of pre-processing, to generate a training dataset with fewer points.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Past Work on Ofensive Language Detection</title>
      <p>Ofensive language datasets have been hard to find, given the specific nature of the problem [ 8].
Early approaches were based on a sentence level / user level scoring mechanism[9] where a score
was assigned to each word. Basis this, the sentence/user was evaluated and maked as ofensive
or not based on a certain threshold. Liu et al [10] tried to use a novel augmentation scheme
to improve the performance of the imbalanced and low resource data. Ofensive language
detection has proved to be an important field for websites to monitor content on their platform
and for governments to tackle abuse. Risch et al[11] showed that more complex approaches
like LSTM with an attention mechanism ofer better accuracy and logical explanation while
BERT[12] based approaches[13] were most favoured at the detection task in HASOC 2020. A
team from Norway[14] used an ensemble of RNNs to not only detect hate speech, but also
racism and sexism. Ranasinghe et al explored the efectiveness of cross lingual embeddings
while a Microsoft study[15] showed that transfer learning for embeddings proves to be the
most efective in ofensive language detection. A study on Bahasa based ofensive language[ 16]
shows the efectiveness of the sigmoid activation function for detection while highlighting the
challenges faced due to abbreviations. Recent studies like Saumya et al[17] showed that naive
bayes, logistic regression, and vanilla neural network were better than transfer learning models
at ofensive language detection for Tamil/Malayalam based scripts.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Approach</title>
      <p>We created a neural network with a custom architecture using Keras, and trained it on a section
of the training dataset.</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing and Tokenisation</title>
        <p>Since a validation dataset wasn’t provided for the task, the training dataset was divided into
training and validation sets. For this, the cutof was set at 4000 i.e. 4000 data points in the
training set and 1880 in the validation set. Further, the labels of the comments, were each
converted to a 3 by 1 vector through one hot encoding.</p>
        <p>The data was then passed through the keras[18] text pre-processing Tokenizer. A maximum of
5000 unique keys were saved by the tokenizer. Each comment was also post-padded with 0s, to
achieve a length of 100 token indices. The tokenizer was used separately on the combination of
validation and training sets, and finally the test dataset as well.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Custom Architecture</title>
        <p>A Keras Sequential model was built from scratch for the task using a variety of layers to build a 5
layer network. An embedding layer was included first, running on the tokenized output, which
turns the indices into dense vectors of a fixed size. Recurrent layers were not used, instead, a
pooling and two dense layers were used. The last dense layer used the sigmoid function as its
activation function (for final classification into the above-mentioned 3 classes : NOT, OFF and
not-Tamil), while ReLU was used as the activation function of the former dense layer.</p>
        <p>A variety of dropout layers (with diferent rates) were included in the model to correct
overfitting. It was observed, however, that the model performed better with just one dropout
layer before the penultimate layer of the model. This architecture gave the best performance
out of all the combinations that were tried.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training</title>
        <p>Cross entropy loss was calculated and the Adam optimizer was used for training. The model
was trained using the reduced training set, with the remaining data points used as part of the
validation set. A range of batch and epoch sizes were tried to optimise performance. A final
batch size of 512 was used, over 25 epochs to correct overfitting and ensure the validation loss
kept decreasing with the corresponding increase of validation accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>The submitted model achieved an overall rank of 5 among all the submitted runs based on its
F-score. The model was evaluated on the basis of classic classification metrics - macro averaged
recall, precision and F-score. The overall results were:</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>We find that a deep learning based approach is highly efective at identifying ofensive Tamil
language texts. Further modifications can be done to fine tune the model to suit to the specificity
of the Tamil language. This approach can be extended to other Indian languages which have a
similar lexical pattern, thus creating a robust solution for flagging ofensive content in news
and social media websites.
Dataset for Dravidian Languages in Code-Mixed Text, arXiv preprint arXiv:2106.09460
(2021).
[8] J. J. Andrew, JudithJeyafreedaAndrew@DravidianLangTech-EACL2021:ofensive language
detection for Dravidian code-mixed YouTube comments, in: Proceedings of the First
Workshop on Speech and Language Technologies for Dravidian Languages, Association
for Computational Linguistics, Kyiv, 2021, pp. 169–174. URL: https://aclanthology.org/2021.
dravidianlangtech-1.22.
[9] Y. Chen, Y. Zhou, S. Zhu, H. Xu, Detecting ofensive language in social media to protect
adolescent online safety (2012) 71–80. doi:10.1109/SocialCom-PASSAT.2012.55.
[10] R. Liu, G. Xu, S. Vosoughi, Enhanced ofensive language detection through data
augmentation, CoRR abs/2012.02954 (2020). URL: https://arxiv.org/abs/2012.02954.
arXiv:2012.02954.
[11] J. Risch, R. Ruf, R. Krestel, Ofensive language detection explained (2020) 137–143. URL:
https://aclanthology.org/2020.trac-1.22.
[12] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
org/abs/1810.04805. arXiv:1810.04805.
[13] T. Mandl, S. Modha, G. K. Shahi, A. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer,
Overview of the hasoc track at fire 2020: Hate speech and ofensive content identification
in indo-european languages (2020).
[14] G. K. Pitsilis, H. Ramampiaro, H. Langseth, Efective hate-speech detection in twitter
data using recurrent neural networks, Applied Intelligence 48 (2018) 4730–4742. URL:
https://doi.org/10.1007%2Fs10489-018-1242-y. doi:10.1007/s10489-018-1242-y.
[15] H. Rizwan, M. H. Shakeel, A. Karim, Hate-speech and ofensive language detection in
roman urdu (2020) 2512–2522.
[16] M. Susanty, Sahrul, A. F. Rahman, M. D. Normansyah, A. Irawan, Ofensive language
detection using artificial neural network (2019) 350–353. doi: 10.1109/ICAIIT.2019.
8834452.
[17] S. Saumya, A. Kumar, J. P. Singh, Ofensive language identification in Dravidian code mixed
social media text (2021) 36–45. URL: https://aclanthology.org/2021.dravidianlangtech-1.5.
[18] F. Chollet, et al., Keras, https://keras.io, 2015.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lahiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Ojha</surname>
          </string-name>
          ,
          <article-title>Aggressive and ofensive language identification in hindi, bangla, and english: A comparative study</article-title>
          ,
          <source>SN Computer Science</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. U.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Benchmarking multi-task learning for sentiment analysis and offensive language identification in under-resourced Dravidian languages</article-title>
          ,
          <source>arXiv preprint arXiv:2108.03867</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Ojha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <article-title>Evaluating aggression identification in social media</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>HopeEDI: A multilingual hope speech detection dataset for equality, diversity, and inclusion</article-title>
          ,
          <source>in: Proceedings of the Third Workshop on Computational Modeling of People's Opinions</source>
          , Personality, and
          <article-title>Emotion's in Social Media, Association for Computational Linguistics</article-title>
          , Barcelona,
          <source>Spain (Online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>53</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .peoples-
          <volume>1</volume>
          .5.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc track at fire 2020: Hate speech and ofensive content identification in indo-european languages</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2108</volume>
          .
          <fpage>05927</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sakuntharaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Madasamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          , P. B,
          <string-name>
            <given-names>S. Chinnaudayar</given-names>
            <surname>Navaneethakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC-DravidianCodeMix Shared Task on Ofensive Language Detection in Tamil and Malayalam</article-title>
          , in: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <source>DravidianCodeMix: Sentiment Analysis and Ofensive Language Identification</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>