<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. K. Gautam); bharathib@ssn.edu.in (B. Bharathi)
~ https://www.ssn.edu.in/staf-members/dr-b-bharathi/ (B. Bharathi)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>RNN's VS TRANSFORMERS : Training language models on deficit datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abhishek Kumar Gautam</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B Bharathi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSE, Sri Siva Subramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer science, Indian Institute of Information Technology Una</institution>
          ,
          <addr-line>Himachal Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The concept of content moderation is as old as the online social media itself, the goal is to prevent any hate speech, comments etc to happen on the platform so as to keep the online social environment friendly and sane. With an exponentially increasing number of people on social media content moderation is a dificult task as such in the modern era we make use of specialised tools such as AI and NLP. In non-native English spoken countries, social media texts are mostly in code mixed form. This paper discusses the work put by SSNCSE_NLP in HASOC ofensive language identification on multilingual codemixed text tasks of FIRE 2021. In this paper we have put a detailed comparison on the performance of several RNN's based models with transformers based BERT architecture by varying the essential hyperparameters when training on a smaller dataset for tasks like sentiment analysis. We achieved an F1 score of 72.47% in task 1 and 69.2% ,61.5% in task2 Tamil and Malayalam respectively on the test set from our best evaluated model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ofensive content</kwd>
        <kwd>Dravidian languages</kwd>
        <kwd>RNN</kwd>
        <kwd>LSTM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Every small or big brand wants to put their product into as many hands as possible and easily
accessible in their own native languages this, in combination with the reach of internet has
resulted into massive expanse of rich diverse user groups, online content moderation in those
native languages along with the mixed languages that the user group speaks is hence necessary.
As codemixed languages consist of bilingual, trilingual or more languages in combination with
symbols and emojis it’s dificult to train eficient models. With recent developments in sequence
processing models and Transformer[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] based architectures it is far easier to train models in these
mixed languages sets. The review of code mixed research and challenges involved in speech and
language processing is discussed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Ensemble approach for ofensive identification were
discussed [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Multilingual BERT based transformer models are used for ofensive language
identification task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this paper we have compared training LSTM[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] based architectures
with transformers based BERT[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] when training on smaller datasets on codemixed Dravidian
languages Malayalam and Tamil mixed with English. Machine learning based approaches for
ofensive language identification are described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Tamil and Malayalam belong to the
Dravidian language family spoken mainly in south India, Sri Lanka, and Singapore.
      </p>
      <p>The paper is organized as follows: The dataset descriptions are given in Section2.1 Section
2.3 details the experimental setup and various features used for this task. Section 3 provides a
subjective analysis and comparison of the performance of various models on the development
and test data. Finally, Section 4 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed work</title>
      <sec id="sec-2-1">
        <title>2.1. Dataset analysis and task description</title>
        <p>
          The primary goal of this shared task is to detect ofensive language of the code-mixed dataset of
comments/posts in Dravidian Languages (Malayalam-English and Tamil-English) collected from
social media [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ][
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The comment/post may contain more than one sentence but the average
sentence length of the corpora is 1. Each comment/post is annotated with ofensive language
label at the comment/post level[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. This dataset also has class imbalance problems depicting
real-world scenarios. The HASOC Dravidan dataset had 2 tasks, for the first task, we were
given with a message-level label classification task. Given a YouTube comment in Tamil, the
model had to classify it into ofensive or not-ofensive. For the second task, Given a tweet in
codemixed Tamil and Malayalam, systems have to classify it into ofensive or not-ofensive.
Example sentences of task 1 and task 2 is given in Fig.1.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Preprocessing</title>
        <p>The datasets consisted of Dravidian languages Tamil and Malayalam codemixed with English
words, symbols and emojis. The dataset was parsed to generate word level tokens then
generate characters to separate out non-UTF-8 charset also emojis were removed from the dataset
obtained from the charset, later the word level tokens were directly parsed into LSTM based
networks while the clean text were parsed separately to generate BERT-tokens for training the
transformer architecture this was done so as to create proper tokens for Tamil in English(Tanglish)
and Malayalam in English(Manglish) datasets provided in task-2 while pre-trained embeddings
from Indic BERT[11] was used for task-1.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Experiments</title>
        <p>
          For natural language modelling several LSTM architectures including ELMo[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] were tested
along with state-of-art transformer model BERT. Considering lack of vocabulary or unknown
tokens in embeddings for Tanglish and Manglish datasets of task-2 the models were trained from
scratch for task-2. The LSTM based RNN architectures were created and trained in Tensorflow
while the transformer architecture BERT was trained in Pytorch using Huggingface transformers
library. The Jupyter notebooks for both training tasks are available here.
        </p>
        <sec id="sec-2-3-1">
          <title>2.3.1. ELMo model</title>
          <p>ELMo is a LSTM based architecture that leverages bi-directionality[11] of natural language
models by using two separate LSTM layers running left-to-right and right-to-left in bidirectional
wrapper and shallow concatenating the outputs. Since there weren’t any multilingual models
on ELMo for Indian languages we ended up training it from scratch and achieved an accuracy
of 82.7% on validation set.</p>
          <p>ELMo utilizes LSTM’s it shares same hyper-parameters as them namely :
1. Number of units : number of LSTM units or output dimension.
2. Dropout : dropout rate(0-1) for outputs.
3. Recurrent dropout : dropout rate(0-1) for recurrent output . ELMo architecture in contrast
to the fairly small dataset could be trained from scratch for better accuracy.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>2.3.2. BERT model</title>
          <p>
            BERT is a transformer[
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] based architecture which has proven to be fit for a wide variety of
tasks[
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] it uses self attention to generate understanding within the network. Task-1 had plain
Tamil text code mixed with English words so fine-tuning multilingual model Indic BERT gave
a score of 81% on validation set. For task-2 we trained the models from scratch , since the
maximum sentence length was found to be 91, embedding size of 128 was used with mini-BERT
configuration to pre-train the model on text. Pre-training was done on the entire set on Masked
language modelling then the model was fine-tuned for classification on downstream tasks.
          </p>
          <p>BERT models are implemented in pytorch and utilises hugging face transformers API to
create and train models, while the ELMo architecture was implemented using tensorflow and
was trained from scratch all the notebooks associated with training of models are available in
the link 1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Performance analysis</title>
      <p>The performance of the proposed approach using BERT model is given in Table1. From 1, it has
been noted that fine-tuning multilingual model Indic BERT gave a score of 81% on validation
set.</p>
      <p>The performance of the proposed system using ELMO model is given in Table 2.</p>
      <p>In Table 2 BiLSTMu refers to BiLSTM with side-by-side stacked uni-directional(left-to-right)
LSTM. BiLSTMd refers to BiLSTM with separate left-to-right and right-to-left LSTM stacked,
outputs shallow concatenated and fed to fully connected layers for classification.</p>
      <p>1https://github.com/Abhishek-krg/Multilingual-codemixed-language-classification</p>
      <p>From Table 2, it has been noted that BiLSTMd model with 32 units achieves highest accuracy of
82. 7%. Considering the performances of multilingual LM, we have experimented XLM-Roberta.
The results are tabulated in Table 3.</p>
      <p>The performance of the proposed system using the test data is given in Table 4.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we proposed ofensive language identification using Dravidian code-mixed text
using ELMO and BERT models. From the performance metrics above it is clear that BERT
despite being a far better architecture couldn’t achieve expected results while the BiLSTMd
architecture gave better results on HASOC dataset. This could be due to the reason that BERT is
a very dense model and requires huge-datasets to train on while LSTM based RNN architectures
on the other hand can achieve better results on simpler classification tasks.
P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
HASOC-DravidianCodeMix Shared Task on Ofensive Language Detection in Tamil and
Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
CEUR, 2021.
[11] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Kumar,
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained
Multilingual Language Models for Indian Languages, in: Findings of EMNLP, 2020.
[12] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
contextualized word representations, CoRR abs/1802.05365 (2018). URL: http://arxiv.org/
abs/1802.05365. arXiv:1802.05365.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>CoRR abs/1706</source>
          .03762 (
          <year>2017</year>
          ). URL: http: //arxiv.org/abs/1706.03762. arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sitaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Chandu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Rallabandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <article-title>A survey of code-switched speech</article-title>
          and
          <source>language processing</source>
          ,
          <year>2020</year>
          . arXiv:
          <year>1904</year>
          .00784.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          , Hatealert@DravidianLangTech-EACL2021:
          <article-title>Ensembling strategies for transformer-based ofensive language detection</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics</source>
          , Kyiv,
          <year>2021</year>
          , pp.
          <fpage>270</fpage>
          -
          <lpage>276</lpage>
          . URL: https://www.aclweb.org/anthology/2021.dravidianlangtech-
          <volume>1</volume>
          .
          <fpage>38</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jayanthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , Sj_aj@
          <fpage>dravidianlangtech</fpage>
          -eacl2021:
          <article-title>Task-adaptive pre-training of multilingual bert models for ofensive language identification</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2102</volume>
          .
          <fpage>01051</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gref</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Koutník</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Steunebrink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <string-name>
            <surname>LSTM:</surname>
          </string-name>
          <article-title>A search space odyssey</article-title>
          ,
          <source>CoRR abs/1503</source>
          .04069 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1503.04069. arXiv:
          <volume>1503</volume>
          .
          <fpage>04069</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv. org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Nitin Nikamanth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <article-title>Ssncse_nlp@hasoc-dravidian-codemix-fire2020: Offensive language identification on multilingual code mixing text</article-title>
          , in: Working Notes of FIRE 2020-
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>370</fpage>
          -
          <lpage>376</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , A.
          <string-name>
            <surname>Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          , T. Mandl,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R L</given-names>
            ,
            <surname>J. P. McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <article-title>Findings of the shared task on ofensive language identification in Tamil, Malayalam, and Kannada</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics</source>
          , Kyiv,
          <year>2021</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>145</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . dravidianlangtech-
          <volume>1</volume>
          .
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chinnappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          , E. Sherly,
          <article-title>Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2021</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sakuntharaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Madasamy</surname>
          </string-name>
          , S. Thavareesan,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>