<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N. N. A. Balaji);bharathib@ssn.edu.in(B. Bharathi)
orcid:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SSNCSE_NLP@HASOC-Dravidian-CodeMix- FIRE2020: Ofensive Language Identification on Multilingual Code Mixing Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nitin Nikamanth AppiahBalaj</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>i B. Bharathi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSE, Sri Siva Subramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The number of social media users is increasing rapidly. A myriad of people have started using native languages in Roman alphabets. Therefore, it has becomes a big concern to regulate the quality of the text content and messages that are being shared to the internet. In this paper we study the task of ofensive message identification for Tamil-English and Malayalam-English code-mixed content. The char n-gram, TFIDF and fine-tuned BERT are compared in combination with machine learning models such as MLP, Random Forest and Naive Bayes. This work explains the submissions made by SSNCSE_NLP in HASOC Code-mix tasks for Hate Speech and Ofensive language detection. We achieve F1 scores of 0.94 for task1-Malayalam, 0.75 for task2-Malayalam and 0.88 for task2-Tamil on the test-set.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Hate Speech Detection</kwd>
        <kwd>NLP</kwd>
        <kwd>Ofensive language identification</kwd>
        <kwd>BERT embedding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With recent development, the technology space has rapidly reduced the cost of devices. The
exponential growth of connectivity to the internet even in the most remote parts of the world
has encouraged people to use communication services and social media. Social media has even
proven to change the status of a country’s elections. So it becomes necessary to automate the
process of censorship of comments and the messages that are being posted.</p>
      <p>
        As the code-mixed comments and messages consist of native languages mixed with roman
alphabets, it becomes dificult to model a generalized solution that interchangeably works with
diferent alphabet set variations and mixes1[
        <xref ref-type="bibr" rid="ref2">, 2</xref>
        ]. With recent development in the multilingual
unsupervised training have elicited models that could be fine-tuned for code-mixed classification
tasks. Also the n-gram based models could learn much easily with a limited amount of data-set,
in a short span of time.
      </p>
      <p>
        In this paper, we present automation models for the identification of hate speech or content
in two diferent Dravidian languages Malayalam and Tamil mixed with Engl3is,h4][. Tamil
and Malayalam belong to Dravidian language family spoken mainly in south of India, Sri Lanka,
and Singapore5[]. We have analyzed various techniques such as n-gram and multilingual BERT
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] fine-tuned with added neural network layers. The char n-gram model produced the best
results, with a comparable performance by the BERT model. We achieved an F1 score of 0.94
for the task1 - YouTube comments in Code-mix (a mixture of Native and Roman) Tamil and
Malayalam. An F1 score of 0.75 for Manglish and 0.88 for Tanglish is achieved.
      </p>
      <p>
        The paper expounds on the experiments and the submissions made for the Ofensive Language
Detection task [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. The remainder of the paper is organized as follows. Sec2tiodniscusses
data-set distribution and techniques implemented to balance classes. Sec3tiountlines the
features used for the experiments for both task 1 and task 2. Results are discussed in Sec4t.ion
Section5 concludes the paper.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Data-set Analysis and Preprocessing</title>
      <p>The HASOC data-set is a collection of message-level labeled comments for ofensive language
detection. It consist of Tamil-English9][ and Malayalam-English10[] YouTube comments. The
comments contain writings in Roman lexicons with Tamil/Malayalam grammar or English
grammar with Tamil/Malayalam lexicons. Task 1 contains Malayalam-English comments and
correspondingNot Ofensive, Ofensive class labels. Task 2 contains two sub-tasks, one for
Malayalam-English comments and the other for Tamil-English comments. The train-set, dev-set
and test-set distribution with class-wise distribution is shown1i.n</p>
      <p>There is a clear imbalance in the data-set distribution. This could cause a bias towards a
particular class and the model trained on this data-set would be more inclined towards the
dominant class. The comments from the class containing a lesser number of instances are
randomly duplicated to get an equal distribution in the training data-set. The final train-set
consists of 5,266 comments (2,633 Of and 2,633 Not-Of) for task1. Similar duplication strategy
is applied to Task 2 data-set. For task2 Malayalam-English, the re-sampled train-set consists
of 4,094 comments (2,047 Of and 2,047 Not-Of) and for task2 Tamil-English, the re-sampled
train-set consist of 4,040 comments (2,020 Of and 2,020 Not-Of).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental setup and features</title>
      <p>For feature extraction, the n-gram model and BERT embedding model are experimented upon.
As the content of the comments is a mix of Dravidian language grammar in Roman lexicons along
with English grammar, it becomes challenging to find pre-trained models for this context. So a
simple n-gram approach is considered. Also, the advancements done by the transformer model
for pre-training and the availability of multilingual trained models encourage to experiment
with BERT pre-trained embeddings.</p>
      <p>The extracted features are used to train machine learning models such as Multi-Layer
Perceptron, Random Forest, Naive Bayes, and the performance of these models are compared. For
task1 the provided dev-set is used and for task2 4 fold cross-validation is used. The metrics
such as accuracy and weighted average F1 score are analyzed. The machine learning model
1
implementations and the metrics of comparison is used from Scikit-lea.rnThe implementation
with experimented and selected hyper-parameters are available in the2.link</p>
      <sec id="sec-3-1">
        <title>3.1. Count and TFIDF n-grams</title>
        <p>As the data-set consists of mixed Roman and Dravidian lexicons with English and native
language grammar used interchangeably, it becomes a new task which doesn’t suit any model’s
training corpus. So the n-gram model trained from scratch is considered for the tasks and this
strategy has shown good results with HI-EN and BN-HI-EN datase1ts1][. Basic count-based
and Term Frequency Inverse Document Frequency-based models are compared for feature
extraction. The char TFIDF is calculated as explained in equa1t.ion</p>
        <p>For each charin a document from the document set TD-IDF is calculated as:
N is the total number of documents.
where
  (, , ) =  (, ). (, )
 (, )</p>
        <p>= (1 +  (, ))
 (, )
= (</p>
        <p>)
( ∶ )

(1)
(2)
(3)</p>
        <p>Diferent character n-gram are constructed with n-gram range varying from 1 to 7. Out of
which n-gram range of 1-5 showed good results for task 1 and n-gram range of 1-3 showed
good results for task 2. The smaller n-gram range for task2 may be due to the shorter tweeter
comments. Out of all the classification models compared, the Random Forest model yielded the
best result.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. BERT Embedding</title>
        <p>The comments consist of a mix of English and Tamil or English and Malayalam. So BERT
multilingual model trained on a large corpus from 104 diferent langua6g]e, sw[hich includes</p>
        <p>Code-mix comments</p>
        <p>code-mix comments
Class balancing
(Oversampling)
n-gram feature extraction
(n-gram range = 1-3, 1-5
for task1, task2 respectively)
Random Forest Classifier</p>
        <p>Offensive or
Not</p>
        <p>Offensive</p>
        <p>Class balancing
(Oversampling)</p>
        <p>BERT embedding
(fixed 512 dimension embedding)</p>
        <p>Embedding
layer size =</p>
        <p>512
Offensive or
Not</p>
        <p>
          Offensive
Malayalam, Tamil, and English could be considered. It is noticeable that the BERT model has
shown excellent results with tasks of sentence classification with Tweet data-sets. So this study
can be interpolated to YouTube comments too. A fixed dimension embedding is generated by
sentence-transformers, with the pre-trained base-multilingual-cased model implementation as
explained in 1[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>These embeddings are then used to classify as Ofensive or Not-ofensive by training a
machine learning model. Diferent models such as Random Forest, Naive Bayes, and MLP are
compared, out of which the MLP with 512 hidden layers, generated the best results. Even though,
the models base training corpus is slightly diferent from the code-mix context with native
language grammar in Roman scripts, it was able to learn from the training corpus by fine-tuning.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Observations</title>
      <p>The output from n-gram embedding is a sparse matrix of high dimension. And as the total
number of training samples is limited, it is observed that the Random Forest model gave better
results in comparison to the MLP model. Whereas the embedding generated by the BERT model
is of 512 dimensions for each comment, so MLP performed relatively well compared to the
Task</p>
      <p>Task1</p>
      <sec id="sec-4-1">
        <title>Task2-ml</title>
      </sec>
      <sec id="sec-4-2">
        <title>Task2-ta</title>
        <p>Features n-gram</p>
        <p>TFIDF 1-3
count vec 1-3</p>
        <p>BERT</p>
        <p>TFIDF 1-5
count vec 1-5</p>
        <p>BERT</p>
        <p>TFIDF 1-5
count vec 1-5</p>
        <p>BERT
Random Forest model.</p>
        <p>The model comparison based on dev-set evaluation for task1 and cross-validation with k = 4
is shown in table2. As we can see that the overall performance of the TFIDF model is slightly
better than the pre-trained multilingual BERT embedding. This maybe is due to the diference
in the base corpus the BERT model is trained on, which is conflicting with the given code-mix
corpus. Whereas the TFIDF and the Count vectorization yielded similar results because of
the short length of the comments and containing varied usage of words by diferent users.
The performance of test-set consisting of 400, 951, 940 instances for task1, task2-Malayalam,
task2-Tamil are shown in the tabl3e. Our model performed well with a rank of 2,4 and 2 in
task1, task2-Malayalam, task2-Tamil respectively.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The popularity of social media platforms is growing exponentially. Automatic and eficient
censorship of hateful and ofensive comments is becoming a necessity. In this paper, we have
studied the performance of two diferent feature extraction techniques for Ofensive language
detection. Pretrained multilingual BERT model with MLP classifier and TFIDF embedding
with Random Forest classifier is compared. Code-mixed Malayalam-English and Tamil-English
YouTube comment HASOC Code-mix Dravidian data-set is evaluated and a test-set F1 score
of 0.94 for task1-Malayalam, 0.75 for task2-Malayalam and 0.88 for task2-Tamil is shown. Our
model achieved ranks of 2,4 and 2 for task1, task2-Malayalam, and task2-Tamil respectively.
[12] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using
knowledge distillation, arXiv preprint arXiv:2004.09813 (2020). UhRttLp:://arxiv.org/abs/
2004.09813.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vegupatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Named entity recognition for code-mixed Indian corpus using meta embedding</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A survey of current datasets for code-switching research</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Comparison of Diferent Orthographies for Machine Translation of Under-Resourced Dravidian Languages</article-title>
          ,
          <source>in: 2nd Conference on Language, Data and Knowledge (LDK</source>
          <year>2019</year>
          ), volume
          <volume>70</volume>
          oOfpenAccess Series in Informatics (OASIcs),
          <source>Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik</source>
          , Dagstuhl, Germany,
          <year>2019</year>
          , pp.
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>14</fpage>
          . URL: http://drops.dagstuhl.de/opus/volltexte/2019/103.7d0oi:
          <fpage>10</fpage>
          .4230/OASIcs. LDK.
          <year>2019</year>
          .
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stearns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jayapal</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. S</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zarrouk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, European Association for Machine Translation</source>
          , Dublin, Ireland,
          <year>2019</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          . URL: https://www.aclweb.org/anthology/W19-68.
          <fpage>09</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Leveraging orthographic information to improve machine translation of under-resourced languages</article-title>
          ,
          <source>Ph.D. thesis, NUI Galway</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          , P. B,
          <string-name>
            <surname>S. KP</surname>
          </string-name>
          , T. Mandl,
          <article-title>Overview of the track on ”HASOC-Ofensive Language Identification- DravidianCodeMix”, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          , P. B,
          <string-name>
            <surname>S. KP</surname>
          </string-name>
          , T. Mandl,
          <article-title>Overview of the track on ”HASOC-Ofensive Language Identification- DravidianCodeMix”</article-title>
          ,
          <source>in: Proceedings of the 12th Forum for Information Retrieval Evaluation</source>
          ,
          <source>FIRE '20</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tamil-English text</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          . UhRttLp:s://www. aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>28</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URhLtt:ps://www.aclweb.org/anthology/ 2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Danda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhakras</surname>
          </string-name>
          ,
          <article-title>Code-mixed sentiment analysis using machine learning and neural network approaches, 201a8r</article-title>
          .Xiv:
          <year>1808</year>
          .03299.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>