<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>model for sentiment classification on code-mixed data in Dravidian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>S R Mithun Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nihal Reddy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aruna Malapati</string-name>
          <email>arunam@hyderabad.bits-pilani.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lov Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BITS Pilani</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Spanish, or Indian languages like Tamil</institution>
          ,
          <addr-line>Telugu, Hindi, Malayalam, Kannada in the Asian</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Uber Research and Development India</institution>
          ,
          <addr-line>Bangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Dravidian languages, Tamil, Kannada, Malayalam and Telugu, is spoken by over 220 million but is vastly under-resourced for natural language processing tasks. Code-switching and code-mixing have been on the rise, with multilingual speakers opting for expressing their opinion in their mother tongue along with English in both written text as well as in speech. Challenges arise in sentiment analysis of code-switched Dravidian languages because of under-resourced corpora and randomness in language interspersing. This paper applied an ensemble sentiment classification strategy based on majority voting using 13 diferent classification models on the Dravidian code-mixed languages dataset provided in FIRE 2021 1 The key conclusion from our experiments was that the ensemble of multiple classifiers outperformed others for sentiment classification. Our approaches show that a result of weighted F1-score of 0.59, 0.65 and 0.60, respectively, on Kannada, Malayalam and Tamil code-switched data can be achieved with the traditional machine learning algorithms through an ensemble of multiple classifiers.</p>
      </abstract>
      <kwd-group>
        <kwd>Dravidian Languages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>on social media sites to day-to-day usage in written communication. Negative sentiments
are more often expressed in mother-tongue than the positive sentiments, which are generally
expressed in English making it necessary to learn the code-switched languages.</p>
      <p>While monolingual NLP tasks form the basis and are no diferent to code-mixed languages in
most of the aspects, significant challenges for code-mixed data exist in language identification,
data collection and preparation strategy, optimally using the existing resources and on the
user-centric design of code-mixed NLP systems. This amplifies even more when one of the
languages is under-resourced.</p>
      <p>Dravidian languages are vastly under-resourced, and when code-mixed with English is a
harder NLP task. Sentiment analysis on code-switched Dravidian languages is still ongoing
research which will help analyse the emotion and attitude of the users who express in
codeswitched languages with the rising usage on social media like TikTok, YouTube, Whatsapp,
etc.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Computational approaches to code-switching, related workshop and ACL anthology1 has seen
an increase in the research papers in the last three to four years.</p>
      <p>
        Graph Convolutional Networks with multi-headed attention was experimented by
        <xref ref-type="bibr" rid="ref7">Dowlagar
et al. 2021</xref>
        where it yielded a weighted F1-score of 0.75 for Malayalam-English code-mixed data
with FIRE 2020 dataset published by Chakravarthi, Jose, et al. (2020).
      </p>
      <p>Ensemble of character-trigrams based Long Short Term Memory (LSTM) model and word
n-grams based Multinomial Naive Bayes (MNB) has been proposed by Jhanwar et al. (2018)
for Hindi-English code-mixed language pair (Prabhu et al. 2016). This model takes in the
combined strength of LSTM and probabilistic models. LSTM was performing better on longer
length sentences due to its ability to capture sequential information whereas MNB performed
generalisation on rare words.</p>
      <p>All prior research highlighted above focus on deep learning techniques, which perform
significantly well with longer length sentences. For instance, Jhanwar et al. (2018) experimented
with datasets which has an average of fity words. However, most social media content, such as
Youtube comments, tend to be shorter. For instance Kannada code-mixed dataset of FIRE 2021
(Hande et al. 2020) has an average comment length of fewer than seven words. We argue that
probabilistic and deterministic classifiers and an ensemble of traditional classifiers will yield
the same or better results on datasets with shorter sentences.</p>
      <p>Our approach was to build a pipeline with traditional classifiers, evaluate the performance
metrics for sentiment classification and then iterate with ensemble techniques that could be set
as a baseline for any short-length code-mixed text. This was done as part of the shared task on
sentiment detection along the lines as described by Priyadharshini et al. (2021).</p>
      <p>1https://aclanthology.org/search/?q=code+mixing</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <sec id="sec-3-1">
        <title>3.1. Data Description</title>
        <p>This section presents the detailed description of the dataset and its distribution, along with the
research framework used.</p>
        <p>
          The dataset used for the task is from the oficial datasets released in Dravidian-CodeMix - FIRE
2021 which comprises labelled sentiment data of YouTube video comments on language pairs
like Kannada-English (Hande et al. 2020), Malayalam-English
          <xref ref-type="bibr" rid="ref1 ref2">(Chakravarthi, Jose, et al. 2020)</xref>
          and Tamil-English
          <xref ref-type="bibr" rid="ref1 ref2">(Chakravarthi, Muralidaran, et al. 2020)</xref>
          . The data has been code-switched
language pairs, mostly in Roman script, both for English and the Dravidian language, wherein
the latter has been transliterated from the source language to Roman script. However, there
remains a good portion of the Dravidian script too in the data.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Distribution</title>
        <p>The data distribution is shown in Table 1. The dataset contains labelled code-mixed sentences
into five categories: Positive, Negative, Mixed Feelings, Unknown State and not in the intended
language. The dataset contains inter-sentential, intra-sentential code-mixed sentences in Tamil,
Malayalam and Kannada with English. As seen in Figure 1, the data is imbalanced, with most of
the labels being available for the positive sentiment.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Data Preprocessing</title>
        <p>The data has been preprocessed for removing stopwords, punctuation and emoticons. NLTK2
library has been used for stemming, lemmatisation and removing stop-words. We have used
the spaCy3 library for named entity recognition.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiment Setting</title>
        <p>The pipeline was set up to train the data, both on traditional as well as on ensemble techniques,
as represented in Figure 2.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Traditional classifiers</title>
          <p>In the first approach, the data has been run on multiple traditional machine learning
algorithms for classification. The following parameters are common to all. ’vect’, CountVectorizer,
min_df=3, max_df=0.2, analyzer=’word’, ngram_range=(1, 3), Tfidf Transformer().</p>
          <p>This data was trained on traditional classifiers, including Logistic Regression (LR), Multinomial
Naive Bayes (MNB), Linear SVM (L-SVM), RBF SVM (R-SVM), Poly SVM (P-SVM), Random
Forest (RF), K-Nearest Neighbours (KNN) and Extra Tree Classifier (XTree).</p>
          <p>2https://www.nltk.org/
3https://spacy.io/</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Ensemble of multiple classifiers</title>
          <p>The data was then run with ensemble classifiers with estimators as detailed out in Table 2.
Various ensemble methods experimented with were AdaBoost (AdaB), XGBoost (XGB), Hard
Ensemble of Voting Classifier (HEns), Hard Ensemble of Top 5, Top 3 and All Classifiers (HTop_5,
HTop_3, HTop_A).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this paper, eight diferent types of traditional machine learning and three diferent types of
ensemble methods have been used to develop a sentiment prediction model for code-mixed
language pairs in Tamil-English, Malayalam-English and Kannada-English. The predictive
power of these sentiment prediction models are validated using 5-fold cross-validation and
compared with four diferent performance metrics such as Precision, Recall, F1-Score and
Accuracy. The performance values of these models are presented in Table 3 through which we
derived the following observations:
• Ensemble classifiers generally outperformed all the single classifiers across all the three
language pairs.
• The ensemble of a mix of both weak and strong individual learners always had a
probabilistic model like logistic regression.
• Both, logistic regression model as well as an ensemble model, performed very close to
each other.</p>
      <sec id="sec-5-1">
        <title>5.1. Comparative Analysis: Box plot</title>
        <p>In this work, box plot diagram for diferent performance metrics, precision, recall and
F1scores has been used to compare the performance of the developed models using diferent
techniques. Figure 3 shows the box plot for each performance metric, precision, recall and
F1-scores compared with each classifier. The information present in Figure 3 suggested that the
ensemble methods generally perform better than other classifier. The information present in
Figure 3 also suggested that the probabilistic models like logistic regression, in silo, perform
better than any other stand-alone classifiers. This performs even better with an ensemble of
top classifiers. The performance metrics are very close to the values observed in the baseline
model using transformer-based models on the FIRE 2021 dataset published by Chakravarthi,
Priyadharshini, Muralidaran, et al. (2021), which achieved an F1-score of 0.67, 0.59 and 0.62 for
Kannada, Malayalam and Tamil code-mixed datasets, respectively as published by Chakravarthi,
Priyadharshini, Thavareesan, et al. (2021). Our experimentation with ensemble models shows
that the best scores of 0.59, 0.67 and 0.60 can be achieved for the same set of language pairs.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Comparative Analysis: T-test</title>
        <p>In this work, the T-test technique has also been applied to find the significant diference in
the performance of the developed models using diferent classifiers. The T-test is used to test
our considered null hypothesis, i.e., ”There is no significant diference in the performance of
the developed sentiment prediction models using diferent techniques”. Figure 4 shows the
result of diferent techniques for each of the performance metric, precision, recall and F1 scores.
The green dots in Figure 4 indicate that the considered null hypothesis is accepted, i.e., the
performance of the models does not depend on techniques; similarly, red dots indicate that there
would be a diference in the performance of the models developed using diferent techniques.
From Figure 4, we can observe that the predictive power of the models developed using diferent
techniques are significantly diferent. From Figure 4, we can also observe that the ensemble
methods significantly improved the performance of the models.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we applied diferent traditional machine learning methods as well as ensemble
methods on code-mixed data from FIRE 2021, which had Tamil-English, Malayalam-English and
Kannada-English language pairs with an objective to develop sentiment prediction models. The
performance of these developed sentiment prediction models are computed using precision,
recall and F1-score. Our experimental results show that:
• The proposed ensemble classifier performs better than any stand-alone classifier.
• The developed models based on ensemble technique achieved an F1-score of 0.59 and
accuracy of 0.62 for Kannada.
• The developed models based on ensemble technique achieved an F1-Score of 0.65 and an
accuracy of 0.67 for Malayalam.
• The developed models based on ensemble technique achieved an F1-score of 0.60 and an
accuracy of 0.64 for Tamil.</p>
      <p>The future steps would be to better the results through the transliteration-translation task to
augment the preprocessing and complete the sentiment analysis on the monolingual English
corpus, rather than a bilingual corpus for code-switched languages.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgement</title>
      <p>Thanks to Dr Aravind Ranganathan, Uber R&amp;D, and the anonymous reviewers for the valuable
suggestions and thorough review comments.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Appendix</title>
      <p>Our code is available on GitHub4</p>
      <p>4https://github.com/mithunkumarsr/CodeMixingDravidianLanguage
Hande, Adeep, Ruba Priyadharshini, and Bharathi Raja Chakravarthi (Dec. 2020). “KanCMD:
Kannada CodeMixed Dataset for Sentiment Analysis and Ofensive Language Detection”. In:
Proceedings of the Third Workshop on Computational Modeling of People’s Opinions, Personality,
and Emotion’s in Social Media. Barcelona, Spain (Online): Association for Computational
Linguistics, pp. 54–63. url: https://www.aclweb.org/anthology/2020.peoples-1.6.
Jhanwar, Madan Gopal and Arpita Das (2018). “An Ensemble Model for Sentiment Analysis
of Hindi-English Code-Mixed Data”. In: CoRR abs/1806.04450. arXiv: 1806.04450. url: http:
//arxiv.org/abs/1806.04450.</p>
      <p>Prabhu, Ameya, Aditya Joshi, Manish Shrivastava, and Vasudeva Varma (2016). Towards
SubWord Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text. arXiv:
1611.00472 [ c s . C L ] .</p>
      <p>Priyadharshini, Ruba, Bharathi Raja Chakravarthi, Sajeetha Thavareesan, Dhivya Chinnappa,
Durairaj Thenmozhi, and Rahul Ponnusamy (2021). “Overview of the DravidianCodeMix
2021 Shared Task on Sentiment Detection in Tamil, Malayalam, and Kannada”. In: Forum for
Information Retrieval Evaluation. FIRE 2021. Online: Association for Computing Machinery.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Chakravarthi</surname>
            , Bharathi Raja, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly,
            <given-names>and John Philip McCrae</given-names>
          </string-name>
          (May
          <year>2020</year>
          ).
          <article-title>“A Sentiment Analysis Dataset for Code-Mixed Malayalam-English”</article-title>
          .
          <source>English. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Underresourced languages (SLTU)</source>
          and
          <article-title>Collaboration and Computing for Under-Resourced Languages (CCURL)</article-title>
          . Marseille, France: European Language Resources association, pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . isbn:
          <fpage>979</fpage>
          -
          <lpage>10</lpage>
          -95546-35-1. url: https://www.aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Chakravarthi</surname>
            , Bharathi Raja, Vigneshwaran Muralidaran, Ruba Priyadharshini,
            <given-names>and John Philip McCrae</given-names>
          </string-name>
          (May
          <year>2020</year>
          ).
          <article-title>“Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text”</article-title>
          .
          <source>English. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)</article-title>
          . Marseille, France: European Language Resources association, pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          . isbn:
          <fpage>979</fpage>
          -
          <lpage>10</lpage>
          -95546-35-1. url: https://www.aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chakravarthi</surname>
            , Bharathi Raja, Ruba Priyadharshini, Vigneshwaran Muralidaran, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly,
            <given-names>and John P. McCrae</given-names>
          </string-name>
          <article-title>(</article-title>
          <year>2021</year>
          ).
          <article-title>DravidianCodeMix: Sentiment Analysis and Ofensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text</article-title>
          . arXiv:
          <volume>2106</volume>
          .09460 [ c s . C L ] .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chakravarthi</surname>
            , Bharathi Raja, Ruba Priyadharshini,
            <given-names>Sajeetha</given-names>
          </string-name>
          <string-name>
            <surname>Thavareesan</surname>
          </string-name>
          , et al. (
          <year>2021</year>
          ).
          <article-title>“Findings of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text”</article-title>
          . In: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          . Online: CEUR.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>Ka Long Roy</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>“Trilingual Code-switching in Hong Kong”</article-title>
          .
          <source>In: ALRJournal 3</source>
          .4, pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          . doi:
          <volume>10</volume>
          .14744/alrj.
          <year>2019</year>
          .
          <volume>22932</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Choudhury</surname>
            , Monojit,
            <given-names>Anirudh</given-names>
          </string-name>
          <string-name>
            <surname>Srinivasan</surname>
          </string-name>
          , and Sandipan Dandapat (Nov.
          <year>2019</year>
          ).
          <article-title>“Processing and Understanding Mixed Language Data”</article-title>
          .
          <source>In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts. Hong Kong</source>
          ,
          <article-title>China: Association for Computational Linguistics</article-title>
          . url: https://aclanthology.org/D19-2002.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Dowlagar</surname>
          </string-name>
          , Suman and Radhika
          <string-name>
            <surname>Mamidi</surname>
          </string-name>
          (Apr.
          <year>2021</year>
          ).
          <article-title>“Graph Convolutional Networks with Multi-headed Attention for Code-Mixed Sentiment Analysis”</article-title>
          .
          <source>In: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. Kyiv: Association for Computational Linguistics</source>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . url: https://aclanthology.org/
          <year>2021</year>
          .dravidianlangtech1.8.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>