Sentiment Analysis in Dravidian Code-Mixed
YouTube Comments and Posts
Sanjeepan Sivapiran, Charangan Vasantharajan and Uthayasanker Thayasivam
Department of Computer Science and Engineering, University of Moratuwa


                                      Abstract
                                      This paper presents the methodologies implemented while doing Sentiment Analysis on Dravidian
                                      code-mixed comments and posts collected from social media. With a dataset of code-mixed Tamil, We
                                      experimented with transformer-based models such as multilingual BERT and DistilBERT and ULMFiT.
                                      This work submitted to the track ’Sentiment Analysis for Dravidian Languages in Code-Mixed
                                      Text’ organized by the Forum for Information Retrieval Evaluation. Although it received the seventh
                                      rank for the Tamil task in the benchmark, This paper improves upon the results by a margin to attain
                                      the final weighted F1 score of 0.61 for the Tamil task.

                                      Keywords
                                      Sentiment Analysis„ Code-Mixed, Transformers, Tamil, ULMFiT


1. Introduction
In the past few years, usage of social media platforms has drastically increased. With this trend,
cyberbullying and hate speech also increased and created a need to analyze comments/posts
on social media. Sentimental Analysis is a study that uses Natural Language Processing in
identifying subjective opinions or emotional responses about a given topic.[1] There are already
multiple steps taken to make use of sentimental Analysis in monolingual texts. But there has
been an indispensable demand for sentimental Analysis in code-mixed Dravidian languages
(Tamil, Malayalam, and Kannada) [2]. Code-mixing is a prevalent phenomenon in a multilingual
community, and the code-mixed texts sometimes write in non-native scripts.[3] Systems trained
on monolingual data fail on code-mixed data due to the complexity of code-switching at different
linguistic levels in the text.
   The objective of our study is to classify YouTube comments into positive, negative, neutral,
mixed emotions or if the word is not in Tamil, which is in code-mixed form [4]. For this task,
transformer architecture models Like multilingual BERT and DistilBERT yielded good results
since they optimized for low-resourced languages like Tamil. Yet ULMFiT made the best results
compared to transformer models. Since data was in code-mixed form, models had difficulty

FIRE 2021: Forum for Information Retrieval Evaluation, December 13-17,2021, India.
Envelope-Open sanjeepan.18@cse.mrt.ac.lk (S. Sivapiran); charangan.18@cse.mrt.ac.lk (C. Vasantharajan);
rtuthaya@cse.mrt.ac.lk (U. Thayasivam)
GLOBE https://www.linkedin.com/in/sanjeepan/ (S. Sivapiran); https://chaarangan.github.io/ (C. Vasantharajan);
https://rtuthaya.lk/ (U. Thayasivam)
Orcid 0000-0001-7225-2432 (S. Sivapiran); 0000-0001-7874-3881 (C. Vasantharajan); 0000-0002-3936-8174
(U. Thayasivam)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
understanding semantic relationships and their respective contexts. We used the translation and
transliteration technique to convey a word from one writing system to another while preserving
the context and semantics to overcome this issue.
   The rest of the sections in the paper are as follows. Section 2 reviews related experiment
works in Sentiment Analysis. Section 3 describes the given dataset in the Shared Task[5]. The
fourth section(4) presents the system description and conducted experiments using different
approaches and features as well as the results reaped from the experiments of our proposed
system. Benchmark results are discussed in section 4.5 and finally, the conclusion.


2. Related Work
Cyberbullying and hateful speech are unpleasant parts of social media. To ensure the well-
being of the social media users from cyberbullying, social media companies always had to
invest/contribute in sentimental analysis research. Due to that, an adequate amount of studies
has been already done. Historically, there have been two approaches to solve sentimental anal-
ysis problems lexicon-based and machine learning approaches [6]. Even though they produce
moderately quality results, they failed against human-generated data. Due to that, new deep
learning models such as Bidirectional Recurrent Neural Network(RNN)[7] and Long Short-Term
Memory(LSTM) network [8] were introduced. On the other hand, [9] conducted experiments
in Kannada-English using the traditional learning approaches such as Logistic Regression(LR),
Support Vector Machine(SVM), Multinomial Naive Bayes, K-Nearest Neighbors(KNN), Decision
Trees(DT), and Random Forest (RF).
   To address the sentiment analysis problem using the above techniques, We need a corpus.
Since social-media comments/posts do not follow the strict grammar rules and also they are
always in non-native scripts as well as code-mixed [10]. [11] created a gold standard Tamil-
English code-switched, sentiment-annotated corpus containing 15,744 comment posts from
YouTube to overcome the above situation. Moreover, Chakravarthi et al. [12] created a standard
corpus for Malayalam-English to increase the sentiment analysis tasks in the code-mixed
contents.
   [13] explored in Tamil-English, Kannda-English, and [14]Malayalam-English by using the
transformer-based model mBERT. The model performed well but failed in some text where
code-mixed comes[15]. As an extension work of this research work, [16] conducted experiments
on different kinds of models such as Bidirectional LSTM, mBERT, DistilBERT, and ULMFiT [17]
to overcome this issue. Moreover, they developed a standard Translation and Transliteration
algorithm to convert the corpus into monolingual. From this approach, they could be able to
improve their system’s performance.
   Over the past decade, different kinds of models introduced, but contrasted to conventional
Recurrent Neural Network models (RNNs), the efficiency and performance of the transformer
models such as BERT[18], DistilBERT[19], mBERT[20] are remarkably distinguished. BERT
[21]) models designed to contextualize the text by jointly conditioning on both left and proper
contexts. Due to that, transformer models can be used to produce a state-of-the-art result by
just fine-tuning the output layer. After studying the above research studies, we decided to go
with transformer models and ULMFiT.
3. Dataset
The Tamil-English data set is provided by the Dravidian-CodeMix-FIRE 2021 organizing com-
mittee, which extracted from Tamil YouTube comments/posts that contains three parts(Train,
Validation, Test). The training, validation and testing datasets have 35,656, 3962, and 4392
comments, respectively, with annotated labels. The dataset consists of texts in five different
classes as follows:
    Text                                                                     Label
    Vijay Annaa Ur Maasssss Therrrrriii                                      Positive
    நம்ப நேட நாசாமா தான் ேபாச்ேச                                             Negative
    Thala’s hardwork + dedication in the movie next level #Thalaaaaaaaaaaa   Mixed Feelings
    மனிதனாய் வாழ்வதற்கு ேதைவ மனிதாபிமானம் மட்டுேம... ஜாதி இல்ைல...!          Unknown State
    Subtitle me traller dekhne wale like karo                                Not in Tamil
Table 1
Dataset samples for each sentiment class.


   The data set contains three code-mixed sentences: Inter-Sentential switch, Intra-Sentential
switch, and Tag switching. They wrote in either native Tamil script or English grammar with
Tamil. Some comments wrote in Tamil script with English words between them. Table 2
describes the dataset statistics and it is visualized in Figure 1. The following items show the
five different classes of comments with a definition:

    • Positive: Comments which are not offensive
      e.g: ennaya trailer Ku mudi Ellam nikkudhu... Vera level trailer..
    • Negative: Comments which are offensive
      e.g: எந்ெதந்த youtube channel காரங்க எல்லாம் இைத ஜாதி ெவறி படம்குறாங்கேளாேளா
      அவங்ெகல்லாம் அந்த ஜாதி என்றறிக
    • Mixed Feelings: Comments which are both negative and positive
      e.g:Kaagam karaindhu koodi unnum, Manidham ennum moodar koodam koodi serdhu
      pagaimai kollum... Idil yaar uyarthinai yaar agrinai
    • Unknown State: Comments which are not determined
      e.g:Vandha raja vah dhaan varuven Vera level str
    • Not in Tamil: Comments which are not in native Tamil
      e.g:Subtitle me traller dekhne wale like karo


4. System Description and Result Analysis
4.1. Preprocessing
Since the dataset collected from YouTube does not follow any grammar rules and is in code-mixed
form. The dataset undergoes the Following steps to use the dataset efficiently.
    • The first step is to stemming and lemmatization the words and lower casing the only
      romanized words as there is no such thing in Tamil script.
                                     Label            Train   Dev    Test
                                     positive         20070   2257   3190
                                     negative         4271    480    315
                                     unknown_state    5628    611    288
                                     mixed-feelings   4020    438    71
                                     Not-Tamil        1667    176    160
                                     Total            35656   3962   4392
Table 2
Number of comment for each class in train, validation and test sets.


Figure 1: Class distribution on Training set. Dataset is highly imbalanced where a number of com-
ments/posts in positive is much higher higher than other classes.


    • The next step is to remove all emojis, special characters, numbers, and punctuations as
      they do not carry any meaning to the sentence.
    • Finally, we applied the algorithm introduced by [16] to do translation and transliteration
      on the comments and posts to create a monolingual corpus.

4.2. Translation
After loading the dataset, we used an extensive corpus of English words from NLTK-corpus1 to
detect English words in a sentence; if the word is in the English dictionary, then we translated
the word into native Tamil script; otherwise, we ignored the word. For this purpose, We used
Google Translate API2 .

4.3. Transliteration
Most of the comments are in code mixed form. Comments should be in the native script to get
state-of-the-art results from transformers models. Transliteration is the process of transferring
a word from the alphabet of one language to another. All non-native Tamil words converted


    1
        https://www.nltk.org/
    2
        https://pypi.org/project/googletrans/
into the same meaning Tamil words using transliteration. To achieve this, we used AI4Bharat
Transliteration3 .

4.4. Models
Recently released transformer models such as BERT achieves a state of the art results in text
classification tasks. Considering the performance of transform models, we choose to start
with multilingual BERT and DistilBERT. All of our transformer-based models are culled from
HuggingFace4 transformers library and the models’ parameters are as stated in Table 3. Figure
2 depicts the architecture of our best-performed model(ULMFiT).

                                      Parameters             Value
                                      LSTM Units             128
                                      Dropout                0.2
                                      Activation Function    Softmax
                                      Max Len                128
                                      Learning Rate          1e-5
                                      Optimizer              AdamW
                                      Loss Function          Cross-Entropy
                                      Batch Size             64
                                      Epochs                 30
Table 3
Common parameters for the models that we used during our experiments.

   DistilBERT model is a small, fast, and light transformer-based model trained on the Wikipedia
dataset. It has 40% fewer parameters than BERT, runs 60% faster while preserving over 95%
of BERT’s performances. Since our purpose is to train a model in Tamil(non-Latin script), we
selected the distilbert-base-multilingual-cased model, which has six layers, 768 dimensions,
12 heads, and tantalizing 134M parameters.
   We also experimented with bert-base-multilingual-cased as our pre-trained multilingual
model having approximately 110M parameters with 12-layers and 768 hidden states.

4.5. Results and Analysis
Teams were ranked by the weighted average F1 score of their model, and we received 7th
rank. Even though our model got above rank, the F1-score difference between the first team is
relatively low.
   In the beginning, we start with our BERT model and it doesn’t perform well. It may have
happened due to the lack of BERT multilingual based model training in the Tamil language. In
the next step, We approached the problem with the ULMFiT model, a transfer learning technique
[22]. ULMFiT’s model architecture is different from transformer models, and it is an effective
transfer learning method that can apply to any task in NLP. The table shows that ULMFiT


   3
       https://pypi.org/project/ai4bharat-transliteration/
   4
       https://github.com/huggingface/
Figure 2: ULMFiT model’s Architecture. To recreate this image, we used a source image from [8]. After
unfreezing all the layers, we did more epochs to train the whole neural network rather than just the last
few layers. This method involves fine-tuning a pre-trained language model (LM) AWD-LSTM to a new
dataset in such a manner that it does not forget what it previously learned.


yielded an F1-Score of 0.6101, and DistilBert, mBERT yielded 0.60104 and 0.5963, respectively.
Precision and recalls of the above models showen in Table 4.
                            Models       Precision     Recall    F1-Score
                            ULMFiT       0.6075        0.6045    0.6101
                            DistilBert   0.5978        0.5984    0.6014
                            mBERT        0.5782        0.5627    0.5963
Table 4
Weighted F1-scores according to the models on the test data-set.


5. Conclusion
In this research, we have analyzed different NLP techniques to classify offensive language in
Tamil code-mixed YouTube comments[5]. We used a novel technique, transliteration, which
leverages the accuracy across all three models.Also, We experimented with transformer models
and transfer learning technique(ULMFiT) models. Even though transformer models are more
advanced, To our task, ULMFiT yields the best results. Since Tamil is a low-resourced language
[23], our research also can be applied to other low-resourced languages without much difficulty.
References
 [1] B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. Kumar M, T. Mandl, P. K. Kumaresan,
     R. Ponnusamy, H. R L, J. P. McCrae, E. Sherly, Findings of the shared task on offensive
     language identification in Tamil, Malayalam, and Kannada, in: Proceedings of the First
     Workshop on Speech and Language Technologies for Dravidian Languages, Association
     for Computational Linguistics, Kyiv, 2021, pp. 133–145. URL: https://aclanthology.org/2021.
     dravidianlangtech-1.17.
 [2] S. Banerjee, B. Raja Chakravarthi, J. P. McCrae, Comparison of pretrained embeddings to
     identify hate speech in indian code-mixed text, in: 2020 2nd International Conference on
     Advances in Computing, Communication Control and Networking (ICACCCN), 2020, pp.
     21–25. doi:1 0 . 1 1 0 9 / I C A C C C N 5 1 0 5 2 . 2 0 2 0 . 9 3 6 2 7 3 1 .
 [3] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the hasoc track at
     fire 2020: Hate speech and offensive language identification in tamil, malayalam, hindi,
     english and german, in: Forum for Information Retrieval Evaluation, 2020, pp. 29–32.
 [4] S. Suryawanshi, B. R. Chakravarthi, Findings of the shared task on troll meme classification
     in Tamil, in: Proceedings of the First Workshop on Speech and Language Technologies for
     Dravidian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 126–132.
     URL: https://aclanthology.org/2021.dravidianlangtech-1.16.
 [5] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings of the
     Sentiment Analysis of Dravidian Languages in Code-Mixed Text 2021, in: Working Notes
     of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
 [6] O. Habimana, Y. Li, R. Li, X. Gu, G. X. Yu, Sentiment analysis using deep learning
     approaches: an overview, Science China Information Sciences 63 (2019).
 [7] M. Schuster, K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on
     Signal Processing 45 (1997) 2673–2681. doi:1 0 . 1 1 0 9 / 7 8 . 6 5 0 0 9 3 .
 [8] A. Sherstinsky, Fundamentals of recurrent neural network (rnn) and long short-term
     memory (lstm) network, Physica D: Nonlinear Phenomena 404 (2020) 132306. URL:
     http://dx.doi.org/10.1016/j.physd.2019.132306. doi:1 0 . 1 0 1 6 / j . p h y s d . 2 0 1 9 . 1 3 2 3 0 6 .
 [9] A. Hande, R. Priyadharshini, B. R. Chakravarthi, KanCMD: Kannada CodeMixed dataset
     for sentiment analysis and offensive language detection, in: Proceedings of the Third
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s
     in Social Media, Association for Computational Linguistics, Barcelona, Spain (Online),
     2020, pp. 54–63. URL: https://aclanthology.org/2020.peoples-1.6.
[10] S. Banerjee, A. Jayapal, S. Thavareesan, Nuig-shubhanker@dravidian-codemix- fire2020:
     Sentiment analysis of code-mixed dravidian text using xlnet, in: FIRE, 2020.
[11] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus cre-
     ation for sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the
     1st Joint Workshop on Spoken Language Technologies for Under-resourced languages
     (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL),
     European Language Resources association, Marseille, France, 2020, pp. 202–210. URL:
     https://aclanthology.org/2020.sltu-1.28.
[12] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
     dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on
     Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
     and Computing for Under-Resourced Languages (CCURL), European Language Resources
     association, Marseille, France, 2020, pp. 177–184. URL: https://aclanthology.org/2020.sltu-1.
     25.
[13] C. Vasantharajan, U. Thayasivam, Hypers@DravidianLangTech-EACL2021: Offensive
     language identification in Dravidian code-mixed YouTube comments and posts, in: Pro-
     ceedings of the First Workshop on Speech and Language Technologies for Dravidian
     Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 195–202. URL:
     https://www.aclweb.org/anthology/2021.dravidianlangtech-1.26.
[14] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan,
     P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
     HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and
     Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.
[15] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil,
     malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021,
     Association for Computing Machinery, 2021.
[16] C. Vasantharajan, U. Thayasivam, Towards offensive language identification for tamil
     code-mixed youtube comments and posts, 2021. a r X i v : 2 1 0 8 . 1 0 9 3 9 .
[17] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, 2018.
     arXiv:1801.06146.
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, 2019. a r X i v : 1 8 1 0 . 0 4 8 0 5 .
[19] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, 2020. a r X i v : 1 9 1 0 . 0 1 1 0 8 .
[20] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual bert?, 2019.
     arXiv:1906.01502.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, 2017. a r X i v : 1 7 0 6 . 0 3 7 6 2 .
[22] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, Q. He, A comprehensive survey
     on transfer learning, 2020. a r X i v : 1 9 1 1 . 0 2 6 8 5 .
[23] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on sentiment analysis for dravidian languages in
     code-mixed text, in: Forum for Information Retrieval Evaluation, 2020, pp. 21–24.