IndicBERT based approach for Sentiment Analysis on
Code-Mixed Tamil Tweets
R.Ramesh Kannan, Ratnavel Rajalakshmi and Lokesh Kumar
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, TamilNadu, India


                                      Abstract
                                      Nowadays, Social media networks have made a huge impact in the lifestyle. Many people prefer to
                                      express their opinions on various topics in the social media platforms such as Facebook, Twitter etc.
                                      Even though, English is predominantly used by most of the people across the world to express their
                                      views, the technological advancements have paved a way for people to use their native language also
                                      to post their opinions. As many of the social media users are bilingual in nature, the trend of using a
                                      combination of English and native language has become a common scenario. Sentiment Analysis, the
                                      task of identifying the correct opinion from these Code-Mixed social media posts, is a challenging one,
                                      as the existing architectures and algorithms are designed to handle uni-lingual posts. The diversity and
                                      the rich linguistic nature of Indian languages demand highly sophisticated systems to address the above
                                      issues. In this work, we have conducted an experimental study to handle the challenges in Code-Mixed
                                      Tamil tweets and proposed a transformer based Indic-BERT approach. From the experimental results,
                                      we have shown that, an 𝐹1 score of 61.73% can be achieved, which is a significant improvement over the
                                      other traditional methods. This work has been submitted to the shared task on [1] Dravidian-CodeMix-
                                      FIRE 2021.

                                      Keywords
                                      Code-Mixed, Sentiment Analysis, Dravidian Language, Tanglish, Tamil,


1. Introduction
Sentiment analysis is the process of analyzing emotions or opinions of a given topic. It uses
Natural Language Processing(NLP), text analysis and statistics to monitor the people opin-
ions/reviews. In recent years, it is been an active area of research in both academia and industry.
The best sentiment analysis system reveals how people are saying, what people are trying to
mean on reviews/opinions. There is an increasing demand for sentiment analysis on social
media texts which are largely Code-Mixed for Dravidian languages. Code-Mixing is a com-
mon marvel in a multilingual community and the texts are written in non-native scripts[2, 3].
Monolingual systems fails due to the complexity of the Code-Switching at different levels.
Dravidian Code-Mixed shared task[3] contains data for sentiment analysis on Code-Mixed text
in Dravidian language Tanglish (Tamil English).
   Tamil, Telugu, Kannada and Malayalam are four of the 22 official languages of India and very
few of the Dravidian Languages of India spoken in India. In particular, Kannada, Malayalam and

FIRE’21: Forum for Information Retrieval Evaluation, December 13-17, 2021,India
" kannanrameshr@gmail.com (R.Ramesh Kannan); rajalakshmi.r@vit.ac.in (R. Rajalakshmi)
 0000-0002-6220-1217 (R.Ramesh Kannan); 0000-0002-6570-483X (R. Rajalakshmi)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
Telugu are the Dravidian Languages spoken by people from Karnataka, Kerala and Andrapradesh.
Tamil is a Dravidian Languages which is spoken by people from India, Srilanka, and Tamil
diaspora around the world. Tamil is the official language in Singapore and Srilanka. These
languages are used by people for Various purpose like administration, education and business
media. However, people often their native language with Roman script for typing because it is
easy for the user to type the contents. Hence, the majority of the under-resourced languages in
social media are Code-Mixed in nature.
   Regional languages are used to share people opinions on the social media. Many of the
resources are developed/generated in Arabic [4],English and other regional languages. In the
technological world, people have ease to access internet and share Code-Mixed texts on the
internet platform. Texts needs to be understood at linguistic level and the lack of Code-Mixed
data to train the model is the challenging part during analysis. Monolingual trained system might
not be suitable for Code-Mixed data, since the linguistic structure is different for Code-Mixed
data.
   Shared task [3] released in Tanglish (Tamil+English) language with social media comments
for sentiment analysis on Code-Mixed data. Our proposed system reveals, how sentiment is
expressed in Code-Mixed scenarios on social media by applying transformer based approach
Indic-BERT[5] and we have obtained an 𝐹 11 score of 61.73%. This paper is organized as follows:
Section 2 shows about the Related works that are carried out on the same domain, Section 3
discusses about the proposed methodology for the shared task, Section 4 deals with the results
obtained using the proposed methodology. Section 5 focuses on conclusion part of the study.


2. Related Works
Sentiment polarity analysis on online medium like YouTube comments is an important problem
in analyzing people opinion on public, product, sports or on movies etc. Analyzing the polarity
on the online medium contents is a challenging task. Various authors [6, 7] carried out their
research on under resources languages. Sentiment analysis on online media contents[8],[9]
and social media contents [10] had been studied by various authors. Sentiment analysis on
movie reviews were studied by [8, 9]. Online movie review is done by [9]combining Con-
volution Neural Network and Bidirectional Long Short Memory to identify the opinions on
movie contents as Hybrid approach. [8] shows the work implementation of feature weighting
method on online movie reviews. New Relevance Factor (NRF) weighting method [11] for text
classification using Naive Bayes classifier. [12] proposed universal dictionary method for text
classification on Uniform Resource Locator(URL)using Linear SVM. Text classification on legal
documents[13], context aware solution based on Cosine similarity approach and Term Fre-
quency - Inverse Document Frequency(TF-IDF) to obtain the similarity between the documents.
Attention mechanism is proposed with Recurrent Convolutional Neural Network(RCNN) [14]
for effective learning of text features on uniform resource locator. Deep learning architecture of
Convolutional Neural Network (CNN) is combined with Bidirectional Gated Recurrent Unit
(BGRU) [15] to extract the features for web page classifications. Sentiment movie reviews is
analysed with Long Short-term Memory (LSTM) with word embedding to extract the polarity
of the reviews with self attention based approach [16]. Sentiment analysis on Tweet contents
were analysed [10] by applying Maximum Entropy supervised approach and obtained 74% cross
validation accuracy score. A detailed survey on sentiment analysis was presented in work by
[17].
   The task on sentiment polarity identification on Code-Mixed data is challenging and recent
days works are reported on the Code-Mixed data sets. The authors in [18] proposed an ensemble
based machine learning approach on Code-Mixed data set. The authors proposed n-gram features
with machine learning to perform classification on Hindi- English and Bengali-English data set
and obtained a F1 score of 58% and 69% respectively. Ensemble classifier approach proposed
using CHI square feature selection approach[19] on Code-Mixed Hindi-German language using
Random Forest Classifier. Rajalakshmi. et al, [20] proposed BERT based approach on Code-
Mixed data set for offensive language identification by capturing linguistic features. The authors
obtained a validation F1 Score of 65% and testing F1 Score of 64%. Hate Speech analysis on Code-
Mixed Marathi, Hindi data were analysed using Ensembled approach [21] Extreme Gradient
Boosting Code-Mixed Hindi, English were analysed for Hate Offensive detection using Indic-
BERT [22] with Majority voting approach for HASOC2021. To process multi-lingual queries
Code-Mixing and Code-Borrowing were studied in recent days [23, 24, 25]. Relevance metric[26]
based approach is proposed for borrowing likeliness of Hindi-English tweets for ranking. [5]
proposed a new multilingual ALBERT model based approach for some of the Indian languages.
Indic-BERT can be applied to various downstream tasks in Natural Language Processing. In this
study, we have applied Indic-BERT on Dravidian Code-Mixed data set for sentiment polarity
identifications.


3. Data Set Description
Dravidian Code-Mixed data set is a collection of YouTube video comments, which contains
code mixed sentences and the types of Code-Mixed sentences are Inter-Sentential switch, Intra-
Sentential switch and Tag switch[27] . Almost all the comments were written in Tamil grammar
with English lexicon or English grammar with Tamil lexicon in native script and Roman scripts.
Few of the comments were in Tamil script with English expressions. Data set contains ID,text
and Label for each of the comments. Id contains unique number to identify particular row,
text contains YouTube comments and label shows the category of the text, which contains five
categories like Positive, Negative, not-Tamil, unknown state and mixed feelings.
   Example from Data set:
   Original Text : Yarayellam FDFS paga ippove ready agitinga
   Meaning : Who are all now ready for FDFS(FIrst Day First Show)- Positive category
   Original Text : Ennada viswasam mersal sarkar madhri time la likes and views create
pannalayae - Negative Category
   Meaning : Why likes and views are not created for the films like viswasam, mersal,sarkar. -
Negative Category
   The objective of the task is to identify sentiment polarity of the Code-Mixed data set of
comments or posts in Tamil+English collected from social media that contains any of the
following 5 category labels viz., Positive(Po), Negative(Ne), Mixed_feelings(Mf), not-Tamil(Nt),
unknown_state(Us). The data distribution is tabulated in Table 1. 56% of the comments are
Table 1
Data set Distribution(percentage in category)
                               Category          Training     Validation
                             Positive(Po)       20070 (56%)   2257 (57%)
                            Negative(Ne)         4271 (12%)   480 (12%)
                            Not Tamil(Nt)        1667 (5%)    176 (5%)
                          Mixed Feelings(Mf)     4020 (11%)   438 (11%)
                          Unknown State(Us)      5628 (16%)   611 (16%)
                                Total              35656      3962


positive and other remaining 44% of the comments are in other four categories. The percentage
of category values are as follows: Ne with 12%, Nt with 5%, Mf with 11% and Us with 16%. As
part of the sentiment analysis task, the training and validation set were released with 35656
and 3962 labelled social media comments. Both the training and validation set follows the same
distribution.


4. Proposed Methodology
The Code-Mixed comments contains Tamil, English and other language phrases and words in
the context. Instead of converting the text into any of the common language, a Multilingual
pretrained model[5] Indic-BERT is used, that has been pretrained on 12 indian languages.
Indic-BERT pretrained model is based on ALBERT(A Lite BERT for Self-Supervised Learning of
Language Representations) model, which is a recent derivative of BERT(Bidirectional Encoder
Representations from Transformers), which is pretrained on 12 indian languages like Assamese,
Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
The proposed BERT has less parameters than other public models like mBERT and XLM-R while
it manages to give state of the art performance on several tasks.
   Since it is a pretrained multilingual transformer model, the data needs to converted into
corresponding embeddings for classifications. As a pre processing step, Autotokenizer tokenizes
all the sentences into tokens. In tokens, Class[CLS] token is added at the beginning of the
sentence and seperation[SEP] token is added at the end of the each sentences. Padding [Pad]
token is padded with all the sentences till the maximum length of the sentence. Assign unique id
to each token for further processing. Attention mask is also generated for each input sentences
and it tells which tokens should be attempted and which should not be attempted by the model
during training. This will be useful when input is fed into transformer based Indic-BERT model.
   To determine the sentiments expressed in the Code-Mixed YouTube comments/posts, Indic-
BERT model is proposed with the fine tuned parameters [5]. Indic-BERT is a multilingual
representation model that extracts the context from different language input representations in
both the directions. To capture the semantic and linguistic features of a multilingual sentence,
Indic-BERT is applied. YouTube comments/Posts may have more than one sentences. Indic-BERT
has the ability to consider these multilingual inputs sentences into a single sequence for input
representations. Indic-BERT embeddings combines the token embedding,segment embedding
Table 2
Performance of the proposed approach on Training set
                    Epoch     Training Acc   Val Acc    Training Loss    Val Loss
                        1        0.6725      0.6012        1.1724        1.0428
                        2        0.6421      0.5959        1.0027        1.0234
                        3        0.6896      0.6327        0.8877        1.0013
                        4        0.6893      0.6297        0.7425        0.9687
                        5        0.6991      0.6329        0.6939        0.9681


Table 3
Comparison of results
                                Team          Precision    Recall   F1
                                 AIML           59.6        60.3    59.9
                            SSN_NLP_MLRG         59.7        61.3   60.3
                                 Ryzer           59.7        61.4   60.4
                               Proposed         61.27       64.54   61.73


Figure 1: Training and Validation Accuracy       Figure 2: Training and Validation Loss


and positional embeddings. Pretrained model can be fine-tuned to suit the downstream tasks
by adding classification layer at the bottom of the model. Indic-BERT can be used for this task
of how sentiment is expressed in Code-Mixed scenarios on social media.


5. Results and Discussion
To study the performance of the sentiment polarity of the system, we have conducted experiment
based on Indic-BERT approach on Code-Mixed data set. The experiment was conducted on
workstation with Intel Xeon Quad Core Processor, 32 GB RAM, NVIDIA Quadro P4000 GPU 8GB.
To capture the sentiment polarity on the Code-Mixed data set, we have tried transformer based
approach of Indic-BERT. To attain better performance of the BERT model, we have fine-tuned
the parameters and obtained learning rate=3e-5,batch size=64, epochs=5. Figure 1, shows the
accuracy graph on training data and validation data. For the 5th epoch accuracy score reached a
maximum level. Figure 2, plotted with loss values on training data and validation data. Obtained
a training accuracy of 69.91% and loss of 0.6939. For the validation set, obtained a loss of 0.9681
and accuracy of 63.29%. Here the classifier is able to classify all the categories, even the data
set is not balanced set. Even the data set contains very less number of Not Tamil categories
are classified correctly. From Table 3, Our proposed model out performs on Weighted average
F1 score of 61.73%. The model is able to classify all the categories irrespective of the specific
category.


6. Conclusion
There is an increase in social media contents in recent days. The goal of the Dravidian-CodeMix-
FIRE 2021 is to identify the subjective opinions or emotional responses of the social media
comments. In this work, we have presented the challenges involved in extracting the key
terms to identify the opinion from the Code-Mixed tweets. A detailed experimental study has
been performed using different architectures and we found that, the sentiment in the social
media contents are better captured using the Indic-BERT language model. We have obtained
a weighted 𝐹1 score of 61.73% with the proposed model. We observed that, the data set is
skewed and the lack of enough samples for every category has impacted the performance
of the classifier. In our future work, we planned to address the class imbalance problem for
Code-Mixed sentiment analysis.


Acknowledgments
The authors would like to thank the management of Vellore Institute of Technology, Chennai
for providing the support to carry out this work. We would like to thank the Department of
Science and Engineering Research Board (SERB),Government of India for their financial grant
(Award Number: ECR/2016/00484) for this research work.


References
 [1] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings of the
     Sentiment Analysis of Dravidian Languages in Code-Mixed Text 2021, in: Working Notes
     of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
 [2] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan,
     P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
     HASOC-Dravidian CodeMix Shared Task on Offensive Language Detection in Tamil and
     Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.
 [3] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     Overview of the Dravidiancodemix 2021 shared task on sentiment detection in tamil,
     malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021,
     Association for Computing Machinery, 2021.
 [4] H. Mubarak, K. Darwish, W. Magdy, Abusive language detection on Arabic social
     media, in: Proceedings of the First Workshop on Abusive Language Online, Asso-
     ciation for Computational Linguistics, Vancouver, BC, Canada, 2017, pp. 52–56. URL:
     https://aclanthology.org/W17-3008.
 [5] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Kumar,
     IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual
     language models for Indian languages, in: Findings of the Association for Computational
     Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp.
     4948–4961. URL: https://aclanthology.org/2020.findings-emnlp.445.
 [6] S. Thavareesan, S. Mahesan, Sentiment lexicon expansion using word2vec and fasttext for
     sentiment prediction in tamil texts, 2020. doi:10.1109/MERCon50084.2020.9185369.
 [7] S. Thavareesan, S. Mahesan, Sentiment analysis in tamil texts: A study on machine
     learning techniques and feature representation, 2019 14th Conference on Industrial and
     Information Systems (ICIIS) (2019) 320–325.
 [8] S. Sivakumar, R. Rajalakshmi, Comparative evaluation of various feature weighting
     methods on movie reviews, in: H. S. Behera, J. Nayak, B. Naik, A. Abraham (Eds.),
     Computational Intelligence in Data Mining, Springer Singapore, Singapore, 2019, pp.
     721–730.
 [9] S. Soubraylu, R. Rajalakshmi,               Hybrid convolutional bidirectional recur-
     rent neural network based sentiment analysis on movie reviews,                       Compu-
     tational Intelligence 37 (2021) 735–757. URL: https://onlinelibrary.wiley.com/
     doi/abs/10.1111/coin.12400.               doi:https://doi.org/10.1111/coin.12400.
     arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/coin.12400.
[10] A. Samuels, J. Mcgonical, Sentiment analysis on social media content, CoRR abs/2007.02144
     (2020). URL: https://arxiv.org/abs/2007.02144. arXiv:2007.02144.
[11] R. R., Supervised term weighting methods for url classification, Journal of Computer
     Science 10 (2014). doi:10.3844/jcssp.2014.1969.1976.
[12] R. R., C. Aravindan, An effective and discriminative feature learning for url based web page
     classification, in: 2018 IEEE International Conference on Systems, Man, and Cybernetics
     (SMC), 2018, pp. 1374–1379. doi:10.1109/SMC.2018.00240.
[13] R. R. Kannan, R. Rajalakshmi, Dlrg@aila 2019: Context - aware legal assistance system, in:
     P. Mehta, P. Rosso, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2019 - Forum
     for Information Retrieval Evaluation, Kolkata, India, December 12-15, 2019, volume 2517
     of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 58–63. URL: http://ceur-ws.org/
     Vol-2517/T1-10.pdf.
[14] R. R., H. Tiwari, J. Patel, R. R., K. Ramamurthy, Bidirectional GRU-Based Attention Model
     for Kid-Specific URL Classification, 2020, pp. 78–90. doi:10.4018/978-1-7998-1192-3.
     ch005.
[15] R. Rajalakshmi, H. Tiwari, J. Patel, A. Kumar, R. Karthik., Design of kids-specific url classi-
     fier using recurrent convolutional neural network, Procedia Computer Science 167 (2020)
     2124–2131. URL: https://www.sciencedirect.com/science/article/pii/S1877050920307262.
     doi:https://doi.org/10.1016/j.procs.2020.03.260, international Conference
     on Computational Intelligence and Data Science.
[16] S. Soubraylu, R. Rajalakshmi, Analysis of sentiment on movie reviews using word embed-
     ding self-attentive lstm, International Journal of Ambient Computing and Intelligence 12
     (2021) 33–52. doi:10.4018/IJACI.2021040103.
[17] V. Ganganwar, R. Rajalakshmi, Implicit aspect extraction for sentiment analysis: A survey
     of recent approaches, Procedia Computer Science 165 (2019) 485–491.
[18] P. Mishra, P. Danda, P. Dhakras, Code-mixed sentiment analysis using machine learning
     and neural network approaches, CoRR abs/1808.03299 (2018). URL: http://arxiv.org/abs/
     1808.03299. arXiv:1808.03299.
[19] R. Rajalakshmi, B. Y. Reddy, Dlrg@hasoc 2019: An enhanced ensemble classifier for hate
     and offensive content identification, in: P. Mehta, P. Rosso, P. Majumder, M. Mitra (Eds.),
     Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India,
     December 12-15, 2019, volume 2517 of CEUR Workshop Proceedings, CEUR-WS.org, 2019,
     pp. 370–379. URL: http://ceur-ws.org/Vol-2517/T3-26.pdf.
[20] R. Rajalakshmi, Y. Reddy, L. Kumar, DLRG@DravidianLangTech-EACL2021: Trans-
     former based approachfor offensive language identification on code-mixed Tamil, in:
     Proceedings of the First Workshop on Speech and Language Technologies for Dravid-
     ian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 357–362. URL:
     https://aclanthology.org/2021.dravidianlangtech-1.53.
[21] R. Rajalakshmi, S. Srivarshan, M. L. P. R. Faerie, M. Faerie, K. E, S. Prithvi, K. M. Anand,
     Conversational hate-offensive detection in code-mixed hindi-english tweets, Association
     for Computing Machinery, 2021.
[22] R. Rajalakshmi, L. P. Reddy, M. Faerie, S. Srivarshan, K. M. Anand, Hate speech and
     offensive content identification in hindi and marathi languages using ensemble techniques,
     Association for Computing Machinery, 2021.
[23] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Improving wordnets for under-resourced lan-
     guages using machine translation, in: Proceedings of the 9th Global Wordnet Conference,
     Global Wordnet Association, Nanyang Technological University (NTU), Singapore, 2018,
     pp. 77–86. URL: https://aclanthology.org/2018.gwc-1.10.
[24] WordNet Gloss Translation for Under-resourced Languages using Multilingual Neural
     Machine Translation, Zenodo, 2019. URL: https://doi.org/10.18653/v1/w19-7101.
[25] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on sentiment analysis for Dravidian languages in
     code-mixed text, in: Forum for Information Retrieval Evaluation, FIRE 2020, Association
     for Computing Machinery, New York, NY, USA, 2020, p. 21–24. URL: https://doi.org/10.
     1145/3441501.3441515.
[26] R. Rajalakshmi, R. Agrawal, Borrowing likeliness ranking based on relevance factor,
     in: Proceedings of the Fourth ACM IKDD Conferences on Data Sciences, CODS ’17,
     Association for Computing Machinery, New York, NY, USA, 2017. URL: https://doi.org/10.
     1145/3041823.3067694. doi:10.1145/3041823.3067694.
[27] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
     sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
     Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
     and Collaboration and Computing for Under-Resourced Languages (CCURL), European
     Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www.
     aclweb.org/anthology/2020.sltu-1.28.