=Paper=
{{Paper
|id=Vol-3159/T6-13
|storemode=property
|title=BiLSTM-Sentiments Analysis in Code Mixed Dravidian Languages
|pdfUrl=https://ceur-ws.org/Vol-3159/T6-13.pdf
|volume=Vol-3159
|authors=Mudoor Devadas Anusha,Hosahalli Lakshmaiah Shashirekha
|dblpUrl=https://dblp.org/rec/conf/fire/AnushaS21
}}
==BiLSTM-Sentiments Analysis in Code Mixed Dravidian Languages==
<pdf width="1500px">https://ceur-ws.org/Vol-3159/T6-13.pdf</pdf>
<pre>
BiLSTM-Sentiments Analysis in Code-Mixed
Dravidian Languages
Mudoor Devadas Anusha, Hosahalli Lakshmaiah Shashirekha
Department of Computer Science, Mangalore University, Mangalore, Karnataka, India


                                      Abstract
                                      Understanding the sentiments of a comment/post on social media is a fundamental move in numerous
                                      applications and Sentiments Analysis (SA) of a text can be worthy for decision-making process. Over the
                                      past few years, SA of texts have received much attention. One such application is to analyze the main-
                                      stream sentiments of videos on social media based on viewer comments. Social media text is primarily
                                      code-mixed and research on SA in code-mixed low-resourced languages is in its infancy that too for
                                      very few language pairs. Non-availability of annotated code-mixed data for low-resourced languages
                                      makes the SA task much more complex. Kannada, Malayalam and Tamil languages belonging to the
                                      family of Dravidian languages are popular south Indian languages but are low-resourced. Each of these
                                      languages’ content mixed with English language either in Roman script or as a combination of native
                                      script and Roman script are available on social media abundantly. In this paper, we, team MUM, describe
                                      the proposed Bidirectional Long Short Term Memory (BiLSTM) model submitted to “Sentiment Analysis
                                      of Dravidian Languages in Code-Mixed Text” - a shared task at Forum for Information Retrieval Evalu-
                                      ation (FIRE) 2021 to analyze the sentiments in Kannada-English (Kn-En), Malayalam-English (Ma-En),
                                      and Tamil-English (Ta-En) code-mixed texts. In the proposed approach, the code-mixed word embed-
                                      dings’ are constructed using the training set of the respective code-mixed language pairs’ and these
                                      embeddings are used to build a Deep Learning (DL) model based on BiLSTM. Our proposed model ob-
                                      tained 13th , 14th , and 14th ranks with weighted F1-scores of 0.563, 0.604, and 0.365 for code-mixed Ta-En,
                                      Ma-En and Kn-En language pairs respectively.

                                      Keywords
                                      Machine Learning, BiLSTM, Word Embedding, Code-mixing, Dravidian language


1. Introduction
An analysis of users’ sentiments can help in understanding users’ attitudes and moods which
can help to draw insights for future decision-making. Rather than just a fundamental check
of notification or comments, sentiments describe the feelings and assessments. An essential
aspect of Natural Language Processing (NLP) is SA which involves understanding the polarity
of a given text or sentence. SA is the undertaking of subjective impressions or reactions about a
given subject and SA via social media uncovers the opinion of users’ about whatever they see
or listen on social media [1]. SA is an on-going area of research for more than a decade in both
academia and industry. However, the increasing online content in the form of code-mixed text
in social media is throwing new challenges to the SA research community.

FIRE 2021: Forum for Information Retrieval Evaluation, December 13-17,2021, India.
" anugowda251@gmail.com (M. D. Anusha); hlsrekha@gmail.com (H. L. Shashirekha)
~ https://mangaloreuniversity.ac.in/dr-h-l-shashirekha (H. L. Shashirekha)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
   Code-mixing or Code-switching is a common event in a multi-lingual community where the
sentences, phrases, words and morphemes of two or more languages are mixed in speech or
writing according to one’s convention. As the younger generation is quite familiar in using
English language, they tend to mix up their native language with English according to their
whims and fancies. India being a multi-lingual country provides much scope for code-mixed
text. Kannada, Malayalam and Tamil languages belong to the family of Dravidian languages and
are the official languages of Karnataka, Kerala, and Tamilnadu respectively. The scripts of these
three Dravidian languages are alpha-syllabic, relating to a group of the Abugida writing methods
that are partly alphabetic and partially syllable-based [2] [3]. Despite their popularity, these
three languages are digitally low-resourced languages [4]. As users’ tend to use a combination
of languages to post comments on social media, the majority of the data available on social
media for these languages are code-mixed. Code-mixed texts are usually written in non-native
scripts especially in Roman script on social media [5] due to the ease of the use of Roman script
and also due to the technological limitations of keyboard layouts of native languages on smart
phones.
   SA systems trained on mono-lingual text are not suitable for code-mixed text due to the
complexity of mixing languages at various linguistic levels in the given text [6]. In order to
promote research in SA in code-mixed Dravidian languages, “Sentiment Analysis of Dravidian
Languages in Code-Mixed Text”1 - a shared task in FIRE 20212 provides an opportunity for
researchers to develop and evaluate the working models for SA. The organizers of the shared
task provide the code-mixed datasets in Kn-En, Ma-En, and Ta-En language pairs with an
objective of identifying the sentiment polarity of a given code-mixed text in these language
pairs.
   SA task is a typical binary (coarse grained) or multi-class (fine-grained) Text Classification
(TC) task of assigning a sentiment polarity to a given text depending on the predefined number
of categories. Researchers have developed several models for SA of natural language mono-
lingual texts as well as code-mixed texts using conventional Machine Learning (ML) and DL
approaches based on Neural Network (NN). DL models are gaining popularity as they provide
accurate and effective results for TC by reducing false positives [7]. Majority of DL models
use BiLSTM which contains two Long Short Term Memory (LSTM) models: one taking the
commitment in a forward direction and the other in a retrogressive way. BiLSTMs are at the
core of many NNs that achieve cutting edge performance in NLP tasks [7].
   Embedding words in a document is an active research area in which scientists endeavour
to find better representations of words is achieved by capturing the contextual, semantic, and
syntactic information about words as much as possible [8]. In this approach, the representation
of words is based on the notion of distributional hypothesis in which words with similar
meanings occur in similar contexts or textual vicinity and each word is represented by a real-
valued vector in a predefined vector space. This distributed representation of words is expected
to provide a great deal of insight for many NLP applications as it captures the syntactic and
semantic information of words in a sufficiently large corpus.
   The BiLSTM NN consists of LSTM units that integrate past and future context information

   1
       https://dravidian-codemix.github.io/2021/index.html
   2
       https://competitions.codalab.org/competitions/306424#learn_the_details-overview
because of which they are showing excellent performance for sequential modeling problems
as well as for TC [9]. NN models expect numeric values as input. Hence, it is necessary to
convert the text data to numeric representation by building an embedding layer before building
a BiLSTM model. In this paper, we, team MUM, describe the BiLSTM model submitted to
“Sentiment Analysis of Dravidian Languages in Code-Mixed Text” shared task in FIRE 2021 to
identify the sentiment polarity of the given code-mixed comment.
   The rest of the paper is organized as follows: Few latest works related to SA are described in
Section 2 followed by the proposed methodology in section 3. Experimental setup and Results
are described in Section 4 and the paper reaches its conclusion throwing light on future work
in Section 5.


2. Related Work
Researchers have explored different algorithms for SA of monolingual texts as well as code-
mixed texts of different language pairs. However, very few works are reported for code-mixing
of Indian language texts in general and Dravidian languages in particular.
   Chakravarthi et al. [10] created code-mixed benchmarked corpora for SA in Ma-En language
pair. Using youtube-comment-scraper tool3 they collected 116,711 Ma-En code-mixed sentences
from YouTube comments posted for Malayalam movie trailers during 2019. The majority of the
contents in these comments were written either in English or as a combination of English and
Malayalam in Malayalam script and/or Roman script. These comments were tokenized into
sentences and the sentences were annotated for SA by volunteers. In addition, the authors also
experimented SA using the traditional ML algorithms, namely: Logistic Regression (LR), Support
Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Multinomial Naïve Bayes (MNB),
and K-Nearest Neighbour (kNN), as baselines. The ML algorithms were trained using the Term
Frequency-Inverse Document Frequency (TF-IDF) vectors obtained by vectorizing the sentences.
Among the baseline classifiers, LR and RF classifiers achieved higher average weighted macro
F1-score of 0.66 and 0.61 respectively.
   Hande et. al [11] developed KanCMD - a multi-task Ka-En code-mixed dataset for SA and
Offensive Language Identification (OLI). This work aims to promote multi-task learning for
under-resourced languages in general and Kannada language in particular. The dataset which
consists of 7,671 comments and annotated by at least three annotators is benchmarked using
computational models. Similar baselines and feature set as used by Chakravarthi et al. [10]
are used to evaluate both SA and OLI on KanCMD dataset. Among all the classifiers, the LR
classifier obtained the highest weighted F1-score of 0.70 and 0.77 for SA and OLI respectively.
   Chakravarthi et al. [12] proposed the most substantial corpus for code-mixed Ta-En (Tanglish)
text with sentiment polarity annotations. They reported a high inter-annotator agreement in
terms of Krippendorff 𝛼 from voluntary annotators on contributions collected using Google
form and created gold standard annotated data for code-mixed Ta-En SA. In addition, the authors
also evaluated ML classifiers, namely: LR, kNN (with 3, 4, 5, and 9 neighbors), DT, MNB, RF and
SVM and DL classifiers, namely: 1D Conv-LSTM, Bidirectional Encoder Representations from
Transformers (BERT)-Multilingual, Dynamic Meta Embedding (DME) and Contextual DME, as
    3
        https://github.com/philbot9/youtube-comment-scraper
baselines. They trained the ML classifiers on TF-IDF vectors of word n-grams in the range (1, 3)
and used word2vec as feature for DL models. Among all the baselines, RF model exhibited the
highest performance with a macro f1-score of 0.42 and the weighted f1-score of 0.65.
   A total of 1,200 Hindi and 300 Marathi text consisting of chats, Tweets, and YouTube comments
were collected by Ansari et al. [13] for SA of code-mixed transliterated Hindi and Marathi Texts.
The study involves Language Identification (LI), word transliteration, sentiment scoring, and
feature extraction along with using learning methods. They conducted several experiments to
classify transliterated Hindi and Marathi text using kNN, NB, SVM and ontology-based classifiers
and obtained a weighted F1-score of 0.59 and 0.57 for NB and Linear SVM respectively.
   Choudhary et al. [14] proposed a novel approach called Sentiment Analysis of Code-Mixed
Text (SACMT) to classify sentences into positive, negative or neutral sentiments using Con-
trastive learning. This work introduces a basic clustering-based pre-processing method for
capturing variations of code-mixed transliterated words and utilizing the shared parameters of
Siamese networks to map the sentences of code-mixed and standard languages to a common sen-
timent space. The proposed approach employs twin BiLSTM networks with shared parameters
to capture a sentiment based representation of the sentences which is used in conjunction with
a similarity metric to group sentences with similar sentiments together. SACMT’s performance
for SA of code-mixed text obtained a weighted F1-score of 0.759 and outperformed the existing
approaches by 7.6% in accuracy and 10.1% in F1-score.
   The system developed by Joshi et al. [15] introduces sub-word-LSTM architecture for learning
sub-word level representations instead of character or word level representations for SA and
the linguistic prior in their architecture gives them the ability to learn sentiment information
about important morphemes. Also, the authors hypothesize that encoding the linguistic prior
in the subword-LSTM architecture leads to superior performance. For the dataset containing
3,879 code-mixed English-Hindi (Hi-En) sentences gathered from Facebook, subword-LSTM
and char-LSTM obtained F1-scores of 0.658 and 0.511 respectively. Additionally, the model
which performed well for heavily noised text containing misspellings was demonstrated in the
morpheme-level feature maps.
   Vaibhav et al. [16] proposed a hybrid model for SA tasks in Hi-En code-mixed texts using
sub-words embedding. They first generate sub-word level representations for the sentences
using a Convolutional Neural Network (CNN) architecture and used them as inputs to a Dual
Encoder Network consisting of two different BiLSTMs: Collective and Specific Encoder. The
Collective Encoder captures the overall sentiment of the sentence, while the Specific Encoder
utilizes an attention mechanism in order to focus on individual sentiment-bearing sub-words.
This was combined with a feature network consisting of orthographic features and specially
trained word embeddings’. On the dataset consisting of 3,879 code-mixed En-Hi messages
created by [15], their proposed model achieved state-of-the-art results of 83.54% accuracy and
0.827 F1-score.


3. Methodology
The proposed methodology includes i) Pre-processing - to clean the text data by removing
unnecessary data ii) Feature Engineering - to represent the sentences/comments as word vectors
and iii) Model construction - to perform SA. Each of the steps are explained below:

3.1. Pre-processing
Text data needs to be pre-processed to remove noise so that the performance of the classifier
can be improved. Pre-processing the data is just as important as the model building itself. Text
pre-processing procedures may differ depending on the task and the dataset used. The following
pre-processing steps were applied in the proposed work:

    • Converting text to lowercase as the character case does not matter for TC task.
    • Removing numeric and punctuation information as they are not important for TC task.
    • Eliminating stop words - the frequently occurring words in a language, as they are not
      the distinguishing features for TC task.
    • Label encoding refers to converting the class/category labels into numeric form to make
      them machine-readable. For example, the class labels of SA task: Mixed feelings, Pos-
      itive, Unknown_state, Negative, and Not-Kannada, will be encoded as 0, 1, 2, 3 and 4
      respectively.

3.2. Feature Engineering
As text data has to be represented as numeric values for any NN model, word embeddings’
are used to encode text data into numeric vectors. For learning word embeddings, Thomas
Mikolov’s [17] word2vec skip-gram model with 300 embedding dimensions and window size 10
is used. The advantage of word2vec is that its vector can be used to learn words’ similarities
and relationships [18]. Word2vec skip-gram model is first trained with the Train set provided
by the organizers of the shared task to obtain the word embeddings’. After training, the word
embedding is considered as a lookup table and the representation of the words is obtained
from this look-up table. The sentences/comments are represented as vectors by averaging the
numeric representations of all the words present in that sentence/comment.

3.3. Model Construction
BiLSTM - a Recurrent Neural Network (RNN) which works with two hidden layers taking
data in both the directions simultaneously has shown good results for NLP applications [19].
The generated embedding vectors of 300 dimension with activation, optimizer and dropout
parameters set to “softmax”, “adam” and 20% respectively are used to train a BiLSTM network
for dynamic epochs until the loss value gets stabilized (at most 20 times). Input to the BiLSTM
layer is fed through the time distributed wrapper to a dense layer with the activation function of
a Rectified Linear Unit (ReLU), followed by a dense layer with softmax activation after flattening
the output from the previous layer. Output dimensions of the model are configured based on
the number of class labels. The structure of the BiLSTM model is shown in Figure 1.
Figure 1: Structure of the BiLSTM Model


4. Experimental setup and Results
The dataset5 [20] provided by the shared task organizers for SA of code-mixed Ka-En [11],
Ma-En [21] and Ta-En [22] [23] language pairs consists of Train, Development (Dev), and Test
sets. A comment/post in the dataset contains more than one sentence, but the average number
of sentences in each language pair is one and each comment/post is annotated with one of the
sentiment polarities: Positive_state, Negative_state, Mixed_feelings, Neutral, Unknown_state,
and Other_language (Not Kannada/Tamil/Malayalam). The given dataset has the class imbalance
issue which is consistent with how sentiments are expressed in reality and the distribution of
labels in the dataset are shown in Table 1.
   Scikit-learn6 and keras7 - a minimalist library for DL, are used to implement the code in
Python. BiLSTM model with word embedding feature applied for the Test set of all three
language pairs obtained 13th , 14th , and 14th ranks with weighted F1-scores of 0.563, 0.604, and
0.365 for Ta-En8 , Ma-En9 , and Ka-En10 respectively. The results obtained in terms of Precision,
Recall, and F1-score are shown in Table 2. Table 3 shows the performance of the proposed
approach on the Development sets of Ta-En, Ma-En, and Ka-En language pairs.


   5
      https://competitions.codalab.org/competitions/30642#participate
   6
      https://scikit-learn.org/stable
    7
      https://www.tensorflow.org/api_docs/python/tf/keras/
    8
      https://drive.google.com/file/d/14pKDC5fuRcWoAnn_HpBD50pdxGszxvPT/view
    9
      https://drive.google.com/file/d/1nZaQ4fm0h6rIHVtbYwWYVmvM8AFD71pD/view
   10
      https://drive.google.com/file/d/1TkWH9vp89p2Yzza3OS3XfWhWYwudhAdA/view
Table 1
Distribution of labels in the given dataset
       Language Pair         Positive_   Mixed_      Unknown_       Other_        Negative_
                                                                                              Total
           \Label              state     feelings      state       language         state
                  Train        6,421        926        5,279         1,157          2,105     15,888
     Malayalam
                   Dev          706         102         580           141            237      1,768
      -English
                  Test          780         134         643           147            258      1,962
                  Train       20,070       4,020       5,628         1,667          4,271     35,656
       Tamil
                   Dev         2,257        438         611           176            480      3,962
      -English
                  Test         2,546        470         665           244            477      4,402
                  Train        2,823        574         711           916           1,188      6212
      Kannada
                   Dev          321          52          69           110            139        691
      -English
                  Test          374          65          62           110            157        768

Table 2
Results of the proposed BiLSTM models on the Test sets
                       Language pair          Precision   Recall   F1-score   Rank
                     Malayalam-English          0.583     0.624     0.563      13
                       Tamil-English            0.621     0.626     0.604      14
                      Kannada-English           0.407     0.487     0.369      14

Table 3
Results of the proposed BiLSTM models on the Development sets
                           Language pairs         Precision   Recall   F1-score
                         Malayalam-English          0.662     0.648     0.684
                            Tamil-English           0.684     0.651     0.714
                          Kannada-English           0.591     0.546     0.601


5. Conclusion and Future Work
In this work, we, team MUM, present the description of the working model for the SA of code-
mixed text in Kannada, Malayalam, and Tamil submitted to “Sentiment Analysis of Dravidian
Languages in Code-Mixed Text” shared task in FIRE 2021. To tackle the challenge of classifying
the given YouTube comments into one of the six predefined categories, we propose a BiLSTM
model with word embeddings’ as features. The proposed model obtained F1-scores of 0.563,
0.604, and 0.365 for Ta-En, Ma-En, and Ka-En language pairs respectively. Exploring different
features and different learning models such as Transfer Learning for SA of code-mixed Indian
languages are in the pipeline.


References
 [1] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings of the
     Sentiment Analysis of Dravidian Languages in Code-Mixed Text 2021, in: Working Notes
     of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
 [2] B. R. Chakravarthi, N. Rajasekaran, M. Arcan, K. McGuinness, N. E. O’Connor, J. P. McCrae,
     Bilingual Lexicon Induction across Orthographically-Distinct Under-Resourced Dravidian
     Languages (2020).
 [3] B. Krishnamurti, The Dravidian Languages, Cambridge University Press, 2003.
 [4] B. R. Chakravarthi, R. Priyadharshini, S. Banerjee, R. Saldanha, J. P. McCrae, P. Krish-
     namurthy, M. Johnson, et al., Findings of the Shared Task on Machine Translation in
     Dravidian Languages, in: Proceedings of the First Workshop on Speech and Language
     Technologies for Dravidian Languages, 2021, pp. 119–125.
 [5] R. Priyadharshini, B. R. Chakravarthi, M. Vegupatti, J. P. McCrae, Named Entity Recog-
     nition for Code-Mixed Indian Corpus using Meta Embedding, in: 2020 6th International
     Conference on Advanced Computing and Communication Systems (ICACCS), IEEE, 2020,
     pp. 68–72.
 [6] K. Bali, J. Sharma, M. Choudhury, Y. Vyas, “I am borrowing ya mixing?" An Analysis
     of English-Hindi Code Mixing in Facebook, in: Proceedings of the First Workshop on
     Computational Approaches to Code Switching, 2014, pp. 116–126.
 [7] R. Bhargava, Y. Sharma, S. Sharma, Sentiment Analysis for Mixed Script Indic Sentences,
     in: 2016 International Conference On Advances In Computing, Communications And
     Informatics (ICACCI), IEEE, 2016, pp. 524–529.
 [8] K. Ghosh, A. Senapati, Technical Domain Identification using Word2vec and BiLSTM, in:
     17th International Conference on Natural Language Processing, 2020, p. 21.
 [9] B. Jang, M. Kim, G. Harerimana, S.-u. Kang, J. W. Kim, Bi-LSTM Model to Increase Accuracy
     in Text Classification: Combining Word2vec CNN and Attention Mechanism, Applied
     Sciences 10 (2020) 5841.
[10] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A Sentiment Analysis
     Dataset for Code-Mixed Malayalam-English, arXiv preprint arXiv:2006.00210 (2020).
[11] A. Hande, R. Priyadharshini, B. R. Chakravarthi, KanCMD: Kannada CodeMixed Dataset
     for Sentiment Analysis and Offensive Language Detection, in: Proceedings of the Third
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s
     in Social Media, 2020, pp. 54–63.
[12] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus Creation for
     Sentiment Analysis in Code-Mixed Tamil-English Text, arXiv preprint arXiv:2006.00206
     (2020).
[13] M. A. Ansari, S. Govilkar, Sentiment Analysis of Mixed Code for the Transliterated Hindi
     And Marathi Texts, International Journal on Natural Language Computing (IJNLC) Vol 7
     (2018).
[14] N. Choudhary, R. Singh, I. Bindlish, M. Shrivastava, Sentiment Analysis of Code-Mixed
     Languages Leveraging resource rich Languages, arXiv preprint arXiv:1804.00806 (2018).
[15] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, Towards Sub-Word Level Compositions for
     Sentiment Analysis of Hindi-English Code Mixed Text, in: Proceedings of COLING 2016,
     the 26th International Conference on Computational Linguistics: Technical Papers, 2016,
     pp. 2482–2491.
[16] Y. K. Lal, V. Kumar, M. Dhar, M. Shrivastava, P. Koehn, De-Mixing Sentiment from
     Code-Mixed Text, in: Proceedings of the 57th Annual Meeting of the Association for
     Computational Linguistics: Student Research Workshop, 2019, pp. 371–377.
[17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed Representations
     of Words and Phrases And their Compositionality, in: Advances In Neural Information
     Processing Systems, 2013, pp. 3111–3119.
[18] A. Khatua, A. Khatua, E. Cambria, A Tale of Two Epidemics: Contextual Word2Vec for
     Classifying Twitter Streams during Outbreaks, Information Processing & Management 56
     (2019) 247–257.
[19] K. S. Tai, R. Socher, C. D. Manning, Improved Semantic Representations from Tree-
     Structured Long Short-Term Memory Networks, arXiv preprint arXiv:1503.00075 (2015).
[20] B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. Kumar M, T. Mandl, P. K. Kumaresan,
     R. Ponnusamy, H. R L, J. P. McCrae, E. Sherly, Findings of The Shared Task on Offensive
     Language Identification in Tamil, Malayalam, and Kannada, in: Proceedings of the First
     Workshop on Speech and Language Technologies for Dravidian Languages, 2021.
[21] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A Sentiment Analysis
     Dataset for Code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on
     Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
     and Computing for Under-Resourced Languages (CCURL), 2020.
[22] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     Overview of the DravidianCodeMix 2021 Shared Task on Sentiment Detection in Tamil,
     Malayalam, and Kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021,
     Association for Computing Machinery, 2021.
[23] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
     Sentiment Analysis in code-mixed Tamil-English Text, in: Proceedings of the 1st Joint
     Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and
     Collaboration and Computing for Under-Resourced Languages (CCURL), 2020.

</pre>