=Paper= {{Paper |id=Vol-2826/T4-9 |storemode=property |title=IRLab@IITBHU@Dravidian-CodeMix-FIRE2020: Sentiment Analysis for Dravidian Languages in Code-Mixed Text |pdfUrl=https://ceur-ws.org/Vol-2826/T4-9.pdf |volume=Vol-2826 |authors=Supriya Chanda,Sukomal Pal |dblpUrl=https://dblp.org/rec/conf/fire/ChandaP20 }} ==IRLab@IITBHU@Dravidian-CodeMix-FIRE2020: Sentiment Analysis for Dravidian Languages in Code-Mixed Text== https://ceur-ws.org/Vol-2826/T4-9.pdf

IRLab@IITBHU@Dravidian-CodeMix-FIRE2020:
Sentiment Analysis for Dravidian Languages in
Code-Mixed Text
Supriya Chandaa , Sukomal Palb
a
Indian Institute of Technology (BHU), Varanasi, INDIA
b
Indian Institute of Technology (BHU), Varanasi, INDIA

Abstract
This paper describes the IRlab@IITBHU system for the Dravidian-CodeMix - FIRE 2020: Sentiment Analysis for
Dravidian Languages pairs Tamil-English (TA-EN) and Malayalam-English (ML-EN) in Code-Mixed text. We
submitted three models for sentiment analysis of code-mixed TA-EN and MA-EN datasets. Run-1 was obtained
from the BERT and Logistic regression classifier, Run-2 used the DistilBERT and Logistic regression classifier,
and Run-3 used the fastText model for producing the results. Run-3 outperformed Run-1 and Run-2 for both
the datasets. We obtained an 𝐹1 -score of 0.58, rank 8/14 in TA-EN language pair and for ML-EN, an 𝐹1 -score of
0.63 with rank 11/15.

Keywords
Code Mixed, Malayalam, Tamil, BERT, fastText, Sentiment Analysis,

1. Introduction
Internet and digitization enabled people express their views, sentiments, opinions through blog posts,
online forums, product review websites, and different social media. Millions of people from different
linguistic and cultural backgrounds use social networking sites like Facebook, Twitter, LinkedIn, and
YouTube to express their emotions, opinions, and share views on different issues that matter in their
lives. As a large number of Indian users can speak multiple languages proficiently (at least two: native
languages like Malayalam, Tamil, Hindi, and English), an unplanned switching between languages of-
ten happens unconsciously. Even though many languages have their own scripts, social media users
often use non-native scripts, usually Roman script, because of socio-linguistics reasons. This phe-
nomenon is called code-mixing and is defined as “the embedding of linguistic units such as phrases,
words and morphemes of one language into an utterance of another language" (Myers-Scotton[1]).
Code-mixed data is generally observed in a place of informal communication like social media. The
data can be easily extracted from social media sources using different APIs. Sentiment analysis (SA)
on social media text has become an important research task in academia and industry in the past
two decades. SA helps understand people’s opinion from movie/product reviews, and thus help take
decision to improve customer satisfaction through advertisement and marketing.
The shared task [2, 3] here aims to identify sentiment polarity of the code-mixed data of YouTube
comments in Dravidian Language pairs (Malayalam-English [4] and Tamil-English [5]) collected from
social media. In the past few years, there have been multiple attempts to process code-mixed data, and
FIRE 2020: Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India
email: supriyachanda.rs.cse18@itbhu.ac.in (S. Chanda); spal.cse@itbhu.ac.in (S. Pal)
url: https://cse-iitbhu.github.io/irlab/supriya.html (S. Chanda); https://cse-iitbhu.github.io/irlab/spal.html (S. Pal)
orcid: 0000-0001-8743-9830 (S. Pal)
© 2020 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Data Distribution
Tamil - English Malayalam - English
Class Training Development Test Total Training Development Test Total
Mixed_feelings 1283 141 377 1801 289 44 70 403
Negative 1448 165 424 2037 549 51 138 738
Positive 7627 857 2075 10559 2022 224 565 2811
not-Tamil 368 29 100 497 - - - -
not-malayalam - - - - 647 60 177 884
unknown_state 609 68 173 850 1344 161 398 1903
Total 11335 1260 3149 15744 4851 540 1348 6739

a shared task on sentiment analysis of code-mixed Indian languages[6] was organized in ICON 2017.
However, the freely available data apart from Hindi-English and Bengali-English are still limited in
Indian languages, although some other languages like English-Spanish and Chinese-English datasets
are available for research.
The rest of the paper is organized as follows. Section 2 describes the dataset, pre-processing and
processing techniques. In Section 3, we report our results and analysis. Finally we conclude in Section
4.

2. System Description
2.1. Datasets
The Dravidian-CodeMix shared task1 organizers provided a dataset that consists of 15,744 Tamil-
English and 6,739 Malayalam-English YouTube video comments. The statistics of training, develop-
ment, and test data corpus collection and their class distribution are shown in Table 1. Here, each
comment is annotated by six (for ML-EN) and eleven (for TA-EN) independent annotators. An inter-
annotator agreement score of 0.6 with Krippendorff’s alpha is obtained for the Tamil-English dataset,
and score of 0.8 with Krippendorff’s alpha for the Malayalam-English dataset. Some comment ex-
amples from the training dataset (Tamil-English) are shown in Table 2. The dataset provided suffers
from general problems of social media data, particularly code-mixed data. The sentences are short
with lack of well-defined grammatical structures, and many spelling mistakes.

2.2. Data Pre-processing
The YouTube comment dataset used in this work is already labelled into five categories: Positive,
Negative, Mixed_feelings, unknown_state and not-Tamil or not-Malayalam. Our pre-processing of
comments includes the following steps:
• Removal of extended words: number of words which have one or more contiguous repeating
characters 2

• Removal of exclamations and other punctuation

• Removal of non-ASCII characters, all the emoticons, symbols, numbers, special characters.

1
https://dravidian-codemix.github.io/2020/index.html
2
https://github.com/SupriyaChanda/Dravidian-CodeMix-FIRE2020
Table 2
Example YouTube comments from the Dravidian-CodeMix dataset for all clases
Sample comments from dataset(Tamil-English) Category
Ena da bgm ithu yuvannnnnnnnnn rocksssssss Positive
Kola gaadula iruka... Thalaivaaaaaaaa waiting layea veri aaguthey Negative
Wow wow wow... Thalaivaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa....... proud to be every Indian...
<3 thanks to shankar sir and holl team... Mixed_feelings
Nenu ee movie chusanu super movie not-Tamil
Super. 1 like is equivalent to 100 likes. unknown_state

2.3. Word Embedding
Word embedding is arguably the most widely known technology in the recent history of NLP. It cap-
tures the semantic property of a word. We use bert-base-uncased and distilbert-base-uncased
pre-trained models3 to get a vector as an embedding for the sentence that we can use for classifica-
tion. Apart from these two pre-trained models, we experiment with other pre-trained models like
bert-base-multilingual-uncased, bert-base-multilingual-cased.

• BERT: Bidirectional Encoder Representations from Transformers (BERT)[7] is a technique for
NLP pre-training developed by Google. BERT is pre-trained on a large corpus of unlabelled
text, including the entire Wikipedia (that is 2,500 million words!) and the Book Corpus (800
million words). BERT-Base uncased has 12 layers (transformer blocks), 12 attention heads, and
110 million parameters.

• DistilBERT: DistilBERT[8] is a smaller version of BERT developed and open-sourced by the
team at HuggingFace. It is a lighter and faster version of BERT that roughly matches its perfor-
mance. DistilBERT also compares surprisingly well to BERT on downstream tasks while having
about half and one third the number of parameters.

• fastText: fastText, developed by Facebook, combines certain concepts introduced by the NLP
and ML communities, representing sentences with a bag-of-words and n-grams using subword
information and sharing them across classes through a hidden representation. fastText[9] can
learn vector representations of out-of-vocabulary words, which is useful for our dataset that
contains Malayalam and Tamil words in Roman script.

After pre-processing our data and transforming all the comments into vector, we implement our
classification algorithms and construct our training models. We used the multinomial logistic regres-
sion 4 with the fastText embeddings for unigrams, bigrams, and trigrams present along with different
learning rates and epochs. we got the maximum 𝐹1 score on fastText text classification model with
-wordNgrams= 1, learning rate = 0.1 and epochs = 10.

3. Results and Analysis
We use scikit-learn5 machine learning package for the implementation. A Macro 𝐹1 score was
used to evaluate every system. Macro 𝐹1 score of the overall system was the average of 𝐹1 scores of
3
https://huggingface.co/transformers/pretrained_models.html
4
https://fasttext.cc/docs/en/supervised-tutorial.html
5
http://scikit-learn.org
Table 3
Evaluation results on test data and rank list
Tamil - English Malayalam - English
Team Name Precision Recall 𝐹1 score Rank Precision Recall 𝐹1 score Rank
SRJ 0.64 0.67 0.65 1/14 0.74 0.75 0.74 1/15
IRLab@IITBHU 0.57 0.61 0.58 8/14 0.63 0.64 0.63 11/15

Table 4
Precision, recall, 𝐹1 -score, and support for all experiment on Tamil-English test data
BERT DistilBERT FastText
Precision Recall 𝐹1 -score Precision Recall 𝐹1 -score Precision Recall 𝐹1 -score support
Mixed_feelings 0.17 0.02 0.04 0.12 0.01 0.01 0.23 0.07 0.11 377
Negative 0.40 0.12 0.18 0.42 0.08 0.13 0.34 0.28 0.31 424
Positive 0.69 0.95 0.80 0.68 0.97 0.80 0.72 0.82 0.76 2075
not-Tamil 0.66 0.57 0.61 0.67 0.53 0.59 0.27 0.61 0.37 100
unknown_state 0.19 0.04 0.07 0.29 0.02 0.04 0.23 0.13 0.17 173
macro avg 0.42 0.34 0.34 0.44 0.32 0.32 0.36 0.38 0.34 3149
weighted avg 0.56 0.66 0.58 0.56 0.67 0.57 0.57 0.61 0.58 3149
Accuracy 0.66 0.67 0.61

Table 5
Precision, recall, 𝐹1 -scores, and support for all experiment on Malayalam-English test data
BERT DistilBERT FastText
Precision Recall 𝐹1 -score Precision Recall 𝐹1 -score Precision Recall 𝐹1 -score support
Mixed_feelings 0.31 0.14 0.20 0.50 0.09 0.15 0.39 0.27 0.32 70
Negative 0.47 0.35 0.40 0.52 0.41 0.46 0.50 0.46 0.48 138
Positive 0.63 0.75 0.68 0.64 0.77 0.70 0.73 0.70 0.72 565
not-malayalam 0.67 0.71 0.69 0.68 0.72 0.70 0.61 0.64 0.62 177
unknown_state 0.63 0.57 0.59 0.60 0.54 0.57 0.60 0.66 0.63 398
macro avg 0.54 0.50 0.51 0.59 0.51 0.52 0.56 0.55 0.55 1348
weighted avg 0.60 0.62 0.60 0.61 0.62 0.61 0.63 0.64 0.63 1348
Accuracy 0.62 0.62 0.64

the individual classes. Table 3 shows our official performances as shared by the organizers vis-a-vis
the best performing team. Table 4 and Table 5 report our results on Tamil-English and Malayalam-
English dataset respectively. We select three models that performed well during the validation phase
and submit them for final prediction of the test dataset. We observe that fastText gives better 𝐹1 scores
over others which was also in the official results (shown in Table 3).
In the training data, there are some ambiguous samples. Some examples are given below.

• The Tamil-English sentence Srk fan plz dislike tha video is labeled as Positive, when the sentence
has negative sentiment word like dislike.
• The Tamil-English sentence Wow wow wow... Thalaivaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.......
proud to be every Indian... <3 thanks to shankar sir and holl team... is labeled as mixed_feelings,
when there is many positive words like wow, proud, thanks.

Our models were trained on this ambiguous data, and we could not verify the correctness of la-
belling as we do not have knowledge of Tamil or Malayalam languages. Inconsistency of the labellings,
if any, might have worsened the results on test data. Another aspect is very small sentence length.
That might also be the reason why fastText unigram gave better results than n-grams, word n-grams
were not able to capture the sentiment of a sentence.
Table 6
Error analysis
Language pair Sample comments from dataset Given Predicted
TA-EN Just amazing thalaivaaaaaaaaaa ARR sir u r that Best ( BGM ) Negative Positive
TA-EN Next year national award Competition DHANUSH ( asuran) Karthi( kaithi) @rya (makamuni) Negative positive
TA-EN Petta paraak.rajin sir is still young Negative Positive

Some of the examples that were marked as incorrect predictions by our best model are shown in
Table 6. The ‘Given’ column in the table denotes the expected sentiment, as available in the gold
standard dataset against the ones predicted by our system. It seems that our predicted sentiment was
correct.

4. Conclusion
This study reports performance of our system for the shared task on Sentiment Analysis for Dravid-
ian Languages in Code-Mixed Text in Dravidian-CodeMix - FIRE 2020. We conducted a number of
experiments on a real-world code-mixed YouTube comments dataset involving a few embedding tech-
niques: fastText, BERT, and DistilBERT. We find that fastText outperforms other techniques on this
task. However, there are room for improvement. In the future, we plan to use other pre-trained models
with necessary fine-tuning. We also plan to explore multilingual embeddings for the languages.

References
[1] C. Myers-Scotton, Common and Uncommon Ground: Social and Structural Factors in
Codeswitching, Language in Society 22 (1993) 475–503. URL: http://www.jstor.org/stable/
4168471.
[2] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Eliz-
abeth McCrae, Overview of the track on Sentiment Analysis for Davidian Languages in Code-
Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation, FIRE ’20,
2020.
[3] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Eliz-
abeth McCrae, Overview of the track on Sentiment Analysis for Davidian Languages in Code-
Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020).
CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020.
[4] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis dataset
for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Lan-
guage Technologies for Under-resourced languages (SLTU) and Collaboration and Computing
for Under-Resourced Languages (CCURL), European Language Resources association, Marseille,
France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/2020.sltu-1.25.
[5] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for sentiment
analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken
Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing
for Under-Resourced Languages (CCURL), European Language Resources association, Marseille,
France, 2020, pp. 202–210. URL: https://www.aclweb.org/anthology/2020.sltu-1.28.
[6] B. G. Patra, D. Das, A. Das, Sentiment Analysis of Code-Mixed Indian Languages: An Overview
of SAIL Code-Mixed Shared Task @ICON-2017, 2018. arXiv:1803.06745.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Trans-
formers for Language Understanding, Proceedings of the 2019 Conference of the North (2019).
doi:10.18653/v1/n19-1423.
[8] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster,
cheaper and lighter, 2019. arXiv:1910.01108.
[9] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in Pre-Training Distributed
Word Representations, in: Proceedings of the International Conference on Language Resources
and Evaluation (LREC 2018), 2018.