=Paper= {{Paper |id=Vol-2826/T4-9 |storemode=property |title=IRLab@IITBHU@Dravidian-CodeMix-FIRE2020: Sentiment Analysis for Dravidian Languages in Code-Mixed Text |pdfUrl=https://ceur-ws.org/Vol-2826/T4-9.pdf |volume=Vol-2826 |authors=Supriya Chanda,Sukomal Pal |dblpUrl=https://dblp.org/rec/conf/fire/ChandaP20 }} ==IRLab@IITBHU@Dravidian-CodeMix-FIRE2020: Sentiment Analysis for Dravidian Languages in Code-Mixed Text== https://ceur-ws.org/Vol-2826/T4-9.pdf
IRLab@IITBHU@Dravidian-CodeMix-FIRE2020:
Sentiment Analysis for Dravidian Languages in
Code-Mixed Text
Supriya Chandaa , Sukomal Palb
a
    Indian Institute of Technology (BHU), Varanasi, INDIA
b
    Indian Institute of Technology (BHU), Varanasi, INDIA


                                          Abstract
                                          This paper describes the IRlab@IITBHU system for the Dravidian-CodeMix - FIRE 2020: Sentiment Analysis for
                                          Dravidian Languages pairs Tamil-English (TA-EN) and Malayalam-English (ML-EN) in Code-Mixed text. We
                                          submitted three models for sentiment analysis of code-mixed TA-EN and MA-EN datasets. Run-1 was obtained
                                          from the BERT and Logistic regression classifier, Run-2 used the DistilBERT and Logistic regression classifier,
                                          and Run-3 used the fastText model for producing the results. Run-3 outperformed Run-1 and Run-2 for both
                                          the datasets. We obtained an 𝐹1 -score of 0.58, rank 8/14 in TA-EN language pair and for ML-EN, an 𝐹1 -score of
                                          0.63 with rank 11/15.

                                          Keywords
                                          Code Mixed, Malayalam, Tamil, BERT, fastText, Sentiment Analysis,




1. Introduction
Internet and digitization enabled people express their views, sentiments, opinions through blog posts,
online forums, product review websites, and different social media. Millions of people from different
linguistic and cultural backgrounds use social networking sites like Facebook, Twitter, LinkedIn, and
YouTube to express their emotions, opinions, and share views on different issues that matter in their
lives. As a large number of Indian users can speak multiple languages proficiently (at least two: native
languages like Malayalam, Tamil, Hindi, and English), an unplanned switching between languages of-
ten happens unconsciously. Even though many languages have their own scripts, social media users
often use non-native scripts, usually Roman script, because of socio-linguistics reasons. This phe-
nomenon is called code-mixing and is defined as “the embedding of linguistic units such as phrases,
words and morphemes of one language into an utterance of another language" (Myers-Scotton[1]).
Code-mixed data is generally observed in a place of informal communication like social media. The
data can be easily extracted from social media sources using different APIs. Sentiment analysis (SA)
on social media text has become an important research task in academia and industry in the past
two decades. SA helps understand people’s opinion from movie/product reviews, and thus help take
decision to improve customer satisfaction through advertisement and marketing.
   The shared task [2, 3] here aims to identify sentiment polarity of the code-mixed data of YouTube
comments in Dravidian Language pairs (Malayalam-English [4] and Tamil-English [5]) collected from
social media. In the past few years, there have been multiple attempts to process code-mixed data, and
FIRE 2020: Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India
email: supriyachanda.rs.cse18@itbhu.ac.in (S. Chanda); spal.cse@itbhu.ac.in (S. Pal)
url: https://cse-iitbhu.github.io/irlab/supriya.html (S. Chanda); https://cse-iitbhu.github.io/irlab/spal.html (S. Pal)
orcid: 0000-0001-8743-9830 (S. Pal)
                                       © 2020 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Data Distribution
                                   Tamil - English                              Malayalam - English
      Class            Training    Development Test        Total     Training    Development Test     Total
  Mixed_feelings         1283          141        377      1801        289            44         70    403
     Negative            1448          165        424      2037        549            51        138    738
     Positive            7627          857       2075      10559       2022           224       565   2811
    not-Tamil            368            29        100       497          -             -          -     -
  not-malayalam            -             -         -         -         647            60        177    884
  unknown_state          609            68        173       850        1344           161       398   1903
      Total             11335         1260       3149      15744       4851           540      1348   6739


a shared task on sentiment analysis of code-mixed Indian languages[6] was organized in ICON 2017.
However, the freely available data apart from Hindi-English and Bengali-English are still limited in
Indian languages, although some other languages like English-Spanish and Chinese-English datasets
are available for research.
   The rest of the paper is organized as follows. Section 2 describes the dataset, pre-processing and
processing techniques. In Section 3, we report our results and analysis. Finally we conclude in Section
4.


2. System Description
2.1. Datasets
The Dravidian-CodeMix shared task1 organizers provided a dataset that consists of 15,744 Tamil-
English and 6,739 Malayalam-English YouTube video comments. The statistics of training, develop-
ment, and test data corpus collection and their class distribution are shown in Table 1. Here, each
comment is annotated by six (for ML-EN) and eleven (for TA-EN) independent annotators. An inter-
annotator agreement score of 0.6 with Krippendorff’s alpha is obtained for the Tamil-English dataset,
and score of 0.8 with Krippendorff’s alpha for the Malayalam-English dataset. Some comment ex-
amples from the training dataset (Tamil-English) are shown in Table 2. The dataset provided suffers
from general problems of social media data, particularly code-mixed data. The sentences are short
with lack of well-defined grammatical structures, and many spelling mistakes.

2.2. Data Pre-processing
The YouTube comment dataset used in this work is already labelled into five categories: Positive,
Negative, Mixed_feelings, unknown_state and not-Tamil or not-Malayalam. Our pre-processing of
comments includes the following steps:
    • Removal of extended words: number of words which have one or more contiguous repeating
      characters 2

    • Removal of exclamations and other punctuation

    • Removal of non-ASCII characters, all the emoticons, symbols, numbers, special characters.

   1
       https://dravidian-codemix.github.io/2020/index.html
   2
       https://github.com/SupriyaChanda/Dravidian-CodeMix-FIRE2020
Table 2
Example YouTube comments from the Dravidian-CodeMix dataset for all clases
 Sample comments from dataset(Tamil-English)                                                 Category
 Ena da bgm ithu yuvannnnnnnnnn rocksssssss                                                  Positive
 Kola gaadula iruka... Thalaivaaaaaaaa waiting layea veri aaguthey                           Negative
 Wow wow wow... Thalaivaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa....... proud to be every Indian...
 <3 thanks to shankar sir and holl team...                                                   Mixed_feelings
 Nenu ee movie chusanu super movie                                                           not-Tamil
 Super. 1 like is equivalent to 100 likes.                                                   unknown_state


2.3. Word Embedding
Word embedding is arguably the most widely known technology in the recent history of NLP. It cap-
tures the semantic property of a word. We use bert-base-uncased and distilbert-base-uncased
pre-trained models3 to get a vector as an embedding for the sentence that we can use for classifica-
tion. Apart from these two pre-trained models, we experiment with other pre-trained models like
bert-base-multilingual-uncased, bert-base-multilingual-cased.

    • BERT: Bidirectional Encoder Representations from Transformers (BERT)[7] is a technique for
      NLP pre-training developed by Google. BERT is pre-trained on a large corpus of unlabelled
      text, including the entire Wikipedia (that is 2,500 million words!) and the Book Corpus (800
      million words). BERT-Base uncased has 12 layers (transformer blocks), 12 attention heads, and
      110 million parameters.

    • DistilBERT: DistilBERT[8] is a smaller version of BERT developed and open-sourced by the
      team at HuggingFace. It is a lighter and faster version of BERT that roughly matches its perfor-
      mance. DistilBERT also compares surprisingly well to BERT on downstream tasks while having
      about half and one third the number of parameters.

    • fastText: fastText, developed by Facebook, combines certain concepts introduced by the NLP
      and ML communities, representing sentences with a bag-of-words and n-grams using subword
      information and sharing them across classes through a hidden representation. fastText[9] can
      learn vector representations of out-of-vocabulary words, which is useful for our dataset that
      contains Malayalam and Tamil words in Roman script.

   After pre-processing our data and transforming all the comments into vector, we implement our
classification algorithms and construct our training models. We used the multinomial logistic regres-
sion 4 with the fastText embeddings for unigrams, bigrams, and trigrams present along with different
learning rates and epochs. we got the maximum 𝐹1 score on fastText text classification model with
-wordNgrams= 1, learning rate = 0.1 and epochs = 10.


3. Results and Analysis
We use scikit-learn5 machine learning package for the implementation. A Macro 𝐹1 score was
used to evaluate every system. Macro 𝐹1 score of the overall system was the average of 𝐹1 scores of
   3
     https://huggingface.co/transformers/pretrained_models.html
   4
     https://fasttext.cc/docs/en/supervised-tutorial.html
   5
     http://scikit-learn.org
Table 3
Evaluation results on test data and rank list
                                       Tamil - English                           Malayalam - English
    Team Name          Precision        Recall 𝐹1 score         Rank     Precision Recall 𝐹1 score Rank
       SRJ               0.64            0.67     0.65          1/14       0.74      0.75     0.74   1/15
  IRLab@IITBHU           0.57            0.61     0.58          8/14       0.63      0.64     0.63   11/15

Table 4
Precision, recall, 𝐹1 -score, and support for all experiment on Tamil-English test data
                              BERT                          DistilBERT                       FastText
                  Precision   Recall    𝐹1 -score   Precision Recall 𝐹1 -score   Precision     Recall 𝐹1 -score   support
 Mixed_feelings     0.17       0.02        0.04       0.12      0.01    0.01       0.23         0.07     0.11       377
    Negative        0.40       0.12        0.18       0.42      0.08    0.13       0.34         0.28     0.31       424
    Positive        0.69       0.95        0.80       0.68      0.97    0.80       0.72         0.82     0.76      2075
   not-Tamil        0.66       0.57        0.61       0.67      0.53    0.59       0.27         0.61     0.37       100
 unknown_state      0.19       0.04        0.07       0.29      0.02    0.04       0.23         0.13     0.17       173
   macro avg        0.42       0.34        0.34       0.44      0.32    0.32       0.36         0.38     0.34      3149
  weighted avg      0.56       0.66       0.58        0.56      0.67    0.57       0.57         0.61    0.58       3149
   Accuracy                    0.66                             0.67                           0.61


Table 5
Precision, recall, 𝐹1 -scores, and support for all experiment on Malayalam-English test data
                              BERT                          DistilBERT                       FastText
                  Precision   Recall    𝐹1 -score   Precision Recall 𝐹1 -score   Precision     Recall 𝐹1 -score   support
 Mixed_feelings     0.31       0.14        0.20       0.50      0.09    0.15       0.39         0.27     0.32       70
    Negative        0.47       0.35        0.40       0.52      0.41    0.46       0.50         0.46     0.48       138
    Positive        0.63       0.75        0.68       0.64      0.77    0.70       0.73         0.70     0.72       565
 not-malayalam      0.67       0.71        0.69       0.68      0.72    0.70       0.61         0.64     0.62       177
 unknown_state      0.63       0.57        0.59       0.60      0.54    0.57       0.60         0.66     0.63       398
   macro avg        0.54       0.50        0.51       0.59      0.51    0.52       0.56         0.55     0.55      1348
  weighted avg      0.60       0.62        0.60       0.61      0.62    0.61       0.63         0.64    0.63       1348
    Accuracy                   0.62                             0.62                           0.64


the individual classes. Table 3 shows our official performances as shared by the organizers vis-a-vis
the best performing team. Table 4 and Table 5 report our results on Tamil-English and Malayalam-
English dataset respectively. We select three models that performed well during the validation phase
and submit them for final prediction of the test dataset. We observe that fastText gives better 𝐹1 scores
over others which was also in the official results (shown in Table 3).
  In the training data, there are some ambiguous samples. Some examples are given below.

    • The Tamil-English sentence Srk fan plz dislike tha video is labeled as Positive, when the sentence
      has negative sentiment word like dislike.
    • The Tamil-English sentence Wow wow wow... Thalaivaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.......
      proud to be every Indian... <3 thanks to shankar sir and holl team... is labeled as mixed_feelings,
      when there is many positive words like wow, proud, thanks.

   Our models were trained on this ambiguous data, and we could not verify the correctness of la-
belling as we do not have knowledge of Tamil or Malayalam languages. Inconsistency of the labellings,
if any, might have worsened the results on test data. Another aspect is very small sentence length.
That might also be the reason why fastText unigram gave better results than n-grams, word n-grams
were not able to capture the sentiment of a sentence.
Table 6
Error analysis
 Language pair   Sample comments from dataset                                                             Given      Predicted
 TA-EN           Just amazing thalaivaaaaaaaaaa ARR sir u r that Best ( BGM )                             Negative   Positive
 TA-EN           Next year national award Competition DHANUSH ( asuran) Karthi( kaithi) @rya (makamuni)   Negative   positive
 TA-EN           Petta paraak.rajin sir is still young                                                    Negative   Positive



   Some of the examples that were marked as incorrect predictions by our best model are shown in
Table 6. The ‘Given’ column in the table denotes the expected sentiment, as available in the gold
standard dataset against the ones predicted by our system. It seems that our predicted sentiment was
correct.


4. Conclusion
This study reports performance of our system for the shared task on Sentiment Analysis for Dravid-
ian Languages in Code-Mixed Text in Dravidian-CodeMix - FIRE 2020. We conducted a number of
experiments on a real-world code-mixed YouTube comments dataset involving a few embedding tech-
niques: fastText, BERT, and DistilBERT. We find that fastText outperforms other techniques on this
task. However, there are room for improvement. In the future, we plan to use other pre-trained models
with necessary fine-tuning. We also plan to explore multilingual embeddings for the languages.


References
[1] C. Myers-Scotton,       Common and Uncommon Ground: Social and Structural Factors in
    Codeswitching, Language in Society 22 (1993) 475–503. URL: http://www.jstor.org/stable/
    4168471.
[2] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Eliz-
    abeth McCrae, Overview of the track on Sentiment Analysis for Davidian Languages in Code-
    Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation, FIRE ’20,
    2020.
[3] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Eliz-
    abeth McCrae, Overview of the track on Sentiment Analysis for Davidian Languages in Code-
    Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020).
    CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020.
[4] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis dataset
    for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Lan-
    guage Technologies for Under-resourced languages (SLTU) and Collaboration and Computing
    for Under-Resourced Languages (CCURL), European Language Resources association, Marseille,
    France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/2020.sltu-1.25.
[5] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for sentiment
    analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken
    Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing
    for Under-Resourced Languages (CCURL), European Language Resources association, Marseille,
    France, 2020, pp. 202–210. URL: https://www.aclweb.org/anthology/2020.sltu-1.28.
[6] B. G. Patra, D. Das, A. Das, Sentiment Analysis of Code-Mixed Indian Languages: An Overview
    of SAIL Code-Mixed Shared Task @ICON-2017, 2018. arXiv:1803.06745.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Trans-
    formers for Language Understanding, Proceedings of the 2019 Conference of the North (2019).
    doi:10.18653/v1/n19-1423.
[8] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster,
    cheaper and lighter, 2019. arXiv:1910.01108.
[9] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in Pre-Training Distributed
    Word Representations, in: Proceedings of the International Conference on Language Resources
    and Evaluation (LREC 2018), 2018.