=Paper= {{Paper |id=Vol-3159/T6-12 |storemode=property |title=Sentiment Analysis on Code-mixed Dravidian Languages A Non-Linguistic Approach |pdfUrl=https://ceur-ws.org/Vol-3159/T6-12.pdf |volume=Vol-3159 |authors=Prasad A. Joshi,Varsha M. Pathak |dblpUrl=https://dblp.org/rec/conf/fire/JoshiP21 }} ==Sentiment Analysis on Code-mixed Dravidian Languages A Non-Linguistic Approach== https://ceur-ws.org/Vol-3159/T6-12.pdf
Sentiment Analysis on Code-mixed Dravidian
Languages, A Non-linguistic Approach
Prasad A. Joshi1 , Varsha M. Pathak2
1
    JET’s Zulal Bhilajirao Pail college, Dhule
2
    Institute of Management and Research, Jalgaon


                                         Abstract
                                         Identification of sentiment analysis from social media contents has more attention in the past decades.
                                         Such social media contents are code-mixed in nature and peoples find easier to express in this format.
                                         They can bind their mother tongue with English. This task deals with identifying sentiment analysis from
                                         code-mixed Dravidian languages. Dataset provided by the organisers are in Tamil-English, Kannada-
                                         English and Malayalam-English languages. Our system uses the three different approaches viz: machine
                                         learning(MNB and DTC), neural-network(ANN and CNN) and transfer learning(BERT, mBERT). For
                                         Malayalam-English, MNB, trained using TF-IDF found best. For Tamil-English, ANN and for Kannada-
                                         English CNN performed better.

                                         Keywords
                                         sentiment analysis language detection, Code-mixing, MNB, DTC, ANN, CNN, BERT,




1. Introduction
In twentieth century people did not have an easy and comfortable medium for expressing
themselves publicly. Twenty first century has begun with a revolutionary shift in information
technology with an obvious good or bad impact of its use. In recent years, social media
platforms have become an obvious part of lives of a majority of peoples. Popular social media
platforms such as YouTube, Facebook, Instagram, Twitter and many more provide a medium
of public expressions keeping the required level of privacy. These expressions/comments are
usually informal in nature. To such informal comment, many people use blends of two or more
languages, which is referred as a code-mixed or code-switched language.
   As per the Indian constitution, there are 22 scheduled languages in India. Out of these,
Dravidian Languages like Malayalam, Tamil, Kannada have highest speakers and rank within
top 10 as per 2011 census data of India. According to the linguists, Malayalam, Tamil and
Kannada belong to the South Dravidian language family. Tamil is one of the oldest spoken
language of Tamilnadu state of India. Many Tamilian people live in SriLanka and by the Tamil
diaspora around the world, whereas Kannada and Malayalam are Dravidian languages spoken
in the part of Karnataka, and Kerala [1]. In India, approximately 400 million peoples handling
social media out of 1.38 billion, but only 0.02% Indians speak English as their first language. So

FIRE 2021: Forum for Information Retrieval Evaluation, December 13-17, 2021, India
Envelope-Open sayprasadajoshi@gmail.com (P. A. Joshi); varsha.pathak@imr.ac.in (V. M. Pathak)
Orcid 0000-0001-5522-6187 (P. A. Joshi)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
most of the peoples in India rely on code-mixed or code-switched languages, which leads to
different sentiments such as hate speech [2], insult [3], offensive comments [4] and trolling
[5]. If these kind of negative sentiments are not taken care in time, can harm communal health
and can turn into devastating events [6]. Specially many code-mixed Indian languages and
under-resourced languages [7], needs serious attention. With this motivation, the researchers
have initiated their work on identifying different sentiments occurring on social media. In this
context, the shared task on sentiment analysis for Dravidian Languages in Code-Mixed Text has
been organized [1] and authors have participated. The details about this task and its purpose is
given in Section 3.


2. Related Work
From the above discussion we can understand different types of sentiments occurring in code-
mixed text content on social media. Though significant amount of sentiment related text in
code-mixed Dravidian languages could be found on social media, very less sentiment analysis
work has been done till date. Bharathi Raja et al.[8], attracted attention of researchers by
organizing special track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text
in year 2020. Similarly Bharathi Raja et al. [9], has organized a Shared Task on identifying
Offensive Language from Dravidian languages viz.: Tamil, Malayalam, and Kannada in year
2021. In both these task, the organizers have provided training, validation and test datasets with
appropriate annotations. In these tasks, many researchers have participated and have applied
different machine learning and transfer learning techniques.
   Our study shows that prior to the initiatives taken for Dravidian languages, Patra et al.,
[10] took efforts for code-mixed Hindi-English and code-mixed Bengali-English languages for
sentiment analysis. They have organized shared task in ICON 2017 and prepared the dataset
using Twitter API. In the year 2018, Aditya Bohra et al. [11], contributed to this research area
and they have developed the dataset in code mixed Hindi-English using Twitter Python API.
Similarly Anita Saroj and Sukomal Pal [12], have created the dataset in English and code-mixed
Hindi languages by using Facebook and Tweeter media in year 2020. This data was collected
from parliamentary election of India (PEI-2019) event and they implemented different machine
classifiers.


3. About HASOC Shared Task
The goal of this shared task is to categorize the posts/comments of the Dravidian code-mixed
dataset collected from YouTube comments into different sentiments polarity. The dataset
provided by organizers are Malayalam-English, Tamil-English and Kannada-English languages,
containing code-mixed sentences viz. Inter-Sentential switch, Intra-Sentential switch and Tag
switching [13]. The Dataset is consisting of 09 tsv files. For each language, the dataset has 03
files for Training, Validation and Testing respectively. Both Training and Validation dataset have
three columns namely id, comment/post and sentiment polarity respectively. The polarity for
respective comment/post, is annotated with five classes viz. Mixed feeling, Negative, Positive,
not-language and unknown state. The test dataset has single text column that contains YouTube
Table 1
Statistics of Dataset
                              Malayalam                 Tamil                   Kannada
        Class
                        Train. Dev.     Test   Train.   Dev.   Test    Train.    Dev.    Test
 Mixed_feelings          926     102            4020     438             574       52
    Negative             2105    237            4271     480            1188      139
                                        Not                    Not                       Not
    Positive             6421    706           20070    2257            2823      321
                                       Known                  Known                     Known
  not-language           1157    141            1667     176            916       110
 unknown_state           5279    580            5628     611            711        69
      Total             15888 1766      1962   35656    3962   4402     6212      691    768


comments/posts. The participants have been asked to develop a system that can identify the
appropriate class of respective comment and annotate the test data accordingly [14]. The
dataset details, contain total number of comments of five classes. These details of training and
development datasets given by the organisers is shown in the Table 1. From this table, we can
see that the classes are highly imbalanced.


4. Methodology
As the datasets are collected from social media, they are noisy in nature. So pre-processing
is required, which is done at the initial stage of this experimental work. Later features are
extracted from datasets(training and validation), using TF-IDF and Keras Tokenizer API 1 . On
these extracted features, two different approaches, Machine learning and Neural network, are
applied. Results of these two approaches, are submitted to HASOC 2021 shared task. To improve
the system further, we have applied a Transfer Learning, of which results were not be submitted
to task organiser. Our system is thus based on three approaches as mentioned below. The code
is available on GitHub 2 .

    • Machine learning.
    • Neural network.
    • Transfer learning.

  The performance of all these models is evaluated for development dataset, using weighted
average F1-score. For all the three Dravidian language, same methodology is followed. The
working of each approach is presented in this article in detail.

4.1. Data Pre-processing
We have removed the white spaces, digits, special characters, extra spaces and emojis etc.
English stop-words are also removed. For Malayalam language, we have used ml2en algorithm3 .
The ml2en algorithm, transliterates Malayalam script to Roman script (’Manglish’).
    1
      https://keras.io/
    2
      https://github.com/sayprasad1/KBCNMUJAL-HASOC-2021
    3
      https://nadh.in/code/ml2en/
Figure 1: HASOC-2021:kbcnmujal system: An architectural view


4.2. Machine learning based approach
We have applied different n-gram ranges of TF-IDF word, TF-IDF character and TF-IDF combined
word and character for extracting the features. The selection of different ranges is done, by
testing at what range the classifier is producing the highest f1-score. For every language, we
have tested the different n-gram range, and finally, have used the following n-gram range .

    • For Malayalam word n-gram of (1,1) and character n-gram range of (5,5) were applied.
Table 2
Results for the Machine learning approach with different word n-gram range of TF-IDF feature
                                       Malayalam              Tamil                 Kannada
 Classifier      Class             n-gram range (1,1)   n-gram range (5,6)    n-gram range (4,5)
                                   Preci. Rec. F1       Preci. Rec. F1        Preci. Rec. F1
                 Mixed_feelings     0.70   0.64 0.67     0.20   0.95 0.33      0.20    0.97 0.33
                 Negative           0.70   0.72 0.71     0.47   0.01 0.02      0.73    0.02 0.05
                 Positive           0.65   0.75 0.70     0.45   0.06 0.11      0.58    0.04 0.08
   MNB
                 not-language       0.89   0.93 0.91     0.96   0.05 0.09      1.00    0.02 0.05
                 unknown_state      0.70   0.60 0.95     0.58   0.05 0.08      0.00    0.00 0.00
                 Weighted avg       0.73   0.73 0.73     0.53   0.22 0.13      0.50    0.21 0.10
                 Mixed_feelings     0.66   0.31 0.42     0.44   0.01 0.02      0.92    0.03 0.07
                 Negative           0.61   0.47 0.53     0.09   0.00 0.00      0.67    0.02 0.04
                 Positive           0.39   0.64 0.49     0.65   0.04 0.08      0.70    0.02 0.04
    DTC
                 not-language       0.71   0.75 0.73     0.20   0.99 0.34      0.20    1.00 0.34
                 unknown_state      0.49   0.53 0.51     0.69   0.03 0.06      0.00    0.00 0.00
                 Weighted avg       0.57   0.54 0.54     0.42   0.21 0.10      0.50    0.21 0.10


    • For Tamil word n-gram of (5,6) and character n-gram range of (5,6) were applied.
    • For Kannada word n-gram of (4,5) and character n-gram range of (4,5) were applied.

After extracting features using above TF-IDF n-gram ranges, Multinomial Naive Bayes (MNB)
and Decision Tree Classifiers (DTC) are implemented. For MNB we have set ’alpha’ parameter
in 0.5 to 2.0 range. For DTC, we have set the ’criterion’ parameter with ’gini’ and ’entropy’
values. The rest of the hyper parameters of both these classifiers are kept at their default values.
   As the classes have severe skew in distribution, both the classifiers failed to categoriz the
comments among the five classes for all the Dravidian languages. Hence SMOTE4 technique, is
applied on TF-IDF extracted features. SMOTE is the most popular oversampling method. While
applying SMOTE on TF-IDF extracted features sampling strategy is kept to auto and thus we
get the re-sampled TF-IDF features according to the majority class. On re-sampled features
MNB and DTC were applied. While applying DTC for Tamil language we have selected 10,000
features because training and development dataset have bigger size as compared to Malayalam
and Kannada. The class-wise performance of both the classifiers on this re-sampled features, is
given in the Table 2, 3, 4.

4.3. Neural network based approach
We have implemented an Artificial Neural Network (ANN) and Convolution Neural Network
(CNN) in our work. For ANN we created tokenization of datasets using one-hot encoded matrix.
For one-hot encoded matrix, we have used keras text_to_matrix method. For CNN, we have
used keras text_to_sequences method for the tokenization. The extracted tokens are used to
construct the vocabulary base of the respective language.


    4
        imblearn.over_sampling.SMOTE
Table 3
Results for the Machine learning approach with different character n-gram range of TF-IDF feature
                                    Malayalam                  Tamil                   Kannada
 Classifier   Class             n-gram range (5,5)      n-gram range (5,6)       n-gram range (4,5)
                                Preci. Rec.   F1        Preci. Rec.   F1         Preci. Rec.   F1
              Mixed_feelings     0.77  0.73 0.75         0.40   0.38 0.39         0.51   0.21 0.30
              Negative           0.70  0.76 0.73         0.51   0.62 0.56         0.48   0.87 0.62
              Positive           0.66  0.79 0.72         0.51   0.59 0.54         0.53   0.61 0.57
   MNB
              not-language       0.93  0.89 0.91         0.84   0.80 0.82         0.66   0.82 0.73
              unknown_state      0.74  0.60 0.66         0.61   0.44 0.51         0.72   0.28 0.40
              Weighted avg       0.76  0.75 0.75         0.57   0.57 0.56         0.58   0.56 0.52
              Mixed_feelings     0.40  0.31 0.35         0.37   0.30 0.33         0.40   0.30 0.34
              Negative           0.35  0.34 0.35         0.40   0.38 0.39         0.45   0.56 0.50
              Positive           0.31  0.41 0.36         0.44   0.63 0.52         0.43   0.56 0.49
    DTC
              not-language       0.65  0.49 0.56         0.72   0.59 0.65         0.64   0.64 0.64
              unknown_state      0.31  0.37 0.34         0.38   0.39 0.38         0.57   0.40 0.47
              Weighted avg       0.41  0.38 0.39         0.46   0.45 0.45         0.50   0.49 0.49


Table 4
Results for the Machine learning approach with combined word and character n-gram range of TF-IDF
feature
                                    Malayalam                    Tamil                 Kannada
 Classifier   Class
                                Preci. Rec.   F1        Preci.    Rec.     F1    Preci. Rec. F1
              Mixed_feelings     0.79   0.75 0.77        0.42      0.30   0.35    0.48    0.22 0.31
              Negative           0.75   0.75 0.75        0.53      0.58   0.55    0.56    0.84 0.67
              Positive           0.69   0.80 0.74        0.45      0.68   0.54    0.47    0.66 0.55
   MNB
              not-language       0.93   0.92 0.93        0.85      075    0.80    0.68    0.83 0.75
              unknown_state      0.74   0.66 0.70        0.55      0.44   0.49    0.74    0.31 0.43
              Weighted avg       0.78   0.78 0.78        0.56      0.55   0.55    0.59    0.57 0.54
              Mixed_feelings     0.56   0.39 0.46        0.37      0.31   0.34    0.30    0.22 0.26
              Negative           0.48   0.52 0.50        0.40      0.35   0.38    0.48    0.53 0.51
              Positive           0.40   0.52 0.45        0.43      0.63   0.51    0.43    0.63 0.51
    DTC
              not-language       0.78   0.65 0.71        0.72      0.58   0.64    0.57    0.58 0.57
              unknown_state      0.46   0.50 0.48        0.38      0.38   0.38    0.61    0.40 0.49
              Weighted avg       0.54   0.52 0.52        0.46      0.45   0.45    0.48    0.47 0.47


   After tokenizing the text, ANN is built by using three dense layers. The first input (dense) layer
has 512 nodes and input vocabulary of size 10,000 is provided. This is followed by activation
layer using relu activation function, which is followed by dropout layer of 0.3. The second
dense layer also has the 512 node followed by the activation layer having relu function. This
is followed by a dropout layer with dropout of 0.3. The last output (dense) layer has 5 nodes
because we have to be categorized the comments/post into five classes. Dense layer is followed
by activation layer using sigmoid as activation function. The proposed ANN trained with
categorical_crossentropy loss function and Adam optimizer. The training has the batch size
of 32 with 10 epochs. The same ANN network is implemented for all the three Dravidian
Table 5
Results of the ANN and CNN approach
                                    Malayalam                Tamil               Kannada
  Classifier   Class
                                Preci. Rec. F1      Preci.    Rec.   F1     Preci. Rec. F1
               Mixed_feelings   0.51   0.40 0.45    0.21      0.17   0.19   0.22   0.19 0.21
               Negative         0.54   0.55 0.55    0.39      0.36   0.38   0.59   0.55 0.57
               Positive         0.76   0.76 0.76    0.72      0.78   0.75   0.69   0.74 0.72
  ANN
               not-language     0.67   0.69 0.68    0.55      0.51   0.53   0.62   0.59 0.60
               unknown_state    0.68   0.69 0.68    0.41      0.36   0.38   0.49   0.46 0.48
               Weighted avg     0.68   0.68 0.68    0.57      0.59   0.58   0.60   0.61 0.61
               Mixed_feelings   0.53   0.31 0.40    0.22      0.13   0.16   0.62   0.10 0.17
               Negative         0.69   0.51 0.59    0.44      0.32   0.37   0.64   0.56 0.60
               Positive         0.72   0.83 0.77    0.70      0.82   0.75   0.67   0.83 0.74
  CNN
               not-language     0.73   0.76 0.75    0.67      0.53   0.59   0.60   0.62 0.61
               unknown_state    0.72   0.71 0.71    0.38      0.35   0.37   0.46   0.32 0.38
               Weighted avg     0.71   0.71 0.70    0.56      0.60   0.57   0.63   0.64 0.61

Table 6
Results of the ANN and CNN approach after SMOTE
                                    Malayalam                Tamil                Kannada
  Classifier   Class
                                Preci. Rec.   F1    Preci.    Rec.    F1    Preci. Rec.    F1
               Mixed_feelings    0.81   0.44 0.57    0.24     0.18   0.21    0.67    0.50 0.58
               Negative          0.44   0.53 0.48    0.38     0.37   0.38    0.56    0.58 0.57
               Positive          0.61   0.80 0.69    0.62     0.80   0.70    0.58    0.66 0.62
  ANN
               not-language      0.75   0.64 0.69    0.95     0.74   0.84    0.50    0.62 0.55
               unknown_state     0.58   0.65 0.61    0.37     0.35   0.36    0.33    0.33 0.33
               Weighted avg      0.65   0.62 0.62    0.67     0.65   0.65    0.58    0.57 0.57
               Mixed_feelings    0.53   0.59 0.56    0.18     0.11   0.14    0.51    0.59 0.55
               Negative          0.58   0.44 0.50    0.40     0.28   0.33    0.50    0.27 0.35
               Positive          0.64   0.65 0.65    0.59     0.65   0.62    0.57    0.60 0.58
  CNN
               not-language      0.73   0.70 0.71    0.63     0.70   0.60    0.56    0.57 0.57
               unknown_state     0.61   0.58 0.60    0.36     0.26   0.30    0.46    0.41 0.43
               Weighted avg      0.60   0.60 0.60    0.54     0.56   0.54    0.53    0.53 0.52


languages.
   For implementing CNN, we have applied different parameters for all the Dravidian languages.
We have extracted total unique words(vocab_size) and length of longest sentence(max_length)
from the dataset for all languages. We found 40230, 69675 and 15800 unique words from
Malayalam, Tamil and Kannada languages respectively. For Malayalam, Tamil and Kannada
maximum length of comment/post is 195, 124 and 92 respectively. Every input comment padded
with a maximum length(max_length) of that language. For every language, embedding of 100 is
used with unique words(vocab_size). So for Malayalam, Tamil and Kannada, we get (40230 X 100),
(69675 X 100) and (15800 X 100) dimensional embedding matrix respectively. Thus embedding
layer is different for each language, but the next set of layers(Conv1D, GlobalMaxPooling1D
and dense) are same. Embedding layer treated as input to Conv1D layer. Conv1D provided with
Table 7
Results of the Transfer Learning
                                       Malayalam                Tamil                 Kannada
  Classifier    Class
                                   Preci. Rec.   F1    Preci.    Rec.     F1    Preci. Rec.    F1
                Mixed_feelings      0.50   0.38 0.43    0.28     0.21    0.24    0.22    0.21 0.21
                Negative            0.61   0.57 0.59    0.42     0.36    0.39    0.73    0.53 0.62
                Positive            0.76   0.82 0.79    0.72     0.82    0.77    0.70    0.80 0.75
  BERT
                not-language        0.87   0.84 0.85    0.56     0.52    0.54    0.65    0.66 0.66
                unknown_state       0.74   0.72 0.73    0.47     0.37    0.41    0.52    0.46 0.49
                Weighted avg        0.73   0.73 0.73    0.59     0.62    0.60    0.65    0.64 0.64
                Mixed_feelings      0.44   0.39 0.42    0.48     0.14    0.21    0.29    0.19 0.23
                Negative            0.68   0.61 0.64    0.45     0.34    0.39    0.62    0.68 0.65
                Positive            0.79   0.79 0.79    0.66     0.91    0.77    0.73    0.73 0.73
  mBERT
                not-language        0.85   0.84 0.84    0.70     0.44    0.54    0.65    0.70 0.67
                unknown_state       0.73   0.78 0.75    0.52     0.23    0.32    0.46    0.43 0.45
                Weighted avg        0.74   0.74 0.74    0.60      0.63   0.61    0.64    0.65 0.64

Table 8
Results of the Transfer Learning using Class Weights
                                       Malayalam                Tamil                 Kannada
  Classifier    Class
                                   Preci. Rec.   F1    Preci.    Rec.     F1    Preci. Rec.    F1
                Mixed_feelings      0.44   0.40 0.42    0.26     0.35    0.30    0.21    0.27 0.24
                Negative            0.62   0.60 0.61    0.38     0.36    0.37    0.66    0.52 0.62
                Positive            0.79   0.79 0.79    0.76     0.69    0.73    0.76    0.75 0.75
  BERT
                not-language        0.85   0.83 0.84    0.49     0.54    0.51    0.65    0.72 0.68
                unknown_state       0.74   0.76 0.75    0.40     0.44    0.42    0.56    0.51 0.53
                Weighted avg        0.73   0.73 0.73    0.59     0.57    0.58    0.66    0.65 0.65
                Mixed_feelings      0.42   0.46 0.44    0.24     0.36    0.29    0.14    0.21 0.17
                Negative            0.60   0.56 0.52    0.41     0.42    0.42    0.62    0.61 0.61
                Positive            0.79   0.78 0.78    0.80     0.67    0.73    0.81    0.72 0.76
  mBERT
                not-language        0.79   0.87 0.83    0.54     0.60    0.56    0.68    0.79 0.73
                unknown_state       0.73   0.72 0.73    0.39     0.47    0.43    0.52    0.49 0.50
                Weighted avg        0.72   0.72 0.72    0.62     0.57    0.59    0.67    0.65 0.66


number of output filters as 64, and kernel size of 3 with relu as activation function. Conv1D layer
is used for generating sequences of text. Conv1D layer is followed by GlobalMaxPooling1D
layer and dropout of 0.5. The dense layer has 5 nodes with activation function as softmax. CNN
trained with categorical_crossentropy loss function and Adam optimizer. The training has the
batch size of 128 with 5 epochs.
   Though the classes were highly imbalanced in nature, neural-network approach(ANN and
CNN) successfully categorized the post into the five classes for all the Dravidian languages. The
class wise distribution of the dataset using ANN and CNN is give in Table 5.
   Further to analyse the performance of ANN and CNN on resampled dataset, We have applied
SMOTE technique on the tokenized text(features). While applying SMOTE, sampling strategy
parameter is set to minority, which resamples the minority class. Thus we have received the
resampled features according to the minority class. On resampled features, the same ANN and
CNN are implemented. This has enabled to predict the comments among the desired classes.
The results are presented in Table 6.
   After comparing the result of neural network approaches on tokenized features and on
resampled features from Table 5 and Table 6, we can observe that for Malayalam and Kannada
language results are very low for resampled features, but for Tamil language ANN performed
better on resampled features.

4.4. Transfer learning
For applying transfer learning, we used Simple Transformers [15] library based on the Trans-
formers library by HuggingFace. Simple Transformers permits to fine-tune Transformer models.
Transformers offers plenty of pretrained models which are used to accomplish different tasks like
text classification, multi-label text classification, text generation, question answering, summa-
rization, translation, information extraction and more in over 100 languages. Our approach used
two different variations of BERT [16] (Bidirectional Encoder Representations from Transformers)
transfer models, for categorizing the comments. The two variations are :
    • Pretrained BERT base model (bert-base-cased) : 12-layer, 768-hidden, 12-heads, 110M
      parameters. It is trained on lower-cased English text using a masked modelling technique.
    • Pretrained BERT multilingual model (bert-base-multilingual-cased) : It has 12 layers, 768
      hidden, 12 attention heads with 168M parameters. It is trained on lower-cased text in 104
      languages with masked language modeling.
For implementing above models we created instance of ClassificationModel with its parameters
and their values. For each language we created two different models : model handling class
imbalance while other model does not handle class imbalance. The model handling class
imbalance has weight attribute, which takes input as list of weight for each label. To calculate
weights of the classes we implemented compute_class_weight5 from sklearn. For both the
models hyperparameter ’fp16’ is set to ’false’ while rest of the parameters are set to their default
values. Both these models are trained by using training dataset with 4 epochs and results were
evaluated using development dataset. Table 7 shows the results of BERT models not handling
class imbalance while BERT models handling class imbalance are shown in Table 8.
   After examining Table 7 and Table 8, we have noticed that, class weights have dropped
the performance for Tamil language, whereas for Kannada language, the results show that
there is slight improvement for both BERT models. For Malayalam language, both BERT
models with class weights and without class weights have produced same result, while BERT-
multilingual without class weights have shown better performance than BERT-multilingual
with class weights.


5. Result and discussion
The system performance is evaluated in terms of precision, recall, and F1-score for all the
five sentiment classes. The weighted average of these classes is also given. Among the three
    5
        sklearn.utils.class_weight.compute_class_weight
approaches, the one, which has highest weighted average of precision, recall and F1-score, is
considered as best. The results of best performing model were submitted to the task organiser.
As mentioned earlier, results of transfer learning and neural network approach using resampled
features were not submitted.
   For Malayalam language, among all the classifiers, MNB using combined word n-gram
range(1,1) and character n-gram range(5,5) performed with better results and got the precision,
recall and F1-score of 0.78. MNB with word n-gram range(1,1) and MNB with character n-gram
range(5,5) also performed well with F1-score of 0.73 and 0.75 respectively. In the same machine
learning approach, results of DTC are very poor, with word n-gram and combined word and
character n-gram it achieves F1-score of 0.54 and 0.52 respectively. But DTC with character
n-gram range(5,5) not even scored 50% weighted average of precision, recall and F1-score.
The Neural network approach performed very well using tokenized features as compared to
resampled features because weighted average F1-score dropped from 0.68 to 0.62 for ANN
and for CNN it dropped from 0.70 to 0.60. Both BERT-models with and without class weights
have same weighted average precision, recall and F1-score of 0.73, whereas BERT-multilingual
without class weights have weighted average F1-score of 0.74 and 0.73 with class weights.
   In case of Tamil language, ANN using resampled features performed better with F1 score of a
0.65 as compared to tokenized features with F1 score 0.58. CNN using tokenized features have
F1 score 0.57 and using resampled features it shows a degradation of 0.54. MNB using character
n-gram range(5,6) and using combined word n-gram range(5,6) and character n-gram(5,6) have
near about same F1 score of 0.55. Performance of MNB and DTC with word n-gram range(5,6)
were very low with F1-score of 0.13 and 0.10 respectively. DTC with character n-gram range(5,6)
and DTC with combined word n-gram range(5,6) and character n-gram range(5,6) has not shown
promising results with F1-score of 0.45. Both BERT models without class weights performed
marginally better. BERT-multilingual without class weight have overshadow the performance
of BERT-based with higher precision, recall and F1-score of 0.60, 0.63 and 0.61 respectively.
   CNN and ANN have shown better results using tokenized features as compared to resampled
features. CNN using tokenized features has weighted average precision and recall values as
0.63 and 0.64 respectively in case of Kannada language. Similarly, for Tamil language, MNB and
DTC with word n-gram range(4,5) has received very low F1-score i.e. 0.10. DTC with character
n-gram and DTC with combined word n-gram and character n-gram have shown good F1-score
as compared to Malayalam and Tamil language. BERT-based and BERT-multilingual models
without class weights have shown equal results with F1-score as 0.64. Both the transfer learning
models have performed better, using class weights. BERT-multilingual model has recei better
weighted average F1 score i.e. 0.66.
   From all the above observations, we can conclude that, in case of Malayalam language,
machine learning model: MNB using combined word n-gram and character n-gram has scored
the highest results. For Tamil language, the neural network model: ANN using resampled
features has scored the better results. In case of Kannada language BERT-multilingual using
class weights performed better.
6. Conclusion
From these results, we can report that Roman/Latin script helps in improving the system
performance for all the three approaches. Machine learning approach with word n-gram feature
categorized the sentiments, only for Roman/Lain script(e.g, Manglish). On the other hand, for
code-mixed sentences(e.g.Tanglish and Tamil), it failed miserably. On the contrary, the same
machine learning approach with character n-gram feature and combined word and character
n-gram feature successfully categorized the sentiments irrespective whether it is Roman/Lain
script or code-mixed sentences. Experiment of the system shows that class imbalanced is
successfully handled by neural-network and transfer learning, whereas for machine learning
approach resampling of features is must.


References
 [1] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil,
     malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021,
     Association for Computing Machinery, 2021.
 [2] T. Davidson, D. Warmsley, M. W. Macy, I. Weber, Automated hate speech detection and
     the problem of offensive language, CoRR abs/1703.04009 (2017). URL: http://arxiv.org/abs/
     1703.04009. a r X i v : 1 7 0 3 . 0 4 0 0 9 .
 [3] Z. Waseem, Are you a racist or am I seeing things? annotator influence on hate speech
     detection on Twitter, in: Proceedings of the First Workshop on NLP and Computational
     Social Science, Association for Computational Linguistics, Austin, Texas, 2016, pp. 138–142.
     URL: https://aclanthology.org/W16-5618. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 6 - 5 6 1 8 .
 [4] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, SemEval-2019
     task 6: Identifying and categorizing offensive language in social media (OffensEval), in:
     Proceedings of the 13th International Workshop on Semantic Evaluation, Association for
     Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 75–86. URL: https:
     //aclanthology.org/S19-2010. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 0 1 0 .
 [5] A. Koufakou, V. Basile, V. Patti, FlorUniTo@TRAC-2: Retrofitting word embeddings on an
     abusive lexicon for aggressive language detection, in: Proceedings of the Second Workshop
     on Trolling, Aggression and Cyberbullying, European Language Resources Association
     (ELRA), Marseille, France, 2020, pp. 106–112. URL: https://aclanthology.org/2020.trac-1.17.
 [6] V. M. Pathak, M. Joshi, P. Joshi, M. Mundada, T. Joshi, Kbcnmujal@hasoc-dravidian-
     codemix-fire2020: Using machine learning for detection of hate speech and offensive code-
     mixed social media text, CoRR abs/2102.09866 (2021). URL: https://arxiv.org/abs/2102.09866.
     arXiv:2102.09866.
 [7] S. Suryawanshi, B. R. Chakravarthi, Findings of the shared task on troll meme classification
     in Tamil, in: Proceedings of the First Workshop on Speech and Language Technologies for
     Dravidian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 126–132.
     URL: https://aclanthology.org/2021.dravidianlangtech-1.16.
 [8] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on sentiment analysis for dravidian languages in
     code-mixed text, in: Forum for Information Retrieval Evaluation, 2020, pp. 21–24.
 [9] B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. Kumar M, T. Mandl, P. K. Kumaresan,
     R. Ponnusamy, H. R L, J. P. McCrae, E. Sherly, Findings of the shared task on offensive
     language identification in Tamil, Malayalam, and Kannada, in: Proceedings of the First
     Workshop on Speech and Language Technologies for Dravidian Languages, Association
     for Computational Linguistics, Kyiv, 2021, pp. 133–145. URL: https://aclanthology.org/2021.
     dravidianlangtech-1.17.
[10] B. G. Patra, D. Das, A. Das, Sentiment analysis of code-mixed indian languages: An
     overview of sail_code-mixed shared task @icon-2017, ArXiv abs/1803.06745 (2018).
[11] A. Bohra, D. Vijay, V. Singh, S. S. Akhtar, M. Shrivastava, A dataset of Hindi-English
     code-mixed social media text for hate speech detection, in: Proceedings of the Second
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in
     Social Media, Association for Computational Linguistics, New Orleans, Louisiana, USA,
     2018, pp. 36–41. URL: https://aclanthology.org/W18-1105. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 8 - 1 1 0 5 .
[12] A. Saroj, S. Pal, An Indian language social media collection for hate and offensive speech, in:
     Proceedings of the Workshop on Resources and Techniques for User and Author Profiling
     in Abusive Language, European Language Resources Association (ELRA), Marseille, France,
     2020, pp. 2–8. URL: https://aclanthology.org/2020.restup-1.2.
[13] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan,
     P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
     HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and
     Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.
[14] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, D. Thenmozhi,
     E. Sherly, J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings
     of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text, in: Working Notes
     of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
[15] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan-
     guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[16] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. a r X i v : 1 8 1 0 . 0 4 8 0 5 .