=Paper=
{{Paper
|id=Vol-3159/T6-9
|storemode=property
|title=A Deep Neural Network-based Model for the Sentiment Analysis of Dravidian Code-mixed Social Media Posts
|pdfUrl=https://ceur-ws.org/Vol-3159/T6-9.pdf
|volume=Vol-3159
|authors=Jyoti Kumari,Abhinav Kumar
|dblpUrl=https://dblp.org/rec/conf/fire/KumariK21a
}}
==A Deep Neural Network-based Model for the Sentiment Analysis of Dravidian Code-mixed Social Media Posts==
<pdf width="1500px">https://ceur-ws.org/Vol-3159/T6-9.pdf</pdf>
<pre>
A Deep Neural Network-based Model for the
Sentiment Analysis of Dravidian Code-mixed Social
Media Posts
Jyoti Kumari1 , Abhinav Kumar2
1
    Department of Computer Science & Engineering, National Institute of Technology Patna, Patna, India
2
    Department of Computer Science & Engineering, Siksha ’O’ Anusandhan Deemed to be University, Bhubaneswar, India


                                         Abstract
                                         Sentiment analysis is one of the most essential jobs in natural language processing. The research
                                         community has recently presented a slew of papers aimed at detecting sentiment from English social
                                         media posts. Despite this, research on recognising feelings in Dravidian Kannada-English, Malayalam-
                                         English, and Tamil-English postings has been limited. This study offers a dense neural network-based
                                         model for categorising postings in Kannada-English, Malayalam-English, and Tamil-English into five
                                         different sentiment classes. When character-level TF-IDF characteristics are combined with a dense
                                         neural network, encouraging results are obtained. The recommended model received weighted 𝐹 1-
                                         scores of 0.61, 0.72, and 0.60 for Kannada-English, Malayalam-English, and Tamil-English social media
                                         postings, respectively. The code for the proposed models is available at https://github.com/Abhinavkmr/
                                         Deep-Neural-Network-based-Model-for-the-Sentiment-Analysis-of-Dravidian-Social-Media-Posts.git

                                         Keywords
                                         Sentiment analysis, Code-mixed, Tamil, Malayalam, Kannada, Machine learning, Deep learning


1. Introduction
Sentiment analysis is the process of finding polarity of a sentence. It can be negative or
positive of neutral depending on the context of the text. It is most helpful in recognizing
opinions/recommendations/queries/answers on a specific subject/product. It is gaining much
attention these days due to its significant impact on businesses like e-commerce, spam detection
[1, 2], recommendation system, social media monitoring, hate speech [3, 4], and disaster
management [5, 6]. English is a common and acceptable language worldwide specially in
the digital world. However, in a multilingual country like India, with more than 400 million
internet users speaking more than one language for communications produces a new code-
mixed language [7]. The presence of multiple script and language constructs in a text makes
it more challenging . Most of the existing models are trained for single language’s sentiment
analysis and thus fails to capture a code-mixed language semantics. Extracting sentiments from
code mixed user-generated texts becomes more difficult due to its multilingual nature [8].
   Recently, the sentiment analysis of code-mixed language [9, 10] has drawn attention of the
research community. Joshi et al. [11] have presented a model for Hinglish (Hindi-English)

FIRE 2021: Forum for Information Retrieval Evaluation, December 13-17, 2021, India
Envelope-Open j2kumari@gmail.com (J. Kumari); abhinavanand05@gmail.com (A. Kumar)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
dataset with subword representation of code-mix data and long short term memory (subword-
LSTM). In [7], Patra et al. reported a model based on support vector machine using character
n-grams features for Bengali-English code mixed data. Advani et al. [12] have used logistic
regression with handcrafted lexical and semantic features to extract sentiments from Hinglish
and Spanglish (Spanish + English) data. A morphological attention model has been proposed
by Goswami et al. [13] for sentiment analysis on Hinglish data.
   The Malayalam and Kannada languages are spoken in the Indian state of Kerala and Karnataka.
There are around 38 million Malayalam speakers over the globe. Tamil is another Dravidian
language spoken by Tamil people in India, Singapore, and Sri Lanka. The scripts of both the
Dravidian languages are alpha-syllabic i.e., partially alphabetic and partially syllable-based.
However, Roman script is frequently used for posting on social media because it is easy [6].
Thus, the majority of the available data on social media are code-mixed.
   In this paper, we proposed a dense neural network-based model that utilizes character-level n-
gram TF-IDF features to identify sentiments from the Kannada-English [14], Malayalam-English
[15], and Tamil-English [16] social media posts. The proposed dense neural network-based
model is validated by the dataset published in the DravidianCodeMix FIRE 2021 [17, 18] track.
The dataset provided by the organizer contains five different sentiment labels such as “positive,”
“negative,” “mixed feelings,” “unknown state,” and “if the post is not in the mentioned Dravidian
languages.”
   The rest of the paper is summarized as: related work is listed in Section 2, the dataset
description, the proposed methodology is explained in Section 3. Various experiments and their
finding is presented in Section 4. Finally, Section 5 concludes the discussion by highlighting the
main findings of this study.


2. Related work
Recently, a number of works [19, 11, 7, 12, 9] have been made to extract sentiment from code-
mixed social media posts. Mahata et al. [19] proposed a bi-directional LSTM-based model
for detecting sentiment in social media posts containing both English and Tamil language.
Joshi et al. [11] have presented a model for Hinglish (Hindi-English) dataset with subword
representation of code-mix data and long short term memory (subword-LSTM). Mandalam et al.
[20] proposed an LSTM-based model that uses sub-word level features and word embedding
vectors to identify sentiment from code-mixed Tamil and Malayalam social media posts. In [7],
Patra et al. reported a model based on support vector machine using character n-grams features
for Bengali-English code mixed data.
   Dowlagar et al. [21] proposed graph convolutional networks (GCN) for detecting sentiments
in Dravidian social media posts. Balouchzahi et al. [22] proposed three different models to
identify sentiments from Dravidian social media posts: SACo-Ensemble, SACo-Keras, and SACo-
ULMFiT, using machine learning, deep learning, and transfer learning, respectively. Advani et
al. [12] have used logistic regression with handcrafted lexical and semantic features to extract
sentiments from Hinglish and Spanglish (Spanish + English) data. A morphological attention
model has been proposed by Goswami et al. [13] for sentiment analysis on Hinglish data.
   In line with the current literature, since social media code-mixed posts contain several
grammatical errors and non-standard abbreviations, the use of character-level features may
perform well; thus, this paper investigates the usability of character-level features with a simple
four-layered dense neural network for sentiment identification from Kannada, Malayalam, and
Tamil social media posts.


3. Methodology
The systematic diagram for the proposed dense neural network-based model can be seen in
Figure 1. The proposed model is validated against the Kannada-English, Malayalam-English,
and Tamil-English datasets [18]. The overall data statistic for each language can be seen in
Table 1.


                                                                                                                                                                                               Positive
           English, & Tamil-English Posts
           Kannada-English, Malayalam-


                                                                                  Dense Layer-1 (4,096)
                                                    (1-6)-Gram Character


                                                                                                                 Dense Layer-2 (512)

                                                                                                                                                                                              Negative
                                                                                                                                       Dense Layer-3 (64)


                                                                                                                                                                Dense Layer-4 (5)
                                                      TF-IDF Features


                                                                                                                                                                                            Mixed feelings

                                                                                                                                                                                             Not-
                                                                                                                                                                                      Kannada/Malayalam/
                                                                                                                                                                                            Tamil

                                                                                                                                                                                            Unknown state


Figure 1: Proposed deep neural network-based model for sentiment classification from Kannada-English,
Malayalam-English, and Tamil-English datasets


Table 1
Overall data statistic for Kannada, Malayalam, and Tamil dataset
Class                                       Kannada                                                               Malayalam                                                                   Tamil
                                            Train                    Validation                           Test    Train                                     Validation              Test      Train       Validation   Test
Mixed-feelings                              574                      52                                   65      926                                       102                     134       4,020       438          470
Negative                                    1,188                    139                                  157     2,105                                     237                     258       4,271       480          477
Positive                                    2,823                    321                                  374     6,421                                     706                     780       20,070      2,257        2,546
Unknown state                               711                      69                                   62      5,279                                     580                     643       5,628       611          665
Not-Kannada                                 916                      110                                  110     -                                         -                       -         -           -            -
Not-Malayalam                               -                        -                                    -       1,157                                     141                     147       -           -            -
Not-Tamil                                   -                        -                                    -       -                                         -                       -         1,667       176          244
Total                                       6,212                    691                                  768     15,888                                    1,766                   1,962     35,156      3,962        4,402


  In our experiments, we found that using character-level features with a dense neural network
performed better than some complex models like convolutional neural network (CNN) and
Table 2
Best-suited hyper-parameters for the proposed model in case of Kannada-English, Malayalam-English,
and Tamil-English
 Hyper-parameters                  Kannada-English             Malayalam-English           Tamil-English
 Dense layes                       4                           4                           4
 Number of neurons at each layer   4,096, 512, 64, 5           4,096, 512, 64, 5           4,096, 512, 64, 5
 Dropout                           0.2                         0.2                         0.2
 Activation function               ReLU, Softmax               ReLU, Softmax               ReLU, Softmax
 Optimizer                         Adam                        Adam                        Adam
 Loss                              Categorical cross-entropy   Categorical cross-entropy   Categorical cross-entropy
 Learning rate                     0.001                       0.001                       0.001
 Batch size                        20                          20                          20
 Epochs                            50                          50                          50


Table 3
Performance of the proposed model for Kannada, Malayalam, and Tamil social media posts
 Class             Kannada-English                   Malayalam-English               Tamil-English
                   Precision Recall      𝐹1 -score   Precision Recall 𝐹1 -score      Precision Recall      𝐹1 -score
 Mixed-feelings    0.26      0.14        0.18        0.49      0.44    0.46          0.21        0.19      0.20
 Negative          0.63      0.64        0.63        0.68      0.53    0.60          0.39        0.40      0.39
 Positive          0.73      0.70        0.72        0.80      0.78    0.79          0.76        0.78      0.77
 Unknown state     0.24      0.44        0.31        0.68      0.78    0.73          0.39        0.41      0.40
 Not-Kannada       0.62      0.60        0.61        -         -       -             -           -         -
 Not-Malayalam     -         -           -           0.78      0.74    0.76          -           -         -
 Not-Tamil         -         -           -           -         -       -             0.64        0.49      0.55
 Weighted Avg.     0.62      0.60        0.61        0.72      0.72    0.72          0.60        0.60      0.60


long-short-term memory (LSTM) when word-level features were used. As a result, we chose
a four-layered dense neural network with character-level features to identify sentiments in
Dravidian social media posts. To provide input to the dense neural network, experiment with the
various combinations of character n-gram TF-IDF features is done. In this extensive experiment,
we found that the first 50,000 one to six-gram character-level TF-IDF performs better for the
Kannada-English dataset in comparison to the other combinations of n-gram. Similarly, for
Malayalam-English and Tamil-English, the first 30,000 one to six-gram character-level TF-IDF
features performed better. The extracted features are then passed through a four-layered dense
neural network containing 4,096, 512, 64, and 5-neurons at consecutive layers, respectively. To
train the proposed model, categorical cross-entropy and Adam is used as the loss function and
optimizer, respectively. As the deep learning-based models are very sensitive to the chosen
hyper-parameters, we performed a sensitivity analysis by varying the learning rate, batch
size, dropout rate, and epochs. The best-suited hyper-parameters for the proposed dense
neural network are listed in Table 2. The proposed system was implemented using Keras with
Tensorflow as a backend.


4. Results
The performance of the proposed model is measured in terms of precision, recall, and 𝐹1 -score.
Along with these metrics, we have plotted confusion matrix to show the prediction in each of
                                                Confusion matrix
                                                                         250
                              Mixed feelings 0.14 0.20 0.32 0.03 0.31
                                                                         200
                                  Negative 0.03 0.64 0.24 0.00 0.09

                 True label
                                                                         150
                                   Positive 0.04 0.10 0.70 0.08 0.09
                                                                         100
                               not-Kannada 0.02 0.04 0.17 0.60 0.17
                                                                         50
                              unknown state 0.08 0.08 0.27 0.13 0.44
                                                                         0


                                                         ive
                                                 Ne ngs


                                                          da
                                                            e


                                                           e
                                                        tiv


                                                       tat
                                                      na
                                                     sit
                                                      li
                                                    ga


                                                   ns
                                                  fee


                                                   an
                                                  Po


                                                ow
                                               t-K
                                        ed


                                            kn
                                            no
                                      Mix


                                          un
                                                  Predicted label

Figure 2: Confusion matrix (Kannada-English)


                                                Confusion matrix
                                                                         600
                              Mixed_feelings 0.44 0.12 0.18 0.01 0.25
                                                                         500
                                  Negative 0.09 0.53 0.13 0.00 0.24
                                                                         400
                 True label


                                   Positive 0.02 0.03 0.78 0.01 0.16     300

                              not-malayalam 0.02 0.01 0.10 0.74 0.13     200
                                                                         100
                              unknown_state 0.03 0.04 0.12 0.03 0.78
                                                            ive
                                                    Ne ngs
                                                               e


                                                  ow m
                                                              e
                                                           tiv


                                                          tat
                                             un yala
                                                        sit
                                                         li
                                                       ga


                                                      n_s
                                                      ee


                                                     Po

                                                    ala
                                           _f
                                        ed


                                                t-m

                                               kn
                                       Mix


                                             no


                                                  Predicted label

Figure 3: Confusion matrix (Malayalam-English)


the classes. The performance of the proposed model for Kannada-English, Malayalam-English,
and Tamil-English datasets is listed in Table3.
   In case of Kannada-English dataset, the proposed model has achieved weighted precision of
0.62, recall of 0.60, and 𝐹1 -score of 0.61. The confusion matrix for Kannada-English dataset can
be seen in Figure 2. From the confusion matrix, it can be seen that the proposed dense neural
                                                Confusion matrix
                              Mixed_feelings 0.19 0.16 0.45 0.02 0.19      1750
                                                                           1500
                                  Negative 0.15 0.40 0.29 0.02 0.15
                                                                           1250
                 True label        Positive 0.07 0.05 0.78 0.01 0.09       1000
                                                                           750
                                  not-Tamil 0.04 0.06 0.27 0.49 0.14
                                                                           500
                              unknown_state 0.12 0.11 0.35 0.03 0.41       250


                                                            il
                                                         ive
                                                Ne ngs
                                                           e


                                                           e
                                                       am
                                                       tiv


                                                      tat
                                                     sit
                                                    eli
                                                   ga


                                                   t-T

                                                  n_s
                                                 Po
                                                _fe


                                                no

                                               ow
                                        ed


                                             kn
                                       Mix


                                          un
                                                  Predicted label

Figure 4: Confusion matrix (Tamil-English)


network perform significantly good for Positive class with the recall of 0.70, whereas it not good
for Mixed-feelings class. The possible reason can be the mixed sentiment text, and the proposed
model is not able to clearly distinguishe it. For Malayalam-English dataset, the proposed model
is able to achieve a weighted precision, recall, and 𝐹1 -score of 0.72 (as can be seen in Table 3.
The confusion matrix for Malayalam-English dataset can be seen in Figure 3. Similarly, for
Tamil-English dataset, the proposed dense neural network has achieved a weighted precision,
recall, and 𝐹1 -score of 0.60. The confusion matrix for Tamil-English dataset can be seen in
Figure 4.


5. Conclusion
Sentiment analysis of textual content offers a wide range of applications in natural language
processing. In this work, we have utilized character-level TF-IDF features with dense neu-
ral network to classify social media posts into five different classes. For Kannada-English,
Malayalam-English, and Tamil-English social media postings, the suggested model has obtained
weighted 𝐹 1-scores of 0.61, 0.72, and 0.60, respectively. Since the usage of character-level TF-IDF
features yields promising results, this feature may be further investigated in the future with
various deep learning models to build a more robust system.


References
 [1] S. Mani, S. Kumari, A. Jain, P. Kumar, Spam review detection using ensemble machine
     learning, in: International Conference on Machine Learning and Data Mining in Pattern
     Recognition, Springer, 2018, pp. 198–209.
 [2] K. Gaurav, P. Kumar, Consumer satisfaction rating system using sentiment analysis, in:
     Conference on e-Business, e-Services and e-Society, Springer, 2017, pp. 400–411.
 [3] A. Kumar, S. Saumya, J. P. Singh, NITP-AI-NLP@ HASOC-Dravidian-CodeMix-FIRE2020:
     A machine learning approach to identify offensive languages from Dravidian code-mixed
     text., in: FIRE (Working Notes), 2020, pp. 384–390.
 [4] A. Kumar, S. Saumya, J. P. Singh, NITP-AI-NLP@ HASOC-FIRE2020: Fine tuned BERT for
     the hate speech and offensive content identification from social media., in: FIRE (Working
     Notes), 2020, pp. 266–273.
 [5] A. Kumar, J. P. Singh, Location reference identification from tweets during emergencies: A
     deep learning approach, International journal of disaster risk reduction 33 (2019) 365–375.
 [6] A. Kumar, J. P. Singh, S. Saumya, A comparative analysis of machine learning techniques
     for disaster-related tweet classification, in: 2019 IEEE R10 Humanitarian Technology
     Conference (R10-HTC)(47129), IEEE, 2019, pp. 222–227.
 [7] B. G. Patra, D. Das, A. Das, Sentiment analysis of code-mixed indian languages: An
     overview of SAIL_Code-Mixed Shared Task@ ICON-2017, arXiv preprint arXiv:1803.06745
     (2018).
 [8] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on sentiment analysis for Dravidian languages in
     code-mixed text, in: Forum for Information Retrieval Evaluation, 2020, pp. 21–24.
 [9] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly,
     Elizabeth McCrae, Overview of the track on Sentiment Analysis for Davidian Languages
     in Code-Mixed Text, in: Working Notes of the FIRE 2020. CEUR Workshop Proceedings.,
     2020.
[10] A. Kumar, S. Saumya, J. P. Singh, NITP-AI-NLP@ Dravidian-CodeMix-FIRE2020: A hybrid
     Cnn and Bi-LSTM network for sentiment analysis of Dravidian code-mixed social media
     posts., in: FIRE (Working Notes), 2020, pp. 582–590.
[11] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, Towards sub-word level compositions for
     sentiment analysis of hindi-english code mixed text, in: Proceedings of COLING 2016, the
     26th International Conference on Computational Linguistics: Technical Papers, 2016, pp.
     2482–2491.
[12] L. Advani, C. Lu, S. Maharjan, C1 at SemEval-2020 Task 9: Sentimix: Sentiment analysis for
     code-mixed social media text using feature engineering, arXiv preprint arXiv:2008.13549
     (2020).
[13] K. Goswami, P. Rani, B. R. Chakravarthi, T. Fransen, J. P. McCrae, ULD@ NUIG at SemEval-
     2020 Task 9: Generative morphemes with an attention model for sentiment analysis in
     code-mixed text, arXiv preprint arXiv:2008.01545 (2020).
[14] A. Hande, R. Priyadharshini, B. R. Chakravarthi, KanCMD: Kannada CodeMixed dataset
     for sentiment analysis and offensive language detection, in: Proceedings of the Third
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s
     in Social Media, Association for Computational Linguistics, Barcelona, Spain (Online),
     2020, pp. 54–63. URL: https://www.aclweb.org/anthology/2020.peoples-1.6.
[15] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
     dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on
     Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
     and Computing for Under-Resourced Languages (CCURL), European Language Resources
     association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/
     2020.sltu-1.25.
[16] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
     sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
     Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
     and Collaboration and Computing for Under-Resourced Languages (CCURL), European
     Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www.
     aclweb.org/anthology/2020.sltu-1.28.
[17] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, D. Thenmozhi,
     E. Sherly, J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings
     of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text, in: Working Notes
     of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
[18] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, D. Thenmozhi,
     E. Sherly, Overview of the DravidianCodeMix 2021 shared task on sentiment detection
     in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE
     2021, Association for Computing Machinery, 2021.
[19] S. Mahata, D. Das, S. Bandyopadhyay, Sentiment classification of code-mixed tweets using
     bi-directional RNN and language tags, in: Proceedings of the First Workshop on Speech
     and Language Technologies for Dravidian Languages, Association for Computational
     Linguistics, Kyiv, 2021, pp. 28–35. URL: https://aclanthology.org/2021.dravidianlangtech-1.
     4.
[20] A. V. Mandalam, Y. Sharma, Sentiment analysis of Dravidian code mixed data, in:
     Proceedings of the First Workshop on Speech and Language Technologies for Dravid-
     ian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 46–54. URL:
     https://aclanthology.org/2021.dravidianlangtech-1.6.
[21] S. Dowlagar, R. Mamidi, Graph convolutional networks with multi-headed attention
     for code-mixed sentiment analysis, in: Proceedings of the First Workshop on Speech
     and Language Technologies for Dravidian Languages, Association for Computational
     Linguistics, Kyiv, 2021, pp. 65–72. URL: https://aclanthology.org/2021.dravidianlangtech-1.
     8.
[22] F. Balouchzahi, H. L. Shashirekha, LA-SACo: A study of learning approaches for sentiments
     analysis inCode-mixing texts, in: Proceedings of the First Workshop on Speech and Lan-
     guage Technologies for Dravidian Languages, Association for Computational Linguistics,
     Kyiv, 2021, pp. 109–118. URL: https://aclanthology.org/2021.dravidianlangtech-1.14.

</pre>