Sentiment Analysis Model For Code-Mixed Tamil Language
N Sripriya 1 and S Divya 2
1,2
      Department of Information Technology, Sri Sivasubramaniya Nadar College of Engineering, Chennai, India

                Abstract
                Social Media is a vital source for communicating information and retrieval. To legitimize the
                contents in social media, sentiment analysis is vital and has become a most focused research
                area. Sentiment analysis is a Natural Language Processing (NLP) task and has been well
                analyzed for application in monolingual text. Sentiment analysis tasks become complex when
                applied to Code-mix data. Since the comments produced by viewers in social media
                incorporate emoticons and maybe in mixed language, sentimental analysis of such data is
                challenging. This paper describes a model that codes the input data by looking at the
                frequency of terms and is then categorized using a multiclass classification algorithm. This
                model is straightforward and produces better results in classifying the data based on the terms
                available in the input sequence. Evaluation of this model yields an average weighted F1 score
                of 0.35 is achieved when applied to the Dravidian Code-mix dataset produced for the
                Sentiment Analysis task in FIRE-2021.

                Keywords 1
                Sentiment, emoticons, Code-mix, Natural Language Processing.

1. Introduction
    Sentiment analysis is a taxonomy task that is used to extract sentiments from text data. This task
has its benefits in numerous applications like customer feedback, reputation management and
legalizing content in social media [1], [2], [3]. This is widely used in generating a summary of human
ideas or interests extracted from the comments posted by the users or viewers [4]. Many online
forums allow users to share their experiences as product or content reviews. To facilitate the user, the
online platforms ensure the mother tongue communication or Code-mix language to share the user's
view in a realistic way. Since most of the MLP tasks are trained over well-organized data with proper
grammar, it becomes challenging when being applied to user-generated comments [5].
    Code-mixing or Code-switching alternates two or greater numbers of languages at various levels
of the content. It may be done at a document level, paragraph level, comments level, sentence level,
phrase level, word level, or at even morpheme level. This represents a unique way of conversing in a
bilingual or multilingual society [6].
    This paper elaborates a model that generates embedding representation for the text data available
in the dataset issued for the sentiment analysis task by Dravidian Code-mix FIRE 2021. This is a
multiclass classification problem that generates five different labels for the data collected from
YouTube comments. The developed model extracts functionality from the given input data and based
on those features the input data is classified into several classes. This classification is done using a
Machine Learning algorithm, which learns from the features extracted and the labels given to each
training data during the training stage. Based on the learning, it tries to classify the data into distinct
groups and labels each data by the group it belongs to. Since the classification task tries to classify the
data into multiple classes, multiclass classifiers are used for categorizing the given data.


FIRE 2021, Forum for Information Retrival Evaluation, December 13-17, 2021, India
E.MAIL: sripriyan@ssn.edu.in(N.Sripriya); divyas@ssn.edu.in (S.Divya)
                2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

             CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work
    Sentiment analysis supports during the analysis of customer polarity on a particular product,
information, or event. This task helps in understanding the attitude of the public, which helps in
collecting reasonable information for future decisions on numerous comments. Sentiment analysis
that was initially applied to political campaigns and news articles was then expanded to social media
content. Recently this mission is used to capture feelings from Code-mix information available on
social media.
    Social media forum permits users to post content in informal settings. Also to enhance user
experience, these forums allow the user to communicate their opinions in their native language or by
switching between one or more languages according to their comfort. High resource language has
formal settings that hold proper grammar rules. Earlier sentiment analysis model had grammatical
rules and lexicons for extracting features from the input data. This rule-based feature extraction is
complex and time-consuming.
    To provide meaningful feature extraction and to make this a domain-independent task,
enhancements are made in embeddings based on prominent features as an alternative to a rule-based
system [11]. Those functions that convey the importance of the content are then fed into machine
learning algorithms for performing multiclass classification. This classification model assigns various
labels that help in understanding the sentiments of that data.
    These systems do not work better for informal settings in the user-generated comments. Code-
mixing and Code-switching alternate between two or more languages at document, phrase, sentence,
token, lexeme and even at morpheme level [12]. In this enlarged usage of social media users, there
arises a need for a model being trained with the Code-mixed language that functions on the user-
generated comments [15]. This lead to the realization of the unavailability of a large dataset for Code-
mixed language. This inspired the corpus collection of Code-mixed data from YouTube.
    This annotated dataset is transformed to Term Frequency Inverse Document Frequency (TF/IDF)
[22] representation and is applied to traditional Machine Learning algorithms for training. The
traditional ML algorithms include Logistic Regression (LR), Support Vector Machine (SVM),
Decision Tree (DT), Random Forest (RF), Multinomial Naive Bayes (MNB) and, k-Nearest
Neighbour (kNN). All these algorithms have performed well in classification tasks by classifying all
the classes [13] [14], whereas SVM does not suit multiple class classification problems.
    A multiple-task Kannada-English Code-Mixed dataset [17] for Sentiment Analysis and Offensive
Language [18] detection has been collected, which consists of 7,671 comments that are annotated and
are benchmarked using computational models. To promote multi-task learning for low-resourced
languages, this dataset is used for training various classification models [16].
    An enhanced technique in the processing of Code-Mixed language is by generating representations
[9] of each sentence in the dataset. This representation gives the ability to learn the task-related
features [10] from the input to facilitate classification. This representation is also generated by certain
pre-trained models [19] [20] to understand the context of the input sentence. These features help in
semantically classify the sentences based on the generated representation [21].
    In this work, Sentiment Analysis on Code-Mixed Tamil language is performed by extracting the
features in each sentence and classifying it based on the extracted features. The technique used for
feature extraction and classification is explained in the subsequent topics.

3. Feature Extraction
    Since the Machine learning model can't work with the raw data, some feature extraction techniques
are applied to go on the raw data to convert it into vectors. Additionally, these models have been
trained with certain training data to perform the task on the test data. Analyzing the similarity between
the test data and the training data facilitates the classification process. To explore the similarity
between the data, Term Frequency - Inverse Document Frequency (TF-IDF) [22] is applied. TF-IDF
value is dependent on two factors.
       Term Frequency (TF) = No of times a specific word takes place in a document.
       Inverse Document Frequency (IDF) = Frequency of a term between the documents in the
        entire corpus.
   The value of TF-IDF increases when a term often appears in the document and decreases when
there are more documents in the entire corpus of that term. Thus, a high value is achieved when a
term more often takes place in a document and the document appears less frequently in the corpus.
   For each term, the TF-IDF score is computed and therefore the functional vector is framed for each
sentence which is further fed as input for classification.

4. Classification Algorithm
    The decision tree [7] is an effective and well-known classification algorithm. This algorithm
generates a tree structure with the specified conditions to decide. Each node in the tree represents the
state of an attribute and the result of this condition is represented using branches that connect each
node. The labels are the judgments present in the leaf node. The decision tree may be error-free while
handling classification problems with many labels but fewer samples. To overcome this disadvantage,
a Random Decision Forest [8] classifier is developed.
    Random Forest classifier builds a set of decision trees from the randomly selected subgroup of
training data. The decisions taken by these trees are then collected and voting is carried out to make a
final determination. This will be accomplished in the following steps.

       Choose random sample subgroups from the given dataset.
       Construct a decision tree for each subgroup sample and get a decision from each decision tree.
       The vote is done on each predicted decision.
       The decision with majority votes is made as to the final prediction.

    This classifier is more accurate and vigorous in making decisions due to the numerous decision
trees involved in the process. Even when the training is done with minimal samples, overfitting is
ignored since the final decision is based on the average of numerous predictions that cancel the biases.

5. Proposed model
   A system is proposed to perform multiclass classification on Code-mix data to detect the sentiment
of that data. The input data must be classified into 5 groups for example in the positive, Negative,
Mixed-feelings, Not-Tamil and unknown states. Initially, the input data must be pre-processed to
remove symbols, special characters, hashtags and characters that do not hold any information.
Preprocessed data is now represented as vectors using TF-IDF functional extraction technique. These
vectors are assigned 5 labels using a random forestry classifier. Figure 1 illustrates the architecture of
the proposed system.

                                           Code-Mixed Input

                                          Text Pre-Processing

                                        Encoding of input tokens

                                       Classification of input into
                                             distinct labels


                        Figure 1: Architecture diagram for the proposed model
6. Performance Evaluation
   The proposed model is applied to the Code-Mixed Tamil dataset collected from YouTube video
comments. This dataset comprises 35,657 training sets, 3,962 validation sets and 4,403 test sets. The
proposed model is trained using the training set and is evaluated using the validation set. The labels
generated using the proposed model are assessed using the average weighted score for classification.
The classification report for the proposed system is given below.

Table 1
Classification Report
                          Precision              Recall              F1-Score            Support
     Positive                0.56                 0.50                 0.53                2257
    Negative                 0.11                 0.11                 0.11                480
  Mixed-feelings             0.13                 0.04                 0.07                438
    Not-Tamil                0.08                 0.29                 0.13                176
  Unknown state              0.16                 0.18                 0.17                611
    Accuracy                   -                                       0.35                3962
    Macro avg                0.21                 0.23                 0.20                3962
  Weighted avg               0.37                 0.35                 0.35                3962

   The count of test data given for evaluating the system is mentioned as support. Weighted average
F1 score is considered to assess the system developed for assigning labels. This is calculated as the
average of precision and recall.

7. Conclusion and Future work
   Identification of sentiments in Code-mix Tamil data is done using a machine learning classifier
and the evaluation of the proposed system is accomplished. Applying profound learning techniques
will further enhance the learning of the model and will enhance classification performance.

8. Acknowledgements
   We sincerely thank the management of SSN Institutions for the infrastructure and lab facilities to
carry out this research work.

9. References
[1] Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. "Recognizing contextual polarity in phrase-
    level sentiment analysis." In Proceedings of human language technology conference and
    conference on empirical methods in natural language processing, pp. 347-354. 2005.
[2] Agarwal, Apoorv, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca J. Passonneau.
    "Sentiment analysis of Twitter data." In Proceedings of the workshop on language in social
    media (LSM 2011), pp. 30-38. 2011.
[3] Thavareesan, Sajeetha, and Sinnathamby Mahesan. "Sentiment analysis in Tamil texts: A study
    on machine learning techniques and feature representation." In 2019 14th Conference on
    Industrial and Information Systems (ICIIS), pp. 320-325. IEEE, 2019.
[4] Pang, Bo, and Lillian Lee. "A sentimental education: Sentiment analysis using subjectivity
    summarization based on minimum cuts." arXiv preprint cs/0409058 (2004).
[5] Pratapa, Adithya, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, Sandipan Dandapat, and
    Kalika Bali. "Language modeling for code-mixing: The role of linguistic theory based synthetic
     data." In Proceedings of the 56th Annual Meeting of the Association for Computational
     Linguistics (Volume 1: Long Papers), pp. 1543-1553. 2018.
[6] Barman, Utsab, Amitava Das, Joachim Wagner, and Jennifer Foster. "Code mixing: A challenge
     for language identification in the language of social media." In Proceedings of the first workshop
     on computational approaches to code-switching, pp. 13-23. 2014.
[7] Safavian, S. Rasoul, and David Landgrebe. "A survey of decision tree classifier
     methodology." IEEE transactions on systems, man, and cybernetics 21, no. 3 (1991): 660-674.
[8] Breiman, Leo. "Random forests." Machine learning 45, no. 1 (2001): 5-32.
[9] Banerjee, Shubhanker, Bharathi Raja Chakravarthi, and John P. McCrae. "Comparison of pre-
     trained embeddings to identify hate speech in Indian code-mixed text." In 2020 2nd International
     Conference on Advances in Computing, Communication Control and Networking (ICACCCN),
     pp. 21-25. IEEE, 2020.
[10] Chakravarthi, Bharathi Raja, Ruba Priyadharshini, Shubhanker Banerjee, Richard Saldanha, John
     Philip McCrae, Parameswari Krishnamurthy, and Melvin Johnson. "Findings of the Shared Task
     on Machine Translation in Dravidian languages." In Proceedings of the First Workshop on
     Speech and Language Technologies for Dravidian Languages, pp. 119-125. 2021.
[11] Chakravarthi, Bharathi Raja, Ruba Priyadharshini, Navya Jose, Thomas Mandl, Prasanna Kumar
     Kumaresan, Rahul Ponnusamy, R. L. Hariharan, John Philip McCrae, and Elizabeth Sherly.
     "Findings of the shared task on offensive language identification in Tamil, Malayalam, and
     Kannada." In Proceedings of the First Workshop on Speech and Language Technologies for
     Dravidian Languages, pp. 133-145. 2021.
[12] Hande, A deep, Siddhanth U. Hegde, Ruba Priyadharshini, Rahul Ponnusamy, Prasanna Kumar
     Kumaresan, Sajeetha Thavareesan, and Bharathi Raja Chakravarthi. "Benchmarking multi-task
     learning for sentiment analysis and offensive language identification in under-resourced
     Dravidian languages." arXiv preprint arXiv:2108.03867 (2021).
[13] Chakravarthi, Bharathi Raja, Ruba Priyadharshini, Vigneshwaran Muralidaran, Shardul
     Suryawanshi, Navya Jose, Elizabeth Sherly, and John P. McCrae. "Overview of the track on
     sentiment analysis for Dravidian languages in code-mixed text." In Forum for Information
     Retrieval Evaluation, pp. 21-24. 2020.
[14] Chakravarthi, Bharathi Raja, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly, and John P.
     McCrae. "A sentiment analysis dataset for code-mixed Malayalam-English." arXiv preprint
     arXiv:2006.00210 (2020).
[15] Chakravarthi, Bharathi Raja, Vigneshwaran Muralidaran, Ruba Priyadharshini, and John P.
     McCrae. "Corpus creation for sentiment analysis in code-mixed Tamil-English text." arXiv
     preprint arXiv:2006.00206 (2020).
[16] Mandl, Thomas, Sandip Modha, Anand Kumar M, and Bharathi Raja Chakravarthi. "Overview
     of the hasoc track at fire 2020: Hate speech and offensive language identification in Tamil,
     Malayalam, Hindi, English and German." In Forum for Information Retrieval Evaluation, pp. 29-
     32. 2020..
[17] Hande, A deep, Ruba Priyadharshini, and Bharathi Raja Chakravarthi. "KanCMD: Kannada
     CodeMixed dataset for sentiment analysis and offensive language detection." In Proceedings of
     the Third Workshop on Computational Modeling of People's Opinions, Personality, and
     Emotion's in Social Media, pp. 54-63. 2020.
[18] Mandl, Thomas, Sandip Modha, Anand Kumar M, and Bharathi Raja Chakravarthi. "Overview
     of the hasoc track at fire 2020: Hate speech and offensive language identification in tamil,
     malayalam, hindi, english and german." In Forum for Information Retrieval Evaluation, pp. 29-
     32. 2020.
[19] Ghanghor, Nikhil, Parameswari Krishnamurthy, Sajeetha Thavareesan, Ruba Priyadharshini, and
     Bharathi Raja Chakravarthi. "IIITK@ DravidianLangTech-EACL2021: Offensive Language
     Identification and Meme Classification in Tamil, Malayalam and Kannada." In Proceedings of
     the First Workshop on Speech and Language Technologies for Dravidian Languages, pp. 222-
     229. 2021.
[20] Banerjee, Shubhanker, Arun Jayapal, and Sajeetha Thavareesan. "NUIG-Shubhanker@
     Dravidian-CodeMix-FIRE2020: Sentiment Analysis of Code-Mixed Dravidian text using
     XLNet." arXiv preprint arXiv:2010.07773 (2020).
[21] Suryawanshi, Shardul, and Bharathi Raja Chakravarthi. "Findings of the shared task on Troll
     Meme Classification in Tamil." In Proceedings of the First Workshop on Speech and Language
     Technologies for Dravidian Languages, pp. 126-132. 2021.
[22] Aizawa, Akiko. "An information-theoretic perspective of tf–idf measures." Information
     Processing & Management 39, no. 1 (2003): 45-65.