, Sentiment Analysis on Tamil Code-Mixed Text using Bi-LSTM Pradeep Kumar Roy1 , Abhinav Kumar2 1 Indian Institute of Information Technology, Surat, Gujarat, India 2 Siksha ‘O’ Anushandhan, University, Bhubaneswar, Odisha, India Abstract Sentiment analysis is one of the most researched topics in the computer science domain. Whenever the term opinion appears, sentiment analysis is required. Many business sectors are growing by analyzing users opinions about their products. E-commerce portals like Amazon and Flipkart offering users to express their opinion by posting the purchased product review. Further, the next buyer of the same product utilizes the user’s review to make their decision-should purchase or not. Existing models of sentiment analysis mostly referred to English language textual comments. However, currently, users are posting the comments and reviews in mixed languages like Hindi-English, Malayalam-English and similar ones; it is called code-mixed languages. To identify the user sentiment from the code-mixed language, this research suggested a deep learning-based framework. The proposed framework automat- ically extracts the features from input sentences and predicts their sentiment with a 0.552 F1-score for the best case. Keywords Sentiment Analysis, Code-Mixed, Tamil, Deep Learning, LSTM, Machine Learning 1. Introduction People are expressing their opinion about things using natural language on different platforms, including YouTube, Facebook, Twitter and others [1, 2]. Analyzing the user’s post and finding its opinion plays a vital role in the decision-making system and has the power to lift or down accordingly. For example, an E-commerce portal like Flipkart offering users to express their opinion about the product in the form of a review. This review helps the buyer to take their decision like whether the product is good or not [3]. Similarly, a newly released movie is good or bad can be predicted by the user’s opinion available of online portal like IMDB. Currently, the Internet is reached to almost every individual, and hence user’s comments are available in high volume. To process the comments or user’s review, many frameworks developed earlier using various machine learning and deep learning frameworks [4]. Most of the previous research work processed the comments or the user’s review written in English text to develop sentiment Forum for Information Retrieval Evaluation, December 13-17, 2021, India " pkroynitp@gmail.com (P. K. Roy); abhinavkumar@soa.ac.in (A. Kumar)  0000-0001-5513-2834 (P. K. Roy); 0000-0001-9367-7069 (A. Kumar) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) analysis frameworks. However, currently, a high volume of comments are posted by the users in mixed languages. For example- Kannada-English, Malayalam-English, Hindi-English and many more. Hence, the model developed so far may not be capable of handling the recent code-mixed comments [5]. The research community has recently been interested in sentiment analysis of code-mixed language. Kumar et al. [6] suggested a hybrid CNN-Bi-LSTM model for categorizing social media postings into distinct sentiment groups. To categorize Tamil-English and Malayalam- English code-mixed social media postings into distinct sentiment classes, Mahata et al. [7] suggested Bi-directional LSTM with language tagging. On the other side, Sharma and Mandalam [8] used sub-word level representation to capture text sentiment and an LSTM network to categorize Tamil-English and Malayalam-English social media postings into distinct polarity classes. Goswami et al. [9] proposed a morphological attention model for sentiment analysis on Hinglish data. Banerjee et al. [10] reported the finding of machine translation for Dravidian language such as English to Tamil, English to Malayalam, and similar ones. In line with the works developed for sentiment analysis from code-mixed social media posts, we developed a deep neural model using Bi-directional LSTM [11, 12]. The data used for this research was developed by scraping the YouTube comments and labelled into five sentimental categories as: "positive, negative, neutral, mixed feelings or not in the intended languages" [13, 14]. Traditional machine learning classifiers and deep neural network-based LSTM are used to classify the Tamil code-mixed dataset. The experimental outcomes confirmed that the proposed model outperforms the traditional machine learning-based models by achieving higher prediction accuracy. The rest of the paper is organized as follows: Section 2 discusses the proposed methodology. In Section 3, we discuss the experimental outcomes, and finally, Section 4 concludes the work. 2. Methodology This research suggested a framework to predict the sentiment analysis of code-mixed data using Bi-directional LSTM model Neural Network [11, 12]. The working steps of the proposed model are shown in Figure 1. The dataset used in this research is available on FIRE-20211 and was developed by Chakravarthi et al. [15, 16]. The statistics of the dataset used for model training and testing with the number of instances available in each category of the sentiment is shown in Table 1. The majority of the sample of the total dataset belongs to a positive sentiment class, whereas the remaining samples are distributed into four other categories. The original dataset contains many unsupportive characters, which needs to the filter out before passing it to the model for processing. The data cleaning step is performed to remove the emojis, special characters, non-ASCII characters. The number is removed and converted all information into lower cases. Further, the cleaned data passes are padded with zeros for making all messages of equal length. The maximum size of the message is fixed to 30 and 70, respectively, for the word and char level processing. The padded text passes to the embedding layer, where for each word, their corresponding vector is extracted from a pre-trained word embedding called Glove[17]. The GloVe embedding, having the dimension of 100, means each 1 https://dravidian-codemix.github.io/2021/datasets.html Code-Mixed Dataset Preprocessing Tokenization Char Level Word Level Padding (70) Padding (30) Random Embedding Random Embedding Pre-trained Word (100) (100) Embedding (100) Bi-LSTM (64) Bi-LSTM(32) Bi-LSTM(32) Dropout Dropout Dropout Concatenate Dense(128) Output Layer, Dense(5) Figure 1: Proposed Framework for Sentiment Analysis with Code-Mixed Dataset input word is mapped into 100-dimensional vectors. This way, for each message of size (n) and 𝑛 × 100 sizes matrix created by the embedding layer. Random embedding technique is also used for the word and char level dataset as shown in Figure 1 with the same output dimension as 100. Further, the embedded dataset passes to Bi-directional Long-Short Term Memory (Bi-LSTM) model for further processing. For Char embedding, 64 units of Bi-LSTM were used, whereas for processing the words, 32 units of Bi-LSTM. The dropout layer is added in all three cases, and then the outcomes are concatenated together. The concatenated outcomes of the Bi-LSTM models are passes to a fully connected dense layer with 128 neurons, followed by an output layer consisting of five neurons. The ReLU activation function is used in the internal layer of the network; however, at the output layer, Softmax is used. Table 1 Data Statistics for Code-Mixed Tamil Dataset Class Training Validation Positive 20069 2257 Unknown State 5628 611 Negative 4271 480 Mixed feelings 4020 438 not-Tamil 1667 176 Table 2 Results with bi-gram features with ML classifiers NB LR RF Class P R F1 P R F1 P R F1 Positive 0.58 0.99 0.73 0.61 0.94 0.74 0.62 0.93 0.74 Unknown State 0.00 0.00 0.00 0.04 0.00 0.00 0.10 0.01 0.02 Negative 1.00 0.00 0.00 0.44 0.05 0.08 0.46 0.12 0.19 Mixed feelings 1.00 0.00 0.00 0.38 0.06 0.10 0.27 0.03 0.05 not-Tamil 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Weighted Avg 0.56 0.57 0.42 0.45 0.55 0.44 0.45 0.55 0.46 3. Results This research developed a model to classify the code-mixed input sentence in one of the pre- defined sentimental categories. The evaluate the model performance, the classification metrics called precision (P), recall (R), and F1-score (F1) are used. Precision is defined as the number of correctly predicted sentiment categories among the retrieved instances of the particular sentiment category. The recall is defined as the number of correctly predicted sentiment categories among the total number of instances of that particular sentiment category. The F1-score (F1) is the harmonic mean of the precision and recall [11, 18]. A number of the experiment was done with by extracting the various n-gram features from the text using tf-idf vectorization technique and passing it to traditional Machine Learning based classifiers like- Random Forest (RF), Logistic Regression (RF), and Naive Bayes (NB). The best outcomes of these classifiers are shown in Table 2. Most of the instances are miss-classified to another category of sentiment. The positive sentiment category is predicted with the highest prediction accuracy by all three classifiers, NB, LR, and RF. In contrast, the same classifiers are failed to detect the not-Tamil sentiment category. None of the classifier’s performance was satisfactory for predicting code-mixed data of negative, mixed-feelings, unknown state and not-Tamil categories with bi-gram features. To improve the model performance, we have used deep learning-based Bi- directional LSTM. The outcomes of the proposed B-LSTM model with validation dataset is shown in Table 4. The best performance is achieved for the Positive sentiment class. The precision, recall and F1-score values are 0.68, 0.80, and 0.74, respectively, whereas the lowest precision, recall and F1-score values are 0.20 0.11, 0.14, respectively, for mixed-feelings sentiment class. The performance of the proposed deep learning model outperforms the traditional machine learning models (Table 2) by achieving better performance for all classes. The non-Tamil classes are not recognized at all by any of the mentioned traditional ML models. However, the proposed deep learning model provides satisfactory prediction accuracy. The weighted average precision, recall and F1-score values are 0.54, 0.57, and 0.55, respectively on the validation dataset. However, on the test dataset, the weighted precision, recall and F1-score values are 0.544, 0.566, and 0.552, respectively. Table 3 Hyper-parameters details for the proposed model Hyper-parameters Bi-LSTM model Number of Bi-LSTM Units 64, 32, 32 Dropout rate 0.5 Activation functions ReLU, Softmax Epochs 50 Loss Categorical Crossentropy Optimizer Adam One of the possible reasons behind the biased performance of the model for different sen- timental classes may include the inconsistent distribution of data samples in different classes of training and testing set (Table 1). The number of samples present in the Positive sentiment category is highest, whereas the lowest number of samples is present in not-Tamil category. The effect of this data distribution is seen on the model’s outcomes (Table 4). Hence, to get better outcomes, data oversampling techniques such as SMOTE or ADASYN may help [19]. Another possible reason behind the low performance of the model may include the high number of code-mixed samples in training and testing dataset. By normalizing the dataset into English may help to achieve better prediction accuracy. 4. Conclusion Sentiment analysis is one of the major research areas in computer science, where the opinion will be extracted from the input text. The opinion may be positive, negative or neutral. In the current time, users are popularly used mixed languages to post comments or reviews. Hence, getting the opinion from such a post is a challenging task. This research suggested a deep learning-based automated system to predict the sentiment of the user’s post written in Tamil code-mixed. The proposed framework utilised the pre-trained word embedding technique and achieved a weighted F1-score of 0.552 for the best case on test sample. References [1] B. Liu, et al., Sentiment analysis and subjectivity., Handbook of natural language processing 2 (2010) 627–666. [2] R. Feldman, Techniques and applications for sentiment analysis, Communications of the ACM 56 (2013) 82–89. Table 4 Results on Validation data with Bi-LSTM Class Precision Recall F1-score Positive 0.68 0.80 0.74 Unknown State 0.34 0.31 0.32 Negative 0.41 0.30 0.35 Mixed feelings 0.20 0.11 0.14 not-Tamil 0.50 0.45 0.47 Weighted Avg 0.54 0.57 0.55 [3] S. Saumya, J. P. Singh, Detection of spam reviews: a sentiment analysis approach, Csi Transactions on ICT 6 (2018) 137–148. [4] Q. T. Ain, M. Ali, A. Riaz, A. Noureen, M. Kamran, B. Hayat, A. Rehman, Sentiment analysis using deep learning techniques: a review, Int J Adv Comput Sci Appl 8 (2017) 424. [5] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus cre- ation for sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://aclanthology.org/2020.sltu-1.28. [6] A. Kumar, S. Saumya, J. P. Singh, Nitp-ai-nlp@ dravidian-codemix-fire2020: A hybrid cnn and bi-lstm network for sentiment analysis of dravidian code-mixed social media posts., in: FIRE (Working Notes), 2020, pp. 582–590. [7] S. Mahata, D. Das, S. Bandyopadhyay, Sentiment classification of code-mixed tweets using bi-directional rnn and language tags, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, 2021, pp. 28–35. [8] Y. Sharma, A. V. Mandalam, Bits2020@ dravidian-codemix-fire2020: Sub-word level sentiment analysis of dravidian code mixed data., in: FIRE (Working Notes), 2020, pp. 503–509. [9] K. Goswami, P. Rani, B. R. Chakravarthi, T. Fransen, J. P. McCrae, Uld@ nuig at semeval- 2020 task 9: Generative morphemes with an attention model for sentiment analysis in code-mixed text, arXiv preprint arXiv:2008.01545 (2020). [10] S. Banerjee, A. Jayapal, S. Thavareesan, Nuig-shubhanker@dravidian-codemix- fire2020: Sentiment analysis of code-mixed dravidian text using xlnet, in: FIRE, 2020. [11] P. K. Roy, J. P. Singh, S. Banerjee, Deep learning to filter sms spam, Future Generation Computer Systems 102 (2020) 524–533. [12] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [13] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly, Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021, Association for Computing Machinery, 2021. [14] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, D. Thenmozhi, E. Sherly, J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. [15] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for sentiment analysis in code-mixed tamil-english text, arXiv preprint arXiv:2006.00206 (2020). [16] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly, Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021, Association for Computing Machinery, 2021. [17] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [18] P. K. Roy, Multilayer convolutional neural network to filter low quality content from quora, Neural Processing Letters 52 (2020) 805–821. [19] P. K. Roy, Z. Ahmad, J. P. Singh, M. A. A. Alryalat, N. P. Rana, Y. K. Dwivedi, Finding and ranking high-quality answers in community question answering sites, Global Journal of Flexible Systems Management 19 (2018) 53–68.