<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>India
" pkroynitp@gmail.com (P. K. Roy); abhinavkumar@soa.ac.in (A. Kumar)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sentiment Analysis on Tamil Code-Mixed Text using Bi-LSTM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pradeep Kumar Roy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhinav Kumar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Information Technology</institution>
          ,
          <addr-line>Surat, Gujarat</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Siksha 'O' Anushandhan, University</institution>
          ,
          <addr-line>Bhubaneswar, Odisha</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Sentiment analysis is one of the most researched topics in the computer science domain. Whenever the term opinion appears, sentiment analysis is required. Many business sectors are growing by analyzing users opinions about their products. E-commerce portals like Amazon and Flipkart ofering users to express their opinion by posting the purchased product review. Further, the next buyer of the same product utilizes the user's review to make their decision-should purchase or not. Existing models of sentiment analysis mostly referred to English language textual comments. However, currently, users are posting the comments and reviews in mixed languages like Hindi-English, Malayalam-English and similar ones; it is called code-mixed languages. To identify the user sentiment from the code-mixed language, this research suggested a deep learning-based framework. The proposed framework automatically extracts the features from input sentences and predicts their sentiment with a 0.552 F1-score for the best case.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment Analysis</kwd>
        <kwd>Code-Mixed</kwd>
        <kwd>Tamil</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>LSTM</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        People are expressing their opinion about things using natural language on diferent platforms,
including YouTube, Facebook, Twitter and others [1, 2]. Analyzing the user’s post and finding
its opinion plays a vital role in the decision-making system and has the power to lift or down
accordingly. For example, an E-commerce portal like Flipkart ofering users to express their
opinion about the product in the form of a review. This review helps the buyer to take their
decision like whether the product is good or not [
        <xref ref-type="bibr" rid="ref1">3</xref>
        ]. Similarly, a newly released movie is good
or bad can be predicted by the user’s opinion available of online portal like IMDB. Currently,
the Internet is reached to almost every individual, and hence user’s comments are available in
high volume.
      </p>
      <p>
        To process the comments or user’s review, many frameworks developed earlier using various
machine learning and deep learning frameworks [
        <xref ref-type="bibr" rid="ref2">4</xref>
        ]. Most of the previous research work
processed the comments or the user’s review written in English text to develop sentiment
analysis frameworks. However, currently, a high volume of comments are posted by the users in
mixed languages. For example- Kannada-English, Malayalam-English, Hindi-English and many
more. Hence, the model developed so far may not be capable of handling the recent code-mixed
comments [
        <xref ref-type="bibr" rid="ref3">5</xref>
        ].
      </p>
      <p>
        The research community has recently been interested in sentiment analysis of code-mixed
language. Kumar et al. [
        <xref ref-type="bibr" rid="ref4">6</xref>
        ] suggested a hybrid CNN-Bi-LSTM model for categorizing social
media postings into distinct sentiment groups. To categorize Tamil-English and
MalayalamEnglish code-mixed social media postings into distinct sentiment classes, Mahata et al. [
        <xref ref-type="bibr" rid="ref5">7</xref>
        ]
suggested Bi-directional LSTM with language tagging. On the other side, Sharma and Mandalam
[
        <xref ref-type="bibr" rid="ref6">8</xref>
        ] used sub-word level representation to capture text sentiment and an LSTM network to
categorize Tamil-English and Malayalam-English social media postings into distinct polarity
classes. Goswami et al. [
        <xref ref-type="bibr" rid="ref7">9</xref>
        ] proposed a morphological attention model for sentiment analysis on
Hinglish data. Banerjee et al. [
        <xref ref-type="bibr" rid="ref8">10</xref>
        ] reported the finding of machine translation for Dravidian
language such as English to Tamil, English to Malayalam, and similar ones.
      </p>
      <p>
        In line with the works developed for sentiment analysis from code-mixed social media posts,
we developed a deep neural model using Bi-directional LSTM [
        <xref ref-type="bibr" rid="ref10 ref9">11, 12</xref>
        ]. The data used for this
research was developed by scraping the YouTube comments and labelled into five sentimental
categories as: "positive, negative, neutral, mixed feelings or not in the intended languages"
[
        <xref ref-type="bibr" rid="ref11 ref12">13, 14</xref>
        ]. Traditional machine learning classifiers and deep neural network-based LSTM are
used to classify the Tamil code-mixed dataset. The experimental outcomes confirmed that
the proposed model outperforms the traditional machine learning-based models by achieving
higher prediction accuracy.
      </p>
      <p>The rest of the paper is organized as follows: Section 2 discusses the proposed methodology.
In Section 3, we discuss the experimental outcomes, and finally, Section 4 concludes the work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        This research suggested a framework to predict the sentiment analysis of code-mixed data using
Bi-directional LSTM model Neural Network [
        <xref ref-type="bibr" rid="ref10 ref9">11, 12</xref>
        ]. The working steps of the proposed model
are shown in Figure 1. The dataset used in this research is available on FIRE-20211 and was
developed by Chakravarthi et al. [15, 16]. The statistics of the dataset used for model training
and testing with the number of instances available in each category of the sentiment is shown
in Table 1. The majority of the sample of the total dataset belongs to a positive sentiment class,
whereas the remaining samples are distributed into four other categories.
      </p>
      <p>The original dataset contains many unsupportive characters, which needs to the filter out
before passing it to the model for processing. The data cleaning step is performed to remove
the emojis, special characters, non-ASCII characters. The number is removed and converted
all information into lower cases. Further, the cleaned data passes are padded with zeros for
making all messages of equal length. The maximum size of the message is fixed to 30 and 70,
respectively, for the word and char level processing. The padded text passes to the embedding
layer, where for each word, their corresponding vector is extracted from a pre-trained word
embedding called Glove[17]. The GloVe embedding, having the dimension of 100, means each
1https://dravidian-codemix.github.io/2021/datasets.html</p>
      <p>Code-Mixed Dataset</p>
      <p>Preprocessing</p>
      <p>Tokenization
Char Level
Padding (70)</p>
      <p>Word Level</p>
      <p>Padding (30)
Random Embedding
(100)</p>
      <p>Random Embedding
(100)</p>
      <p>Pre-trained Word</p>
      <p>Embedding (100)
Bi-LSTM (64)</p>
      <p>Dropout</p>
      <p>Bi-LSTM(32)</p>
      <p>Bi-LSTM(32)
Dropout</p>
      <p>Dropout
Concatenate</p>
      <p>Dense(128)</p>
      <p>Output Layer, Dense(5)
input word is mapped into 100-dimensional vectors. This way, for each message of size (n) and
 × 100 sizes matrix created by the embedding layer. Random embedding technique is also used
for the word and char level dataset as shown in Figure 1 with the same output dimension as 100.</p>
      <p>Further, the embedded dataset passes to Bi-directional Long-Short Term Memory (Bi-LSTM)
model for further processing. For Char embedding, 64 units of Bi-LSTM were used, whereas
for processing the words, 32 units of Bi-LSTM. The dropout layer is added in all three cases,
and then the outcomes are concatenated together. The concatenated outcomes of the Bi-LSTM
models are passes to a fully connected dense layer with 128 neurons, followed by an output
layer consisting of five neurons. The ReLU activation function is used in the internal layer of
the network; however, at the output layer, Softmax is used.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        This research developed a model to classify the code-mixed input sentence in one of the
predefined sentimental categories. The evaluate the model performance, the classification metrics
called precision (P), recall (R), and F1-score (F1) are used. Precision is defined as the number
of correctly predicted sentiment categories among the retrieved instances of the particular
sentiment category. The recall is defined as the number of correctly predicted sentiment
categories among the total number of instances of that particular sentiment category. The
F1-score (F1) is the harmonic mean of the precision and recall [
        <xref ref-type="bibr" rid="ref9">11, 18</xref>
        ].
      </p>
      <p>A number of the experiment was done with by extracting the various n-gram features from
the text using tf-idf vectorization technique and passing it to traditional Machine Learning
based classifiers like- Random Forest (RF), Logistic Regression (RF), and Naive Bayes (NB). The
best outcomes of these classifiers are shown in Table 2. Most of the instances are miss-classified
to another category of sentiment. The positive sentiment category is predicted with the highest
prediction accuracy by all three classifiers, NB, LR, and RF. In contrast, the same classifiers are
failed to detect the not-Tamil sentiment category. None of the classifier’s performance was
satisfactory for predicting code-mixed data of negative, mixed-feelings, unknown state and
not-Tamil categories with bi-gram features.</p>
      <p>To improve the model performance, we have used deep learning-based Bi- directional LSTM.
The outcomes of the proposed B-LSTM model with validation dataset is shown in Table 4. The
best performance is achieved for the Positive sentiment class. The precision, recall and F1-score
values are 0.68, 0.80, and 0.74, respectively, whereas the lowest precision, recall and F1-score
values are 0.20 0.11, 0.14, respectively, for mixed-feelings sentiment class. The performance of
the proposed deep learning model outperforms the traditional machine learning models (Table
2) by achieving better performance for all classes. The non-Tamil classes are not recognized
at all by any of the mentioned traditional ML models. However, the proposed deep learning
model provides satisfactory prediction accuracy. The weighted average precision, recall and
F1-score values are 0.54, 0.57, and 0.55, respectively on the validation dataset. However, on
the test dataset, the weighted precision, recall and F1-score values are 0.544, 0.566, and 0.552,
respectively.</p>
      <p>One of the possible reasons behind the biased performance of the model for diferent
sentimental classes may include the inconsistent distribution of data samples in diferent classes
of training and testing set (Table 1). The number of samples present in the Positive sentiment
category is highest, whereas the lowest number of samples is present in not-Tamil category.
The efect of this data distribution is seen on the model’s outcomes (Table 4). Hence, to get
better outcomes, data oversampling techniques such as SMOTE or ADASYN may help [19].
Another possible reason behind the low performance of the model may include the high number
of code-mixed samples in training and testing dataset. By normalizing the dataset into English
may help to achieve better prediction accuracy.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Sentiment analysis is one of the major research areas in computer science, where the opinion
will be extracted from the input text. The opinion may be positive, negative or neutral. In the
current time, users are popularly used mixed languages to post comments or reviews. Hence,
getting the opinion from such a post is a challenging task. This research suggested a deep
learning-based automated system to predict the sentiment of the user’s post written in Tamil
code-mixed. The proposed framework utilised the pre-trained word embedding technique and
achieved a weighted F1-score of 0.552 for the best case on test sample.
[1] B. Liu, et al., Sentiment analysis and subjectivity., Handbook of natural language processing
2 (2010) 627–666.
[2] R. Feldman, Techniques and applications for sentiment analysis, Communications of the</p>
      <p>ACM 56 (2013) 82–89.
of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text, in: Working Notes
of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
[15] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
sentiment analysis in code-mixed tamil-english text, arXiv preprint arXiv:2006.00206
(2020).
[16] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil,
malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021,
Association for Computing Machinery, 2021.
[17] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), 2014, pp. 1532–1543.
[18] P. K. Roy, Multilayer convolutional neural network to filter low quality content from
quora, Neural Processing Letters 52 (2020) 805–821.
[19] P. K. Roy, Z. Ahmad, J. P. Singh, M. A. A. Alryalat, N. P. Rana, Y. K. Dwivedi, Finding and
ranking high-quality answers in community question answering sites, Global Journal of
Flexible Systems Management 19 (2018) 53–68.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Detection of spam reviews: a sentiment analysis approach</article-title>
          ,
          <source>Csi Transactions on ICT 6</source>
          (
          <year>2018</year>
          )
          <fpage>137</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q. T.</given-names>
            <surname>Ain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Riaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Noureen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kamran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hayat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rehman</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis using deep learning techniques: a review</article-title>
          ,
          <source>Int J Adv Comput Sci Appl</source>
          <volume>8</volume>
          (
          <year>2017</year>
          )
          <fpage>424</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tamil-English text</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .sltu-
          <volume>1</volume>
          .
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Nitp-ai-nlp@ dravidian-codemix-fire2020: A hybrid cnn and bi-lstm network for sentiment analysis of dravidian code-mixed social media posts</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>582</fpage>
          -
          <lpage>590</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahata</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bandyopadhyay</surname>
          </string-name>
          ,
          <article-title>Sentiment classification of code-mixed tweets using bi-directional rnn and language tags</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Mandalam</surname>
          </string-name>
          ,
          <article-title>Bits2020@ dravidian-codemix-fire2020: Sub-word level sentiment analysis of dravidian code mixed data</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>503</fpage>
          -
          <lpage>509</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fransen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Uld@ nuig at semeval2020 task 9: Generative morphemes with an attention model for sentiment analysis in code-mixed text</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>01545</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jayapal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          ,
          <article-title>Nuig-shubhanker@dravidian-codemix- fire2020: Sentiment analysis of code-mixed dravidian text using xlnet</article-title>
          ,
          <source>in: FIRE</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <article-title>Deep learning to filter sms spam</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>102</volume>
          (
          <year>2020</year>
          )
          <fpage>524</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation 9</source>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chinnappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          , E. Sherly,
          <article-title>Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2021</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chinnappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thenmozhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vasantharajan</surname>
          </string-name>
          , Findings
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>