<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Xiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>NITP-AI-NLP@Dravidian-CodeMix-FIRE2020: A Hybrid CNN and Bi-LSTM Network for Sentiment Analysis of Dravidian Code-Mixed Social Media Posts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abhinav Kumar</string-name>
          <email>abhinavanand05@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sunil Saumya</string-name>
          <email>sunil.saumya@iiitdwd.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jyoti Prakash Singh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Information Technology Dharwad</institution>
          ,
          <addr-line>Karnataka</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Technology Patna</institution>
          ,
          <addr-line>Patna</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1508</year>
      </pub-date>
      <volume>01991</volume>
      <abstract>
        <p>The sentiment analysis is one of the important tasks in the field of natural language processing. Many works have been proposed recently by the research community to find the sentiment from English social media posts. Nevertheless, very little work has been proposed to find sentiments from the Dravidian code-mixed Malayalam and Tamil social media comments. In this work, we have proposed two-hybrid neural network models based on Convolutional Neural Network (CNN) and Bidirectional Long-ShortTerm-memory (Bi-LSTM) network. We utilized both character and word embedding of the YouTube comments to learn robust features from the text. The proposed hybrid CNN-CNN network achieved a promising weighted  1-score of 0.69 for Malayalam code-mixed text, whereas the CNN-Bi-LSTM network achieved a promising weighted  1-score of 0.61 for Tamil code-mixed text.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment analysis</kwd>
        <kwd>Code-mixed</kwd>
        <kwd>Tamil</kwd>
        <kwd>Malayalam</kwd>
        <kwd>YouTube</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sentiment analysis helps to recognize opinions or answers on a specific subject. It is one of the
most researched topics in natural language processing due to its significant impact on
businesses like e-commerce, spam detection [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], recommendation system, social media
monitoring [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and name a few. English is the most preferable and acceptable language worldwide and
very prevalent in the digital world. However, in a country like India, having over 400 million
internet users speaks more than one language to communicate their thoughts or emotions,
producing a new code-mixed language [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. The issue with the code-mix language is that it
contains more than one script and language constructs. Most of the existing models trained
to extract a single language’s sentiment fail to capture a code-mixed language semantics.
Extracting sentiments from code mixed user-generated texts becomes more dificult due to its
multilingual nature.
      </p>
      <p>
        Recently, the sentiment analysis of code-mixed language [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] has drawn attention from
the research community. Joshi et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] presented a model with subword representation of
code-mix data and long short term memory (subword-LSTM) for sentiment analysis of Hinglish
(Hindi-English) dataset. Priyadharshini et al. [9] used subword representation for named entity
recognition in code-mixed Hindi-English text. A model with a support vector machine that uses
character n-grams features for Bengali-English code mixed data was reported by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Advani
et al. [10] used logistic regression with handcrafted lexical and semantic features to extract
sentiments from Hinglish and Spanglish (Spanish + English) data. Goswami et al. [11] proposed
a morphological attention model for sentiment analysis on Hinglish data.
      </p>
      <p>The Malayalam language is one of the Dravidian languages spoken in the Indian state of
Kerala. There are almost 38 million Malayalam speakers over the globe. Another famous
Dravidian language in India’s southern region is Tamil, which is being spoken by Tamil people in
India, Singapore, and Sri Lanka [12]. The scripts of both Dravidian languages are alpha-syllabic,
which is partially alphabetic and partially syllable-based [13]. However, people on social
media frequently utilize Roman script for writing because it is easy to write through keyboards
available on the devices [14]. For these under-resourced languages, thus, the majority of the
data available in social media are code-mixed.</p>
      <p>The objective of the current study is to extract sentiment from code-mixed Dravidian
languages Tanglish and Manglish. The data of the Dravidian-CodeMix-FIRE2020 challenge [15, 16]
was collected from the social media platform YouTube. Each instance or post in the data
typically has one sentence, and in a few cases, it is more than one. Every instance is labeled with
one of the sentiment polarities “positive, negative, mixed emotion, unknown state, and if the
post is not in the said Dravidian languages". The current paper develops two diferent
hybrid neural networks based on Convolutional Neural Network (CNN) and Bi-directional
LongShort-Term-Memory (Bi-LSTM) networks. In the proposed hybrid models, both character and
word embedding vectors of the text are used to get the text’s robust textual features.</p>
      <p>In the rest of the paper, the dataset description, the proposed methodology is explained in
Section 2. The various experiments and their finding is presented in Section 3. Finally, Section
4 concludes the discussion by highlighting the main findings of this study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The detail description of the proposed hybrid Convolutional Neural Network (CNN) and
Bidirectional Long-Short-Term-Memory (Bi-LSTM) networks are discussed in this section. We have
proposed two diferent hybrid deep neural network models: (i) CNN (c) + CNN (w) model: in
this model, two parallel CNN networks are used to extract the character level (c) and word level
features (w) from the text. For the first CNN network character embedding of the text is given
as the input to the network, whereas for the second CNN network, word embedding of the text
is given as the input to the network. The model diagram for the hybrid CNN (c) + CNN (w) can
be seen from Figure 1. (ii) CNN (c) + Bi-LSTM (w): in this model, similar to the previous CNN
(c) + CNN (w) model, charcter embedding is given as input in CNN whereas word embedding
is given as input in Bi-LSTM network. The model diagram for the hybrid CNN (c) + Bi-LSTM
(w) can be seen from Figure ??. The detailed description regarding the number of layers,
parameters, and type of embedding can be seen in 2.2 and 2.3.</p>
      <p>We have removed multiple spaces between the words into one for the data pre-processing,
and we replaced &amp;, @ symbols to their English words ‘and’ and ‘at’, respectively. We also
replaced numeric values into their corresponding English words (e.g., ‘1’ is replaced by ‘one’, ‘2’
is replaced by ‘two’ and so on). The data statistic for the given task is presented in Table 1.
2.1. Character and Word embedding vectors
Each character of the YouTube comments is encoded into one-hot vector to get the character
embedding. We fixed 200-characters for each posts. In our character vocabulary we found
seventy diferent characters such as alphabets, numbers, and special symbols. Therefore, each</p>
      <p>Positive Negative Not_Tamil Unknown</p>
      <p>Concatenated layer (640)
ilt:frse ragm
82 -1
1
,:r-a2m rag4m
g
s &amp;
itlfre ,ragm
821 -3</p>
      <p>Dense (128)</p>
      <p>CNN</p>
      <p>CNN
social media post is converted into a 200 × 70 dimensional character embedding matrix. Then
this matrix is used by the CNN network for their convolution process. For word embedding,
we trained a FastText1 model by using Tamil code-mixed and Malayalam code-mixed corpus
separately. Each word of the corpus is converted into a 100-dimensional vector. In our case, we
ifxed 30-words for each of the YouTube comments. Therefore, each post is converted into (30 ×
100) dimensional word embedding matrix. These character embedding and word embedding
matrix is then used in our proposed hybrid models.</p>
      <p>Predictednloatb-meallayualnakmnown_state</p>
      <p>CNN (c) + CNN (w)</p>
      <p>CNN (c) + Bi-LSTM (w)
2.2. CNN (c) + CNN (w) model
The overall diagram for the hybrid CNN (c) + CNN (w) model can be seen from Figure 1. Two
parallel CNN network is used one to process character embedding matrix and other one is to
process word embedding matrix. To process character embedding matrix, two layers of CNN
is used. In the first CNN layer, 128 filters of 2-gram, 3-gram, and 4-gram are used, whereas
in the second CNN layer 128 filters of 1-gram are used. Similarly, to process word embedding
matrix, two layers of CNN in used. In the first CNN layer, 1024 filters of 2-gram, 3-gram, and
4-gram filters are used, where in the next CNN layer 512 filters of 1-gram are used. Finally,
the flattened vectors from both the parallel CNN networks are concatenated and passed to a
dense layer having 256-neurons. Finally, the output of dense layer is passed to a softmax layer
to get its class probability. As the performance of the deep neural networks are very sensitive
to the selected hyper-parameters, we experimented by varying the batch sizes, dropout rates,
pooling window, and epochs. The best suited hyper-parameter of the proposed CNN (c) + CNN
(w) model is listed in Table 2.
2.3. CNN (c) + Bi-LSTM (w) model
The overall diagram for the hybrid CNN (c) + Bi-LSTM (w) model can be seen from Figure ??.
The CNN network of the proposed hybrid CNN (c) + Bi-LSTM (w) model takes the character
embedding matrix as an input. In the first CNN layer, 128 filters of 2-gram, 3-gram, and 4-gram
are used, whereas as in the second CNN layer, 128 filters of 1-gram are used in our model.
Then the extracted features from two consecutive CNN layer is passed through a dense layer
having 128 neurons. Similarly, word embedding is given input to two Bi-LSTM layers with
a 512-dimensional output vector at the first layer and a 256-dimensional output vector at the
second layer. The second Bi-LSTM layer’s output vector is then concatenated with the
128dimensional output vector of CNN (followed by dense) model, as can be seen in Figure ??.
Finally, the concatenated vector is passed through a softmax layer to get the class probability.
The best suited hyper-parameters for the proposed CNN (c) + Bi-LSTM (w) can be seen in Table
2. The detailed description of the CNN and Bi-LSTM network can be seen in [17, 18, 19, 20, 21].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>In the given task of Dravidian-CodeMix-FIRE2020 workshop, participants had to classify
codemixed (written in roman script) Tamil and Malayalam social media posts into five diferent
sentiment classes: (i) Mixed feelings, (ii) Positive, (iii) Negative, (iv) Not related to that language
(Not-Tamil/Not-Malayalam), and (v) Unknown state. The results of code-mixed Malayalam
posts for both the CNN (c) + CNN (w) and CNN (c) + Bi-LSTM (w) models are listed in Table
3. After comparing the results of both proposed models, it was found that CNN (c) + CNN
(w) performed better for code-mixed Malayalam posts with precision, recall, and  1-score of
0.69. The confusion matrix and ROC curve for the code-mixed Malayalam posts can be seen in
Figures 3 and 4, respectively.</p>
      <p>The results of code-mixed Tamil post for both the CNN (c) + CNN (w) and and CNN (c) +
BiLSTM (w) models are listed in Table 3. The proposed CNN (c) + Bi-LSTM (w) model performed
better as compared to another model and achieved a precision of 0.59, recall of 0.64, and an
 1-score of 0.61. The confusion matrix and ROC curve for the code-mixed Tamil posts can be
seen in Figures 5 and 6, respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Sentiment analysis of the textual contents has significant uses in various natural language
processing tasks. In this work, we proposed two-hybrid deep neural networks based on CNN
and Bi-LSTM networks. We used both character and word embedding vectors in the proposed
hybrid CNN (c) + CNN (w) and CNN (c) + Bi-LSTM (w) models that achieved promising
performance in the classification of code-mixed Malayalam and Tamil YouTube comments. The
proposed CNN (c) + CNN (w) network achieved a weighted  1-score of 0.69 for Malayalam
code-mixed text, whereas the CNN (c) + Bi-LSTM (w) network achieved a weighted  1-score
of 0.61 for Tamil code-mixed text.
[9] R. Priyadharshini, B. R. Chakravarthi, M. Vegupatti, J. P. McCrae, Named entity
recognition for code-mixed Indian corpus using meta embedding, in: 2020 6th International
Conference on Advanced Computing and Communication Systems (ICACCS), 2020.
[10] L. Advani, C. Lu, S. Maharjan, C1 at semeval-2020 task 9: Sentimix: Sentiment analysis for
code-mixed social media text using feature engineering, arXiv preprint arXiv:2008.13549
(2020).
[11] K. Goswami, P. Rani, B. R. Chakravarthi, T. Fransen, J. P. McCrae, Uld@ nuig at
semeval2020 task 9: Generative morphemes with an attention model for sentiment analysis in
code-mixed text, arXiv preprint arXiv:2008.01545 (2020).
[12] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation
of under-resourced languages, Ph.D. thesis, NUI Galway, 2020.
[13] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Comparison of Diferent Orthographies
for Machine Translation of Under-Resourced Dravidian Languages, in: 2nd
Conference on Language, Data and Knowledge (LDK 2019), volume 70 of OpenAccess Series
in Informatics (OASIcs), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl,
Germany, 2019, pp. 6:1–6:14. URL: http://drops.dagstuhl.de/opus/volltexte/2019/10370.
doi:10.4230/OASIcs.LDK.2019.6.
[14] A. Kumar, J. P. Singh, S. Saumya, A comparative analysis of machine learning techniques
for disaster-related tweet classification, in: 2019 IEEE R10 Humanitarian Technology
Conference (R10-HTC)(47129), IEEE, 2019, pp. 222–227.
[15] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and
Collaboration and Computing for Under-Resourced Languages (CCURL), European
Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www.
aclweb.org/anthology/2020.sltu-1.28.
[16] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop
on Spoken Language Technologies for Under-resourced languages (SLTU) and
Collaboration and Computing for Under-Resourced Languages (CCURL), European Language
Resources association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.
org/anthology/2020.sltu-1.25.
[17] A. Kumar, J. P. Singh, Location reference identification from tweets during emergencies:
A deep learning approach, International journal of disaster risk reduction 33 (2019) 365–
375.
[18] S. Saumya, J. P. Singh, Y. K. Dwivedi, Predicting the helpfulness score of online
reviews using convolutional neural network, Soft Computing 24 (2020) 10989–11005,
https://doi.org/10.1007/s00500–019–03851–5.
[19] A. Kumar, J. P. Singh, Y. K. Dwivedi, N. P. Rana, A deep multi-modal neural network
for informative twitter content classification during emergencies, Annals of Operations
Research (2020) 1–32.
[20] J. P. Singh, A. Kumar, N. P. Rana, Y. K. Dwivedi, Attention-based lstm network for rumor
veracity estimation of tweets, Information Systems Frontiers (2020) 1–16.
[21] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Spam review detection using lstm autoencoder: an unsupervised approach</article-title>
          ,
          <source>Electronic Commerce Research</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          , doi.org/10.1007/s10660-020- 09413-4.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Detection of spam reviews: A sentiment analysis approach</article-title>
          ,
          <source>CSI Transactions on ICT 6</source>
          (
          <year>2018</year>
          )
          <fpage>137</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Predicting stock movements using social network</article-title>
          , in: Conference on e-Business, e-Services and e-Society, Springer,
          <year>2016</year>
          , pp.
          <fpage>567</fpage>
          -
          <lpage>572</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Patra</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis of code-mixed indian languages: An overview of sail_code-mixed shared task@ icon-2017</article-title>
          , arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>06745</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A survey of current datasets for code-switching research</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <surname>Elizabeth</surname>
            <given-names>McCrae</given-names>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Davidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Working Notes of the FIRE 2020. CEUR Workshop Proceedings</source>
          .,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <surname>Elizabeth</surname>
            <given-names>McCrae</given-names>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Davidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Proceedings of the 12th FIRE, FIRE '20</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text</article-title>
          ,
          <source>in: Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2482</fpage>
          -
          <lpage>2491</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>