<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. S); varadhaganapathy@gmail.com (V. S)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SA-SVG@Dravidian-CodeMix-FIRE2020: Deep Learning Based Sentiment Analysis in Code-mixed Tamil-English Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anbukkarasi S</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Varadhaganapathy S</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kongu Engineering College</institution>
          ,
          <addr-line>Erode, Tamilnadu</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Velalar College of Engineering and Technology)</institution>
          ,
          <addr-line>Erode, Tamilnadu</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Sentiment Analysis (SA) is the process of identifying the opinions and thoughts on particular context. In recent times, People tend to share their feelings, emotions and ideas through social media applications such as Facebook, Twitter, Instagram. The bright side of these kinds of applications is users can interact with other people in code-mixed text. Analysing sentiments in code-mixed data is little bit tedious when compared in mono lingual texts as the code-switching increases the complexity. For conducting the experiments, the dataset given by Dravidian-CodeMix-FIRE2020 contains the code-mixed data of youtube comments has been used. Deep Learning based Bi-LSTM model is used for classification in our implementation. F1-Score, Precision ,Recall metrics are used for evaluation purpose. Our code is published in github at https://github.com/AnbukkarasiS/Dravadian-Codemix.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment Analysis</kwd>
        <kwd>Bi-LSTM</kwd>
        <kwd>Code-Mixed</kwd>
        <kwd>Tamil</kwd>
        <kwd>Dravidian languages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sentiment Analysis is the emerging topic of the Natural Language Processing domain. It is
the process of identifying the emotions like happy, sad, and depressed from the given texts.
This analysis can be helpful in product review, movies reviews etc. It is helpful in making
a decision on a particular task like movie watching, purchasing a product, book review. In
countries like India, people speak diferent languages in diferent regions. Even within a state
they speak diferent languages. Emergence of social media applications like Facebook, Twitter,
Youtube makes the sentiment analysis task a little bit tricky as many of the users often use
code-mixed text to opine their views. Code-mixed text consist of text in a mixed language, that
too not written in native script. Since these texts are not in native script, classifying them into
a particular class is a little bit tedious task [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. A sentence might be fully written in the
roman script, or it might be a mixture of the languages.
      </p>
      <p>
        Hence it is required to do the SA task in code-mixed data carefully. In this paper we proposed
a deep learning based model for sentiment analysis in code-mixed data. The given text is
classified as positive, negative, neutral, mixed-feeling and non-Tamil for Tamil and Malayalam
languages. Tamil and Malayalam are closely related languages from Dravidian language family
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The data set is provided by the shared task Dravidian-CodeMix-FIRE2020. The dataset
consist of 15744 youtube comments, 11,335 training data, 1,260 validation data and 3,149 test
data for code-mixed Tamil-English text. [
        <xref ref-type="bibr" rid="ref6">6, 7</xref>
        ]. The sample data with rough translation in
English is given below:
• Positive: Sema thalaiva endrum ungalukku visvasamana fans thaan ( Super hero, always
your loyal fans)
• Negative: nanum vjs fan than ..enaku pudikala.. trailer ( I am too vjs fan. But I don’t like
trailer)
• Neutral: Kaithiye ippadi irukuna,thalapathy 64 eppadi irukuna.VERITHANAMA IRUKUM
(Even kaidhi looks good, so thalapathy 64 would be more rocking)
• Mixed-Feelings: Rajinikula expiry date mudinju over expiry akiruju (expiry date is over
for Rajini) Title track ah vida trailer bgm sema.. yuvan (Trailer bgm is superb when
compare to title track ..Yuvan)
• Non-Tamil: Seems like remake movie, amithab tapsee etc
      </p>
      <p>In this paper, our motivation is to classify the given youtube comments as positive, negative,
neutral, mixed-feeling and non-Tamil. The detail of the data set is given in Table 1.</p>
      <p>This paper is classified as following. Section 2 describes the related work. Section 3 specifies
the proposed methodology. Section 4 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>There are very few works have been carried out for sentiment analysis in code-mixed data.</p>
      <p>LSTM based sentiment analysis in tweets is performed in [8]. In this paper, authors use
combined character based splitting for feeding the model Authors claim they achieved 86.2 %
of accuracy when Bi-LSTM model is used and 77.2% of accuracy when LSTM is used. RNN
based sentiment analysis is performed in micro texts in [9]. It is performed in the native script
of the languages such as Tamil, Hindi and Bengali. This system achieved 88.23, 72.01, 65.16 % of
accuracy for the languages Tamil, Hindi and Bengali respectively. It is said that unsupervised
data could also be included in the model as a future work.</p>
      <p>Vijay et.al created corpus and performed sentiment analysis in Hindi-English [10, 11]
codemixed social media text using Support Vector Machine (SVM) [12, 13]. Without including any
features, they achieved 58.2% of accuracy. They claimed that character based classification
increases the accuracy of the model to the significant amount. The system lacks in annotating
the corpus on part-of-speech tag based. The size of the dataset is minimal in the work.</p>
      <p>
        Shallow morphological parsers are used for analysing sentiments in online documents [14].
Based on the parsers, binary parse tree of recursive network is used. For long phrase sentences
they achieved 71.1 % of accuracy. Corpus for Tamil-English code-mixed data has been created
by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This is the gold standard corpus for Tamil-English code-mixed data. For this work,
various annotators label the classification of the given data. They have created a baseline model
with Logistic regression, K-Nearest Neighbour, 1-dimensional convolution network etc.
      </p>
      <p>
        A sentiment analysis dataset has been created for code-mixed Malayalam – English text
by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. They have created the corpus with around 7743 sentences in code-mixed
MalayalamEnglish text. For the baseline model, they used various machine learning approach such as
Logistic regression, (LR), Support vector machine (SVM), Decision tree (DT), Random Forest
(RF), Multinomial Naive Bayes (MNB), K-nearest neighbours (KNN).
      </p>
      <p>In the proposed work, Bi-LSTM model has been used for sentiment analysis in code-mixed
tamil-English text.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methodology</title>
      <p>This section describes the proposed methodology for classifying the given text as positive,
negative, neutral, mixed-feeling and non-Tamil. Figure 1 depicts the overall system methodology
proposed. In our proposed approach, a deep learning based Bi-LSTM model has been used for
the classification purpose. Implementation has been done using the Tensorflow package.
• Pre-processing
• Feature Extraction
• Classification
The basic structure is given in Figure 1.</p>
      <p>Pre-Processing</p>
      <p>The input data contains some special symbols like ‘@’, ‘. . . ’ etc. So the very first step is to
clean the data. The symbols and special characters are removed from the given text and fed to
the system.</p>
      <p>Feature Extraction</p>
      <p>All the deep learning models accept the input in terms of numbers only. Text input cannot
be fed to the models. Hence, the given code-mixed text is first converted into numerals. For
this, the input sentence is broken into tokens. This part is known as tokenization. After this
step, each token is mapped with the number vector. As each token is of diferent size, they
have to be made equal by padding. The additional zeros at the right end of the given input
sequence make the input data is of same size. Finally this input is fed to the Bi-LSTM model.
It is implemented with Tensorflow package. The code for the same has been given in [15]
Classification</p>
      <p>In this phase, the given input sequences are classified into the corresponding output. For
classification, RNN based Long Short Term Memory model is used with Tensorflow package.
The model outputs the given code-mixed youtube comments as Positive, Negative,
MixedFeeling, Neutral and non-Tamil.</p>
      <p>The various parameters used in the Bi-LSTM model are given in Table 2.</p>
      <p>The detail of the Bi-LSTM model is represented in Figure 2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>For studying the performance of the given dataset, on Tamil-English code-mixed data, Bi-LSTM
classifier is used with the parameters given in Table 2. We achieved the weighted average
FScore of 0.10 and ranked 14th in Tamil-English. The detailed results are given in Table 3.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The proposed system is used to classify the given input youtube comments into Positive,
Negative, Mixed-Feeling, Neutral and non-Tamil on the code-mixed data given by Dravadian
Codemix-FIRE 2020 task. The system uses Bi-LSTM model for the classification purpose. In
future, the model could be improved by parameter tuning and other neural network models
such as GRU with Attention mechanism could also used.
[7] B. R. Chakravarthi, R. Priyadharshini, B. Stearns, A. Jayapal, S. S, M. Arcan, M. Zarrouk,
J. P. McCrae, Multilingual multimodal machine translation for Dravidian languages
utilizing phonetic transcription, in: Proceedings of the 2nd Workshop on Technologies for
MT of Low Resource Languages, European Association for Machine Translation, Dublin,
Ireland, 2019, pp. 56–63. URL: https://www.aclweb.org/anthology/W19-6809.
[8] S. S.Anbukkarasi, Analyzing sentiment in tamil tweets using deep neural network, IEEE</p>
      <p>Xplore (2020).
[9] S. K. P. Shriya Seshadri, Anand Kumar Madasamy, Analyzing sentiment in indian
languages micro text using recurrent neural network, IIOABJ (2016).
[10] N. Jose, B. R. Chakravarthi, S. Suryawanshi, E. Sherly, J. P. McCrae, A survey of current
datasets for code-switching research, in: 2020 6th International Conference on Advanced
Computing and Communication Systems (ICACCS), IEEE, 2020, pp. 136–141.
[11] R. Priyadharshini, B. R. Chakravarthi, M. Vegupatti, J. P. McCrae, Named entity
recognition for code-mixed indian corpus using meta embedding, in: 2020 6th International
Conference on Advanced Computing and Communication Systems (ICACCS), IEEE, 2020,
pp. 68–72.
[12] D. Vijay, A. Bohra, V. Singh, M. Akhtar, Syed S.and Shrivastava, Corpus creation and
emotion prediction for hindi-english code-mixed social media text, in: NAACL-HLT 2018,
NAACL, 2018, pp. 128–135.
[13] P. Ranjan, B. Raja, R. Priyadharshini, R. C. Balabantaray, A comparative study on
codemixed data of Indian social media vs formal text, in: 2016 2nd International Conference
on Contemporary Computing and Informatics (IC3I), 2016, pp. 608–611. doi:10.1109/
IC3I.2016.7918035.
[14] R. Padmamala, V. Prema, Sentiment analysis of online tamil contents using recursive
neural network models approach for tamil language, in: 2017 IEEE International
Conference on Smart Technologies and Management for Computing, Communication, Controls,
Energy and Materials (ICSTM), IEEE, 2018, pp. 28–31.
[15] A. S, Github Code, 2020 (accessed October 15, 2020). URL: https://github.com/
AnbukkarasiS/Dravadian-Codemix.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Wordnet gloss translation for under-resourced languages using multilingual neural machine translation</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Comparison of diferent orthographies for machine translation of under-resourced Dravidian languages</article-title>
          ,
          <source>in: 2nd Conference on Language, Data and Knowledge (LDK</source>
          <year>2019</year>
          ),
          <source>Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Leveraging orthographic information to improve machine translation of under-resourced languages</article-title>
          ,
          <source>Ph.D. thesis, NUI Galway</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A survey of orthographic information in machine translation</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>01391</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. E.O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Bilingual lexicon induction across orthographically-distinct under-resourced Dravidian languages</article-title>
          ,
          <source>in: Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects</source>
          , Barcelona, Spain,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>WordNet gloss translation for under-resourced languages using multilingual neural machine translation</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation, European Association for Machine Translation</source>
          , Dublin, Ireland,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . URL: https://www.aclweb.org/anthology/W19-7101.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>