<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>bits2020@Dravidian-CodeMix-FIRE2020: Sub-Word Level Sentiment Analysis of Dravidian Code Mixed Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>YashvardhanSharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asrita VenkataMandalam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Systems, Birla Institute of Technology and Science Pilani</institution>
          ,
          <addr-line>Pilani Campus</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the methodologies implemented while classifying Dravidian code-mixed comments according to their polarity in the evaluation of the track 'Sentiment Analysis for Davidian Languages in Code-Mixed Text' proposed by Forum of Information Retrieval Evaluation in 2020. The implemented method used a sub-word level representation to capture the sentiment of the text. Using a Long Short Term Memory (LSTM) network along with language-specific preprocessing, the model classified the text according to its polarity. With F1-scores of 0.61 and 0.60, the model achieved an overall rank of 5 and 12 in the Tamil and Malayalam tasks respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment Analysis</kwd>
        <kwd>Recurrent Neural Networks</kwd>
        <kwd>Sub-word Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>websites for both Tamil and Malayalam. Each team had to submit a set of predicted sentiments
for the Tamil-English and Malayalam-English mixed test se3t]s.[</p>
      <p>Along with language specific preprocessing techniques, the implemented model makes use
of sub-word level representations to incorporate features at the morpheme level, the smallest
meaningful unit of any language. Evaluated by F1-score, the presented approach achieved the
5th highest score in the Tamil task and the 12th rank in the Malayalam task. The implemented
model is available on Githu1b.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Analysing the sentiment of code-mixed data is important as traditional methods fail when given
such data. Barman et al.4[] concluded that n-grams proved to be useful in their experiments
that involved multiple languages with Roman script.</p>
      <p>Qurat Tul Ain et al5.[] wrote a review paper that highlighted studies regarding the
implementation of deep learning models in sentiment analysis models. Due to more hidden layers, deep
learning models extracted data that had heavier weights and used those features to proceed.</p>
      <p>Bojanowski et al. 6[] used character n-grams in their skip-gram model. The lack of
preprocessing resulted in a shorter training time and outperformed baselines that did not consider
sub-word information. Joshi et al.7[] outperformed existing systems as well by using a sub-word
based LSTM architecture. Their dataset consisted of 15% negative, 50% neutral and 35% positive
comments. As their dataset was imbalanced like the one used in this paper, the submitted
approach involved morpheme extraction as it would help in identifying the polarity of the dataset.
In more recent work, Jose et al.8[] surveyed publicly available code-mixed datasets. They noted
statistics about each dataset such as vocabulary size and sentence length. Priyadharshini et
al. [9] used embeddings of closely related languages of the code-mixed corpus to predict Named
Entities of the same corpus.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        The model has been trained, validated and tested using the Tam1il0][ and Malayalam1[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
datasets provided by the organizers of the Dravidian Code-Mix FIRE 2020 task. The Tamil
code-mix dataset consists of 11,335 comments for the train set, 1,260 for the validation set and
3,149 comments for testing the model. In the Malayalam code-mix dataset, there are 4,851
1https://github.com/avmand/SA_Dravidian
comments for training, 541 for validating and 1,348 for testing the model. Tab1lgeives the
distribution of each sentiment in each dataset.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Technique</title>
      <p>Word-level models such as Word2Vec12[] and GloVe [13] are popularly used in a variety of NLP
tasks. However, they did not seem to be suited for a code-mixed dataset. It is not possible to use
a word-level model due to the sparsity of words in the dataset. The implemented approach uses
a sub-word level model as it accounts for words that have a similar morpheme. For example, in
the Tamil dataset,Ivan, Ivanga and Ivana have similar meanings due to their root worIvdan.</p>
      <p>First, the dataset is preprocessed to replace all emojis with their corresponding description in
English. As the dataset contains both Roman and Tamil (or Malayalam) characters, the latter is
replaced with its corresponding Roman script representation.</p>
      <p>From the preprocessed data, a set of characters was obtained . The input to the model is a
set of character embeddings. The sub-word level representation is generated through a 1-D
convolution layer with activation as ReLU, size of convolutional window as 5 and number of
output filters as 128. After getting a morpheme-level feature map, a 1-D maximum pooling layer
is used to obtain its most prominent features. To obtain the connections between each of these
features, LSTMs are used due to their ability to process sequences and retain information. The
ifrst and second LSTM layers have a dropout of 0.4 and 0.2 respectively. Finally, it is passed to a
fully connected layer. Batch normalization has been used in the model to prevent overfitting.
While training the model, early stopping has been utilized to stop training when the validation
loss shows no improvement after 4 epochs. The training data was shufled before each epoch.
Figure1 gives a representation of the discussed methodology.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Result</title>
      <p>The submitted run achieved a rank of 5 and 12 for the Tamil and Malayalam tasks respectively.
The final rank was evaluated based on the weighted average F1-score. The classification report
is shown in Table2. The Tamil task received Precision, Recall and an F1-score of 0.62, 0.66 and
0.61 respectively. For the Malayalam task, the submission received scores of 0.67, 0.59 and 0.60
respectively.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Error Analysis</title>
      <sec id="sec-6-1">
        <title>6.1. Tamil Task</title>
        <p>From Table2, one can see that the F1-score of the positive comments is the highest with a
value of 0.80. The next highest score is only at 0.46, attained by the class of comments that are
not in Tamil. The order of classes from the highest to the lowest F1-scores are Positive, Not
Tamil, Negative, Mixed Feelings and Unknown State. The weighted F1-score is lower than the
Precision and Recall as the weighted score takes into account the proportion of each class in
the dataset.</p>
        <p>Due to the higher number of positive comments in the overall dataset, it is not surprising
that the model trains well and produces the best results for that class. Non-Tamil comments
get the next highest score due to the diferent morphemes used in them. These comments are
usually in a diferent Indian language like Hindi or Telugu and are written using the English
alphabet. Some comments are written in the script of their respective language. This class does
not achieve a higher score due to words that they have in common with the Tamil-English
codemixed tweets such asRajinikanth and Thalaiva. The same can be concluded for the negative
label as well as it had many words that were common with those of the positive comments.
Comments from the mixed feelings class were misclassified as either positive or negative. They
were not misclassified as comments from the unknown state class possibly due to the relatively
lower ratio of unknown comments as compared to the positive and negative classes. As these
comments contained both positive and negative sentiments, there was a much higher chance of
them being classified into one of those classes. The unknown state class receives the lowest
F1-score. Its precision is 0.67 but its recall is very low at 0.01. This implies that there is a high
false negative rate and is because all of the comments use words from the Tamil vocabulary.
Most of those words are common with those of the positive class. Figur2egives a representation
of the misclassified Tamil comments.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Malayalam Task</title>
        <p>The classification report of the Malayalam task can be seen in Tab2l.eThe F1-score of the
positive comments is the highest at 0.73. The next highest is at 0.63, for the class of comments
that are not in Malayalam. The order of classes from the highest to the lowest F1-scores are
Positive, Not Malayalam, Unknown State, Negative and Mixed Feelings.</p>
        <p>The Malayalam dataset was more balanced as compared to the Tamil dataset with the second
largest class less than 1000 comments behind the largest one. Similar to the Tamil dataset, the
positive class has the highest number of comments. This led to the relatively higher F1-score
and an equally low false positive and false negative rate. For the class of comments that were not
in Malayalam, the classifier identified all of the comments that were not written in the Roman
or Malayalam script. However, words that were commonly found in positive comments, such
as names of Malayalam actors, and were used with English words were classified incorrectly.
For the unknown state class, it is noted that the misclassified comments were majorly assigned
a positive, negative or mixed feelings tag. Although the overall sentiment of the sentence was
unknown, a portion of that sentence had similarities with one of the other classes. For the
mixed feelings class, the same was deduced and most of the wrongly classified comments were
assigned either a positive or negative tag. Most of the misclassified comments from the negative
class were labelled as comments with mixed feelings. Sarcastic comments that used positive
words but implied negative sentiments were not accounted for by the model. The distribution
of misclassified comments can be seen in Figure 3.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This paper presents the submitted approach for the Sentiment Analysis for Dravidian Languages
in Code-Mixed Text track of Forum for Information Retrieval Evaluation (FIRE) 2020. The model
implemented sub-word level representations along with LSTM networks. Morpheme level
representations have proven to be useful in code-mixed data as they club words with similar
root words and meanings together. The results show that the positive class in each dataset
receives the highest F1-scores. This is possibly due to the higher ratio of the same as compared
to the rest of the classes. Comments that were not in the language of their dataset received
the next highest score as their vocabulary included sub-words that were not a part of the
rest of the datasets. For future work, a sarcasm detection feature could be included to avoid
misclassification.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The authors would like to convey their sincere thanks to the Department of Science and
Technology (ICPS Division), New Delhi, India, for providing financial assistance under the
Data Science (DS) Research of Interdisciplinary Cyber Physical Systems (ICPS) Programme
[DST/ICPS/CLUSTER/Data Science/2018/Proposal-16:(T-856)] at the department of computer
science, Birla Institute of Technology and Science, Pilani, India.
26th International Conference on Computational Linguistics: Technical Papers, 2016, pp.
2482–2491.
[8] N. Jose, B. R. Chakravarthi, S. Suryawanshi, E. Sherly, J. P. McCrae, A survey of current
datasets for code-switching research, in: 2020 6th International Conference on Advanced
Computing and Communication Systems (ICACCS), IEEE, 2020, pp. 136–141.
[9] R. Priyadharshini, B. R. Chakravarthi, M. Vegupatti, J. P. McCrae, Named entity recognition
for code-mixed indian corpus using meta embedding, in: 2020 6th International Conference
on Advanced Computing and Communication Systems (ICACCS), IEEE, 2020, pp. 68–72.
[10] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
and Collaboration and Computing for Under-Resourced Languages (CCURL), European
Language Resources association, Marseille, France, 2020, pp. 202–210. UhRtLtp:s://www.
aclweb.org/anthology/2020.sltu-1.2.8
[11] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on
Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
and Computing for Under-Resourced Languages (CCURL), European Language Resources
association, Marseille, France, 2020, pp. 177–184. URhLt:tps://www.aclweb.org/anthology/
2020.sltu-1.25.
[12] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in
vector space, arXiv preprint arXiv:1301.3781 (2013).
[13] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), 2014, pp. 1532–1543.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Proceedings of the 12th Forum for Information Retrieval Evaluation</source>
          ,
          <source>FIRE '20</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Leveraging orthographic information to improve machine translation of under-resourced languages</article-title>
          ,
          <source>Ph.D. thesis, NUI Galway</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>U.</given-names>
            <surname>Barman</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Foster</surname>
          </string-name>
          ,
          <article-title>Code mixing: A challenge for language identification in the language of social media</article-title>
          ,
          <source>in: Proceedings of the first workshop on computational approaches to code switching</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q. T.</given-names>
            <surname>Ain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Riaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Noureen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kamran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hayat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rehman</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis using deep learning techniques: a review</article-title>
          ,
          <source>Int J Adv Comput Sci Appl</source>
          <volume>8</volume>
          (
          <year>2017</year>
          )
          <fpage>424</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          )
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text</article-title>
          ,
          <source>in: Proceedings of COLING</source>
          <year>2016</year>
          , the
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>