<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TADS@Dravidian-CodeMix-FIRE2020: Sentiment Analysis on CodeMix Dravidian Language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Deepesh Sharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IIIT Kottayam</institution>
          ,
          <addr-line>Kerela</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Sentimental analysis on Social Media has received much attention in research recently. Social Media will be the biggest source of big data in the upcoming years. Hence, the sentiment analysis of social media contents very important to regularize it. The FIRE 2020 organizers provided participants with annotated data-sets containing comments on YouTube videos in Malayalam and Tamil(including codemixing). Approached the problem using classic machine learning algorithms for classification i.e. SVM, Perceptron, and Logistic classifier.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment Analysis</kwd>
        <kwd>Dravidian language</kwd>
        <kwd>Text Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Data-set</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] Malayalam is one of the Dravidian languages spoken in the the southern region of India
with nearly 38 million Malayalam speakers in India and other countries.
      </p>
      <p>Tamil, is a Dravidian language natively spoken by the Tamil people of India and Sri Lanka.
Tamil is the oficial language of the South Indian state of Tamil Nadu, as well as two sovereign
states, Sri Lanka and Singapore. For this shared task, we have been provided with a new
gold standard corpus by the organizers for sentiment analysis of code-mixed text in Dravidian
languages (Malayalam-English and Tamil-English). The data-set consists of YouTube comments
which are then marked as one of the following.’ Positive’, ’Negative’, ’Mixed feelings’, ’unknown
state’, ’not-Tamil’.</p>
      <p>The distribution of the dataset is below.</p>
      <p>As we can see from the data the very skewed a simpler machine learning approach will be
more generalized.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>
        This is a message-level polarity classification task[
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. Given a YouTube comment, systems
have to classify it into positive, negative, neutral, mixed emotions, or not in the intended
languages.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment Setup</title>
      <p>We experimented with broadly three kinds of classic systems - an SVM classifier, a logistic
classifier, and a Perceptron. We used the sci-kit learn implementation of SVM, Logistic
Regression, and Perceptron. Support Vector Machines are one of the most successful classic machine
learning models used for various kinds of text classification tasks. Used logistic regression with
a multi-class variable as ’ovr’ for multi-class classification. Perceptron is a single layer neural
network and a multi-layer perceptron is called Neural Networks. We used a grid search for
ifnding the best parameters for SVM algorithms.</p>
      <p>For text to vector conversion, we used sklearn CountVectorizer. The CountVectorizer provides
a simple way to both tokenize a collection of text documents and build a vocabulary of known
words, but also to encode new documents using that vocabulary. I trained models separately
for the Tamil dataset and the Malayalam dataset.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results Analysis</title>
      <p>This section presents the results of the evaluation of the three architectures. We compare
the performance of the above machine learning architectures to select submissions for each
language. Classification Accuracy is what we usually mean when we use the term accuracy. It
is the ratio of the number of correct predictions to the total number of input samples.</p>
      <sec id="sec-5-1">
        <title>Model Accuracy Score SVM 0.63</title>
      </sec>
      <sec id="sec-5-2">
        <title>Logistic Reg 0.677</title>
      </sec>
      <sec id="sec-5-3">
        <title>Perceptron</title>
        <p>0.614</p>
        <p>For further analysis, I used the confusion matrix. A confusion matrix is a table that is often
used to describe the performance of a classification model (or "classifier") on a set of test data
for which the true values are known. It is extremely useful for measuring Recall, Precision,
Specificity, Accuracy, and most importantly AUC-ROC Curve.</p>
        <p>In confusion matrix figure 3 and figure 4, we can see that due to an unbalanced data-set many
test cases were classified as negative 1.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we have described how I trained machine learning algorithms for classification.
Simple machine learning algorithms were fast to train and set the base for further research. For,
future work we can train complex deep learning algorithms but we will need a more balanced
dataset for complex deep learning algorithms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A survey of current datasets for code-switching research</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vegupatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Named entity recognition for code-mixed indian corpus using meta embedding</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stearns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jayapal</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. S</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zarrouk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, European Association for Machine Translation</source>
          , Dublin, Ireland,
          <year>2019</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          . URL: https://www.aclweb.org/anthology/W19-6809.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>WordNet gloss translation for under-resourced languages using multilingual neural machine translation</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation, European Association for Machine Translation</source>
          , Dublin, Ireland,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . URL: https://www.aclweb.org/anthology/W19-7101.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URL: https://www.aclweb.org/anthology/ 2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tamil-English text</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          . URL: https://www. aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Comparison of diferent orthographies for machine translation of under-resourced dravidian languages</article-title>
          ,
          <source>in: 2nd Conference on Language, Data and Knowledge (LDK</source>
          <year>2019</year>
          ),
          <source>Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Leveraging orthographic information to improve machine translation of under-resourced languages</article-title>
          ,
          <source>Ph.D. thesis, NUI Galway</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A survey of orthographic information in machine translation</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>01391</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Proceedings of the 12th Forum for Information Retrieval Evaluation</source>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>