<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MUCS@Dravidian-CodeMix-FIRE2020: SACO-Sentiments Analysis for CodeMix Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fazlourrahman Balouchzahi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>H L Shashirekha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <addr-line>Mangalore - 574199</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The increasing use of social media and online shopping are generating a lot of text data that consists of sentiments or opinions about anything and everything available over these platforms. Users usually use Roman script to pen their sentiments in their language in addition to using English words due to technological limitations of using their native scripts. Sentiment Analysis (SA), an automatic way of analyzing these sentiments is gaining popularity as analyzing them manually is challenging due to the huge size of the texts and the language used in these texts. In this paper, we, team MUCS, have proposed a SA model and submitted it to 'Sentiment analysis of Dravidian languages in CodeMixed Text' shared task at FIRE 2020 to analyze Tamil-English and Malayalam-English code-mixing texts. The proposed approach uses a Hybrid Voting Classifier (HVC) by combining Machine Learning (ML) models using word embeddings and n-grams features extracted from sentences with Deep Learning (DL) models based on BiLSTM using sub-words embedding features. Our team obtained 4th rank in Tamil-English and 6th rank in Malayalam-English code-mixed SA.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiments Analysis</kwd>
        <kwd>Code-Mixing</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sentiments in the online era are comprised of feelings or opinions of users in social media
and customers’ reviews or opinions of products available in online shops. The increasing use
of social media such as YouTube, Facebook, WhatsApp, Instagram, Twitter, etc., and online
shopping are generating lot of text data that consists of sentiments/opinions about anything and
everything available on the internet. Sentiments or reviews may be positive, negative, neutral
or mixed ones. Sentiment Analysis (SA) which deals with the automatic analysis of sentiments
or opinions is becoming important as these texts can be a reason to raise the popularity of a
product or a person or to discard a product such as a video on social media or a laptop on the
online shop. It is gaining popularity as a Recommender system as many people tend to read
reviews about the products available on the internet such as movie reviews, reviews about a
digital camera, before deciding to watch a movie or buy a digital camera respectively. As there
is no restriction on the language, content, or rules used to express the opinions, comments or
reviews in social media, users are at the ease to express their feelings in any language without
any hesitation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, due to technological limitations, users usually use Roman script
to pen their sentiments/opinions in their language in addition to using English words rather
than using their native or local language script. One reason for this is the availability of Roman
letters which can be keyed in directly as opposed to a combination of keys for a character for
most of the Indian languages. This combination of more than one language using the same
script in any text is called code-mixing and code-mixing texts are increasing with the popularity
of social media and online shopping [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As words of diferent languages are used to write
sentiments or reviews, the complexity of these texts increases and hence it becomes dificult to
analyze such texts. SA is challenging due to huge size of these texts and also the code-mixing at
various levels. Code-mixing includes mixing of languages in various linguistics attributes such
as words, phrases, and sentences[
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ]. Code-mixed text afiliate features such as vocabulary
and grammar of diferent languages and builds new structures by combining attributes of these
languages. This is a challenging process in SA models as conventional semantic analysis models
do not capture the meaning of the sentences [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this work, we propose a Hybrid Voting
Classifier (HVC) using ML and DL approaches. While ML approaches include Multi-Layer
Perceptron (MLP) and Multinomial Naïve Bayes (MNB) classifiers using n-grams and word
vectors as features respectively, the DL approach uses a Bidirectional Long Short Term (BiLSTM)
classifier model with sub-words embeddings as features. The implementation of this paper is
available in our Github repository1. The rest of the paper is organized as follows: an overview
of literature in the related area is discussed in Section 2. Feature extraction for the proposed
model is discussed in Section 3 followed by the proposed methodology is described in Section 4.
Section 5 presents the experiments and results and Section 6 concludes the paper with future
work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Several studies have been carried out by various researchers in collecting code-mixed data of
diferent pairs and analyzing them for various applications including SA, language identification,
POS tagging, NER, etc. A few of the important ones are given below: Chakravarthi et .al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
have built two code-mixed benchmarked corpus namely TamilMixSentiment and
MalayalamMixSentiment [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for SA by collecting YouTube comments and annotating them with the help of
voluntary annotators. Among the several baseline models applied on created datasets, Random
Forest that randomly generates trees without defining rules gives a weighted average f-score
of 0.65 for TamilMixSentiment corpus. Further, a BERT model performed better compared to
other baselines with a weighted average f-score of 0.75 for MalayalamMixSentiment corpus
using encoder-decoder architecture along with a mechanism to read a sequence in both (left to
right and vice versa) directions. An overview of shared task on SA of Bengali- English (BN-EN)
and Hindi-English (HI-EN) code-mixed data at ICON-2017 is presented by Patra et. al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Datasets are collected using Twitter4j2 API from Twitter and have been provided to shared task
participants. IIIT-NBP team (baselines) used several features such as GloVe word2vec, TF-IDF
of word n-grams (n = 1, 2, 3), and character n-grams (n = 2 to 6) and achieved the highest
score for both datasets with macro average f-score of 0.569 for HI-EN dataset and 0.526 for
      </p>
      <sec id="sec-2-1">
        <title>1https://github.com/fazlfrs/SACO-SentimentsAnalysis-for-CodeMix-Text 2http://twitter4j.org/en/</title>
        <p>
          BN-EN. A code-mixed SA using ML and Neural Network approaches has been proposed by
Mishra et. al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for BN-EN and HI-EN code-mixed datasets. They built the classifiers using
the dataset provided by Sentiment Analysis for Indian Languages (SAIL)-Code Mixed task at
ICON-20173. The first classifier is a Voting classifier consisting of three classifiers namely, SVM,
Logistic Regression, and Random Forests using TF/IDF with 2 to 6 char n-grams. They further
experimented with the word level n-grams as features for both SVM and MultiLayer Perceptron
(MLP) classifiers. The mean of GloVe vectors in a sentence (GloVe averaged) as features for SVM
and MLP and Bi-LSTM with GloVe were also explored. The best results of an f-score of 0.58
and 0.69 obtained for Hindi-English and Bengali-English datasets respectively have used SVM
with 2 to 6 char n-grams. Ansari et. al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] collected 1200 Hindi and 300 Marathi documents
from social media comments and designed a model using three classification algorithms namely,
Naïve Bayes and Support Vector Machine using RBF and Linear SVM. The results show the
accuracy of up to 90% with consistency for Marathi language, but for Hindi language, it is in the
range of 70% to 80%. A model based on contrastive learning proposed by Choudhary et. al.[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
use twin Bidirectional LSTM networks and a clustering based method to capture alteration
of code-mixed transliterated words. Based on diferent configuration of language pairs their
models obtained accuracy in the range of 71.30% to 79.80%. A hybrid model proposed by Vaibhav
et. al.[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] for SA task in HI-EN code-mixed text includes sub-words embeddings. The main
component of this model includes generating sub-words using CNN and a dual encoder BiLSTM
network that captures the entire information about sentiments and selects more informative
sentiments-bearing parts of a sentence. The system proposed by Y Joshi et. al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] achieved
an accuracy of 83.54% and an f-score of 0.827 on the dataset released which contains 3,879
code-mixed English-Hindi sentences collected from Facebook.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Feature Extraction</title>
      <p>
        Extracting features from text is the basis for all text processing applications [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The proposed
approach uses three diferent feature extraction techniques namely, n-grams, word vectors, and
sub-words vectors to extract the relevant features from code-mixing data.
      </p>
      <p>
        • N-grams: An n-gram is a sequence of n items in a given sequence where an item can be
a letter, phoneme, word, etc.[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. For example, a word 2-gram (or bigram) is a two-word
sequence of words like “please turn”, “turn of” or ”of TV” and a word 3-gram (or trigram)
is a three-word sequence of words like “please turn of” or “turn of TV”. Word N-gram
models assign probabilities to all the words in the sentence and estimate the probability
of the last word of an n-gram given the previous words. N-gram model is integrated with
most of the text classification tasks and is expected to boost the accuracy of classification
tasks.
• Word Vectors: Word2Vec transforms a text into a row of numbers such that words
with similar meanings have similar vector representation called as embeddings [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Skip
Gram and Common Bag of Words (CBOW) are the two common methods that are used
to obtain Word2Vec. CBOW method predicts the target word based on a given input
3https://brajagopalcse.github.io/SAIL  −  − 2017
      </p>
      <p>
        context and Skipgram predicts the most probable context based on a given word. As word
embeddings are not available for code-mixed data, they need to be trained from the raw
code-mixed data.
• Sub-words Vectors: Generally NLP systems discard words that are rarely seen in
training corpus as learning eficient representation for these words is a dificult challenge
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The problem gets further complicated due to code mixing as the code-mixed words
such as “Indiayile” is not present in any language as it is a combination of two languages.
However, a part of this word such as “India” which indicates a location is a valid word.
This problem can be handled conveniently by considering the substrings of a word called
sub-words. For example, the word “Indiayile” (meaning ‘in India’) commonly used in
Malayalam can be considered as two sub-words “India” and “yile” where India is a name
in any language and the sufix “yile” is a word in Malayalam which means ‘in’. Sub-words
allow finding parts of the words that otherwise seems to be unknown or out-of-vocabulary.
Word embeddings are generally trained for one language and are not ideal for code-mixed
data as code-mixed data, in general, may contain more than one language. Further, as
word embeddings cannot handle the spelling variations in social media data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], using
sub-words embeddings will be the right choice to represent code-mixed data in addition
to word embeddings. However, as pre-trained sub-words embeddings are not readily
available for code-mixed data they have to be trained from the raw code-mixed data.
Byte-Pair Encoding (BPEmb) [
        <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18 ref19">15, 16, 17, 18, 19</xref>
        ] is a collection of pre-trained sub-words
embeddings available for 275 languages trained on respective language Wikipedia4. It
can be used to train sub-words embeddings by extracting all sub-words from a given
sentence. In the proposed method, BPEmb tools from BPEmb5 library are used to extract
all the sub-words from a given sentence. Figure 1 gives a snapshot of using BPEmb tools
to convert an input sentence to sub-words. These extracted sub-words are used to train
sub-words embeddings using Word2Vec and is defined as Sub-Word2Vec. Sub-words
embeddings can be more efective than usual word embeddings as it can capture more
context than the word embeddings.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>As sentiments or reviews will be written in code-mixed language usually using Roman script
in most of the cases, code-mixed text do not adhere to the grammar of any language thus
increasing the complexity of designing the SA system. Therefore, the architecture that helps</p>
      <sec id="sec-4-1">
        <title>4https://nlp.h-its.org/bpemb/ 5https://github.com/bheinzerling/bpemb</title>
        <p>to classify code-mixed text and overcome the challenges of analyzing such text has to be used.
Hence, word and sub-word vectors and also word/char n-grams are chosen to train diferent
learning models. For generating word and sub-word vectors, Skipgram methods are used as it
calculates the probability of contexts in which a word can appear and hence will predict the
most probable context of a given word. N-grams have already proved its eficiency in many
natural language processing applications. Both char and word n-grams allow the model to
utilize all features of a text. A voting classifier is an ensemble classification model that works
based on majority voting and the tag with a higher number of votes will be the final predicted
tag. It is usually built using more than two base classifiers. The proposed HVC model includes
training three base models namely, BiLSTM, MNB, and MLP. The architecture of proposed
approach is shown in Figure 2.</p>
        <p>The proposed model includes a phase of feature engineering after cleaning and tokenization
followed by model construction. Details of the model construction are as follows:
• BiLSTM: Building a DL model using BiLSTM includes two main steps: i) training
subwords embeddings Sub-Word2Vec of 100 dimensions using Word2Vec Skipgram model
and ii) utilizing the sub-words embeddings to train BiLSTM networks for classifying the
sentiments. The sub-words embeddings model has been trained on various batch sizes
(128, 64, 32) each for 10 epochs.
• MNB: Raw code-mixed text is used to train Skipgram word2vec model and generated
vectors are transformed using CountVectorizer from the Sklearn library and are used as
features for building MNB classifiers.
• MLP: Building MLP includes extracting features such as char (n = 1, 2, 3, 4, 5) and word
n-grams (n = 1, 2, 3) and transforming the obtained features using CountVectorizer which
is fed to build an MLP Classifier.</p>
        <p>
          After training, all the base models were evaluated on the test set provided by the organizers,
and the predicted labels were submitted to the Dravidian-CodeMix task in FIRE 2020 as MUCS
team. Overview of the Dravidian-CodeMix task is discussed in reference papers [
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>Two code-mixed SA datasets namely, Malayalam-English and Tamil-English were provided by
the Dravidian-CodeMix organizing team. Details of the datasets are given in Table 1. As it is
mentioned in the shared task website, the performances of systems are measured by weighted
averaged precision, weighted averaged recall, and weighted averaged f-score across all the
classes. As per the results announced by task organizers, MUCS team obtained 4th rank in
Tamil-English and 6th rank in Malayalam-English code-mixed SA. Results of the proposed
approaches are shown in Table 2. MUCS team obtained an average weighted f-score of 0.62 for
Tamil-English code-mixed SA which is only 0.03 less than the first rank model. Also, an average
weighted f-score of 0.68 for Malayalam-English code-mixed SA is only 0.06 less than first rank.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future work</title>
      <p>’Sentiment analysis of Dravidian languages in CodeMixed Text’ is a shared task at FIRE 2020. We,
MUCS team, submitted an HVC consisting of ML approaches namely, MLP and MNB classifiers
using n-grams and word-vectors as features respectively, and a DL model namely, BiLSTM
classifier with sub-words embeddings as features for Sentiment analysis of Dravidian languages
in CodeMixed text. Our team obtained 4th and 6th rank in Tamil-English and Malayalam-English
shared task respectively. The future work is to explore code-mixed language models for low
resource Indian languages and Persian language.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tamil-English text</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          . URL: https://www. aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ansari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Govilkar</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis of mixed code for the transliterated hindi</article-title>
          and marathi texts,
          <source>International Journal on Natural Language Computing (IJNLC)</source>
          Vol
          <volume>7</volume>
          (
          <year>2018</year>
          ) , pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Lal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          , P. Koehn,
          <article-title>De-mixing sentiment from code-mixed text</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop</source>
          , (
          <year>2019</year>
          ), pp.
          <fpage>371</fpage>
          -
          <lpage>377</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A survey of current datasets for code-switching research</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vegupatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Named entity recognition for code-mixed Indian corpus using meta embedding</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Leveraging orthographic information to improve machine translation of under-resourced languages</article-title>
          ,
          <source>Ph.D. thesis, NUI Galway</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Choudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Bindlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis of code-mixed languages leveraging resource rich languages</article-title>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>00806</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URL: https://www.aclweb.org/anthology/ 2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Patra</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis of code-mixed indian languages: An overview of sail_code-mixed shared task@ icon-2017</article-title>
          , arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>06745</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Danda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhakras</surname>
          </string-name>
          ,
          <article-title>Code-mixed sentiment analysis using machine learning and neural network approaches</article-title>
          , arXiv preprint arXiv:
          <year>1808</year>
          .
          <volume>03299</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text</article-title>
          ,
          <source>in: Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          , (
          <year>2016</year>
          ), pp.
          <fpage>2482</fpage>
          -
          <lpage>2491</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Text feature extraction based on deep learning: a review</article-title>
          ,
          <source>EURASIP journal on wireless communications and networking</source>
          <year>2017</year>
          (
          <year>2017</year>
          ) , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stefanovič</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurasova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Štrimaitis</surname>
          </string-name>
          ,
          <article-title>The n-grams based text similarity detection approach using self-organizing maps and similarity measures</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>9</volume>
          (
          <year>2019</year>
          )
          <year>1870</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zimmerman</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Hudson, Packet2vec: Utilizing word2vec for feature extraction in packet data</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>14477</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Heinzerling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Strube</surname>
          </string-name>
          , Bpemb:
          <article-title>Tokenization-free pre-trained subword embeddings in 275 languages</article-title>
          , arXiv preprint arXiv:
          <volume>1710</volume>
          .02187 (
          <year>2017</year>
          ) , pp.
          <fpage>2989</fpage>
          -
          <lpage>2993</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Improving Wordnets for Under-Resourced Languages Using Machine Translation</article-title>
          ,
          <source>in: Proceedings of the 9th Global WordNet Conference, The Global WordNet Conference 2018 Committee</source>
          ,
          <year>2018</year>
          . URL: http://compling.hss. ntu.edu.sg/events/2018-gwc/pdfs/GWC2018_paper_
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Comparison of Diferent Orthographies for Machine Translation of Under-Resourced Dravidian Languages</article-title>
          ,
          <source>in: 2nd Conference on Language, Data and Knowledge (LDK</source>
          <year>2019</year>
          ), volume
          <volume>70</volume>
          of OpenAccess Series in Informatics (OASIcs),
          <source>Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik</source>
          , Dagstuhl, Germany,
          <year>2019</year>
          , pp.
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>14</fpage>
          . URL: http://drops.dagstuhl.de/opus/volltexte/2019/10370. doi:
          <volume>10</volume>
          .4230/OASIcs. LDK.
          <year>2019</year>
          .
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>WordNet gloss translation for under-resourced languages using multilingual neural machine translation</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation, European Association for Machine Translation</source>
          , Dublin, Ireland,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . URL: https://www.aclweb.org/anthology/W19-7101.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stearns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jayapal</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. S</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zarrouk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, European Association for Machine Translation</source>
          , Dublin, Ireland,
          <year>2019</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          . URL: https://www.aclweb.org/anthology/W19-6809.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Proceedings of the 12th Forum for Information Retrieval Evaluation</source>
          , FIRE '
          <fpage>20</fpage>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>