<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>YUN111@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of Dravidian Code Mixed Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yueying Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kunjie Dong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Science and Engineering, Yunnan University</institution>
          ,
          <addr-line>Yunnan</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The use of social media has grown rapidly during the past few years, which provides a convenient platform for users to communicate by inserting their native language into English and results in a large amount of code mixed text. This way of communication is not only convenient but also reduces people's knowledge burden. They can express their opinions most simply and easily. However, it is quite dificult for non-native speakers to understand these code mixed texts. Therefore, it is important to analyze the sentiment expressed in these texts. This paper reports on our work in Dravidian- Codemix-Fire 20210. We propose a sentiment analysis mBERT-based model, and use self-attention to assign a weight to the output of the BiLSTM, which further improve the analysis accuracy of the model. We test our model on the data sets released by the organizers, and the performance of our system is very close to the best system in the competition. We achieve the weighted average F1-scores of 0.73 and 0.64 in Malayalam and Tamil languages, respectively, and both rank .2</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Code mixed</kwd>
        <kwd>Dravidian language</kwd>
        <kwd>Sentiment analysis</kwd>
        <kwd>Malayalam</kwd>
        <kwd>Tamil</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>There is an increasing number of people who will express their opinions on social media,
including comments on a new movie, expectations, suggestions for a new product, or the
responses to a new policy of the government. People will often use their most familiar native
language mixed with English to form code mixed text because this relaxed communication
way reduces their burden of language as a cognitive process. This is also why the code mixed
is becoming more and more common on social media. But for those who are not the native
speakers, fully understanding this code mixed texts can be a headache. Therefore, it is important
to provide a systematic approach to fully dig the sentiment information given by the code mixed
text.</p>
      <p>
        Sentiment analysis is the task of determining subjective opinions or responses about a given
topic. For the past two decades, it has been an active area of academic and industrial research.
There is a growing need for sentiment analysis on social media code mixed tex1t][. Code
mixing is a common phenomenon in the multilingual community and the code mixed text is
sometimes written in non-native scripts. Systems that train on monolingual data fail on code
mixed data because of the complexity of switching code between diferent language levels in
text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The complexities of this type of language include the presence of multilingual words,
transliteration, spelling variations, and so on.
      </p>
      <p>This common task proposes a new gold standard corpus for sentiment analysis of code mixed
text in Dravidian (Malayalam-English and Tamil-English). This is the message level polarity
classification task. Given a YouTube comment, the system must categorize it into negative,
notMalayalam (or not-Tamil), positive, unknown-state or mixed-feeling. We introduce a sentiment
analysis mBERT-based model (mBERT means BERT multilingual model) and use self-attention
to assign a weight to the output of the BiLSTM, which further improve the analysis accuracy
of the model. We also test our model on the data sets released by the organizers, and the
performance of our system is very close to the best system in the competition. We achieve the
weighted average F1-scores of 0.73 and 0.64 in Malayalam and Tamil languages, respectively,
and both rank 2 . Our code is available on GitHu1b</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Code mixed text sentiment analysis seminal work was done by Joshi et al3.][on Hindi text. The
sub-word level representations in LSTM (Sub word-LSTM) architecture have been proposed by
Prabhu et al. 4[] on Hindi code mixed text. Shalini et al.5[] created the Kannada-English code
mixed corpus and provided distributed representation methods. Jhanwar et al.6][ combined an
ensemble model of character-n based LSTM and word-n based MBN to identify the sentiment of
Hindi-English code mixed data. The multilayer perceptron model has been used to identify the
polarity of the Bengali-English tweets by Ghosh et a7l].[ Vijay et al. 8[] proposed a supervised
classification system that used a variety of machine learning techniques to detect sentiment.
Lal et al. 9[] proposed a hybrid architecture for the sentiment analysis task in English-Hindi
code mixed data. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the author proposed a POS tag method for code mixed social media
text the recursive neural network-based language model (RNN-LM) architecture.
      </p>
      <p>
        As mentioned above, there has been a lot of research on sentiment analysis in many diferent
code mixed types of languages. 1[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is the first task about the sentiment analysis of code mixed
text in Dravidian language (Malayalam-English and Tamil-English).
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Model Architecture</title>
      <p>
        In this section, after a lot of experiments and comparisons that we choose the pre-trained
methods. Which is the mBERT-based model (mBERT means BERT2 multilingual model)1[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
It is a pre-trained model of the mBERT for multiple languages, because mBERT not only can
predict and train the next sentence, it can also help process the logical relationship between two
sentences. On the other hand, mBERT’s high hidden layer can learn rich semantic information
characteristics. Therefore, to obtain richer semantic information features13[], we also make
use of the top hidden layer state output of the mBERT and feed it to BiLSTM. Then we give
a weight for all hidden state output of the BiLSTM which helps in better classification of the
polarity of sentiment in multiple sentiment bearing units, and the output of the attention model
1https://github.com/TroubleGilr/codemixed
2https://github.com/google-research/bert
is a weighted sum of hidden representations at each hidden state. Finally, we concatenate the
original output of mBERT with the output weighted representation vector.
      </p>
      <p>The principle of weight attention</p>
      <p>Formally, let C be the set of characters and T be the set of input statements. The sentence s
respectively. Concatenating⃖ℎ⃖⃗ and⃖ℎ⃖⃖ obtains the annotation  :
∈T can be made of characters by [1, ...,  l] where l is the length of input, and thel = 60 in our
model. The  ℎ forward and backward hidden states of the BiLSTM are represente⃖ℎ⃖d⃗ and⃖ℎ⃖⃖,
The attention weight
 of   in the sentence s can be calculated by this following formula:
  = [⃖ℎ⃖⃗ ⃖⃖⃖], ( = 1, 2, ..., l)
; ℎ</p>
      <p>= exp(     )
 =
∑ exp(     )</p>
      <p>=1
  =



=1
h = ∑    
And then we get the representation vector h by combination all the weighted outputs:
(1)
(2)
(3)
(4)
(5)</p>
      <p>Next, we do a reshape forh that let it keep the same dimensions with the − output of
mBERT. Finally, we concatenate them into the classifier. The schematic overview of the model
architecture was shown in a of Figure 1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Description</title>
      <p>The aim of this task is to identify the sentiment polarity of a code mixed dataset of
comments/posts in Dravidian (Malayalam and Tamil) collected from social media. Comments/posts
may contain more than one sentence, but the average sentence length of the corpus is 1. The
datasets consist of YouTube comments labeled into one of the five classes:</p>
      <p>Negative(Neg): Tweets contain obvious emotions such as sadness, dissatisfaction, loss, or
ofensive language. To express disgust or criticize some people.</p>
      <p>Positive(Pos): A tweet that expresses happiness, satisfaction, praise for a person, group or
country.</p>
      <p>Not-Malayalam or Not-Tamil(Not): For the language of Malayalam or Tamil, if the
sentence does not include either Malayalam or Tamil, then it is not Malayalam or Tamil.</p>
      <p>Unknown-state(Unknown): Tweets that represent facts, provide news, or belong to
advertisements. There’s no obvious emotional expression.</p>
      <p>Mixed-feeling(Mixed): Tweets explicitly or implicitly contain the user’s emotions.</p>
      <p>
        Most datas given are written in Roman script, which have the mixture of these forms of
code-mixed sentences –Inter-Sentential switch, Intra-Sentential switch and Tag switching. The
specific data set details are given in table 1. The various types of data are unbalanced from
the table and more details about the dataset can be found i1n4[] and [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and some of the
processing of code mixed text can be seen in [16].
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <p>In this work, we use mBERT-based as our pre-training model. Before the training, we randomly
shufle the data and remove unwanted characters and emoticons (Generally, emoticons express
specific sentiment, and we will consider introducing an emoticons system in our future work
). Especially, the label of categorical sentiment values we encode as 0,1,2,3,4 to negative,
notTamil(not-Malayalam), positive, unknown or mixed-feelings, respectively. This way is to give a
numeric representation to the categorical sentiment data. Finally, we input the processed data
into our model. And here mBERT uses the WordPiece3 tool [17] for word segmentation and
inserts special separators ([CLS], which separates each sample) and separator ([SEP], which
separates diferent sentences in the sample) [18].</p>
      <p>Here our model is implemented based on Pytorch. We use Adam optimizer with a learning
rate of 1e-5 andCross-Entropy Loss. The batch size is set to 8 and thegradient accumulation
steps is set to 4. The epochs and maximum length of the sentence are 5 and 60, respectively.</p>
      <p>The measurement to evaluate the participating system is the weighted average F1-scores.
Table 2 shows the precision, recall and the weighted average F1-scores of our system. The final
ranking is based on the weighted average F1-scores and our submit system obtains the weighted
average F1-scores of 0.73 and 0.64 in Malayalam and Tamil languages, respectively.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>With the increase of the social media text in popularity and influence, it is increasingly important
to analyze the sentiment attached to the text. In this paper, we perform a sentiment analysis on
Dravidian Languages (Malayalam-English and Tamil-English). We propose an mBERT-based
multilingual processing model. We also give the weights for the hidden state output of the
BiLSTM, and then we concatenate it with the original output of the mBERT. Our system achieves
very satisfying performance.</p>
      <p>On the whole, the article uses mBERT to represent the code-mixed Dravidian text, which has
been feed to the BiLSTM (creates attention weighted vector representation of the vector). In the
end, the output of BiLSTM and attention layer of mBERT are concatenated for the classification.
This framework is well-combined with the help of both high level and low level. Through this
model, rich sentiment information is extracted to improve the classification accuracy. In future
work, we will consider incorporating emotional information into the classification system. On
the other hand, we consider the depth of the model as far as the data allows.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>I would like to thank the organizers for their hard work and the teachers for their help. Finally,
I would like to thank the school for its support to my research and the future reviewers for their
patient work.</p>
      <p>3https://github.com/lovit/WordPieceModel
Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
and Computing for Under-Resourced Languages (CCURL), European Language Resources
association, Marseille, France, 2020, pp. 177–184. URLh:ttps://www.aclweb.org/anthology/
2020.sltu-1.25.
[16] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation
of under-resourced languages, Ph.D. thesis, NUI Galway, 2020.
[17] B. C. Xu, L. Ma, L. Zhang, H. H. Li, M. C. Zhou, An adaptive wordpiece language model
for learning chinese word embeddings, in: 2019 IEEE 15th International Conference on
Automation Science and Engineering (CASE), 2019.
[18] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding (2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Proceedings of the 12th Forum for Information Retrieval Evaluation</source>
          ,
          <source>FIRE '20</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Balamurali</surname>
          </string-name>
          ,
          <article-title>A fall-back strategy for sentiment analysis in hindi: a case study</article-title>
          ,
          <source>in: Icon</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. R,</surname>
          </string-name>
          <article-title>A fall-back strategy for sentiment analysis in hindi: a case study</article-title>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Jhanwar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <article-title>An ensemble model for sentiment analysis of hindi-english codemixed data (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
          </string-name>
          ,
          <article-title>Sentiment identification in code-mixed social media text (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vijay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bohra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <article-title>Corpus creation and emotion prediction for hindi-english code-mixed social media text</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Lal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          , P. Koehn,
          <article-title>De-mixing sentiment from code-mixed text</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Pimpale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sasikumar</surname>
          </string-name>
          ,
          <article-title>Recurrent neural network based part-of-speech tagger for code-mixed social media text (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Nair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Jayan</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. R. R</surname>
          </string-name>
          , E. Sherly,
          <article-title>Sentima - sentiment extraction for malayalam</article-title>
          ,
          <source>in: International Conference on Advances in Computing</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1719</fpage>
          -
          <lpage>1723</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seddah</surname>
          </string-name>
          ,
          <article-title>What does bert learn about the structure of language?</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          , Kungfupanda at semeval-2020 task 12:
          <article-title>Bert-based multi-task learning for ofensive language detection (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tamil-English text</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          . URhLt:tps://www. aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .2.
          <fpage>8</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>