<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Hyderabad, India
email: zhouxb@ynu.edu.cn(X. Zhou*)
orcid:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SRJ @ Dravidian-CodeMix-FIRE2020: Automatic Classification and Identification Sentiment in Code-Mixed Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruijie Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>XiaobingZhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Science and Engineering, Yunnan University</institution>
          ,
          <addr-line>Yunnan</addr-line>
          ,
          <country country="CN">P.R.China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Sentiment analysis of the Code-Mixed text has received increasing research attention. To facilitate the researches on Code-Mixed text, the Sentiment Analysis of Dravidian Languages in Code-Mixed Text is proposed in FIRE 2020. This paper introduces the system submitted by the SRJ team. We participate in the Malayalam-English task and Tamil-English task. We use XLM-Roberta as our model. And abundant semantic information is obtained by extracting XLM-Roberta's hidden state. Our approach achieves the best results in both tasks with weighted F-scores of 0.74 and 0.65, respectively. Our code is available on GitHub(https://github.com/lonelyjie323/HASO C).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment Analysis</kwd>
        <kwd>Abundant Semantic Information</kwd>
        <kwd>Code-Mixed Text</kwd>
        <kwd>XLM-Roberta</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Tamil-English, and English codes. Our goal is to categorize the comments obtained by YouTube
into positive, negative, neutral, mixed emotions, or not in the intended langua4g]e. s[</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In recent years, sentiment analysis has attracted the attention of a large number of industrial
and academic researchers. With the acceleration of information dissemination, Code-Mixed has
gradually become a common phenomenon in multilingual communities.</p>
      <p>Asoka Chakravarthi et a5l].[proposed an improvement of word meaning translation by
utilizing orthographic information to translate words in languages with insuficient resources,
such as Dravidian languages. A word in a language may have more than one meaning, for this
reason, R. padamara6[] also proposed a word-level translation method based on knowledge
engineering for Tamil-English. There are two traditional approaches to solve the problem of
sentiment analysis, such as lexicon-based or machine learning approa7c]h. eIsn[ general, it
is dificult to understand and analyze texts written in multiple languages. Veena P V et8a]l.[
developed a word-level language recognition system based on Code-Mixed for social media
text, which was used to complete Tamil-English and Malayalam-English Code-Mixed Facebook
reviews. Shashi Shekhar et al9.][ proposed a novel architecture combining a multichannel
neural network (MNN) and quantum Bi-directional Long Short-Term Memory (QBLSTM).</p>
      <p>For this task, the content of sentiment analysis related to Malayalam-English and
TamilEnglish is very few1[0], because for Indian language, machine translation between India and
English has certain dificulties[11], so the data is not easy to obtain, let alone carry out a
sentiment analysis on it.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>In the Malayalam-English task, the oficial organizers provide training set (4,851 comments /
posts) and validation set (540 comments / posts). In our experiment, we combine training set
and validation set, and label distribution is { positive: 2,246, negative: 600, neutral: 1,505, mixed
emotions: 333, not-Malayalam: 707 }, which is an imbalance data set.</p>
        <p>In the Tamil-English task, the oficial organizers provide training set (11,335 comments /
posts) and validation set (1,260 comments / posts). In our experiment, we combine training set
and validation set, and label distribution is { positive: 8,124, negative: 1,613, neutral: 677, mixed
emotions: 1,424, not-Tamil:397 }, which is an imbalance data set.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. XLM-Roberta with hidden state</title>
        <p>Early work in the field of cross-language understanding has proved the efectiveness of
multilingual masked language model (MLM) in cross-language understanding, but models such as
XLM[12] and Multilingual BERT13[] (pre-trained on Wikipedia) are still limited in learning
useful representations of low resource languages. XLM-Rob1e4r]tas[hows that the
performance of cross-language transfer tasks can be significantly improved by using the large-scale
multi-language pre-training model. It can be understood as a combination of XLM and Roberta.
It is trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages. Because the
training of the model in this task must make full use of the whole sentence content to extract
useful semantic features, which may help to deepen the understanding of the sentence and
reduce the impact of noise on data. Therefore, we use XLM-Roberta in this work.</p>
        <p>In the classification task, the original output of XLM-Roberta is obtained through the last
hidden state of the model. However, the output usually does not summarize the semantic
content of the input. Recent studies have shown that abundant semantic information features
are learned by the top hidden layer of BE1R5T][, which we call the semantic layer. In my
opinion, the same is true of XLM-Roberta. Therefore, in order to make the model obtain more
abundant semantic information features, we propose the following model, as shown in Figure 1.
Firstly, we getP_O. Secondly, we extract the hidden state of the last three layers of XLM-Roberta
and input them into CNN to getL12, L11, andL10. Finally, we concatenatPe_O, L10, L11, and
L12 into the classifier.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing and Experiments Setup</title>
        <p>In the experiment, we try to clean the text. But it does not achieve the desired results. So the
text is not cleaned. In the Malayalam-English and Tamil-English tasks, the hyper-parameters
are set to the same, and the best weight of the model is saved in the training. In this work,
oficial organizers provide training sets and validation sets. We combine the oficial training set
not-malayalam/
not-Tamil
Positive</p>
        <p>Negative
unknown_state
Mixed_feelings</p>
        <p>Macro avg
weighted acg
0.78
and the validation set to get the new data set, which is split into the new training set and the
validation set by using the Stratified 5-Fold Cross-validat1io.nDue to the imbalance of datasets,
the Stratified 5-Fold Cross-validation ensures that the proportion of samples in each category
in each fold data set remains unchanged.</p>
        <p>For the XLM-Roberta, we use XLM-Roberta-bas2e pre-trained model, which contains 12
layers. We use Adam optimizer with a learning rate as 5e-5. The batch size is set to 32 and
the max sequence length is set to 60. We extract the hidden layer state of BERT by setting the
output_hidden_States is true. The model is trained in 10 epochs with a dropout rate of 0.2.</p>
        <p>For the convolution layer, we use 2D convolutionn(n.Conv2d3). The size of the convolution
kernel is set to (3,4,5) and the number of convolution kernels is set to 256.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results and Analysis</title>
        <p>
          In this work, we find the limitations of P_O for sentiment analysis of Code-Mixed text in
Dravidian languages. In the classification task, the original output of BERPT_iOs. Chakravarthi
et al.[
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ] pointed out that Multilingual BERT fails to identify “Mixed feeling” class in the
Malayalam-English task and Multilingual BERT fails to identify “Negative”, “Neutral”, “Mixed
feeling”class in the Tamil-English task. In the same way, we just pPu_tO as the output of
XLM-Roberta. The results are shown in Table 1. We can see that the results are not good when
P_O is used as the output of Bert and XLM-Roberta. We think that just usinPg_O as the output
will lose some efective semantic information. So we think that deep and abundant semantic
features are efective for this work. We extract the hidden state of XLM-Roberta and we also
discover that the performance of the model improves with the increase of the semantic layer.
Table 2 shows the performance of our model at diferent semantic layers.
        </p>
        <p>Table 3 shows our results on the test set. For two tasks, we only use the oficial training set
1https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.htmlsklearn.model_selection.StratifiedKFold
2https://huggingface.co/xlm-roberta-base
3https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.htmltorch.nn.Conv2d
and validation set and do not use any external data. The hyper-parameters of the model are set
to the same. In the Malayalam-English and Tamil-English tasks, our models all achieve the best
performance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we provide a baseline for sentiment analysis of Code-Mixed text in Dravidian
languages (Malayalam-English and Tamil-English). We find the limitation of only using Pooler
out as the output of BERT. To obtain deeper and more abundant semantic features, we extract
the hidden layer state of XLM-Roberta, which is input into convolution and max pooling. The
result shows that it is helpful to improve the performance of XLM-Roberta to obtain more
abundant semantic information features by extracting the hidden state of XLM-Roberta.
Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
and Collaboration and Computing for Under-Resourced Languages (CCURL), European
Language Resources association, Marseille, France, 2020, pp. 202–210. UhRttLp:s://www.
aclweb.org/anthology/2020.sltu-1.28
[4] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly,
Elizabeth McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages
in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation
(FIRE 2020). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020.
[5] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation
of under-resourced languages, Ph.D. thesis, NUI Galway, 2020.
[6] R. Padmamala, Word level translation (tamil-english) with word sense disambiguation in
tamil using ontnet, in: 2015 International Conference on Computing and Communications
Technologies (ICCCT), IEEE, 2015, pp. 191–198.
[7] O. Habimana, Y. Li, R. Li, X. Gu, G. Yu, Sentiment analysis using deep learning approaches:
an overview, Science China Information Sciences 63 (2020) 1–36.
[8] P. Veena, M. A. Kumar, K. Soman, An efective way of word-level language identification for
code-mixed facebook comments using word-embedding via character-embedding, in: 2017
International Conference on Advances in Computing, Communications and Informatics
(ICACCI), IEEE, 2017, pp. 1552–1556.
[9] S. Shekhar, D. K. Sharma, M. S. Beg, Language identification framework in code-mixed
social media text based on quantum lstm—the word belongs to which language?, Modern
Physics Letters B 34 (2020) 2050086.
[10] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly,
Elizabeth McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in
Code-Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation,
FIRE ’20, 2020.
[11] M. A. Kumar, B. Premjith, S. Singh, S. Rajendran, K. Soman, An overview of the shared task
on machine translation in indian languages (mtil)–2017, Journal of Intelligent Systems 28
(2019) 455–464.
[12] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv preprint
arXiv:1901.07291 (2019).
[13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[14] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
scale, arXiv preprint arXiv:1911.02116 (2019).
[15] G. Jawahar, B. Sagot, D. Seddah, What does bert learn about the structure of language?,
2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Haridas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Nair</surname>
          </string-name>
          , G. Gutjahr,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nedungadi</surname>
          </string-name>
          ,
          <article-title>Spelling errors by normal and poor readers in a bilingual malayalam-english dyslexia screening test</article-title>
          ,
          <source>in: 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>340</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URhLtt:ps://www.aclweb.org/anthology/ 2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tamil-English text</article-title>
          ,
          <source>in: Proceedings of the 1st Joint</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>