<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>H. Li*)
orcid:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>YNU@Dravidian-CodeMix-FIRE2020: XLM-RoBERTa for Multi-language Sentiment Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaozhi Ou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongling Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Science and Engineering, Yunnan University</institution>
          ,
          <addr-line>Kunming, 650500, Yunnan</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This article describes the system that our team submitted to the Dravidian-CodeMix-FIRE 2020. The purpose of this task is to identify the sentiment polarity of the code-mixed dataset of Dravidian (MalayalamEnglish and Tamil-English) comments/posts collected from social media. Our system is based on a pre-trained multi-language model XLM-RoBERTa, and uses the K-folding method to ensemble and aims to solve the sentiment analysis problem of multilingual code-mixed across language models. We participate in the tasks of two code-mixed languages (Malayalam-English and Tamil-English), our system achieves the best F-Score of 0.74 in Malayalam-English (Ranks 1/28), and we rank third in Tamil-English with an F-Score of 0.63.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;code-mixed</kwd>
        <kwd>multi-language</kwd>
        <kwd>XLM-RoBERTa</kwd>
        <kwd>K-folding</kwd>
        <kwd>Dravidian-CodeMix-FIRE 2020</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>approach for ensemble [7]. Our model is available on GitHub: https://github.com/Ouxiaozhi/YNU_
TEAM-IN-dravidian-codemix-.</p>
      <p>The rest of this article is organized as follows. Section 2 introduces related work. Section3 describes
the data and approaches. Section4 presents the experimental results. Finally, our conclusion and
future work are presented in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>For the past two decades, sentiment analysis has been an active area of research in academia and
industry. In recent years, sentiment analysis has mostly focused on the study of a single language.
Some related shared tasks of the organization, such as the detection of ofensive language in
German 1 [8] , the detection of hate speech in Italian 2 [9], and organized of Semeval 2019 shared task
6 OfensEval 3 – Identifying and Categorizing Ofensive Language in Social Media [10]. At present,
the study of multilingualism has become a new upsurge, and some related tasks organized recently
have attracted a large number of researchers. For example, the Semeval 2019 shared task 5 HatEval
4 – Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter [11]. Semeval
2020 shared task 12 OfensEval 2 5 – Multilingual Ofensive Language Identification in Social
Media [12], and the shared tasks of HASOC (2019) and HASOC (2020) 6 – Hate Speech and Ofensive
Content Identification in Indo-European Languages [13]. Code-Mixed is a common phenomenon in
multilingual communities, and there is a growing demand for sentiment analysis of a large number
of code-mixed social media texts. In recent years, some shared tasks related to code-mixed have been
launched, such as the Semeval 2020 shared task 9 7 – Sentiment Analysis for Code-Mixed Social
Media Text [14]. Four series of tasks on Mixed Script Information Retrieval was organized at the Forum
for Information Retrieval Evaluation (FIRE) [15]. Three workshops on Computational Approaches to
Linguistic Code-Switching (CALCS) were also held [16].</p>
      <p>Some researchers try to analyze sentiment from code-mixed text. Chittaranjan et al.[17] tried
wordlevel recognition of code-mixed data to classify emotions. Sharma et al. [18] tried to perform a shallow
analysis of the code-mixed data obtained from online social media. Bojanowski et al. [19] proposed
a word representation model based on skip-gram to classify the emotion of tweets. Giatsoglou et
al. [20] trained a hybrid system based on dictionary-based document vectors, word embeddings, and
word polarity to classify the sentiment of tweets.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data and Approaches</title>
      <sec id="sec-3-1">
        <title>3.1. Data description</title>
        <p>
          In order to run the experiment, we use the datasets in two languages (Malayalam-English [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and
Tamil-English [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]) provided by the organizer, the data mainly comes from YouTube video comments.
The dataset contains all three types of code-mixed sentences: Inter-Sentential switch, Intra-Sentential
switch and Tag switching. Among them, the Malayalam-English dataset contains 6,739 comments,
1https://projects.fzai.h-da.de/iggsa/
2http://www.di.unito.it/ tutreeb/haspeede-evalita20/index.html
3https://competitions.codalab.org/competitions/20011
4https://competitions.codalab.org/competitions/19935
5http://alt.qcri.org/semeval2020/index.php?id=tasks
6https://hasocfire.github.io/hasoc/2019/index.html
7https://competitions.codalab.org/competitions/20654
and the Tamil-English dataset contains 15,744 comments. Table 1 shows the distribution of the
training set, validation set, and test set for the two languages.
        </p>
        <p>In our experiment, we merge the training set and validation set released by the organizer, and
randomly shufled their order. For both languages, we adopt this method to process the data. Finally,
we get a new training dataset of Malayalam-English with 5,391 comments and a new training dataset
of Tamil-English with 12,595 comments. In the experimental run, we employ a cross-validation idea,
the K-fold ensemble method, to improve the overall classification performance of the model.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Approaches</title>
        <p>Inspired by the success of the multilingual model, XLM-RoBERTa has greatly expanded the amount
of multilingual training data used in unsupervised MLM pre-training compared with previous work,
and has reached the latest level in both monolingual and cross-lingual benchmarks [6]. As shown in
Figure 1, our model is implemented based on the XLM-RoBERTa multilingual model. First, we get the
pooler output (P_O), and obtain the sequence of hidden states on the output of the last layer of
XLMRoBERTa. Then, we obtain H_AVG through average-pooling and H_MAX by max pooling. Finally,
we concatenate P_O, H_AVG and H_MAX into the Classifier.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. K-folding ensemble</title>
        <p>In this paper, in order to improve the overall classification performance of the model, we employed
a K-fold ensemble method. The design idea of this method comes from K-fold cross-validation. The
source data is randomly divided into K parts, the K-1 subset is used for training, the remaining subset
is used as the validation set, and then repeated K times. Finally, the K results are accumulated and
averaged to obtain the final output. The purpose of K-fold ensemble is to train diferent datasets during
each fold training process and extract diferent features in the model feature extraction process, so as
to further improve the generalization ability of the model. The K-fold ensemble method is shown in
Figure 2.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment results</title>
      <sec id="sec-4-1">
        <title>4.1. Experiment setting</title>
        <p>In our experiment, we did not preprocess the data. Our model is implemented based on Pytorch. We
use XLM-RoBERTa base as our pre-training model, which contains 12 layers. We use the 8-fold
crossvalidation, and the Maximum sentence length is 160. We use the learning rate of 3e-5, CrossEntropy
Loss, and Adam as the optimizer. To save GPU memory, the batch size is set to 4 and the gradient
accumulation step is set to 4 so that the gradient is accumulated 4 times each time a sample is input,
and then the backpropagation update parameters are performed.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results analysis</title>
        <p>The primary evaluation metric for the task is F-Score toward the positive class as a trade-of between
Precision (P) and Recall (R) [21]. Table 2 shows the 8-fold cross-validation results of our experiment
on our validation set. (best regression heads for model and language are in bold).</p>
        <p>First of all, from Table 2 we can conclude that the model performance of the Malayalam-English
task is better than Tamil-English. Secondly, the correct selection of the regression head does help to
obtain better performance. For the Malayalam-English final submission, we select the XLM-RoBERTa
model with the P_O &amp; H_MAX &amp; H_AVG regression head. Based on the test data, it show an F-Score
of 0.74 (P = 0.74, R = 0.74). That is higher than our 8-fold cross-validation result by 0.064, ranking No.1
in the competition leaderboard. For the Tamil-English final submission, we select the XLM-RoBERTa
model with the P_O &amp; H_MAX &amp; H_AVG regression head. Based on the test data, it show an F-Score
of 0.63 (P = 0.61, R = 0.67). That is higher than our 8-fold cross-validation result by 0.016, ranking 3rd
in the competition leaderboard.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>This article introduces our overall idea and specific plan for participating in the
Dravidian-CodeMixFIRE 2020 sharing task, aiming to identify the sentiment polarity of the code-mixed datasets annotated
in Dravidian languages (Malayalam-English and Tamil-English) collected from social media. We use
a multilingual pre-training model based on XLM-RoBERTa for classification, and use K-fold method
for ensemble. Our results demonstrate that multilingual models perform well in code-mixed datasets,
and we suggest that code-mixed NLP practitioners consider at least one of the XLM-RoBERTa variants
when selecting language models for their NLP systems. At present, the importance of breaking the
English-centric NLP research has been widely discussed, and we believe that the research of
nonEnglish languages will increase. We believe that the best models in the future can not only learn from
diferent fields but also from diferent languages.
[6] K. Pant, T. Dadu, Cross-lingual inductive transfer to detect ofensive language, arXiv preprint
arXiv:2007.03771 (2020).
[7] B. Wang, Y. Ding, S. Liu, X. Zhou, Ynu_wb at hasoc 2019: Ordered neurons lstm with attention
for identifying hate speech and ofensive language., in: FIRE (Working Notes), 2019, pp. 191–198.
[8] J. M. Struß, M. Siegel, J. Ruppenhofer, M. Wiegand, M. Klenner, et al., Overview of germeval
task 2, 2019 shared task on the identification of ofensive language (2019).
[9] C. Bosco, D. Felice, F. Poletto, M. Sanguinetti, T. Maurizio, Overview of the evalita 2018 hate
speech detection task, in: EVALITA 2018-Sixth Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, volume 2263, CEUR, 2018, pp. 1–9.
[10] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Semeval-2019 task 6:
Identifying and categorizing ofensive language in social media (ofenseval), arXiv preprint
arXiv:1903.08983 (2019).
[11] V. Basile, C. Bosco, E. Fersini, N. Debora, V. Patti, F. M. R. Pardo, P. Rosso, M. Sanguinetti, et al.,
Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in
twitter, in: 13th International Workshop on Semantic Evaluation, Association for Computational
Linguistics, 2019, pp. 54–63.
[12] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Derczynski,
Z. Pitenis, Ç. Çöltekin, Semeval-2020 task 12: Multilingual ofensive language identification in
social media (ofenseval 2020), arXiv preprint arXiv:2006.07235 (2020).
[13] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview of the hasoc
track at fire 2019: Hate speech and ofensive content identification in indo-european languages,
in: Proceedings of the 11th Forum for Information Retrieval Evaluation, 2019, pp. 14–17.
[14] P. Patwa, G. Aguilar, S. Kar, S. Pandey, S. PYKL, B. Gambäck, T. Chakraborty, T. Solorio, A. Das,
Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets, arXiv preprint
arXiv:2008.04277 (2020).
[15] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso, S. Bandyopadhyay, M. Choudhury,
Overview of the mixed script information retrieval (msir) at fire-2016, in: Forum for Information
Retrieval Evaluation, Springer, 2016, pp. 39–49.
[16] G. Molina, F. AlGhamdi, M. Ghoneim, A. Hawwari, N. Rey-Villamizar, M. Diab, T. Solorio,
Overview for the second shared task on language identification in code-switched data, arXiv
preprint arXiv:1909.13016 (2019).
[17] G. Chittaranjan, Y. Vyas, K. Bali, M. Choudhury, Word-level language identification using crf:
Code-switching shared task report of msr india system, in: Proceedings of The First Workshop
on Computational Approaches to Code Switching, 2014, pp. 73–79.
[18] S. Sharma, P. Srinivas, R. C. Balabantaray, Text normalization of code mix and sentiment analysis,
in: 2015 international conference on advances in computing, communications and informatics
(ICACCI), IEEE, 2015, pp. 1468–1473.
[19] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
information, Transactions of the Association for Computational Linguistics 5 (2017) 135–146.
[20] M. Giatsoglou, M. G. Vozalis, K. Diamantaras, A. Vakali, G. Sarigiannidis, K. C. Chatzisavvas,
Sentiment analysis leveraging emotions and word embeddings, Expert Systems with
Applications 69 (2017) 214–224.
[21] C. Goutte, E. Gaussier, A probabilistic interpretation of precision, recall and f-score, with
implication for evaluation, International Journal of Radiation Biology Related Studies in Physics
Chemistry Medicine 51 (2005) 952–952.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Leveraging orthographic information to improve machine translation of under-resourced languages</article-title>
          ,
          <source>Ph.D. thesis, NUI Galway</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Proceedings of the 12th Forum for Information Retrieval Evaluation</source>
          ,
          <source>FIRE '20</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URL: https://www.aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tamil-English text</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          . URL: https://www.aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>