<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CIA_NIT T@Dravidian-CodeMix-FIRE2020: Malayalam-English Code Mixed Sentiment Analysis Using Sentence BERT And Sentiment Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yandrapati Prakash Babu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajagopal Eswari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>K Nimmi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Applications, National Institute of Technology</institution>
          ,
          <addr-line>Trichy</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Code mixing is the mixing of language while writing text. The biggest problem in Malayalam-English code-mixing is that people switch between languages (e.g. Malayalam and English) and use English phonetic typing instead of writing Malayalam words using Dravidian scripts. Traditional NLP models are trained on extensive monolingual tools ( e.g. Malayalam or English ); code-mixing is challenging since they cannot handle data code-mixing. DravidianCodeMix FIRE 2020 is to classify comments into positive, negative, unknown_state, mixed_feelings and not-malayalam categories based on messagelevel polarity using Malayalam - English dataset. The classification model used for this challenging task is sentence-level BERT. Achieved an F1-score of 0.71 (ranked 4th) for the Malayalam-English (Manglish) comments dataset submitted under the username: CIA_NITT.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;BERT</kwd>
        <kwd>Code Mixed</kwd>
        <kwd>Manglish</kwd>
        <kwd>Sentiment Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        mixing which is done here is Malayalam sentences written using English alphabets [10, 11].
Code-mixing is the word-level alternation of languages that often occurs by combining words
from one language with another language’s rules, according to Solorio et al. [12]. Language
mixing when writing text, also known code-mixing. Natural language processing (NLP) is
a modern technology that provides computers with knowledge in order to understand the
languages humans speak. NLP involves syntax analysis (grammatical rules) and semantic
analysis (meaning). Sentiment analysis is a technique of classification that collectively provides
feelings relevant to a subject. Sentiment analysis may be carried out at sentence level, document
level, aspect level and phrase level. Sentiment Analysis is a concept which is commonly used
to describe a human’s emotional states. We did not find any research on Manglish Corpora in
sentiment analysis to the best of our knowledge. The task organizers created the datasets for
Malayalam-English and Tamil-English, They clearly explained how they gathered and labeled
the comments in the datasets [
        <xref ref-type="bibr" rid="ref4">4, 13</xref>
        ]. This paper suggests an analysis of sentiments using
sentence BERT[14] for comments in Malayalam-Language (Manglish).
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Research work</title>
      <p>Related studies on code-mixing computing models are less since there is a scarcity of
conventional text corpora. Joshi et al. carried out pioneering work in the sentiment analysis of
Hindi language, in which the authors developed a three-step fallback model [15] based on
machine translation, lexicons of sentiment and classification. The framework performed best
with unigram features. Vyas et al has been using Parts of Speech (POS) [16] for Hindi sentences
embedded with English words and found that Hindi language recognition and transliteration are
two significant challenges that afect the accuracy of the POS tagging. Prabhu et al. proposed
an LSTM (Sub word-LSTM) [17] that function well on extremely noisy text and text containing
misspellings they got a 4-5% greater accuracy than traditional approaches. Code-mixed noisy
content cleaning at word-level based on orthographic information [18]. Text classification using
deep learning models to identify the Manglish and Tanglish sentiments [19, 20]. Kumar et al.
used an ensemble model [21] for mixed code tweet analysis and achieved an F1 score of 0.70 for
Hindi-English (Hinglish) and 0.725 for Spanish-English (Spanglish) datasets.</p>
      <p>
        For code-mixed social media text analysis, Singh et al. used cross-lingual embeddings [22],
which is unsupervised, and obtained an F1 score of 0.6355. Sharma et al. developed a shallow
Hindi-English code-mixed social media text parser[23]; this shallow parser model was modelled
as three separate sequence labeling problems. Advani et al. built a classifier that can use [ 24]
hand-engineered lexical, sentiment, and metadata features to distinguish between "positive",
"neutral" and "negative" feelings. Thomas et al. performed Sentimental Study of
Transliterated Text using RNN-LSTM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] technique to extract transliterated text sentiments. Das et
al.[25] suggested a computational technique to produce a SentiWordNet equivalent for Bengali
from publicly accessible English Sentiment lexicons and bilingual English-Bengali dictionary.
Bhargava et al. used the Language Recognition [26] and Sentiment Mining Method with the
combination of four languages. A novel hybrid architecture of deep learning [27] that is highly
successful in analysing emotions in resource poor languages based on four Hindi datasets
covering various domains was proposed by Akhtar et al. A rule-based, n-gram, multivariate
feature selection system was developed by Abbasi et al .[28].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methdology</title>
      <sec id="sec-3-1">
        <title>3.1. Data and Pre-Processing</title>
        <p>The task organizers provide the dataset collected from YouTube1. The given dataset has multiple
polarities like positive, negative, not-malayalam, unknown_state and mixed_feelings. The
comments in the dataset are noisy, to clean the comments, we applied pre-processing steps
like convert comment into lower case, removing the special characters, replace the emojis with
related word, removing the repeating characters which are appear in the word more than two
times. The dataset has the class imbalance problem, class labels not-malayalam, unknown_state
and mixed_feelings have less sample compared to positive and negative class labels. The
statistics of the each class is tabulated in the Table 1.</p>
        <sec id="sec-3-1-1">
          <title>Labels</title>
          <p>Positive
Negative
Not-malayalam
Unknown_state
Mixed_feelings</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Description</title>
        <p>Model-1 is based on Sentence BERT (SBERT)[14] and Manglish features. In this work 252
Manglish features(Manglish sentiment words) are manually gathered from the YouTube comments.
The whole SBERT model is finely tuned using the training dataset. The original comment
is added with the [CLS] and [SEP] special tokens and then word-piece tokenizer is applied
for tokenization. The fusion of each token is accomplished by the summation of the
wordpiece’s embedding, position, and segment. A series of 12 transformer encoder layers is applied
to obtain the final hidden state vectors on these token embeddings. Following , we treat
1 =  () ∈ 768 the final hidden vector of [CLS] token as the representation
of comment, and for the feature vector representation 2 =  where N = number of features,
one vector is generated for every comment with the length of 252 having 0 and 1 (if Manglish
sentiment word is appeared in the comment 1 is appended to the vector otherwise 0 is appended).
Then, 1 and 2 is concatenated  = 1 ⊕ 2 where ⊕ represents concatenation, passed through
fully connected softmax layer to get the required label ˆ ∈  , where C=number of classes.
Figure 1(a) shows the methodology of Model-1.</p>
        <p>ˆ =  (   + )
(1)
1https://dravidian-codemix.github.io/2020/datasets.html
(a) SBERT with Feature Vector
(b) SBERT with CBL
Model-2 is also based on SBERT but instead of cross entropy loss, Class Balanced Loss (CBL)
[29] is used to handle the imbalance problem in the dataset. Here   = (1 −  )/(1 −   ),   is
the balanced focal loss factor, C denotes total number of classes, y denotes labels from 1,2,..,c, 
denotes probabilities varies from 0 and 1,  and  are the hyper parameters. Figure 1(b) shows
the methodology of Model-2.</p>
        <p>= −   ∑︁(1 − ) log()
=1
(2)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Implementation Details</title>
      <p>In the Model-1 SBERT with Feature Vector is used and to train this model. SBERT Hyper
parameters are set to epochs=3, batch size = 32, learning rate = 3e-5 and dropout=0.4. In
Model-2 SBERT with Class Balanced Loss is used, for this we set additional hyperparameters
as  =0.9999,  = 2.0. Two models implemented using the transformers library in PyTorch [30].
The implementation code is available in the github 2.</p>
      <p>2https://github.com/prakashbabuy/manglish/</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>precision</p>
      <sec id="sec-5-1">
        <title>Recall</title>
      </sec>
      <sec id="sec-5-2">
        <title>F1-score</title>
      </sec>
      <sec id="sec-5-3">
        <title>Label</title>
        <p>Mixed_feelings
Negative
Positive
not-malayalam
unknown_state</p>
      </sec>
      <sec id="sec-5-4">
        <title>Label</title>
        <p>Mixed_feelings
Negative
Positive
not-malayalam
unknown_state</p>
      </sec>
      <sec id="sec-5-5">
        <title>Label</title>
        <p>Mixed_feelings
Negative
Positive
not-malayalam
unknown_state
To classify Code-Mixed Malayalam and English comments, we experimented with three models
namely SBERT, SBERT with Feature Vector and SBERT with CBL. The class wise SBERT
performance is shown in Table 2. Due to class imbalance, SBERT model performance is limited.
The performances of SBERT with Feature Vector and SBERT with CBL models are reported
in the Tables 3 and 4. SBERT with Feature Vector model achieved F1-score of 70.29% and
SBERT with CBL model achieved F1-score of 71%. From the Table 5, it is clear that SBERT with
CBL model achieved slightly better results compared to SBERT with Feature Vector. Both our
proposed models achieved better performance compared to SBERT.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presents two models to classify Code-Mixed Malayalam and English comments
models based on Sentence BERT. This task is treated as multiclass text classification problem.
The model based on SBERT with CBL achieved F1-score of 71%. In the future we will improve
the model to find the sarcastic Manglish comments.
phonetic transcription, in: Proceedings of the 2nd Workshop on Technologies for MT of
Low Resource Languages, European Association for Machine Translation, Dublin, Ireland,
2019, pp. 56–63. URL: https://www.aclweb.org/anthology/W19-6809.
[9] S. Suryawanshi, B. R. Chakravarthi, P. Verma, M. Arcan, J. P. McCrae, P. Buitelaar, A dataset
for troll classification of Tamil memes, in: Proceedings of the 5th Workshop on Indian
Language Data Resource and Evaluation (WILDRE-5), European Language Resources
Association (ELRA), Marseille, France, 2020.
[10] B. R. Chakravarthi, P. Rani, M. Arcan, J. P. McCrae, A survey of orthographic information
in machine translation, arXiv e-prints (2020) arXiv–2008.
[11] P. Rani, S. Suryawanshi, K. Goswami, B. R. Chakravarthi, T. Fransen, J. P. McCrae, A
comparative study of diferent state-of-the-art hate speech detection methods for
HindiEnglish code-mixed data, in: Proceedings of the Second Workshop on Trolling, Aggression
and Cyberbullying, European Language Resources Association (ELRA), Marseille, France,
2020.
[12] T. Solorio, Y. Liu, Learning to predict code-switching points, in: Proceedings of the 2008</p>
      <p>Conference on Empirical Methods in Natural Language Processing, 2008, pp. 973–981.
[13] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
and Collaboration and Computing for Under-Resourced Languages (CCURL), European
Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www.
aclweb.org/anthology/2020.sltu-1.28.
[14] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese
BERTnetworks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,
China, 2019, pp. 3982–3992. URL: https://www.aclweb.org/anthology/D19-1410. doi:10.
18653/v1/D19-1410.
[15] A. Joshi, A. Balamurali, P. Bhattacharyya, et al., A fall-back strategy for sentiment analysis
in hindi: a case study, Proceedings of the 8th ICON (2010).
[16] Y. Vyas, S. Gella, J. Sharma, K. Bali, M. Choudhury, POS tagging of English-Hindi
codemixed social media content, in: Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), 2014, pp. 974–979.
[17] A. Prabhu, A. Joshi, M. Shrivastava, V. Varma, Towards sub-word level compositions for
sentiment analysis of Hindi-English code mixed text, arXiv preprint arXiv:1611.00472
(2016).
[18] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation
of under-resourced languages, Ph.D. thesis, NUI Galway, 2020. URL: http://hdl.handle.net/
10379/16100.
[19] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly,
Elizabeth McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in
Code-Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation,
FIRE ’20, 2020.
[20] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly,
Elizabeth McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages
in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation
(FIRE 2020). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020.
[21] A. Kumar, H. Agarwal, K. Bansal, A. Modi, BAKSA at SemEval-2020 task 9:
Bolstering CNN with self-attention for sentiment analysis of code mixed text, arXiv preprint
arXiv:2007.10819 (2020).
[22] P. Singh, E. Lefever, Sentiment analysis for Hinglish code-mixed tweets by means of
crosslingual word embeddings, in: Proceedings of the The 4th Workshop on Computational
Approaches to Code Switching, 2020, pp. 45–51.
[23] A. Sharma, S. Gupta, R. Motlani, P. Bansal, M. Srivastava, R. Mamidi, D. M. Sharma,
Shallow parsing pipeline for Hindi-English code-mixed social media text, arXiv preprint
arXiv:1604.03136 (2016).
[24] L. Advani, C. Lu, S. Maharjan, C1 at SemEval-2020 task 9: Sentimix: Sentiment analysis for
code-mixed social media text using feature engineering, arXiv preprint arXiv:2008.13549
(2020).
[25] A. Das, S. Bandyopadhyay, Subjectivity detection in English and Bengali: A CRF-based
approach, Proceeding of ICON (2009).
[26] R. Bhargava, Y. Sharma, S. Sharma, Sentiment analysis for mixed script Indic sentences, in:
2016 International conference on advances in computing, communications and informatics
(ICACCI), IEEE, 2016, pp. 524–529.
[27] M. S. Akhtar, A. Kumar, A. Ekbal, P. Bhattacharyya, A hybrid deep learning architecture
for sentiment analysis, in: Proceedings of COLING 2016, the 26th International Conference
on Computational Linguistics: Technical Papers, 2016, pp. 482–493.
[28] A. Abbasi, S. France, Z. Zhang, H. Chen, Selecting attributes for sentiment classification
using feature relation networks, IEEE Transactions on Knowledge and Data Engineering
23 (2010) 447–462.
[29] Y. Cui, M. Jia, T. Lin, Y. Song, S. Belongie, Class-balanced loss based on efective number
of samples, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2019, pp. 9260–9269.
[30] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
M. Funtowicz, et al., Transformers: State-of-the-art natural language processing, arXiv
preprint arXiv:1910.03771 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Improving Wordnets for Under-Resourced Languages Using Machine Translation</article-title>
          ,
          <source>in: Proceedings of the 9th Global WordNet Conference, The Global WordNet Conference 2018 Committee</source>
          ,
          <year>2018</year>
          . URL: http://compling.hss. ntu.edu.sg/events/2018-gwc/pdfs/GWC2018_paper_
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Comparison of Diferent Orthographies for Machine Translation of Under-Resourced Dravidian Languages</article-title>
          ,
          <source>in: 2nd Conference on Language, Data and Knowledge (LDK</source>
          <year>2019</year>
          ), volume
          <volume>70</volume>
          of OpenAccess Series in Informatics (OASIcs),
          <source>Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik</source>
          , Dagstuhl, Germany,
          <year>2019</year>
          , pp.
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>14</fpage>
          . URL: http://drops.dagstuhl.de/opus/volltexte/2019/10370. doi:
          <volume>10</volume>
          .4230/OASIcs. LDK.
          <year>2019</year>
          .
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>WordNet gloss translation for under-resourced languages using multilingual neural machine translation</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation, European Association for Machine Translation</source>
          , Dublin, Ireland,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . URL: https://www.aclweb.org/anthology/W19-7101.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URL: https://www.aclweb.org/anthology/ 2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Myers-Scotton</surname>
          </string-name>
          ,
          <article-title>Duelling languages: Grammatical structure in codeswitching</article-title>
          , Oxford University Press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A survey of current datasets for code-switching research</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vegupatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Named entity recognition for code-mixed Indian corpus using meta embedding</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stearns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jayapal</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. S</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zarrouk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Multilingual multimodal machine translation for Dravidian languages utilizing</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>