<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CoSaD- Code-Mixed Sentiments Analysis for Dravidian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fazlourrahman Balouchzahi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hosahalli Lakshmaiah Shashirekha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <addr-line>Mangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC)</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Analyzing sentiments or opinions in code-mixed languages is gaining importance due to increase in the use of social media and online platforms especially during the Covid-19 pandemic. In a multilingual society like India, code-mixing and script mixing is quite common as people especially the younger generation are quite familiar in using more than one language. In view of this, the current paper describes the models submitted by our team MUCIC for the shared task in 'Sentiments Analysis (SA) for Dravidian Languages in Code-Mixed Text'. The objective of this shared task is to develop and evaluate models for code-mixed datasets in three Dravidian languages, namely: Kannada, Malayalam, and Tamil mixed with English language resulting in Kannada-English (Ka-En), Malayalam-English (Ma-En), and TamilEnglish (Ta-En) language pairs. N-grams of char, char sequences, and syllables features are transformed into feature vectors and are used to train three Machine Learning (ML) classifiers with majority voting. The predictions on the Test set obtained average weighted F1-scores of 0.628, 0.726, and 0.619 securing 2nd, 4th, and 5th ranks for Ka-En, Ma-En, and Ta-En language pairs respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Code-Mixing</kwd>
        <kwd>Sentiments Analysis</kwd>
        <kwd>Dravidian Languages</kwd>
        <kwd>n-grams</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The task of analyzing the opinions, feelings, and reviews posted on social media or online
markets to identify the sentiments of users about a given topic, movie, song, product, etc. is
called as Sentiments Analysis (SA). For example, a video on Instagram or a product in e-markets
can be viral and popular based on its reviews and sentiments posted by the customers/users
[
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ]. Lately, the demand for SA of social media data has increased both in academia and
industry, especially for the code-mixed data [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ]. Code-mixed data are common in multilingual
communities such as India where people use more than one languages’ words, grammar, and
phrases in their communication/ posts/ comments in social media or reviews in online shopping
websites [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ].
      </p>
      <p>Code-mixed content in Dravidian languages is usually a combination of a native language
such as Kannada, Tamil or Malayalam and English language at diferent linguistic units such
as sentence, phrase, word, morpheme and sub-word. The code-mixed text will either be in a
single script which is usually a Roman script or in multi-script i.e., a combination of Roman and
native script may be with few words of the native language in Roman script. Table 1 presents
some examples of single and multi-scripts code-mixed contents in Ta-En, Ma-En, and Ka-En
language pairs from the datasets used in the shared task.</p>
      <p>
        Dravidian languages in general are under-resourced languages and code-mixing adds a further
dimension mainly due to the problems with collecting and annotating code-mixed data for
various applications. ’Sentiment Analysis for Dravidian Languages in Code-Mixed Text’ is a
shared task in Dravidian-CodeMix-FIRE20211 with the aim of promoting SA of code-mixed
texts in Ka-En, Ma-En, and Ta-En language pairs [
        <xref ref-type="bibr" rid="ref4 ref5">5, 6</xref>
        ]. This shared task is an extension of
previous shared task of SA in Ta-En and Ma-En in FIRE 2020 [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ] with the addition of Ka-En
language pair [
        <xref ref-type="bibr" rid="ref5">6</xref>
        ].
      </p>
      <p>
        The objective of the shared task is to identify the opinion/sentiment of the comments posted
by the users on a given topic and classifying them further into one of the following categories:
• Positive: comments contain positive contents or justify that speaker is in a positive state
• Negative: comments contain negative contents or justify that speaker is in a negative
state
• Mixed_Feelings: comments contain positive as well as negative contents and hence
cannot be explicitly categorized into one of the two classes mentioned earlier
• Unknown_state: emotional state of a speaker is not clear or comments does not contain
positive or negative contents explicitly
• Not in indented language: comments are not written in the intended language
In the earlier works, i) Balouchzahi et al. [1] experimented various features such as Skipgram
word embedding, BPEmb2 sub-word embedding, and a combination of word and char n-grams
to train ML classifiers for SA, and ii) Balouchzahi et al. [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ] also explored and compared diferent
learning approaches such as ML, Deep Learning (DL), and Transfer Learning (TL) for SA. In
continuation of these works in SA in Dravidian languages, this paper describes the models
      </p>
      <sec id="sec-1-1">
        <title>1https://dravidian-codemix.github.io/2021/index.html 2https://nlp.h-its.org/bpemb/</title>
        <p>15,744
6,739
submitted by our team MUCIC to the Dravidian-CodeMix-FIRE2021 shared task. Three diferent
feature sets, namely: char, char sequences, and syllables are explored to check the efectiveness
of char level (characters) and sub-word level (char sequences and syllables) n-grams for
codemixed SA task. Each feature set is individually used to train three ML classifiers, namely: Linear
Support Vector Machine (LSVM), Logistic Regression (LR) and Multi-Layer Perceptron (MLP)
and the majority voting of the predictions of all the classifiers is used to classify the given
sentiment. The code of the proposed methodology is available in our GitHub link3.</p>
        <p>The rest of paper is organized as follows: Section 2 gives a summary of the best models
submitted to the Sentiment Analysis for Dravidian Languages in Code-Mixed Text in
DravidianCodeMix-FIRE20204 shared task and the Methodology is described in Section 3. Section 4
describes the results obtained and the paper concludes in Section 5.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Researches had submitted several models to ’Sentiment Analysis for Dravidian Languages in
Code-Mixed Text’ shared task in Dravidian-CodeMix-FIRE2020 organized by Chakravarthi et al.
[
        <xref ref-type="bibr" rid="ref2 ref6">3, 7</xref>
        ]. The shared task consists of similar sentiments categories (as mentioned in Section 1) in
two language pairs, namely: Ta-En and Ma-En. Authors collected the Youtube comments to
develop datasets consisting of 15,744 and 6,739 comments in Ta-En and Ma-En language pairs
respectively, and provided the same to the participants of the shared task as Train, Dev and
Test set. The label distribution of the comments in the dataset shown in Table 2 (borrowed from
[
        <xref ref-type="bibr" rid="ref1">2</xref>
        ]) illustrates that the dataset is imbalanced for both the language pairs.
      </p>
      <p>Participants were supposed to train and evaluate their models locally on Train and Dev set
respectively and then predict the class label of the Test set. These predictions were submitted to
the shared task organizers for final evaluation and ranking which is based on average weighted
scores. The brief descriptions of the models which exhibited good performance in this shared
task are given below:</p>
      <p>
        Most of successful teams have utilized Multilingual BERT (mBERT5) [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ] and XLM-Roberta
[
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] - the multilingual transformer based models for SA similar to that of code-mixed Ofensive
Language Identification (OLI) in Dravidian languages [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ]. With the objective of using Masked
Language Modeling (MLM), mBERT was trained on the top 104 languages that have largest
Wikipedia including Kannada, Malayalam, and Tamil. Pires et al. [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ] describe that mBERT can
be employed for cross-lingual generalization. Moreover, based on the authors’ experiments,
      </p>
      <sec id="sec-2-1">
        <title>3https://github.com/fazlfrs/CoSaD</title>
        <p>
          4https://dravidian-codemix.github.io/2020/index.html
5https://github.com/google-research/bert/blob/master/multilingual.md
despite the high lexical overlap among diferent languages, mBERT is capable of transfering
between languages with diferent scripts by capturing multilingual representations.
XLMRoberta also relay on MLM objective and cross-lingual transfer. By using the large-scale
multilingual pre-training model trained on 2.5 TB of clean CommonCrawl data in 100 languages
[
          <xref ref-type="bibr" rid="ref11">12</xref>
          ], XLM-Roberta has overcome the limitation of XLM [13] and mBERT in learning useful
representations for under-resourced languages.
        </p>
        <p>
          Sun et al. [
          <xref ref-type="bibr" rid="ref11">12</xref>
          ] proposed a XLM-Roberta based model by extracting the abundant semantic
information from the hidden layer state of XLM-Roberta, which is then fed as input into
convolution and max pooling. Further, they concatenated the top hidden states and pooler
to improve performances and reported that the proposed model without any pre-processing
obtained better results. The proposed model outperformed all other models submitted to the
shared task by securing 1st ranks (for both the language pairs) with average weighted F1-scores
of 0.74 and 0.65 for Ma-En and Ta-En language pairs respectively.
        </p>
        <p>
          Ou et al. [14] developed a XLM-Roberta based model similar to the work of Sun et al. [
          <xref ref-type="bibr" rid="ref11">12</xref>
          ].
Here, the authors obtained the pooler output and the sequence of hidden states of the last layer
of XLM-Roberta and concatenated the pooler output with the average-pooling and max-pooling
of hidden-states of XLM-Roberta into a classifier. They merged and shufled the Train and Dev
sets and used k-fold cross validation to enhance the performances of the system. They obtained
1st rank for Ma-En language pair and average weighted F1-scores of 0.74 and 0.63 for Ma-En
and Ta-En language pairs respectively. In a simple way, Sun et al. [15] proved the eficiency
of multilingual transformers by fine-tuning the pre-trained multilingual BERT adopted from
multi_cased_L12_H-768_A-126. They secured 2nd and 4th ranks with average weighted F1-scores
of 0.73 and 0.62 for Ma-En and Ta-En language pairs respectively.
        </p>
        <p>Huang et al. [16] proposed a multi-step integration of fine-tuned XLM-Roberta and mBERT
transformers for the shared task and obtained average weighted F1-scores of 0.73 and 0.63
for Ma-En and Ta-En language pairs respectively. They used mBERT as binary classifier and
XLM-Roberta as quaternary classifier and intertwined both model’s predictions for final decision.
Zhu et al. [17] experimented an mBERT-based model along with BiLSTM by feeding the hidden
state of the last layer of mBERT model to BiLSTM. Further, they set weights for each hidden state
layer in BiLSTM and the weighted sum of hidden states is concatenated with the original output
of mBERT. The results reported in leaderboard shows 2nd rank for both language pairs with
average weighted F1-scores of 0.73 and 0.64 for Ma-En and Ta-En language pairs respectively.</p>
        <p>
          In addition to transformers, several models based on ML classifiers have also obtained
promising results in the shared task. Kanwar et al. [18] adopted under-sampling technique from
TOMEK [19] to train several ML classifiers with various syntax based n-grams features. The
best performance obtained was using LR classifier with word and char n-grams features for
Ma-En and Ta-En language pairs with 0.71 and 0.62 average weighted F1-scores respectively.
Balouchzahi et al. [
          <xref ref-type="bibr" rid="ref1">2</xref>
          ] submitted a majority voting of ML classifiers (Multinomial Naïve Bayes
trained on Skipgram word embedding and Multi-Layer Perceptron (MLP) trained on the
combination of word and char n-grams) and BiLSTM model (with training a sub-word embedding
using BPEmb library that is used as weight later in BiLSTM) for the shared task. The proposed
model obtained 0.68 and 0.62 average weighted F1-scores for Ma-En and Ta-En language pairs
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>6https://github.com/google-research/bert</title>
        <p>respectively.</p>
        <p>Researchers have explored several models based on ML and DL approaches with a combination
of diferent embeddings and feature sets. Balouchzahi et al. [ 1] explored ML, DL and TL
approaches by proposing (i) a ML-based voting classifier trained on a feature set of char
sequences along with BPEmb sub-words ngrams and syntactic ngrams [20, 21] with three
estimators, namely: LR, MLP, and eXtreme Gradient Boosting (XGB); (ii) A Keras sequential
classifier trained on earlier feature set; and (iii) A Universal Language Model Fine-Tuning
(ULMFiT) for SA. They used Dakshina7 dataset as raw text to train a tokenizer, universal
Language Model (LM) for fine tuning and fast.ai 8 library for training LM and SA classification
model. Using ML-based voting classifier they obtained 0.72 and 0.62 average weighted F1-scores
for Ma-En and Ta-En language pairs respectively.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The proposed methodology contains:
• pre-processing texts
• extracting char, char sequences and syllables as features from the texts
• obtaining the corresponding n-grams from the n-gram generator
• vectorizing the n-grams using Tfidf Vectorizer
• training the ML classifiers
• predict the labels of the Test set</p>
      <p>
        The pre-processing module adopted from Balouchzahi et al. [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ] includes converting Emojis
to text, removing punctuation, numbers, unnecessary characters and words of length less than
2 and lower casing the words written in Roman script. Words are split by a simple strategy
of using attributes of string data type in a ’for’ loop to obtain char features. Char sequences
are extracted as sub-word level features using everygrams9 library from NLTK. Syllable which
comprises of vowels and consonants [22] is a smallest unit used to organize sequences of sounds
and are considered as the building blocks in Text To Speech (TTS) tasks. Sidorov [23] proposed
using syllables as features for Text Classification (TC) tasks. Syllable features are extracted
using the Syllablizer10 library. Though the library works better for native scripts, results for
code-mixed texts are also encouraging.
      </p>
      <p>The n-gram generator accepts a list of chars/char sequences/syllables of a word as input and
will generate the corresponding n-grams which are vectorized using Tfidf Vectorizer 11 to train
the ML classifiers.</p>
      <p>The overview of feature engineering which includes the procedures to pre-process, extract
features, generate n-grams, and obtaining TFIDF vectors is shown in Figure 1 and the range of
n-grams for each feature type is given in Table 3.</p>
      <sec id="sec-3-1">
        <title>7https://github.com/google-research-datasets/dakshina</title>
        <p>8https://nlp.fast.ai
9https://tedboy.github.io/nlps/generated/generated/nltk.everygrams.html
10https://github.com/libindic/syllabalizer.git
11https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html</p>
        <p>The parameters of LSVM and LR classifiers are set to default and that of MLP classifier are
set as: hidden_layer_sizes = (150, 100, 50), max_iter = 300, activation = ’relu’, solver = ’adam’,
random_state = 1. Each classifier is trained separately with the three feature sets mentioned
earlier. The best performing feature and classifier pairs are selected manually based on their
performances on Dev set and majority voting of the predictions on the Test set were submitted
for final evaluation to the shared task organizers. Figure 2 presents the steps for training the
individual classifier for each feature set.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>
        A post/comment in each language pair should be classified into one of the five categories as
described in Section 1. The dataset provided by the shared task organizers [
        <xref ref-type="bibr" rid="ref4 ref5">5, 6</xref>
        ] includes
a collection of code-mixed text from social media in three language pairs, namely: Ma-En,
Ta-En, and Ka-En. These datasets were split into Train, Dev and Test sets and provided to the
participants of the shared task to train and evaluate the models. The statistics of the datasets
are given in Table 4. Similar to Dravidian-CodeMix-FIRE2020 shared task, the label distribution
over the datasets illustrate that the datasets are highly imbalanced. The observation of the
datasets in Table 4 illustrate that, for each class, Ta-En language pair consists of more samples
and Ka-En language pair consists of less samples and this could afect the performance of the
classifiers for Ka-En language pair.
      </p>
      <p>The predictions on the Test set submitted by the participants were evaluated based on the
average weighted F1-scores. Organizers had encouraged the teams to evaluate the models locally
on Dev set and then to submit the predictions on the Test set. Table 5 gives the performances of
proposed methodology on the Dev set for all the three feature sets using the three classifiers for
all the three language pairs. Observation of the results on the Dev set shows that LR and LSVM
outperform each other for various feature sets and language pairs, while MLP always obtained
the lowest results for all feature sets and all language pairs. The highlighted content in Table 5
correspond to the best performing classifier and the starred (*) score indicates the best feature
set and classifier pair for the language pair. It can be seen that most of high performances
are obtained with char n-grams followed by syllable n-grams. However, results using char
sequences are interesting as well.</p>
      <p>The good performance of syllable n-grams reveals that they can be efectively used as features
in TC tasks as well and it is expected that they perform much better for native scripts as
compared to code-mixed texts. Due to hardware resource constraints, only LR classifier was
trained with char sequences and syllable n-grams for Ta-En language pair.</p>
      <p>According to the performances of the models on the Dev set (highlighted scores in Table 5),
the best feature set and classifier pair are selected and applied on the Test sets. The results of the
best individual classifier and feature set pairs and their majority voting on the Test sets are given
in Table 6. It can be observed that the performance of the majority voting of the predictions
outperformed the performances of the individual classifiers. The results released by the shared
task organizers in the leaderboard12 reveals that our proposed methodology using majority
voting of the predictions obtained 2nd, 4th, and 5th ranks with average weighted F1-scores of
0.628, 0.726, and 0.619 for Ka-En, Ma-En and Ta-En language pairs respectively.</p>
      <p>The confusion matrix for each language pair based on the best performances as mentioned
in Table 6 are presented in Figure 3. For both Ka-En (Figure 3a) and Ta-En (Figure 3c) language
pairs, the weakest performances are for predicting "Mixed_feelings" comments and the best
performances are for predicting "Positive" comments. Similarly, for Ma-En (Figure 3b) language
pair, the weakest performance is for predicting "Mixed_feelings" comments and the good
performance is for predicting "not-Malayalam" comments along with "Positive" comments.
Though predicting "Mixed_feelings" comments exhibits weakest performance in all the three
language pairs, the results of Ma-En language pair are higher compared to that in other two
12https://dravidian-codemix.github.io/2021/proceedings.html
language pairs.</p>
      <p>The comparison of the performances of the proposed methodology with that of the top
performing models in the shared task shown in Figure 4 illustrates that the performances are
quite competitive for all the language pairs. Ka-En language pair which has a smaller dataset
compared to other language pairs also has given good performance.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This paper describes the participation of our team MUCIC in SA shared task at
DravidianCodeMix-FIRE2021. Three types of features, namely: char, char sequences and syllables are
extracted from the given texts. These features are used to generate corresponding n-grams
which are then transformed to TFIDF vectors for training the classifiers. According to the
performances of the models on the Dev set, the best feature set and classifier pair are selected
and applied on the Test sets and the majority voting of their predictions were submitted to the
shared task organizers for evaluation. The results on the leaderboard reveals that our proposed
strategy obtained promising results and secured 2nd, 4th, and 5th ranks with average weighted
F1-scores of 0.628, 0.726, and 0.619 for Ka-En, Ma-En and Ta- En language pairs respectively.
Other features and feature selection algorithms will be explored further for code-mixed low
resource Dravidian languages.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments References</title>
      <p>Team MUCIC sincerely appreciate the organizers for their eforts to conduct this shared task.
[1] F. Balouchzahi, H. Shashirekha, LA-SACo: A Study of Learning Approaches for Sentiments
Analysis in Code-Mixing Texts, in: Proceedings of the First Workshop on Speech and
Language Technologies for Dravidian Languages, 2021, pp. 109–118.</p>
      <p>Identification Sentiment in Code-mixed Text, in: FIRE (Working Notes), 2020, pp. 548–553.
[13] A. Conneau, G. Lample, Cross-lingual Language Model Pretraining, Advances in Neural</p>
      <p>Information Processing Systems 32 (2019) 7059–7069.
[14] X. Ou, H. Li, YNU@ Dravidian-CodeMix-FIRE2020: XLM-RoBERTa for Multi-language</p>
      <p>Sentiment Analysis, in: FIRE (Working Notes), 2020, pp. 560–565.
[15] H. Sun, J. Gao, F. Sun, HIT_SUN@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis
on Multilingual Code-Mixing Text Base on BERT, in: FIRE (Working Notes), 2020, pp.
517–521.
[16] B. Huang, Y. Bai, LucasHub@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis on
Multilingual Code Mixing Text with M-BERT and XLM-RoBERTa, in: FIRE (Working
Notes), 2020, pp. 574–581.
[17] Y. Zhu, K. Dong, YUN111@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis of
Dravidian Code Mixed Text, in: FIRE (Working Notes), 2020, pp. 628–634.
[18] N. Kanwar, M. Agarwal, R. K. Mundotiya, PITS@ Dravidian-CodeMix-FIRE2020:
Traditional Approach to Noisy Code-Mixed Sentiment Analysis, in: FIRE (Working Notes), 2020,
pp. 541–547.
[19] I. TOMEK, Two Modifications of CNN, IEEE Trans. Systems, Man and Cybernetics 6 (1976)
769–772.
[20] J. P. Posadas-Durán, I. Markov, H. Gómez-Adorno, G. Sidorov, I. Batyrshin, A. Gelbukh,
O. Pichardo-Lagunas, Syntactic N-grams as Features for the Author Profiling Task, in:
CEUR Workshop Proceedings, 2015.
[21] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic
Dependency-based n-grams: More Evidence of Usefulness in Classification, in:
International Conference on Intelligent Text Processing and Computational Linguistics, Springer,
2013, pp. 13–24.
[22] K. De Jong, Temporal Constraints and Characterising Syllable Structuring, Phonetic</p>
      <p>Interpretation. Papers in Laboratory Phonology VI (2003) 253–268.
[23] G. Sidorov, Automatic Authorship Attribution using Syllables as Classification Features,
Rhema. (2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          , MUCS@
          <string-name>
            <surname>Dravidian-CodeMix-FIRE2020</surname>
          </string-name>
          :
          <article-title>SACOSentiments Analysis for CodeMix Text</article-title>
          , in: P. Mehta,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2020 -
          <article-title>Forum for Information Retrieval Evaluation, Hyderabad</article-title>
          , India,
          <source>December 16-20</source>
          ,
          <year>2020</year>
          , volume
          <volume>2826</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>495</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the Track on Sentiment Analysis for Dravidian Languages in Code-mixed Text</article-title>
          , in: Working Notes of FIRE 2020 -
          <article-title>Forum for Information Retrieval Evaluation, Hyderabad</article-title>
          , India,
          <source>December 16-20</source>
          ,
          <year>2020</year>
          , volume
          <volume>2826</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A Survey of Current Datasets for Code-Switching Research</article-title>
          , in: 2020
          <source>Sixth International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chinnappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thenmozhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vasantharajan</surname>
          </string-name>
          ,
          <article-title>Findings of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text</article-title>
          , in: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chinnappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          , E. Sherly,
          <article-title>Overview of the Dravidian CodeMix 2021 Shared Task on Sentiment Detection in Tamil, Malayalam, and Kannada, in: Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2021</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Sentiment</given-names>
            <surname>Analysis</surname>
          </string-name>
          <article-title>Dataset for Code-Mixed Malayalam-English, in: Proceedings of the First Joint Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources Association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , É. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised Cross-lingual Representation Learning at Scale</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          , N. Jose, T. Mandl,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          , et al.,
          <article-title>Findings of the Shared Task on Ofensive Language Identification in Tamil, Malayalam, and Kannada</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garrette</surname>
          </string-name>
          , How Multilingual is Multilingual BERT?,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4996</fpage>
          -
          <lpage>5001</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , SRJ@
          <string-name>
            <surname>Dravidian-CodeMix-FIRE2020</surname>
          </string-name>
          :
          <article-title>Automatic Classification and</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>