Sentiment Analysis using Cross Lingual Word
Embedding Model
C Jerin Mahibha1 , Sampath Kayalvizhi2 and Durairaj Thenmozhi3
1
  Meenakshi Sundararajan Engineering College, Chennai
2
  Sri Sivasubramania Nadar College of Engineering, Kalavakkam
3
  Sri Sivasubramania Nadar College of Engineering, Kalavakkam


                                         Abstract
                                         Sentiment analysis deals with analysing the given text and classifying whether the text is a positive,
                                         negative or neutral one. The sentiment analysis forms the base for applications where the public views
                                         could be known. This paper shows how a multi label classification of the given text could be implemented
                                         by considering the sentiment associated with the text. The models that are applied for monolingual
                                         sentiment analysis may not provide good results when it is extended for code mixed data. When a Cross
                                         Lingual model was applied to the training data-set provided by Dravidian Code-mix FIRE 2021 for the
                                         Task A which uses Tamil - English code mixed data, it was able to classify the test data-set with an
                                         average F1 score of 0.514. .

                                         Keywords
                                         Sentiment analysis, Code-mixed data, Transformer model, Imbalanced data-set, Sampling


1. Introduction
  India is a land of large bilingual communities. Most part of the Indian community are well
versed in the language English [1] in addition to their native language. So when Indians express
their opinions considering various situations as tweets or comments, mostly it is a mix of
English and a regional language which is represented as code mixed data. Hence analysis of
code mixed data has become an important part of any analysis considering the Indian society.

   Sentiment analysis is a research area under Natural language Processing where the sentiment
associated with a text has to be identified [2]. Usually the sentiment analysis process is expressed
as a multi label classification problem[3] with a minimum of three sentiments associated with a
text which can be represented as positive, negative or neutral. Sentiment expresses the inner
feeling of a person towards a particular situation which can also be represented as emotions [4]
towards a situation which can be a product, idea, concept, event etc.

  When a corpus is created for a specific purpose like Sentiment analysis, the data in the
corpus may not be uniformly distributed. It may have a particular class of data much more than

FIRE 2021 : Forum for Information Retrieval Evaluation, December 13-17, 2021, India
Envelope-Open jerinmahibha@gmail.com (C. J. Mahibha); kayalvizhis@ssn.edu.in (S. Kayalvizhi); d_theni@ssn.edu.in
(D. Thenmozhi)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
other classes or a particular class of data may be much smaller than the other classes[5]. The
sentiments associated with such imbalanced code-mixed data could be analysed using sampling
techniques [6]. Various machine learning approaches namely, Random Forest Classifier, Logistic
Regression, XGBoost classifier, Support Vector Machine and Naïve Bayes Classifier can be
applied over this set to perform the classification process associated with the sentiment of
the text. Lexicon based approaches can also be used for the classification process [7] which
mainly rely on lexical resources like Wordnet which represents the words and their associated
sentiment in order to perform the classification.

   During this pandemic period, exposure to social media has seen a wide increase. Due to the
use of online resources and online media specifying their views and posting comments in Social
media sites have become very common among people. Indians prefer to express themselves
using code mixed languages. These are difficult to analyse as they may not be grammatically
formed with variations in spelling and use of abbreviations and also it has limited resources [8].
Deep learning models could be effectively utilized for performing the multi label classification of
sentiment analysis over the code mixed data. Cross-lingual pretrained transformer models are
expected to provide better performance for such task when it is associated with low-resourced
languages [9].


2. Related works
  In a code mixed corpus, usage of both code-mixed and non-code-mixed data are common
which had been identified by creation of a lexicon dictionary [6] for the code-mixed corpus
using which the problems associated with spelling variations, abbreviation usage had been
handled and then the machine learning techniques had been applied for classification based on
the sentiment associated with the text. Problems associated with the imbalance nature of the
corpus had been handled by applying various sampling techniques over the corpus.

   Sentiment analysis of Kannada-English code mixed corpus which had been created by crawl-
ing Facebook comments and the performance of various machine learning and deep learning
techniques over the corpus using a distributed representation had been demonstrated by [10].
Sentiment analysis of a Hindi data-set had been implemented using the concept of cross-lingual
contextual word embeddings and zero-shot transfer learning to project the predictions from
resource-rich English to resource-poor Hindi language [11]. Classification of code mixed Hindi
text based on sentiment had been implemented with the use of TF-IDF feature vectors of char-
acter n-grams where n ranged from 2 to 6 with an ensembled voting classifier and linear SVM
classifier[12].

  Polarity of Dravidian code-mixed comments had been identified using a sub-word level
model and a word embedding based model which in turn had made use of Long Short Term
Memory (LSTM) network and a machine learning based architecture which had used Inverse
document frequency (TF-IDF) vectorization along with a Logistic Regression model [13] for the
Table 1
Data set
                             Category      Training Set   Validation Set
                            Positive           4271       480
                            Negative           5628       611
                          Mixed Feeling        4021       438
                           Not Known          20069       2257
                           Not Tamil          1667        176


classification task. The Decision Tree Algorithm had been used to computationally identify and
categorize the opinions expressed as text in Kannada language [14].

   To analyse code mixed data, bilingual embedding techniques has to be replaced with multilin-
gual word embedding schemes [15] to achieve improvement in performance in any application
associated with code-mixed data. For implementing sentiment analysis of code-mixed data,
using different kinds of multilingual and cross-lingual embeddings, knowledge can be efficiently
transferred from monolingual text to code-mixed text [16]. Variations of code mixed words had
been captured by a cluster based preprocessing approach and then the sentences of code-mixed
and standard languages had been mapped to a common sentiment space for performing the
sentiment analysis by [17].


3. Data set Description
   The data-set provided for the shared task Dravidian Code-Mix FIRE 2021 [2],[3] was a new
gold standard corpus for sentiment analysis of code-mixed data. The text represents comment /
post with an average sentence length of the corpora as 1. Each text is annotated with sentiment
polarity at the comment / post level. The data-set had three sets of data, one for training, second
for validation and the third set of data for testing based on which the results of the shared
task was announced. The complete data-set had five different classes to which the data belong,
namely Not-known, Positive, Negative, Not-Tamil and Mixed Feelings.

  The table 1 shows the number of instances under each category both in the training and the
validation data set. Considering the distribution of data in the data set it showed an imbalanced
nature of the real world scenario with 56% of the data falling under Not-Known category in
both training and validation sets. The instances of the test data-set has to be placed under any
one of the above categories which was done by predictions based on the proposed model. The
average F1 score of this predicted class of the test data was used by Dravidian Code-Mix FIRE
2021 to evaluate the proposed model and based on the predicted values, average F1 score was
computed and the models were ranked.

  The Fig:1 shows the distribution of the data in both the training and validation data-set
provided by Dravidian Code-mix Fire 2021 for the Task1. It was evident that the data-set is an
Figure 1: Distribution of data in the data set


imbalanced one with more instances under the category Not- Known.

   As separate data-sets were provided for training and validation purposes, they were used for
training and validating the proposed model. A separate test data-set was provided for which
the sentiments were predicted using the trained model which were used for computing the F1
score of the model.


4. Proposed methodology
   The shared task Dravidian Code-Mix FIRE 2021 was a multi-label classification task for
sentiment analysis of Dravidian Code-mixed data based on the provided gold standard corpus[?
]. We as a team participated in the task associated with the sentiment analysis of Tamil-English
code mixed text which had comments from YouTube. We have implemented the classification
using a Cross Lingual pre-trained model.

   The given data-set was imbalanced in nature, as a result of which the model was not able to
turn up with good results. The data from the training data-set was down-sampled to enforce
a balanced nature to the data-set. Using the Cross Lingual model substantial improvement in
performance of the model is expected over both low-resource languages and high-resource
languages. The Cross Lingual Model is trained with a Translation Language Modeling which
helps the model to learn similar representations for different languages1 . The vocabulary
supported by the model was the default value which is 30145, and it uses 2048 encoder and
pooling layers and 12 hidden layers with a dropout value of 0.1. The model supports language
embedding with two languages supported by the model as the considered data set has text in
code-mixed Tamil and English. The model was trained using the training data set and validated

    1
        https://huggingface.co/transformers/model_doc/xlmroberta.html
Table 2
Submission score
                                       Parameters    Score
                                        Precision    0.615
                                         Recall      0.485
                                        F1 Score     0.514


using the validation data set provided for the shared task by Dravidian Code-Mix FIRE 2021.
The evaluation of the model was based on a separate test data set.

   The performance of Dravidian Code-mix FIRE 2021 was evaluated based on the performance
of the proposed system which in turn was measured in terms of weighted averaged precision,
weighted averaged recall and weighted averaged F-Score across all the classes. Our model was
able to provide an average F1 score of 0.514. Table 2 shows the value of various performance
measures for sentiment predictions done for the text in the test data-set using the proposed
model.


5. Error Analysis
   The performance measures of the model shows that there is much more scope to improve
the accuracy of the sentiment analysis of code mixed Tamil-English text. Imbalance nature of
the data-set could be considered as one possible reason for this output which could be balanced
by using up-sampling mechanisms instead of down-sampling the data. The down-sampled
data could have made the model to miss out few key features which would have led to miss-
classification of the text from the test data-set. Pre-processing of the data is equally important
as up-sampling of the system as the posts retrieved from YouTube would have symbols and
abbreviations.

   Figure : 2 shows few example sentences which has not been correctly predicted by the
proposed model. First three sentences have been identified to belong to the Negative category,
but analysing the words in the sentences shows that they express positive sentiments. Sentence
4 had been classified as positive by considering the presence of the word ’level’, but considering
the sentence as a whole it should be classified as Not-Tamil. Fine tuning the proposed model
and avoiding the loss of information would help to reduce the miss-classifications that have
occurred in the model.
Figure 2: Sample Sentences


6. Conclusion
  From the result it could be found that the performance achieved by our model on the new
golden corpus provided by Dravidian Code-mix FIRE 2021 for code mixed Tamil-English text
could be improved. As numerous multilingual based transformer models and approaches are
available, more scope is there for research to be carried out in this field to identify a better
model for sentiment analysis of code mixed text.


References
 [1] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
     sentiment analysis in code-mixed tamil-english text, arXiv preprint arXiv:2006.00206
     (2020).
 [2] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, D. Thenmozhi,
     E. Sherly, J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings
     of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text, in: Working Notes
     of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
 [3] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, D. Thenmozhi,
     E. Sherly, Overview of the dravidiancodemix 2021 shared task on sentiment detection
     in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE
     2021, Association for Computing Machinery, 2021.
 [4] A. Kalaivani, D. Thenmozhi, SSN_NLP_MLRG@ dravidian-codemix-FIRE2020: Sentiment
     code-mixed text classification in tamil and malayalam using ulmfit., in: FIRE (Working
     Notes), 2020, pp. 528–534.
 [5] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus cre-
     ation for sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the
     1st Joint Workshop on Spoken Language Technologies for Under-resourced languages
     (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL),
     European Language Resources association, Marseille, France, 2020, pp. 202–210. URL:
     https://aclanthology.org/2020.sltu-1.28.
 [6] R. Srinivasan, C. Subalalitha, Sentimental analysis from imbalanced code-mixed data using
     machine learning approaches, Distributed and Parallel Databases (2021) 1–16.
 [7] N. H. Mahadzir, et al., Sentiment analysis of code-mixed text: A review, Turkish Journal
     of Computer and Mathematics Education (TURCOMAT) 12 (2021) 2469–2478.
 [8] S. R. Shah, A. Kaushik, Sentiment analysis on indian indigenous languages: a review on
     multilingual opinion mining, arXiv preprint arXiv:1911.12848 (2019).
 [9] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv preprint
     arXiv:1901.07291 (2019).
[10] K. Shalini, H. B. Ganesh, M. A. Kumar, K. P. Soman, Sentiment analysis for code-mixed
     indian social media text with distributed representation, in: 2018 International Confer-
     ence on Advances in Computing, Communications and Informatics (ICACCI), 2018, pp.
     1126–1131. doi:1 0 . 1 1 0 9 / I C A C C I . 2 0 1 8 . 8 5 5 4 8 3 5 .
[11] A. Kumar, V. H. C. Albuquerque, Sentiment analysis using xlm-r transformer and zero-
     shot transfer learning on resource-poor indian language, Transactions on Asian and
     Low-Resource Language Information Processing 20 (2021) 1–13.
[12] P. Mishra, P. Danda, P. Dhakras, Code-mixed sentiment analysis using machine learning
     and neural network approaches, arXiv preprint arXiv:1808.03299 (2018).
[13] A. V. Mandalam, Y. Sharma, Sentiment analysis of Dravidian code mixed data, in:
     Proceedings of the First Workshop on Speech and Language Technologies for Dravid-
     ian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 46–54. URL:
     https://aclanthology.org/2021.dravidianlangtech-1.6.
[14] P. Ranjitha, K. Bhanu, Improved sentiment analysis for dravidian language-kannada using
     dicision tree algorithm with efficient data dictionary, in: IOP Conference Series: Materials
     Science and Engineering, volume 1123, IOP Publishing, 2021, p. 012039.
[15] A. Pratapa, M. Choudhury, S. Sitaram, Word embeddings for code-mixed language pro-
     cessing, in: Proceedings of the 2018 conference on empirical methods in natural language
     processing, 2018, pp. 3067–3072.
[16] S. Yadav, T. Chakraborty, Unsupervised sentiment analysis for code-mixed data, arXiv
     preprint arXiv:2001.11384 (2020).
[17] N. Choudhary, R. Singh, I. Bindlish, M. Shrivastava, Sentiment analysis of code-mixed
     languages leveraging resource rich languages, arXiv preprint arXiv:1804.00806 (2018).