=Paper= {{Paper |id=Vol-3121/short3 |storemode=property |title=CalBERT – Code-Mixed Adaptive Language Representations Using BERT |pdfUrl=https://ceur-ws.org/Vol-3121/short3.pdf |volume=Vol-3121 |authors=Aditeya Baral,Ansh Sarkar,Deeksha D,Ashwini M Joshi |dblpUrl=https://dblp.org/rec/conf/aaaiss/BaralSDJ22 }} ==CalBERT – Code-Mixed Adaptive Language Representations Using BERT== https://ceur-ws.org/Vol-3121/short3.pdf
CalBERT – Code-Mixed Adaptive Language
Representations Using BERT
Aditeya Baral1 , Ansh Sarkar1 , Aronya Baksy1 , Deeksha D1 and Ashwini M Joshi1
1
    PES University, Bengaluru, India


                                         Abstract
                                         A code-mixed language is a type of language that involves the combination of two or more language vari-
                                         eties in its script or speech. Analysis of code-text is difficult to tackle because the language present is not
                                         consistent and does not work with existing monolingual approaches. We propose a novel approach to
                                         improve performance in Transformers by introducing an additional step called "Siamese Pre-Training",
                                         which allows pre-trained monolingual Transformers to adapt language representations for code-mixed
                                         languages with a few examples of code-mixed data. The proposed architectures beat the state of the
                                         art F1 -score on the Sentiment Analysis for Indian Languages (SAIL) dataset, with the highest possi-
                                         ble improvement being 5.1 points, while also achieving the state-of-the-art accuracy on the IndicGLUE
                                         Product Reviews dataset by beating the benchmark by 0.4 points.

                                         Keywords
                                         code-mixed languages, Transformer, BERT, SAIL, IndicGLUE, sentiment analysis




1. Introduction
Code-mixed language is a form of language wherein syntactic elements from one language
are inserted into another language in such a way that the semantics of the resultant language
remains the same. Code-mixed language is more prevalent in a multilingual society like India,
where most of the population is at least bi-lingual. Code-mixed language is most prevalent
on social media platforms such as Facebook and Twitter. Given the interest of enterprises in
determining business insights from social media posts, generating a mechanism for the proper
analysis, and understanding of code-mixed language gains even more importance.
   Language representations are used in Natural Language Understanding tasks such as senti-
ment analysis and human-like conversation systems. The current state of the art methods for
learning representations use Transformer architectures pre-trained on vast amounts of natural
language data. However, almost all of these learnt representations are monolingual and have
been created using a single language only, or pre-trained on multiple languages individually.
These representations struggle when a language might be code-mixed and hence display low

In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE
2022), Stanford University, Palo Alto, California, USA, March 21–23, 2022.
" aditeya.baral@gmail.com (A. Baral); anshsarkar1@gmail.com (A. Sarkar); abaksy@gmail.com (A. Baksy);
deekshad132@gmail.com (D. D); ashwinimjoshi@pes.edu (A. M. Joshi)
~ https://aditeyabaral.github.io/ (A. Baral); https://github.com/anshsarkar (A. Sarkar); https://github.com/abaksy
(A. Baksy); https://deeksha-d.github.io/ (D. D)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
performance on code-mixed natural language tasks due to inherent inconsistencies and large
variability in data.
   Our novel approach generates Code-mixed Adaptive Language representations using Bidi-
rectional Encoder Representations from Transformers (CalBERT) by adding an additional
pre-training step called "Siamese Pre-Training" where a pre-trained monolingual Transformer
[1] is provided with different representations of the same sentence or phrase and learns to
minimise the distance between their embeddings. We evaluate CalBERT on sentiment analysis
of the Hinglish language using the benchmark SAIL 2017 dataset [2] and obtain an 𝐹1 -score of
62%, thus obtaining an improvement of 5.1 points or 8.9%. We also evaluate CalBERT on the
sentiment analysis task released by IIT Patna using the IndicGLUE Product Reviews dataset [3]
and obtain an accuracy of 79.37%, a 0.5% increase over the existing benchmark.


2. Background
BERT [4] and similar transformer architectures derived from BERT (such as RoBERTa [5],
DistilBERT [6], XLM-RoBERTa [7] and others) are used to extract language representations
from text corpora. These language representations are learned using a bidirectional attention
mechanism [8], incorporate contextual information about the individual tokens in the corpus,
and can be fine-tuned for a variety of tasks. The large majority of existing models are trained
on monolingual corpora, and as such, produce representations that are more attuned for high
performance in tasks involving a single language. As such, these representations suffer from
poor performance when applied to tasks involving code-mixed language that contains multiple
language varieties in a single script language.
   The methodology proposed in this work seeks to adapt existing representations for mono-
lingual text into representations that can be fine-tuned for code-mixed tasks. This allows for
representations of code-mixed language to be generated without having to pre-train a Trans-
former model from scratch on large quantities of code-mixed data, which is a time consuming
and computationally intensive process.


3. Previous Work
Mikolov, in his paper titled “Efficient Estimation of Word Representations in Vector Space” [9],
proposes a novel approach to compute word embeddings without the added complexity of a
hidden layer while performing better than the neural network language model. These word
vectors outperform the SemEval 2012 task-2 benchmark however perform poorly on out of
vocabulary words and morphologically rich languages.
   Reimers et al, in their work titled “Sentence-BERT: Sentence Embeddings using Siamese
BERT-Networks” [10] modify the existing BERT architecture to use a Siamese network with
shared weights and cosine distance as the loss function to predict the semantic textual similarity.
The model was able to train in linear time yielding a score of 84.88% in sentiment prediction of
movie reviews.
   “A Passage to India: Pre-trained Word Embeddings for Indian Languages" [11] by Kumar et
al, shows that models trained on subword representations perform better as Indian languages
are morphologically rich. Their multilingual embeddings when evaluated on next sentence
prediction using pre-trained BERT gives 67.9% accuracy. Another paper titled “Towards Sub-
Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text” [12] by
Joshi et al, discusses the advantage of using sub-word representations compared to character
level tokens to deal with inconsistencies in the code-mixed text. This method was evaluated on
a custom dataset and yields an 𝐹1 -score of 65.8%.
   Choudhary et al [13], in their paper “Sentiment Analysis of Code-Mixed Languages leveraging
Resource Rich Languages” propose a novel method that uses contrastive learning. They use a
Siamese model to map code-mixed and monolingual language to the same space by clustering
based on the similarity between their skip-gram vectors. This works on the hypothesis that
similar words have similar contexts. The model achieves 75.9% F-score on the HECM dataset,
an improvement over existing approaches.


4. Proposed Approach
We propose a novel approach to adapt Transformer representations for a code-mixed language
from existing representations that exist in another language by introducing an additional
pre-training step using a Siamese network architecture [14]. We attempt to minimise the
distance between the two semantic spaces of the corresponding languages and obtain a joint or
shared representation for equivalent sentences in both languages. This additionally enables the
generation of code-mixed language representations from an existing language’s representations,
without the need to pre-train a Transformer from scratch.
   Our novel approach is implemented by using a shared pre-trained Transformer layer in each
branch of the Siamese network and using an appropriate loss function to bring the embeddings
closer between each pair of sentences. To Siamese pre-train CalBERT, the same sentence needs
to be obtained in both the language that the Transformer was pre-trained in, along with the
language for which the Transformer is trying to adapt representations.


5. Methodology
5.1. Terminologies
    • Base Language: The single language that was used to pre-train the Transformer architec-
      ture using any task like Masked Language Modelling [4], Next Sentence Prediction and
      so on. For example, in the case of the Hinglish language, the base language is English.
    • Target Language: The code-mixed language for which the Transformer architecture is
      trying to adapt representations. This language is a super-set of the base language, since it
      may contain sentences entirely in the base language. For example, Hinglish.
    • Siamese Pre-training: The novel additional pre-training step proposed to adapt represen-
      tations.
Figure 1: SBERT Architecture used for Siamese Pre-training


5.2. Siamese Pre-training
A Siamese network is a network consisting of two or more branches, wherein each branch
receives an input. The network learns to distinguish between the examples provided through
each branch. Siamese networks have been used extensively in computer vision for image
classification.
   A pre-trained monolingual Transformer being trained on only a single language is capable of
generating language representations for that language only, thus performing poorly when the
language is code-mixed. Pre-training a Transformer from scratch on code-mixed data is a difficult
task since it is computationally expensive and also time-consuming. Since the Transformer has
learnt language representations for one of the languages, it is far more optimal to adapt these
existing representations for new and similar words belonging to the other language.
   A pre-trained monolingual Transformer is provided with different representations of the
same sentence or phrase and learns to minimize the distance between their embeddings (Fig. 1).
The two representations, in this case, are the transliterated version of the code-mixed sentence
in the target language and the translated version in the base language. Since the two versions
have the same semantic meaning, their similarity should ideally be as high as possible and thus
the distance between their representations should be minimum. Since the Transformer already
knows representations for the base language, it only needs to adapt representations to map the
target language to the same space.
   The Siamese pre-training is called thus because of its use of the Siamese network architecture,
wherein the two arms of the network are fed with two semantically equivalent input sentences
in the base and the target language respectively. The network learns the embeddings for both
sentences by comparing their similarity (normally, this involves the use of a labelled dataset
with sentence pairs and the corresponding similarities, but here all the sentence pairs have
maximum similarity). The loss function (Eqn. 1) used here is used to bring the representations
from the two branches closer. In our work, we use the cosine distance as the loss function,
although similar loss functions such as contrastive loss can also be used. Minimizing the cosine
distance implies that the similarity is maximized between the sentence embeddings, which is
the desired outcome.
   The Siamese pre-training is performed as the last pre-training step (after the usual pre-
training strategies used in a Transformer such as masked language modelling or next sentence
prediction) since it needs existing base language representations to learn the target language
effectively. Additionally, our work shows that a significant amount of data is not required to
effectively perform Siamese pre-training and that the model can learn with a fraction of the
data size that was used for pre-training, thus showing that our approach does not require vast
computational resources to improve performance.
   A variety of BERT-based architectures were pre-trained and fine-tuned for this work. The
models that were evaluated were BERT, RoBERTa, DistilBERT and XLM-RoBERTa. These
architectures are used either pre-trained on an English corpus (as available publicly on the
HuggingFace Hub) or pre-trained on the corpus of code-mixed data that was collected as part
of this experiment.


6. Workflow
1 We focus our efforts on Transformer architectures based on the Bidirectional Encoder Repre-

sentations from Transformers (BERT) architecture, since they are bidirectional models capable
of learning representations from a given sequence using both forward as well as backward
contexts. We demonstrate our novel approach on the Hindi-English (Hinglish) code-mixed
language and data for the same was obtained from social media and news articles to maintain a
good balance of well structured as well as informal code-mixed language usage.

6.1. Equations
The objective is to minimize the distance between the representation in the base language
and the representation in the target language. The loss function, hence, needs to reflect this
minimization of the distance between the two representations.

  The cosine distance loss function in (1) represents the angle between two vectors. Minimizing
the cosine distance implies that the two vectors are highly similar as a smaller angle between
the two vectors creates a smaller cosine distance.
                                                       →
                                                       −  →
                                                          −
                                    →
                                    −   →
                                        −               𝑎.𝑏
                                  𝑙( 𝑎 , 𝑏 ) = 1 −         →
                                                           −                                 (1)
                                                    ‖→−𝑎 ‖‖ 𝑏 ‖




   1
       https://github.com/aditeyabaral/calbert
Table 1
CalBERT Dataset Metrics
                                 Source         No. of examples
                                 Social Media           147731
                                 IndicNLP              8466307


6.2. Data Collection
Since code-mixed Hinglish data is abundantly present on platforms like social media, we choose
to use social media as our source of data. There are several online archives [15] available that
host code-mixed data from platforms like Twitter and Facebook for several tasks. Additionally,
Hinglish code-mixed data is already available on IndicCorp, which is one of the largest publicly-
available corpora for Indian languages.
   After compiling all the sources of data together, they need to be converted into a suitable
pairwise format to be used along with CalBERT.

6.3. Data Pre-Processing
Although a large amount of data is obtained (Table 1), most of it is not in a format that can be
directly used to train CalBERT. Since Hinglish code-mixed language can exist in both the Hindi
(Devanagari) script as well as the Roman script, we need to transliterate all of them to the same
single script language (this is achieved using the indic-transliteration pip package [16]).
For our work, we choose the script language to be Roman since most popular Transformer
architectures have been pre-trained in this script language, and hence have representations for
the base language, in our case English.
   For code-mixed data that already exists in the Roman script, we need to obtain the translated
version of the same for the base language representation. This has been performed by using
software automation tools such as Selenium combined with Google Translate. Alternatively, for
code-mixed data that exists in the Devanagari script, we employ the use of a transliteration
library and convert them into the ITRANS ASCII transliteration scheme.
   The input to CalBERT consists of sentence pairs (Table 2). Each pair consists of the translit-
eration of the code-mixed sentence in the target language and the translation of the same
code-mixed sentence in the base language. Both sentences are represented in the Roman script
in our work. Since both sentences have the same semantic meaning, the objective is to reduce
the distance between the corresponding sentence representations. The shared Transformer
layer in CalBERT thus allows it to learn joint representations for both languages and adapts
base language representations to the target language.
   Due to computational limitations, we did not Siamese pre-train our models on the data
obtained from IndicNLP [3]. However, our experiments show that the small portion of data
obtained via scraping was able to boost performance significantly.
Table 2
Base and target language pairs used to train CalBERT
   Base Language                                  Target Language
   in reply, pakistan got off to a solid start    jisake javaba mem pakistan ne achchi shuru-
                                                  ata ki thi.
   by this only, we can build our country and     isake jariye hi hama desha ka nirmana
   make it great again                            karemge aura use mahana bana payemge
   people started gathering                       logom ki bhida jama hone lagi
   obtain loans from a bank or an individual      kisi bank ya vyakti se rrina prapta kara sakati
                                                  hai
   he was later taken to a hospital and treated   bada mem use aspatala le jaya gaya aura ilaja
                                                  kiya gaya
   it’s our cultural heritage                     yaha hamari samskrritika virasata hai


6.4. Evaluation Metric
Since CalBERT is trained in a task-agnostic manner, it can be fine-tuned on any downstream
natural language understanding task like sentiment analysis and question answering. We
evaluate CalBERT on the Sentiment Analysis for Indian Languages (SAIL) 2017 dataset [2],
which consists of Hinglish code-mixed data in the Roman script.
   We also evaluate CalBERT on the IndicGLUE Product Reviews dataset released by IIT Patna,
which serves as another benchmark dataset for sentiment analysis of Hinglish text, but in the
Hindi (Devanagari) script. All models evaluated on this script have either been trained on
English scripted text or Devanagari scripted text. We however propose to evaluate CalBERT on
a code-mixed version of the dataset by transliterating the text from Devanagari to Roman.
   The dataset is popularly used for evaluating the performance of Transformers built for Indian
languages such as IndicBERT [3] and consists of reviews for products across various categories.
The highest possible 𝐹1 -score obtained on this dataset by IndicBERT is 71.32%.

6.5. Siamese Pre-training CalBERT
To Siamese pre-train a Transformer (Table 3), we first initialise a Siamese network with a shared
layer containing the Transformer whose representations we intend to adapt (BERT, RoBERTa
or DistilBERT). This can be done effectively using a sentence-transformer architecture. We
add a single pooling layer to each branch, and then finally combine the pieces by adding a
suitable loss function to reduce the distance between the representations. In our preliminary
experiments, we observe that the contrastive and cosine distance loss functions perform nearly
the same and hence use the cosine distance loss function for all the experiments performed.
This shared Transformer layer can then be extracted and fine-tuned on other downstream tasks.


7. Evaluating CalBERT
CalBERT is meant to be fine-tuned for downstream tasks involving code-mixed data. Due to
the abundance of code-mixed Hinglish data available on social media and the much need for
Table 3
Hyperparameters for Siamese Pre-training
                             Hyperparameter                   Value
                             Number of epochs                     10
                             Number of warm-up steps            100
                             Learning Rate               2.5 × 10−5
                             Weight decay                       0.01


code-mixed Transformer architectures, we evaluate performance on the popular downstream
task of sentiment analysis.

7.1. SAIL 2017
The Sentiment Analysis of Indian Languages (SAIL) 2017 dataset is a collection of sentences in
two popular Indian code-mixed languages – Hindi-English and Bengali-English. The datasets
are composed of sentences from various sources like news articles as well as social media and are
in the Roman script. It consists of 9945 train examples, 1238 test examples and 1240 validation
examples. There is also a high degree of variability in the data, with multiple forms existing for
the same word and different styles of writing. The sentences are classified into 3 polarities –
positive, neutral and negative. The SAIL 2017 task is a challenging benchmark and is widely
regarded as one of the benchmark datasets for sentiment analysis of the Hinglish language. The
highest documented 𝐹1 -score on the benchmark is 56.9% (achieved by IIIT Hyderabad)
   The SAIL 2017 dataset is already partitioned into the train, test and validation splits. All
comparisons are made using the 𝐹1 score on the validation split. For our experiments, we fine-
tune existing pre-trained models with and without CalBERT’s additional Siamese pretraining
step and compare the score obtained by each model. The experiments are performed multiple
times using the same set of hyperparameters and the highest 𝐹1 -score over 10 runs is recorded.

Table 4
Hyperparameters for SAIL 2017 Fine-Tuning
                                Hyperparameter             Value
                                Learning Rate           2 × 10−6
                                Training Batch Size             4
                                Evaluation Batch Size           4
                                Number of epochs               15
                                Weight Decay                 0.08

   CalBERT outperforms the SAIL 2017 benchmark (Table 5) 𝐹1 -score by 5.1 points, or 8.9%
with the CalBERT-XLM-RoBERTa model (XLM-RoBERTa with CalBERT’s Siamese pre-training).
We also improve upon the benchmark precision by 3.5% and the recall by 10.3%. Additionally,
all other CalBERT architectures also outperform the benchmark 𝐹1 -score, with the minimum
improvement obtained by CalBERT-DistilBERT being 3%.
Table 5
Comparison of 𝐹1 -scores on SAIL 2017 benchmark (State-of-the-art results indicated in bold)
                    Model Type                       𝐹1 -  Precision Recall
                                                     Score
                    CalBERT-XLM-RoBERTa              0.620    0.618      0.618
                    CalBERT-RoBERTa                  0.612    0.615      0.614
                    CalBERT-BERT                     0.588    0.581      0.583
                    CalBERT-DistilBERT               0.586    0.587      0.586
                    CalBERT-IndicBERT                0.530    0.529      0.531
                    SAIL-2017 Benchmark              0.569    0.597      0.56


   We observe that Transformer architectures also obtained improved performance metrics on
the SAIL 2017 benchmark. For this reason, we evaluate CalBERT’s Siamese pre-training against
a native Transformer that was used to Siamese pre-train CalBERT (Table 6). Our experiments
show that CalBERT has improved the 𝐹1 -score on all native Transformer architectures as well,
thus showing that additional pre-training can improve performance on the given code-mixed
task.

Table 6
Influence of CalBERT on model 𝐹1 -scores
                            Model Type              CalBERT     𝐹1 -
                                                                Score
                            RoBERTa                 Y           0.613
                                                    N           0.608
                            BERT                    Y           0.588
                                                    N           0.585
                            DistilBERT              Y           0.584
                                                    N           0.580
                            XLM-RoBERTa             Y           0.620
                                                    N           0.608
                            IndicBERT               Y           0.530
                                                    N           0.544
                            SAIL-2017      Bench-   N/A         0.569
                            mark


   Pre-training a Transformer from scratch is computationally expensive as well as time-
consuming. Additionally, it requires a vast amount of data to effectively pre-train a Transformer
to learn usable language representations. Since there are no code-mixed Transformer architec-
tures that exist at the time of writing this paper, we experiment by pre-training some popular
Transformers (Table 7) using Masked Language Modelling on a subset of our code-mixed data.
The size of the subset taken is the same as that was used for CalBERT experiments to com-
pare the two approaches. Our experiments show that the models pre-trained from scratch do
not outperform the benchmark, and are significantly lower than all CalBERT architectures,
thus proving that Transformers do need enormous amounts of data to provide good results.
Additionally, it reinforces our hypothesis that it is far more optimal as well as effective to
apply CalBERT’s Siamese pre-training to adapt a pre-trained Transformer’s representations for
another code-mixed language.

Table 7
Comparison of 𝐹1 -scores on code-mixed pre-trained Transformers with limited data
                   Model Type                   𝐹1 -Score   Precision   Recall
                   SAIL-2017 Benchmark          0.569       0.597       0.56
                   DistilBERT-big               0.553       0.551       0.557
                   DistilBERT-small             0.551       0.552       0.556
                   BERT-small                   0.551       0.550       0.554
                   BERT-big                     0.543       0.542       0.547
                   RoBERTa-small                0.533       0.531       0.537
                   RoBERTa-big                  0.524       0.523       0.529



Table 8
CalBERT Results for SAIL 2017 Dataset
       Sentence                                     True Label          CalBERT-XLM-
                                                                        RoBERTa
       sab ka Bhai meri Jan Salman khan             POSITIVE            POSITIVE
       hahaha ! gazab imagination hai teri !        POSITIVE            POSITIVE
       lagta hai aaj bhi bating nahi milega #ind-   NEGATIVE            NEGATIVE
       vssa
       Or mastery kaisi chal ri hai apki            NEUTRAL             NEUTRAL

  Table 8 showcases some predictions made by the various CalBERT model types. The outputs
are in the form of integers, where a value of 0 indicates negative sentiment, 1 indicates a neutral
sentiment and 2 indicates positive sentiment.

7.2. IndicGLUE Product Reviews
The IIT Patna product review dataset was released by the IndicNLP organization as part of
the IndicGLUE collection of datasets that are meant to be used for the evaluation of models
trained for NLU tasks on Indian languages. The dataset consists of product reviews in Hindi
taken from a popular e-commerce website. It comprises 4182 train examples and 523 test and
validation examples respectively. Like the SAIL dataset, there is a high degree of variability in
this data too, with multiple forms existing for the same word and different styles of writing.
The sentences are classified into 3 polarities – positive, neutral and negative.
   We observe that CalBERT can beat the state-of-the-art accuracy on the dataset by 0.4 points,
or 0.5% with the CalBERT-XLM-RoBERTa model. We also improve on the score set by the
IndicBERT model, by achieving an improvement of 8.05 points, or 11.2%. However, we observe
that the other Transformer architectures do not perform well on this dataset, thus being
consistent with previous attempts made by other authors [3].
Table 9
Hyperparameters for IndicGLUE Product Reviews Fine-Tuning
                                Hyperparameter            Value
                                Learning Rate           2 × 10−6
                                Training Batch Size            16
                                Evaluation Batch Size          16
                                Number of epochs               50
                                Weight Decay                 0.02


Table 10
Comparison of accuracy on IndicGLUE Product Review Dataset (State-of-the-art results indicated in
bold)
                  Model Type                                Accuracy
                  CalBERT-XLM-RoBERTa                       0.794
                  IndicGLUE Benchmark                       0.789
                  CalBERT-RoBERTa                           0.639
                  CalBERT-BERT                              0.612
                  CalBERT-IndicBERT                         0.602
                  CalBERT-DistilBERT                        0.564


   We also experiment with our custom pre-trained Transformers as used in the SAIL-2017
experiment on this dataset.(Table 11). As seen previously, the models again do not outperform
the benchmark and perform very poorly. However, we observe that in certain cases, CalBERT
outperforms the comparable from-scratch trained Transformer model, hence showing the
effectiveness of Siamese pre-training as used in CalBERT, to learn language representations of
code-mixed language.

Table 11
Comparison of accuracy on code-mixed pre-trained Transformers with limited data
                  Model Type                                Accuracy
                  IndicGLUE Benchmark                       0.789
                  IndicBERT                                 0.713
                  DistilBERT-big                            0.671
                  DistilBERT-small                          0.625
                  BERT-small                                0.659
                  BERT-big                                  0.656
                  RoBERTa-small                             0.627
                  RoBERTa-big                               0.627
8. Conclusion
We demonstrate the use of BERT and BERT-based architectures in learning code-mixed language
representations for Hindi-English code-mixed language and evaluate the performance of the
learned embeddings on a benchmark sentiment analysis task. We present a task and language-
agnostic approach to generating cross-language representations for sentences, which can further
be fine-tuned on any specific downstream task.
   We show an 8.9% improvement in the 𝐹1 score achieved by the novel Siamese pre-training
method over the existing benchmark score. We also show that CalBERT also outperforms the
native Transformer architectures which were used to Siamese pre-train CalBERT, thus showing
that Siamese pre-training can help existing Transformers adapt to a code-mixed language.
   Owing to computational limitations at our end, we are yet to find out the extent of possible
improvement that CalBERT can show on the benchmark dataset, but we postulate that training
on more examples may result in a more significant increase in the performance of the model.


References
 [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in neural information processing systems,
     2017, pp. 5998–6008.
 [2] A. Das, B. Gambäck, Identifying languages at the word level in code-mixed Indian
     social media text, in: Proceedings of the 11th International Conference on Natural
     Language Processing, NLP Association of India, Goa, India, 2014, pp. 378–387. URL:
     https://aclanthology.org/W14-5152.
 [3] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Kumar,
     IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilin-
     gual Language Models for Indian Languages, in: Findings of EMNLP, 2020.
 [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
 [5] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
 [6] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
 [7] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale, 2020. arXiv:1911.02116.
 [8] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align
     and translate, arXiv preprint arXiv:1409.0473 (2014).
 [9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in
     vector space, arXiv preprint arXiv:1301.3781 (2013).
[10] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     arXiv preprint arXiv:1908.10084 (2019).
[11] S. Kumar, S. Kumar, D. Kanojia, P. Bhattacharyya, A passage to india: Pre-trained word
     embeddings for indian languages, in: Proceedings of the 1st Joint Workshop on Spoken
     Language Technologies for Under-resourced languages (SLTU) and Collaboration and
     Computing for Under-Resourced Languages (CCURL), 2020, pp. 352–357.
[12] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, Towards sub-word level compositions for
     sentiment analysis of hindi-english code mixed text, in: Proceedings of COLING 2016, the
     26th International Conference on Computational Linguistics: Technical Papers, 2016, pp.
     2482–2491.
[13] N. Choudhary, R. Singh, I. Bindlish, M. Shrivastava, Sentiment analysis of code-mixed
     languages leveraging resource rich languages, CoRR abs/1804.00806 (2018). URL: http:
     //arxiv.org/abs/1804.00806. arXiv:1804.00806.
[14] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, R. Shah,
     Signature verification using a “siamese” time delay neural network, International Journal
     of Pattern Recognition and Artificial Intelligence 7 (1993) 669–688.
[15] Code-mixed indian social media text, 2016, https://amitavadas.com/Code-Mixing.html,
     2016. Accessed: 2021-11-5.
[16] Indic-transliteration, python package index, https://github.com/indic-transliteration/indic_
     transliteration_py, 2020.