=Paper=
{{Paper
|id=Vol-3395/T2-8
|storemode=property
|title=Leveraging Dynamic Meta Embedding for Sentiment Analysis and Detection of Homophobic/Transphobic Content in Code-mixed Dravidian Languages
|pdfUrl=https://ceur-ws.org/Vol-3395/T2-8.pdf
|volume=Vol-3395
|authors=Asha Hegde,Shashirekha Hosahalli Lakshmaiah
|dblpUrl=https://dblp.org/rec/conf/fire/HegdeS22
}}
==Leveraging Dynamic Meta Embedding for Sentiment Analysis and Detection of Homophobic/Transphobic Content in Code-mixed Dravidian Languages==
<pdf width="1500px">https://ceur-ws.org/Vol-3395/T2-8.pdf</pdf>
<pre>
Leveraging Dynamic Meta Embedding for Sentiment
Analysis and Detection of Homophobic/Transphobic
Content in Code-mixed Dravidian Languages
Asha Hegde, Hosahalli Lakshmaiah Shashirekha
Department of Computer Science, Mangalore University, Mangalore, India


                                      Abstract
                                      Sentiment Analysis (SA) examines people’s feelings, opinions, sentiments, views, and attitudes towards
                                      entities such as products, movies, services, organizations, and so on, whereas Homophobic/Transphobic
                                      (H/T) content identification aims to detect abusive behaviors, such as hate speech, sexism, racism
                                      specifically toward Lesbian, Gay, Bisexual, and Transgender (LGBT) people in any text. In parallel with
                                      the growth of social media, the code-mixed content for SA and H/T detection is also increasing creating
                                      a demand for the tools which efficiently analyze such content. However, SA and H/T content detection
                                      tasks in social media text are challenging due to the complex nature of the code-mixed text. To tackle
                                      this issue, in this paper, we - team MUCS, describe a learning model submitted to ”Sentiment Analysis
                                      and Homophobia Detection of YouTube Comments in Code-Mixed Dravidian Languages” shared task
                                      at Forum for Information Retrieval Evaluation (FIRE) 2022. The proposed methodology makes use of
                                      Dynamic Meta Embedding (DME) to train the Deep Learning (DL) based Long Short Term Memory
                                      (LSTM) model to perform SA and detect H/T content in code-mixed Dravidian languages viz. Kannada,
                                      Malayalam, and Tamil. Models submitted to the shared tasks, obtained 6th , 4th , and 9th rank for Tamil,
                                      Malayalam, and Kannada in Task A and 1st , 4th , 1st , and 5th rank for Tamil, English, Tamil-English, and
                                      Malayalam in Task B respectively.

                                      Keywords
                                      Dravidian Languages, Code-mixed, Sentiment Analysis, Homophobia, Transphobia, Dynamic Meta
                                      Embedding


1. Introduction
The increasing number of social media platforms and the anonymity of users on these platforms
have enabled more people to share their freedom of expression than ever before. This is
increasing the user-generated content such as opinions, sentiments, reviews about products
and movies, likes and dislikes about an event or news, objectionable content such as threats
and remarks directed at individuals, groups or organizations: fake news, abusive language,
hope and motivational words, and so on. SA aims to identify the sentiments of the given text
and categorize them into predefined classes such as positive, negative, neutral, etc., and has

Forum for Information Retrieval Evaluation, December 9-13, 2022, India
Envelope-Open hegdekasha@gmail.com (A. Hegde); hlsrekha@gmail.com (H. L. Shashirekha)
GLOBE https://mangaloreuniversity.ac.in/dr-h-l-shashirekha (H. L. Shashirekha)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
received considerable attention in industries as a means of determining customer fulfillment
with services and products [1]. H/T content identification deals with detecting abusive speech
toward LGBT people only because of who they love, how they appear, or who they are. Across
the globe, LGBT people are subjected to violence, inequity, torture, and even execution. Due to
this, LGBT people who seek online support are being targeted, threatened, and abused, resulting
in severe mental health problems. Hence, automatic identification and removal of such content
from social media is the need of the day towards promoting equality, diversity, and inclusion in
society [2].
   SA and identifying H/T content in social media text is challenging because of the complex
nature of code-mixed text available on social media platforms. Usually, social media text is
written by mixing one or more local or regional languages, for instance, Kannada, Malayalam,
Tamil, etc., with English, either at word and/or sentence level [3] [4]. Additionally, the usage
of short forms for words, (ex. ’g8’ for ’good night’), internet slangs (ex. ’plz’ for ’please’),
words/phrases from other languages, emojis, hashtags, text consisting of recurrent characters
(ex. ’soooooo sad’ for ’so sad’ ), etc., escalates the complexities in processing code-mixed text
[5]. Further, the rapid growth of social media users intensifies the problem further necessitating
efficient tools or learning models for SA and H/T content identification. The sample text from
the dataset provided by the organizers of the shared task is given in Table 1.


Table 1
Sample text from the given dataset for SA and H/T content detection

   To address the challenges of processing social media text particularly in code-mixed Dravidian
Languages for SA and H/T content identification, in this paper we - team MUCS describe the
models submitted to ”Sentiment Analysis and Homophobia detection of YouTube comments
in Code-Mixed Dravidian Languages” shared task1 at FIRE 2022. The shared task consists of
two subtasks: i) Task A - is a message-level polarity classification task for SA in code-mixed
Dravidian languages viz. Kannada, Tamil, and Malayalam, and ii) Task B - is to identify H/T
content in code-mixed Tamil, Malayalam, and English texts written in their native script and
Tamil-English text written in Latin script [6]. The proposed methodology makes use of DME to

1
    https://codalab.lisn.upsaclay.fr/competitions/5310#learn_the_details
train DL based LSTM models to perform SA and detect H/T content in code-mixed text.
   The rest of the paper is structured as follows: Section 2 contains related works and Section 3
explains the methodology. Section 4 describes the experiments, as well as the outcomes, and
the paper concludes in Section 5 with future work.


2. Related work
Several researchers have explored SA in Dravidian languages and few of the relevant ones are
described below:
Chakravarthi et al. [7] created a Tamil-English code mixed corpus of 15,744 YouTube comments
for sentiment classification. Their study uses Machine Learning (ML) models (Random Forest
(RF), Logistic Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), k-Nearest
Neighbor (kNN)) and DL based 1D Convolutional-Long Short Term Memory (1D-convLSTM)
classifier and transformer-based classifier with multilingual Bidirectional Encoder Represen-
tations from Transformers (mBERT) to classify YouTube comments. Term Frequency-Inverse
Document Frequency (TF-IDF) of n-grams in the range n = (1, 3) is used to train ML classifiers
and Keras embeddings to train 1D-convLSTM classifier. Among all the models, RF classifier
obtained a maximum weighted F1 score of 0.65. Kusampudi et al. [8] presents code-mixed
Telugu-English corpus extracted from Twitter and blogs of size 9,657 and 24,404 sentences
respectively to perform SA. The authors developed ML models (SVM, NB, LR, kNN, and RF)
for SA with TF-IDF of character and word n-grams both in the range n = (1, 3) as features.
They also implemented DL based Bidirectional LSTM (BiLSTM) and a hybrid model combining
BiLSTM and Conditional Random Field (BiLSTM+CRF) to perform SA with Keras embeddings as
features. BiLSTM model obtained a better accuracy of 0.98 on the blog dataset and BiLSTM+CRF
model exhibited an accuracy of 0.99 on the Twitter dataset. Chakravarthi et al. [9] created a
Malayalam-English code-mixed dataset of 6,738 sentences extracted from YouTube comments
using YouTube comment scraper2 for SA. The authors implemented ML models (LR, SVM, DT,
RF, MNB, and kNN), DL models (1DConvLSTM and LSTM), and a transformer-based classifier
with mBERT to perform SA. They used TF-IDF of word tri-grams and Keras embeddings as
features to train ML and DL models respectively. Among all the models, mBERT outperformed
the other models with an F1 score of 0.75.
   Several workshops and shared tasks are focusing on H/T content identification in social
media text and prominent among them is the Homophobia/Transphobia Detection shared
task at Language Technology for Equality, Diversity and Inclusion (LT-EDI) - Association for
Computational Linguistics (ACL) 2022 which focuses on detecting H/T content in English and
in code-mixed Dravidian languages viz. Tamil text in the native script and Tamil text in Latin
script3 [10]. The following are few of the recent works related to the detection of H/T content
in Dravidian languages:
Swaminathan et al. [11] developed two SVM classifiers with TF-IDF and GloVe embeddings as
features and a transformer-based classifier with mBERT to detect H/T content. Transformer-
based classifier with mBERT outperformed the SVM classifier with weighted F1 scores of

2
    https://github.com/philbot9/
3
    https://competitions.codalab.org/competitions/36394
0.93, 0.75, and 0.87 securing 11th , 9th , and 9th rank for English, Tamil, and Tamil-English
respectively. Transformer-based classifiers proposed by Bhandari and Goyal [12] to detect H/T
content makes use of IndicBERT, cross-lingual language models with Robustly Optimized BERT
(XLM-RoBERTa), and mBERT as features to train transformer-based classifiers. Among all the
models, the transformer-based classifier with mBERT exhibited maximum weighted F1 scores
of 0.42, 0.64, and 0.58 placing 9th , 6th , and 3rd ranks in the shared task for English, Tamil, and
Tamil-English respectively.
   From the literature, it is clear that though several works are carried out to perform SA and
H/T content identification in Dravidian languages, there is still scope for developing tools and
models in this direction as the results are considerable.


Figure 1: Framework of the proposed method


3. Methodology
The proposed methodology for SA and detection of H/T content in code-mixed Dravidian lan-
guages includes three major steps: Preprocessing, Text vectorization, and Classifier construction.
The framework of the proposed methodology is shown in Figure 1 and the steps are explained
below:
  Preprocessing - is the process of cleaning text data with the aim of improving the perfor-
mance of the classifier. The text is preprocessed by converting emojis into text and removing
digits, punctuation, URLs, and stopwords. English stopwords list available in Natural Language
Toolkit (NLTK)4 library, Kannada stopwords list available at github5 , and Tamil stopwords list
available at github6 are used to remove the stopwords from the respective languages.
   Text vectorization - aims to transform the text into vector values which are in turn used
to train the learning models. Distributed representation of words, also known as word em-
beddings, is a popular word representation technique, where each word is represented by a
low-dimensional vector such that words having the same meaning will have a similar repre-
sentation [13]. Word2Vec7 , fastText8 , GloVe9 , etc., are some popular word embedding models
with a very large vocabulary available in various dimensions such as 50, 100, 300, etc. However,
selecting the correct embeddings out of the available embedding techniques for specific tasks is
always challenging. Further, the usefulness of word embeddings for downstream tasks, such as
text classification, machine translation, text summarization, natural language understanding,
etc., tends to be hard to predict. Therefore, instead of considering any single embeddings it is
beneficial to combine the strengths of different word embeddings. This also increases the lexical
coverage by allowing systems to take the union of the vocabulary of different embeddings.
   DME is a supervised learning of embedding ensembles where the Neural Network (NN)
decides which embeddings to use. This is achieved by adding the ensembled embedding
layer allowing the network to learn the embeddings it prefers by predicting the weight for
each embedding type. Instead of using a single word embedding, the proposed work utilizes
DME in which the primary word embeddings are ensembled with additional learnable weights
through an LSTM encoder. In this work, Word2Vec10 and fastText11 embeddings are built using
gensim12 library considering the training dataset provided by the shared task organizers and
these embeddings are then ensembled to create the DME. Both the models are trained with a
latent dimension of 100, a window size of 3 followed by a random seed of 33 with 10 epochs. In
the proposed method, maximum sequence length is set to 200 followed by the stacking of two
LSTM layers with a dropout of 0.3. Eventually, the softmax attention is used as the final layer
with adam optimizer.

3.1. Model Construction
The goal of the shared task is to perform SA and detect H/T content in code-mixed Dravidian
languages. To address these tasks, DL based LSTM model is implemented using DME features.
Though the DL based models, namely Recurrent Neural Network and Convolutional Neural
Network produce considerable results, these models suffer from a short-term memory issue
during handling longer sentences that lead to vanishing gradient problems. During backpropa-
gation, the gradient grows so small that it approaches zero, rendering the neuron useless for
further processing. LSTM which memorizes the important information in the data by assigning

4
  https://www.nltk.org/nltk_data/
5
  https://gist.github.com/MSDarshan91
6
  https://gist.github.com/arulrajnet/
7
  https://code.google.com/archive/p/word2vec/
8
  https://fasttext.cc/docs/en/pretrained-vectors.html
9
  https://nlp.stanford.edu/projects/glove/
10
   https://radimrehurek.com/gensim/models/word2vec.html
11
   https://radimrehurek.com/gensim/models/fasttext.html
12
   https://radimrehurek.com/gensim/
                                                      Train set
                                               Mixed        Unknown             not          not         not
         Languages     Positive    Negative
                                              feelings        state           Kannada       Tamil     Malayalam
         Kannada         2,823        1,188     574               711               916        -            -
           Tamil        20,069        4,271     4,020             5,628              -       1,667          -
         Malayalam       6,421        2,105     926               5,279              -         -          1,157
                                               Development set
         Kannada         321          139        69                52               119        -            -
           Tamil         2,257        480       611               438                -        176           -
         Malayalam       786          237       102               580                -         -          141

Table 2
Classwise distribution of the dataset for Task A

                                                      Train set
                             Tag              English      Tamil          Malayalam       Tamil-English
                   Non-anti-LGBT+ content      3,001       2,022            2,434             3,438
                        Homophobic              157         485              491               311
                        Transphobic              6          155              189               112
                                               Development set
                   Non-anti-LGBT+ content       732         526              692               862
                        Homophobic              58          103              133               66
                        Transphobic              2           37              41                38

Table 3
Classwise distribution of the dataset for Task B


weights to them can be used to resolve the vanishing gradient problem. Hence, LSTM is helpful
when dealing with longer sentences. With appropriate embedding layers and an LSTM encoder,
the model will be able to produce good results.


4. Experiments and Results
The statistics of the datasets provided by the shared task organizers for Task A [14] and Task B
[15] are given in Table 2 and 3 respectively. It is clear that both the datasets are imbalanced
and this may affect the performance of the learning models. The proposed models were used to
predict the class labels of the unlabeled Test sets provided by the organizers and the predictions
were submitted to the organizers for evaluation. The predictions were evaluated and ranked by
the organizers based on the F1 score. As per the results in the leaderboard of the shared task,
the proposed DL based LSTM model with DME obtained considerable accuracy. Performance
of the proposed method for Task A and B along with the ranks obtained in the shared task are
given in Table 4. In Task A, the proposed method exhibited the lowest F1 score of 0.16 for Tamil
language, where 56% comments in the Tamil dataset belong to the ’positive’ class reflecting the
imbalance in the classwise distribution of the dataset. But, the proposed method obtained a
better F1 score of 0.61 for Malayalam, as the Malayalam dataset contains better distribution of
classes compared to that of Tamil dataset. Similarly, in Task B, Malayalam dataset has fairly
                                                Task A
                                     Language      F1 score   Rank
                                      Tamil          0.16      6
                                    Malayalam        0.61      4
                                     Kannada         0.44      9
                                                Task B
                                      Tamil          0.36      1
                                    Malayalam        0.74      5
                                      English        0.37      4
                                   Tamil-English     0.58      1

Table 4
Performance measure of the proposed method for Task A and B


distributed comments over all the classes compared to the other datasets. Hence, the proposed
method obtained better F1 score of 0.74 for Malayalam dataset.


Figure 2: Comparison of F1 scores of the participating teams for Task A


  The proposed method exhibited considerable F1 scores of 0.16, 0.61, and 0.44 securing 6th ,
4th , and 9th rank for Tamil, Malayalam, and Kannada respectively in Task A. For Task B, the
models exhibited F1 scores of 0.36, 0.74, 0.58, and 0.37 securing 1st , 4th , 1st , and 5th rank for
Tamil, English, Tamil-English, and Malayalam respectively. Figure 2 and 3 show the comparison
of F1 scores of all the participating teams for Task A and B respectively which illustrate that
the performance of the proposed DL based LSTM model with DME is considerable.


5. Conclusion and Future work
This paper describes the models proposed by team MUCS for SA and identification of H/T
content in the social media text, particularly in code-mixed Dravidian languages submitted to
Figure 3: Comparison of F1 scores of the participating teams for Task B


”Sentiment Analysis and Homophobia Detection of YouTube Comments in Code-Mixed Dravid-
ian Languages” - a shared task at FIRE 2022. In the proposed strategy, DME feature is used to
train DL based LSTM model for SA and identification of H/T in code-mixed Dravidian languages
viz. Kannada, Malayalam, and Tamil. The proposed models have exhibited considerable F1
scores of 0.36, 0.74, and 0.37 for Tamil, English, and Malayalam respectively in Task A and F1
scores of 0.36, 0.74, 0.58, and 0.37 for Tamil, English, Tamil-English, and Malayalam respectively
in Task B. These models secured 6th , 4th , and 9th rank for Tamil, Malayalam, and Kannada
respectively in Task A and 1st , 4th , 1st , and 5th rank for Tamil, English, Tamil-English, and
Malayalam respectively in Task B. Investigation of efficient resampling techniques to handle
imbalanced classes with effective feature extraction will be explored in future work.


References
 [1] A. Hande, R. Priyadharshini, B. R. Chakravarthi, KanCMD: Kannada CodeMixed Dataset
     for Sentiment Analysis and Offensive Language Detection, in: Proceedings of the Third
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s
     in Social Media, 2020, pp. 54–63.
 [2] E. A. McConnell, A. Clifford, A. K. Korpak, G. Phillips II, M. Birkett, Identity, Victimiza-
     tion, and Support: Facebook Experiences and Mental Health among LGBTQ Youth, in:
     Computers in Human Behavior, Elsevier, 2017, pp. 237–244.
 [3] B. B. KACHRU, Toward Structuring Code-Mixing: An Indian Perspective, in: Walter de
     Gruyter, 1978, pp. 27–46.
 [4] A. Hegde, M. D. Anusha, S. Coelho, H. L. Shashirekha, B. R. Chakravarthi, Corpus
     Creation for Sentiment Analysis in Code-Mixed Tulu Text, in: Proceedings of SIGUL 2022
     @LREC2022, 2022, pp. 33–40.
 [5] A. Hegde, M. D. Anusha, H. L. Shashirekha, Ensemble Based Machine Learning Models
     for Hate Speech and Offensive Content Identification, in: Forum for Information Retrieval
     Evaluation (Working Notes) (FIRE), 2021, pp. 43–49.
 [6] K. Shumugavadivel, M. Subramanian, P. K. Kumaresan, B. R. Chakravarthi, B. B, S. Chin-
     naudayar Navaneethakrishnan, L. S.K, T. Mandl, R. Ponnusamy, V. Palanikumar, M. Balaji J,
     Overview of the Shared Task on Sentiment Analysis and Homophobia Detection of YouTube
     Comments in Code-Mixed Dravidian Languages, in: Working Notes of FIRE 2022 - Forum
     for Information Retrieval Evaluation, CEUR, 2022.
 [7] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus Creation for
     Sentiment Analysis in Code-Mixed Tamil-English Text, in: arXiv preprint arXiv:2006.00206,
     2020.
 [8] S. S. V. Kusampudi, A. Chaluvadi, R. Mamidi, Corpus Creation and Language Identification
     in Low-Resource Code-Mixed Telugu-English Text, in: Proceedings of the International
     Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021, pp.
     744–752.
 [9] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A Sentiment Analysis
     Dataset for Code-Mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on
     Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
     and Computing for Under-Resourced Languages (CCURL), 2020, pp. 177–184.
[10] B. R. Chakravarthi, R. Priyadharshini, T. Durairaj, J. P. McCrae, P. Buitelaar, P. Kumaresan,
     R. Ponnusamy, Overview of The Shared Task on Homophobia and Transphobia Detection
     in Social Media Comments, in: Proceedings of the Second Workshop on Language
     Technology for Equality, Diversity and Inclusion, 2022, pp. 369–377.
[11] K. Swaminathan, B. Bharathi, G. Gayathri, H. Sampath, Ssncse_NLP@ LT-EDI-ACL2022:
     Homophobia/Transphobia Detection in Multiple Languages using SVM Classifiers and Bert-
     based Transformers, in: Proceedings of the Second Workshop on Language Technology
     for Equality, Diversity and Inclusion, 2022, pp. 239–244.
[12] V. Bhandari, P. Goyal, bitsa_nlp@LT-EDI-ACL2022: Leveraging Pretrained Language
     Models for Detecting Homophobia and Transphobia in Social Media Comments, in:
     Proceedings of the Second Workshop on Language Technology for Equality, Diversity and
     Inclusion, Association for Computational Linguistics, 2022, pp. 149–154.
[13] D. J. Chalmers, Syntactic Transformations on Distributed Representations, in: Connec-
     tionist natural language processing, Springer, 1992, pp. 46–55.
[14] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly,
     J. P. McCrae, Dravidiancodemix: Sentiment analysis and offensive language identification
     dataset for Dravidian languages in code-mixed text, in: Language Resources and Evaluation,
     Springer, 2022, pp. 1–42.
[15] B. R. Chakravarthi, R. Priyadharshini, R. Ponnusamy, P. K. Kumaresan, K. Sampath, D. Then-
     mozhi, S. Thangasamy, R. Nallathambi, J. P. McCrae, Dataset for Identification of Ho-
     mophobia and Transophobia in Multilingual YouTube Comments, in: arXiv preprint
arXiv:2109.00227, 2021.

</pre>