Applying TF-IDF and BERT-based Variants under
Multilabel Classification for Emotion Detection in
Urdu Language
Sakshi Kalra1 , Saransh Goel1 , Kushank Maheshwari1 , Yashvardhan Sharma1 and
Shresht Bhowmick2
1
    Department of CSIS, BITS Pilani, 333031, Rajasthan, INDIA
2
    Greenwood High International School, Gunjur Village, Varthur, Karnataka 560087


                                         Abstract
                                         Nowadays, the use of emojis is very common to show our emotions with just a single image instead
                                         of long sentences describing our emotions. Each emoji describes a particular emotion, such as anger,
                                         disgust, fear, sadness, surprise, and happiness. Now if we are given a task to identify emotions in a text,
                                         that means we have to tag a text with multiple emojis, each pointing to a different emotion. This paper
                                         aims to check for multiple emotions in an Urdu text, which comes under the category of multi-label
                                         classification. We have used pre-trained BERT models to add basic knowledge about a language (Urdu in
                                         our case). Over the pre-trained model, we added the classification layer using PyTorch. The output layer
                                         has seven nodes, six of which are for six emotions, and the seventh is for neutral. FIRE 2022 provided
                                         the Urdu tweet dataset used here as part of the subtask ”Multi-label emotion classification in Urdu” of
                                         the main task ”Emothreat: Emotion and Threat detection in Urdu.”

                                         Keywords
                                         Social media, UrduHack, BERT, Distil-BERT, Multi-label classification, Negative weight, Positive weight,
                                         One vs Rest, Transformers model, Text classification, Tokenizer


1. Introduction
With the vast-scale expansion of social media, it is affecting the narrative of the whole country
or even the whole world, which could be evidenced by the examples of various country-wide
or worldwide campaigns started from social media accounts and spread into the population.
The messages or tweets posted by various users online are responsible for all these new effects
of social media, so these messages or tweets must be analysed to understand the mindset
of the users about different topics in the public domain. What is better than an emotional
classification of text is to categorise it into multiple emotions like anger, fear, sadness, etc.
Emotion classification will help in identifying the mood of the population about a topic. This
type of task comes under ”affective computing,” as it was defined in [1] in 1995, which is
”computing that relates to, arises from, or influences emotions,” or it can be said that ”affective

FIRE 2022: Forum for Information Retrieval Evaluation, December 9-13, 2022, India
Envelope-Open p20180437@pilani.bits-pilani.ac.in (S. Kalra); f20190988@pilani.bits-pilani.ac.in (S. Goel);
f20180679@pilani.bits-pilani.ac.in (K. Maheshwari); yash@pilani.bits-pilani.ac.in (Y. Sharma);
bhowmickshresht@gmail.com (S. Bhowmick)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
computing” is computing that has to do with emotions. [2] Earlier, this was not a famous
research field, but nowadays there are hundreds of companies and researchers working on
it. Humans express their emotions through a variety of means, including facial expressions,
text, audio, body gestures, and movements.Even though our body also physically responds to
different emotions by changing heart rate, breathing, etc., for the given dataset, the proposed
model is expected to do multi-label emotion detection for text data only. Detecting emotion in
text data is not a direct task of identifying some keywords for each type of emotion, sometimes
emotion is interpreted through the meaning of the concept and context in a sentence and the
interaction between various concepts [3]. The proposed model is based on the same concept of
categorising text among different emotions. Text can also contain more than one emotion; for
example, someone could be sad and angry at the same time, so this classification becomes a
multi-label classification.
The following sections are included in the paper: Section 2 describes the related work; Section
3 describes the dataset and the challenges that go along with it and their solutions; Section 4
describes our model design and techniques; and Section 5 describes the evaluation and result of
our model over the data.


2. Related Work
Several authors have participated in the hate speech detection tasks, such as [4], [5], [6], [7], [8]
and [9]. Many methods for multi-label classification are used in machine learning, as explained
in [10]. The first method is the ranking method. The data is ranked for all classes, and higher
ranked classes can be chosen as labels for data points. This is the old method used in machine
learning. Other methods include the problem transformation method, in which the multi-label
classification is transformed into multiple single-level classifications. This method includes the
following steps: (1) randomly select one of the labels for each multi-label instance, (2) discard
those instances having multiple labels, (3) consider each different set of labels as a new label,
and (4) transform the dataset such that if for a given instance there are three labels, the new
dataset will contain three instances of the same data point, making it a multi-class classification
problem. One other method is called the algorithm adaptation method; this method includes the
custom entropy loss function for multi-label data, the same thing implemented in this paper, and
the proposed custom loss function also includes the imbalance part of multi-label classification.
[11] describes how we can use the BERT model for text classification. This paper describes
the structure of the BERT model, which takes tokenized text as input along with an attention
mask and is trained over a large corpus of multilingual data. [2] gives a wholesome survey on
emotion detection, finding multi-modal systems to be best for emotion detection tasks. This is
the same as humans identifying each other, which is also a multi-modal approach; thus, humans
analyse each other’s face, audio, body posture, etc. [3] focuses on emotion detection in text data.
The method the paper introduces is called the ”keyword spotting technique” which involves
finding some particular keywords as sub-strings from a sentence; each keyword is associated
with one or more emotions and can help identify emotion in a sentence. The shortcomings of
the keyword spotting technique explained in [3] are that the meaning of a keyword changes
with context, such as with the word ”accident,” which is generally associated with a negative
sense, but in this sentence: ”I found my life partner by accident,” the meaning of ”accident” is in
a positive sense, so the keyword spotting technique fails in this type of case. [12] experiments
with different machine learning based techniques for abusive language detection in Urdu text
and achieved an accuracy of 93.6% by using soft voting techniques on three BERT variants
(urduhack, BERT and XLM-RoBERTa). The authors in [13] proposed a model for detection of
threatening posts using deep learning based models on transformers,they essentially employed
the pretrained BERT model (RoBERTa) for classifying text as threatening and non-threatening
and obtained an F1 score of 53.46% and ROC AUC of 81.99%.
Another work in[14] fine tuned monolingual and multilingual transformers over Urdu text and
used ensembling techniques to combine the results of RoBERTa-urdu-small, XLM-RoBERTa,
bert-based-multilingual-case and Alberta-urdu-large and get the accuracy of 0.596 and F1 score
of 0.449. The author of [15] got the highest F1 score of 0.7993 by using pre-trained BERT models
+ fine tuning classification layer over them. They also used data augmentation to make the
models generalise better and used both machine learning and deep learning techniques for
the task of recognising hate and offensive speech. The effectiveness of several pre-trained
multilingual BERT models in the detection of threats and hate speech, which are also types
of emotions, is discussed in [14] and [15]. [16] surveys the concept of emotion detection by
exploring various methods of categorising emotions one is Direct Emotion Detection, which
considers 6 or more basic emotions, believes that all other emotions are a combination of these
basic emotions and considers each basic emotion to be independent, whereas Dimensional
Emotion Detection, which does not consider emotions to be independent, defines a 2-D or 3-D
space for emotion categorisation. The X-axis represents valency, while the Y-axis represents
arousal. Each area in the 2-D space shows a certain kind of emotion, and you can also add a
Z-axis showing the person’s control over that emotion.


3. Dataset
The dataset is provided by FIRE 2022, under the sub-task A(Multi-label emotion classification in
Urdu) of main Task: EmoThreat: Emotion and Threat detection in Urdu. The training dataset
contains sentences in Urdu and seven labels for each sentence in one-hot encoding, which is
multi-label (0 or 1). the code for this task is available on this1 github repository. The distribution
statistics for each label is as depicted in Table 1 for training data. Each label corresponds
towards a particular emotion like anger, disgust, fear, sadness, surprise, happiness, or neutral.
The pictorial depiction of training data is as in Figure 1, which on analysis brings about a
problem of unbalanced data that could affect the training model parameters in a way that is
biased towards labels that have a high number of sentences labelled for it. for example, in the
dataset, the label ”neutral data” has a higher count than all others; this problem of unbalanced
data is taken care of in further sections by using the method of calculating positive and negative
weights for each label.


    1
        https://github.com/saransh-goel/emotion𝑑 𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛.𝑔𝑖𝑡
Table 1
Dataset Statistics
                                    Category      Training Data
                                      Anger             811
                                     Disgust            761
                                       Fear             609
                                     Sadness           2190
                                     Surprise          1550
                                    Happiness          1046
                                     Neutral           3014
                                      Total            9981


                       Figure 1: Training set distribution in the Urdu Dataset


4. Proposed Techniques and Algorithms
4.1. Multi-label classification
[17]For a d-dimensional input data, 𝑋 ∈ 𝑅𝑑 and 𝑄 = {1, 2, ..., 𝑞} set of labels where q is the
number of labels. Each instance 𝑥 ∈ 𝑋 can be associated with the subset of labels 𝐿 ∈ 2𝑄 which
are called as relevant labels for x and the set of labels in complement of L, i.e 𝐿̄ = 𝑄 ⧵ 𝐿 are
called as irrelevant labels for x. Training dataset for multi-label classification of size l, will be a
set of elements (𝑋 𝑥 2𝑄 ).
                                  𝑖.𝑒., {(𝑥1 , 𝐿1 ), ...., (𝑥𝑖 , 𝐿𝑖 ), ...(𝑥𝑙 , 𝐿𝑙 )}
Multi-label classification would be learning a function 𝑓 (𝑥) ∶ 𝑋 → 2𝑄 . There are two main
methods for multi-label classification:
1. Data decomposition method
2. Algorithm extension method
4.1.1. Data decomposition method
This method includes binary classifiers, One vs Rest method, One vs One method etc. The widely
used trick used by this method is to define a function for each class i.e 𝑓𝑖 (𝑥) ∶ 𝑋 → 𝑅, 𝑖 = 1, 2, .., 𝑞
such that 𝑓𝑘 (𝑥) > 𝑓𝑖 (𝑥), 𝑖 ≠ 𝑘 𝑖𝑓 𝑥 ∈ 𝑐𝑙𝑎𝑠𝑠 𝑘.

                                   𝑖.𝑒., 𝑓𝑘 (𝑥) > 𝑓𝑖 (𝑥), 𝑘 ∈ 𝐿, 𝑖 ∈ 𝐿̄                               (1)

which means that relevant label should have ranked higher than irrelevant labels.

                               𝑓 (𝑥) = {𝑘, 𝑠.𝑡 𝑓𝑘 (𝑥) ≥ 𝑡, 𝑘 = 1, 2, .., 𝑞}                           (2)

Further a threshold can be set for relevant labels, now the methods like one vs rest that the
proposed model used came in picture to set this threshold along with the help of binary classifiers
like Naive Bayes, SVC and Logistic regression. [17]One vs Rest method divides a q-class multi-
label data set into q binary subsets, here the 𝑖𝑡ℎ subset consists of positive instances with the 𝑖𝑡ℎ
label and negative ones with the all other labels. This method helps in identifying the threshold
t in eq 2.

4.1.2. Algorithm extention method
This method includes using multi-class classifiers and dealing with multi-label classification in
one function only, like in the proposed model, which uses the BERT pre-trained model with a
classification head over it for the multi-label classification.
The BERT variants used as base models are UrduHack and distil-BERT, which can work with
multilingual data as the dataset contains sentences in Urdu. For multi-label classification, as
shown in figure 2, in the proposed model there are seven parallel feed forward dense layer
networks for each class. Each gives a two-node output and works as a binary classifier for its
own class.The training process begins with data tokenization and padding with the required
[CLS] and [SEP] tokens, followed by passing the tokenized text as input to the Bert model and
using the output from the Bert model corresponding to the [CLS] token as input to classification
layers as described in [11]. Table 2 lists the various hyperparameters used while training the
proposed model.

Table 2
Various Hyperparameters and its Descriptions
                                  Hyperparameter         Description
                                   Learning Rate         1e-05
                                  Number of Epochs       4
                                     Batch Size          2
Figure 2: The Proposed Architecture for BERT
                                     Figure 3: The Proposed Model


4.2. Handling the class imbalance issue
As it is clear from the figure 1 that data is imbalanced, so we firstly calculate negative and
positive weights of all classes as follows:

                                                1                  𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠
              𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑊 𝑒𝑖𝑔ℎ𝑡 =                               ∗                                        (3)
                                    𝑛𝑜 𝑜𝑓 1′ 𝑠 𝑓 𝑜𝑟 𝑡ℎ𝑎𝑡 𝑐𝑙𝑎𝑠𝑠       𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
                                                1                  𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠
              𝑁 𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑊 𝑒𝑖𝑔ℎ𝑡 =                              ∗                                        (4)
                                    𝑛𝑜 𝑜𝑓 0′ 𝑠 𝑓 𝑜𝑟 𝑡ℎ𝑎𝑡 𝑐𝑙𝑎𝑠𝑠       𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠

Then we added these weights in our custom loss function which is calculating cross entropy
loss for each class separately and multiplying it with corresponding weights:
 𝐿𝑜𝑠𝑠 = (𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑤𝑒𝑖𝑔ℎ𝑡) ∗ (𝑦𝑡𝑟𝑢𝑒 ∗ 𝑙𝑜𝑔(𝑦𝑝𝑟𝑒𝑑 )) + (𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑒𝑖𝑔ℎ𝑡) ∗ ((1 − 𝑦𝑡𝑟𝑢𝑒 ) ∗ 𝑙𝑜𝑔(1 − 𝑦𝑝𝑟𝑒𝑑 )) (5)
and added all the losses for each class to form a final loss for a data point.
                                                         7
                                         𝑇 𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = ∑ 𝑙𝑜𝑠𝑠(𝑐)                                        (6)
                                                        𝑐=1
5. Evaluation and Results
[10]As this is multi-label classification, so the formula for calculation of accuracy, precision and
recall will change and the modified formulas are as follows: If D is the dataset, H is the model,
Y are the real labels, Z are the predicted labels(Z=H(D))


                                                           |𝐷|
                                                   1      |𝑌 ∩ 𝑍𝑖 |
                                𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝐻 , 𝐷) =     ∑ 𝑖                                        (7)
                                                  |𝐷| 𝑖=1 |𝑌𝑖 ∪ 𝑍𝑖 |

                                                          |𝐷|
                                                    1     |𝑌 ∩ 𝑍𝑖 |
                                𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝐻 , 𝐷) =     ∑ 𝑖                                       (8)
                                                   |𝐷| 𝑖=1 |𝑍𝑖 |
                                                         |𝐷|
                                                      1     |𝑌 ∩ 𝑍𝑖 |
                                   𝑅𝑒𝑐𝑎𝑙𝑙(𝐻 , 𝐷) =       ∑ 𝑖                                     (9)
                                                     |𝐷| 𝑖=1 |𝑌𝑖 |
                                         2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝐻 , 𝐷) ∗ 𝑟𝑒𝑐𝑎𝑙𝑙(𝐻 , 𝐷)
                          𝐹 1(𝐻 , 𝐷) =                                                         (10)
                                           𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝐻 , 𝐷) + 𝑟𝑒𝑐𝑎𝑙𝑙(𝐻 , 𝐷)


The performance of each model is evaluated using various evaluation metrics. Table 3 lists
the accuracy, precision, recall, and F1-measure using the TF-IDF model. Table 4 lists BERT
variants. In the TF-IDF feature extraction method, linear SVC performed best for emotion
detection, followed by logistic regression as the second-best method, and Naive Bayes as the
worst. Among BERT variants, UrduHack performs better than multilingual BERT.

Table 3
Performance Evaluation using BERT Variants
                BERT Variants        Accuracy      Precision     Recall    F1-Measure
                  Urdu-hack              0.394        0.453        0.758         0.567
               Multilingual-BERT         0.340        0.358       0.8559        0.5048


Table 4
Performance Evaluation using TF-IDF feature extraction
                  Classifiers        Accuracy        Precision    Recall   F1-Measure
                 Naive-Bayes              0.68         0.98        0.68          0.80
                  LinearSVC               0.79         0.91        0.84          0.87
              Logistic Regression         0.77         0.96        0.79          0.86
6. Conclusion and Future Work
The proposed results demonstrate that the TF-IDF feature extraction model works better than
the BERT model. This is because in emotion detection, keywords are found to be more important
than context, as each emotion has its own set of keywords that help a lot with classification.
This paper only deals with emotion detection in text data, but as explained in one of the earlier
sections, a multi-modal approach to emotion detection is very effective as other features other
than text, such as audio pitch and facial expression, more clearly explain an emotion. For
image data, an expression detection model could help identify different emotions, and just
as each emotion has its own set of expressions, sometimes the same sentence has different
meanings with different expressions, such as ”he is very intelligent.” This sentence with a good
expression will come in the category of happiness and praise, but with an expression of sarcasm,
it will come under the category of jealousy. The same is true for audio, where pitch can help
distinguish between anger, excitement, and a lazy tone.


References
 [1] J. Oliver, B. García-Zapirain, Affective computing and education, in: INTED2017 Proceed-
     ings, IATED, 2017, pp. 1334–1338.
 [2] J. M. Garcia-Garcia, V. M. Penichet, M. D. Lozano, Emotion detection: a technology review,
     in: Proceedings of the XVIII international conference on human computer interaction,
     2017, pp. 1–8.
 [3] S. N. Shivhare, S. Khethawat, Emotion detection from text, arXiv preprint arXiv:1205.4944
     (2012).
 [4] S. Butt, M. Amjad, F. Balouchzahi, N. Ashraf, R. Sharma, G. Sidorov, A. Gelbukh, Overview
     of EmoThreat: Emotions and Threat Detection in Urdu at FIRE 2022, in: CEUR Workshop
     Proceedings, 2022.
 [5] S. Butt, M. Amjad, F. Balouchzahi, N. Ashraf, R. Sharma, G. Sidorov, A. Gelbukh, EmoTh-
     reat@FIRE2022: Shared Track on Emotions and Threat Detection in Urdu, in: Forum for
     Information Retrieval Evaluation, FIRE 2022, Association for Computing Machinery, New
     York, NY, USA, 2022.
 [6] N. Ashraf, L. Khan, S. Butt, H.-T. Chang, G. Sidorov, A. Gelbukh, Multi-label emotion
     classification of urdu tweets, PeerJ Computer Science 8 (2022) e896.
 [7] L. Khan, A. Amjad, N. Ashraf, H.-T. Chang, A. Gelbukh, Urdu sentiment analysis with
     deep learning methods, IEEE Access 9 (2021) 97803–97812.
 [8] I. Ameer, N. Ashraf, G. Sidorov, H. Gómez Adorno, Multi-label emotion classification using
     content-based features in twitter, Computación y Sistemas 24 (2020) 1159–1164.
 [9] L. Khan, A. Amjad, N. Ashraf, H.-T. Chang, Multi-class sentiment analysis of urdu text
     using multilingual bert, Scientific Reports 12 (2022) 1–17.
[10] G. Tsoumakas, I. Katakis, Multi-label classification: An overview, International Journal of
     Data Warehousing and Mining (IJDWM) 3 (2007) 1–13.
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[12] S. Kalraa, Y. Bansala, Y. Sharmaa, Detection of abusive records by analyzing the tweets in
     urdu language exploring transformer based models (2021).
[13] S. Kalraa, M. Agrawala, Y. Sharmaa, Detection of threat records by analyzing the tweets in
     urdu language exploring deep learning transformer-based models (2021).
[14] S. Kalraa, P. Vermaa, Y. Sharmaa, G. S. Chauhanb, Ensembling of various transformer
     based models for the fake news detection task in the urdu language (2021).
[15] S. Kalraa, K. N. Inania, Y. Sharmaa, G. S. Chauhanb, Applying transfer learning using
     bert-based models for hate speech detection (2020).
[16] F. A. Acheampong, C. Wenyu, H. Nunoo-Mensah, Text-based emotion detection: Advances,
     challenges, and opportunities, Engineering Reports 2 (2020) e12189.
[17] J. Xu, An extended one-versus-rest support vector machine for multi-label classification,
     Neurocomputing 74 (2011) 3114–3124.