=Paper= {{Paper |id=Vol-3416/paper_2 |storemode=property |title=MINDS: A Multi-label Emotion and Sentiment Classification Dataset Related to COVID-19 |pdfUrl=https://ceur-ws.org/Vol-3416/paper_2.pdf |volume=Vol-3416 |authors=Anjali Bhardwaj,Muhammad Abulaish |dblpUrl=https://dblp.org/rec/conf/icon-nlp/BhardwajA22 }} ==MINDS: A Multi-label Emotion and Sentiment Classification Dataset Related to COVID-19== https://ceur-ws.org/Vol-3416/paper_2.pdf

MINDS: A Multi-label Emotion and Sentiment
Classification Dataset Related to COVID-19
Anjali Bhardwaj1 , Muhammad Abulaish2
Department of Computer Science, South Asian University, New Delhi, India

Abstract
During times of crisis, such as the COVID-19 pandemic, there has been a sudden increase in information
exchanges on social media. People gathered to share their feelings, experiences, knowledge, and ideas
with one another. Twitter has emerged as the most authentic, admired, and widely used social media
platform for users to express their sentiments, opinions or emotions. Understanding the emotions
expressed in text facilitates the development of empathetic machines. In this paper, we intend to
collect and annotate a large corpus of textual data that can be used to train classification models for
detecting emotions and sentiments in user-generated content. We analyze user sentiments and emotions
expressed in tweets during the third wave of the omicron sub-variant pandemic. Our curated dataset,
MINDS, consists of 227, 229 tweet instances that were annotated using a multi-model setup in order
to quantify all aspects of model uncertainty. Each instance in the dataset is classified according to
three sentiment classes (positive, negative, and neutral) and five emotion classes (sadness, joy, fear,
disgust, and anger). We have used a well-performing benchmark dataset related to SemEval-2018’s
E-c subtask for determining the classification threshold. The MINDS datasets is publicly available at
http://www.abulaish.com/ldsa/dataset for research purposes.

Keywords
Emotion dataset, Sentiment dataset, Multi-label classification, NLU, COVID-19, IBM Watson

1. Introduction
Human emotion is one of the fundamental components of cognitive processes. Without emo-
tions, humans would be lifeless stones; emotions are what keep us alive and define who we
are. Changes in the physiological aspect frequently convey emotions, as they correspond to
psychological states that occur spontaneously and without conscious effort. It is a complex
feeling caused by internal or external events concerning an object, such as a person, a topic, an
event, or an item, resulting from thought processing, e.g., “my family thinks it’s a good idea
for me to continue my education overseas, though they’ll miss me.” Emotions are the most
significant aspect of human understanding and have a positive impact on our physical health,
jobs, learning, economic, and social behaviors. In addition, whenever a choice must be made,
humans seek the opinions of others.
During events such as pandemics, unrest, etc., there is a torrent of emotions. As people faced
the unprecedented challenge of COVID-19, their emotional responses became overwhelming

WNLPe-Health 2022: The First Workshop on Context-aware NLP in eHealth, December 15-18, 2022, New Delhi, India
Envelope-Open bhardwaj.anjali200594@gmail.com (A. Bhardwaj); abulaish@sau.ac.in (M. Abulaish)
Orcid 0009-0008-4063-4661 (A. Bhardwaj); 0000-0003-3387-4743 (M. Abulaish)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
and erratic. Since November 2019, the global community has been dealing with this issue, which
has been disastrous for humanity. And due to social media, individuals were suddenly able to
share their experiences. In order to be together safely in the future, one had to be apart in the
interim.
The classification of emotions from textual data is more complex than sentiment classification.
Although emotions and sentiments are synonymous, in sentiment analysis they represent two
distinct concepts [1]. The AI hot topic of emotion and sentiment classification has numerous
applications in the current technological era. Emotion analysis is used in a variety of AI-based
applications, including human-machine interface, cognitive psychology, intelligent devices,
automated identification, etc., which have become the gold standard. Thus, research on this
topic enables the development of empathic machines [2].
Numerous social media platforms, including Instagram, Facebook, etc., have become indis-
pensable for communicating with friends and family. In online social media, emotions are
typically expressed through emoticons and texts that are unstructured, informal, and massive
in size. Due to its immense size and unstructured nature, it is difficult to extract meaningful
emotional information from such a repository of data. Moreover, different types of emotions,
such as fear, anger, sadness, disgust, joy, and surprise, are not mutually exclusive; rather, they
are interconnected, such that one emotion causes/triggers other emotion(s) [3]. Therefore,
a document can be tagged with multiple emotions, making emotion detection a multi-label
classification problem requiring deep learning. Multiple mutually non-exclusive classes or labels
are predicted by deep learning algorithms [4]. However, it requires an enormous quantity of
labeled data. In order to train and evaluate algorithms for multi-label classification, we intend
to curate and label a large dataset.
Our contributions are summarized as follows:
• Curation of a large-scale multi-labeled dataset, MINDS (Multi-label emotIoN and senti-
ment classification DataSet), of textual data containing 227, 229 instances. The tweets
were collected using the top-six trending hashtags – #Omicron, #OmicronVariant, #Omi-
cronVarient, #OmicronInIndia, #OmicronVirus, and #Omicronvirus india from December
17, 2021 to February 4, 2022 using Twitter’s Tweepy API.
• Classification of each tweet into the most appropriate emotion categories (annotation
labels), which are amongst the universal emotions of Ekman [5], namely joy, sadness, fear,
anger, surprise, and disgust, except surprise. The surprise emotion is dropped due to its
ambiguous appearance, indicating both positive and negative sentiments. The sentiment
categories comprise positive, negative or neutral, with numeric scores.
• Using a multi-model setup, namely IBM Watson NLU API, Komprehend Text Analysis
API, and Text2emotion Python package to annotate the dataset to quantify all aspects of
model uncertainty.
• Based on the SemEval-2018 E-c subtask dataset, determining a classification threshold.
After determining the threshold value, the micro-average and macro-average f1-score for
each emotion were analyzed. We obtained a Jaccard index of 0.52 and micro-average and
macro-average f1-score values of 63% and 61%, respectively.
The remaining sections are organized as follows. The section 2 provides a brief overview of
the related datasets for emotion and sentiment classification on textual data based on multi-label
classification. Section 3 provides background information on our work. Section 4 provides the
functional steps involved in the creation of our dataset. Finally, section 5 concludes the paper.

2. Related Datasets
Multi-label emotion and sentiment classification have increasingly become an active research
topics. Many researchers investigated multi-label emotion classification in textual data [3, 6, 7, 4].
The authors used multiple Convolution Neural Network (CNN) along with self-attention and
performed multi-label emotion classification on Twitter data [3]. Similarly, the authors [7] used
transfer learning to enhance the multi-label emotion classification performance on Twitter
data.
Mostly datasets are hand-annotated, some of them are SemEval-2018 Task 1: Affect in Tweets
(AIT) [8], GoEmotions [9], EMOBANK [10], and Emotion Intensities (EmoInt) [11]. EmoInt
dataset [11] was created for detecting the emotional intensities of four emotions in tweets.
SemEval-2018 [8] is a multilingual dataset used to train and test supervised machine learning
algorithms. This dataset contains 10, 690 instances, annotated based on 11 emotions categories.
The shared task evaluates automatic systems for E-c (multi-label Emotion-classification), EI-oc
(Emotion Intensity-ordinal classification), EI-reg (Emotion Intensity-regression), V-oc (Valence-
ordinal classification), and V-reg (Valence-regression) detection in three languages - English,
Spanish, and Arabic. Moreover, the authors detected valence, an ordinal class of intensity of
emotion (slightly sad, very angry, etc.), and detected an ordinal class of valence (or sentiment).
In GoEmotions [9] consists of 58𝑘 instances collected from English Reddit comments and
categorized into 27 emotion categories and neutral classes. The authors used principal preserved
component analysis and conducted transfer learning experiments with existing benchmarks to
show that their dataset generalizes well to other domains. EMOBANK [10] is another dataset
containing above 10, 000 sentences labeled according to the emotion representation model of
VAD. The authors collected data from various sources, including essays, blogs, fiction, travel
guides, letters, newspapers, and news headlines of readers and writers. In addition, a subset
of the concerned dataset was categorically labeled using Ekman’s emotion model, making it
desirable for dual representational designs.

3. Background
This section briefly discusses social media platforms, mainly Twitter and their content. More-
over, we discuss well-known APIs, which integrate the processing of textual data to deliver the
evoked emotions and sentiments.

3.1. Social Media
Social media platforms have become modern communication tool and have a large user base
worldwide. Among these platforms, Twitter is the most authentic, admired, and extensively
used site by users to share their information, thoughts, ideas, and opinions/emotions simultane-
ously regarding social events, products, services, and political and marketing campaigns. Due to
constant updates on the repository of opinions, banter, facts, and other minutiae, Twitter has
garnered much interest from decision-makers, business leaders, and politicians. This is because
of the inherent desires of knowing people’s perspectives and opinions regarding specific topics.
Therefore, we choose Twitter data to analyze the emotions expressed in tweets. Collecting
data using API is the most popular and recommended practice. Almost every social media
service provider has an API that assists users with several libraries or packages in various
data-extracting activities.

3.2. Well-known APIs
Over the last few years, APIs (Application Programming Interfaces) have increased, breaking
the barriers of using a third-party text analytic functionality rather than building their model.
APIs play a significant role in today’s digital age as they encourage innovation, offer flexible
experiences, save cost, and make easy availability. There are dozens of APIs which are essential
for the digital transformation, creation, and development of innovative models. The majority
of datasets are annotated manually and tend to be small. Moreover, there is no API-based
multi-label emotion and sentiment classification dataset is available. So, we annotated our
dataset MINDS using a multi-model setup (i.e., APIs and python package) which are discussed
below:

3.2.1. IBM Watson NLU
IBM developed Watson NLU (Natural Language Understanding) 1 API that analyzes data
with the help of text analytics for extracting categories, classification, concepts, keywords,
relations, entities, emotion, sentiment, semantic roles, and syntax. It uses deep learning to
extract the meanings and metadata from textual data that is unstructured. Moreover, NLU
supports 25 languages depending on whose features one analyzes. Sentiment features analyze
the sentiments towards specific phrases in target and also the overall document’s sentiment.
While, emotion features analyze emotion presented by specific target phrases or the document
itself.

3.2.2. Komprehend Text Analysis
This is the most comprehensive document classification and NLP API2 for software developers.
This API trained their NLP models on more than a billion documents and provided state-of-the-
art performance on most common use cases of NLP, such as sentiment analysis and emotion
detection. Moreover, this API supports 15 languages. The main advantages of this API is that it
works on diverse data, is accurate, supports flexible deployment, and maintains privacy. This
API used Long Short-Term Memory (LSTM) algorithms which divide text blob sentiment’s into
positive and negative. LSTMs represents sentences as a series of context-based forget-remember
decisions. It was trained differently to handle informal and formal language based on social
media and news data. Moreover, this algorithm is trained using various specific datasets for

1
https://cloud.ibm.com/apidocs/natural-language-understanding
2
https://komprehend.io/
different clients. Sometimes the sentiment classes-positive, negative, and neutral are not enough
to understand the aspects associated with the underlying tone of any sentence. As a result, the
emotion analysis classifier of this API is trained on the proprietary dataset. It can determine
whether the emotion conveyed through the text is sadness, happiness, fear, anger, bored, or
excitement. Deep learning-based algorithms were used to capture features from the text data.
These features were employed to categorize the emotions conveyed by the data. The classifier
was trained using CNNs on a tagged dataset.

3.2.3. Text2emotion
This is the python package3 that extracts the emotions from the text, and it is compatible with
five emotion categories: sad, happy, angry, fear, and surprise. Details of the package are as
follows:

• Text pre-processing: The primary goal is to promote data cleaning while making the
content more suitable for emotion analysis. The unwanted textual part is removed from
the text; NLP techniques were used to identify the well-pre-processed text.
• Emotion investigation: Pre-processed text are analyzed, and appropriate words are found
that express emotions or feelings and category of emotions. The count of emotions
relevant to the words are stored.
• Emotion analysis: The output of the text is in the form of a dictionary where the emotion
categories and scores are represented as keys and values, respectively. A larger score of a
particular emotion category, indicates the text belonging to that category.

4. Dataset Creation
This section gives the functional steps of our dataset creation and analysis, including data
extraction, data preprocessing, dataset annotation, evaluations, and observations for emotion and
sentiment classification. Fig. 1 shows the framework of our dataset annotation approach.

Twitter Data Data Multi-Model Sentiments
Crawling Preprocessing
Twitter REST API
IBM Watson
NLU API

Positive Neutral Negative
Pre-processed Komprehend
RAW tweets
tweets API

Python Package
Text2Emotion

Emotions

Figure 1: A framework of our dataset annotation approach.

3
https://pypi.org/project/Text2emotion/
Table 1
Details of the curated corpus of tweets
Hashtag # of instances
#Omicron 194, 942
#OmicronVariant 28, 315
#OmicronVarient 8, 079
#OmicronInIndia 5, 562
#OmicronVirus 4, 161
#Omicronvirus india 360
Total tweets 241, 419

4.1. Data Extraction
We have created a data crawler utilizing Python 3.6, and Twitter Tweepy -4.6.0 API to collect
COVID-19 related posts, mainly those of Omicron (sub-variant of the COVID-19 virus responsi-
ble for the third wave of infections) in the English language and stored them in a local repository.
To find valuable insights from public reactions and shared posts on Twitter and model public
emotions, we collected tweets from December 17, 2021 to February 4, 2022. The tweets are
collected using six top trending hashtags such as #Omicron, #OmicronVariant, #OmicronVarient,
#OmicronInIndia, #OmicronVirus, and #Omicronvirus india. Table 1 shows the details of the
Twitter Omicron corpus containing 2, 41, 419 instances of raw tweets when the Omicron wave
had just begun. Through Twitter API, we obtained various tweet-related information, such
as tweet id, text, author screen name, author id, created at, source, user verified, count, language,
favorite count, username, user id, location, etc.

4.2. Data Pre-processing
Pre-processing is an essential step before analyzing any dataset to deal with the problem of
outliers and noises for a proper representation of data. Corpuses curated from Twitter as
the one presented in Table 1 are often noisy and contain unwanted or irrelevant constituents;
eliminating such kind of undesirable information is vital for better performance and efficiency
of the system. Since tweets are generated in an uncontrolled manner which is generally
unstructured and informal, basic pre-processing steps are applied to transform input data into
the ready-to-analyze format. The following steps were performed as part of the pre-processing
stage:
1. The uppercase letters were converted into lower-case letters to normalize the input text
data.
2. HTML tags, URLs, extra white spaces, hexa-characters (UTF-8), double quotes, and
duplicate tweets were removed to promote ease of further processing.
3. The stop-words, punctuation, and emoticons are not removed because they play a signifi-
cant role in understanding the context of tweets.
Table 2
Sample tweets from our proposed dataset
Tweets
𝑡1 not enough people through the doors to cover wages, much less rent and food costs and our
#omicron wave hasnt even hit yet.
𝑡2 #omicron is not mildno one against the businessmen/women how bravely the government stands
against them? or they reciprocally the same entity to another? “a” is not “all the time, sometime”
could be “a” in one point, and “a” acted on behalf of.
𝑡3 so proud of the scientists and ex-colleagues who continue to keep norwich safe i miss working
with you all - you’re so fantastic at what you do #norwich #omicron

4.3. Dataset Annotation
As presented in Table 1, 241, 419 tweets were scraped from Twitter , after which pre-processing
resulted in 227, 229 instances of filtered tweets. Due to highly emotion-rich data from Twitter ,
a multi-model setup is employed to label the pre-processed tweets and analyze the different user
emotions within them. The main advantage of such a setup is to diminish the effect of model
errors along with bias that results from each individual model. Additionally, a multi-model setup
promotes reliability, consistency, and better labeling by performing automatic text annotation
to annotate large quantities of data in less time.
Each independent API predicts scores respectively to identify the evoked emotions and
sentiments. Based on these scores, each tweet is classified into the most appropriate emotion
categories, which are amongst the universal emotions of Ekman [5], namely joy, sadness, anger,
fear, disgust, and surprise, except surprise. The surprise emotion is dropped due to its ambiguous
appearance, indicating both positive and negative sentiments. At the same time, sentiment
categories comprise positive, negative or neutral, with scores. Watson API, Komprehend API,
and the Text2emotion python package are employed in the multi-model setup for annotation.
The required scores are obtained in the following manner:

• Watson predicts the sentiment label viz. positive, negative, or neutral, and those of emo-
tions, namely anger, fear, joy, disgust, and sadness, with their corresponding scores for
both, the sentiment (ranging from -1 to 1) and the emotion (ranging from 0 to 1) as shown
in Table 3 (based on sample tweets given in Table 2).
• The Komprehend labels data amongst the same three sentiments, i.e., positive, negative, or
neutral and their emotions amongst happy, sad, angry, fearful, excited, or bored, scores for
both of which range from 0 to 1.
• Text2emotion , on the other hand, scores the input only for their emotions amongst the
categories happy, anger, sad, surprise, and fear, scores of which range from 0 to 1.

One difference between the score assignment of Watson and Komprehend is that the former
presents a single score labeling the sentiment of each tweet, whereas Komprehend returns
individual scores corresponding to each sentiment for the input data. However, the ultimate
labels being the same for both, those from Watson are considered as the final value for sentiment
classification. Note that the four emotions sadness, joy, fear, and anger are considered by each
of the above APIs, due to which the scores for the emotion disgust are considered directly from
Table 3
Sample annotations obtained using the IBM Watson NLU
Tweets Sentiment Emotion
Label Score Sadness Joy Fear Disgust Anger
𝑡1 negative −0.97301 0.24589 0.093889 0.210093 0.100609 0.186584
𝑡2 neutral 0 0.309616 0.080319 0.047491 0.094984 0.0778
𝑡3 positive 0.962837 0.083365 0.7824 0.013969 0.007002 0.007283

Watson . Post obtaining the real-valued scores for each tweet using the aforementioned models,
these are converted into binary using the method discussed in section 4.3.2. The method of
arriving at the threshold for final annotation is discussed below.

4.3.1. Classification Threshold
In order to obtain the classification threshold, SemEval-2018 [8] E-c subtask, a benchmark
dataset, is employed. SemEval-2018 contains 10, 983 instances split into three subsets: training
set, validation set, and testing set, with 6838, 886, and 3259 instances, respectively. SemEval-
2018 is labeled using the three APIs mentioned above, which provide real-valued scores for
each tweet. These real-valued scores are transformed into binary using a threshold value. This
threshold value, which is considered to be the InterQuartile Range (IQR), helps in making the
classification for proper inference of tweets while being a statistics robust to outliers with a
better representation of the amount of spread in the data than the range.
Thus, the real-valued scores of the aforementioned multi-model setup corresponding to the
ground-truth are separated for both the classes (i.e., 0 and 1), for which the IQR is calculated
separately. The IQR corresponding to the ground-truth value labeled 1 opted for threshold.
Finally, these values for the emotions sadness, joy, fear, disgust, and anger are as follows: 0.26,
0.20, 0.25, 0.06, and 0.12, respectively.

4.3.2. Binary Annotation
The SemEval-2018 dataset is then classified based on the IQR for the presence (>threshold)
and absence (