A Machine Learning Model for Automatic Emotion Detection
from Speech
Nataliia Kholodna, Victoria Vysotska, Solomiia Albota
Lviv Polytechnic National University, S. Bandera street, 12, Lviv, 79013, Ukraine

                Abstract
                The paper aims to create a machine-learning model for automatic emotion detection from
                speech. The developed model is to be used in the system of monitoring public emotions. A
                short analysis of other research papers on the process of designing machine-learning models
                for automatic emotion detection from speech has been provided in the article. Classical and
                deep machine learning methods and algorithms and some features of the initial dataset have
                been considered. The DailyDialog and its applicability for training the classificatory have been
                regarded. Moreover, developing and selecting the optimal model for automatic emotion
                detection from speech has been described. The research results on the influence of factors such
                as the number of records of each category in the training data set, text pre-processing, methods
                of vectorisation or word embedding, the choice of machine learning method for text
                classification, parameters and architecture of the model have been given. The examples of
                using the machine-learning model to analyse the collected real-world data have been
                demonstrated in the last section. The process of correlation of the change in the number of
                records that belong to different emotional categories with specific events in the life of society,
                the population of a geographical region or a community has been shown. Finally, the
                limitations of the created machine-learning model and some possible steps to refine the system
                for detecting emotions have been considered.

                Keywords 1
                Machine learning, emotion detection, deep learning, machine learning model, text pre-
                processing, neural network, learning model, deep learning model, word embedding, created
                machine learning model, logistic regression, classical machine learning method, social
                network, temporary convolutional neural network, social network Twitter, naive Bayesian
                classifier, pre-processing, information resource, machine learning method, convolutional
                neural network, classification method, learning method, vectorisation method, text
                classification, stop word, artificial neural network, adaptive boosting algorithm

1. Introduction

    Emotions play an essential role in our daily lives and affect our social interaction, behaviour,
relationships with other people, and even how we make decisions that are important to us.
    Text is an essential source for identifying emotions. Such content may contain information about
the person’s psychological state and reflect the feelings experienced by society at a particular time.
Analysis of text that is rich in emotions can be used in many areas, such as:
        Early warning systems for government, health or emergency services;
        To assess the individual psychological state of a person by their activity in the social network;
        A different approach for making critical business decisions by analysing feedback on a product
or service;
        Analysis of news and technical literature.

MoMLeT+DS 2021: 3rd International Workshop on Modern Machine Learning Technologies and Data Science, June 5, 2021, Lviv-Shatsk,
Ukraine
EMAIL nataliia.kholodna.sa.2018@lpnu.ua (N. Kholodna); Victoria.A.Vysotska@lpnu.ua (V. Vysotska); Solomiia.M.Albota@lpnu.ua (S.
Albota)
ORCID: 0000-0002-6908-2900 (N. Kholodna); 0000-0001-6417-3689 (V. Vysotska); 0000-0002-0845-5304 (S. Albota)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
   The research object is the problem of creating an adequate model of emotion recognition in the text
and the study of the influence of various factors and parameters on the quality of classification.
   The work aims to create a machine-learning model for automatic emotion detection from text. The
developed model can be further integrated into the system of monitoring the emotional state of society,
the population of a geographical region or a community.

2. Related works
    Maryam Hasan, Elke Rundensteiner, Emmanuel Agu [1] developed a system of automatic emotion
detection in the flow of text publications on the social network Twitter – EmotexStream. Experiments
show that the model correctly classifies 90% of records. The authors also solve the problem of fuzzy
boundaries of emotional categories, using a dimensional model of emotions and fuzzy classification,
which indicates the probability that the record belongs to a specific emotional category.
    EmotexStream can be considered the primary analogue to the system, which is supposed to use a
created machine learning model. However, Emotex uses a different emotion model – dimensional but
not discrete. In addition, Emotex developers created and used their dataset of automatically annotated
data using hashtags from the social network Twitter. Emotex, unlike the designed system, do not use
the concept of identifying trending words or hashtags for a certain period to simplify the correlation of
changes in emotions with important events in the society’s life. EmotexStream is also proposed to be
used for real-time analysis of message flows.
    Fabio Calefato, Filippo Lanubile, Nicole Novielli [2] developed EmoTxt – an open-source toolkit
for recognising emotions in the text. EmoTxt supports the recognition of emotions in the text and
training on other models of classification emotions using manually annotated data.
    Rahul Venkatesh Kumar, Shahram Rahmanian, Hessa Albalooshi [3] developed an experimental
model for emotion detection in short messages and posts on social networks. Several standard
algorithms were used in the study: logistic regression, naive Bayesian classifier, CNN – BiLSTM and
CNN – LSTM with fastText vectorisation. The model using logistic regression had the best accuracy –
91.83%, in the second place – the CNN-LSTM model.
    Srinivasu Badugu and Matla Suhasini [4], in their work, chose a rule-centred approach. As a result,
the accuracy of the model was 85%. However, the disadvantage of the developed model that can be
considered is that it does not reveal individual emotions. Still, it only indicates to which group (Happy-
Active, Happy-Inactive, Unhappy-Active, Unhappy-Inactive) a specific record belongs.
    Nafis Irtiza Tripto, Mohammed Eunus Ali [5] suggested the use of deep learning to determine
sentiment and emotion in the comments written in Bengali.
    Nafis Irtiza Tripto and Mohammed Eunus Ali point out that LSTM has better accuracy, but CNN is
much faster. In the case of word2vec, here Skip Gram has the best accuracy estimates.
    Luyao Ma, Long Zhang, Wei Ye, Wenhui Hu [6] proposed deep learning architecture with biLSTM
neural network, which uses an emotion-oriented attention network.
    Marco Polignano, Pierpaolo Basile, Marco de Gemmis, Giovanni Semeraro [7] developed a model
in which BiLSTM layers, the self-attention layer and convolutional neural networks (CNN) are
combined. Lu Chen, Amit Sheth, Krishnaprasad Thirunarayan, Wenbo Wang [8] created an
automatically annotated dataset from posts on the social network Twitter for their research. To classify
records according to their emotion, the researchers chose a logistic regression classifier from the open-
source library Liblinear and a naive Bayesian classifier implemented in the Weka program.

3. Material and methods
    Dataset selection. The manually annotated DailyDialog dataset [9], presented in 2017, was chosen
as the initial dataset for the study. It contains seven categories of emotions that correspond to the discrete
model of Paul Ekman [10] (anger, disgust, fear, happiness, sadness, surprise and a “no emotion”
category), and consists of 13 118 dialogues, each of them contain a different number of lines.
    The DailyDialog data does not contain unnecessary characters, emoji’s, links, etc. Its only drawback
is its imbalance: the number of records that belong to the “anger” category is 1022, “disgust” – 353,
“fear” – 174, “happiness” – 12885, “sadness” – 1150, “surprise” – 1823, “no emotion” – 85572. To
avoid the possibility of bias of the model related to the largest class, it is necessary to investigate the
behaviour of the classifier by removing different numbers of records or adding records that belong to
the same categories from other datasets.
    Data representation in vector format. Although machine-learning algorithms deal with numbers, the
information in the selected dataset is presented as text. Therefore, to classify these records using
machine-learning classifiers, they must be written as numerical vectors. This process is called
vectorisation or word embedding. The most popular methods of representing data in vector format are
TF-IDF, Bag-of-Words, fastText, Glove, Word2Vec (CBOW, Skip-Gram).
    Classical machine learning methods. The critical task for the system of automatic emotion detection
from text is the classification – the prediction of which of the known classes of emotions the record or
publication belongs to. The classical supervised machine learning methods include the following
algorithms: logistic regression, decision tree, naive Bayesian classifier, support vector machine, k-
nearest neighbours. In addition, the random forest algorithm, which belongs to the family of ensemble
algorithms, was used to classify the records in the study of Marco Polignano, Pierpaolo Basile, Marco
de Gemmis, and Giovanni Semeraro [7]. The AdaBoost method also belongs to this family. This method
was not used in any of the reviewed studies, but in several works [3, 4], the XGBoost method was used,
which also boosted (ensemble) algorithms. Thus, for comparison, the classical and ensemble methods
of machine learning were chosen, which are most often used in similar studies to create an automatic
emotion detection from text model. To select the best classification method for its further use in real-
world data analysis, it is necessary to compare the accuracy, precision, recall, and F-measure indicators
for all the algorithms mentioned above.
    Deep learning methods. Convolutional and recurrent neural networks and multilayer perceptron are
used to detect emotions in textual data.
         Convolutional Neural Network, CNN. CNN consists of input and output layers, as well as
several hidden layers. Hidden CNN layers typically consist of convolutional layers, aggregation layers,
fully bonded layers, and normalisation layers. Convolution layers apply a convolution operation to the
input, passing the result to the next layer. Although convolutional networks are mainly used for image
analysis, in some studies [3, 5, 7], their use allowed obtaining good accuracy results than other deep
learning methods.
         Recurrent Neural Networks, RNN. Recurrent neural networks are networks that contain
recurrent connections and can store information. The recurrent connections allow transferring data from
one-step of the network learning to another. LSTM network is a type of RNN. LSTM (Long Short-
Term Memory) is an RNN capable of learning long-term dependencies. LSTM networks consist of
repeating elements. Each element contains four layers, and one key difference of this type of neural
networks is that it has a cell of long short-term memory. The ability of LSTM networks to successfully
learn data with long-term dependencies makes them a good choice for solving problems in which both
input and output information is represented in the form of sequences of some elements (e.g., letters,
words, sentences) [11].
    Language and libraries. The most popular languages for machine learning are Python, R, Java, C++.
The advantage of Python, among other languages, for creating an automatic emotion detection from
text machine-learning model is its support for a large number of libraries for:
         Classical machine learning methods: Scikit-Learn;
         Creating artificial neural networks, using deep learning: TensorFlow, Keras, PyTorch;
         Natural language data pre-processing: NLTK, spaCy, WordNet;
         Operations with arrays and matrices: NumPy;
         Tables: pandas;
         Data visualisation: Matplotlib, seaborn, Plotly.
    The Scikit-Learn library supports data pre-processing, data dimension reduction, and machine
learning models for regression, classification, or cluster analysis. However, Scikit-Learn does not have
comprehensive support for creating deep learning models. The Keras library was chosen to create
artificial neural networks, which serves as a high-level API for TensorFlow 2. Keras allows building
sequential models in the form of a graph, the vertices of which are layers of a specific type with a given
number of nodes. In addition, Keras allows to the combination of the results from several separate parts
of the neural network for their further processing. Such a structure is not linear.
   TensorFlow allows importing a pre-trained machine-learning model for later use in other programs.
TensorFlow also supports low-level tensor operations using CPUs, GPUs, and tensor processing units.
   The NLTK library in this study is used for text pre-processing: tokenisation, removal of stop words,
stemming, and lemmatisation. In addition, with the help of functions from this library, one can find the
most popular n-grams and parts of speech of individual tokens, recognise named entities etc.
   Additional libraries that simplify work with natural language include Regex and emoji – to use
regular expressions and replace emoji’s with their meanings, respectively.

4. Experiments

    According to the results of the analysed works, each set of text classification data has its combination
of such factors as text formatting (removal of punctuation characters, replacement of emoji with
appropriate words, etc.), the number of entries of each category in the training dataset, text pre-
processing, vectorisation or word embedding, machine learning model for text classification, model’s
parameters and architecture etc., that will result in the best possible classification accuracy.
    Therefore, when developing such systems for detecting the text features, the most critical step is to
choose the best combination of the factors mentioned above.
    The optimal number of records for each category. To avoid bias of the future machine-learning
model to the type with the most significant number of documents, it is essential to make sure that the
training dataset is balanced before training the model or studying the influence of other factors.
Therefore, the initial dataset, The DailyDialog, contains a different number of records of each category.
    To investigate how many records of each category will be the most optimal choice, three
experiments with the following initial conditions: pre-processing of text: removal of stop words and
lemmatisation, vectorisation method: Bag-of-words, classifier: logistic regression, binary classification
approach used for multi-class classification: One-Vs-Rest were conducted.
    In the first experiment, with the original number of records in each category, there are high scores
(precision - 86%, recall - 96%, F-measure - 91%) for the category with the largest number of records,
low recall and F-measure for other classes (e.g., category 3: precision - 67%, recall - 12%, F-measure -
20%). In addition, the model did not correctly detect any record that belongs to category 2.
    In the third experiment (deletion of 70.5 thousand records of category 0 and 10 thousand records of
category 4), the scores mentioned above increased for certain classes. Still, the model’s accuracy is too
low (57%), so the option of deleting 60 thousand records of category 0 as a base distribution of
documents for the following experiments (accuracy - 67%) was chosen.
    Text pre-processing. First, it should be noted that the records in the dataset The DailyDialog do not
require spell check (because the dialogues were collected and annotated manually), lowercasing is the
default, and punctuation is removed.
    Methods used: logistic regression, OneVSRest approach, TF-IDF is a vectorisation method,
precision, recall and F-measure - weighted average.


Figure 1: Table of comparison of text pre-processing methods

   Thus, the removal of stop words reduces the accuracy of the model. Lemmatization does not affect
the performance of the model.
   Vectorisation and classification. In this step, results were obtained for all possible combinations of
vectorisation methods (Bag-of-Words, TF-IDF) and classical machine learning methods for sort
(logistic regression, random forest, decision tree, multilayer perceptron, Adaptive Boosting algorithm,
naive Bayesian classifier, support vector machine, the technique of k-nearest neighbours).


Figure 2: Table of comparison text vectorisation and classification methods without deleting stop
words


Figure 3: Table of comparison text vectorisation and classification methods with deleting stop words

    Therefore, the following conclusions can be drawn from the obtained tables: removal of stop words
reduces the precision, recall and F-measure by 2-3% for almost all of the classifiers (except for the
Adaptive Boosting algorithm and the method of k-nearest neighbours); TF-IDF increases precision,
recall and F-measure by 1-3% compared to the Bag-of-Words vectorisation method for almost all of
the classifiers (except the Adaptive Boosting algorithm and the k-nearest neighbour method);
classification methods with the best indicators: random forest (77% of weighted precision), logistic
regression (74%), naive Bayesian classifier (73.5%).
    Parameters of the classifiers. To study the influence of the parameters of classification methods on
the quality scores of the model, logistic regression and Bag-of-Words were chosen as a method of
vectorisation. The main logistic regression parameters from the sklearn library are the algorithms used
for optimisation (“solvers”), and parameter C is the inverse of the regularisation strength. The change
in parameter C increased the weighted F-measure by 2% max, but the model’s accuracy is still relatively
low. In addition, the change of algorithms used for optimisation has hardly changed the accuracy of
classification. The effect of the class_weight parameter is on the accuracy of the random forest method
was also investigated. This parameter affects the weight of the classes and is used to classify unbalanced
datasets. The value of the parameter class_weight = ‘balanced_subsample’ slightly increases recall for
classes with low support (1-5%) and decreases precision (1-7%) for most classes.
    Binary classification strategies for multi-label classification. Classifiers such as logistic regression,
perceptron, and support vector machine were made for binary classification. However, they do not
support classification problems with more than two classes. The strategies used for multi-class
classification apply these methods: One-vs-Rest or One-vs-One. To compare binary classification
strategies for multi-label classification, logistic regression as a classification method, TF-IDF was
chosen as a vectorisation method, and lemmatisation – as a text pre-processing.
    Analysis of the obtained results: both approaches have a high recall, low - precision for classes with
low support; there are no correctly identified records for the classes with minor support; the One-vs-
One approach slightly increases some scores (1-2%).
    Conclusions on the first part of choosing the best model (classical methods of machine learning):
        The DailyDialog dataset is not suitable for classical machine learning methods and further
research, as its imbalance negatively affects the quality of the classification model;
        To train which The DailyDialog dataset records were used, the final classifier model is biased
against classes with the most extensive support. It poorly classifies the minor classes;
        The best results were obtained for the following combination: lemmatisation + TF-IDF +
random forest;
        Changing the parameters of the classifiers does not sufficiently change the precision, recall, F-
measure of the model;
        The accuracy of the obtained classification model is quite low for its further use.
    Comparison of deep learning with classical machine learning methods. Since traditional machine
learning methods provide some unsatisfactory results, for further research, a temporary convolutional
neural network [12-14] was chosen as a basic initial method, which, according to Diardano Raihan [12],
is an alternative to recurrent architecture that can accept long sequences and does not suffer from
“forgetting” important information. For word embedding, a pre-trained GloVe model [15], the
dictionary containing 400 thousand words, the dimension of the vector for each word is 100. For training
and classification, the initial number of entries in the dataset The DailyDialog was kept; for pre-
processing, only lemmatisation was used. As a result, the model has the following accuracy indicators:
training set - 86.9%, validation set - 89.6%, test set - 84.2%. The resulting model has better accuracy,
but it does not remain very objective concerning the classes with the most support.
    Creating a new data set. Because popular datasets used to build a machine-learning model for
automatic emotion detection from the text use an uncompleted list of basic emotions according to Paul
Ekman [10], records from other datasets to preserve the original emotion categories: Emotions dataset
for NLP, Kaggle [16] and Semeval-2018 Task 1, E-c [17] were used. In addition, every three coherent
entries from the categories “no emotion” and “happiness” were combined. So now, the distribution of
records by type looks like this:


Figure 4: The number of records of each category in the new dataset
   Now the temporary convolutional neural network shows satisfactory results for all classes:


Figure 5: Precision, recall, F-score for new balanced dataset

   The deep learning model shows 87% accuracy on the data set for validation, the random forest
method - 67% (-20%).
   Text pre-processing. The DailyDialog and Emotions dataset for NLP, Kaggle datasets use standard
formatting: removing punctuation and lowercasing. Since the dataset Semeval-2018 Task 1, E-c
contains posts from the social network Twitter, the records are also further cleared of mentions of other
users, links, numbers. Emoji are replaced with the appropriate values. Hashtags are not deleted.

Table 1
Comparison of text pre-processing methods
   Stop-    Stemming Lemmatisation Precision Recall                 F-     Accuracy on        Accuracy
   words                                                          score        the           on the test
  removal                                                                 validation set         set
     1           1              0         0.87 0.86                0.86        0.67             0.85
     1           0              1         0.88 0.85                0.86        0.85             0.85
     1           0              0         0.87 0.85                0.86        0.85             0.85
     0           1              0         0.7  0.69                0.69        0.68             0.69
     0           0              1         0.87 0.85                0.86        0.85             0.85
     0           0              0         0.89 0.87                0.87        0.87             0.86

    Therefore, in this case, the primary stages of pre-processing reduce the accuracy of the model.
    Word embedding. For deep learning, either pre-trained word embedding models or models trained
on the corpus can be used. In this study, the number of words in the combined dataset is not enough to
obtain satisfactory accuracy of the deep learning model. Also, reducing the dimension of the vector
(from 100 to 50), adding to each word its semantic score from the AFINN dictionary, and indicators
such as valence, arousal and dominance from the ANEW dictionary can significantly increase the
accuracy of the model.

Table 2
Comparison of vector embedding models trained on a combined data set
     Word embedding model       Precision Recall       F-     Accuracy on a                Accuracy on a
                                                     score    validation set                  test set
   Word2Vec, without text pre-    0.55       0.59     0.53         0.61                         0.62
           processing
 Word2Vec, stop-words removal           0.48       0.52     0.48              0.52           0.56
       and lemmatization
   fastText, without text pre-          0.49       0.56     0.5               0.56            0.7
           processing
  Word2Vec + AFINN + ANEW,              0.69       0.69     0.68              0.69           0.67
  without text pre-processing

   Pre-trained models have almost the same good accuracy. The addition of semantic scores does not
affect the result.
Table 3
Comparison of pre-trained word embedding models
    Word embedding       Precision Recall     F-         Accuracy on a        Accuracy on a test
          model                             score         validation set              set
    Stanford’s Glove       0.88      0.87    0.87              0.87                  0.86
  Google’s Word2Vec          0.88       0.86      0.87               0.86                  0.87
 fastText on Wikipedia       0.88       0.88      0.89               0.88                  0.86
   fastText + AFINN +        0.88       0.87      0.87               0.87                  0.86
         ANEW
    Glove + AFINN +          0.88       0.86      0.87               0.87                  0.85
         ANEW

   Dictionary size. Methods used: fastText for word embedding, temporary convolutional neural
network as a classifier. According to the results in Table 4, the size of the dictionary does not
significantly affect the quality of the model.

Table 4
Comparison of the influence of the number of words in the dictionary on the quality of the model
  Number of words in the Precision Recall         F-         Accuracy on a         Accuracy on a
        dictionary                              score        validation set            test set
          3 000                 0.88    0.86     0.87             0.86                   0.87
      5 000 – baseline           0.89      0.87      0.88              0.87                 0.87
           10 000                0.89      0.87      0.88              0.88                 0.86
           15 000                0.88      0.88      0.88              0.88                 0.87
           20 000                0.88      0.86      0.87              0.87                 0.85

   Architecture of the model of deep learning. Presented in table 5, models of deep learning are graphs,
the vertices of which are layers. All models, except the CNN + BiLSTM architecture, are sequential.
   The layers from which the model is built can have different nodes, node types, and activation
functions. Parameters such as the number of filters and the filter size are used for convolutional neural
network layers. Layers such as pooling operation, the transformation of a matrix into a single array of
values (flatten), exclusion of neurons (dropout) are also used.

Table 5
Comparison of deep learning models
     NN architecture      Precision Recall           F-           Accuracy on a        Accuracy on a
                                                   score          validation set          test set
  ML perceptron (Dense         0.56       0.53      0.54               0.54                 0.56
         layers)
         CNN 1                 0.71       0.69      0.7                0.7                  0.71
          CNN 2                0.74      0.73     0.73              0.72                    0.72
         BiLSTM 2              0.88      0.87     0.87              0.88                    0.85
         BiLSTM 1              0.89      0.89     0.89              0.88                    0.86
      CNN + BiLSTM             0.87      0.85     0.85              0.85                    0.84
           RNN                 0.87      0.85     0.85              0.85                   0.845
      TCN – baseline           0.89      0.87     0.88              0.87                    0.87

    The best results were obtained for deep learning models with layers of long-short term memory and
temporary convolutional neural network. For further research, the artificial neural network “BiLSTM
1” was chosen because it is larger than other models scores of precision, recall, F-measure and accuracy.
Its architecture is given in Table 6.

Table 6
Architecture of the model of deep learning "BiLSTM 1"
                       Layer         Number of cells Activation function Dropout
                Bidirectional LSTM         128         tanh, sigmoid       0.2
                Bidirectional LSTM         256         tanh, sigmoid       0.2
                Bidirectional LSTM         128         tanh, sigmoid
                Dense (perceptron)          7             softmax

   Conclusions on the second part of choosing the best model (deep learning):
        Combining records from multiple datasets eliminated the imbalance of the original dataset and
the bias of the model relative to the most significant class;
        Best results were obtained for the following combination: pre-trained fastText word embedding
model + deep learning model with long-short memory layers, without text pre-processing;
        The obtained model shows good results so that it can be used for further experiments on the
analysis of real-world data collected from information resources

5. Results and discussions

   The system of monitoring public sentiment, a part of which the created machine-learning model has
to be, may use data from several information resources depending on the purpose and scope of its
application. For example, as an information source for experiments with the collected real-world data,
the social network Twitter was chosen because the entries in this social network are short. In addition,
they often contain hashtags that simplify the search by keywords.
   In addition, according to Wakamiya, S. et al. [18], social networks contain a large amount of public
data, available in real-time and rich in emotional content. Therefore, such data sources are well suited
for behavioural research [19-27], both for studies the emotions of a particular person and specific
groups. Data can be collected according to the following parameters: start and end dates (required
parameters), keywords, geographical location (specified by the author of the publication), and the
minimum number of likes. The user must also specify the maximum number of records that will be
collected for each day of the interval between the start and end dates. To get more relevant data, one
should increase the minimum number of likes. For the first experiment, the following parameters: start
date - 1 May 2021; end date - 11 May 2021 (inclusive); the maximum number of records - 500; the
minimum number of likes for each record - 200; the keyword is India was specified.
   The result of the analysis of records is the following plot:
Figure 6: The results of the first analysis of the collected data
    A quick analysis of the obtained graph shows that the vast majority of records belong to the
categories of “disgust” and “no emotions”. The level of other emotions is deficient.
    When hovering the plot by the cursor to the point that corresponds to the peak of negative emotions
- 6 May, we get a list of five popular hashtags on this day (Fig. 7).


Figure 7: The results of the first analysis of the collected data
    The most popular words and hashtags on 6 May are related to the COVID-19 pandemic. According
to google.com, 6 May in India was the peak in the number of new cases of COVID-19 in two weeks,
from 29 April to 11 May. It is worth noting that the predominant emotion in such a situation should not
be disgust but sadness or anger. This result may be the reason that the category “disgust” was added
from the data set Semeval-2018 Task 1, E-c, which is intended for multi-class classification. Thus, the
emotion of “disgust” may be a secondary emotion for such records that, in the case of a simple type,
would fall into the categories of “sadness” or “anger”. The obtained results can be interpreted as strong
negative emotions prevailing in the population. If we remove the two “biggest” emotions and increase
the scale, we can see 2 small peaks of emotions on May 9 and 11. The recordings of 9 May primarily
relate to foreign and domestic political issues.
Figure 8: Peak of emotions on May 11
    The slight increase in the number of 11 May records in the “happiness” category can be explained
by the fact that one date in India is a celebration of National Technology Day (the second most popular
hashtag). Thus, as a result of the first analysis of the collected data, it was found that the model
incorrectly classifies negative emotions and marks the records as belonging to the category of “disgust”.
Therefore, to correct this error, this category (now the data set does not contain records from the
Semeval dataset) and re-trained the model and tokenizer were deleted.
    The dictionary size to 10,000 words and added semantic estimates from the AFINN and ANEW
dictionaries were also increased. The dimension of the embedding vector for each word risen from 300
to 304. According to table 4, the addition of such indicators does not increase the model’s accuracy that
processes large-scale vectors. However, such semantic scores can improve the accuracy of real-world
classification data collected from the information resource. Now the model has the following scores:
precision - 90%, recall - 89%, F-score - 0.89%, accuracy - 89% (on data for validation), accuracy on
test data - 88%. The following experiment concerns the reaction of social network users to the blocking
of the Suez Canal from 23 March to 29. The data were selected from 20 to 31 March.


Figure 9: Reaction to Suez Canal blockage
   The obtained plot shows an increase in the number of “non-emotional” records after 23 March;
March 24 and 29 - bursts of anger and surprise; the days of maximums of the number of “non-
emotional” records do not correspond to the days of increase in the number of published “emotional”
records. It is worth noting that the number of collected records is not enough to comprehensively
analyse the reaction of social network users to this event.
   The following analysis concerns the reaction of users to the international song contest “Eurovision”,
which took place in 2019. The graph does not show the records of the category “without emotions”
because, in this analysis, the change in the number of such publications is not informative.


Figure 10: Reaction to the song contest “Eurovision-2019”
   In 2019, the first semi-final took place on 14 May, the second semi-final on 16 May, and the final
on 18 May. In the same days, there are bursts of emotions of all categories. However, the most
prominent peak of emotions falls on the final of the competition. The following reaction concerns the
holiday of the New Year 2021. For almost the entire time interval (from 25 December 2020 to 4 January
2021) in the publications collected under the critical phrase “new year”, the predominant emotion is
happiness. The peak of this emotion falls on 31 December.


Figure 11: Reaction on a New Year holiday
   Fig. 12. shows the reaction of social network users to the 93rd Academy Awards ceremony, which
took place on 25 April 2021. In this case, there is a slight peak of all emotions on 26 April, but most of
the publications did not contain emotions.


Figure 12: Reaction to the Oscar 2021

    By examining the classified records, one can see that the model classifies sentences well with the
correct spelling of words, clearly articulated thought, a small number of emoji’s and hashtags. On the
contrary, the classification of “noisy” records even after formatting (lowercasing, deleting unnecessary
characters, replacing emoji with their values) is difficult for the model. It should also be considered that
“unfamiliar” words for the model have an index of 0. The vector corresponding to this index also
consists of zeros. This word indexing is caused by the dictionary’s limitation of the 10,000 most famous
words in the data set, so “unfamiliar” words are skipped. As a result, the model takes in a sequence of
words that have no logical connection. The model often classifies such publications as “anger” category
entries, which explains a large number of tweets in this category on some plots.
    In the previous paragraph, some examples of the application of the system of automatic emotion
detection from the text were given for such purposes as analysis of the emotional state of the population
of a particular geographical region and study on the reaction of users of information resources to socially
popular events. Other examples of the use of the system may be its use by moderators of information
resources and social networks to maintain the quality of publications and follow trends, business owners
to research the market and customer satisfaction with various products or services, government, health
and emergency services as early warning systems. The limitations of the created machine-learning
model of automatic emotion detection from the text are:
        Data collection from one information source;
        Lack of graphical interface to simplify the interaction of the user and the system;
        Incorrect classification of “noisy” records;
        No connection to a separate server to collect process and store large amounts of data.
    Therefore, the model mentioned above of determining emotions in the text is not suitable for a full-
fledged analysis of the reaction of the authors of such publications, which are posted on information
resources. However, such a model can be improved for further use in detecting emotions that prevail in
a certain audience for a certain period to correlate the results with events that occurred during this
period. For further refinement of the system, it is proposed to:
        Collect training and test data of the “disgust” category so that the list of possible types
corresponds to the list of basic emotions of the discrete model proposed by Paul Ekman;
        Conduct further research on the dataset used for training the model to ensure that the data is of
appropriate quality and properly annotated;
        Explore other possible architectures of the deep learning model: using GRU layers, BERT
model, a convolutional neural network working with letter-level data, models with different
combinations of CNN and (or) LSTM layer sequences;
        Select hyper-parameters using a grid or random search;
        Introduce methods for detecting “noisy” data for its further specified processing or deletion;
        Introduce the replacement of emoji not by their name, but by a popular semantic meaning, e.g.,
🎉 - not “tada”, but “excitement”, “happiness”, “holiday”;
        Implement automatic data collection from several sources or one, at the user’s choice;
        Create a graphical interface of the program and organise the system’s infrastructure with
sufficiently powerful servers for processing and storing information.

6. Conclusion

    The result of the research is a created machine-learning model for automatic emotion detection from
text. For experiments with real-world data collected from the information resource, the model of deep
learning was chosen, the architecture of which includes layers of bidirectional long-short memory
(Bidirectional LSTM) in combination with a pre-trained model of word embedding fastText.
    Created and improved due to experiments with real-world data, the machine-learning model showed
satisfactory results: precision - 90%, recall - 89%, F-measure - 0.89%, accuracy - 89% (on data for
validation), accuracy on test data - 88%. The analysis results obtained using the created model could
not be considered reliable because the model had problems with the correct classification of “noisy”
data. This problem required the introduction of an algorithm for detecting and further processing such
records. Other main proposed stages of system refinement were data collection of the missing category
“disgust” and re-training the model; research of different types of architecture of the deep learning
model; selection of model’s hyper-parameters; replacing emoji their name but by popular semantic
meaning. The developed model can be further improved and used in monitoring the emotional state of
society, the population of a geographical region or a specific community.

7. References
[1] M. Hasan, E. Rundensteiner, E. Agu, Automatic emotion detection in text streams by analysing
    Twitter data, in: Int J Data Sci Anal 7, pp. 35–51 (2019), doi: 10.1007/s41060-018-0096-z.
[2] F. Calefato, F. Lanubile, N. Novielli, EmoTxt: A Toolkit for Emotion Recognition from Text, in:
    Proceedings of the Seventh International Conference on Affective Computing and Intelligent
    Interaction Workshops and Demos, ACIIW 2017, doi: 10.1109/ACIIW.2017.8272591.
[3] R. V. Kumar, Sh. Rahmanian, H. Albalooshi, EmotionX-SmartDubai_NLP: Detecting User
    Emotions in Social Media Text, in: SocialNLP@ACL (2018), doi: 10.18653/v1/W18-3508.
[4] S. Badugu, M. Suhasini, Emotion detection on twitter data using knowledge base approach, in:
    International Journal of Computer Applications, 2017, doi: 162(10):28‐33.
[5] N. Tripto, M. Ali, Detecting multilabel sentiment and emotions from bangla youtube comments,
    in: Proceedings of the 2nd International Conference on Communication, Computing and
    Networking, 2018, pp. 1‐6, doi: 10.1109/ICBSLP.2018.8554875.
[6] L. Ma, L. Zhang, W. Ye, W. Hu, PKUSE at SemEval‐2019 Task 3: emotion detection with
    emotion‐oriented neural attention network, in: Proceedings of the 13th International Workshop on
    Semantic Evaluation, 2019, pp. 287‐291, doi: 10.18653/v1/S19-2049.
[7] M. Polignano, P. Basile, M. Gemmis, G. Semeraro, A comparison of word‐embeddings in emotion
    detection from text using bilstm, cnn and self‐attention, in: Proceedings of the Adjunct Publication
    of the 27th Conference on User Modeling, Adaptation and Personalisation, 2019, pp. 63‐68, doi:
    10.1145/3314183.3324983.
[8] L. Chen, A. Sheth, K. Thirunarayan, W. Wang, Harnessing Twitter ‘Big Data’ for Automatic
    Emotion Identification, in: International Conference on Privacy, Security, Risk and Trust and 2012
    International Confernece on Social Computing, pp. 587-592, doi: 10.1109/SocialCom-
    PASSAT.2012.119.
[9] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, Sh. Niu, DailyDialog: A Manually Labelled Multi-turn
     Dialogue Dataset, in: Proceedings of the Eighth International Joint Conference on Natural
     Language Processing, 2017, Vol. 1: Long Papers.
[10] P. Ekman, Basic emotions. Handbook Cognit Emot (1999).
[11] O. I. Sheremet, V. S. Zaporozhets, Application of recurrent neural networks to perform machine
     rewrite, in: Scientific Bulletin of the DSEA, № 1 (25Е), 2018, pp. 62 – 68.
[12] D. Raihan, Deep Learning Techniques for Text Classification. Towards Data Science, Medium,
     URL:              https://towardsdatascience.com/deep-learning-techniques-for-text-classification-
     78d9dc40bf7c.
[13] S. Bai, J. Kolter, V. Koltun, An Empirical Evaluation of Generic Convolutional and Recurrent
     Networks for Sequence Modeling, in: arXiv (2018).
[14] Ph. Rémy, Keras TCN, 2021. URL: https://github.com/philipperemy/keras-tcn.
[15] J. Pennington, R. Socher, Ch. D. Manning, GloVe: Global Vectors for Word Representation, 2004.
     URL: https://nlp.stanford.edu/projects/glove/
[16] Emotions dataset for NLP. Kaggle, URL: https://www.kaggle.com/praveengovi/emotions-dataset-
     for-nlp
[17] S. M. Mohammad, F. Bravo-Marquez, M. Salameh, S. Kiritchenko, Semeval-2018 Task 1: Affect
     in Tweets, in: Proceedings of International Workshop on Semantic Evaluation, (SemEval-2018),
     pp. 1-17, doi: 10.18653/v1/S18-1001.
[18] S. Wakamiya, L. Belouaer, D. Brosset, R. Lee, Y. Kawai, K. Sumiya, C. Claramunt, Measuring
     crowd mood in city space through twitter, in: International Symposium on Web and
     WirelessGeographical Information Systems, 2015, pp 37-49, doi: 10.1007/978-3-319-18251-3_3.
[19] S. Albota, Resolving conflict situations in reddit community driven discussion platform, in:
     Proceedings of the 4th International conference on computational linguistics and intelligent
     systems, (COLINS 2020), Vol. 2604, pp. 215–226.
[20] S. Albota, Contradictory statement as a basis for conflict resolution strategies, in: Proceedings of
     the international workshop on conflict management in global information networks (CMiGIN
     2019) co-located with 1st International conference on cyber hygiene and conflict management in
     global information networks, (CyberConf 2019), Vol. 2588, pp. 336-345.
[21] D. Nazarenko, I. Afanasieva, N. Golian, V. Golian, Investigation of the Deep Learning Approaches
     to Classify Emotions in Texts, volume Vol-2870 of CEUR Workshop Proceedings, 2021, pp. 206-
     224.
[22] I. Bekhta, N. Hrytsiv, Computational Linguistics Tools in Mapping Emotional Dislocation of
     Translated Fiction, volume Vol-2870 of CEUR Workshop Proceedings, 2021, pp. 685-699.
[23] I. Spivak, S. Krepych, O. Fedorov, S. Spivak, Approach to Recognizing of Visualized Human
     Emotions for Marketing Decision Making Systems, volume Vol-2870 of CEUR Workshop
     Proceedings, 2021, pp. 1292-1301.
[24] Z. Kochuieva, N. Borysova, K. Melnyk, D. Huliieva, Usage of Sentiment Analysis to Tracking
     Public Opinion, volume Vol-2870 of CEUR Workshop Proceedings, 2021, pp. 272-285.
[25] Artemenko, O., Pasichnyk, V., Kunanets, N., Shunevych, K.: Using sentiment text analysis of user
     reviews in social media for e-tourism mobile recommender systems, volume Vol-2604 of CEUR
     workshop proceedings, 2020, pp. 259-271.
[26] Bobicev, V., Kanishcheva, O., Cherednichenko, O.: Sentiment Analysis in the Ukrainian and
     Russian News, in: First Ukraine Conference on Electrical and Computer Engineering (UKRCON),
     2017, pp. 1050-1055.
[27] S. Bhatia, M. Sharma, K. Bhatia, P. Das, Opinion Target Extraction with Sentiment Analysis,
     volume 17(3) of International Journal of Computing, 2018, pp. 136-142.