Multimodal Emotion Recognition from Voice and
                                Video Signals
                                Paola Barra1,∗,† , Zied Mnasri2,† and Danilo Greco3,†
                                1
                                  DIST, University of Naples Parthenope. Naples, Italy
                                2
                                  DAAM, University of Naples L’Orientale. Naples, Italy
                                3
                                  DiSEGIM, University of Naples Parthenope. Naples, Italy


                                                                         Abstract
                                                                         A promising area of research and development that can significantly increase the efficacy and accuracy
                                                                         of mental health assessments is using artificial intelligence (AI) and machine learning algorithms to
                                                                         analyse voice and facial expressions simultaneously in a video stream. More studies are required to
                                                                         completely comprehend the capabilities and limitations of these technologies and guarantee their ethical
                                                                         and effective usage in clinical settings. Collaborative robots (cobots) can potentially change how mental
                                                                         evaluations of autistic children are approached completely. ChatGPT is an effective language model that
                                                                         can understand and produce human-like text. When used in conjunction with the Cobot, this technology
                                                                         enables children with autism to interact and communicate in a way that is natural to them. In this
                                                                         article, we introduce a novel method for analysing emotional detection using voice analysis and facial
                                                                         recognition that has been tested on the IEMOCAP database. The outcomes session, which illustrates the
                                                                         tool’s potential use in healthcare, concludes the paper.

                                                                         Keywords
                                                                         emotion recognition, spoken language, deep learning, convolutional neural network, web-shaped model
                                                                         (WSM)


                                1. Introduction
                                In the field of digital health, there has been extensive research on emotion recognition utilising
                                spoken language and facial expressions [1, 2]. Deep learning techniques are also included
                                among the strategies involved to solve this problem[3]: it is possible to train convolutional
                                neural networks (CNN) on large collections of photos with the proper emotions labelled on
                                them and subsequently recognize picture patterns associated with emotions [4]. Furthermore,
                                recurrent neural networks (RNN) can be used to analyze audio data for spoken words [5]. An
                                assortment of audio recordings that have each been given a different emotion can be used to
                                train the RNN. The network can then develop the ability to identify audio data patterns that
                                represent various emotional states. The CNN and RNN can both be trained independently

                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open paola.barra@uniparthenope.it (P. Barra); zmnasri@unior.it (Z. Mnasri); danilo.greco@uniparthenope.it
                                (D. Greco)
                                Orcid 0000-0002-7692-0626 (P. Barra); 0000-0002-8929-3609 (Z. Mnasri); 0000-0002-0011-7001 (D. Greco)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
before being integrated into a single model that can identify emotions through both spoken and
facial expressions. To distinguish emotions from spoken language and facial expressions, other
deep learning model types, such as long short-term memory (LSTM) networks or transformers,
can also be utilised[6]. The decision around the model to be employed will depend on the
specific characteristics of the data. In this paper, we present a novel approach[7, 8] to recognize
emotions from speech and images coming from a video data set. The simultaneous analysis of
voice and facial expressions in a video stream presents a promising field to improve the efficacy
and accuracy of mental health assessments. The experiments were performed on the IEMOCAP1
dataset which provides video and audio labelled by time interval with related emotions. More
research is necessary to fully understand the potential and constraints of these technologies[9],
as well as to ensure their effective and moral application in therapeutic contexts. Compared
with more conventional robotic systems, Cobots, or collaborative robots, have a number of
advantages as: they can be configured to communicate with children in a fun and relational way;
they can be trained to interact with children in interesting and reassuring ways; cobots can do
things such as point to objects or make certain movements, which can help engage children and
provide a more thorough assessment of their abilities. The way mental assessments of autistic
children are performed could change completely through the interaction between ChatGPT and
Cobots. The use of these cutting-edge technologies could offer a more effective and efficient
way to assess and understand the needs of these children, A system that integrates language and
picture analysis with patient contact with robots to evaluate their social and communication
skills might be developed to research autistic pathophysiology. The system might engage
with patients (children) via voice and facial-recognition-equipped Cobots while gathering data
on their feelings and behaviours. In order to spot any irregularities in communication and
emotional expression, the system could also evaluate the voices and facial expressions of the
patients. The severity of the condition might be determined using this research data, and more
individualized and focused therapy interventions could be created.


2. Related works
Audio-visual emotion recognition has been recently reviewed in several survey publications, e.g.
[10], that examine experimental methods for multimodal emotion recognition. Methods used
for audio-visual emotion recognition can be categorized into three main families: a) generative
methods, b) supervised learning and c) unsupervised learning.

2.1. Generative modeling
Mainly using Bayesian models, where each emotion category is modelled by a GMM/HMM. For
instance, in order to fuse numerous signals for an Error-Weighted Classifier (EWC), [11] adopted
a Bayesian framework to blend empirical evidence with previous assumptions. An HMM was
trained for each emotional phonetic category in the voice modality. In the face modality, the
GMM trained for each emotion without having access to visual information models the top
face, while HMMs trained for each emotional visual information model the bottom face. In

1
    https://sail.usc.edu/iemocap/
order to properly integrate different modalities, the weighted total of the individual judgments
was merged after investigating their contributions, which were determined using the confusion
matrices of each classifier.

2.2. Supervised learning
Supervised learning is definitely the most used for several years for emotion recognition in
conversational speech, either from audio or video. For instance, [10] reported that:

    • By taking into account the speaker information of the target utterance and additionally
      modelling self and inter-speaker emotional effect with a hierarchical multi-stage RNN
      with attention mechanism, DialogueRNN [12] seeks to resolve this issue.
    • By utilizing quantum theory and RNN-LSTM, the necessity to understand inter-speaker
      interdependence for emotion recognition in conversational speech is addressed and
      modelled in [13].
    • Interaction-Aware Attention Network (IANN), an emotion recognition technique for con-
      versational speech based on inter-speaker connection modelling, was recently proposed
      by [14], where each speaker is modelled by a unique memory.

2.3. Unupervised learning
Unsupervised learning can be useful either for emotion clustering or feature extraction. Emotion
clustering consists in discovering emotion groups which do not necessarily match with unique
labels. For e.g. a cluster grouping ‘excitement’ and ‘anger’ signals may indicate ‘high arousal’,
whereas ‘calm’ and ‘sadness’, when clustered together, indicate ‘low arousal’. Then, such a type
of clustering would be useful in proposing a novel classification of emotions. This track has
been investigated in [15], where fuzzy clustering helped construct a membership matrix that
shows that different emotion categories may share similar valence or arousal characteristics.
On the other hand, unsupervised autoencoders were used by [16] to extract latent features for
speech emotion recognition. An autoencoder is a neural network that approximates the identity
function. However, its interest for this application consists in collecting latent features at the
code layer, which may be more useful than hand-crafted features to characterize speech for the
particular task of comparing the emotional content of speech.


3. Materials and methods
3.1. Emotion recognition datasets
They can be classified into three main categories: a) speech emotion datasets, b) facial expression
datasets and c) video and/or multimodal emotion recognition. Most of these datasets use either
spontaneous or acted speech or scenes. The corresponding labels are generally provided either
in a categorical or a dimensional scheme. Categorical labels usually follow either the basic
emotion model proposed by Eckman or Plutchik’s compound model, known also as the wheel
of emotions. On the other hand, the dimensional model represents emotions in a 3-D coordinate
system where each emotion is characterized by a score of i) valence, i.e. positive or negative, ii)
activation, i.e. high or low, and iii) dominance [15], as illustrated in Fig. 1.


Figure 1: Emotion categories and dimensional attributes.


  Therefore, most emotion recognition datasets are built following either the basic, the com-
pound or the dimensional model. For instance, speech emotion recognition datasets like
EMO-DB and EMOVO contain every 6 basic emotions plus neutrality [10]. However, multi-
modal datasets like IEMOCAP [17], CMU-MOSEI, MELD and SEMAINE [10] contain a more
complex labelling system, including both basic emotion categorical labels or scores for the
dimensional axes, either for improvised or acted scenes. For example, in IEMOCAP dataset,
categorical labels and dimensional scores are given by several human evaluators, to yield a
majority voting. More details about state-of-the-art datasets used in speech and multimodal
emotion recognition can be found in [10].

3.2. Speech emotion recognition
In recent years, a substantial body of research has been conducted on the use of convolutional
neural networks (CNNs) for voice emotion recognition. The complexity and dynamic nature of
speech signals, which are influenced by several elements like speaking style, pronunciation,
and accent, make it difficult to recognize emotions in speech. However, it is also a vital area of
research since emotions are fundamental to communication and their understanding can have a
big impact on a variety of applications, including speech-based human-computer interaction,
affective computing, and mental health diagnostics.

3.2.1. Low-level descriptors
They represent a set of features intentionally designed for speech emotion recognition. Such
features have been selected to form the well-known GeMAPS feature set[18]. This feature set
includes several types of prosodic, acoustic and spectral descriptors, computed at the raw signal
level. Several variants of GeMAPS features are available in OpenSmile toolbox [19], each for a
particular task in affective speech computing. These features were proven to give high affective
speech recognition rates in several challenges, such as Interspeech’09 emotion challenge [20],
Computational Paralinguistics challenge (ComParE) [21]. However, they belong to the paradigm
of explicit feature extraction, whereas novel feature extraction methods are moving towards
the concept of an end-to-end learning process.

3.2.2. Feature extraction and classification methods
A more recent technique of feature extraction in speech emotion recognition consists in the
usage of spectrogram images as an implicit representation of the Mel-Frequency Cepstral
Coefficients (MFCCs)[22]. A spectrogram is a time-frequency representation of speech signals
that can be used to identify the spectral and temporal aspects of speech, which are significant
markers of emotional state. The collection of MFCCs, on the other hand, is resilient to changes
in speaking style, pronunciation, and accent. They summarize the spectral envelope of speech
signals. The spectrogram can be utilized as input for speech emotion classification in several
ways by a) extracting feature vectors composed of MFCC coefficients and their Δ and Δ-Δ, i.e.
their first and second derivatives; b) feature embedding, usually through learning latent features
collected at the code layer of an autoencoder [15], or c) presenting the spectrogram image as
raw input to the classifier.
   When it comes to using spectrogram images as input, convolutional neural networks (CNN)
are preferred as classifier methods, since they excel in classifying images and they can accurately
identify patterns in the input data. When recognizing emotions in speech, a CNN uses the
spectrogram or MFCCs as input and trains on a sizable dataset of speech signals to learn how
to categorize the emotions. The combination of feature extraction, CNNs, and numerous other
techniques can considerably enhance the precision, robustness, and generalization of emotion
recognition in speech. The absence of significant and diverse datasets, the difficulty of defining
a set of emotions that is consistent, and the impact of numerous confounding variables on
speech signals are just a few of the difficulties the area is now experiencing.

3.3. Facial emotion recognition
From learning platforms to human-computer and human-robot interaction, facial emotion recog-
nition (FER) is widely used in a variety of activities. It can also be used as an intrinsic technique
for face recognition issues to produce an expression-free face classifier. Most approaches focus
on building ever-deeper neural networks that regard an expression as a still image or, in certain
cases, as a sequence of succeeding frames that reflect the expression’s temporal component
[23].

3.3.1. Feature extraction and classification methods.
FER usually starts with the extraction of facial features, which is strictly succeeded by an
emotion classification method. This section presents the most recent works that led to the
experimentation of different techniques. Methods for feature extraction are mainly divided into
geometric methods and appearance-based methods. In particular, geometric methods focus on
the shape, scale, orientation, and location of the various parts of the face, such as the nose, mouth,
eyes and eyebrows [24]. In this context, Active Shape Model is a feature-matching solution
that focuses on point features and measures several shape variations and a range of adaptive
models on the face [25]. The authors in [26] propose a technique that consists of an estimate
of parameters extracted from the wavelet transform, and then an SVM classifier performs the
emotion classification. In [27], oriented gradient histograms (HOG), Gabor, and local binary
pattern (LBP) have been used as feature extractors; subsequently, a simple k-nearest neighbour
(kNN) was used for the classification. The experiments carried out aimed to demonstrate the
effectiveness of these three feature extractors in the FER. The authors in [28] propose a system
of FER jointly with the gender and age of the subject to understand how much these affect facial
expressions. The authors used the classic KNN and SVM classifiers and deep learning models
such as CNN and VGG16.

3.3.2. Web Shaped Model
The computational cost of the training phase applies to these, which could take hours or days
to complete. This work proposes the Web Shaped Model (WSM), a geometric method to extract
features that can distinguish between various facial emotions. It uses a virtual geometric pattern
resembling a spider’s web drawn on a face. The Web Shaped model was introduced in [7] to
detect the pose of the face; later, it was also used for the FER [8]. This method: (1) locates
face landmarks with the Kazemi-Sullivan method[29]; (2) draws the web centred on the nose
landmark; (3) keeps the web with the associated emotion tag. The resulting coding is an array
of how many landmarks fall into each sector of the web. It locates facial regions containing
facial landmarks. Fig. 2 shows an example of this process for an emotion ’neutral’ image. The
resulting encoding varies depending on the choice of the number of concentric circles and radii;
the final array changes its size and content. The web configuration used has 60 rays and 8
concentric circles, resulting in 480 elements in the final coding


Figure 2: Process of creating emotion coding using WSM.
Figure 3: Workflow of the combined emotion recognition process from facial images and speech signal


4. Experiments and results
4.1. Experimental protocol
An experimental setup has been established to carry out both tasks simultaneously and jointly
in order to evaluate the efficacy of merging spoken language and facial expressions for emotion
recognition. As a result, the following tests were carried out using conversational speech videos
that were either improvised or scripted and were included in the IEMOCAP dataset [17]:

    • Conversational video sessions were segmented into chunks, each containing only one
      sentence, uttered by a single speaker.
    • For each frame, a categorical label has been defined among the following basic emotions:
      happiness, anger, sadness, fear, neutral, other. It should be noted that some class labels
      provided in the dataset were merged with other ones for their reduced occurrence, e.g
      excitement and surprise are labelled as happiness and frustration as sadness.
    • Each audio chunk is segmented into short frames (audio frames) of 100 ms each with 75%
      overlap.
    • For each audio frame, a series of spectrograms are extracted to be trained as input features.
    • The video of each sentence is segmented into successive images.
    • For each audio frame or image, the corresponding label is that of the sentence.
    • Either for audio frames or images taken from each video, feature extraction and training
      are performed separately, using different types of features and classifiers for each task.
    • After classifying the single audio frames and images, a majority voting is applied to
      predict the label at the sentence level.
    • The performance of each classifier is evaluated using typical metrics for supervised
      learning, i.e. overall accuracy, class-wise precision, recall and F1-score.
   The combined process is illustrated in Fig. 3, both for facial images and speech emotion
classification.

4.2. Audio classification for speech emotion recognition
To implement the audio signal classifier, we opted to use spectrogram images as input and
CNNs as classifiers. Thus, no explicit features, such as OpenSmile LLD’s [19] were used. The
rationale behind such a choice is simply empirical, based on the performance of each type of
input. Feature extraction and classification are achieved at the frame level, using the following
process:

    • Each audio signal, generally corresponding to one uttered sentence, is segmented into
      overlapping short frames, of 100 ms each. Actually, such a duration is the minimum that
      can carry emotion.
    • For each frame, a series of spectrogram images is extracted, applying a short-time Fourier
      transform at overlapping windows.
    • The spectrogram images are fed into a CNN network, composed of several 2-D convolu-
      tional layers, each followed by a max-pooling layer. The last layer is a 1-D Dense layer
      with a softmax activation that returns a classification probability.
    • Once all frames belonging to a given sentence are classified, a majority voting is applied
      to decide about the sentence’s label.

  The results of this workflow obtained on a separate test set randomly extracted from the
same session of IEMOCAP database used for training are mentioned in Table 1.

Table 1
Emotion classification results from voice signals and facial images

   Emotion                   Voice signals                              Facial images
    Class       Precision    Recall    F1      Accuracy     Precision   Recall    F1  Accuracy
    Anger          0.76       0.30    0.43                     0.15       0.47   0.23
     Fear          1.00       0.17    0.29                     0.33       0.50   0.40
  Happiness        0.77       0.47    0.58                     0.81       0.60   0.69
   Sadness         0.58       0.95    0.72                     0.65       0.68   0.66
   Neutral         1.00       0.21    0.35                     0.63       0.33   0.44
    Other          1.00       0.25     0.4                     0.00       0.00   0.00
      All                                         0.63                                  0.60


4.3. Image classification for face emotion recognition
For each time interval of the images, the WSM coding from each face was extracted and labelled
with the corresponding emotion. The classifier was then trained and tested with the same
fragments used in audio testing. In previous articles, the spider web has been used only for a
single face, in this work the emotions have been classified by video intervals. Each image in
the video range is classified and the classification is carried out for a single frame. Of all the
predicted classes in the range, the majority voting value is considered to determine the predicted
class in the whole fragment. The classification was done with various known state-of-the-art
classifiers to compare with which we obtained the best performance. The best-performing
classifier is the Random Forest Classifier with the parameters:

    • The whole dataset is used to build each tree.
    • tree splitting criterion = entropy. This criterion is equivalent to minimizing the cross-
      entropy and multinomial deviance) between the true labels and the probabilistic predic-
      tion.
    • number of estimators = 1000.

4.4. Discussion
Results show differences for each type of input, i.e voice signals and facial images, and for
each category of emotion. Whereas some emotions, like happiness and sadness are pretty well
recognized, either by audio or image classification, other ones are much less recognized. For
instance, audio-based classification presents low recall rates for most emotions, which indicates
high false positive rates. This may be due to the inability of the spectrogram to decorrelate
emotions and voices. Actually, once the CNN is trained on a certain pattern of the spectrogram,
it learns the characteristic spectral coefficients of the voice, and hence, becomes more prone
to recognize the voice than the emotion. In other words, if a voice is frequently labelled as
happy, it would also be labelled so even if the target emotion is different. FER results are
highly dependent on the type of images acquired. The emotions of the IEMOCAP dataset
are realistic and include many microexpressions that are unlikely to be captured by cameras
with a framerate lower than 30FPS [30]. The problem in question is of a technical nature and
also justifies the low result of the other class, which is a collection of unidentified emotions.
Otherwise, the general classification trend follows speech recognition in which classic emotions
such as sadness and happiness are quite well classified.


5. Conclusions
Multimodal emotion recognition from audio and video channels can open new avenues in mental
health monitoring. Information gathered from the conversation with a Cobot and ChatGPT
can be reviewed in real-time, providing the evaluator with immediate feedback. Since children
with autism often have difficulty taking typical exams that rely on lengthy questionnaires or
standardized exams, this can be extremely useful for them. To explore this avenue, we presented
preliminary results on emotion recognition on the IEMOCAP database, which provides voice
data and video. Experiments demonstrate how separately it is possible to detect this information.
Future experiments may consist in: a) improving the results of emotion recognition by each
type of classifier, b) integrating both classifiers into a single multimodal framework obtained
from the conjunction of both extracted feature sets using data integration with fuzzy methods
[31], and applying visual transformers to introduce recurrent learning.
References
 [1] P. Harár, R. Burget, M. K. Dutta, Speech emotion recognition with deep learning, 2017
     4th International Conference on Signal Processing and Integrated Networks (SPIN) (2017)
     137–140doi:10.1109/SPIN.2017.8049931 .
 [2] H. Aouani, Y. B. Ayed, Speech emotion recognition with deep learning, Procedia Computer
     Science 176 (2020) 251–260.
 [3] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar, T. Alhussain, Speech emotion
     recognition using deep learning techniques: A review, IEEE Access 7 (2019) 117327–117345.
     doi:10.1109/ACCESS.2019.2936124 .
 [4] B. Zhang, C. Quan, F. Ren, Study on cnn in the recognition of emotion in audio and images,
     in: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science
     (ICIS), 2016, pp. 1–5. doi:10.1109/ICIS.2016.7550778 .
 [5] A. Graves, Generating sequences with recurrent neural networks, arXiv preprint
     arXiv:1308.0850 (2013).
 [6] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (1997)
     1735–1780. doi:10.1162/neco.1997.9.8.1735 .
 [7] P. Barra, S. Barra, C. Bisogni, M. C. D. Mársico, M. Nappi, Web-shaped model for head
     pose estimation: An approach for best exemplar selection, IEEE Transactions on Image
     Processing 29 (2020) 5457–5468.
 [8] P. Barra, L. De Maio, S. Barra, Emotion recognition by web-shaped model 82 (8) (2022).
     doi:10.1007/s11042- 022- 13361- 6 .
     URL https://doi.org/10.1007/s11042-022-13361-6
 [9] Z. Tariq, S. K. Shah, Y. Lee, Speech emotion detection using iot based deep learning for
     health care, 2019 IEEE International Conference on Big Data (Big Data) (2019) 4191–4196.
[10] S. Poria, N. Majumder, R. Mihalcea, E. Hovy, Emotion recognition in conversation: Research
     challenges, datasets, and recent advances, IEEE Access 7 (2019) 100943–100953. doi:
     10.1109/ACCESS.2019.2929050 .
[11] A. Metallinou, S. Lee, S. Narayanan, Decision level combination of multiple modalities for
     recognition and analysis of emotional expression, in: 2010 IEEE International Conference
     on Acoustics, Speech and Signal Processing, 2010, pp. 2462–2465. doi:10.1109/ICASSP.
     2010.5494890 .
[12] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, E. Cambria, Dialoguernn: An
     attentive rnn for emotion detection in conversations, Proceedings of the AAAI Conference
     on Artificial Intelligence 33 (01) (2019) 6818–6825. doi:10.1609/aaai.v33i01.33016818 .
[13] Y. Zhang, Q. Li, D. Song, P. Zhang, P. Wang, Quantum-inspired interactive networks for
     conversational sentiment analysis, Proc. 28th Int. Joint Conf. Artif. Intell. (IJCAI) (2019)
     1–8.
[14] S.-L. Yeh, Y.-S. Lin, C.-C. Lee, An interaction-aware attention network for speech emotion
     recognition in spoken dialogs, in: ICASSP 2019-2019 IEEE International Conference on
     Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019.
[15] S. Rovetta, G. Mauri, G. Cinini, R. Tomazzoli, S. Annoni, I. D. Feis, B. Torre, V. Caruso,
     A. Villa, L. Cattaneo, A. M. Pugliese, P. Accortanzo, G. M. D. Pietro, M. Sampietro, A. Verri,
     M. Cinque, F. Bruni, M. Pini, C. Cristalli, G. Altoé, Emotion recognition from speech: An
     unsupervised learning approach, Int. J. Comput. Intell. Syst. 14 (1) (2021) 23–35.
[16] S. Rovetta, Z. Mnasri, F. Masulli, Emotional content comparison in speech signal using
     feature embedding, Progresses in Artificial Intelligence and Neural Systems (2021) 45–55.
[17] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S.
     Narayanan, Iemocap: Interactive emotional dyadic motion capture database, Language
     Resources and Evaluation 42 (4) (2008) 335–359.
[18] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps,
     P. Laukka, S. S. Narayanan, K. P. Truong, The geneva minimalistic acoustic parameter
     set (gemaps) for voice research and affective computing, IEEE Transactions on Affective
     Computing 7 (2) (2015) 190–202.
[19] F. Eyben, B. Schuller, opensmile:) the munich open-source large-scale multimedia feature
     extractor, ACM SIGMultimedia Records 6 (4) (2015) 4–13.
[20] B. Schuller, S. Steidl, A. Batliner, The interspeech 2009 emotion challenge, in: Interspeech,
     Vol. 2009, 2009, p. 312.
[21] B. Schuller, S. Steidl, A. Batliner, S. Hantke, F. Hönig, J. R. Orozco-Arroyave, E. Nöth,
     Y. Zhang, F. Weninger, The interspeech 2015 computational paralinguistics challenge:
     nativeness, parkinson’s & eating condition, in: Sixteenth annual conference of the interna-
     tional speech communication association, 2015.
[22] R. M. Stern, A. Schafer, Robust modeling of speech signals, IEEE Transactions on Speech
     and Audio Processing 7 (5) (1999) 525–532. doi:10.1109/89.784104 .
[23] S. Lee, A brief review of deep learning for facial expression recognition, Available at SSRN
     4318896 (2023).
     URL http://dx.doi.org/10.2139/ssrn.4318896
[24] N. Rathour, R. Singh, A. Gehlot, S. V. Akram, A. K. Thakur, A. Kumar, The decadal perspec-
     tive of facial emotion processing and recognition: A survey, Displays 75 (2022) 102330.
     doi:10.1016/j.displa.2022.102330 .
[25] T. Cootes, C. Taylor, D. Cooper, J. Graham, Active shape models-their training and
     application, Computer Vision and Image Understanding 61 (1) (1995) 38–59. doi:
     10.1006/cviu.1995.1004 .
[26] J. R. Kumar, R., M. Sundaram, N. Arumugam, et al., Face feature extraction fohr emotion
     recognition using statistical parameters from subband selective multilevel stationary
     biorthogonal wavelet transform, Soft Computing 25 (2021) 5483–5501. doi:10.1007/
     s00500- 020- 05550- y .
[27] S. Subudhiray, H. K. Palo, N. Das, K-nearest neighbor based facial emotion recognition
     using effective features, IAES International Journal of Artificial Intelligence 12 (1) (2023)
     57.
[28] S. T. Chavali, C. T. Kandavalli, T. M. Sugash, R. Subramani, Smart facial emotion recognition
     with gender and age factor estimation, Procedia Computer Science 218 (2023) 113–123.
     doi:10.1016/j.procs.2022.12.407 .
[29] V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression
     trees, in: Proceedings of the IEEE conference on computer vision and pattern recognition,
     2014, pp. 1867–1874.
[30] D. Borza, R. Danescu, R. Itu, A. Darabant, High-speed video system for micro-expression
     detection and recognition, Sensors 17 (12) (2017) 2913. doi:10.3390/s17122913 .
[31] A. Ciaramella, D. Nardone, A. Staiano, Data integration by fuzzy similarity-based hi-
     erarchical clustering, BMC Bioinformatics 21 (Suppl 10) (2020) 350. doi:10.1186/
     s12859- 020- 03567- 6 .