=Paper= {{Paper |id=Vol-3126/paper30 |storemode=property |title=Hybrid Intelligence System of Emotional Facial and Speech State Estimation |pdfUrl=https://ceur-ws.org/Vol-3126/paper30.pdf |volume=Vol-3126 |authors=Viktor Sineglazov,Andriy Rjabokonev }} ==Hybrid Intelligence System of Emotional Facial and Speech State Estimation== https://ceur-ws.org/Vol-3126/paper30.pdf
Hybrid Intelligence System of Emotional Facial and Speech State
Estimation
Viktor Sineglazov1, Andriy Rjabokonev2
1,2
      National Aviation University, ave. Lubomir Husar, 1, Kyiv, 03058, Ukraine

                   Abstract
                   It is shown that person emotional state estimation with help of facial or speech state estimation
                   isn’t enough. It is necessary to create Hybrid Intelligence system of emotional facial and speech
                   state estimation. For the problem solution it is proposed to use hybrid convolutional neural
                   networks. The data supplied to the network input are presented in the form of mel-spectrograms
                   and facial images during conversation. Mel-spectrogram can be interpreted as a two-
                   dimensional image, where along one axis the frequency changes, along the other time, or rather
                   sequential frames of the spectrogram. The following characteristics are often extracted for this
                   purpose: local characteristics, global characteristics, prosodic characteristics, qualitative
                   characteristics. It is shown that change of emotions on a face or in speech is connected with
                   internal reaction of the person to the questions posed. For the solution of emotional state
                   estimation with help of facial and speech state estimation it is offered to use convolutional
                   neural networks at a stage of micro emotions identification and voice characteristic changes.
                   Making decision on potential threats based on determined emotional state estimation is realized
                   by the ensemble of classifiers.

                   Keywords 1
                   Hybrid Intelligence, emotional state estimation, hybrid convolutional neural networks, Mel-
                   spectrogram, facial or speech features, making decision.


1. Introduction                                                                                recognition, on the second – the fuzzy classifier
                                                                                               supplies the solution of making decision on
                                                                                               potential threat problem based on determined
    Nowadays, the real importance is given to
                                                                                               emotional state estimation.
increasing the aircraft safety conditions, in
                                                                                                   In article [2] it is considered an Intelligent
particular during the passenger control. Commonly,
                                                                                               system of analysis of musical works, where it was
the number of people for each security officer is
                                                                                               used     mel-spectrograms      as     inputs    for
too high to deal with them in restricted period of
                                                                                               convolutional neural network.
time. The employee of aircraft company is faced
                                                                                                   Last researches showed that it isn’t enough to
by a hard task, to ask the number of special
                                                                                               take into account only particular features,
questions to understand the emotional state of the
                                                                                               appearance because sometimes they can be
passenger to successful admission of the flight.
                                                                                               formed artificially. So in addition it is necessary
The main features that allow to solve this problem
                                                                                               to consider speech state estimation.
is emotional changes of the passenger during the
control conversation [1].
    In article [1] it is considered an intelligent                                             2. Review of Existing Solutions
system of micro emotions analysis which consists
of the two-levels: at the first level the convolution                                             Generally, the facial emotion of an individual
neural network realizes micro emotion                                                          in few studies has been realized through the

ISIT 2021: II International Scientific and Practical Conference
«Intellectual Systems and Information Technologies», September
13–19, 2021, Odesa, Ukraine
EMAIL: svm@nau.edu.ua (A. 1);
ryabokonev.andrey@gmail.com (A. 2)
ORCID: 0000-0002-3297-9060 (A. 1)
               ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
               Commons License Attribution 4.0 International (CC BY 4.0).
               CEUR Workshop Proceedings (CEUR-WS.org)
computer vision (CV). Facial expressions have            2.1. Facial                        Movement
maximum magnitude over the words during a
personal conversation. Various methods have              Characteristics
been used for automatic facial expression
recognition (FER or AFER) tasks. Early papers                Each manifestation of facial emotions of a
used geometric representations, for example,             person can be described by a set of descriptors. As
vectors descriptors for the motion of the face [3],      the apparent facial changes there also occurs the
active contours for mouth and eye shape retrieval        micro emotions. They can be taken into account
[4], and using 2D deformable mesh models [5].            in more complicated recognition approaches.
Other used appearance representation based               Table 1 describes the main facial changes
methods, such as Gabor filters [6], or local binary      relatively to the six standard types of emotions
patterns (LBP) [7]. These feature extraction             [9].
methods usually were combined with one of
several regressors to translate these feature            Table 1
vectors to emotion classification or action unit         Relations of emotional facial features changing
detection. The most popular regressors used in            Emotion        Eyebrow              Mouth
this context were support vector machines (SVM)            Surprise         Rise                 Open
and random forests. Many descriptive approaches              Fear         Rise and            Open and
to interaction forms of emotions are included in
                                                                          wrinkled              stretch
the classification of the input data, and the CNN
network is an effective algorithm of deep learning.        Disgust        Decrease        Rise and ends will
    Current research in the field of classification of                                         decrease
the user's emotional state based on voice focuses           Anger       Decrease and       Opens and ends
mainly on experiments with different classifiers                          wrinkled          will decrease
and characteristics and finding the best                 Happiness      Bends down           Ends will rise
combination. A relatively small number of                 Sadness       End part will          Ends will
available recordings of emotions (databases) that                         decrease             decrease
can potentially be used to create a classifier has
shown to be problematic, as well as the fact that           Motion units of the person can be divided into
people in real situations tend to suppress their         three groups conditionally.
emotions and not fully express them. Another                1. Static – recognition using only the photo
obstacle in creating a universal solution is the            is possible.
human voice itself, which can be influenced by              2. Dynamic – it is necessary to continuous
many factors – e.g. gender, age, state of health,           frame changing, key points initialization or
etc.                                                        obtaining the average value of distances
    An important step in designing an emotion               between motion units.
recognition system is to recognize the facial micro         3. Empty – actively participate in
changes that effectively characterize the various           manifestation of emotions, however are not
emotions and extract useful properties from the             registered search algorithms (dimples on
voice.                                                      cheeks).
    For these purposes it is extracted the following        Now it is possible to review the following
characteristics; facial movements (unitary               recognition methods of the human emotional state
movements performed by a group of muscles:               using methods of calculation of forms of objects,
tightening the cheeks, stretching the eyelids,           methods of calculation of dynamics of objects
raising the wings of the nose, raising the upper lip,    (Table 2) [10].
deepening the nasolabial fold, raising the corners          Face detection algorithms can be divided into
of the lips, dimpling the lips, lowering the corners     four categories [11]: empirical method; method of
of the mouth, lowering the lower lip, pulling off        invariant signs; recognition on the template
the lips) [8], speech (local characteristics, global     implemented by the developer; method detection
characteristics,      prosodic        characteristics,   on external signs (the training systems).
qualitative         characteristics,          spectral      The main stages of algorithms of empirical
characteristics).                                           approach are: stay on the image of the person:
                                                            eye, nose, mouth; detection: borders of the
                                                            person, form, brightness, texture, color;
   combination of all found invariant signs and         stress, chorus; also prosodyk) – a section of
   their verification.                                  phonetics, which considers such features of
                                                        pronunciation as height, strength / intensity,
Table 2                                                 duration, aspiration, glottalization, palatalization,
Methods for facial emitonal state recognition of        the type of concordance of a consonant to a vowel
human face                                              and other signs, which are additional to the main
                                                        articulation of sound [14]. Within the framework
   Methods         Holistic        Local methods
                                                        of prosody, both the subjective level of perception
                   methods                              of the characteristics of super-segment units
   Methods     Classificators:   Classificators:        (pitch, strength / loudness, duration) and their
  for shapes Artificial Neural Artificial Neural        physical aspect (frequency, intensity, time) are
 calculations    network,       network, Bayes          studied [15].
                                  Classificator,            These characteristics are thought to carry
              Random Forest,
                                                        useful information for recognizing emotions [16]
                 Adaboost,         Adaboost,
                                                        because longer sound units are characterized by
               Gabor filters,   Geometric face          rhythm, intonation, emphasis and pause in speech
              2D face models:        models.            [17] or tempo of speech, relative duration, and
                AAM, ASM         Own vectors:           intensity [18]. The intensity is often measured as
                   EBGM                PCA.             the sound pressure level [19].
                               Local histograms:            The usage of qualitative characteristics is
                                                        based on the assumption that emotional content in
                                   HoG, LBP.            speech is related to the quality of the voice [13].
 Methods for Optical flow,        3D dynamic            By changing the qualitative characteristics of
   dynamics       Dynamic            models.            one's voice, it is possible to reveal important
 calculations     models       Statistical models:      information, e.g. intentions, emotions, and
                                  HMM, DBN              attitudes [18]. Qualitative characteristics are
                                                        closely related to prosodic characteristics.
    Shortcoming is that this algorithm is very          Qualitative characteristics include jitter, shimmer
sensitive to degree of an inclination and turn of the   and other microprosodic phenomena that reflect
head.                                                   the properties of the voice, such as shortness of
    These approaches were implemented in the            breath and hoarseness [20] jitter refers to
following software for processing video images of       fluctuations in fundamental frequency. There are
                                                        several methods for calculating this perturbation.
a human face subject to emotions [10]: Face
                                                        The simplest is the average jitter, which is defined
Reader, Emotion Software and GladOrs
application, Face Analysis System.                      as the average absolute difference in the length of
                                                        consecutive periods. Jitter is usually expressed as
                                                        a percentage. Amplitude perturbation (shimmer)
2.2.    Voice Characteristics                           is defined as fluctuations in the amplitudes of
                                                        adjacent periods. As with jitter, there are many
    Consider speech characteristics. Local              different calculation methods for shimmer. The
characteristics are determined as energy or             most common is the average shimmer – the
frequency of separate frames which form the             average absolute difference in the amplitudes of
speech signal. Global characteristics (maximum,         consecutive periods [21].
minimum, variance, mean, standard deviation,                Spectral characteristics describe a spectrum of
sharpness, skew and other similar values) are           speech that is higher than the fundamental
statistically calculated from local characteristics.    frequency – for example, harmonic and formant
These values are then combined into a single            frequencies. Harmonic frequencies are integer
global characteristics vector [12]. Global              multiples of the fundamental frequency – the
characteristics are effective only in distinguishing    second harmonic frequency is 2 · F0, the third
between energetic and low-energy emotions (e.g.,        harmonic frequency is 3 · F0, etc. Formant
anger and sadness), but fail to distinguish             frequencies are amplifications of certain
emotions that manifest similarly energetically          frequencies in the spectrum.
(e.g. anger and joy) [13].                                  Formant is a phonetic term that denotes the
    Prosodic characteristics is based on concept of     acoustic characteristic of speech sounds
prosody. Proshodia (ancient Greek προσῳδία -            (primarily vowels), associated with the level of
the frequency of the voice tone and forming the         can use various unique blocks inherent in the
timbre of the sound.                                    CNN with the same name.
    The spectrogram can be obtained by using a              As a result, we have the problem of structural-
short-term Fourier transform, in which For              parametric synthesis of the HCNN, the solution of
extraction of these 5 basic types of voice              which is to determine the types of unique blocks,
characteristics it is used different software:          their locations in the structure of the HCNN, to
openSMILE, PortAudio, Praat, Parselmouth,               determine their connections with other blocks, to
Librosa, pyAudioAnalysis.                               determine the types of activation functions, to
    A mel-spectrogram can be used as spectral           calculate the values of weight coefficients, etc.
characteristics (Mel is a psychophysical value for          In general case [23], HCNN consists of S
measuring the pitch of sound, a quantitative            stages, and the sth stage, s = 1, 2 , ... , S, contains
assessment of pitch, which is based on the              Ks nodes, denoted 𝑣𝑠,𝑘𝑠 , ks = 1, 2 ,. ... ... , Ks. The
statistical processing of a large amount of data on     nodes within each stage are ordered, and we only
the subjective perception of the pitch of sound         allow connections from a lower-numbered node
tones). The mel-spectrogram is obtained by              to a higher numbered node. Each node
applying a set of overlapping triangular windows        corresponds to the unique block. It is assumed that
to the frequency spectrogram obtained by the            the geometric dimensions (width, height, and
discrete Fourier transform – Xk, k = 1, ..., N, where   depth) of the stage cube remain unchanged in each
N is the number of signals of different frequencies     stage. Neighboring stages are connected via a
that form the spectrogram [2]. The sound                spatial pooling operation, which may change the
recording of the speech is first divided into short     spatial resolution. The structure of HCNN
frames of equal length. By applying the Fourier         represents the alternation of two unique blocks,
transform, a spectrum (frequencies present in the       followed by a layer of pooling. All convolution
frame) is obtained from each frame. The                 layers in one stage have the same number of filters
spectrogram is then created by visualizing              or channels. To solve the problem of structural-
changes in the spectrum over time. In article [2],      parametric synthesis, it is used a genetic algorithm
a mel-spectrogram was used as inputs to a               or a multicriteria genetic one, if under the training
convolutional neural network, which was                 of HCNN in addition to the criterion determining
represented by a two-dimensional matrix of real         accuracy, a criterion of minimal complexity is
numbers.                                                used. We do not encode the fully-connected part
                                                        of a network. In each stage, we use ½ 𝐾𝑠 (𝐾𝑠 – 1)
3. Hybrid Intelligence System of                        bits to encode the inter-node connections. The
                                                        first bit represents the connection between (𝑣𝑠,1 ,
   Emotional Facial and Speech State                    𝑣𝑠,2 ), then the following two bits represent the
   Estimation                                           connection between (𝑣𝑠,1 , 𝑣𝑠,3 ) and (𝑣𝑠,2 , 𝑣𝑠,3 ),
                                                        etc. This process continues until the last 𝐾𝑠 – 1 bits
   Section 2 of this work pointed out the use of        are used to represent the connection between vs,1,
convolutional neural networks for emotional             𝑣𝑠,2 ,. . . 𝑣𝑠,𝐾𝑠−1 and 𝑣𝑠,𝐾𝑠 .. For 1 ≤ 𝑖 < 𝑗 ≤ 𝐾𝑠 if
facial and speech state estimation. However, as         the code corresponding to (𝑣𝑠,𝑖 , 𝑣𝑠,𝑗 ) is 1, there is
indicated in a number of studies, the use of            an edge connecting 𝑣𝑠,𝑖 and 𝑣𝑠,𝑗 , i.e., 𝑣𝑠,𝑗 takes the
convolutional networks of standard topology does
                                                        output of 𝑣𝑠,𝑖 as a part of the element-wise
not always lead to a correct assessment of
emotions when processing both video and speech          summation, and vice versa.
                                                            Additional training of HCNN was performed
signals. This leads to the need to develop new
                                                        using the Adam optimizer with a learning speed
topologies of convolutional neural networks
(CNN), in particular, hybrid convolutional neural       of 0.00005.
networks (HCNN).                                            Because the Hybrid Intelligence System of
   A characteristic feature of modern CNM is the        Emotional Facial and Speech State Estimation
                                                        contains two channels of information: micro
presence of unique blocks that determine their
essential features. For example: Squeeze and            changes in facial expression and voice, it is
                                                        necessary to have two HCNNs, each of which
excitation block, convolutional attention module,
channel attention module, spatial attention             decides on expressed emotions, for example,
module, residual block, inception module,               when answering questions.
ResNeXt block [22]. Thus, to build a HCNN, you
4. Results                                                 Conference 2019, 19–24 August, Odessa,
                                                           Ukraine, 2019, pp. 202–206.
                                                      [2] V. Sineglazov, O. Chumachenko, V. Patsera,
    The results of person emotional state
                                                           Intellectual system of analysis of musical
estimation with help of facial and speech state
                                                           works, in: Proceedings of the International
estimation are strongly depended of training
                                                           Scientific Conference 2020, May, 20th to
sample quality and are different for different
                                                           25th Ivano-Frankivsk, 2020, pp. 44–47.
emotions. For example, each of the 7 emotional
                                                      [3] Ira Cohen, Nicu Sebe, Ashutosh Garg,
states was correctly identified in more than 65%
                                                           Lawrence S Chen, and Thomas S Huang.
of cases. Facial state estimation gave good results
                                                           Facial expression recognition from video
only for separate states (Fig. 1).
                                                           sequences: temporal and static modeling.
                                                           Computer Vision and image understanding,
                                                           91(1):160–187, 2003.
                                                      [4] Petar S Aleksic and Aggelos K Katsaggelos.
                                                           Automatic facial expression recognition
                                                           using facial animation parameters and
                                                           multistream hmms. IEEE Transactions on
                                                           Information Forensics and Security, 1(1):3–
                                                           11, 2006.
                                                      [5] Irene Kotsia and Ioannis Pitas. Facial
Figure 1: Facial expression recognition example            expression recognition in image sequences
obtained using HCNN                                        using geometric deformation features and
                                                           support vector machines. IEEE transactions
                                                           on image processing, 16(1):172–187, 2007.
   These     researches    need     in    addition    [6] Gwen Littlewort, Marian Stewart Bartlett,
experiments.                                               Ian Fasel, Joshua Susskind, and Javier
                                                           Movellan. Dynamics of facial expression
                                                           extracted automatically from video. Image
5. Conclusions                                             and Vision Computing, 24(6):615–625,
                                                           2006.
   In this work the effective approach for            [7] Caifeng Shan, Shaogang Gong, and Peter W
emotional state recognition of human face and              McOwan. Facial expression recognition
mel-spectrograms using digital images analysis is          based on local binary patterns: A
proposed. It is developed the ways of application          comprehensive study. Image and Vision
the hybrid convolutional neural networks for               Computing, 27(6):803–816, 2009.
assigned task and algorithms of digital image         [8] A. Woubie, J. Luque, J. Hernando, Short-and
processing was applied. Because the Hybrid                 Long-Term Speech Features for Hybrid
Intelligence System of Emotional Facial and                HMM-i-Vector based Speaker Diarization
Speech State Estimation contains two channels of           System, in: ODYSSEY 2016-The Speaker
information: micro changes in facial expression            and Language Recognition Workshop, 2016:
and voice, it is necessary to have two HCNNs.              pp. 400–406.
Given approach has the acceptable recognition         [9] P. Ekman and W. Friesen, Facial Action
level and good enough accuracy. This system can            Coding System: A Technique for the
be successfully applied to perform the security            Measurement       of   Facial    Movement,
purposes in the airports and able to increase the          consulting Psychologists Press, Palo Alto,
security level.                                            1978.
                                                      [10] D. Stutz, Introduction to Neural Networks.
6. References                                              Seminar on Selected Topics in Human
                                                           Language      Technology      and   Pattern
[1] Viktor Sineglazov, Roman Panteev, Ilya                 Recognition, 2014.
    Boryndo, Intelligence system for emotional        [11] D. A. Tatarenkov, Analysis of face
    facial state estimation during inspection              recognition methods on images, 2015,
    control, in: International Scientific-practical        p. 270.
                                                      [12] Y. Gao, B. Li, N. Wang, T. Zhu, Speech
                                                           Emotion Recognition Using Local and
     Global Features, in: Lecture Notes in            [21] J. M. Hillenbrand, Acoustic Analysis of
     Computer Science (Including Subseries                 Voice: A Tutorial, Perspectives on Speech
     Lecture Notes in Artificial Intelligence and          Science and Orofacial Disorders. 21 (2011)
     Lecture Notes in Bioinformatics), Springer            31–43. https://doi.org/10.1044/ssod21.2.31.
     Verlag,         2017:        pp.        3–13.    [22] Viktor Sineglazov and Anatoly Kot, Design
     https://doi.org/10.1007/978-3-319-70772-              of hybrid neural networks of the ensemble
     3_1.                                                  structure, Eastern-European Journal of
[13] M. El Ayadi, M. S. Kamel, F. Karray, Survey           Enterprise Technologies, vol. 1, no. 4(109)
     on speech emotion recognition: Features,              (2021): Mathematics and Cybernetics –
     classification schemes, and databases,                applied aspects: https://doi.org/10.15587/
     Pattern Recognition. 44 (2011) 572–587.               1729-4061.2021.225301 Scopus
     https://doi.org/10.1016/j.patcog.2010.09.020     [23] Lingxi Xie, Alan Yuille. Genetic CNN.
[14] Просодії / Светозарова Н. Д. //                       arXiv:1703.01513v1[cs.CV]4Mau2017
     Напівпровідники - Пустеля. - М .: Велика
     російська енциклопедія, 2015. – С. 614.
     (Велика російська енциклопедія: [в 35 т.]
     / Гл. ред. Ю. С. Осипов; 2004–2017, т. 27).
     – ISBN 978-5-85270-364-4. (Перевірено 10
     квітня 2020).
[15] Антипова А. М. Просодії // Лінгвістичний
     енциклопедичний словник / Головний
     редактор В. Н. Ярцева - М .: Радянська
     енциклопедія, 1990. – 685 с. ISBN 5-
     85270-031-2. (Перевірено 10 квітня 2020)
[16] N. Sato, Y. Obuchi, Emotion Recognition
     using Mel-Frequency Cepstral Coefficients,
     Journal of Natural Language Processing. 14
     (2007)                                83–96.
     https://doi.org/10.5715/jnlp.14.4_83.
[17] A. Woubie, J. Luque, J. Hernando, Short-and
     Long-Term Speech Features for Hybrid
     HMM-i-Vector based Speaker Diarization
     System, in: ODYSSEY 2016-The Speaker
     and Language Recognition Workshop, 2016:
     pp. 400–406.
[18] P. Gangamohan, S. R. Kadiri, B.
     Yegnanarayana, Analysis of Emotional
     Speech – A Review, in: Intelligent Systems
     Reference Library, Springer Science and
     Business Media Deutschland GmbH, 2016:
     pp. 205–238. https://doi.org/10.1007/978-3-
     319-31056-5_11.
[19] L. L. (Leo L. Beranek, T. J. Mellow,
     Acoustics: Sound Fields, Transducers and
     Vibration,           Elsevier,          2019.
     https://doi.org/10.1016/C2017-0-01630-0.
[20] A. Batliner, B. Schuller, D. Seppi, S. Steidl,
     L. Devillers, L. Vidrascu, T. Vogt, V.
     Aharonson, N. Amir, The Automatic
     Recognition of Emotions in Speech, in:
     Cognitive Technologies, Springer Verlag,
     2011:                pp.              71–99.
     https://doi.org/10.1007/978-3-642-15184-
     2_6.