=Paper= {{Paper |id=Vol-3126/paper30 |storemode=property |title=Hybrid Intelligence System of Emotional Facial and Speech State Estimation |pdfUrl=https://ceur-ws.org/Vol-3126/paper30.pdf |volume=Vol-3126 |authors=Viktor Sineglazov,Andriy Rjabokonev }} ==Hybrid Intelligence System of Emotional Facial and Speech State Estimation== https://ceur-ws.org/Vol-3126/paper30.pdf

Hybrid Intelligence System of Emotional Facial and Speech State
Estimation
Viktor Sineglazov1, Andriy Rjabokonev2
1,2
National Aviation University, ave. Lubomir Husar, 1, Kyiv, 03058, Ukraine

Abstract
It is shown that person emotional state estimation with help of facial or speech state estimation
isn’t enough. It is necessary to create Hybrid Intelligence system of emotional facial and speech
state estimation. For the problem solution it is proposed to use hybrid convolutional neural
networks. The data supplied to the network input are presented in the form of mel-spectrograms
and facial images during conversation. Mel-spectrogram can be interpreted as a two-
dimensional image, where along one axis the frequency changes, along the other time, or rather
sequential frames of the spectrogram. The following characteristics are often extracted for this
purpose: local characteristics, global characteristics, prosodic characteristics, qualitative
characteristics. It is shown that change of emotions on a face or in speech is connected with
internal reaction of the person to the questions posed. For the solution of emotional state
estimation with help of facial and speech state estimation it is offered to use convolutional
neural networks at a stage of micro emotions identification and voice characteristic changes.
Making decision on potential threats based on determined emotional state estimation is realized
by the ensemble of classifiers.

Keywords 1
Hybrid Intelligence, emotional state estimation, hybrid convolutional neural networks, Mel-
spectrogram, facial or speech features, making decision.

1. Introduction recognition, on the second – the fuzzy classifier
supplies the solution of making decision on
potential threat problem based on determined
Nowadays, the real importance is given to
emotional state estimation.
increasing the aircraft safety conditions, in
In article [2] it is considered an Intelligent
particular during the passenger control. Commonly,
system of analysis of musical works, where it was
the number of people for each security officer is
used mel-spectrograms as inputs for
too high to deal with them in restricted period of
convolutional neural network.
time. The employee of aircraft company is faced
Last researches showed that it isn’t enough to
by a hard task, to ask the number of special
take into account only particular features,
questions to understand the emotional state of the
appearance because sometimes they can be
passenger to successful admission of the flight.
formed artificially. So in addition it is necessary
The main features that allow to solve this problem
to consider speech state estimation.
is emotional changes of the passenger during the
control conversation [1].
In article [1] it is considered an intelligent 2. Review of Existing Solutions
system of micro emotions analysis which consists
of the two-levels: at the first level the convolution Generally, the facial emotion of an individual
neural network realizes micro emotion in few studies has been realized through the

ISIT 2021: II International Scientific and Practical Conference
«Intellectual Systems and Information Technologies», September
13–19, 2021, Odesa, Ukraine
EMAIL: svm@nau.edu.ua (A. 1);
ryabokonev.andrey@gmail.com (A. 2)
ORCID: 0000-0002-3297-9060 (A. 1)
©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
computer vision (CV). Facial expressions have 2.1. Facial Movement
maximum magnitude over the words during a
personal conversation. Various methods have Characteristics
been used for automatic facial expression
recognition (FER or AFER) tasks. Early papers Each manifestation of facial emotions of a
used geometric representations, for example, person can be described by a set of descriptors. As
vectors descriptors for the motion of the face [3], the apparent facial changes there also occurs the
active contours for mouth and eye shape retrieval micro emotions. They can be taken into account
[4], and using 2D deformable mesh models [5]. in more complicated recognition approaches.
Other used appearance representation based Table 1 describes the main facial changes
methods, such as Gabor filters [6], or local binary relatively to the six standard types of emotions
patterns (LBP) [7]. These feature extraction [9].
methods usually were combined with one of
several regressors to translate these feature Table 1
vectors to emotion classification or action unit Relations of emotional facial features changing
detection. The most popular regressors used in Emotion Eyebrow Mouth
this context were support vector machines (SVM) Surprise Rise Open
and random forests. Many descriptive approaches Fear Rise and Open and
to interaction forms of emotions are included in
wrinkled stretch
the classification of the input data, and the CNN
network is an effective algorithm of deep learning. Disgust Decrease Rise and ends will
Current research in the field of classification of decrease
the user's emotional state based on voice focuses Anger Decrease and Opens and ends
mainly on experiments with different classifiers wrinkled will decrease
and characteristics and finding the best Happiness Bends down Ends will rise
combination. A relatively small number of Sadness End part will Ends will
available recordings of emotions (databases) that decrease decrease
can potentially be used to create a classifier has
shown to be problematic, as well as the fact that Motion units of the person can be divided into
people in real situations tend to suppress their three groups conditionally.
emotions and not fully express them. Another 1. Static – recognition using only the photo
obstacle in creating a universal solution is the is possible.
human voice itself, which can be influenced by 2. Dynamic – it is necessary to continuous
many factors – e.g. gender, age, state of health, frame changing, key points initialization or
etc. obtaining the average value of distances
An important step in designing an emotion between motion units.
recognition system is to recognize the facial micro 3. Empty – actively participate in
changes that effectively characterize the various manifestation of emotions, however are not
emotions and extract useful properties from the registered search algorithms (dimples on
voice. cheeks).
For these purposes it is extracted the following Now it is possible to review the following
characteristics; facial movements (unitary recognition methods of the human emotional state
movements performed by a group of muscles: using methods of calculation of forms of objects,
tightening the cheeks, stretching the eyelids, methods of calculation of dynamics of objects
raising the wings of the nose, raising the upper lip, (Table 2) [10].
deepening the nasolabial fold, raising the corners Face detection algorithms can be divided into
of the lips, dimpling the lips, lowering the corners four categories [11]: empirical method; method of
of the mouth, lowering the lower lip, pulling off invariant signs; recognition on the template
the lips) [8], speech (local characteristics, global implemented by the developer; method detection
characteristics, prosodic characteristics, on external signs (the training systems).
qualitative characteristics, spectral The main stages of algorithms of empirical
characteristics). approach are: stay on the image of the person:
eye, nose, mouth; detection: borders of the
person, form, brightness, texture, color;
combination of all found invariant signs and stress, chorus; also prosodyk) – a section of
their verification. phonetics, which considers such features of
pronunciation as height, strength / intensity,
Table 2 duration, aspiration, glottalization, palatalization,
Methods for facial emitonal state recognition of the type of concordance of a consonant to a vowel
human face and other signs, which are additional to the main
articulation of sound [14]. Within the framework
Methods Holistic Local methods
of prosody, both the subjective level of perception
methods of the characteristics of super-segment units
Methods Classificators: Classificators: (pitch, strength / loudness, duration) and their
for shapes Artificial Neural Artificial Neural physical aspect (frequency, intensity, time) are
calculations network, network, Bayes studied [15].
Classificator, These characteristics are thought to carry
Random Forest,
useful information for recognizing emotions [16]
Adaboost, Adaboost,
because longer sound units are characterized by
Gabor filters, Geometric face rhythm, intonation, emphasis and pause in speech
2D face models: models. [17] or tempo of speech, relative duration, and
AAM, ASM Own vectors: intensity [18]. The intensity is often measured as
EBGM PCA. the sound pressure level [19].
Local histograms: The usage of qualitative characteristics is
based on the assumption that emotional content in
HoG, LBP. speech is related to the quality of the voice [13].
Methods for Optical flow, 3D dynamic By changing the qualitative characteristics of
dynamics Dynamic models. one's voice, it is possible to reveal important
calculations models Statistical models: information, e.g. intentions, emotions, and
HMM, DBN attitudes [18]. Qualitative characteristics are
closely related to prosodic characteristics.
Shortcoming is that this algorithm is very Qualitative characteristics include jitter, shimmer
sensitive to degree of an inclination and turn of the and other microprosodic phenomena that reflect
head. the properties of the voice, such as shortness of
These approaches were implemented in the breath and hoarseness [20] jitter refers to
following software for processing video images of fluctuations in fundamental frequency. There are
several methods for calculating this perturbation.
a human face subject to emotions [10]: Face
The simplest is the average jitter, which is defined
Reader, Emotion Software and GladOrs
application, Face Analysis System. as the average absolute difference in the length of
consecutive periods. Jitter is usually expressed as
a percentage. Amplitude perturbation (shimmer)
2.2. Voice Characteristics is defined as fluctuations in the amplitudes of
adjacent periods. As with jitter, there are many
Consider speech characteristics. Local different calculation methods for shimmer. The
characteristics are determined as energy or most common is the average shimmer – the
frequency of separate frames which form the average absolute difference in the amplitudes of
speech signal. Global characteristics (maximum, consecutive periods [21].
minimum, variance, mean, standard deviation, Spectral characteristics describe a spectrum of
sharpness, skew and other similar values) are speech that is higher than the fundamental
statistically calculated from local characteristics. frequency – for example, harmonic and formant
These values are then combined into a single frequencies. Harmonic frequencies are integer
global characteristics vector [12]. Global multiples of the fundamental frequency – the
characteristics are effective only in distinguishing second harmonic frequency is 2 · F0, the third
between energetic and low-energy emotions (e.g., harmonic frequency is 3 · F0, etc. Formant
anger and sadness), but fail to distinguish frequencies are amplifications of certain
emotions that manifest similarly energetically frequencies in the spectrum.
(e.g. anger and joy) [13]. Formant is a phonetic term that denotes the
Prosodic characteristics is based on concept of acoustic characteristic of speech sounds
prosody. Proshodia (ancient Greek προσῳδία - (primarily vowels), associated with the level of
the frequency of the voice tone and forming the can use various unique blocks inherent in the
timbre of the sound. CNN with the same name.
The spectrogram can be obtained by using a As a result, we have the problem of structural-
short-term Fourier transform, in which For parametric synthesis of the HCNN, the solution of
extraction of these 5 basic types of voice which is to determine the types of unique blocks,
characteristics it is used different software: their locations in the structure of the HCNN, to
openSMILE, PortAudio, Praat, Parselmouth, determine their connections with other blocks, to
Librosa, pyAudioAnalysis. determine the types of activation functions, to
A mel-spectrogram can be used as spectral calculate the values of weight coefficients, etc.
characteristics (Mel is a psychophysical value for In general case [23], HCNN consists of S
measuring the pitch of sound, a quantitative stages, and the sth stage, s = 1, 2 , ... , S, contains
assessment of pitch, which is based on the Ks nodes, denoted 𝑣𝑠,𝑘𝑠 , ks = 1, 2 ,. ... ... , Ks. The
statistical processing of a large amount of data on nodes within each stage are ordered, and we only
the subjective perception of the pitch of sound allow connections from a lower-numbered node
tones). The mel-spectrogram is obtained by to a higher numbered node. Each node
applying a set of overlapping triangular windows corresponds to the unique block. It is assumed that
to the frequency spectrogram obtained by the the geometric dimensions (width, height, and
discrete Fourier transform – Xk, k = 1, ..., N, where depth) of the stage cube remain unchanged in each
N is the number of signals of different frequencies stage. Neighboring stages are connected via a
that form the spectrogram [2]. The sound spatial pooling operation, which may change the
recording of the speech is first divided into short spatial resolution. The structure of HCNN
frames of equal length. By applying the Fourier represents the alternation of two unique blocks,
transform, a spectrum (frequencies present in the followed by a layer of pooling. All convolution
frame) is obtained from each frame. The layers in one stage have the same number of filters
spectrogram is then created by visualizing or channels. To solve the problem of structural-
changes in the spectrum over time. In article [2], parametric synthesis, it is used a genetic algorithm
a mel-spectrogram was used as inputs to a or a multicriteria genetic one, if under the training
convolutional neural network, which was of HCNN in addition to the criterion determining
represented by a two-dimensional matrix of real accuracy, a criterion of minimal complexity is
numbers. used. We do not encode the fully-connected part
of a network. In each stage, we use ½ 𝐾𝑠 (𝐾𝑠 – 1)
3. Hybrid Intelligence System of bits to encode the inter-node connections. The
first bit represents the connection between (𝑣𝑠,1 ,
Emotional Facial and Speech State 𝑣𝑠,2 ), then the following two bits represent the
Estimation connection between (𝑣𝑠,1 , 𝑣𝑠,3 ) and (𝑣𝑠,2 , 𝑣𝑠,3 ),
etc. This process continues until the last 𝐾𝑠 – 1 bits
Section 2 of this work pointed out the use of are used to represent the connection between vs,1,
convolutional neural networks for emotional 𝑣𝑠,2 ,. . . 𝑣𝑠,𝐾𝑠−1 and 𝑣𝑠,𝐾𝑠 .. For 1 ≤ 𝑖 < 𝑗 ≤ 𝐾𝑠 if
facial and speech state estimation. However, as the code corresponding to (𝑣𝑠,𝑖 , 𝑣𝑠,𝑗 ) is 1, there is
indicated in a number of studies, the use of an edge connecting 𝑣𝑠,𝑖 and 𝑣𝑠,𝑗 , i.e., 𝑣𝑠,𝑗 takes the
convolutional networks of standard topology does
output of 𝑣𝑠,𝑖 as a part of the element-wise
not always lead to a correct assessment of
emotions when processing both video and speech summation, and vice versa.
Additional training of HCNN was performed
signals. This leads to the need to develop new
using the Adam optimizer with a learning speed
topologies of convolutional neural networks
(CNN), in particular, hybrid convolutional neural of 0.00005.
networks (HCNN). Because the Hybrid Intelligence System of
A characteristic feature of modern CNM is the Emotional Facial and Speech State Estimation
contains two channels of information: micro
presence of unique blocks that determine their
essential features. For example: Squeeze and changes in facial expression and voice, it is
necessary to have two HCNNs, each of which
excitation block, convolutional attention module,
channel attention module, spatial attention decides on expressed emotions, for example,
module, residual block, inception module, when answering questions.
ResNeXt block [22]. Thus, to build a HCNN, you
4. Results Conference 2019, 19–24 August, Odessa,
Ukraine, 2019, pp. 202–206.
[2] V. Sineglazov, O. Chumachenko, V. Patsera,
The results of person emotional state
Intellectual system of analysis of musical
estimation with help of facial and speech state
works, in: Proceedings of the International
estimation are strongly depended of training
Scientific Conference 2020, May, 20th to
sample quality and are different for different
25th Ivano-Frankivsk, 2020, pp. 44–47.
emotions. For example, each of the 7 emotional
[3] Ira Cohen, Nicu Sebe, Ashutosh Garg,
states was correctly identified in more than 65%
Lawrence S Chen, and Thomas S Huang.
of cases. Facial state estimation gave good results
Facial expression recognition from video
only for separate states (Fig. 1).
sequences: temporal and static modeling.
Computer Vision and image understanding,
91(1):160–187, 2003.
[4] Petar S Aleksic and Aggelos K Katsaggelos.
Automatic facial expression recognition
using facial animation parameters and
multistream hmms. IEEE Transactions on
Information Forensics and Security, 1(1):3–
11, 2006.
[5] Irene Kotsia and Ioannis Pitas. Facial
Figure 1: Facial expression recognition example expression recognition in image sequences
obtained using HCNN using geometric deformation features and
support vector machines. IEEE transactions
on image processing, 16(1):172–187, 2007.
These researches need in addition [6] Gwen Littlewort, Marian Stewart Bartlett,
experiments. Ian Fasel, Joshua Susskind, and Javier
Movellan. Dynamics of facial expression
extracted automatically from video. Image
5. Conclusions and Vision Computing, 24(6):615–625,
2006.
In this work the effective approach for [7] Caifeng Shan, Shaogang Gong, and Peter W
emotional state recognition of human face and McOwan. Facial expression recognition
mel-spectrograms using digital images analysis is based on local binary patterns: A
proposed. It is developed the ways of application comprehensive study. Image and Vision
the hybrid convolutional neural networks for Computing, 27(6):803–816, 2009.
assigned task and algorithms of digital image [8] A. Woubie, J. Luque, J. Hernando, Short-and
processing was applied. Because the Hybrid Long-Term Speech Features for Hybrid
Intelligence System of Emotional Facial and HMM-i-Vector based Speaker Diarization
Speech State Estimation contains two channels of System, in: ODYSSEY 2016-The Speaker
information: micro changes in facial expression and Language Recognition Workshop, 2016:
and voice, it is necessary to have two HCNNs. pp. 400–406.
Given approach has the acceptable recognition [9] P. Ekman and W. Friesen, Facial Action
level and good enough accuracy. This system can Coding System: A Technique for the
be successfully applied to perform the security Measurement of Facial Movement,
purposes in the airports and able to increase the consulting Psychologists Press, Palo Alto,
security level. 1978.
[10] D. Stutz, Introduction to Neural Networks.
6. References Seminar on Selected Topics in Human
Language Technology and Pattern
[1] Viktor Sineglazov, Roman Panteev, Ilya Recognition, 2014.
Boryndo, Intelligence system for emotional [11] D. A. Tatarenkov, Analysis of face
facial state estimation during inspection recognition methods on images, 2015,
control, in: International Scientific-practical p. 270.
[12] Y. Gao, B. Li, N. Wang, T. Zhu, Speech
Emotion Recognition Using Local and
Global Features, in: Lecture Notes in [21] J. M. Hillenbrand, Acoustic Analysis of
Computer Science (Including Subseries Voice: A Tutorial, Perspectives on Speech
Lecture Notes in Artificial Intelligence and Science and Orofacial Disorders. 21 (2011)
Lecture Notes in Bioinformatics), Springer 31–43. https://doi.org/10.1044/ssod21.2.31.
Verlag, 2017: pp. 3–13. [22] Viktor Sineglazov and Anatoly Kot, Design
https://doi.org/10.1007/978-3-319-70772- of hybrid neural networks of the ensemble
3_1. structure, Eastern-European Journal of
[13] M. El Ayadi, M. S. Kamel, F. Karray, Survey Enterprise Technologies, vol. 1, no. 4(109)
on speech emotion recognition: Features, (2021): Mathematics and Cybernetics –
classification schemes, and databases, applied aspects: https://doi.org/10.15587/
Pattern Recognition. 44 (2011) 572–587. 1729-4061.2021.225301 Scopus
https://doi.org/10.1016/j.patcog.2010.09.020 [23] Lingxi Xie, Alan Yuille. Genetic CNN.
[14] Просодії / Светозарова Н. Д. // arXiv:1703.01513v1[cs.CV]4Mau2017
Напівпровідники - Пустеля. - М .: Велика
російська енциклопедія, 2015. – С. 614.
(Велика російська енциклопедія: [в 35 т.]
/ Гл. ред. Ю. С. Осипов; 2004–2017, т. 27).
– ISBN 978-5-85270-364-4. (Перевірено 10
квітня 2020).
[15] Антипова А. М. Просодії // Лінгвістичний
енциклопедичний словник / Головний
редактор В. Н. Ярцева - М .: Радянська
енциклопедія, 1990. – 685 с. ISBN 5-
85270-031-2. (Перевірено 10 квітня 2020)
[16] N. Sato, Y. Obuchi, Emotion Recognition
using Mel-Frequency Cepstral Coefficients,
Journal of Natural Language Processing. 14
(2007) 83–96.
https://doi.org/10.5715/jnlp.14.4_83.
[17] A. Woubie, J. Luque, J. Hernando, Short-and
Long-Term Speech Features for Hybrid
HMM-i-Vector based Speaker Diarization
System, in: ODYSSEY 2016-The Speaker
and Language Recognition Workshop, 2016:
pp. 400–406.
[18] P. Gangamohan, S. R. Kadiri, B.
Yegnanarayana, Analysis of Emotional
Speech – A Review, in: Intelligent Systems
Reference Library, Springer Science and
Business Media Deutschland GmbH, 2016:
pp. 205–238. https://doi.org/10.1007/978-3-
319-31056-5_11.
[19] L. L. (Leo L. Beranek, T. J. Mellow,
Acoustics: Sound Fields, Transducers and
Vibration, Elsevier, 2019.
https://doi.org/10.1016/C2017-0-01630-0.
[20] A. Batliner, B. Schuller, D. Seppi, S. Steidl,
L. Devillers, L. Vidrascu, T. Vogt, V.
Aharonson, N. Amir, The Automatic
Recognition of Emotions in Speech, in:
Cognitive Technologies, Springer Verlag,
2011: pp. 71–99.
https://doi.org/10.1007/978-3-642-15184-
2_6.