=Paper=
{{Paper
|id=Vol-3318/short1
|storemode=property
|title=Extreme Learning Machines For Efficient Speech Emotion Estimation In Julia
|pdfUrl=https://ceur-ws.org/Vol-3318/short1.pdf
|volume=Vol-3318
|authors=Georgios Drakopoulos,Phivos Mylonas
|dblpUrl=https://dblp.org/rec/conf/cikm/DrakopoulosM22
}}
==Extreme Learning Machines For Efficient Speech Emotion Estimation In Julia==
<pdf width="1500px">https://ceur-ws.org/Vol-3318/short1.pdf</pdf>
<pre>
Extreme Learning Machines For Efficient Speech Emotion
Estimation In Julia
Georgios Drakopoulos∗ , Phivos Mylonas
Department of Informatics, Ionian University, Tsirigoti Sq. 7. Kerkyra 49100, Hellas


                                          Abstract
                                          Speech is a mainstay of communication across literally all human activities. Besides facts and statements speech carries
                                          substantial information regarding experiences, thoughts, and emotions, therefore adding significant context. Moreover,
                                          non-linguistic elements such as pauses add more to the message. The field of speech emotion recognition (SER) has been
                                          developed precisely to develop algorithms and tools performing what humans learn to do from early on. One promising
                                          line of research comes from applying deep learning techniques trained on numerous audio attributes to discern between
                                          various emotions as dictated by a given model of fundamental human emotions. Extreme learning machines (ELMs) are
                                          neural network architectures achieving efficiency through simplicity and can potentially operate akin to a sparse coder. When
                                          trained by a plethora of audio attributes, such as cepstral coefficients, zero crossing rate, and autocorrelation, then it can
                                          classify emotions in speech based on the established emotion wheel model. The evaluation, done with the Toronto emotional
                                          speech set (TESS) on an ELM implemented in Julia, is quite encouraging.

                                          Keywords
                                          extreme learning machine, speech emotion recognition, emotion classification, Plutchik model, higher order patterns,
                                          spectrogram, cepstral coefficients, zero crossing rate, TESS dataset, Julia


1. Introduction                                                                                                                    emotion classification as is the case here.
                                                                                                                                      Among the various models proposed for the vari-
Language, whether oral or written, is among the major                                                                              ous SER tasks, extreme learning machines (ELMs) have
sources of human emotion and perhaps a mainstay of                                                                                 shown considerable potential. The latter can be partially
civilization itself. The field of speech emotion recog-                                                                            at least attributed to the ELM structure which has only a
nition (SER) almost since its formulation has been an                                                                              single but very long hidden layer. In turn, this allows for
essentially demanding field systematically garnering in-                                                                           straightforward and easy to interpret training schemes,
tense interdisciplinary interest since it aims to answer                                                                           all of which eventually stem from a synaptic weight reg-
fundamental questions regarding human speech, which                                                                                ularization property. This is aligned with the intuition
includes major elements such as intonation and pitch as                                                                            that a certain optimality condition should hold in order
well as latent and non-linguistic elements such as pauses                                                                          for the weights to be uniquely derived.
and the length of sentences. Because of the complexity                                                                                The primary research contribution of this conference
and volatility of human speech, SER relies heavily on                                                                              paper is the development of an ELM implemented in Julia
machine learning (ML) and recently on deep learning                                                                                and operating like a sparse encoder for the emotion classi-
(DL) techniques for performing tasks.                                                                                              fication of sentences coming from the ubiquitous Toronto
   Human emotion models such as the emotion wheel                                                                                  emotion speech set (TESS) collection, a benchmark for
by Plutchik [1] and the universal emotion models [2]                                                                               training ML and DL models for SER tasks.
have been developed to explain not only which emotions                                                                                The remainder of this work is structured as follows.
are fundamental, with interpretations ranging from so-                                                                             The recent scientific literature regarding ELMs, SER, and
cial conditioning to brain functionality and evolutionary                                                                          graph mining is briefly reviewed in section 2. In section
goals, but also how they are composed, which may well                                                                              3 the proposed methodology is described, whereas the
entail non-linear operations. In any case, such models                                                                             results obtained using the TESS dataaset are analysed in
can serve well as training guides to ML models for speech                                                                          section 4. Possible future research directions are given in
                                                                                                                                   section 5. Bold capital letters denote matrices, bold small
CIKM’22: 31st ACM International Conference on Information and                                                                      vectors, and small scalars. Acronyms are explained the
Knowledge Management (companion volume), October 17–21, 2022,                                                                      first time they are encountered in the text. Finally, the
Atlanta, GA                                                                                                                        notation of this work is summarized in table 1.
∗
     Corresponding author.
Envelope-Open c16drak@ionio.gr (G. Drakopoulos); fmylonas@ionio.gr
(P. Mylonas)
Orcid 0000-0002-0975-1877 (G. Drakopoulos); 0000-0002-6916-3129
                                                                                                                                   2. Related Work
(P. Mylonas)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License   Because of its interdisciplinary nature SER has been at
                                    Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)                                                        the attention focus of a number of fields [3]. To address


                                                                                                                               1
Georgios Drakopoulos et al. CEUR Workshop Proceedings                                                                     1–7


Table 1                                                           is given in [34], sequential graph collaborative filtering
Notation Summary                                                  is the topic of [35], mining hot sports in trajectories with
                                                                  graph based methodologies is developed in [36], and fMRI
     Symbol     Meaning                       First in
                                                                  image classification with tensor distance metrics is the
     △
     =          Equality by definition        Eq. (1)             focus of [37].
     ‖⋅‖        Vector or matrix norm         Eq. (6)
     𝜑(⋅)       Activation function           Eq. (3)
     tanh (⋅)   Hyperbolic tangent function   Eq. (3)             3. Methodology
     tr (⋅)     Matrix trace                  Eq. (7)
                                                                  3.1. Attributes
                                                                  Emotion models have been developed in order to explain
the inherent complexity of the SER tasks, ML approaches
                                                                  how emotions work, their intensity and elicit conditions,
such as ensemble learning [4], deep convolutional neu-
                                                                  how they may be composed in case of emotion levels,
ral networks [5], domain invariant feature learning [6],
                                                                  and possibly their evolutionary purpose. In this set of
two-dimensional convolutional neural networks [7], and
                                                                  models the one proposed by Plutchik has been among
multimodal deep learning [8] have been proposed in the
                                                                  the earliest and one of the most commonly used in en-
scientific literature. Human emotion models such as the
                                                                  gineering applications. Additionally, it has an easy to
emotion wheel by Plutchik [1] or the universal emotion
                                                                  understand and intuitive-friendly visual interpretation,
theory by Ekman [2] typically describe a fundamental set
                                                                  which is shown in figure 1. Notice that this figure depicts
of emotions [9, 10, 11] along with composition rules and
                                                                  a two dimensional projection of a cone.
possible evolutionary explanations for them [12]. More
recently, personality taxonomies go beyond single emo-
tional reactions and treat personality as a whole such
as the Myers-Brigs type indicator (MBTI). A reasoning
based framework for emotion classification is [13].
   ELMs have been used in ML because of the simplicity
of their architecture [14]. They have been used as part of
ML pipelines for wavelet transforms [15], in conjunction
with an autoencoder for predicting the concentration of
emitted greenhouse gases from boilers [16], and in opti-
mizing a Kalman filter for determining the aging factors
of supercapacitors [17]. Further applications include es-
timating soil thermal conductivity [18] and an evolving
kernel assisted ELM for medical diagnosis [19], whereas
an extensive list of applications is given in [20].
   Graph mining is a field relying heavily on ML [21]
and graph signal processing techniques [22]. Regarding
the use of ML, self organizing maps (SOMs) for recom-
mending cultural content are presented in [23], exploiting
natural language attributes for finding linked require-
ments between software artefacts is the focus of [24],
decompressing a sequence of Twitter graphs compressed             Figure 1: Plutchik model (From Wikipedia).
with the two-dimensional discrete cosine transform us-
ing a tensor stack network (TSN) is described in [25],               According to this model each emotion corresponds to
combining graph mining with transformers is shown                 a location in a circle which is primarily a function of its
in [26], advanced graph clustering techniques for clas-           valence as well as of its direction. The latter is related
sifying variation of cancer genomes [27], message pass-           to the nature of the emotion under consideration, which
ing graph neural networks for fuzzy [28] and ordinary             also determines at least in part its polarity. Specifically,
Twitter graphs [29] are described, a GPU-based system             there are in total eight directions with three scales each.
for efficient graph mining is shown in [30], partitioning         Moreover, there are some emotions which are combina-
the user base of a portal for cultural content recommen-          tions of others from two directions. Moreover, the set of
dation is explained in [31], visualizing massive graphs           emotions is categorized as basic, primary, secondary, and
for human feedback is described in [32], approximating            tertiary. Primary emotions are archetypes the remain-
directed graphs with undirected ones under optimality             ing ones are patterned after or are derived of. They are
conditions is shown in [33], classification of noisy graphs       characterized by especially high survival value.


                                                              2
Georgios Drakopoulos et al. CEUR Workshop Proceedings                                                                       1–7


Table 2                                                            The individual output of the 𝑗-th neuron can be com-
Primary Emotions In Plutchik’s Model                             puted from the nonlinear combination of equation (2).
                                                                 Therein 𝑞 is the number of input neurons which is much
    Emotion         Polarity               Opposite
                                                                 smaller than that of the hidden neurons 𝑝, namely 𝑞 ≪ 𝑝,
    Neutral         Neutral                Neutral               and equal to the dimensionality of each data point.
    Surprise        Positive or negative   Anticipation
    Anticipation    Positive or negative   Surprise                                       𝑞
                                                                                 △
    Joy             Positive               Sadness                        ℎ𝑗 (x𝑖 ) = 𝜑𝑗 (∑ 𝑤𝑗,𝑘 x𝑖 [𝑘]) = 𝜑𝑗 (w𝑇𝑗 x𝑖 )      (2)
    Trust           Positive               Disgust                                       𝑘=1
    Anger           Negative               Fear
                                                                    The nonlinear activation function 𝜑𝑘 (⋅) may take a
    Sadness         Negative               Joy
    Disgust         Negative               Trust                 number of forms such as the logistic function or poly-
    Fear            Negative               Anger                 nomial kernels. In this case it is the hyperbolic tangent
                                                                 function of (3). It has the advantage of being differen-
                                                                 tiable and of being the Bayes estimator of a bipolar source
   As stated earlier, each of the above emotions has an          under additive white Gaussian (AWGN) noise.
associated emotional polarity. For most emotions this                                              △
                                                                                     𝜑𝑘 (𝑥; 𝛽0 ) = tanh (𝑥; 𝛽0 )            (3)
polarity is clear, although is some cases such as surprise
this has to be determined by the context. The intensity            The first derivative 𝜓 of 𝜑 can be expressed as a sec-
of each emotion essentially determines its location on a         ond order polynomial of the latter as shown in (4). This
given affective axis. The higher the intensity, the more         expression is that of Malthus population models.
emotional a person is at a given time.
   The primary emotions in the model of Plutchik are                              △    𝜕𝜑(𝑥; 𝛽0 )
                                                                         𝜓(𝑥; 𝛽0 ) =              = 𝛽0 (1 − 𝜑 2 (𝑥; 𝛽0 ))   (4)
shown in table 2. Their location on the intensity scale                                   𝜕𝑥
and their relationship to other emotions are shown in fig-
ure 1. Moreover, the primary emotions come in four bipo-            The column synaptic weight vector w𝑗 is formed by
lar opposite emotions in the sense that they accomplish          stacking the 𝑞 weights w𝑗,𝑘 connecting the 𝑗-th hidden
opposite objectives and their physical manifestations are        neuron with the 𝑘-input one. Moreover, this is also the
considerably different. Bipolarity does not necessarily          𝑗-th column of the synaptic weight matrix W. If the data
mean that in every pair there is one feeling with positive       points are stacked on top of each other, then the input
polarity, though this may be the case as in the pair of          matrix X is formed. Thus H of (1) can be rewritten as in
joy and sadness. Instead, both emotions in the anger             (5), where the function of (3) is elementwise applied.
and fear pair are both perceived as negative, but they are                                     △
                                                                                          H = 𝜑(WX𝑇 )                       (5)
diametric opposites in the context of fight or flight.
                                                                In general ELMs, depending on their training formu-
3.2. Training                                                lation, can perform regularized least squares fitting in
                                                             order to determine the optimal weights as in (6) where
ELM training has a simpler compared to other neural
                                                             𝜂0 is a hyperparameter. Therein the regularization term
network architectures since it has only one hidden layer
                                                             adds robustness to the algorithmic minimization process.
with a large number 𝑝 of processing neurons. With the
                                                             To this end, the nonlinear least squares problem of (6)
proper training, each neuron can be specialized in a par-
                                                             was formulated, where the Frobenius matrix norm is used
ticular subset of the training set, which comprises of 𝑛
                                                             since it is differentiable. Also Y is the ground truth ma-
data vectors. In this case the ELM output matrix H has
                                                             trix containing the one hot encoding of the eight primary
the elementwise structure of equation (1).
                                                             emotions and W∗ is the solution.
   From its structure the 𝑖-th row of H contains the output
of each neuron for 𝑖-th data point 0 ≤ 𝑖 ≤ 𝑛 − 1, while the W∗ =△ argmin [𝐽] = argmin [𝜂 ‖W‖2 + ‖𝜑(WX𝑇 ) − Y‖2 ]
                                                                                              0     𝐹                    𝐹
𝑗-th column consists of the output of the 𝑗-th neuron 0 ≤
                                                                                                                        (6)
𝑗 ≤ 𝑝 − 1 across all the available data points, preserving
                                                                Expanding (6) and taking into consideration the ex-
the order in which they were given to the ELM.
                                                             pansion of the Frobenius norm the objective function 𝐽
         ℎ (x )       ℎ1 (x0 )   …     ℎ𝑝−1 (x0 )            to be minimized can be recast as in (7). Because of the
       ⎡ 0 0                                      ⎤          form Frobenius norm and that of the nonlinear activation
         ℎ0 (x1 )     ℎ𝑘 (x1 )   …     ℎ𝑝−1 (x1 ) ⎥
  H= ⎢
     △
       ⎢                                          ⎥ ∈ ℝ𝑛×𝑝   function 𝐽 not only is differentiable but it also has a single
       ⎢     ⋮            ⋮      ⋱         ⋮      ⎥          global minimum. Additionally, the regularization term
       ⎣ℎ0 (x𝑛−1 ) ℎ1 (x𝑛−1 ) … ℎ𝑝−1 (x𝑛−1 )⎦                ensures that synaptic weight sparsity also taken into con-
                                                         (1)
                                                             sideration. Thus, minimizing 𝐽 translates into finding the


                                                             3
Georgios Drakopoulos et al. CEUR Workshop Proceedings                                                                    1–7


weight set achieving a tradeoff between fitting the ELM           its central frequency. This allows the details of a speech
response to the target response with the least possible           signal to be more discernible.
energy. The latter can be considered as the explanation              The spectrogram of a signal is a function of time and
closest to that dictated by Occam’s razor.                        frequency and shows how its frequency content evolves
                                                                  in small time steps. Typically it can be obtained by the
                                        𝑇
𝐽 = 𝜂0 tr (W𝑇 W) + tr ((𝜑(WX𝑇 ) − Y) (𝜑(WX𝑇 ) − Y))               wavelet transform, by the short time Fourier transform
                                                        (7)       (STFT), or by a bank of bandpass filters such as Gabor
   The minimization problem of (7) is a regularized non-          and shifted Chebyshev filters. In any case, the resulting
linear least squares problem. The hyperparameter 𝜂0               heatmap has been transformed to a long column vec-
determines the relative weight of the synaptic weight             tor, which incurs some information loss as the spatial
matrix sparsity compared to how well the ELM response             structure is lost. This is attributed to the fact that the
matches the target response. The problem of (7) can be            proposed ELM is trained with data points with are real
solved by a plethora of methodologies including iterative         vectors. An architecture natively handling matrices may
ones such as fixed point methods. However, they should            be more adept in this scenario.
take into consideration the nonlinear term introduced                The 𝑘-th autocorrelation coefficient 𝑎 [𝑘] of any real-
by the activation function. This can be accomplished by           valued stationary sequence 𝑠 [𝑖] is defined as the expected
utilizing methods such as the Gauss-Newton or a regu-             value of the sequence multiplied by a shifted version of
larized version thereof. In this work the steepest descent        itself by 𝑘 positions. In practice these stochastic coef-
iterative method was selected because of its simplicity           ficients are often approximated by the sample mean of
and because of the single global minimum 𝐽 has, since             equation (8) under the assumption of ergodicity. Autocor-
the latter is essentially a sum of squares.                       relation coefficients are a measure of the self-similarity
   Furthermore, it can be argued that the proposed ELM,           of the sequence under consideration and play a central
if properly trained, operates like a sparse coder with            role in discovering higher order patterns through the
each activation neuron corresponding to a single emo-             Wiener filter. It should be noted that the higher 𝑘 is, the
tion. This approach is clear it can be extended to an             less reliable the estimation of 𝑎 [𝑘] becomes as fewer term
arbitrary number of emotions, provided of course that             pairs are available. Therefore, 𝑘 in most engineering ap-
the appropriate attributes are available. However, the            plications is small compared to the total length 𝑛 of the
ELM proposed here can in fact discover the emotional              speech sample sequence. As a direct consequence of the
direction and not the valence itself.                             Cauchy-Schwarz inequality, the maximum autocorrela-
                                                                  tion coefficient is the first one 𝑎 [0].

3.3. Attributes                                                                 1
                                                                                    𝑛−𝑘−1
                                                                    𝑎 [𝑘] ≈         ∑ 𝑠 [𝑖] 𝑠 [𝑖 + 𝑘] ,   0≤𝑘 ≤𝑛−1       (8)
In this subsection the various features used to train the                     𝑛 − 𝑘 𝑖=0
ELM described here, their primary properties, and their
respective meaning are explained. Said attributes are also           Finally, the zero crossing rate (ZCR) is an important
shown in table 3 along with a brief explanation.                  feature which assumes that the mean value of the speech
   The cepstral coefficients 𝑐 [𝑘] express a modified short       signals has been subtracted from it during a preprocess-
term power spectrum of a signal 𝑠 [𝑘] consisting of speech        ing phase. ZCR is closely tied with the primary mode
samples. They are derived by an algorithmic process               of the Hilbert-Huang spectrum (HHS), which is built on
which involves the following steps:                               fundamental signals inherent in the sequence. Thus, in-
                                                                  tuitively speaking, the HHS is very similar to the Fourier
     • The sequence is pre-emphasized such that higher            spectrum but it is composed of basis signals progressively
       frequencies receive an energy boost.                       extracted from the original signal itself and hence having
     • The spectrum is smoothed with a window, usually            irregular shapes instead of weighted complex exponen-
       a Hamming window of odd length.                            tials. In this context ZCR plays a role analogous to that
     • The power spectrum is translated in the nonlinear          of the fundamental frequency in Fourier analysis.
       Mel scale where resolution is not constant.                   The audio attributes used in this work are also shown
     • The logarithm of said power spectrum, which is             in table 3 along with their interpretation.
       always real, is computed.
     • The coefficients of the inverse Fourier spectrum           4. Results
       are the cepstral coefficients.

  The natural meaning of the cepstral coefficients is that        4.1. ELM Architecture
they represent a power spectrum where each frequency              The architecture of the proposed ELM is shown in 2. No-
band has a resolution roughly inversely proportional to           tice that all hidden neurons belong to the same ELM layer


                                                              4
Georgios Drakopoulos et al. CEUR Workshop Proceedings                                                                         1–7


Table 3
Audio Attributes
                           Attribute                Meaning
                           Cepstral coefficients    Short term windowed power spectrum
                           Spectrogram              Frequency content evolution over short time steps
                           Autocorrelation          Self-similarity patterns in the speech sequence
                           Zero crossing rate       Tied to primary mode of Hilbert-Huang spectrum


                            TESS
                                                               attributes


                                                               Ιnput layer


     Hidden layer seg. 1    Hidden layer seg. 2    Hidden layer seg. 3      Hidden layer seg. 4   ...   Hidden layer seg. n


Figure 2: Proposed architecture.


and they are conceptually but not physically segmented      In figure 3 is shown the heatmap resulting from the
to show they are an integer multiple of the neurons of analysis of the ELM training. From it the following can
the input layer. Hence the hidden layer can be thought be immediately inferred:
of as comprising of segments, although in practice all
                                                               • The neutral emotional state is the only one which
hidden neurons are simultaneously trained.
                                                                  can be accurately discovered in the context of
   The implementation language of choice was Julia.
                                                                  this work. This can be attributed to the fact that
It is a rapidly emerging multiparadigm high level lan-
                                                                  compared to the other states there is no valence.
guage aiming at computation-heavy tasks such as those
                                                                  In turn this allows its isolation from the rest of
frequently encountered in DL and ML scenarios, large
                                                                  the states in the attribute space with a margin
database clustering, extensive and fine grained simula-
                                                                  sufficient for the ELM to discern it.
tions, and graph signal processing.
                                                               • On the contrary anger is the most difficult to be
                                                                  discovered. A possible explanation is that its bipo-
4.2. Emotion Recognition                                          lar opposite emotion is also present in TESS and,
The TESS dataset contains 200 target words spoken in the          thus, certain instances have been misattributed
context of a carrier phrase by two actresses, a younger           to it. Moreover, anger is also confused with sur-
and an older one aged 26 and 64 respectively. Each record-        prise and sadness. The former is possibly due to
ing contains 2000 data points, which are sufficient for           valence,  whereas the latter because of polarity.
processing, and they represent the neutral state plus six      •  Concerning   the other bipolar pair of sadness and
of the primary emotions according to Plutchik’s model,            happiness, they are clearly distinguished from
namely these of anger, disgust, fear, happiness, pleas-           each other, but nevertheless there is a small prob-
ant surprise, and sadness. Therefore, from the emotions           ability they will be misclassified respectively as
listed in table 2 anticipation and trust are absent. Conse-       disgust and as pleasant surprise. This can be at-
quently, from the four pairs of primary bipolar emotions          tributed to their valence as well as to the seman-
only two are fully present in TESS.                               tics of each emotion under consideration.
                                                               • The remaining emotions can be also be distin-


                                                                  5
Georgios Drakopoulos et al. CEUR Workshop Proceedings                                                                    1–7


        guished relatively easy from the others in the tectures capable of natively handling two-dimensional
        dataset. Still, the negative emotions tend to be attributes such as the class of graph neural networks.
        classified with a lower level accuracy compared
        to the positive ones, with the single exception of
        sadness. This can be explained by their preva- Acknowledgments
        lence in TESS.
                                                           This conference paper is part of Project 451, a long term
                                                           research initiative with a primary objective of develop-
                                                           ing novel, scalable, numerically stable, and interpretable
                                                           higher order analytics.


                                                                   References
                                                                    [1] A. Semeraro, S. Vilella, G. Ruffo, PyPlutchik: Visu-
                                                                        alising and comparing emotion-annotated corpora,
                                                                        PLoS one 16 (2021).
                                                                    [2] A. Talipu, A. Generosi, M. Mengoni, L. Giraldi, Eval-
                                                                        uation of deep convolutional neural network archi-
                                                                        tectures for emotion recognition in the wild, in:
                                                                        ISCT, IEEE, 2019, pp. 25–27.
                                                                    [3] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kar-
                                                                        tiwi, E. Ambikairajah, A comprehensive review of
                                                                        speech emotion recognition systems, IEEE Access
Figure 3: ELM heatmap.                                                  9 (2021) 47795–47814.
                                                                    [4] W. Zehra, A. R. Javed, Z. Jalil, H. U. Khan, T. R.
                                                                        Gadekallu, Cross corpus multi-lingual speech emo-
   In summary, the heatmap reveals a performance level                  tion recognition using ensemble learning, Complex
which may be satisfactory for certain applications. Still,              & Intelligent Systems 7 (2021) 1845–1854.
as negative emotions with the sole exception of sadness             [5] S. Kwon, Optimal feature selection based speech
tend to be less accurately identified compared to the                   emotion recognition using two-stream deep convo-
positive ones, there is room for improvement.                           lutional neural network, International Journal of
                                                                        Intelligent Systems 36 (2021) 5116–5135.
                                                                    [6] C. Lu, Y. Zong, W. Zheng, Y. Li, C. Tang, B. W.
5. Conclusions                                                          Schuller, Domain invariant feature learning for
                                                                        speaker-independent speech emotion recognition,
The focus of this conference paper is the development
                                                                        IEEE/ACM Transactions on Audio, Speech, and Lan-
of an extreme learning machine (ELM) for speech emo-
                                                                        guage Processing 30 (2022) 2217–2230.
tion recognition (SER) based on the primary emotions
                                                                    [7] Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang,
identified in Plutchik’s model. Based on a wide array of
                                                                        J. Tao, B. W. Schuller, Combining a parallel 2D CNN
audio attributes an ELM is trained to act like a sparse
                                                                        with a self-attention dilated residual network for
coder with the nine fundamental emotions are one-hot
                                                                        CTC-based discrete speech emotion recognition,
encoded in an output vector. The proposed approach
                                                                        Neural Networks 141 (2021) 52–60.
is flexible enough as the training phase on an ELM is
                                                                    [8] S. Zhang, X. Tao, Y. Chuang, X. Zhao, Learning
much simpler compared to that of other neural network
                                                                        deep multimodal affective features for spontaneous
architectures, especially the fundamental multilayer per-
                                                                        speech emotion recognition, Speech Communica-
ceptron. The results obtained with waveforms taken from
                                                                        tion 127 (2021) 73–81.
the established Toronto emotional speech set (TESS) are
                                                                    [9] P. Sreeja, G. Mahalakshmi, Emotion models: A
very encouraging in terms of accuracy.
                                                                        review, International Journal of Control Theory
   Regarding future research directions, the proposed
                                                                        and Applications 10 (2017) 651–657.
neural network architecture and the associated encod-
                                                                   [10] K. R. Scherer, et al., Psychological models of emo-
ing can be tested with other publicly available speech
                                                                        tion, The neuropsychology of emotion 137 (2000)
datasets such as the Emo-Soundscape or SUSAS. More-
                                                                        137–162.
over, an ELM can be adapted to other human emotion
                                                                   [11] S. Marsella, J. Gratch, P. Petta, et al., Computa-
models such as the big five or the universal emotion theory.
                                                                        tional models of emotion, A Blueprint for Affective
Finally, attribute vectorization can be avoided with archi-


                                                               6
Georgios Drakopoulos et al. CEUR Workshop Proceedings                                                                  1–7


     Computing - A sourcebook and manual 11 (2010)                    rics, NCAA 33 (2021) 16363–16375. doi:10.1007/
     21–46.                                                           s00521- 021- 06235- 9 .
[12] R. M. Nesse, Evolutionary explanations of emotions,         [26] I. Tyagin, A. Kulshrestha, J. Sybrandt, K. Matta,
     Human nature 1 (1990) 261–289.                                   M. Shtutman, I. Safro, Accelerating COVID-19 re-
[13] A. Lieto, G. L. Pozzato, S. Zoia, V. Patti, R. Dami-             search with graph mining and transformer-based
     ano, A commonsense reasoning framework for                       learning, in: Conference on Artificial Intelligence,
     explanatory emotion attribution, generation and                  volume 36, AAAI, 2022, pp. 12673–12679.
     re-classification, Knowledge-Based Systems 227              [27] G. Gomez-Sanchez, L. Delgado-Serrano, D. Carrera,
     (2021).                                                          D. Torrents, J. L. Berral, Author correction: Cluster-
[14] Q.-Y. Zhu, A. K. Qin, P. N. Suganthan, G.-B. Huang,              ing and graph mining techniques for classification
     Evolutionary extreme learning machine, Pattern                   of complex structural variations in cancer genomes,
     recognition 38 (2005) 1759–1763.                                 Scientific Reports 12 (2022).
[15] S. Yahia, S. Said, M. Zaied, Wavelet extreme learning       [28] G. Drakopoulos, E. Kafeza, P. Mylonas, S. Sioutas,
     machine and deep learning for data classification,               A graph neural network for fuzzy Twitter graphs,
     Neurocomputing 470 (2022) 280–289.                               in: G. Cong, M. Ramanath (Eds.), CIKM companion
[16] Z. Tang, S. Wang, X. Chai, S. Cao, T. Ouyang, Y. Li,             volume, volume 3052, CEUR-WS.org, 2021.
     Auto-encoder-extreme learning machine model for             [29] G. Drakopoulos, I. Giannoukou, P. Mylonas,
     boiler NOx emission concentration prediction, En-                S. Sioutas, A graph neural network for assessing
     ergy 256 (2022).                                                 the affective coherence of Twitter graphs, in: IEEE
[17] D. Li, S. Li, S. Zhang, J. Sun, L. Wang, K. Wang,                Big Data, IEEE, 2020, pp. 3618–3627. doi:10.1109/
     Aging state prediction for supercapacitors based on              BigData50022.2020.9378492 .
     heuristic Kalman filter optimization extreme learn-         [30] L. Hu, L. Zou, A GPU-based graph pattern mining
     ing machine, Energy 250 (2022).                                  system, in: CIKM, 2022, pp. 4867–4871.
[18] N. Kardani, A. Bardhan, P. Samui, M. Nazem,                 [31] G. Drakopoulos, Y. Voutos, P. Mylonas, S. Sioutas,
     A. Zhou, D. J. Armaghani, A novel technique based                Motivating item annotations in cultural portals
     on the improved firefly algorithm coupled with ex-               with UI/UX based on behavioral economics, in:
     treme learning machine (ELM-IFF) for predicting                  IISA, IEEE, 2021. doi:10.1109/IISA52424.2021.
     the thermal conductivity of soil, Engineering with               9555569 .
     Computers 38 (2022) 3321–3340.                              [32] S. A. Bhavsar, V. H. Patil, A. H. Patil, Graph parti-
[19] J. Xia, D. Yang, H. Zhou, Y. Chen, H. Zhang, T. Liu,             tioning and visualization in graph mining: A survey,
     A. A. Heidari, H. Chen, Z. Pan, Evolving kernel ex-              Multimedia Tools and Applications (2022) 1–42.
     treme learning machine for medical diagnosis via a          [33] G. Drakopoulos, I. Giannoukou, P. Mylonas,
     disperse foraging sine cosine algorithm, Computers               S. Sioutas, On tensor distances for self organiz-
     in Biology and Medicine 141 (2022).                              ing maps: Clustering cognitive tasks, in: DEXA,
[20] S. Ding, X. Xu, R. Nie, Extreme learning machine                 volume 12392 of Lecture Notes in Computer Sci-
     and its applications, NCAA 25 (2014) 549–556.                    ence, Springer, 2020, pp. 195–210. doi:10.1007/
[21] M. A. Thafar, S. Albaradie, R. S. Olayan, H. Ashoor,             978- 3- 030- 59051- 2\_13 .
     M. Essack, V. B. Bajic, Computational drug-target           [34] Z. Xu, B. Du, H. Tong, Graph sanitation with appli-
     interaction prediction based on graph embedding                  cation to node classification, in: Web Conference,
     and graph mining, in: Proceedings of the 2020 10th               ACM, 2022, pp. 1136–1147.
     international conference on bioscience, biochem-            [35] Z. Sun, B. Wu, Y. Wang, Y. Ye, Sequential graph
     istry and bioinformatics, 2020, pp. 14–21.                       collaborative filtering, Information Sciences 592
[22] K. Yamada, Y. Tanaka,          Temporal multireso-               (2022) 244–260.
     lution graph learning, IEEE Access 9 (2021)                 [36] S. Wang, X. Niu, P. Fournier-Viger, D. Zhou, F. Min,
     143734–143745.                                                   A graph based approach for mining significant
[23] G. Drakopoulos, I. Giannoukou, S. Sioutas, P. My-                places in trajectory data, Information Sciences 609
     lonas, Self organizing maps for cultural con-                    (2022) 172–194.
     tent delivery,         NCAA (2022). doi:10.1007/            [37] G. Drakopoulos, E. Kafeza, P. Mylonas, S. Sioutas,
     s00521- 022- 07376- 1 .                                          Approximate high dimensional graph mining with
[24] M. Singh, Using natural language processing and                  matrix polar factorization: A Twitter application,
     graph mining to explore inter-related requirements               in: IEEE Big Data, IEEE, 2021, pp. 4441–4449. doi:10.
     in software artefacts, ACM SIGSOFT Software En-                  1109/BigData52589.2021.9671926 .
     gineering Notes 44 (2022) 37–42.
[25] G. Drakopoulos, E. Kafeza, P. Mylonas, L. Iliadis,
     Transform-based graph topology similarity met-


                                                             7

</pre>