<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shahana Yasmin Chowdhury</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bithi Banik</string-name>
          <email>bithi.banik@kristiania.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Tamjidul Hoque</string-name>
          <email>thoque@uno.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shreya Banerjee</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kristiania University of Applied Sciences</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of New Orleans</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Recent advancements in SER research have gained growing importance across various application areas, including healthcare, afective computing, personalized services, enhanced security, and AI-driven behavioral analysis. Existing machine learning (e.g., SVM, HMM) and deep learning (e.g., CNNs, LSTMs, Transformers) approaches have significantly advanced, proposing various algorithms for SER. However, these approaches often struggle to fully capture the sequential and contextual dependencies in speech signals, leading to suboptimal emotion classification, particularly in low-resource and noisy environments. To address this issue, we propose a novel sequence-based structured prediction framework that integrates Deep Conditional Random Fields (DeepCRF) with Bidirectional LSTM for detecting emotions in speech, which combines the benefits of deep feature learning with structured sequence prediction. Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Speech Emotion Recognition</kwd>
        <kwd>DeepCRF</kwd>
        <kwd>Bidirectional LSTM</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>Speech</kwd>
        <kwd>MFCC</kwd>
        <kwd>Spectrogram</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With technological advancements, research in human-computer interaction (HCI) and artificial emotional
intelligence (AEI) is evolving rapidly [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. HCI [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] focuses on how people communicate with computers
and how efectively systems respond. Researchers aim to make interactions between humans and
machines as natural as possible. Speech, being our primary form of communication, plays a key role in
HCI, with microphones as auditory sensors [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Through speech, we express emotions, making speech
emotion recognition (SER) essential for enhancing HCI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Speech emotion detection (SED) or SER is a sub-domain of natural language processing (NLP) [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]
and afective computing [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. There has been a significant amount of published work in recent years
[
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. A conventional SER system uses various methods to extract and analyze speech signals to
identify emotions [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. SER has many practical applications, especially in enhancing human-computer
interactions by adding emotional awareness [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        One practical example of SER includes assessing call center agents’ performance by detecting
customer emotions such as anger or happiness. This feedback helps companies improve service quality,
ofer targeted training, and increase customer satisfaction and eficiency [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Beyond this, many
applications—including healthcare, smart home devices, audio surveillance, criminal investigations,
recommendation systems, dialogue systems, and intelligent robots—benefit from detecting users’ emotions
through speech.
      </p>
      <p>
        Despite advances, challenges in SER remain due to limited technology, the complex nature of emotions,
and variations in language, accents, gender, and age, which all impact how emotions are expressed
in speech [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. To address these limitations, researchers have turned to cross-dataset integration
techniques [15, 16]. By combining diverse datasets, models can learn more generalized patterns and
improve recognition accuracy across multiple contexts [17].
      </p>
      <p>To address the challenges discussed above, our contributions are outlined below:
• We used five datasets together and applied data augmentation before model training. This
approach helps mitigate the risk of overfitting and improves the generalizability of emotion
recognition models.
• To the best of our knowledge, our proposed framework, DCRF-BiLSTM, outperforms other
methods across most of the datasets in the literature.
• Our contribution lies in striking a balance between high feature diversity and maintaining
computational eficiency, ensuring our model captures nuanced emotional characteristics without
overfitting.</p>
      <p>The structure of this paper is as follows: Section 2 reviews existing literature on SER. Section 3
highlights the importance of SED or SER. Section 4 provides an overview of our system, including
discussions on datasets, data preprocessing, data augmentation techniques, feature extraction, feature
selection, and our proposed architectures. In Section 5, we present a detailed and a comparative analysis
of the experimental results using five datasets. Finally, Section 6 concludes with a discussion on current
limitations and future research directions in SED.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>SER identifies speakers’ emotional states from speech and plays a vital role in enhancing HCI across
domains like call centers, mental health diagnostics, and driver monitoring systems. By interpreting
emotional cues, SER improves user interaction, enables emotionally adaptive interfaces, and supports
stress or anxiety detection [18, 19].</p>
      <p>
        The traditional approach to SER has primarily emphasized extracting prosodic features—such as
pitch, intensity, and energy—and spectral features like Mel-frequency cepstral coeficients (MFCC) and
Linear Predictive Coeficients (LPC), which are essential for representing emotional cues in speech
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These features have typically been evaluated using machine learning classifiers such as Gaussian
Mixture Models and Support Vector Machines (SVM) [20, 19]. While earlier studies focused on a limited
set of handcrafted features for model development [19], there remains limited research exploring
a broader range of acoustic feature types. In contrast, our work leverages a more comprehensive
and diverse set of features—including MFCC, Chroma, Spectral Contrast, RMSE, ZCR, and Log-Mel
Spectrograms—resulting in a high-dimensional feature space. This extensive variation not only captures
more nuanced emotional characteristics but may also contribute to improved classification performance
compared to prior works that relied on fewer features.
      </p>
      <p>
        Deep learning has revolutionized various fields, including SER, by providing advanced methods for
processing and analyzing complex data. Techniques such as Convolutional Neural Networks (CNNs),
Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and attention
mechanisms have significantly enhanced the accuracy and eficiency of SER systems. In SER, CNNs
have been used to extract high-level features from spectrograms, which are then used for emotion
classification [ 19]. RNNs are designed to process sequential data, making them suitable for time-series
analysis, such as speech signals [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. LSTMs, a type of RNN, address the vanishing gradient problem
and are capable of learning long-term dependencies, which is beneficial for SER [ 17]. Combining CNNs
with LSTMs allows for the extraction of both spatial and temporal features, enhancing the performance
of SER systems [
        <xref ref-type="bibr" rid="ref10">21, 10</xref>
        ]. Deep learning has significantly improved the accuracy and robustness of SER
systems by enabling the automatic extraction and processing of complex features from raw data [19].
      </p>
      <p>The study of emotion recognition using speech datasets is a key area in AI and machine learning.
Datasets like RAVDESS, TESS, SAVEE, EMO-DB, and CREMA-D provide diverse audio samples for
training and evaluation. RAVDESS ofers high-quality recordings and wide emotion coverage, achieving
97.16% accuracy with ConvLSTM [22], though its limited size may cause overfitting [ 23]. TESS has
2,800 samples and shows perfect accuracy in some studies [22], but lacks speaker diversity [24]. SAVEE,
focused on male British speech, reached 97.45% accuracy [24, 22], yet its small size is a limitation
[23, 22]. EMO-DB supports cross-linguistic studies [22] with 535 German samples but faces overfitting
risks [23, 22]. CREMA-D includes varied demographics [24], though its complexity impacts consistent
training despite achieving 83.28% in some works [22]. While valuable, these datasets highlight the need
for more diverse and extensive data to improve SER model generalization.</p>
      <p>Despite notable progress in deep learning-based SER, key challenges persist in achieving high accuracy
and robust generalization across diverse datasets. Variations in emotional expression across individuals,
languages, and cultural backgrounds make it dificult for models to generalize efectively [ 22]. The
inherent subjectivity of emotions also introduces ambiguity in labeling, leading to inconsistencies in
classification outcomes [ 20]. Moreover, many existing SER datasets are scripted and limited in speaker
and emotional diversity, which reduces their applicability to real-world scenarios and increases the risk
of overfitting [ 24]. Issues such as class imbalance and small dataset sizes further hinder deep learning
model performance and stability [24]. While deep models like CNNs and LSTMs can automatically
learn features, selecting relevant and informative features remains a challenge [20]. Additionally,
the computational demands of these architectures can limit their use in real-time systems. Although
techniques like data augmentation and cross-dataset integration ofer potential improvements by
enhancing training diversity [22, 24], relatively few studies explore the impact of high-dimensional
and varied acoustic features in conjunction with structured prediction models. These limitations
motivate our work, which addresses the need for better feature diversity, improved generalization, and
context-aware modeling in SER.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The proposed framework begins with the preparation of datasets. After collecting five publicly available
and widely used speech emotion datasets, we applied two key preprocessing steps: data preprocessing
and data augmentation. Once the data was prepared, we proceeded with feature extraction and feature
selection to identify the most informative attributes for emotion recognition. Following this, we detail
the architecture of our proposed model. This section also outlines the processes involved in model
training and evaluation</p>
      <p>Our proposed framework DCRF-BiLSTM is shown in Figure 1 . At first, we fetch all the audio
or speech signals from the datasets, then we preprocess each audio signal with silence removal and
resampling, followed by three types of Augmentation techniques, i.e, Gaussian noise, time stretch,
pitch shift. At feature extraction steps, we extracted 6 diferent types of features with variation, and
a total of 190 features were selected. For the model training, we applied a Deep Conditional Random
Fields (DeepCRF) and Bidirectional LSTM framework to detect emotions in speech, combining deep
feature learning with structured sequence prediction. Our proposed model is designed to recognize
seven emotions — neutral, happy, sad, angry, fear, disgust, and surprise — and is trained on five datasets:
RAVDESS, TESS, SAVEE, EmoDB, and Crema-D.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>In this study, we used five publicly available datasets namely Ryerson Audio-Visual Database of
Emotional Speech and Song (RAVDESS), Toronto Emotional Speech Set (TESS), Surrey Audio-Visual
Expressed Emotion (SAVEE), EmoDB and CREMA-D, and for speech emotion detection. Table 1 shows the
emotion count of each of the five datasets and combined datasets class-wise [ 24]. All datasets used in
this study are publicly available and widely used for academic research. We did not collect any new
human data. Dataset diversity, spanning gender, age, and language, was considered to enhance fairness
and generalizability.
The description of the dataset are as follows:
• Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): This dataset[25]
consists of 1440 audio wav files recorded by 12 males and 12 females. It includes emotions such
as happy, calm, sad, fearful, angry, surprise, and disgust. The emotions are recorded at normal
and strong intensity levels. It is a balanced dataset though the "neutral" class has fewer records
compared to other classes. We exclude the calm emotion class from RAvDESS to matched with
the combination dataset.
• Toronto Emotional Speech Set (TESS): A female-only dataset [26] consisting of 2800 audio files
recorded by two actors aged 26 and 64 years. The recordings portray seven emotions: happiness,
anger, fear, disgust, pleasant surprise, sadness, and neutral.
• Surrey Audio-Visual Expressed Emotion (SAVEE): This dataset [27] consists of 120 audio
clips recorded by four male speakers in the age group of 27 to 31 years. The emotions included are
neutral, happy, sad, angry, disgust, fear and surprise. However, this dataset has a class imbalance
issue, with the "neutral" class is almost double compared to all the other classes.
• EmoDB: It [28] comprises 535 audio recordings in the German language categorized into seven
emotional kinds: "anger," "fear," "sadness," "happiness," "disgust," "boredom," and "neutral". The
utterances are sampled at a rate of 16 kHz with a resolution of 16 bits. However, this dataset has a
class imbalance issue, with the "anger," class utterance number is large compared to other classes.</p>
        <p>We exclude the boredom emotion class from EmoDB to matched with the combination dataset.
• CREMAD: The CREMA-D [29] dataset contains 7,442 audio files recorded by 91 actors (48 males
and 43 females) between the age group of 20 to 74 years. It consists of six emotions, namely
happy, angry, neutral, sad, fear and disgust at four diferent emotions levels (low, medium, high,
and unspecified)</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Data Preprocessing</title>
          <p>Data Preprocessing is an important technique because it prepares the raw data for analysis and modeling.
It also helps to extract relevant features, eliminate unnecessary information, and enhance model accuracy.
As a part of data preprocessing, we removed unnecessary silence and resampled the data. Specifically,
we removed 70% of the silence from the beginning and end of each audio clip when the silence length is
more than 200ms. Additionally, we resampled each audio clip to a 22050 Hz sampling rate. Spectrogram
and waveform of happy emotion with data preprocessing shown in Figures 2 - 3 , highlight the value of
cleaning the data by removing silence, making the emotional characteristics of speech more prominent.
It shows how preprocessing enhances the signal quality and prepares the data for more accurate emotion
classification.</p>
          <p>(a)
(b)</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Data Augmentation</title>
          <p>To create variations in audio data, we applied the data augmentation process [30]. Data augmentation
is a widely used technique in SER model to enhance the dataset, improve robustness and generalization.
It’s also increased data size for proper training of the models. In this study, we noticed the data set
imbalance over the emotion classes, so we injected Gaussian noise into three faces. Our noise injection
technique applied with the noise rate (0.035, 0.025, 0.015). We also applied two more augmentation
techniques to our dataset, pitch shift and time stretch, by using librosa. The pitch shift technique shifts
the pitch of an audio signal without changing its duration in three steps with the rate parameter of (0.70,
0.80, 0.70). And we applied the time stretch technique to change the speed of an audio signal without
altering its pitch with the speed factor rate (0.8, 0.9, 0.7). These techniques enhance the model’s ability
to handle diferent acoustic environments, improving the overall performance of the SER model. The
impact of these augmentation techniques is visually shown in Figure 4 . Spectrogram and waveform
visualizations illustrating the efects of diferent data augmentation techniques applied to a speech
sample, Time-stretching increases the playback speed while preserving pitch, Pitch shifting alters
the frequency content to simulate diferent speaker tones, and Gaussian noise injection simulates
environmental noise to enhance model robustness.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature Extraction and Selection</title>
        <p>Feature extraction converted the raw data to a meaningful representation. It is an essential step in audio
emotion detection, because raw data contains a vast amount of information. So, it helps to reduce the
complexity of the data by capturing the most relevant information, which represents the emotional state
in the speech. In this study, features have been extracted from WAV format audio files by exploiting
Librosa, a Python package for music and audio analysis. In particular, we extracted and then selected the
following 190 (= 80 + 36 + 64 + 6 + 3+ 1) features for our SER model. Selected features after extraction
are shown in Table 2 .</p>
        <p>• MFCCs (Mel-Frequency Cepstral Coeficients)
: The sounds produced by humans are shaped
by the unique configuration of the vocal tract, including features such as the tongue and teeth.
These structures influence each individual’s distinct voice characteristics. Accurately capturing
vocal tract shape is important, as it is reflected in the short-time power spectrum, typically
represented by Mel-Frequency Cepstral Coeficients (MFCCs) [ 21]. MFCCs are widely used in SER
research [31, 24, 32]. To extract MFCCs, the speech signal is segmented into overlapping frames
of 20–30 ms with a 10 ms shift to preserve temporal dynamics. Each frame is windowed and
transformed using the Discrete Fourier Transform (DFT) to obtain magnitude spectrum [18, 33].
A set of 26 Mel-scaled filters is applied to mimic the human auditory system, producing 26 energy
values per frame. These are converted into log filter bank energies. The relationship between
Mel and physical frequency is expressed by Equation 1 .</p>
        <p>Mel = 2595 · log10 1 +
︂(
 )︂
700
(1)
Here,  denotes the physical frequency in Hertz (Hz), and Mel represents the perceived frequency
by the human ear. Finally, the Discrete Cosine Transform (DCT) is applied to the log filter-bank
energies to obtain the MFCCs.</p>
        <p>
          For this study, 20 lower-dimensional MFCCs, 20 delta MFCCs, 20 delta-delta (delta2) MFCCs,
and 20 MFCC standard deviations were extracted from each audio file. The delta and delta-delta
coeficients represent the first and second temporal derivatives of the MFCCs. These features
efectively capture speech dynamics. MFCC envelopes are suficient to reflect phoneme diferences,
enabling speech emotion recognition.
• Chroma Features: The Chroma feature captures the tonal content of audio by mapping frequency
components to the 12 pitch classes of the chromatic scale [34]. It reflects energy distribution
across pitch classes as a 12-dimensional vector. Chroma features are extracted using Short-Time
Fourier Transform (STFT) on audio waveforms [
          <xref ref-type="bibr" rid="ref15">35</xref>
          ]. This study used three Chroma types:
ChromaSTFT, Chroma-CQT (Constant-Q Transform), and Chroma-CENS (Chroma Energy Normalized
Statistics). Each captures pitch information diferently. From each audio file, 12 features were
extracted per type, totaling 36 Chroma-related features. These represent tonal and harmonic
speech content, aiding emotion recognition.
• Log Mel Spectrogram (LMS): The Log Mel spectrogram improves audio spectrum representation
by considering temporal changes and the human ear’s frequency response. It includes signal
segmentation, windowing, the Fourier transform, Mel filtering, and log compression to generate a
logarithmic spectral view. The frequency spectrum is mapped onto Mel scale frequencies, forming
a Mel spectrogram per window. Magnitude components of these frequencies were extracted using
the Librosa library. The spectrogram visualizes signal intensity in the time-frequency domain
using the Fast Fourier Transform (FFT) [
          <xref ref-type="bibr" rid="ref16">19, 36</xref>
          ]. It is essential for speech classification tasks and
is efective with BiLSTM or DeepCRF. In this study, 64 LMS features were extracted from each
audio file.
• Spectral Contrast: Spectral entropy serves as an indicator of voicing and signal quality [
          <xref ref-type="bibr" rid="ref17">37</xref>
          ]. It is
often utilized to assess the ’speechiness’ of an audio signal and is widely applied in distinguishing
speech from background noise.
• Root Mean Square Energy (RMSE): The Root Mean Square Energy gives the average signal
amplitude over a frame, regardless of whether the values are positive or negative. It is useful for
analyzing signal intensity in speech [20]. In this study, RMSE values were extracted using the
Librosa library. For a signal  = 1, 2, ..., , the RMSE is calculated using Equation 2 . In
this study, we extracted 3 features for RMSE.
        </p>
        <p>⎯
 = ⎷⎸⎸ 1 ∑=︁1 2

(2)
• Zero Crossing Rate (ZCR): Zero Crossing Rate is a commonly used feature in SER. It counts the
number of times a signal crosses the zero value in a given time frame. This helps to diferentiate
between voiced and unvoiced parts of speech [24]. In this work, we used the Librosa library to
extract ZCR values from the audio datasets. The ZCR is defined mathematically in Equation 3
 =
1  − 1</p>
        <p>∑︁ 1R&lt;0 ( · +1)
 − 1 =1
(3)</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Proposed Architecture</title>
        <p>In this study, we utilize the CRF (Conditional Random Field) layer, a specialized layer used for sequence
tagging and prediction tasks, particularly in NLP, which exists within the TensorFlow Addons (TFA)
module. We combined the DeepCRF with Bi-LSTM to train our model. Table 3 shows our model
architecture.</p>
        <p>The architecture of the model is created with a sequential layer, which builds a Bidirectional LSTM
layers to learn hierarchical and contextualized temporal features in the audio or speech data from both
forward and backward directions. It uses 512 LSTM units, with L2 regularization to reduce overfitting,
and Batch Normalization to accelerate training. A dropout rate of 0.3 is applied to prevent overfitting. In
the second and third layer, same Bi-LSTM with 512 units is added for deeper feature extraction. Similar
Batch Normalization and Dropout layers are applied.</p>
        <p>The next two layers are a Dense layer with 512 units and Swish activation, which improves
performance by addressing the vanishing gradient problem. LeakyReLU is applied at the end for additional
non-linearity. After that, we include the DeepCRF Layer with LeakyReLU activation to model the
sequential structure of speech data. CRF performs best for sequence labeling tasks because it considers
the context of neighboring predictions.</p>
        <p>The final layer is the output layer. We use a TimeDistributed layer that applies a Dense layer with
the number of classes and a Softmax activation to generate class probabilities for each time step. A
reshape layer is also used to ensure the output matches the shape of the class number by representing
the predicted emotion. This model is designed for speech emotion recognition by extracting high-level
features using LSTMs and leveraging the CRF layer to enhance prediction accuracy through temporal
relationships in speech signals.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Model Training</title>
        <p>To ensure an efective evaluation and comparison with previous studies, the target datasets were divided
into two parts every time. After preparing the feature set, we perform the data normalization and split
the feature data into 80% training data and 20% testing data . The model was trained on the Augmented
dataset using 500 epochs. To tune the hyperparameter, we used a learning rate of 0.0001, batch size 256,
optimizer ‘Adam’ with loss function categorical- crossentropy. The whole process is developed using
TensorFlow and Keras to build and train a deep learning model for our SER framework. Experiments
were conducted using CPU(11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz) and RAM
(16GB); training each fold took approximately 7 hours.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>The experiment involved training the SED model on five individual datasets as well as the combined
datasets (R+T+S) and (R+T+S+E+C) to assess its generalization abilities. The results were compared and
analyzed using performance evaluation metrics such as accuracy, loss rate, and confusion matrix.</p>
      <sec id="sec-4-1">
        <title>4.1. Performance Matrices and Evaluation</title>
        <p>The performance of our model, DeepCRF, with the combination of Bi-LSTM is given in Table 4. Every
time, each dataset was partitioned into train and test sets employing an 80%-20% split to assess the
model’s performance on unseen data. The accuracy metric determines the percentage of samples</p>
        <sec id="sec-4-1-1">
          <title>Model: "DCRF-BiLSTM"</title>
          <p>Layer (Type) Output Shape
bidirectional (Bidirectional) (None, 1, 1024)
batch_normalization (BatchNormalization) (None, 1, 1024)
dropout (Dropout) (None, 1, 1024)
bidirectional_1 (Bidirectional) (None, 1, 1024)
batch_normalization_1 (BatchNormalization) (None, 1, 1024)
dropout 1 (Dropout) (None, 1, 1024)
bidirectional_2 (Bidirectional) (None, 1, 1024)
batch_normalization_2 (BatchNormalization) (None, 1, 1024)
dropout 2 (Dropout) (None, 1, 1024)
dense (Dense) (None, 1, 512)
dropout 3 (Dropout) (None, 1, 512)
dense 1 (Dense) (None, 1, 512)
dropout 4 (Dropout) (None, 1, 512)
leaky_re_lu (LeakyReLU) (None, 1, 512)
crf_layer (CRFLayer) (None, 1, 512)
time_distributed (TimeDistributed) (None, 1, 7)
predicted correctly by the model. Our proposed model has an accuracy of 100.00% for TESS, 100.00% for
EMO-DB, 97.83% for RAVDESS, 97.02% for SAVEE, and 95.10% for CREMA-D datasets. Furthermore,
98.82% for combined RAVDESS, TESS, and SAVEE (R+T+S), 93.76% for combined 5 datasets (R+T+S+E+C).</p>
          <p>
            We also performed 5-fold cross-validation with PCA-reduced features (principle component analysis)
to evaluate model consistency [
            <xref ref-type="bibr" rid="ref18 ref19">38, 39</xref>
            ]. As seen in Table 4, the results remained highly stable across
datasets, confirming the robustness of DeepCRF. The close agreement between 80-20 split and
crossvalidation accuracies highlights the model’s generalization ability, while PCA successfully reduced
feature size without loss of accuracy.
          </p>
          <p>We analyze the performance of five individual datasets and the combined dataset using weighted
average (WA) and macro average (MA), due to class imbalance. We also report the precision, recall, and
F1-score of the proposed model on individual datasets for the same reason. Tables 5-7 show class-wise
emotion recognition performance for individual and combined datasets.</p>
          <p>High accuracy in recognizing speech emotion is the model’s primary goal, so this work records only
the best-fit model. Validation accuracy is the key indicator of model generalization. When validation
accuracy reaches its peak, prediction is optimal with this model. Figure 5 shows results predicted by the
proposed model for the two combined datasets. Appendix A : Figure 7-11 includes validation accuracy
and loss curves for the five individual datasets.</p>
          <p>A confusion matrix shows the number of correct and incorrect predictions for each class. The
confusion matrix for all the emotions (classes) is given in Figure 6 The emotion ‘surprise’ has the highest
no. of correctly predicted samples for Ravdess, ‘disgust’ has the highest no. of correctly predicted
samples for Tess, ‘neutral’ has the highest no. of correctly predicted samples for Savee, ‘angry’ has the
highest no. of correctly predicted samples for EmoDB and Crema-D.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Comparison with the State of the Art</title>
        <p>Table 8 presents a comparison between the proposed DCRF-BiLSTM model and other related studies
using the RAVDESS, TESS, SAVEE, EMO-DB, CREMA-D, and combined datasets. The evaluation is based
on the average accuracy achieved in diferent experiments. The proposed model achieves the highest
accuracy across all datasets and significantly outperforms previous methods. In particular, for the
combined RAVDESS, TESS, and SAVEE (R+T+S) dataset, the model achieves an accuracy of 98.82%, and
for the combined RAVDESS, TESS, SAVEE, EMO-DB, and CREMA-D (R+T+S+E+C) dataset, the model
achieves an accuracy of 93.76%. Using an 80%-20% train-test split, the model demonstrates superior
performance compared to existing approaches. The comparison is based on the average classification
accuracy achieved by each model using various architectures and experimental setups.</p>
        <p>The proposed DCRF-BiLSTM model consistently outperforms other methods across most of the</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>datasets. It achieves the highest accuracy on RAVDESS (97.83%), TESS (100.00%), EMO-DB (100.00%),
and SAVEE (97.02%) and the combined R+T+S dataset (98.82%), surpassing previous benchmarks.
Furthermore, the proposed DCRF-BiLSTM model achieves an accuracy of 99.18% on the combined
RAVDESS, TESS, and SAVEE (R+T+S) datasets, and 94.03% on the combined RAVDESS, TESS, SAVEE,
EMO-DB, and CREMA-D (R+T+S+E+C) datasets using 5-fold cross-validation, further confirming its
robustness and generalization across diverse emotional speech corpora.</p>
      <p>Although [22] achieved slightly higher performance on the Savee dataset (97.11%) and achieved the
same performance on the combined dataset (99.18%), our proposed model remains highly competitive
while demonstrating robustness across all datasets. Furthermore, there is no other model found in
literature that worked with the combined five datasets, where our model achieved an accuracy of 93.76%.</p>
      <p>Moreover, in comparison to ensemble-based architectures such as the 1D CNNs-LSTM-GRU model
by [24], and ConvLSTM models [22], the proposed DCRF-BiLSTM architecture—combining deep
contextualized BiLSTM layers with a CRF-based sequence labeling mechanism—demonstrates superior
generalization for audio signals. This confirms the model’s ability to capture temporal dependencies
and contextual emotional insights more efectively. The proposed model shows consistent performance
for all datasets, ensuring the robustness and adaptability of diferent emotional speech corpora.</p>
      <p>This study utilized a data preprocessing technique and three types of data augmentation techniques to
expand the training and test datasets. We combined five diverse datasets to enhance generalizability
and reduce potential bias from any single corpus. Still, some cross-dataset variation remains, which
we aim to address in future work through domain adaptation techniques. Our SER also represented
diferent kinds of important audio features, MFCC, Chroma, LMS, Spectral Contrast, RMSE, and ZCR,
with a total of 190 features to reduce the complexity of the data and capture the most relevant and
important meaningful information. Our proposed DCRF-BiLSTM hybrid framework, which uses the
DeepCRF combined with the stacked Bidirectional LSTM, efectively captures the spatial and temporal
information of audio signals. Evaluation of our model was performed on five publicly available datasets:
RAVDESS, TESS, SAVEE, EMO-DB, and CremaD, as well as a combination dataset of (R+T+S) and
(R+T+S+E+C). It shows the highest accuracy of 100.00% on TESS and EmoDB , accurately recognizing
and classifying speech emotions.</p>
      <p>Although the proposed DCRF-BiLSTM architecture performed well, there are still areas for future
work. Evaluating the model across diferent languages could provide insight into its adaptability and
robustness. Incorporating contextual information, such as speaker identity, along with attention-based
mechanisms, may help capture more complex emotional patterns. Since Bi-LSTM relies on future
context, it is not well-suited for real-time audio sequences. Therefore, exploring alternative models
that are more eficient for real-time speech processing would be beneficial. In future work, we plan
to incorporate structured sentiment knowledge, such as SenticNet or emotion ontologies, to enhance
reasoning over nuanced emotional expressions. We also aim to extend this framework to ASD-related
emotion recognition and the early detection of emotional cues in children with ASD.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The machine learning code in this research was implemented by the author(s), with limited assistance
from ChatGPT for error checking and improvement. In addition, AI tools such as ChatGPT and
Grammarly were used for grammar and spelling checks, paraphrasing, and rewording during the
preparation of this manuscript. After using these tools, the author(s) reviewed and edited all content as
needed and assume(s) full responsibility for the content of the publication.
and Other Applications of Applied Intelligent Systems, IEA/AIE 2012, Dalian, China, June 9-12,
2012. Proceedings 25, Springer, 2012, pp. 491–500.
[15] L. T. C. Ottoni, A. L. C. Ottoni, J. d. J. F. Cerqueira, A deep learning approach for speech emotion
recognition optimization using meta-learning, Electronics 12 (2023) 4859.
[16] R. Milner, M. A. Jalal, R. W. Ng, T. Hain, A cross-corpus study on speech emotion recognition, in:
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2019, pp.
304–311.
[17] S. Jothimani, K. Premalatha, Mf-saug: Multi feature fusion with spectrogram augmentation of
speech emotion recognition using convolution neural network, Chaos, Solitons &amp; Fractals 162
(2022) 112512.
[18] P. Nantasri, E. Phaisangittisagul, J. Karnjana, S. Boonkla, S. Keerativittayanun, A. Rugchatjaroen,
S. Usanavasin, T. Shinozaki, A light-weight artificial neural network for speech emotion recognition
using average values of mfccs and their derivatives, in: 2020 17th International conference on
electrical engineering/electronics, computer, telecommunications and information technology
(ECTI-CON), IEEE, 2020, pp. 41–44.
[19] A. Mukhamediya, S. Fazli, A. Zollanvari, On the efect of log-mel spectrogram parameter tuning
for deep learning-based speech emotion recognition, IEEE Access 11 (2023) 61950–61957.
[20] M. B. Er, A novel approach for classification of speech emotions based on deep and acoustic
features, Ieee Access 8 (2020) 221640–221653.
[21] M. Gupta, T. Patel, S. H. Mankad, T. Vyas, Detecting emotions from human speech: role of gender
information, in: 2022 IEEE Region 10 Symposium (TENSYMP), IEEE, 2022, pp. 1–6.
[22] R. M. Ben-Sauod, R. S. Alshwehdi, W. I. Eltarhouni, Enhancing speech emotion recognition through
a cross-dataset analysis: Exploring improved models, in: 2024 IEEE 4th International Maghreb
Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer
Engineering (MI-STA), IEEE, 2024, pp. 706–711.
[23] D. Issa, M. F. Demirci, A. Yazici, Speech emotion recognition with deep convolutional neural
networks, Biomedical Signal Processing and Control 59 (2020) 101894.
[24] M. R. Ahmed, S. Islam, A. M. Islam, S. Shatabda, An ensemble 1d-cnn-lstm-gru model with data
augmentation for speech emotion recognition, Expert Systems with Applications 218 (2023)
119633.
[25] S. R. Livingstone, F. A. Russo, The ryerson audio-visual database of emotional speech and song
(ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,
PloS one 13 (2018) e0196391.
[26] M. K. Pichora-Fuller, K. Dupuis, Toronto emotional speech set (tess), Scholars Portal Dataverse 1
(2020) 2020.
[27] P. Jackson, S. Haq, Surrey audio-visual expressed emotion (savee) database, University of Surrey:</p>
      <p>Guildford, UK (2014).
[28] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss, et al., A database of german
emotional speech., in: Interspeech, volume 5, 2005, pp. 1517–1520.
[29] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, R. Verma, Crema-d: Crowd-sourced
emotional multimodal actors dataset, IEEE transactions on afective computing 5 (2014) 377–390.
[30] L. Ferreira-Paiva, E. Alfaro-Espinoza, V. M. Almeida, L. B. Felix, R. V. Neves, A survey of data
augmentation for audio classification, in: Congresso Brasileiro de Automática-CBA, volume 3,
2022.
[31] J. de Lope, M. Graña, An ongoing review of speech emotion recognition, Neurocomputing 528
(2023) 1–11.
[32] J. Singh, L. B. Saheer, O. Faust, Speech emotion recognition using attention model, International</p>
      <p>Journal of Environmental Research and Public Health 20 (2023) 5140.
[33] N. Hajarolasvadi, H. Demirel, 3d cnn-based speech emotion recognition using k-means clustering
and spectrograms, Entropy 21 (2019) 479.
[34] G. K. Birajdar, M. D. Patil, Speech/music classification using visual and spectral chromagram
features, Journal of Ambient Intelligence and Humanized Computing 11 (2020) 329–347.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Krakovsky</surname>
          </string-name>
          ,
          <source>Artificial (emotional) intelligence, Communications of the ACM</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>18</fpage>
          -
          <lpage>19</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3185521. doi:
          <volume>10</volume>
          .1145/3185521.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <source>The age of artificial emotional intelligence</source>
          ,
          <source>Computer</source>
          <volume>51</volume>
          (
          <year>2018</year>
          )
          <fpage>38</fpage>
          -
          <lpage>46</lpage>
          . URL: https://ieeexplore.ieee.org/document/8481266. doi:
          <volume>10</volume>
          .1109/
          <string-name>
            <surname>MC</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <volume>3620963</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hudlicka</surname>
          </string-name>
          ,
          <article-title>To feel or not to feel: The role of afect in hci</article-title>
          ,
          <source>International Journal of HumanComputer Studies</source>
          <volume>59</volume>
          (
          <year>2003</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          . URL: https://www.sciencedirect.com/science/article/abs/pii/ S1071581903000478. doi:
          <volume>10</volume>
          .1016/S1071-
          <volume>5819</volume>
          (
          <issue>03</issue>
          )
          <fpage>00047</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] n. Mustaqeem,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <article-title>A cnn-assisted enhanced audio signal processing for speech emotion recognition</article-title>
          ,
          <source>Sensors</source>
          <volume>20</volume>
          (
          <year>2019</year>
          )
          <fpage>183</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mukesh</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-H. Hsu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Vyas</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition using crosscorrelation and acoustic features</article-title>
          ,
          <source>in: 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress</source>
          (DASC/PiCom/- DataCom/CyberSciTech), IEEE,
          <year>2018</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kusal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choudrie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kotecha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vora</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Pappas,</surname>
          </string-name>
          <article-title>A systematic review of applications of natural language processing and future challenges with special emphasis in text-based emotion detection</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>15129</fpage>
          -
          <lpage>15215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Picard</surname>
          </string-name>
          , Afective computing, MIT press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Madanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Adeleye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Templeton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Poellabauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition using machine learning-a systematic review</article-title>
          ,
          <source>Intelligent systems with applications 20</source>
          (
          <year>2023</year>
          )
          <fpage>200266</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M. B. Akçay</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Oğuz</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers</article-title>
          ,
          <source>Speech Communication</source>
          <volume>116</volume>
          (
          <year>2020</year>
          )
          <fpage>56</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>M. B. Mustafa</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Yusoof</surname>
            ,
            <given-names>Z. M.</given-names>
          </string-name>
          <string-name>
            <surname>Don</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Malekzadeh</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition research: an analysis of research focus</article-title>
          ,
          <source>International Journal of Speech Technology</source>
          <volume>21</volume>
          (
          <year>2018</year>
          )
          <fpage>137</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Gadhe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Babasaheb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deshmukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Babasaheb</surname>
          </string-name>
          ,
          <article-title>Emotion recognition from isolated marathi speech using energy and formants</article-title>
          ,
          <source>International Journal of Computer Applications</source>
          <volume>125</volume>
          (
          <year>2015</year>
          )
          <fpage>22</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Radhika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prasanth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Sowndarya</surname>
          </string-name>
          ,
          <article-title>A reliable speech emotion recognition framework for multi-regional languages using optimized light gradient boosting machine classifier</article-title>
          ,
          <source>Biomedical Signal Processing and Control</source>
          <volume>105</volume>
          (
          <year>2025</year>
          )
          <fpage>107636</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kowalczyk</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. N. van der Wal</surname>
          </string-name>
          ,
          <article-title>Detecting changing emotions in natural speech</article-title>
          ,
          <source>in: Advanced Research in Applied Artificial Intelligence: 25th International Conference on Industrial Engineering</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. B.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Cnn based approach for speech emotion recognition using mfcc, croma and stft hand-crafted features</article-title>
          ,
          <source>in: 2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>981</fpage>
          -
          <lpage>985</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition from 3d log-mel spectrograms with deep learning network</article-title>
          ,
          <source>IEEE access 7</source>
          (
          <year>2019</year>
          )
          <fpage>125868</fpage>
          -
          <lpage>125881</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thiruvenkadam</surname>
          </string-name>
          ,
          <article-title>An analysis of the impact of spectral contrast feature in speech emotion recognition</article-title>
          .,
          <source>Int. J. Recent Contributions Eng. Sci. IT</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>87</fpage>
          -
          <lpage>95</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>H.</given-names>
            <surname>Abdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>Principal component analysis</article-title>
          ,
          <source>Wiley interdisciplinary reviews: computational statistics 2</source>
          (
          <year>2010</year>
          )
          <fpage>433</fpage>
          -
          <lpage>459</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>A.</given-names>
            <surname>Benba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jilbab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hammouch</surname>
          </string-name>
          ,
          <article-title>Voice assessments for detecting patients with neurological diseases using pca and npca</article-title>
          ,
          <source>International Journal of Speech Technology</source>
          <volume>20</volume>
          (
          <year>2017</year>
          )
          <fpage>673</fpage>
          -
          <lpage>683</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mekruksavanich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jitpattanakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hnoohom</surname>
          </string-name>
          ,
          <article-title>Negative emotion recognition using deep learning for thai language, in: 2020 joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT &amp; NCON)</article-title>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>U.</given-names>
            <surname>Asiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kiran</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition-a deep learning approach</article-title>
          , in: 2021 Fifth International Conference on
          <string-name>
            <surname>I-SMAC</surname>
          </string-name>
          (
          <article-title>IoT in Social, Mobile, Analytics and Cloud)(I-SMAC)</article-title>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>867</fpage>
          -
          <lpage>871</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [42]
          <string-name>
            <surname>M. K. Pichora-Fuller</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Dupuis</surname>
          </string-name>
          , Toronto emotional speech set
          <source>(TESS)</source>
          ,
          <year>2020</year>
          . URL: https://doi.org/ 10.5683/SP2/E8H2MF. doi:
          <volume>10</volume>
          .5683/SP2/E8H2MF.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Aho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          ,
          <source>The Theory of Parsing, Translation and Compiling</source>
          , volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>Prentice-Hall</surname>
          </string-name>
          , Englewood Clifs, NJ,
          <year>1972</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>American</given-names>
            <surname>Psychological</surname>
          </string-name>
          <string-name>
            <surname>Association</surname>
          </string-name>
          , Publications Manual, American Psychological Association, Washington, DC,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [45]
          <string-name>
            <surname>A. K. Chandra</surname>
            ,
            <given-names>D. C.</given-names>
          </string-name>
          <string-name>
            <surname>Kozen</surname>
            ,
            <given-names>L. J.</given-names>
          </string-name>
          <string-name>
            <surname>Stockmeyer</surname>
          </string-name>
          , Alternation,
          <source>Journal of the Association for Computing Machinery</source>
          <volume>28</volume>
          (
          <year>1981</year>
          )
          <fpage>114</fpage>
          -
          <lpage>133</lpage>
          . doi:
          <volume>10</volume>
          .1145/322234.322243.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>G.</given-names>
            <surname>Andrew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Scalable training of L1-regularized log-linear models</article-title>
          ,
          <source>in: Proceedings of the 24th International Conference on Machine Learning</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gusfield</surname>
          </string-name>
          , Algorithms on Strings,
          <source>Trees and Sequences</source>
          , Cambridge University Press, Cambridge, UK,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Rasooli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          ,
          <article-title>Yara parser: A fast and accurate dependency parser</article-title>
          ,
          <source>Computing Research Repository arXiv:1503.06733</source>
          (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1503.06733, version 2.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Ando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>A framework for learning predictive structures from multiple tasks and unlabeled data</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>6</volume>
          (
          <year>2005</year>
          )
          <fpage>1817</fpage>
          -
          <lpage>1853</lpage>
          . Appendix-A
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>