=Paper=
{{Paper
|id=Vol-3922/paper6
|storemode=property
|title=Detection and classification of Emotion Recognition System for TESS and Crema-d Audio Datasets Using Hybrid Deep Learning Architecture
|pdfUrl=https://ceur-ws.org/Vol-3922/paper6.pdf
|volume=Vol-3922
|authors=Rafik Djalal Hammou
|dblpUrl=https://dblp.org/rec/conf/iam/Hammou24
}}
==Detection and classification of Emotion Recognition System for TESS and Crema-d Audio Datasets Using Hybrid Deep Learning Architecture==
Detection and classification of Emotion Recognition
System for TESS and Crema-d Audio Datasets Using
Hybrid Deep Learning Architecture⋆
Hammou Djalal Rafik1,*,†
1
Djillali Liabes University, BP 89 22000 Sidi Bel Abbes, Algeria, Faculty of Exact Science, Department of computer sciences.
Abstract
Humans communicate their desires through spoken language, which expresses various emotions. This process has
led to the development of speech recognition systems, where machine learning enables computers to recognize and
analyze vocal cues to interpret emotions, resulting in application creation focused on human-machine interaction.
Advancements in technology, the evolution of artificial intelligence, and the influence of deep learning via CNN
architectures have propelled research in emotion recognition systems forward. In this paper, we evaluated our
method for detecting and classifying emotions in two architectural models (Model-A and Model-B) that utilized
Mel-frequency cepstral coefficients to extract features from audio files. The experiments were conducted using
the TESS and Crema-d audio file databases. The outcomes are promising, showing an accuracy of 54,07% for
Model-B with the Crema-d dataset and 98,92% for Model-A with the TESS dataset.
Keywords
Speech, Architecture, Emotion, Accuracy, Recognition
1. Introduction
Speech is a means of communication with the outside world. Each human being has his speech, and it
is thanks to natural language that individuals can discuss and understand each other. Speech is unique
and expressed through a well-defined language addressed to an interlocutor (oneself). It allows us to
express needs such as feelings, suffering, aspirations, observations, and the formulation of requests. It
also allows us to constitute different natural languages and dialects.
Speech is the most popular tool of expression because it is easier to speak than to write or make
a diagram. Nonetheless, The process of producing speech, from the brain to the articulation of the
mouth using the vocal cords, is intricate. This difficulty makes the automation of speech by machine
complicated [1].
The system of the phonatory apparatus of speech is an acoustic mechanism that differs from other
sensory devices. It comprises elements such as the vocal cords, the oral and nasal cavities, the air, the
nervous system, and the tongue and lips [3].
These elements are described in detail (see Fig.1):
• The throat (pharynx), which is the largest surface of the neck and head, is made up of three
primary parts: the hypopharynx, the nasopharynx, and the oropharynx.
Proceedings of the International IAM’24: International Conference on Informatics And Applied Mathematics, December 4-5, 2024,
Guelma, Algeria
*
Corresponding author.
†
These authors contributed equally.
$ r_hammou@esi.dz (H. D. Rafik)
0000-0002-0038-0424 (H. D. Rafik)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Figure 1: The phonatory anatomy system of the human speech mechanism [2].
• The hypopharynx, found behind the pharynx, is a section of the throat. It has a flap-like
structure that acts as a lid for the larynx and shuts when swallowing to prevent food and liquid
from entering the trachea.
• The nasopharynx is located behind the nose, and its purpose is to transfer the air to breathe
down through the voice box of the throat and into the lungs.
• The oropharynx is situated posterior to the mouth and is responsible for conveying food and
liquid to the digestive tract and stomach.
• The larynx is the voice box, and it is efficient because it protects the lungs from food, drink, and
foreign bodies and acts as a corridor for air from the nasopharynx.
• The epiglotis protects the lungs and only lets air pass through.
• The vocal tract extends from the vocal cords to the lips and measures approximately 17.5 cm in
length. It consists of two types of ducts: a pharyngeal cavity and an oral cavity.
• The vocal cords open when breathing and close when swallowing or producing voice sounds.
Understanding the vocal process is straightforward. It starts with the movement of air passing
through the two vocal cords (see Fig.2), which are soft and vibrate as air flows through them,
producing 100 to 1 000 vibrations per second.
• Articulators are the tongue, lips, jaws, mouth, etc. These articulators allow for the modification
of the shape of the vocal tract.
• The nasal cavity forms a part of the vocal tract and is situated beneath the velum.
Speech production system:
The human speech production system involves multiple stages before the voice is produced. It
operates rapidly and is complex, relying on the respiratory system, the phonatory system, and the
articulatory system, all working together. These systems collectively create speech and sound. The
respiratory system manages the intake and release of air in the lungs. Inhalation involves the diaphragm
Figure 2: The different positions of the vocal cords in humans: A) vocal cords opening position, B)
vocal cords closing position, c) vocal cords in closed position (semi-paralysis) [4].
lowering and the intercostal muscles facilitating vacuum creation in the lungs for air intake. Exhalation
entails the diaphragm relaxing, allowing air to escape and produce sounds. Air then passes through the
larynx, containing muscles and cartilages called the vocal cords, with the space between them known
as the glottis. The vocal cords, capable of rapid opening and closing, can reach up to 400 movements
per second in children. The articulators, situated between the arytenoids, enable the vocal cords to
move. The articulation process starts with air leaving the larynx, passing through the pharynx before
being modulated by resonators like the lips, tongue, mandible, and velum, which impart characteris-
tics to the sound (air flowing freely results in a vowel, while encountering an obstacle yields a consonant).
Analysis of a person’s emotions is an important area of research in Natural Language Processing
(NLP). For example, it allows us to recognize the individual’s anxiety. In certain cases, we use the
analysis of emotions to diagnose diseases. NLP is a branch of deep learning research. Because deep
learning has revolutionized the world of artificial intelligence, such as in medical research for early
diagnosis of diseases such as COVID-19 [5] or in the field of biometric recognition [6].
The strategy of our approach consists of:
1. Making a consistent bibliographic study in the speech emotion recognition field.
2. Extracting adequate scientific knowledge to apply them in our approach.
3. Using quality and available datasets that are made available to the scientific community.
4. Applying our approach to deep neural networks of the LSTM type using the Mel-frequency
cepstral coefficients (MFCC).
5. Testing our experiments on two datasets, which are TESS and CREMA-D.
6. Evaluating the results obtained with the evaluation parameters: number of trained and untrained
parameters, max accuracy val, accuracy test, score, loss, Precision, Recall, F1-score, and Support.
7. Establishing a Comparative Table between our Results and those of the state of the art.
The rest of our paper is structured as follows: Section 2 is devoted to a literature review of speech
recognition systems with the individual analysis of emotions and the datasets used. Section 3 is
dedicated to the implementation and methodology of the application of our approach with the use of
the architecture of the deep neural network of the LSTM type by focusing on the use of Mel-frequency
cepstral coefficients (MFCC). Section 4 concerns the experimental results obtained on the datasets with
the evaluation metrics used to validate our approach. In conclusion, we will wrap up with a section
discussing the primary obstacles and future research directions about speech recognition systems (SER).
2. Related word
In May 2020, De Pinto G. et al. [7] developed a deep neural network for speech recognition with eight
emotions. The developed model is based on convolutional neural networks (CNN). The tests were
carried out on the RAVDESS database, and the results obtained are of the order of the F1 score of
91,00%, the emotion class Angry with a score of 95,00%, and the class Sad with 87,00%. In February
2022, Puri, T. et al. [8] built a hybrid architecture (LSTM + CNN) using the Hidden Markov Model
and Deep Neural Networks (DNN) for a speech recognition system based on eight emotions. They
applied their approach to the RAVDESS dataset with a three-branch division strategy (for males and
females); the first branch concerns emotions in two positive classes (male and female); the second
branch is divided into three emotion classes (positive, negative, and neutral); and finally, the last branch
is divided into eight different emotion classes. They obtained an accuracy of 98,00%. In September
2022, Gupta, M. V. et al. [9] developed a computer system to detect stress, which has behavioral,
emotional, and physical effects. The authors proposed a cascaded RNN-LSTM architectural system
and applied their approach to the RAVDESS dataset and obtained an accuracy of 91,00%. In October
2022, Ullah, S. et al. [10] developed an architecture model based on speech recognition of emotions
with human-machine interaction. The researchers proposed a one-dimensional CNN (convolutional
neural network). They tested it on a combined emotional dataset (Crema-D, Ravdess, Savee, and Tess)
with a feature set and classifier ZCR+energy+entropy of energy+RMS+MFCC. The proposed model
obtained an accuracy of 92,62%. In November 2022, Vijayan, D. M. et al. [11] proposed two architecture
models for speech recognition based on deep learning. The aim is to analyze the emotions of speech
and classify them while extracting the spatial and temporal technical characteristics. The first model is
a combined CNN-LSTM architecture, and the second is a CNN-Transform encoder architecture. The
RAVDESS database was used for the experimentation.The first model obtained an accuracy of 74,00%
while the second achieved a better high precision with 82,00%. In January 2023, Ahmed, R. et al. [12]
contributed to the implementation of a speech recognition system based on emotion analysis and
feature extraction motivation employed on Convolutional Neural Networks (CNN), Long Short-Term
Memory (LSTM), and Gated Recurrent Unit (GRU). The authors deployed three different architectures:
the first architecture uses 1D CNN followed by FCN networks (Model-A), the second architecture uses
1D CNN followed by LSTM-FCN networks (Model-B), and the third architecture uses 1D CNN followed
by GRU-FCN networks (Model-C). They also used data augmentation by adding Gaussian noise. The
experiments were carried out on five databases, and they obtained very accuracy with 95,62% for
RAVDESS, 99,46% for TESS, 90,47% for CREMA-D, 95,42% for EMO-DB, and 93,22% for SAVEE. In
March 2023, Shah, N. et al. [13] contributed to creating a powerful computational model based on
the Mel frequency cepstral coefficients by combining three datasets: RAVDESS, TESS, and SAVEE.
The model uses two classifiers, Random Forest and Boosting Ensemble, and the prediction accuracy
results are 86,30% for the first classifier and 85,80% for the second classifier. Another learning model
was experimented with on the dataset and obtained an accuracy of 75,00%. In July 2023, Bhawesh,
K.et al. [14] developed four neural networks for emotion speech recognition using features such as
MFCC, chrominance, spectral attenuation, etc. The architecture models used are LSTM, CNN, MLP, and
Random Forest models. The experiments were carried out on a combined data set (SAVEE, RAVDESS,
CREMA-D, and TESS), and the results obtained with accuracies are of the order of 57,50% for the 1st
model, 75,80% for the 2nd model, 59,60% for the 3rd model, 67,80% for the last model. In December
2023, Tyagi, S. et al. [15] developed a computer application based on vocal emotions that allows
to understand and identify human emotions from speech. The proposed prototype uses the LSTM
architecture with a GWO optimization on a combined dataset with the following databases SAVEE,
TESS, EMO-DB, and RAVDESS, and obtained a good accuracy of 65,47% (SAVES), 99,93% (TESS), 78,00%
(EMO-DB), and 87,00% (RAVDESS) respectively. In February 2024, Lata, S. et al. [16] experimented with
two neural network architectures. The first model is a hybrid architecture of a convolutional neural
network (CNN) and long short-term memory (LSTM). The second model consists of an architecture
composed of MFCC+LSTM. The aim is to exploit a stack of depth layers in linear form to improve
the accuracy of the speech sentiment recognition system. The tests were carried out on the TESS
database, and the results obtained are encouraging, with an accuracy of 98,00% for the first model and
96,00% for the second model. In March 2024, Yuan, Z. et al. [17] developed a computer module that
relies on a speech recognition system, and more particularly on identity information (this method
harms the model generalization). The authors proposed a DTNet-type neural network to dissociate
acoustic features from emotional features. The experiments were tested on two databases. They
obtained an accuracy of 74,80% for IEMOCAP and 95,00% for Emo-DB. In April 2024, Islam A. et al. [18]
built a consistent computing system for speech emotion detection and enhancement. In this context,
the authors experimented with their approaches by merging three databases: RAVDESS, TESS, and
CREMA-D. They also proposed a hybrid architectural model using CNN and BiLSTM for the eight
emotions. The proposed model is based on root mean square energy (RMSE), zero crossing rate (ZCR),
and Mel frequency cepstral coefficient (MFCC). The proposed model achieved an accuracy of 97,80%. In
June 2024, Akinpelu, S. et al. [19] proposed a computer application of speech recognition based on
machine learning to detect emotions. This system is based on the principle of the Vision Transformer
(ViT) model. It allows the capture of the characteristics in the images that are adequate indicators of
emotional states from the input data of the mel spectrogram introduced into the model. The TESS
(Toronto English Speech Set) and EMODB (Berlin Emotional Database) were used for the experiments.
The results are satisfactory, as they obtained an accuracy of 98,00% (TESS), 91,00% (EMO-DB), and
93,00% (TESS+EMO-DB). In August 2024, Hossain, I. et al. [20] implemented a hybrid model to
extract information and improve prediction accuracy with probability calculated. The model uses the
convolutional neural network (CNN) architecture to extract features from the speech spectrogram.
After that, the long short-term memory processes the features (LSTM). The authors used a KNN
classifier to classify emotions and make predictions. They conducted experiments using the TESS
database and achieved an accuracy of 98,21%.
The proposed approach is based on two architectural models, and one of them has given excellent
results and is even better than some methods in the literature. The contributions of our model are
defined as follows:
• The development of a new neural network architecture based on the long short-term memory
(LSTM) network, which is dedicated to the classification and analysis of human emotions.
• The proposed LSTM structure features nine layers, beginning with an input layer for data and
ending with an output layer that uses the softmax activation function to classify seven distinct
emotions. It is composed of seven hidden layers that implement optimization methods, such as
dropout and dense layers.
• The suggested LSTM model achieved an accuracy of 98,92% for classifying emotions using the
TESS database, including pleasant (ps), anger, happiness, disgust, fear, sadness, and neutral.
• The LSTM model is lightweight and contains about 307 655 parameters. It allows for accelerated
learning and reduces the computation time for emotion class prediction.
3. Methodology and Implementation
3.1. Data collection
The basis of a speech-emotion recognition system is the quality of the audio file database because a
good-quality dataset implies an efficient and robust system. If the audio file database contains noise, it
must go through refining preprocessing to clean it, and this is done through the process of filtering,
encapsulation, and integration [21]. A SER goes through the data collection phase, preprocessing,
feature extraction, feature selection, classification, and recognition [22]. The data collection stage is the
most sensitive phase of the system (Collect the data thoroughly and ensure that it meets high-quality
standards). The University of Marburg in Slovenia was the birthplace of the first database for an SER
system, as per the literature [23]. It consists of six types of emotions, and the audio files are in MPEG-4
format [24]. The number of utterances in the dataset is 186 for each emotional category.
3.2. Data preprocessing
Before entering the data into the neural network for learning, it is necessary to go through
a preprocessing for feature extraction of the audio files and transform them into mathematical
coefficients for the classification and recognition of emotions. For this, we need the following parameters:
• Mel scale: It is a mathematical scale that allows the height to be measured (acoustic scale), so
the unit of measurement is the Mel, with a sound characteristic of high or low.
• Frequency: represents the number of oscillations (vibrations) of the sound per second, with the
unit of measurement Hertz.
• Chromagram: represents the intensity of the audio signal at a given moment, it is made up of a
chrome vector (size of 12 dimensions with 12 semitones of the chromatic scale).
• Pitch: represents the vibration frequency corresponding to the sounds.
• Fourier transform: allows the decomposition of the periodic signals and is the link between the
temporal signal and the frequency representation.
Subsequently, the Mel spectrogram will be used for individual identification with speech recognition
and emotional states. It is represented by an image whose x-axis represents time, and the y-axis
represents the frequency with the application of a logarithmic scale in the dot diagram [25].
The Mel-frequency cepstral coefficients (MFCC) are used for speech recognition, and this is done by
following the steps above:
1. Calculate the Fourier transform of the signal.
2. Establish a power mapping of the spectrum obtained previously on the Mel scale using interleaved
triangular windows.
3. Collect the power recordings for each Mel frequency.
4. Observe the discrete cosine transform (DCT) of the list of mel log powers as if it were a signal.
5. The amplitudes of the resulting spectrum are called MFCC.
6. For each audio recording, 40 MFCC values [26] are used as input data into the neural network.
3.3. Architectural model
The proposed neural network Model-A is of the LSTM (language short-term memory) type [27], with
as input the first audio recording vector and these 40 values of the MFCC coefficients. The network
comprises an LSTM input layer of 256 neurons and an output layer with the softmax activation function,
and these seven neurons correspond to emotions (selection probabilities). The network has seven
hidden layers with four dropout layers of 0,2 %, three dense layers with the relu activation function, a
dense layer of 128 neurons, a dense layer of 64, and another thick layer of 32 neurons (see Fig.3).
Model-B is a hybrid CNN-LSTM architecture [28], and it is a combined neural network architecture
with a total of 18 layers. The input layer consists of 32 neurons with a relu activation function and,
as input data, a vector that contains 40 MFCC coefficient values. An output layer with the softmax
activation function and the six neurons (correspond to the emotions of speech recognition). The network
is composed of 16 hidden layers with five dropout layers (2 layers of 30% and three layers of 50%), two
convolution layers of 32 and 64 neurons with a relu activation function, two max-pooling layers of two
dimensions, a flattened layer, two LSTM layers with 128 neurons, a dense layer of 128 neurons with the
relu activation function, and finally a batch normalization layer (see Fig.4).
3.4. Hardware and Software
During the experimentation phase in both neural network architecture models we used the following
hyperparameters (see Table 1).
We also used hardware and software to execute our approach (see Table 2).
Figure 3: Architecture of neuronal network LSTM for Dataset TESS Emotions.
Figure 4: Architecture of neuronal network CNN-LSTM for Dataset Crema-D.
4. Results and discussion
Applying our approach requires the use of two datasets.
Dataset 1: The TESS dataset is a database that contains audio files for seven emotions (pleasant (ps),
anger, happiness, disgust, fear, sadness, and neutral) [29]. These recordings were made by two actresses
aged 26 and 64 recruited in the Toronto area as English-speaking actresses with university studies and
musical training. The dataset contains 2 800 audio files (These sound recordings were based on the
Northwestern University Hearing Test Number 6) with a set of 200 words relating to a sentence that
says "Say the word ........".
Dataset 2: The CREMA-D data set consists of 7 442 original recordings featuring 91 actors, with 48
Table 1
The values and properties of hyper-parameters.
N° Property Values
1 Number class 6 and 7
2 Batch-size 64
3 Epochs 100
4 Learning rate 0.0001
6 Optimizer Adam
7 Beta 1 0,9
8 Beta 2 0,999
9 Epsilon 1e-06
10 Loss Categorical crossentropy
Table 2
Hardware and software characteristics.
Hardware and Software Used Characteristics detail
Programming language Python 3.7.10
Mathematical Library Numpy 1.19.5
Memory (RAM) 32 GB
Operating Système x86-64 GNU/Linux
Deep Learning Framework Keras 2.4.3
NVIDIA Tesla T4, 16 GB GDDR6,
Graphics Card (GPU)
NVIDIA CUDA Cores 2560, PCIe Gen 3.0 x16.
Architecture Tensorflow 2.4.1
Numerical software Library Panda 1.1.5
Notebook Jupyter
Processor (CPU) Intel(R) Xeon(R) CPU @ 2.00GHz
Visualization Library Matplotlib 3.4.3
male and 43 female participants aged between 20 and 74 from diverse racial and ethnic backgrounds
such as Hispanic, African, Caucasian, Asian, American, and Unspecified [30]. The actors delivered
12 specific sentences while expressing six different emotions (Happy, Neutral, Sad, Fear, Anger, and
Disgust) at four varying levels (Low, Medium, High, and Unspecified).
To evaluate the results obtained from the experiments of our approach on the two datasets, we
used the following criteria: number of trained parameters (train param), number of untrained
parameters (untrain param), total number of parameters (total param), top loss (validation), top
accuracy (validation), score (test), and accuracy (test).
On the other hand, we have taken specific evaluation criteria to measure the performance of our
approach on the image corpus. F1-score, Recall, Precision, and Support parameters are calculated based
on the TP (true positive), TN (true negative), FP (false positive), and FN (false negative) [31]. These
parameters are defined as follows:
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (1)
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (2)
𝑇𝑃 + 𝐹𝑁
2 × 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 = (3)
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Figure 5: Result of experimentation of accuracy and loss with LSTM architecture (dataset TESS)
.
∑︁
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 : 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑒𝑎𝑐ℎ 𝑐𝑙𝑎𝑠𝑠. (4)
Table 3
Results of experiments with the two architectural models.
Architecture Total param Train param Non-train param Accuracy (test) (%) Score (test) Top Accuracy (val) (%) Top loss (val)
LSTM 307 655 307 655 0 98,92 0,0561 99,21 0,0573
CNN+LSTM 274 758 275 142 384 54,07 1,4708 55,02 1,2211
Figure 5 displays the outcomes of the accuracy and loss experiments conducted on the TESS database
using the LSTM architectural model. The diagram shows that the training and validation accuracy
curves closely align, and the same pattern is observed for the loss, indicating that the neural network
has effectively learned with minimal risk of overfitting.
Figure 6 illustrates the outcomes of the confusion matrix from experiments conducted on the
TESS database for identifying the emotions: anger, happiness, pleasant, disgust, fear, sadness, and
neutral. From this matrix, we observe that the accuracy for the emotion categories fear, sad, angry,
disgust, and happy is notably higher than that for the other categories; conversely, the neutral and
pleasant emotions exhibit a lower accuracy. This indicates that the model maintains an almost uniform
distribution across all emotion categories without clear distinctions. Furthermore, the overall accuracy
of the emotion recognition system stands at 98,92 %.
Table 3 shows the results of the experiments of the two proposed architectural models. The LSTM
model gave excellent results, with an accuracy of 98,48% and a score of 0,0561, while the CNN-LSTM
model gave average results, with an accuracy of 54,07% and a score of 1,4708.
Table 4 displays the results from the tests conducted on the TESS emotion database, including
the evaluation metrics of recall, precision, F1 score, and support. It is noteworthy that the emotion
categories happy, disgust, neutral, pleasant, and sad exhibit high precision, as illustrated in Figure 6,
while the emotion classes angry and fear show slightly lower precision. Additionally, both the micro
average and macro average precision are 0,99.
Table 5 showcases a comparison of our method’s outcomes on both databases against those found in
the existing literature.
Table 4
Results and predictions emotion classes from the database TESS with the model LSTM.
Table Evaluation parameters
Emotion Precision Recall F1-score Support
Fear 0,95 0,98 0,96 41
Angry 0,98 1,00 0,99 42
Disgust 1,00 1,00 1,00 43
Neutral 1,00 1,00 1,00 42
Sad 1,00 1,00 1,00 36
Pleasant (ps) 1,00 0,94 0,97 32
Happy 1,00 1,00 1,00 44
Micri avg 0,99 0,99 0,99 280
Macro avg 0,99 0,99 0,99 280
Weighted avg 0,99 0;99 0,99 280
Sample avg 0,99 0,99 0,99 280
Figure 6: Confusion Matrix for Dataset Tess Emotions (Model-A).
5. Conclusion
The application of our approach and the experiments and tests carried out on the two emotion databases
Crema-d and TESS with the two architectural models LSTM and CNN + LSTM demonstrate the results
obtained with the CNN + LSTM model on the dataset Crema-d with an accuracy of 54,07% and the
LSTM model on the TESS dataset with an accuracy of 98,92%. This also implies that the proposed
model is so effective with the LSTM architecture with all the classes of emotions that there is an almost
uniform distribution in the accuracy, which is very adaptable for an emotion recognition system.
The introduction of deep learning has significantly enhanced research in areas like medicine [32],
biometrics [33], and even more speech recognition systems, especially through the use of convolutional
neural network (CNN) architectures. We aim to enhance our model by utilizing a more extensive
dataset of emotions, and additionally, we will analyze speech to determine whether the patient is
experiencing a psychological disorder.
Table 5
A comparison table between our results and those of the state of the art.
Author Year Nbr of class Nbr of Files dataset Architecture Accuracy
De Pinto, G. et al. [7] May 2020 8 7356 RAVDESS CNN-MFCC 91,00 %
Puri, T., et al. [8] February 2022 8 7356 RAVDESS CNN-LSTM+DNN 98,00 %
Gupta, M. V. et al. [9] September 2022 8 7356 RAVDESS RNN-LSTM 91,00 %
Ullah, S. et al. [10] October 2022 7 18078 Crema-D+Ravdess+Savee+Tess CNN 92,62 %
CNN-LSTM 74,00 %
Vijayan, D. M. et al. [11] November 2022 8 7356 RAVDESS
CNN-Transform 82,00 %
8 7356 RAVDESS 95,62 %
7 2800 TESS 99,46 %
Ahmed, R., et al. [12] January 2023 6 7442 CREMA-D Model-A+Model-B+Model-C 90,47 %
7 535 EMO-DB 95,42 %
7 480 SAVEE 93,22 %
Boosting (KNN+MLP+RF) 86,30 %
Shah, N. et al. [13] March 2023 7 10636 RAVDESS+TESS+SAVEE Random Forest 85,80 %
CNN-LSTM 75,00 %
LSTM 57.50 %
CNN 75,80 %
Bhawesh, K.et al. [14] July 2023 7 18078 SAVEE+RAVDESS+CREMA-D+TESS
MLP 59,60 %
Random Forest 67,80 %
8 7356 RAVDESS 87,00 %
7 2800 TESS 99,93 %
Tyagi, S. et al. [15] December 2023 CNN-LSTM-GWO
7 480 SAVES 65,47 %
7 535 EMO-DB 78,00 %
CNN-LSTM 98,00 %
Lata, S. et al. [16] February 2024 7 2800 TESS
MFCC-LSTM 96,00 %
10 IEMOCAP 74.80 %
Yuan, Z. et al. [17] Marsh 2024 DTNet
7 535 EMO-DB 95.00 %
Islam, A. e al. [18] April 2024 8 17589 RAVDESS+TESS+CREMA-D CNN+BiLSTM 97,80 %
7 2800 TESS 98,00 %
Akinpelu, S. et al. [19] June 2024 7 535 EMO-DB ViT 91,00 %
7 3335 TESS+EMO-DB 93,00 %
Hossain, I. et al. [20] August 2024 7 2800 TESS CNN+LSTM+KNN 98,21 %
6 7442 CREMA-D CNN-LSTM 54,07 %
Proposed approach October 2024
7 2800 TESS LSTM 98,92 %
Declaration on Generative AI
Either:
The author(s) have not employed any Generative AI tools.
References
[1] Children’s Health Queensland, Vocal cord palsy, https://www.childrens.health.qld.gov.au/
health-a-to-z/vocal-cord-palsy#section__signs-and-symptoms. Accessed: 2024-11-19.
[2] Anatomy Corner, Epiglottis, https://anatomycorner.com/main/2016/09/13/epiglottis/. Accessed:
2024-11-12.
[3] J. A. Seikel, D. J. Hudock, D. G. Drumright, Anatomy Physiology for Speech, Language, and
Hearing, Seventh Edition, 2021, 6th. ed., Plural Publishing, SIXTH EDITION, 912 pages, Full Color,
Hardcover, 8.5" x 11",ISBN13: 978-1-63550-279-4.
[4] THROAT - Anatomy, respiration, voice swallowing, A brief introduction to the throat’s anatomy,
https://www.aarontrinidade.com/throat. Accessed: 2024-11-12.
[5] D. R. Hammou, Classification and detection of covid-19 in human respiratory lungs using convolu-
tional neural network architectures, in: 2021 International Conference on Artificial Intelligence for
Cyber Security Systems and Privacy (AI-CSP), IEEE, 2021. doi:10.1109/ai-csp52968.2021.9671158.
[6] D. R. Hammou, S. A. Mahmoudi, R. Adjoudj, Multi-Biometric Iris Recognition System Using
Consensus Between Convolutional Neural Network Architectures, Int. J. Organ. Collect. Intell.
12.1 (2022) 1–30. doi:10.4018/ijoci.305210.
[7] M. G. de Pinto, M. Polignano, P. Lops, G. Semeraro, Emotions Understanding Model from Spoken
Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients, in: 2020 IEEE
Conference on Evolving and Adaptive Intelligent Systems (EAIS), IEEE, 2020. doi:10.1109/eais48028.
2020.9122698.
[8] T. Puri, M. Soni, G. Dhiman, O. Ibrahim Khalaf, M. alazzam, I. Raza Khan, Detection of Emotion of
Speech for RAVDESS Audio Using Hybrid Convolution Neural Network, J. Healthc. Eng. (2022)
1–9. doi:10.1155/2022/8472947.
[9] M. V. Gupta, S. Vaikole, A. D. Oza, A. Patel, D. P. Burduhos-Nergis, D. D. Burduhos-Nergis, Audio-
Visual Stress Classification Using Cascaded RNN-LSTM Networks, Bioengineering 9.10 (2022) 510.
doi:10.3390/bioengineering9100510.
[10] S. Ullah, Q. A. Sahib, Faizullah, S. Ullahh, I. U. Haq, I. Ullah, Speech Emotion Recognition Using
Deep Neural Networks, in: 2022 International Conference on IT and Industrial Technologies (ICIT),
IEEE, 2022. doi:10.1109/icit56493.2022.9989197.
[11] D. M. Vijayan, A. A. V, G. R, A. N. S. A, R. C. Roy, Development and Analysis of Convolutional
Neural Network based Accurate Speech Emotion Recognition Models, in: 2022 IEEE 19th India
Council International Conference (INDICON), IEEE, 2022. doi:10.1109/indicon56171.2022.10040174.
[12] M. Rayhan Ahmed, S. Islam, A. K. M. Muzahidul Islam, S. Shatabda, An Ensemble 1D-CNN-LSTM-
GRU Model with Data Augmentation for Speech Emotion Recognition, Expert Syst. With Appl.
(2023) 119633. doi:10.1016/j.eswa.2023.119633.
[13] N. Shah, K. Sood, J. Arora, Speech emotion recognition for psychotherapy: an analysis of traditional
machine learning and deep learning techniques, in: 2023 IEEE 13th Annual Computing and
Communication Workshop and Conference (CCWC), IEEE, 2023. doi:10.1109/ccwc57344.2023.
10099344.
[14] K. Bhawesh, D. Mustafi, A Comparison of Deep Learning and Machine Learning Models for
Speech Emotion Recognition Using Multiple Features, in: 2023 14th International Conference on
Computing Communication and Networking Technologies (ICCCNT), IEEE, 2023. doi:10.1109/
icccnt56998.2023.10307514.
[15] S. Tyagi, S. Szénási, Optimizing Speech Emotion Recognition with Deep Learning and Grey Wolf
Optimization: A Multi-Dataset Approach, Algorithms 17.3 (2024) 90. doi:10.3390/a17030090.
[16] S. Lata, N. Kishore, P. Sangwan, A Comparative Analysis of CNNLSTM and MFCCLSTM for
Sentiment Recognition from Speech Signals,” Int J Intell Syst Appl Eng, vol. 12, no. 21s, pp.
4392–4402, Mar. (2024), Accessed: Nov. 22, 2024. [Online]. Available: https://ijisae.org/index.php/
IJISAE/article/view/6295.
[17] Z. Yuan, C. L. Philip Chen, S. Li, T. Zhang, Disentanglement Network: Disentangle the Emo-
tional Features from Acoustic Features for Speech Emotion Recognition, in: ICASSP 2024 -
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024.
doi:10.1109/icassp48485.2024.10448044.
[18] A. Islam, M. Foysal, M. I. Ahmed, Emotion Recognition from Speech Audio Signals using CNN-
BiLSTM Hybrid Model, in: 2024 3rd International Conference on Advancement in Electrical and
Electronic Engineering (ICAEEE), IEEE, 2024. doi:10.1109/icaeee62219.2024.10561755.
[19] S. Akinpelu, S. Viriri, A. Adegun, An enhanced speech emotion recognition using vision trans-
former, Sci. Rep. 14.1 (2024). doi:10.1038/s41598-024-63776-4.
[20] I. Hossain, M. Islam, T. Nahrin, M. Rashed, M. Rahman, Improving Speech Emotion Recognition
and Classification Accuracy Using Hybrid CNN-LSTM-KNN Model. (2024), International Journal
of Research Publication and Reviews Journal Homepage: Www.ijrpr.com, 5. Retrieved from
URL:https://ijrpr.com/uploads/V5ISSUE8/IJRPR32597.pdf.
[21] T. Vamsikrishna, P. Naga Vyshnavi, (2017). Efficient Speech Emotion Recognition Using SVM and
Decision Trees. (2017), In International Research Journal of Engineering and Technology. Retrieved
from URL:https://www.irjet.net/archives/V4/i7/IRJET-V4I7663.pdf.
[22] T. Pfister, P. Robinson, Real-Time Recognition of Affective States from Nonverbal Features of
Speech and Its Application for Public Speaking Skill Analysis. (2011) IEEE Transactions on Affective
Computing, 2(2), 66–78. URL:https://doi.org/10.1109/t-affc.2011.8.
[23] D. C. Ambrus, Collecting and recording of an emotional speech database. (2000), Maribor, Slovenia:
University of Maribor.
[24] J. Ostermann, Face Animation in MPEG-4. MPEG-4 Facial Animation: The Standard, Implementa-
tion and Applications, 17-55, (2002).
[25] H. Meng, T. Yan, F. Yuan, H. Wei, Speech Emotion Recognition From 3D Log-Mel Spectrograms With
Deep Learning Network, IEEE Access 7 (2019) 125868–125881. doi:10.1109/access.2019.2938007.
[26] Wei Han, Cheong-Fat Chan, Chiu-Sing Choy, Kong-Pang Pun, An efficient MFCC extraction
method in speech recognition, in: 2006 IEEE International Symposium on Circuits and Systems,
IEEE. doi:10.1109/iscas.2006.1692543.
[27] Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, B. Schuller, Speech Emotion Classification Using
Attention-Based LSTM, IEEE/ACM Trans. Audio, Speech, Lang. Process. 27.11 (2019) 1675–1685.
doi:10.1109/taslp.2019.2925934.
[28] J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1D 2D CNN LSTM networks,
Biomed. Signal Process. Control 47 (2019) 312–323. doi:10.1016/j.bspc.2018.08.035.
[29] P. M. Kathleen, K. Dupuis, Toronto emotional speech set (TESS). (2020) Scholars Portal Dataverse
1. University of Toronto Toronto, ON, Canada: URL:https://doi.org/10.5683/SP2/E8H2MF.
[30] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, R. Verma, CREMA-D: Crowd-sourced
Emotional Multimodal Actors Dataset. IEEE Trans Affect Comput. (2014) Oct-Dec;5(4):377-390.
PMID: 25653738; PMCID: PMC4313618. doi:10.1109/TAFFC.2014.2336244.
[31] X. Tang, Y. Lin, T. Dang, Y. Zhang, J. Cheng, Speech Emotion Recognition Via CNN-Transforemr
and Multidimensional Attention Mechanism. (2024). ArXiv (Cornell University). URL:https://doi.
org/10.48550/arxiv.2403.04743.
[32] D. R. Hammou, S. A. Mahmoudi, R. Adjoudj, B. Mechab, A Model Of A Biometric Recognition
System Based On The Hough Transform Of Libor Masek and 1D LogGabor Filter. 2020 5th
International Conference on Cloud Computing and Artificial Intelligence: Technologies and
Applications (CloudTech), 1–9. URL:https://doi.org/10.1109/CloudTech49835.2020.9365917.
[33] D. R. Hammou, Y. Z. Feddag, S. Benadane , A New Architecture For Diagnosing Pulmonary
Thorax Diseases (Covid19, Pneumonology, Normal) Using Deep Learning Technology. 2023 6th
International Conference on Advanced Communication Technologies and Networking (CommNet),
1–10. URL:https://doi.org/10.1109/CommNet60167.2023.10365251.