<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cascade Multi-Modal Emotion Recognition Leveraging Audio-Video and EEG Signals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Renato Esposito</string-name>
          <email>renato.esposito001@studenti.uniparthenope.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Mele</string-name>
          <email>vincenzo.mele001@studenti.uniparthenope.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Verrilli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Minopoli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo D'Errico</string-name>
          <email>lorenzo.derrico@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura De Santis</string-name>
          <email>ldesantis@unisa.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariacarla Stafa</string-name>
          <email>mariacarla.stafa@uniparthenope.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Naples "Federico II"</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Naples "Parthenope"</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Salerno</institution>
          ,
          <addr-line>Salerno</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Emotion recognition is essential for improving human-computer interaction, but single-modality approaches often face challenges in accurately capturing the complexity of human emotions. To overcome these limitations, we introduce a novel multimodal system that combines audio, video, and electroencephalogram (EEG) data. The system employs two deep learning models: an audio-video classifier utilizing hybrid fusion for analyzing speech and facial expressions, and a Feature-Based Convolutional Neural Network (FBCCNN) designed to process EEG signals. These models are integrated through a meta-model that uses logistic regression to combine their predictions. The system is capable of classifying four emotions-happiness, sadness, anger, and neutral-and outperforms single-modality methods, particularly in handling more complex emotional states.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Emotion recognition</kwd>
        <kwd>multimodal classification</kwd>
        <kwd>EEG signal processing</kwd>
        <kwd>fusion strategies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Emotion recognition has become a crucial area of study in human-computer interaction [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], with
significant implications for interpersonal relationships, perception, and decision-making [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As
automated systems become increasingly integral to daily life, the precise detection and classification of
emotions have grown in significance, thereby underscoring the importance of research in this domain.
      </p>
      <p>
        Traditional single-modality approaches, such as analyzing facial expressions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], voice patterns [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ],
or EEG signals [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], often struggle to capture the complexity of human emotions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Emotions are
inherently multimodal, manifesting through various physiological and behavioral channels
simultaneously. Relying on one modality can lead to incomplete or inaccurate assessments [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], particularly
in real-world contexts where environmental factors may degrade signal quality or introduce noise in
individual channels.
      </p>
      <p>
        Recent research has demonstrated the potential of multimodal approaches to improve recognition
accuracy [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. By integrating information from multiple sources, these systems can leverage
complementary features and compensate for weaknesses in individual modalities. However, significant
challenges persist in feature extraction, synchronization, and handling noisy or incomplete data [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ].
      </p>
      <p>
        Deep learning has shown promise in addressing these challenges [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], with Convolutional Neural
Networks (CNNs) and other architectures achieving success in multimodal feature extraction and
classification [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
        ]. These approaches can automatically learn relevant features from raw data,
potentially capturing subtle emotional cues that traditional methods might miss. However, efectively
integrating diverse modalities remains an active research area requiring innovative solutions.
      </p>
      <p>
        To address these challenges, we propose a novel cascade multimodal system that integrates audio,
video, and electroencephalogram (EEG) inputs for emotion recognition. Our primary contribution lies
in the development of an efective meta-model integration approach that combines existing specialized
models for diferent modalities. Specifically, for audio-video processing, we adopt the established model
from Zhang et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which has demonstrated strong performance in multimodal emotion recognition
through hybrid fusion strategies. For EEG signal processing, we leverage the Feature-Based CNN
(FBCNN) approach described by Pan and Zheng [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Our approach difers from previous work in several key aspects:
1. We implement a two-stage cascade architecture that first processes individual modalities through
specialized models before combining their outputs at a meta-level
2. We leverage state-of-the-art deep learning architectures already tailored to each modality unique
characteristics
3. We utilize the existing hybrid fusion strategy for audio-visual processing from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that captures
both low-level interactions and high-level semantic relationships
4. We adopt the FBCNN [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] specifically designed to efectively process EEG signals
5. Our main innovation is the development of a meta-model integration approach using logistic
regression that intelligently weighs predictions from each modality to produce superior
classification results
      </p>
      <p>The proposed system ofers key advantages: robust synchronization of modalities, efective
feature extraction through specialized architectures, and an interpretable yet powerful meta-model that
combines predictions. Our experimental results show significant improvements over single-modality
systems, particularly in classifying complex emotional states that have traditionally been dificult to
recognize.</p>
      <p>The system is capable of classifying four primary emotions: happiness, sadness, anger, and neutral,
with high accuracy across varied conditions. This approach represents an important step toward
more naturalistic and robust emotion recognition systems that can function efectively in real-world
human-computer interaction scenarios.</p>
      <p>The remainder of this paper is organized as follows: Section 2 details the materials and methods,
including the architecture of both the audio-video and EEG models; Section 3 describes the multimodal
classifier and integration approach through our meta-model; Section 4 outlines the experimental
procedure and datasets used; Section 5 presents the results and provides a detailed discussion of our
ifndings; and Section 6 concludes with a summary of contributions and directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods</title>
      <p>This section presents the foundational components of our multimodal emotion recognition system.
We first provide an overview of the system architecture 2.1, followed by detailed descriptions of the
specialized models for each modality and their integration through the meta-model approach 2.3.</p>
      <sec id="sec-2-1">
        <title>2.1. System Architecture Overview</title>
        <p>Our proposed system follows a cascade architecture comprising three main components:
1. Audio-Video Emotion Recognition Model: A specialized deep learning model that processes both
audio and video data, extracting complementary features from speech and facial expressions;
2. EEG-based Emotion Recognition Model: A Feature-Based Convolutional Neural Network
(FBC</p>
        <p>CNN) designed to analyze EEG signals and extract emotion-relevant patterns from brain activity;
3. Meta-Model Integration: A logistic regression-based model that takes predictions from the two
specialized models as input and produces the final emotion classification.</p>
        <p>This architecture allows each modality to be processed by models specifically designed for their
unique characteristics, before combining their outputs at a higher level. The complete system is capable
of classifying four primary emotions: happiness, sadness, anger, and neutral states.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Audio-Video and EEG architectures</title>
        <p>
          Audio-Video Model The audio-video emotion recognition component is based on the work of Zhang
et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], which processes both audio and video streams through dedicated neural networks before
combining them through a fusion strategy. For our implementation, we adopt a late fusion approach as
described in [21], which has demonstrated efective performance in multimodal emotion recognition
tasks.
        </p>
        <p>As illustrated in Figure 1, the video stream is processed through EficientFace [ 22], a lightweight
yet powerful convolutional neural network designed specifically for facial expression recognition.
Simultaneously, the audio stream undergoes processing through a series of one-dimensional
convolutional blocks that extract spectral and temporal features from speech signals. The outputs from
these specialized networks are then passed through transformer blocks that implement a cross-modal
attention mechanism. This allows each modality to benefit from complementary information in the
other stream:
• The Audio-Video transformer uses audio features as queries to attend to video features
• The Video-Audio transformer uses video features as queries to attend to audio features
The attended feature maps undergo pooling operations before being concatenated and fed into a
classification head that produces emotion predictions based on the combined audiovisual information.</p>
        <p>Conv1D Block
Conv1D Block
Conv1D Block
Conv1D Block
Ef icientFace</p>
        <p>Video</p>
        <p>Conv1D Block
Conv1D Block
Conv1D Block
Conv1D Block</p>
        <p>Audio</p>
        <p>Q
K,V
K,V</p>
        <p>Q</p>
        <p>Transformer
Block A➔V
Transformer
Block V➔A
Pooling</p>
        <p>
          Pooling
Concatenation
Classification
EEG Model For the EEG modality, we implement the FBCNN (Feature-Based Convolutional Neural
Network) approach described in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. This architecture is specifically designed to process the unique
characteristics of EEG signals in the context of emotion recognition.
        </p>
        <p>As shown in Figure 2, the FBCNN model first divides the EEG signals into multiple frequency bands
corresponding to established brain wave patterns: alpha (8-14 Hz), beta (14-31 Hz), gamma (31-49 Hz),
and theta (4-8 Hz). Each frequency band is processed separately through dedicated convolutional layers
that extract spatial-temporal features from the corresponding brain activity patterns.</p>
        <p>The architecture employs multiple convolutional layers with varying filter configurations to capture
features at diferent scales and abstraction levels. The extracted features from all frequency bands
are then concatenated and processed through a series of fully connected layers, ultimately producing
emotion classification outputs.
f ×</p>
        <p>ConvLayer
ConvLayer
ConvLayer
ConvLayer
ConvLayer
ConvLayer
ConvLayer</p>
        <p>C
o
n
c
a
t
e
n
a
t
i
o
n
FC.Layer 2
FC.Layer 1</p>
        <p>This band-specific processing is particularly advantageous for emotion recognition, as diferent
emotional states have been shown to manifest themselvesin distinct frequency bands of brain activity.
For example, gamma activity (31-49 Hz) has been demonstrated to increase during the processing of
emotionally salient stimuli, particularly in the prefrontal and temporal regions of the brain [23, 24].</p>
        <p>For both models, we employ a reduced set of four emotion categories: neutral, happy, angry, and sad.
These categories provide a balanced representation of the primary emotional dimensions according to
the Russell Circumplex Model of Afect [ 25], covering both positive and negative valence as well as
high and low arousal states.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Meta-model integration</title>
        <p>The meta-model serves as the integrative component of our cascade architecture, combining the
predictions from the audio-video and EEG models to produce a final emotion classification. This
approach follows the stacking ensemble method [26], where the outputs of base classifiers become
input features for a higher-level model.</p>
        <p>As illustrated in Figure 3, the meta-model receives predictions from both specialized models and
applies a logistic regression function to determine the final classification. This process involves several
key steps:
Data synchronization A critical challenge in multimodal emotion recognition is ensuring temporal
alignment between diferent data streams. We employ a batch-based synchronization strategy that
creates label-matched data pairs between the audio-video and EEG modalities. This procedure ensures
that both models are exposed to consistent emotional content despite coming from diferent datasets.</p>
        <p>The synchronization process creates emotional bins based on the class labels, allowing samples
from diferent modalities to be paired according to their emotional content rather than requiring strict
temporal alignment (a detailed description of the synchronization strategy is reported in Sec.3.2 and
Alg.1). This approach provides semantic consistency between modalities while accommodating the
reality that our training data comes from separate specialized datasets.</p>
        <p>Meta-Feature Creation The predictions from the audio-video and EEG models serve as meta-features
for the final classification stage. These predictions, represented as logits (pre-softmax outputs) for each
emotion category, capture the confidence levels of each specialized model regarding the emotional
content of the input.</p>
        <p>By using these prediction vectors as features, the meta-model can learn which modality tends to be
more reliable for specific emotional states and weigh their contributions accordingly. This approach
is more sophisticated than simple averaging or voting schemes, as it can adapt to the strengths and
weaknesses of each modality.</p>
        <p>Logistic Regression for Final Classification The final element of our system is a logistic regression
classifier, which receives the meta-features as input and outputs the final emotion prediction. We chose
logistic regression for this task because of its interpretability, computational eficiency, and efectiveness
in combining predictive signals from multiple sources. The logistic regression model learns the optimal
coeficients for the predictions for each modality, efectively determining their relative contributions to
the final decision. This approach balances the strengths of deep learning models for modality-specific
feature extraction with the interpretability and reliability of traditional machine learning techniques for
the final integration step. The entire process results in a robust multimodal emotion recognition system
that takes advantage of the complementary nature of audiovisual and neuro-physiological signals,
providing more accurate and reliable emotion classification than any single modality could achieve
independently.</p>
        <sec id="sec-2-3-1">
          <title>Video</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Audio</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Procedure</title>
      <p>This section presents our experimental methodology and results. Following the architectures described
in Figure 3, we first train individual models for audio, video, and EEG data using the datasets detailed
in Section 3.1. These models’ outputs serve as inputs for a meta-model that produces the final emotion
prediction. Section 3.2 details the data preprocessing methodology, with particular emphasis on EEG
signal processing. All experiments were conducted on a high-performance computing cluster running
Linux, comprising eight interconnected computational nodes, with four nodes each equipped with four
NVIDIA V100 GPUs.</p>
      <sec id="sec-3-1">
        <title>3.1. The dataset</title>
        <p>A key challenge in multimodal emotion recognition research is the scarcity of datasets that
simultaneously capture audio, video, and EEG signals during emotional experiences. To address this limitation,
we adopted a two-stage training approach using specialized datasets for each modality, followed by a
synchronized integration procedure for the meta-model. This independent training phase ensures that
each model learns modality-specific features optimally.</p>
        <p>Audio-Video Dataset For training and evaluating the audio-video emotion recognition model, we
utilized the RAVDESS dataset (Ryerson Audio-Visual Database of Emotional Speech and Song) [27].
This dataset includes recordings of 24 professional actors (12 female, 12 male) vocalizing two
lexicallymatched statements in a neutral North American accent. The actors express various emotions including
calm, happy, sad, angry, fearful, surprise, and disgust, with each expression captured in three formats:
• Audio-only (16bit, 48kHz .wav)
• Audio-Video (720p H.264, AAC 48kHz, .mp4)
• Video-only (no sound)</p>
        <p>For our experiments, we focused exclusively on the Audio-Video format to capture both facial
expressions and vocal characteristics simultaneously. The dataset consists of 2,880 recordings, which
we partitioned following a 70:15:15 split for training, validation, and testing, respectively.</p>
        <p>To align with our four-category emotion classification scheme, we mapped the original eight emotional
expressions in RAVDESS to our target categories (neutral, happy, angry, and sad) based on the Russell
Circumplex Model of Afect [ 25]. This model arranges emotions in a two-dimensional space defined by
valence (positive/negative) and arousal (high/low), providing a theoretical foundation for our emotion
grouping. The specific mappings are detailed in Table 1.</p>
        <p>EEG dataset For the EEG modality, we employed the SEED-IV dataset [28], which contains EEG
recordings from 15 subjects during emotion-elicitation experiments. The dataset includes simultaneous
recordings from EEG and eye-tracking devices, providing comprehensive neurophysiological data
during emotional experiences.</p>
        <p>The original dataset comprises 1,080 EEG signals, which we further segmented into sequences of
800 samples each, resulting in a total of 37,575 samples. This segmentation approach ensures uniform
sample sizes and eliminates the need for padding operations. We divided the dataset using an 80:10:10
scheme for training, validation, and testing.</p>
        <p>Similar to the approach taken with the audio-video dataset, we mapped the original emotion categories
in SEED-IV (happy, sad, neutral, and fear) to match our four-category classification scheme. Specifically,
we consolidated the "fear" category into the "sad" emotional bin, as detailed in Table 1. This mapping
was guided by the valence-arousal coordinates in the circumplex model, where fear and sadness share
negative valence characteristics.</p>
        <p>Input Emotions
Neutral, Calm
Happy, Surprised
Angry, (Fear)ful, Disgusted
Sad</p>
        <p>Final Category
Neutral
Happy
Angry
Sad</p>
        <p>Meta-Model Dataset The dataset for training and evaluating the meta-model consists of the
prediction outputs from both pre-trained models (audio-video and EEG). These models were individually
trained on their respective datasets to extract modality-specific information, producing prediction
vectors that reflect the emotional content of the inputs.</p>
        <p>A critical aspect of our approach is the synchronization between data from separate datasets. For
the meta-model, each sample is constructed by ensuring that the predictions from both modalities are
based on inputs associated with the same emotional label. This process, detailed in Section 3.2 and
Algorithm 1, guarantees semantic consistency between the features extracted from diferent modalities
and preserves the integrity of the emotional information.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature Extraction</title>
        <p>Efective feature extraction is crucial for capturing emotion-relevant information from heterogeneous
data sources. We implemented modality-specific preprocessing pipelines optimized for audio, video,
and EEG signals.</p>
        <p>
          Audio and Video Data Preprocessing The audio-video model processes both audio and visual data
following established methods from the literature [21]. For the audio stream, we extract Mel-frequency
cepstral coeficients (MFCC) as the primary feature representation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This choice was informed by
comparative studies showing no significant advantages in using alternative features such as chroma or
spectrograms for emotion recognition tasks in speech.
        </p>
        <p>For the visual stream, we implemented a preprocessing pipeline consisting of:
1. Frame sampling at regular intervals
2. Image scaling to a standard resolution
3. Region of interest detection using multi-task cascaded convolutional networks (MTCNN) [29] to
localize and extract facial regions
4. Feature extraction using EficientFace [ 22], a lightweight yet powerful model specifically designed
for facial expression recognition</p>
        <p>This preprocessing strategy ensures that the visual model receives consistent and relevant facial
expression data while filtering out irrelevant background information.</p>
        <p>EEG Data Preprocessing The preprocessing of EEG signals involved several specialized steps to
enhance signal quality and extract emotion-relevant features. First, we conducted channel selection to
match our laboratory equipment constraints. From the 62 channels available in the SEED-IV dataset, we
selected a subset of 14 channels: ’AF3’, ’AF4’, ’F3’, ’F4’, ’F7’, ’F8’, ’T7’, ’T8’, ’P7’, ’P8’, ’O1’, ’O2’, ’FC5’, and
’FC6’. This selection corresponds to the channels available in the EEG headset Epoch Plus used in our
laboratory, ensuring compatibility with our experimental setup for future multimodal data collection.</p>
        <p>EEG signals were pre-processed by filtering out artifacts through a baseline noise removal procedure,
eliminating parts of the signal unrelated to the emotional stimulus. The signals were then transformed
into the frequency domain and divided into  bands corresponding to alpha (8–14 Hz), beta (14–31 Hz),
gamma (31–49 Hz), and theta (4–8 Hz) bins. Each frequency band is known to uniquely contribute to
the understanding of emotional and cognitive states. The gamma band (31–49 Hz), in particular, has
been shown to play a critical role in emotion recognition tasks. Studies have highlighted that gamma
activity increases during the processing of emotionally salient stimuli, particularly in the prefrontal
and temporal regions of the brain [23, 24].</p>
        <p>Subsequently, the separated channels are processed through a feature extraction mechanism
employing diferential entropy technique to extract meaningful information from the provided data.</p>
        <p>The resulting EEG signals are then transformed into a spatial grid format by mapping the electrode
signals onto a 2D matrix based on the spatial arrangement of the electrodes. This transformation enables
spatially aware processing—such as with convolutional neural networks—and yields a final output as a
3D EEG frame of size  × .ℎ × .ℎℎ.</p>
        <p>The obtained EEG signals of diferent channels are then projected onto a grid structure to form a 3D
frame EEG signal representation with the size of [number of data points, width of the grid, height of
grid] according to the electrodes position.</p>
        <p>Label alignment A fundamental challenge in our multimodal approach is the lack of simultaneously
recorded data across all three modalities. To address this, we developed a label-based synchronization
strategy that creates emotionally consistent batches across modalities. The synchronization procedure,
outlined in Algorithm 1, involves the following steps:
• Organizing the EEG dataset samples into emotional bins corresponding to our four emotion
categories
• Using samples from the audio-video dataset as a guide for selecting corresponding EEG samples
• For each audio-video sample with a specific emotion label, randomly selecting an EEG sample
from the matching emotional bin
• Creating artificial batches containing paired audio-video and EEG samples that share the same
emotional label</p>
        <p>This approach ensures semantic consistency between modalities despite the absence of temporally
synchronized recordings. While not capturing the exact same emotional instances across modalities, it
provides a valid basis for training the meta-model to recognize patterns in how each modality responds
to similar emotional states.</p>
        <p>The complete synchronization algorithm is presented in Algorithm 1, which describes the creation of
emotional bins and the batch-based selection process that aligns data points across modalities. These
synchronized batches then serve as input for generating the meta-features used in the final classification
model.</p>
        <p>Algorithm 1 Dataset Synchronization</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>The integrated system’s performance is expected to outperform single-modality models by combining
the complementary strengths of audio, video, and EEG data. The Audio-Video Emotion Classification
Model
Audio-Video
FBCCNN
Meta model</p>
      <p>Dataset
RAVDESS
SEED-IV
Meta-features
Model enhances the system’s ability to capture expressive cues from both speech and facial expressions,
while the FBCNN model contributes insights from the brain activity captured in EEG data.</p>
      <p>The meta-model further improves performance by combining the predictions from both models,
ensuring that the final emotion classification is robust and accurate. Preliminary results will be analyzed
in terms of accuracy and classification error rates.</p>
      <p>We expect that the multi-modal approach will significantly improve recognition rates, particularly in
distinguishing emotions such as anger and sadness, which are often more dificult to classify based on a
single modality alone.</p>
      <sec id="sec-4-1">
        <title>4.1. Audio-Video and EEG results</title>
        <p>The audio-video branch of the proposed model, which integrates a Convolutional Neural Network
(CNN) and a Transformer, is implemented following the architecture outlined in the corresponding
reference paper. The performance metrics considered for this model are accuracy and loss, evaluated
during both the training and validation processes.</p>
        <p>Similarly, for the EEG-based model, which utilizes the FBCCNN architecture, the same performance
metrics are assessed. As presented in Table 2, the audio-video model, trained on the RAVDESS dataset,
achieves a loss of 0.8860 and an accuracy of 77.08%.</p>
        <p>In contrast, the EEG model, evaluated on the SEED-IV dataset, achieves a lower loss of 0.7075 and a
higher accuracy of 80.67%. These results show the better performance of the EEG-based model in terms
of both accuracy and loss.</p>
        <p>While the audio-video model demonstrates competitive results, its relatively higher loss and lower
accuracy may be attributed to the inherent complexity of processing multimodal data from the RAVDESS
dataset. Conversely, the EEG model benefits from the frequency-band-specific learning capabilities
of the FBCNN architecture, which prove to be well-suited for the emotion recognition tasks in the
SEED-IV dataset.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Meta model predictions</title>
        <p>The integration of multimodal data through logistic regression represents the final stage of our emotion
recognition pipeline. While logistic regression does involve a training phase. Unlike the complex
training dynamics observed in the CNN and FBCNN networks, this last step follows a more straightforward
optimization path. This simpler nature of the model involved implies that traditional training
visualizations, such as loss curves or learning rate analysis, are less informative and arguably unnecessary for
understanding the model’s performance. In the described scenario the logistic regression serves as a
meta-learner, weighing and combining the features already extracted by our networks for audio-video
and EEG data.</p>
        <p>When evaluated independently, the single-modal approaches showed varying degrees of success. The
FBCNN model, which uses the DEAP dataset for EEG analysis and tested on multiple actors, achieved
an average accuracy of 55.87% as reported in Table 7 in the “Experiment” section of the original paper.</p>
        <p>The audio-video model, which leverages both audio and video signals via a 1-head dropout transformer
architecture and trained on the RAVDES dataset, demonstrated better performance with an accuracy of
79.08%, as reported in Table III in the “Results and Discussion” section of the original paper.</p>
        <p>However, the true potential of emotion recognition emerges through our multimodal integration
approach, which achieves a remarkable accuracy of 91% when combining all three modalities (EEG,
audio, and video signals) as shown in Table 2, demonstrating the efectiveness of multimodal emotion
recognition. The efectiveness of this approach is clearly demonstrated in Figure 4 where the showed
confusion matrix describes a model capable of learning and synthesizing information from multiple
modalities.</p>
        <p>Particularly robust performance is highlighted in distinguishing emotional states that are typically
challenging to diferentiate. The strongest performance is observed in the happy emotion category,
where the model achieves high accuracy with minimal false positives. This suggests that the combination
of physiological signals from EEG with audiovisual cues provides particularly strong indicators for this
emotional state.</p>
        <p>Confusion Matrix - Meta Model</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study presents a novel cascade multimodal system for emotion recognition that efectively integrates
audio, video, and EEG data to achieve superior classification performance. Our approach addresses the
inherent limitations of single-modality systems by leveraging the complementary strengths of each
data stream through a two-stage architecture. The key contributions and findings of this work can be
summarized as follows:
• First, our cascade architecture demonstrates the efectiveness of specialized modality-specific
processing before high-level integration. The audio-video component, implementing the hybrid
fusion approach from Zhang et al., achieved 77.08% accuracy, while the EEG component using the
FBCCNN architecture from Pan and Zheng reached 80.67% accuracy. When combined through
our meta-model approach, the system achieved a remarkable 91.45% accuracy, representing an
improvement of approximately 11% over the best single-modality model;
• Second, the logistic regression-based meta-model proved to be an efective and interpretable
integration strategy. The confusion matrix results revealed particularly strong performance in
distinguishing "happy" emotions (95% accuracy) and consistently high performance for "angry"
and "sad" categories (92% accuracy each). The pattern of misclassifications primarily occurring
between psychologically adjacent emotions in Russell’s circumplex model suggests that our
system successfully captures the underlying continuous nature of emotional expressions;
• Third, our approach ofers practical advantages in terms of implementation and extensibility.</p>
      <p>By leveraging pre-trained specialized models for each modality, our system can benefit from
advancements in modality-specific architectures without requiring complete redesign. The
metamodel integration strategy also provides flexibility in weighting the contributions of each modality
based on their reliability for specific emotional states.</p>
      <p>We acknowledge that a significant limitation of the current study is the lack of simultaneously
recorded multimodal data, necessitating our bin-based synchronization strategy. To address this
limitation, we are currently conducting experiments to collect a comprehensive dataset with synchronized
audio, video, and EEG recordings during emotional experiences. This dataset will enable more
direct evaluation of temporal dynamics across modalities and will be made publicly available following
validation.</p>
      <p>Future research directions include expanding the range of recognizable emotions beyond the four
primary categories (neutral, happy, angry, sad) to include more nuanced states such as surprise, fear,
and disgust. Additionally, we plan to explore more sophisticated meta-model architectures that can
adapt their weighting strategies dynamically based on signal quality and context. The integration of
additional physiological signals such as heart rate variability or galvanic skin response also presents
promising avenues for further improving recognition accuracy, particularly for subtle emotional states.</p>
      <p>In conclusion, our multimodal approach represents a significant step toward developing more
naturalistic and robust emotion recognition systems that can function efectively in real-world human-computer
interaction scenarios. By demonstrating substantial improvements over single-modality approaches,
particularly for complex emotional states, this work contributes to the advancement of empathetic
computing systems capable of more nuanced understanding of human emotional experiences.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The work was supported by the research project SPECTRA (Supporting schizophrenia PatiEnts Care
wiTh aRtificiAl intelligencen - CUP: I53D23006050001), funded by the MIUR with DD no.861 under the
PNRR and by the European Union - Next Generation EU.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
convolutional neural network, computational and Mathematical Methods in Medicine 2021 (2021)
2520394.
[21] K. Chumachenko, A. Iosifidis, M. Gabbouj, Self-attention fusion for audiovisual emotion
recognition with incomplete data, in: 2022 26th Int. Conf. on Pattern Recognition (ICPR), IEEE, 2022, pp.
2822–2828.
[22] Z. Zhao, Q. Liu, F. Zhou, Robust lightweight facial expression recognition network with label
distribution training, in: Proceedings of the AAAI conference on artificial intelligence, volume 35,
2021, pp. 3510–3519.
[23] M. Li, B.-L. Lu, Emotion classification based on gamma-band eeg, in: 2009 Annual Int. Conf. of
the IEEE Engineering in medicine and biology society, ieee, 2009, pp. 1223–1226.
[24] K. Yang, L. Tong, J. Shu, N. Zhuang, B. Yan, Y. Zeng, High gamma band eeg closely related to
emotion: evidence from functional network, Frontiers in human neuroscience 14 (2020) 89.
[25] J. A. Russell, A circumplex model of afect., Journal of personality and social psychology 39 (1980)
1161.
[26] B. Pavlyshenko, Using stacking approaches for machine learning models, in: (DSMP), 2018, pp.</p>
      <p>255–258.
[27] S. R. Livingstone, F. A. Russo, The ryerson audio-visual database of emotional speech and song
(ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,
PloS one 13 (2018) e0196391.
[28] W. Zheng, W. Liu, Y. Lu, B. Lu, A. Cichocki, Emotionmeter: A multimodal framework for
recognizing human emotions, IEEE Transactions on Cybernetics (2018) 1–13.
[29] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded
convolutional networks, IEEE signal processing letters 23 (2016) 1499–1503.
[30] A. Aggarwal, S. Garg, R. Madaan, R. Kumar, Comparison of diferent machine learning and deep
learning emotion detection models, Algorithms for Intelligent Systems (2021).
[31] C. Norman, Ai in pursuit of happiness, finding only sadness: Multi-modal facial emotion
recognition challenge, ArXiv abs/1911.05187 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Spezialetti</surname>
          </string-name>
          , G. Placidi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <article-title>Emotion recognition for human-robot interaction: Recent advances and future perspectives</article-title>
          ,
          <source>Frontiers in Robotics and AI</source>
          <volume>7</volume>
          (
          <year>2020</year>
          )
          <fpage>532279</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barra</surname>
          </string-name>
          , L.
          <string-name>
            <surname>D'Errico</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stafa</surname>
          </string-name>
          ,
          <article-title>Multimodal interfaces for emotion recognition: Models, challenges and opportunities</article-title>
          ,
          <source>in: Int. Conf. on Human-Computer Interaction</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>152</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>The emotion recognition in psychology of human-robot interaction</article-title>
          ,
          <source>Psychomachina</source>
          <volume>1</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <article-title>A facial expression emotion recognition based human-robot interaction system</article-title>
          .,
          <source>IEEE CAA J. Autom. Sinica</source>
          <volume>4</volume>
          (
          <year>2017</year>
          )
          <fpage>668</fpage>
          -
          <lpage>676</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tsiourti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vincze</surname>
          </string-name>
          ,
          <article-title>Multimodal integration of emotional signals from voice, body, and context: Efects of (in) congruence on emotion recognition and attitudes towards robots</article-title>
          ,
          <source>Int. Journal of Social Robotics</source>
          <volume>11</volume>
          (
          <year>2019</year>
          )
          <fpage>555</fpage>
          -
          <lpage>573</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Zhang, T. Huang,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Learning afective features with a hybrid deep model for audio-visual emotion recognition</article-title>
          ,
          <source>IEEE transactions on circuits and systems for video technology 28</source>
          (
          <year>2017</year>
          )
          <fpage>3030</fpage>
          -
          <lpage>3043</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stafa</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>D'Errico, Eeg-based machine learning models for emotion recognition in hri</article-title>
          ,
          <source>in: Int. Conf. on Human-Computer Interaction</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>285</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Galluccio</surname>
          </string-name>
          , L.
          <string-name>
            <surname>D'Errico</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Giordano</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stafa</surname>
          </string-name>
          ,
          <article-title>Advancing eeg-based emotion recognition: Unleashing the power of graph neural networks for dynamic and topology-aware models</article-title>
          ,
          <source>in: 2024 Int. Joint Conf. on Neural Networks (IJCNN)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. Al</given-names>
            <surname>Aghbari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Girija</surname>
          </string-name>
          ,
          <article-title>A systematic survey on multimodal emotion recognition using learning algorithms</article-title>
          ,
          <source>Intelligent Systems with Applications</source>
          <volume>17</volume>
          (
          <year>2023</year>
          )
          <fpage>200171</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Baltrušaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.-P. Morency,</surname>
          </string-name>
          <article-title>Multimodal machine learning: A survey and taxonomy</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>41</volume>
          (
          <year>2018</year>
          )
          <fpage>423</fpage>
          -
          <lpage>443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cimtay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ekmekcioglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Caglar-Ozhan</surname>
          </string-name>
          ,
          <article-title>Cross-subject multimodal emotion recognition based on hybrid fusion</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>168865</fpage>
          -
          <lpage>168878</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S. M. S. A.</given-names>
            <surname>Abdullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y. A.</given-names>
            <surname>Ameen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Sadeeq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zeebaree</surname>
          </string-name>
          ,
          <article-title>Multimodal emotion recognition using deep learning</article-title>
          ,
          <source>Journal of Applied Science and Technology Trends</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>73</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Caccavale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Leone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lucignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Finzi</surname>
          </string-name>
          ,
          <article-title>Attentional regulations in a situated human-robot dialogue</article-title>
          ,
          <year>2014</year>
          , p.
          <fpage>844</fpage>
          -
          <lpage>849</lpage>
          . doi:
          <volume>10</volume>
          .1109/ROMAN.
          <year>2014</year>
          .
          <volume>6926358</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Middya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <article-title>Deep learning based multimodal emotion recognition using modellevel fusion of audio-visual modalities</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          <volume>244</volume>
          (
          <year>2022</year>
          )
          <fpage>108580</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>M.-I. Georgescu</surname>
          </string-name>
          , R. T. Ionescu,
          <article-title>Recognizing facial expressions of occluded faces using convolutional neural networks</article-title>
          ,
          <source>in: Neural Information Processing</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>645</fpage>
          -
          <lpage>653</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hirota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods</article-title>
          ,
          <source>Neurocomputing</source>
          (
          <year>2023</year>
          )
          <fpage>126866</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>L. D'Errico</surname>
            ,
            <given-names>E. Di</given-names>
          </string-name>
          <string-name>
            <surname>Nardo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ciaramella</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stafa</surname>
          </string-name>
          ,
          <article-title>A 3d-cnns approach to classify users' emotion through eeg-based topographical maps in hri</article-title>
          ,
          <source>in: Companion of the 2024 ACM/IEEE Int. Conf. on Human-Robot Interaction</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>397</fpage>
          -
          <lpage>401</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Akhand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Siddique</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A. S.</given-names>
            <surname>Kamal</surname>
          </string-name>
          , T. Shimamura,
          <article-title>Facial emotion recognition using transfer learning in the deep cnn</article-title>
          ,
          <source>Electronics</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>1036</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Multimodal emotion recognition in deep learning: a survey</article-title>
          ,
          <source>in: 2021 Int. Conf. on Culture-oriented Science and Technology (ICCST)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>82</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pan</surname>
          </string-name>
          , W. Zheng,
          <article-title>Emotion recognition based on eeg using generative adversarial nets and</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>