<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (F. Fracasso);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Using Deep Physiological Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fatemeh Rahimi</string-name>
          <email>f.rahimi@studenti.unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Tamantini</string-name>
          <email>christian.tamantini@cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Orlandini</string-name>
          <email>andrea.orlandini@cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Fracasso</string-name>
          <email>francesca.fracasso@cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberta Siciliano</string-name>
          <email>roberta@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Emotion Recognition, Afective Computing, Deep Feature Extraction, Physiological Signals.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Federico II University of Naples</institution>
          ,
          <addr-line>80138 Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Cognitive Sciences and Technologies, National Research Council of Italy</institution>
          ,
          <addr-line>00196 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Recognizing afective states from physiological signals is essential for enabling emotion-aware systems, particularly in human-robot interaction. This paper presents a hybrid deep learning framework for multimodal emotion recognition that integrates deep feature extraction with handcrafted physiological descriptors. The system processes electrocardiogram, photoplethysmogram, and galvanic skin response signals to predict arousal and valence in a continuous regression setting. To our aim, we evaluate two fusion strategies - feature-level and decision-level fusion - using two public afective datasets (AMIGOS and DEAP). Features extracted from each modality via a shared one-dimensional convolutional neural network and signal-specific physiological metrics are either concatenated (feature-level fusion) or separately modeled and combined at the prediction level (decision-level fusion). A broad set of machine learning regressors, including boosting methods and tree ensembles, is explored. Experiments were conducted with a leave-one-subject-out cross-validation protocol to assess generalization across users. Results show that feature-level fusion generally outperforms decision-level fusion, achieving the best root mean square error of 0.089 for arousal and 0.053 for valence. Statistical analyses confirm the significance of these diferences, particularly favoring adaptive boosting and random forest under feature fusion. The proposed architecture ofers a robust and interpretable solution for physiological emotion recognition and provides a solid foundation for real-time applications in emotion-aware social robotics and human-centered adaptive systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Among key areas of modern human–computer interaction, afective computing aims to enable machines
to recognize, interpret, plan, and respond to human emotions. A central task in this field is emotion
recognition, which involves automatically analyzing emotional cues from multiple sources such as
facial expressions, speech, and physiological signals, mimicking human perception of emotions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Foundational scientific models guide emotion classification: Ekman’s model defines six basic universal
emotions—joy, sadness, anger, fear, surprise, and disgust—plus a neutral state [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], while Russell’s
circumplex model represents emotions along continuous valence and arousal dimensions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These
frameworks underpin emotion recognition system design, which has been extensively studied across
diverse modalities.
      </p>
      <p>
        Recent advances have broadened afective computing applications into mental health monitoring,
personalized interfaces and intelligent decision-making systems, highlighting the need for robust
Multimodal Emotion Recognition (MER) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A particularly promising application is in socially assistive
robots, which leverage emotion recognition to enable adaptive, user-centered interactions in, e.g.,
caregiving, rehabilitation, and education [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Integrating continuous valence-arousal estimation into
robot control architectures supports these applications by allowing real-time perception of user afective
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
states. A modular design, where an independent emotion recognition component interacts with a robot’s
deliberative and reactive layers, is efective [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. This component can process physiological signals
through feature extraction and fusion pipelines to produce continuous afective state estimates. These
estimates can inform behavior planners, enabling dynamic adjustment of robot responses, e.g., detecting
distress in eldercare and modulating speech or assistance accordingly [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Embedding this framework
fosters socio-emotional intelligence in assistive robots, improving user acceptance, care adherence, and
natural interaction [
        <xref ref-type="bibr" rid="ref10">10, 11</xref>
        ]. However, real-world deployment introduces practical challenges such as
signal degradation from motion artifacts, variable sensor placement, and latency constraints that can
impact real-time responsiveness. Addressing these challenges requires eficient, noise-tolerant models
and lightweight architectures suitable for on-device inference, as highlighted in applications involving
robot-assisted rehabilitation systems [12].
      </p>
      <p>A critical component of efective MER systems lies in the feature extraction process, which transforms
raw multimodal data into informative and discriminative representations for emotion classification [ 13].
Traditional methods [14] often rely on time-domain features, including statistical and signal-based
measures such as peak count, peak amplitude, variability, signal power, mean, standard deviation,
minimum, maximum, and mean diferences. Additional frequency-domain and time-frequency features
have also been widely explored to capture complex temporal and spectral patterns [15]. However, with
the rise of deep learning, there has been a paradigm shift toward automated feature learning, which is
especially beneficial for processing complex, heterogeneous data sources [ 16]. Convolutional Neural
Network (CNN) has become essential for extracting deep features in physiological and non-physiological
modalities due to its ability to learn hierarchical, discriminative representations.</p>
      <p>CNNs capture complex spatiotemporal patterns in Electroencephalogram (EEG) [17], with models
like ScalingNet achieving strong results on DEAP and AMIGOS [18, 19, 20]. Extensions incorporate
global-local receptive fields [ 21] and hierarchical fusion [22]. For Electrocardiogram (ECG) and
Photoplethysmogram (PPG), CNN autoencoders and multimodal CNNs efectively classify emotions [ 23, 24].
Galvanic Skin Response (GSR) signals, indicative of arousal, benefit from CNN-long short-term memory
models for end-to-end learning [25]. Our work applies CNN-based extraction on GSR (DEAP, AMIGOS),
ECG (AMIGOS), and PPG (DEAP), leveraging their proven eficacy [ 26]. In vision, CNNs extract facial
expression features from images and videos [27]. For audio, CNNs analyze spectrograms and are often
paired with recurrent layers to capture temporal dynamics [28]. Textual emotion recognition also
uses CNNs to capture semantic and syntactic cues [29, 30]. CNNs additionally facilitate cross-modal
fusion, such as EEG-text [21] and audio-visual [31] integrations. Fusion strategies are critical in MER to
integrate signals from physiological, visual, and audio modalities, enhancing accuracy and robustness.
Three main strategies exist:</p>
      <p>Feature-Level Fusion combines features from multiple modalities before classification. For
physiological data, Hassan et al.[32] extracted features from Electrodermal Activity (EDA), Electromyography
(EMG), and PPG using deep belief networks, achieving high accuracy on DEAP. Zhang et al.[33] fused
EEG, EMG, GSR, and respiration signals with a deep regularized framework, improving valence and
arousal prediction. Similarly, CNN-based fusion of facial and vocal features showed superior
performance [34]. Decision-Level Fusion combines outputs from modality-specific classifiers. Zhao et al.[ 35]
fused CNN-based EEG, electrooculography, and GSR decisions, outperforming unimodal models. Xu
et al.[36] blended manual and deep features from audio-visual inputs using ensemble classifiers. This
preserves modality specificity but may miss inter-modality correlations. Hybrid Fusion integrates both
feature and decision levels. Yan et al. [37] combined facial, texture, and audio features early, then fused
decisions later, improving recognition in unconstrained environments. These studies collectively
suggest that while decision-level fusion preserves the uniqueness of each modality, feature-level and hybrid
strategies can better exploit inter-modal relationships, especially relevant for physiological signals,
which ofer involuntary and robust indicators for afective state recognition. Therefore, identifying the
optimal fusion strategy is crucial for developing personalized emotion-aware robotic systems.</p>
      <p>To this end, our work proposes a hybrid deep learning framework that combines deep features
with handcrafted physiological descriptors to enhance afect recognition from ECG, PPG, and GSR
signals. This work builds on our recently proposed DeepPhysioNet, a deep physiological feature
extraction method for afective state recognition from wearable sensing. Here, we focus on evaluating
its efectiveness under diferent fusion strategies, highlighting its potential for real-world afective
computing. By evaluating both feature-level and decision-level fusion strategies on the AMIGOS and
DEAP datasets, we demonstrate that integrating modalities at feature level consistently leads to superior
performance. Through rigorous experimentation using Leave-One-Subject-Out Cross-Validation
(LOSOCV) and a diverse set of regression models, we achieve state-of-the-art results, particularly in predicting
valence and arousal. Unlike existing works that focus either on handcrafted descriptors or fixed fusion
architectures, our contribution lies in validating the discriminative power of DeepPhysioNet’s features
and providing a systematic comparison of fusion strategies tailored for physiological signals. These
ifndings not only validate the efectiveness of feature-level fusion but also underscore the potential of
our architecture as a foundation for real-time, personalized, and emotion-aware robotic systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Framework</title>
      <p>To efectively capture both low-level temporal patterns and high-level physiological descriptors from
multimodal biosignals, we propose a hybrid deep learning framework that integrates data-driven feature
learning with domain-specific physiological knowledge. As shown in Figure 1, the system is designed
to process signals such as ECG and GSR, which are acquired independently per subject and trial, and to
lfexibly support multimodal afective state estimation.</p>
      <p>At its core, the architecture relies on a shared 1D CNN, composed of stacked convolutional layers with
increasing filter widths, each followed by batch normalization, max pooling, and dropout. This sequence
enables the extraction of deep, hierarchical features from raw signals. A global average pooling layer
compresses the temporal dimension, followed by fully connected layers that generate compact deep
embeddings of each modality. To improve both interpretability and physiological robustness, these
deep features are concatenated with handcrafted features computed per modality. These can include
classic time- and frequency-domain metrics, such as Heart Rate Variability (HRV) parameters and skin
conductance indices, depending on the physiological signal given in input to the network. The result is
a joint feature vector that integrates learned representations and expert-designed descriptors. Moreover,
the proposed deep feature extraction framework is particularly suitable for managing deep multimodal
learning. It supports two complementary fusion strategies:
• Feature-Level Fusion, where the feature vectors from multiple modalities are concatenated
into a single representation and passed to a regression layer. This approach enables end-to-end
learning across modalities and supports direct exploitation of multimodal dependencies.
• Decision-Level Fusion, in which independent regressors are trained for each modality. Their
outputs, i.e., predicted arousal or valence values, are later combined through a late fusion ensemble,
introducing robustness to sensor-specific noise and missing data. This strategy preserves the
individual modality characteristics and has been shown to outperform feature-level fusion in
scenarios with degraded or noisy signals. For instance, in [38], decision-level fusion achieved a
significantly higher accuracy by separately learning from audio and visual features and combining
results via an ensemble method, while [39] highlights its advantage in low-quality multimodal
data environments.</p>
      <p>Thanks to its modular and modality-agnostic design, the proposed framework can be easily adapted
to diferent physiological channels and experimental settings. Its ability to combine physiological
insight with deep learning makes it particularly suited for afective computing applications in
realworld human–robot interaction scenarios, such as emotion-aware social robots deployed in domestic
environments for personalized monitoring and adaptive interaction. Moreover, its low computational
footprint and flexible signal handling make it appropriate for embedded use, where resources are limited
and robustness against sensor noise or dropout is essential for sustained user engagement.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>In order to validate the proposed framework, we present experiments carried out to cope with the
problem of afective state recognition based on multimodal physiological signals. The task is formulated
as a continuous regression problem, targeting the prediction of arousal and valence dimensions. Both
feature-level and decision-level fusion strategies are evaluated, and their comparative performance is
analyzed in the following section. To this end, two widely adopted afective computing datasets were
used to develop and evaluate the proposed framework. Both datasets provide multimodal
physiological recordings collected during exposure to emotionally evocative stimuli and include subjectively
annotated arousal and valence values for each trial. The AMIGOS dataset [20] contains recordings
from 40 participants in both individual and group settings. In this study, only the individual sessions
were considered, focusing on ECG and GSR signals acquired while participants watched short video
clips. After each clip, participants reported their perceived arousal and valence using continuous
self-assessment scales, and external annotations of arousal and valence were provided by independent
observers, enabling evaluation against both subjective self-assessments and externally judged emotional
states. The DEAP dataset [19] includes data from 32 participants who each watched 40 one-minute
music videos in a controlled laboratory environment. For our experiments, PPG and GSR signals were
used. Participants rated their afective responses on a 9-point Likert scale for both arousal and valence
dimensions.</p>
      <sec id="sec-3-1">
        <title>3.1. Feature Extraction</title>
        <p>Each physiological signal (ECG, PPG, GSR) was processed using the hybrid pipeline introduced in this
work, which combines deep feature extraction through a shared 1D CNN architecture with
domaininformed handcrafted features. This design enables the generation of compact, modality-independent
representations that capture both latent signal dynamics and physiologically meaningful descriptors.
features computed in this study are listed in Table 1. For GSR, which is a well-established indicator of
Description and mathematical formulation of HRV-related features extracted from ECG and PPG signals.
sympathetic nervous system activity, the extracted features are summarized in Table 2. All handcrafted
Description and mathematical formulation of GSR-based physiological features extracted per trial.</p>
        <p>Description</p>
        <p>Equation
Skin Conductance</p>
        <p>Number of SCR detected during the trial
(SCR) using a threshold-based peak detection
alStandard
viation</p>
        <p>De- Standard deviation of all inter-beat
interof</p>
        <p>NN
vals, reflecting overall HRV.</p>
        <p>is the mean
Intervals (SDNN)</p>
        <p>IBI.</p>
        <p>Frequency Signal
Quality
(FSQI)</p>
        <p>Index</p>
        <p>Spectral-based quality index indicating the
proportion of total power contained in
the LF (0.04–0.15 Hz) and HF (0.15–0.4
Hz) bands. Computed using Welch’s
periodogram.
Mean SCR = 1 
Mean GSR = 1</p>
        <p>∑ 
 =1</p>
        <p>∑</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Supervised Model for Emotion Recognition</title>
        <p>A supervised regression pipeline was implemented to predict arousal and valence from the extracted
multimodal features, exploring two complementary fusion strategies: feature-level fusion and
decisionlevel fusion. In both configurations, model hyperparameters were optimized via grid search within
each fold of the LOSO-CV protocol, ensuring consistent and unbiased evaluation.</p>
        <p>In the feature-level fusion approach, the deep and handcrafted features extracted from diferent
modalities (ECG and GSR) were concatenated into a single feature vector per trial. This unified
representation allows regression models to capture inter-modality correlations and learn joint patterns
across signals [42]. In contrast, the decision-level fusion strategy involves training separate models for
each modality and subsequently combining their predictions using ensemble methods [43], a technique
shown to enhance robustness in multimodal afective computing scenarios [ 44].</p>
        <p>To evaluate the proposed framework across both fusion strategies, a diverse set of machine learning
regressors was employed, encompassing both simple and advanced models. This variety ensures broad
coverage of learning paradigms and robustness to overfitting, nonlinearity, and noise. Specifically, we
included:
• Linear Regression (LR), for its interpretability and as a baseline linear model [45];
• Support Vector Regressor (SVR), efective for capturing nonlinear relations with good
generalization [46];
[49];
boosting [52].
• Random Forest (RF), an ensemble of decision trees that reduces variance via bagging [47];
• Gradient Boosting (GB) and Adaptive Boosting (AdaBoost), which sequentially build additive
models to minimize error, ofering strong performance in many real-world regression tasks [ 48],
• XGBoost (XGB) and LightGBM (LGBM), highly eficient GB implementations that support
regularization and scalability, particularly suited for tabular data [50], [51];
• CatBoost, which natively handles categorical features and stabilizes training through ordered
All models were applied both in the feature-fusion pipeline and in the decision-fusion meta-regressor
framework, enabling a comprehensive comparison of their capacity to model multimodal physiological
data in the context of afective state estimation.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental Evaluation.</title>
        <p>The proposed framework was validated through a LOSO-CV scheme. In each iteration, data from
one subject in the AMIGOS dataset were excluded from training and used solely for testing, while
the remaining AMIGOS subjects, together with all participants from the DEAP dataset, were used
for training. This evaluation protocol reflects a realistic deployment scenario in which models must
generalize to new users whose physiological patterns are not seen during training.</p>
        <p>Model performance was quantified using the Root Mean Square Error (RMSE), a commonly adopted
metric in afective computing tasks involving continuous afect prediction [
14]. RMSE penalizes larger
deviations more heavily and is defined as:</p>
        <p>RMSE =</p>
        <p>∑ ( ̂ −   )2
√</p>
        <p>1
where  ̂ and   represent the predicted and true values for the  -th trial, respectively, and  is the
total number of predictions. Lower RMSE values indicate more accurate predictions of arousal and
(1)
valence dimensions.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Statistical Analysis.</title>
        <p>To evaluate the impact of fusion strategies on predictive performance, we first assessed, for each
regression model individually, whether feature-level fusion or decision-level fusion yielded significantly
lower RMSE values. This comparison was conducted separately for arousal and valence using the
Wilcoxon signed-rank test, applied to paired RMSE scores computed across subjects. This approach
allowed us to determine which fusion strategy was more efective on a per-model basis, accounting for
subject-level variability.</p>
        <p>Subsequently, we investigated the relative performance of all models within each fusion strategy and
emotion dimension. A Friedman test [53] was employed to assess whether statistically significant
differences existed in model performance, considering the repeated-measures design. The null hypothesis
(H0) stated that all models performed equally (i.e., no diference in median RMSE), while the alternative
hypothesis (H1) assumed that at least one model difered significantly from the others.</p>
        <p>To identify which specific model pairs contributed to any significant efects found by the Friedman test,
we performed pairwise comparisons using the Wilcoxon signed-rank test with Bonferroni correction to
control for multiple comparisons. This procedure was repeated for both arousal and valence, and for
both fusion types.</p>
        <p>Only results with adjusted  -values ≤ 0.05 were considered statistically significant and annotated
in the corresponding visualizations. This two-level statistical analysis framework enabled us to draw
robust conclusions on the efectiveness of fusion strategies and the relative merits of diferent models
in multimodal afective state prediction.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>To provide a comparative overview of the predictive performance across models and fusion strategies,
Figure 2 reports the RMSE distributions obtained for each model under both feature-level and
decisionlevel fusion. Results are shown separately for the arousal and valence prediction tasks. For each pair
(model, task), statistical significance was assessed using Wilcoxon signed-rank tests, comparing
featurevs. decision-level fusion. Statistically significant diferences are annotated with standard asterisk
notation, while non-significant results are labeled as “ns”.</p>
      <p>The results shown in Figure 2 confirm that, in general, feature-level fusion leads to better performance
than decision-level fusion across most of the tested models. This outcome aligns with established
findings in multimodal learning, where early fusion strategies often benefit from the ability to capture joint
dependencies and correlations across modalities at the feature level [42]. By integrating the
complementary characteristics of physiological signals (e.g., GSR and ECG) into a shared representation before
model training, feature fusion allows the regressors to exploit richer contextual information. However,
this trend is not consistent across all models. For instance, in the case of LR and AdaBoost during
arousal prediction, decision-level fusion shows slightly better or equivalent performance. This inversion
may be attributed to the limited capacity of linear or shallow models to exploit the high-dimensional
fused feature space efectively. In such cases, learning separate unimodal models and aggregating their
outputs can act as a form of regularization, reducing the risk of overfitting and improving robustness
to noisy features or modality-specific artifacts. Moreover, ensemble-based methods like RF and GB
generally benefit more from feature-level fusion, likely due to their capacity to handle heterogeneous
features and non-linear interactions. In contrast, models with more rigid assumptions or sensitivity
to feature scaling (e.g., LR) may struggle when exposed to the increased complexity introduced by
early fusion. These findings underscore the importance of selecting appropriate fusion strategies in
relation to the model architecture and the characteristics of the input modalities. While feature-level
fusion appears generally preferable, decision-level fusion can still provide competitive performance in
scenarios where model simplicity or modularity is required.</p>
      <p>To further investigate the comparative performance of the diferent regression models across fusion
strategies and emotion dimensions, we conducted a non-parametric Friedman test. This test assesses
whether there are statistically significant diferences in model performance when evaluated on the
same subjects, based on RMSE rankings. The results of this analysis are reported in Table 3. For both
arousal and valence prediction tasks, and under both feature-level and decision-level fusion strategies,
the Friedman test returned extremely low p-values (all &lt; 10−19), clearly rejecting the null hypothesis
that all models perform equally. This confirms that the choice of model has a significant impact on
performance, regardless of the fusion strategy adopted. Interestingly, the highest Friedman test statistic
was observed in the valence prediction task under feature-level fusion (186.86), indicating particularly
large diferences in model performance in this configuration. This may be due to the nature of valence
representation in physiological signals, which could benefit more from richer multimodal embeddings
learned during feature fusion. Overall, the analysis highlights that both the fusion strategy and the
emotion dimension being predicted play a crucial role in shaping the relative efectiveness of the
regression models.</p>
      <p>A natural continuation of the statistical analysis following the Friedman test is provided by the
pairwise Wilcoxon comparisons summarized in Figure 3. This set of heatmaps displays the adjusted
pvalues resulting from multiple pairwise tests between models within each combination of fusion strategy
and emotion dimension. The Bonferroni correction was applied to account for multiple comparisons,
and significance levels are indicated using a standard asterisk notation.</p>
      <p>The results confirm and extend the Friedman test findings, revealing several significant pairwise
diferences in model performance. In particular, models such as AdaBoost and XGB consistently
outperform others under feature-level fusion, especially for the valence dimension. Conversely, the
performance gaps under decision-level fusion appear slightly narrower, although significant diferences
still emerge.</p>
      <sec id="sec-4-1">
        <title>4.1. Comparison with Related Studies.</title>
        <p>To further contextualize the efectiveness of the proposed framework, Table 4 presents a comparison with
the benchmark results reported by [14]. That study employed traditional hand-crafted features—such
as Hjorth parameters, spectral entropy, wavelet-based energy and entropy, and empirical mode
decomposition descriptors—combined with conventional classifiers like k-Nearest Neighbors (KNN) and RF
for afective state prediction using the AMIGOS dataset.</p>
        <p>In contrast, our approach integrates deep feature learning with handcrafted physiological metrics
and explores both feature-level and decision-level fusion strategies. The results indicate a consistent
improvement in RMSE performance for both arousal and valence dimensions. Specifically, decision-level
fusion using AdaBoost achieves RMSE scores of 0.116 for arousal and 0.119 for valence, outperforming
the baseline models in [14]. Even greater performance gains are observed with feature-level fusion,
where RF and AdaBoost models attain RMSE values as low as 0.089 and 0.053, respectively.</p>
        <p>These findings confirm the advantages of combining deep representations with early fusion
mechanisms, especially when dealing with complex, multimodal physiological data. They also demonstrate
the superiority of the proposed pipeline in comparison to existing handcrafted approaches, supporting
its suitability for real-world afective computing applications.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study presented a neural framework for multimodal afective state recognition based on
physiological signals. The proposed architecture integrates deep feature extraction via a shared 1D CNN
with signal-specific handcrafted physiological metrics, enabling a robust representation of autonomic
responses related to emotional arousal and valence. In addition to designing an efective feature
extraction pipeline, we systematically investigated two widely adopted fusion strategies, i.e, feature-level
and decision-level fusion, within a supervised regression setting. The framework was validated on two
benchmark datasets, AMIGOS and DEAP, using a LOSO-CV protocol to simulate real-world
generalization to unseen subjects. A wide range of machine learning regressors was tested to assess the flexibility
and robustness of the extracted features under diferent fusion paradigms. Our results demonstrate that
feature-level fusion consistently outperforms decision-level fusion in most scenarios, particularly when
coupled with ensemble-based models such as RF and AdaBoost. Statistical analyses using Wilcoxon
and Friedman tests confirmed the significance of these findings, highlighting the impact of both model
selection and fusion strategy on performance. When compared with previous work based on traditional
feature engineering and classical classifiers, our approach achieved lower RMSE values for both arousal
and valence prediction tasks, confirming the value of combining deep representations with physiological
insights. In future work, we plan to systematically evaluate fusion strategies under non-optimal
conditions (e.g., simulated noise or missing modalities), to better understand their robustness and suitability
for real-world settings. This analysis will help determine whether feature-level fusion remains efective
or if decision-level fusion ofers greater resilience in such scenarios. In addition, we plan to conduct
ablation studies by removing specific modalities to assess the contribution of each physiological signal.
We will also compare our approach with classical machine learning models trained on handcrafted
features tailored to each modality. These insights will inform the deployment of our proposed
framework in real-time applications, using biosignals collected during human–robot interaction to support
emotionally adaptive behavior. In particular, we aim to integrate the model into social robotic platforms,
enabling them to continuously estimate users’ afective states and adapt their communicative strategies
accordingly, paving the way toward more empathic and responsive assistive technologies.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the Italian Ministry of Research, under the complementary actions
to the NRRP “Fit4MedRob - Fit for Medical Robotics” Grant PNC0000007, (CUP: B53C22006990001) and
partially by Next Generation EU – “Age-It – Ageing Well in an Ageing Society” project (PE0000015),
National Recovery and Resilience Plan (NRRP).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used generative AI tools (specifically, OpenAI’s GPT-4)
to assist with grammar and spelling checks.
[11] S. Rossi, F. Ferland, A. Tapus, User profiling and behavioral adaptation for hri: A survey, Pattern</p>
      <p>Recognition Letters 99 (2017) 3–12.
[12] e. a. Zhang, Emotion recognition using eeg and physiological data for robot-assisted rehabilitation
systems, in: Companion Publication of the International Conference on Multimodal Interaction
(ICMI), 2020.
[13] N. Kim, S. Cho, B. Bae, Smate: A segment-level feature mixing and temporal encoding framework
for facial expression recognition, Sensors 22 (2022) 5753.
[14] F. Galvão, S. M. Alarcão, M. J. Fonseca, Predicting exact valence and arousal values from eeg,</p>
      <p>Sensors 21 (2021) 3414.
[15] Q. Li, A. Zhang, Z. Li, Y. Wu, Improvement of emg pattern recognition model performance in
repeated uses by combining feature selection and incremental transfer learning, Frontiers in
Neurorobotics 15 (2021) 699174.
[16] N. Jia, C. Zheng, W. Sun, A multimodal emotion recognition model integrating speech, video and
mocap, Multimedia Tools and Applications 81 (2022) 32265–32286.
[17] S. K. Khare, V. Bajaj, Time–frequency representation and convolutional neural network-based
emotion recognition, IEEE transactions on neural networks and learning systems 32 (2020)
2901–2909.
[18] J. Hu, C. Wang, Q. Jia, Q. Bu, R. Sutclife, J. Feng, Scalingnet: Extracting features from raw eeg
data for emotion recognition, Neurocomputing 463 (2021) 177–184.
[19] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, I. Patras,
Deap: A database for emotion analysis; using physiological signals, IEEE transactions on afective
computing 3 (2011) 18–31.
[20] J. A. Miranda-Correa, M. K. Abadi, N. Sebe, I. Patras, Amigos: A dataset for afect, personality
and mood research on individuals and groups, IEEE transactions on afective computing 12 (2018)
479–493.
[21] S. Wang, J. Qu, Y. Zhang, Y. Zhang, Multimodal emotion recognition from eeg signals and facial
expressions, IEEE Access 11 (2023) 33061–33068.
[22] Y. Zhang, C. Cheng, Y. Zhang, Multimodal emotion recognition using a hierarchical fusion
convolutional neural network, IEEE access 9 (2021) 7943–7951.
[23] N. Hajarolasvadi, E. Bashirov, H. Demirel, Video-based person-dependent and person-independent
facial emotion recognition, Signal, Image and Video Processing 15 (2021) 1049–1056.
[24] G. Yin, S. Sun, D. Yu, D. Li, K. Zhang, A multimodal framework for large-scale emotion recognition
by fusing music and electrodermal activity signals, ACM Transactions on Multimedia Computing,
Communications, and Applications (TOMM) 18 (2022) 1–23.
[25] T.-P. Jung, T. J. Sejnowski, et al., Utilizing deep learning towards multi-modal bio-sensing and
vision-based afective computing, IEEE Transactions on Afective Computing 13 (2019) 96–107.
[26] F. J. Ordóñez, D. Roggen, Deep convolutional and lstm recurrent neural networks for multimodal
wearable activity recognition, Sensors 16 (2016) 115.
[27] Q. Wei, X. Huang, Y. Zhang, Fv2es: A fully end2end multimodal system for fast yet efective video
emotion recognition inference, IEEE Transactions on Broadcasting 69 (2022) 10–20.
[28] M. Sharafi, M. Yazdchi, R. Rasti, F. Nasimi, A novel spatio-temporal convolutional neural framework
for multimodal emotion recognition, Biomedical Signal Processing and Control 78 (2022) 103970.
[29] F. Chen, J. Shao, A. Zhu, D. Ouyang, X. Liu, H. T. Shen, Modeling hierarchical uncertainty for
multimodal emotion recognition in conversation, IEEE Transactions on Cybernetics 54 (2022)
187–198.
[30] S. Liu, P. Gao, Y. Li, W. Fu, W. Ding, Multi-modal fusion network with complementarity and
importance for emotion recognition, Information Sciences 619 (2023) 679–694.
[31] Z. Farhoudi, S. Setayeshi, Fusion of deep learning features with mixture of brain emotional learning
for audio-visual emotion recognition, Speech Communication 127 (2021) 92–103.
[32] M. M. Hassan, M. G. R. Alam, M. Z. Uddin, S. Huda, A. Almogren, G. Fortino, Human emotion
recognition using deep belief network architecture, Information Fusion 51 (2019) 10–18.
[33] J. Zhang, Z. Yin, P. Chen, S. Nichele, Emotion recognition using multi-modal data and machine
learning techniques: A tutorial and review, Information Fusion 59 (2020) 103–126.
[34] J. Xu, H. Li, Y. Wang, Cnn-based multimodal emotion recognition using facial and vocal features,</p>
      <p>IEEE Access 8 (2020) 36774–36785.
[35] Y. Zhao, X. Cao, J. Lin, D. Yu, X. Cao, Multimodal afective states recognition based on multiscale
cnns and biologically inspired decision fusion model, IEEE Transactions on Afective Computing
(2021).
[36] J. Xu, L. Wang, Y. Zhang, H. Li, An ensemble learning framework for multimodal emotion
recognition using audio and visual features, Neurocomputing 412 (2020) 251–259.
[37] J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, Y. Zong, Multi-cue fusion for emotion recognition in
the wild, Neurocomputing 309 (2018) 27–35.
[38] M. Hao, W.-H. Cao, Z.-T. Liu, M. Wu, P. Xiao, Visual-audio emotion recognition based on multi-task
and ensemble learning with multiple features, Neurocomputing 391 (2020) 42–51.
[39] Q. Zhang, Y. Wei, Z. Han, H. Fu, X. Peng, C. Deng, Q. Hu, C. Xu, J. Wen, D. Hu, et al., Multimodal
fusion on low-quality data: A comprehensive survey, arXiv preprint arXiv:2404.18947 (2024).
[40] C. Tamantini, M. L. Cristofanelli, F. Fracasso, A. Umbrico, G. Cortellessa, A. Orlandini, F. Cordella,</p>
      <p>Physiological sensor technologies in workload estimation: A review, IEEE Sensors Journal (2025).
[41] M. Benedek, C. Kaernbach, A continuous measure of phasic electrodermal activity, Journal of
neuroscience methods 190 (2010) 80–91.
[42] P. K. Atrey, M. A. Hossain, A. El Saddik, M. S. Kankanhalli, Multimodal fusion for multimedia
analysis: a survey, Multimedia systems 16 (2010) 345–379.
[43] J. Kittler, M. Hatef, R. P. Duin, J. Matas, On combining classifiers, IEEE transactions on pattern
analysis and machine intelligence 20 (1998) 226–239.
[44] R. A. Calvo, S. D’Mello, J. M. Gratch, A. Kappas, The Oxford handbook of afective computing,</p>
      <p>Oxford University Press, 2015.
[45] G. A. Seber, A. J. Lee, Linear regression analysis, John Wiley &amp; Sons, 2003.
[46] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector regression machines,</p>
      <p>Advances in neural information processing systems 9 (1996).
[47] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
[48] J. H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of statistics
(2001) 1189–1232.
[49] Y. Freund, R. E. Schapire, A decision-theoretic generalization of on-line learning and an application
to boosting, Journal of computer and system sciences 55 (1997) 119–139.
[50] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm
sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
[51] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly eficient
gradient boosting decision tree, Advances in neural information processing systems 30 (2017).
[52] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, A. Gulin, Catboost: unbiased boosting
with categorical features, Advances in neural information processing systems 31 (2018).
[53] M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of
variance, Journal of the american statistical association 32 (1937) 675–701.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hirota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>561</volume>
          (
          <year>2023</year>
          )
          <fpage>126866</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liotta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>A systematic review on afective computing: Emotion models</article-title>
          , databases, and recent advances,
          <source>Information Fusion</source>
          <volume>83</volume>
          (
          <year>2022</year>
          )
          <fpage>19</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cittadini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tamantini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scotto di Luzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lauretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zollo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cordella</surname>
          </string-name>
          ,
          <article-title>Afective state estimation based on russell's model and physiological measurements</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>9786</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Geetha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Priyanka</surname>
          </string-name>
          , E. Uma,
          <article-title>Multimodal emotion recognition with deep learning: advancements, challenges, and future directions</article-title>
          ,
          <source>Information Fusion</source>
          <volume>105</volume>
          (
          <year>2024</year>
          )
          <fpage>102218</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tamantini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Umbrico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          ,
          <article-title>Automated planning and scheduling in robot-aided rehabilitation: a review</article-title>
          ,
          <source>Journal of NeuroEngineering and Rehabilitation</source>
          <volume>22</volume>
          (
          <year>2025</year>
          )
          <fpage>180</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Beraldo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tamantini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Umbrico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          ,
          <article-title>Fostering behavior change through cognitive social robotics</article-title>
          ,
          <source>in: International Conference on Social Robotics</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>279</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Benedictis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Umbrico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fracasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cortellessa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cesta</surname>
          </string-name>
          ,
          <article-title>A dichotomic approach to adaptive interaction for socially assistive robots, User Modeling and User-Adapted Interaction 33 (</article-title>
          <year>2023</year>
          )
          <fpage>293</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tamantini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Umbrico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          ,
          <article-title>Repair platform: Robot-aided personalized rehabilitation</article-title>
          ,
          <source>in: International Conference of the Italian Association for Artificial Intelligence</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>301</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Umbrico</surname>
          </string-name>
          , R. De Benedictis,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fracasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cesta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cortellessa</surname>
          </string-name>
          ,
          <article-title>A mind-inspired architecture for adaptive hri</article-title>
          ,
          <source>International Journal of Social Robotics</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <fpage>371</fpage>
          -
          <lpage>391</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Spezialetti</surname>
          </string-name>
          , G. Placidi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <article-title>Emotion recognition for human-robot interaction: Recent advances and future perspectives</article-title>
          ,
          <source>Frontiers in Robotics and AI</source>
          Volume 7
          <article-title>-</article-title>
          <year>2020</year>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          . 3389/frobt.
          <year>2020</year>
          .
          <volume>532279</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>