1. Introduction

Mamba Architecture⋆

Xiuze Jia

xiuxejia@gmail.com 1

Xuenan Liu

xuenanliu@mail.hfut.edu.cn 0

Siyi Wang

Lizhong Zhang

Shuai Ding

0 0 School of Computer Science and Information Engineering,Hefei University of Technology , Hefei, 230601, Anhui , China 1 School of Software,Hefei University of Technology , Hefei, 230601, Anhui , China

2025

Remote heart rate estimation from video still faces three main challenges in real-world scenarios: (1) the absence of adaptive, frequency-selective modeling allows low-frequency physiological rhythms to be overwhelmed by noise; (2) single-modality inputs sufer from instability under varying illumination, occlusions, or device changes; and (3) conventional temporal encoders are computationally expensive and lack long-sequence generalization. To address these limitations, we propose FSMamba, a frequency-selective multimodal perception system built upon the Mamba state-space framework. FSMamba employs a dual-branch feature extractor for RGB and NIR streams and a Joint Cross Attention (JCA) module to enable bidirectional, multi-head cross-modal interaction. In its encoder, we combine the standard MambaBlock with a parallel Frequency-Selective Filter (FSFilter) that uses a learnable time step-derived from trainable heart-rate bounds-and an SSMKernel-based causal recurrence to implicitly generate a band-pass convolution kernel. A channel-wise gating further refines the heart-rate-focused features. The decoder fuses raw temporal and frequency-enhanced representations via joint classification and class-wise regression to predict the final heart rate. Experiments on VIPL-HR demonstrate that FSMamba achieves competitive RMSE performance across the majority of diverse conditions, and ablation studies confirm the efectiveness of each module.

state-space modeling SSMKernel

1. Introduction

CEUR Workshop

ISSN1613-0073

From a spectral perspective, rPPG signals are primarily concentrated in the frequency band of 0.7 – 2.5 Hz. Traditional bandpass filters are fixed and not adaptive to individual or context variations. Learnable filter approaches such as SincNet [ 14] have shown great potential in related domains like speech processing and physiological signal estimation.

Transformer-based models[11] ofer strong global modeling capabilities but sufer from quadratic complexity, making them less suitable for long sequences and edge deployment. In contrast, state-space models (SSMs), including S4 and Mamba [15, 16], provide a more eficient and interpretable way to model long-term dependencies in temporal signals.

On the training side, recent works have explored multi-objective loss formulations—combining regression, interval classification, and distributional alignment —to mitigate label noise and account for sample uncertainty, thereby improving model robustness and generalization [17].

In summary, the current development of rPPG systems is oriented toward three key directions: multimodal fusion, frequency-aware modeling, and eficient sequential modeling. A major challenge remains in balancing model complexity, accuracy, and deployability, particularly for real-world applications.

2. Modifications 2.1. System Introduction

Traditional rPPG systems face significant challenges, such as sensitivity to lighting variations, motion artifacts, and limited modeling of long-term temporal dependencies. To address these issues, we propose a modular end-to-end frequency-selective multimodal framework with four key modules: ( 1 ) a dual-stream feature extractor that encodes spatiotemporal dynamics from RGB and NIR inputs; ( 2 ) a Joint Cross Attention (JCA) module for cross-modal interaction and feature alignment; ( 3 ) an FSMamba encoder that combines state-space modeling with frequency-aware filtering to capture global and heart-rate-focused representations; and ( 4 ) a spectrum-aware decoder that fuses temporal and frequency features for robust heart rate prediction through joint classification and regression.

2.2. Feature Extraction Module

This module employs a dual-branch design based on inter-frame diferencing and lightweight convolutional encoding to extract spatiotemporal features from RGB and NIR sequences. ( 1 ) Temporal Diference Construction (STMap) Following PhysFormer [10], we compute interframe diferences to construct spatiotemporal maps (STMap), which highlight subtle pulse-induced variations between frames:

X = I+1 − I , = 1, … , − 1 where I denotes the -th frame of the input video sequence, and X is the resulting diference map that emphasizes temporal color fluctuations caused by blood flow. ( 2 ) Spatial Encoder Each temporal diference map X is processed through a 5-layer CNN, where each layer consists of a 3 × 3 convolution, ReLU activation, and Batch Normalization:

F = BN(ReLU(Conv3×3(F−1 ))) where F denotes the intermediate feature map at the -th layer. The channel dimension doubles progressively across layers. After temporal stacking, the final output Fifnal ∈ ℝ × is obtained by applying global average pooling (GAP) across spatial dimensions, where is the number of frames and is the feature dimension. ( 3 ) Dual-Modality Processing Both RGB and NIR video streams undergo independent STMap generation and spatial encoding. The process for each modality is defined as: ( 1 ) ( 2 ) Xrgb = SpatialEncoder(STMap(Irgb)), Xnir = SpatialEncoder(STMap(Inir)) ( 3 ) where Irgb and Inir are the original RGB and NIR frame sequences, respectively. The outputs Xrgb, Xnir ∈ ℝ × represent temporally encoded features for each modality, with as the sequence length and the feature dimension.

This dual-stream structure ensures that both modalities preserve their complementary spectral information while enabling robust downstream multimodal fusion.

2.3. Multimodal Fusion Mechanism: Bidirectional Cross-Modal Attention

To model the interaction between RGB and NIR modalities, we introduce the Joint Cross Attention (JCA) module based on the Transformer architecture [11, 13]. Given the sequence features: Xrgb, Xnir ∈ ℝ× × ( 4 )

For each attention head ℎ, we compute the attention flow as follows. The attention mechanism is based on the scaled dot-product attention, where we compute the attention for each head using diferent queries (Q), keys (K), and values (V).

Frgb→nir = Attention(Qrℎgb, Kℎnir, Vℎnir), Fnir→rgb ℎ ℎ = Attention(Qℎnir, Krℎgb, Vrℎgb) where Qrℎgb, Kℎnir, and Vℎnir are the query, key, and value matrices for the ℎ-th attention head, respectively, derived from the RGB and NIR features, and similarly, Qℎnir, Krℎgb, and Vrℎgb are the corresponding matrices for the reverse attention.

The attention function is defined as:

Attention(, , ) = softmax ( √ ) where is the dimensionality of the query and key vectors.

After calculating each attention head for both directions, we concatenate the outputs of all heads. Let be the number of heads, and denote the concatenation of all heads as:

Then, we update each stream via residual connection:

Frgb→nir = Concat (Fr1gb→nir, Fr2gb→nir, … , Frgb→nir) Fnir→rgb = Concat (F1nir→rgb, F2nir→rgb, … , Fnir→rgb) Xr′gb = Xrgb + Fnir→rgb,

X′nir = Xnir + Frgb→nir Here, Xr′gb and X′nir represent the attended features after cross-modal integration.

Finally, we concatenate and project the updated features from both modalities:

Xfused = MLP (Xr′gb ‖ X′nir)

2.4. FSMamba Encoder: Frequency-Selective State-Space Modeling

To overcome the limitations of conventional encoders in frequency selectivity and long-range modeling, we propose the FSMamba encoder, which combines the Mamba state-space framework [16] with a frequency-guided path that focuses on the heart rate band (0.7 ∼ 2.5 Hz). ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 ) ( 1 ) Overall Architecture Given an input sequence X0 ∈ ℝ× × , FSMamba employs a layered encoder structure where the first layer is a standard MambaBlock, followed by − 1 layers of Frequency-Enhanced MambaBlocks

Each enhanced layer processes input as:

Yssm = MambaBlockℓ(LN(Xℓ)), Yℓhr = FSFilterℓ(LN(Xℓ))

ℓ where Xℓ is the input to layer ℓ, Yℓssm and Yℓhr are the outputs of the MambaBlock and FSFilter, respectively.

Xℓ+1 = LN (Xℓ + MLP (Yℓssm ‖ Yℓhr)) where Xℓ+1 is the residual-updated output of the current layer, with feature fusion performed via an MLP.

Houtput = LN(X ), Hhr = −11 −∑ℓ=11 Yℓhr ( 13 ) where X is the final output of the encoder, which passes through layers of MambaBlock and Frequency-Enhanced MambaBlock, and Houtput = LN(X ) is the final encoder representation after layer normalization. Additionally, Hhr aggregates heart-rate-enhanced features from all FSFilter layers by averaging them across layers. ( 2 ) State-Space Path in MambaBlock: Local Convolution and Global Temporal Modeling The state-space path follows the standard MambaBlock [16], which models both short-term and long-range temporal dependencies through a combination of local convolution and global state recurrence. Given input Xℓ ∈ ℝ × , the block first applies a depthwise 1D convolution to capture local motion patterns across time. The result is then passed through a dynamic gating unit and projected into a state-space model (SSM) for global modeling.

The SSM core operates via a linear recurrence: s+1 = As + Bx , y = Cs + Dx where x is the input at time , s is the latent state, and y is the output. The matrices A, B, C, D are learnable and define a causal filter with global temporal receptive field.

The output is fused with the input via a residual connection and normalized. This design enables MambaBlock to eficiently learn hierarchical temporal features by combining local convolution, global recurrence, and data-driven gating in a lightweight and scalable architecture. ( 3 ) Frequency-Enhanced MambaBlock: Learning Frequency-Focused Representations To enhance sensitivity to periodic physiological dynamics such as heart rate oscillations, we augment the MambaBlock with a parallel Frequency-Selective Filter (FSFilter). This module is implemented via a modified state-space kernel, denoted as SSMhr_band, which introduces frequency-domain awareness by dynamically modulating its temporal resolution.

We first define a learnable time step: Δ =

2 mhrin + mhrax , adapts to shift the efective center frequency of the filter. where mhrin, mhrax > 0 are trainable scalars initialized to 0.7 and 2.5 Hz respectively. During training, Δ The FSFilter applies the following causal state-space recurrence for each time step : +1 = Δ + Δ , = , where • Δ , Δ ∈ ℝ × • ∈ ℝ × • ∈ ℝ is the input at time , • ∈ ℝ is the latent state, is the output projection.

are transition matrices modulated by Δ , By unrolling the recurrence, this is equivalent to a causal convolution

=0 = ∑ ℎ − , ℎ = (

Δ ) Δ , where {ℎ } is the implicit filter kernel whose efective bandwidth is controlled by Δ .

Finally, we apply a channel-wise gating to the filter output:

Yℓhr = SSMhr_band(Xℓ; Δ) ⊙ whr, whr ∈ ℝ where whr is a trainable vector and ⊙ denotes element-wise multiplication. This gating further emphasizes or suppresses specific channels according to their relevance to the heart-rate frequency band.

Through (i) learnable frequency bounds via Δ , (ii) state-space – derived kernel ℎ , and (iii) channelwise gating whr, the FSFilter functions as a data-driven band-pass filter centered on the heart rate range, providing explicit frequency-domain selectivity in the FSMamba encoder. ( 14 ) ( 15 ) (16) (17) (18) processed in two branches: features:

2.5. Heart Rate Decoder Module

To address both interval discrimination and frequency sensitivity, the decoder integrates temporal and frequency-aware features with joint classification and regression objectives [ 14, 17]. ( 1 ) Feature Extraction and Spectral Enhancement: The features from the FSMamba encoder are - Raw Feature Processing: The temporal features are passed through an MLP to extract raw

Fraw = MLP(Houtput) where Houtput is the global representation from the FSMamba encoder.

- Frequency Feature Enhancement: The heart-rate focused features Hhr are processed through a Sinc convolutional layer followed by an MLP to enhance frequency-specific features:

Ffreq = SincConv1d(MLP(Hhr)) where Hhr is the heart-rate focused representation from the FSMamba encoder. The SincConv1d layer is a learnable filter designed to enhance the relevant frequency components for heart rate estimation.

( 2 ) Feature Fusion and Regression: After extracting raw and frequency-enhanced features, they are fused as: fused using an MLP.

Ffused = MLP (Fraw ‖ Ffreq) Here, the raw temporal features Fraw and the frequency-enhanced features Ffreq are concatenated and

Finally, the heart rate prediction fin̂al is computed using a softmax function for classification and a regression head to output the final estimation: =1 fin̂al = ∑ softmax (Ffused) ⋅ Reg (Ffused) where represents the number of interval classes, and Reg denotes the regression output for class .

This design enhances both frequency focus and interval-level adaptation, yielding robust heart rate predictions even under challenging conditions.

2.6. Loss Function Design

We adopt a multi-objective loss function to jointly optimize prediction accuracy, classification sensitivity, and distributional stability [17]: (19) (20) (21) (22) (23) (24) (25) (26) ℒtotal = reg ⋅ ℒreg + cls ⋅ ℒcls + dist ⋅ ℒdist, where reg = 1.0, cls = 1.0, and dist = 0.5.

Specifically, the three components are defined as follows:

ℒdist = ( ̂ − )2 + ( ̂ − )2

Here, fin̂al denotes the predicted heart rate, is the ground truth, , is the probability assigned to the true interval class, and , are the empirical mean and standard deviation. This formulation improves both accuracy and robustness under uncertainty [17].

3. Experimental Design and Result Analysis

We evaluate the proposed FSMamba system through comprehensive comparative and ablation experiments on the VIPL-HR dataset [12], with detailed analysis to validate the efectiveness and contribution of each module.

3.1. Datasets

VIPL‐HR contains 2,378 RGB and 752 NIR facial video sequences from 107 subjects under varied conditions (rest, motion, illumination), with synchronized PPG, heart rate, and SpO₂ annotations [12].

Oulu Bio Face Database (OBF) comprises videos from 100 healthy volunteers and 6 AF patients (two 5-minute sessions per subject in RGB + NIR), along with simultaneous ECG, PPG, and respiration signals, for heart rate, respiratory rate, and AF detection benchmarking [18].

3.2. Experimental Settings and Evaluation

We adopt the Root Mean Square Error (RMSE) as the primary evaluation metric to assess model robustness under outlier predictions, defined as:

RMSE =

∑( − ̂ )2 √

All experiments are conducted on VIPL-HR using the subject-independent protocol [12], with 684/68/198 samples for training, validation, and testing. The model is trained for 15 epochs with Adam optimizer (lr = 1 × 10−4, batch size = 2) on 180-frame segments.

The architecture includes four modules: a feature extractor, a cross-modal fusion block (4-head attention, 64 dim/head), a 6-layer FSMamba encoder ( model = 256, state = 128, 30 Hz), and a heart rate decoder (16 SincConv layers, kernel size = 33, band = 0.7 – 2.5 Hz, regression head dim = 128) [16, 14]. Inputs are resized to 128 × 128, and all features are 256-dimensional. This setup ensures a good trade-of between temporal modeling and frequency selectivity for accurate rPPG estimation.

3.3. Results

As shown in Table 1, the proposed FSMamba system achieves consistent performance across diverse VIPL-HR scenarios. In typical settings (e.g., sitting, talking, bright lighting), RMSE stays within 11 – 13 BPM. Even under low light, long distance, or mobile capture (v4, v6, v8/v9), it remains below 15 BPM, demonstrating robustness to noise and input variation. The highest RMSE (24.17 BPM) appears in the post-exercise recovery scenario (v7), suggesting future improvement is needed for modeling rapid physiological changes. The solid performance in mobile cases further indicates strong potential for real-world deployment.

Ablation results (Table 2) demonstrate the importance of each loss component and system module. Removing ℒreg, cross-modal attention, or the NIR modality notably degrades RMSE, confirming their complementary roles. Despite limited NIR data, its inclusion enhances low-light robustness. Overall, FSMamba’s design efectively balances accuracy, generalization, and real-world applicability through frequency-aware modeling and multimodal fusion.

As shown in Table 3, our proposed system (Team: xiuxejia, Hefei University of Technology) ranked 6th on the oficial RE-PSS leaderboard, evaluated on the VIPL-HR and OBF test sets, with an RMSE of 16.25 BPM.

Due to local resource constraints, we submitted a lightweight version without pre-trained weights or extensive hyperparameter tuning. However, the system still performed well, demonstrating its robustness and potential for deployment. (27)

4. Conclusion

We propose FSMamba, a frequency-selective multimodal framework for heart rate estimation based on the Mamba architecture. It combines RGB – NIR fusion, frequency-aware encoding, and multi-branch loss to enhance robustness. Experiments on VIPL-HR show strong generalization, with future work focusing on adaptability and deployment eficiency.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT (OpenAI) for translation between Chinese and English of author-written text and grammar/spelling/style polishing of author-written paragraphs. No Generative AI tools were used to generate ideas, design the study, conduct analyses/experiments, produce results, or create figures/tables. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

URL: https://openreview.net/forum?id=uYLFoz1vlAC. [16] Albert Gu, Tri Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: Proceedings of the International Conference on Learning Representations (ICLR), 2024. URL: https://openreview.net/forum?id=AL1fq05o7H. [17] Alex Kendall, Yarin Gal, What uncertainties do we need in bayesian deep learning for computer vision?, in: I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 2650d6089a6d640c5e85b2b88265dc2b-Paper.pdf. [18] Xuesong Li, Kun Peng, Xiaobai Li, Hu Han, Shiguang Shan, Xilin Chen, The obf database: A large face video database for remote physiological signal measurement and atrial fibrillation detection, in: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 2018, pp. 242–249. doi:10.1109/FG.2018.00043.

[1] Ming-Zher

Poh

, Daniel J. McDuf , Rosalind W. Picard, Non-contact, automated cardiac pulse measurements using video imaging and blind source separation ., Opt. Express 18 ( 2010 ) 10762 - 10774 . URL: https://opg.optica.org/oe/abstract.cfm?URI= oe -18-10- 10762 . doi: 10 .1364/OE.18.010762.

[2] Hao-Yu

, Michael Rubinstein, Eugene Shih, John Guttag, Frédo Durand,

William

Freeman , Eulerian video magnification for revealing subtle changes in the world , ACM Trans. Graph . 31 ( 2012 ). URL: https://doi.org/10.1145/2185520.2185561. doi: 10 .1145/2185520.2185561.

[3]

Guha

Balakrishnan , Frédo Durand, John V. Guttag, Detecting pulse from head motions in video , 2013 IEEE Conference on Computer Vision and Pattern Recognition ( 2013 ) 3430 - 3437 . URL: https://api.semanticscholar.org/CorpusID:17407827.

[4] Gerard de Haan , Vincent Jeanne, Robust pulse rate from chrominance-based rppg , IEEE Transactions on Biomedical Engineering 60 ( 2013 ) 2878 - 2886 . doi: 10 .1109/TBME. 2013 . 2266196 .

[5]

Wenjin

Wang , Albertus C. den Brinker, Sander Stuijk, Gerard de Haan, Algorithmic principles of remote ppg , IEEE Transactions on Biomedical Engineering 64 ( 2017 ) 1479 - 1491 . doi: 10 .1109/ TBME. 2016 . 2609282 .

[6]

Xiaobai

Li ,

Jie

Chen ,

Guoying

Zhao ,

Matti

Pietikainen , Remote heart rate measurement from face videos under realistic situations , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014 .

[7]

Weixuan

Chen , Daniel

McDuf

, Deepphys: Video-based physiological measurement using convolutional attention networks , in: Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, Yair Weiss (Eds.), Computer Vision - ECCV 2018 , Springer International Publishing, Cham, 2018 , pp. 356 - 373 .

[8]

Zitong

Yu ,

Xiaobai

Li ,

Guoying

Zhao , Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks , 2019 . URL: https://arxiv.org/abs/ 1905 .02419. arXiv: 1905 .02419.

[9]

Xin

Liu , Josh Fromm, Shwetak Patel, Daniel McDuf, Multi-task temporal shift attention networks for on-device contactless vitals measurement , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.F.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 19400 - 19411 . URL: https://proceedings.neurips.cc/paper_files/paper/ 2020/file/e1228be46de6a0234ac22ded31417bc7-Paper.pdf.

[10] Zitong

, Yuming Shen, Jingang Shi,

Hengshuang

Zhao , Philip H.S. Torr , Guoying Zhao , Physformer: Facial video-based physiological measurement with temporal diference transformer , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022 , pp. 4186 - 4196 .

[11] Ashish

Vaswani

, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, Illia Polosukhin, Attention is all you need , in: I. Guyon,

Von Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 30 , Curran

Associates

, Inc., 2017 . URL: https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

[12] Xuesong

Niu

, Shiguang Shan, Hu Han, Xilin Chen, Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation , Trans. Img. Proc. 29 ( 2020 ) 2409 - 2423 . URL: https://doi.org/10.1109/TIP. 2019 . 2947204 . doi: 10 .1109/TIP. 2019 . 2947204 .

[13] Yao-Hung Hubert

Tsai

, Shaojie Bai, Paul Pu Liang,

J. Zico

Kolter , Louis-Philippe

Morency

, Ruslan Salakhutdinov, Multimodal transformer for unaligned multimodal language sequences , in: Anna Korhonen, David Traum, Lluís Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 6558 - 6569 . URL: https://aclanthology.org/P19-1656/. doi: 10 .18653/v1/ P19 - 1656.

[14] Mirco

Ravanelli

, Yoshua Bengio, Speaker recognition from raw waveform with sincnet , in: 2018 IEEE Spoken Language Technology Workshop (SLT) , 2018 , pp. 1021 - 1028 . doi: 10 .1109/SLT. 2018 . 8639585 .

[15] Albert

, Karan Goel, Christopher Ré, Eficiently modeling long sequences with structured state spaces , in: Proceedings of the International Conference on Learning Representations (ICLR) , 2022 .