<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mamba Architecture⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiuze Jia</string-name>
          <email>xiuxejia@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xuenan Liu</string-name>
          <email>xuenanliu@mail.hfut.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siyi Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lizhong Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuai Ding</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science and Information Engineering,Hefei University of Technology</institution>
          ,
          <addr-line>Hefei, 230601, Anhui</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Software,Hefei University of Technology</institution>
          ,
          <addr-line>Hefei, 230601, Anhui</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Remote heart rate estimation from video still faces three main challenges in real-world scenarios: (1) the absence of adaptive, frequency-selective modeling allows low-frequency physiological rhythms to be overwhelmed by noise; (2) single-modality inputs sufer from instability under varying illumination, occlusions, or device changes; and (3) conventional temporal encoders are computationally expensive and lack long-sequence generalization. To address these limitations, we propose FSMamba, a frequency-selective multimodal perception system built upon the Mamba state-space framework. FSMamba employs a dual-branch feature extractor for RGB and NIR streams and a Joint Cross Attention (JCA) module to enable bidirectional, multi-head cross-modal interaction. In its encoder, we combine the standard MambaBlock with a parallel Frequency-Selective Filter (FSFilter) that uses a learnable time step-derived from trainable heart-rate bounds-and an SSMKernel-based causal recurrence to implicitly generate a band-pass convolution kernel. A channel-wise gating further refines the heart-rate-focused features. The decoder fuses raw temporal and frequency-enhanced representations via joint classification and class-wise regression to predict the final heart rate. Experiments on VIPL-HR demonstrate that FSMamba achieves competitive RMSE performance across the majority of diverse conditions, and ablation studies confirm the efectiveness of each module.</p>
      </abstract>
      <kwd-group>
        <kwd>state-space modeling</kwd>
        <kwd>SSMKernel</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>From a spectral perspective, rPPG signals are primarily concentrated in the frequency band of 0.7
– 2.5 Hz. Traditional bandpass filters are fixed and not adaptive to individual or context variations.
Learnable filter approaches such as SincNet [ 14] have shown great potential in related domains like
speech processing and physiological signal estimation.</p>
      <p>Transformer-based models[11] ofer strong global modeling capabilities but sufer from quadratic
complexity, making them less suitable for long sequences and edge deployment. In contrast, state-space
models (SSMs), including S4 and Mamba [15, 16], provide a more eficient and interpretable way to
model long-term dependencies in temporal signals.</p>
      <p>On the training side, recent works have explored multi-objective loss formulations—combining
regression, interval classification, and distributional alignment —to mitigate label noise and account for
sample uncertainty, thereby improving model robustness and generalization [17].</p>
      <p>In summary, the current development of rPPG systems is oriented toward three key directions:
multimodal fusion, frequency-aware modeling, and eficient sequential modeling. A major challenge remains
in balancing model complexity, accuracy, and deployability, particularly for real-world applications.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Modifications</title>
      <sec id="sec-2-1">
        <title>2.1. System Introduction</title>
        <p>
          Traditional rPPG systems face significant challenges, such as sensitivity to lighting variations, motion
artifacts, and limited modeling of long-term temporal dependencies. To address these issues, we
propose a modular end-to-end frequency-selective multimodal framework with four key modules:
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) a dual-stream feature extractor that encodes spatiotemporal dynamics from RGB and NIR inputs;
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) a Joint Cross Attention (JCA) module for cross-modal interaction and feature alignment; (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) an
FSMamba encoder that combines state-space modeling with frequency-aware filtering to capture global
and heart-rate-focused representations; and (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) a spectrum-aware decoder that fuses temporal and
frequency features for robust heart rate prediction through joint classification and regression.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Feature Extraction Module</title>
        <p>
          This module employs a dual-branch design based on inter-frame diferencing and lightweight
convolutional encoding to extract spatiotemporal features from RGB and NIR sequences.
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Temporal Diference Construction (STMap) Following PhysFormer [10], we compute
interframe diferences to construct spatiotemporal maps (STMap), which highlight subtle pulse-induced
variations between frames:
        </p>
        <p>
          X = I+1 − I ,  = 1, … ,  − 1
where I denotes the  -th frame of the input video sequence, and X is the resulting diference map that
emphasizes temporal color fluctuations caused by blood flow.
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Spatial Encoder Each temporal diference map X is processed through a 5-layer CNN, where
each layer consists of a 3 × 3 convolution, ReLU activation, and Batch Normalization:
        </p>
        <p>
          F = BN(ReLU(Conv3×3(F−1 )))
where F denotes the intermediate feature map at the  -th layer. The channel dimension doubles
progressively across layers. After temporal stacking, the final output Fifnal ∈ ℝ × is obtained by
applying global average pooling (GAP) across spatial dimensions, where  is the number of frames and
 is the feature dimension.
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Dual-Modality Processing Both RGB and NIR video streams undergo independent STMap
generation and spatial encoding. The process for each modality is defined as:
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
Xrgb = SpatialEncoder(STMap(Irgb)), Xnir = SpatialEncoder(STMap(Inir))
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
where Irgb and Inir are the original RGB and NIR frame sequences, respectively. The outputs Xrgb, Xnir ∈
ℝ × represent temporally encoded features for each modality, with  as the sequence length and  the
feature dimension.
        </p>
        <p>This dual-stream structure ensures that both modalities preserve their complementary spectral
information while enabling robust downstream multimodal fusion.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Multimodal Fusion Mechanism: Bidirectional Cross-Modal Attention</title>
        <p>
          To model the interaction between RGB and NIR modalities, we introduce the Joint Cross Attention
(JCA) module based on the Transformer architecture [11, 13]. Given the sequence features:
Xrgb, Xnir ∈ ℝ× ×
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
        </p>
        <p>For each attention head ℎ, we compute the attention flow as follows. The attention mechanism is
based on the scaled dot-product attention, where we compute the attention for each head using diferent
queries (Q), keys (K), and values (V).</p>
        <p>Frgb→nir = Attention(Qrℎgb, Kℎnir, Vℎnir), Fnir→rgb
ℎ ℎ
= Attention(Qℎnir, Krℎgb, Vrℎgb)
where Qrℎgb, Kℎnir, and Vℎnir are the query, key, and value matrices for the ℎ-th attention head,
respectively, derived from the RGB and NIR features, and similarly, Qℎnir, Krℎgb, and Vrℎgb are the corresponding
matrices for the reverse attention.</p>
        <p>The attention function is defined as:</p>
        <p>Attention(,  ,  ) =
softmax (
 
√
) 
where  is the dimensionality of the query and key vectors.</p>
        <p>After calculating each attention head for both directions, we concatenate the outputs of all heads.
Let  be the number of heads, and denote the concatenation of all heads as:</p>
        <p>Then, we update each stream via residual connection:</p>
        <p>Frgb→nir = Concat (Fr1gb→nir, Fr2gb→nir, … , Frgb→nir)
Fnir→rgb = Concat (F1nir→rgb, F2nir→rgb, … , Fnir→rgb)
Xr′gb = Xrgb + Fnir→rgb,</p>
        <p>X′nir = Xnir + Frgb→nir
Here, Xr′gb and X′nir represent the attended features after cross-modal integration.</p>
        <p>Finally, we concatenate and project the updated features from both modalities:</p>
        <p>Xfused = MLP (Xr′gb ‖ X′nir)</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. FSMamba Encoder: Frequency-Selective State-Space Modeling</title>
        <p>
          To overcome the limitations of conventional encoders in frequency selectivity and long-range modeling,
we propose the FSMamba encoder, which combines the Mamba state-space framework [16] with a
frequency-guided path that focuses on the heart rate band (0.7 ∼ 2.5 Hz).
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Overall Architecture Given an input sequence X0 ∈ ℝ× × , FSMamba employs a layered
encoder structure where the first layer is a standard MambaBlock, followed by  − 1 layers of
Frequency-Enhanced MambaBlocks
        </p>
        <p>Each enhanced layer processes input as:</p>
        <p>Yssm = MambaBlockℓ(LN(Xℓ)), Yℓhr = FSFilterℓ(LN(Xℓ))</p>
        <p>ℓ
where Xℓ is the input to layer ℓ, Yℓssm and Yℓhr are the outputs of the MambaBlock and FSFilter,
respectively.</p>
        <p>Xℓ+1 = LN (Xℓ + MLP (Yℓssm ‖ Yℓhr))
where Xℓ+1 is the residual-updated output of the current layer, with feature fusion performed via an
MLP.</p>
        <p>
          Houtput = LN(X ), Hhr =  −11 −∑ℓ=11 Yℓhr (
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
where X is the final output of the encoder, which passes through  layers of MambaBlock and
Frequency-Enhanced MambaBlock, and Houtput = LN(X ) is the final encoder representation after
layer normalization. Additionally, Hhr aggregates heart-rate-enhanced features from all FSFilter layers
by averaging them across layers.
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) State-Space Path in MambaBlock: Local Convolution and Global Temporal Modeling
The
state-space path follows the standard MambaBlock [16], which models both short-term and long-range
temporal dependencies through a combination of local convolution and global state recurrence. Given
input Xℓ ∈ ℝ × , the block first applies a depthwise 1D convolution to capture local motion patterns
across time. The result is then passed through a dynamic gating unit and projected into a state-space
model (SSM) for global modeling.
        </p>
        <p>The SSM core operates via a linear recurrence:
s+1 = As + Bx ,
y = Cs + Dx
where x is the input at time  , s is the latent state, and y is the output. The matrices A, B, C, D are
learnable and define a causal filter with global temporal receptive field.</p>
        <p>
          The output is fused with the input via a residual connection and normalized. This design enables
MambaBlock to eficiently learn hierarchical temporal features by combining local convolution, global
recurrence, and data-driven gating in a lightweight and scalable architecture.
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Frequency-Enhanced MambaBlock: Learning Frequency-Focused Representations To
enhance sensitivity to periodic physiological dynamics such as heart rate oscillations, we augment the
MambaBlock with a parallel Frequency-Selective Filter (FSFilter). This module is implemented via a
modified state-space kernel, denoted as SSMhr_band, which introduces frequency-domain awareness by
dynamically modulating its temporal resolution.
        </p>
        <p>We first define a learnable time step:
Δ =</p>
        <p>2
 mhrin +  mhrax
,
adapts to shift the efective center frequency of the filter.
where  mhrin,  mhrax &gt; 0 are trainable scalars initialized to 0.7 and 2.5 Hz respectively. During training, Δ
The FSFilter applies the following causal state-space recurrence for each time step  :
 +1 =  Δ   +  Δ   ,   =    ,
where
•  Δ ,  Δ ∈ ℝ ×
•  ∈ ℝ ×
•   ∈ ℝ is the input at time  ,
•   ∈ ℝ is the latent state,
is the output projection.</p>
        <p>are transition matrices modulated by Δ ,
By unrolling the recurrence, this is equivalent to a causal convolution</p>
        <p>=0
  = ∑ ℎ  − ,
ℎ =  (</p>
        <p>Δ )  Δ ,
where {ℎ } is the implicit filter kernel whose efective bandwidth is controlled by
Δ .</p>
        <p>Finally, we apply a channel-wise gating to the filter output:</p>
        <p>Yℓhr = SSMhr_band(Xℓ; Δ) ⊙ whr,
whr ∈ ℝ
where whr is a trainable vector and ⊙ denotes element-wise multiplication. This gating further
emphasizes or suppresses specific channels according to their relevance to the heart-rate frequency band.</p>
        <p>
          Through (i) learnable frequency bounds via Δ , (ii) state-space – derived kernel ℎ , and (iii)
channelwise gating whr, the FSFilter functions as a data-driven band-pass filter centered on the heart rate range,
providing explicit frequency-domain selectivity in the FSMamba encoder.
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
(16)
(17)
(18)
processed in two branches:
features:
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Heart Rate Decoder Module</title>
        <p>
          To address both interval discrimination and frequency sensitivity, the decoder integrates temporal and
frequency-aware features with joint classification and regression objectives [ 14, 17].
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Feature Extraction and Spectral Enhancement: The features from the FSMamba encoder are
- Raw Feature Processing: The temporal features are passed through an MLP to extract raw
        </p>
        <p>Fraw = MLP(Houtput)
where Houtput is the global representation from the FSMamba encoder.</p>
        <p>- Frequency Feature Enhancement: The heart-rate focused features Hhr are processed through a
Sinc convolutional layer followed by an MLP to enhance frequency-specific features:</p>
        <p>Ffreq = SincConv1d(MLP(Hhr))
where Hhr is the heart-rate focused representation from the FSMamba encoder. The SincConv1d layer
is a learnable filter designed to enhance the relevant frequency components for heart rate estimation.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Feature Fusion and Regression: After extracting raw and frequency-enhanced features, they
are fused as:
fused using an MLP.
        </p>
        <p>Ffused = MLP (Fraw ‖ Ffreq)
Here, the raw temporal features Fraw and the frequency-enhanced features Ffreq are concatenated and</p>
        <p>Finally, the heart rate prediction  fin̂al is computed using a softmax function for classification and a
regression head to output the final estimation:

=1
 fin̂al = ∑ softmax (Ffused) ⋅ Reg (Ffused)
where  represents the number of interval classes, and Reg denotes the regression output for class  .</p>
        <p>This design enhances both frequency focus and interval-level adaptation, yielding robust heart rate
predictions even under challenging conditions.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Loss Function Design</title>
        <p>We adopt a multi-objective loss function to jointly optimize prediction accuracy, classification sensitivity,
and distributional stability [17]:
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
ℒtotal =  reg ⋅ ℒreg +  cls ⋅ ℒcls +  dist ⋅ ℒdist,
where  reg = 1.0,  cls = 1.0, and  dist = 0.5.</p>
        <p>Specifically, the three components are defined as follows:</p>
        <p>ℒdist = (  ̂ −   )2 + (  ̂ −   )2</p>
        <p>Here,  fin̂al denotes the predicted heart rate,  is the ground truth,  ,  is the probability assigned
to the true interval class, and  ,  are the empirical mean and standard deviation. This formulation
improves both accuracy and robustness under uncertainty [17].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Design and Result Analysis</title>
      <p>We evaluate the proposed FSMamba system through comprehensive comparative and ablation
experiments on the VIPL-HR dataset [12], with detailed analysis to validate the efectiveness and contribution
of each module.</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>VIPL‐HR contains 2,378 RGB and 752 NIR facial video sequences from 107 subjects under varied
conditions (rest, motion, illumination), with synchronized PPG, heart rate, and SpO₂ annotations [12].</p>
        <p>Oulu Bio Face Database (OBF) comprises videos from 100 healthy volunteers and 6 AF patients
(two 5-minute sessions per subject in RGB + NIR), along with simultaneous ECG, PPG, and respiration
signals, for heart rate, respiratory rate, and AF detection benchmarking [18].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Settings and Evaluation</title>
        <p>We adopt the Root Mean Square Error (RMSE) as the primary evaluation metric to assess model
robustness under outlier predictions, defined as:</p>
        <p>RMSE =</p>
        <p>∑(  −  ̂ )2
√</p>
        <p>1</p>
        <p>All experiments are conducted on VIPL-HR using the subject-independent protocol [12], with
684/68/198 samples for training, validation, and testing. The model is trained for 15 epochs with
Adam optimizer (lr = 1 × 10−4, batch size = 2) on 180-frame segments.</p>
        <p>The architecture includes four modules: a feature extractor, a cross-modal fusion block (4-head
attention, 64 dim/head), a 6-layer FSMamba encoder ( model = 256,  state = 128, 30 Hz), and a heart rate
decoder (16 SincConv layers, kernel size = 33, band = 0.7 – 2.5 Hz, regression head dim = 128) [16, 14].
Inputs are resized to 128 × 128, and all features are 256-dimensional. This setup ensures a good trade-of
between temporal modeling and frequency selectivity for accurate rPPG estimation.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results</title>
        <p>As shown in Table 1, the proposed FSMamba system achieves consistent performance across diverse
VIPL-HR scenarios. In typical settings (e.g., sitting, talking, bright lighting), RMSE stays within 11 –
13 BPM. Even under low light, long distance, or mobile capture (v4, v6, v8/v9), it remains below 15
BPM, demonstrating robustness to noise and input variation. The highest RMSE (24.17 BPM) appears in
the post-exercise recovery scenario (v7), suggesting future improvement is needed for modeling rapid
physiological changes. The solid performance in mobile cases further indicates strong potential for
real-world deployment.</p>
        <p>Ablation results (Table 2) demonstrate the importance of each loss component and system module.
Removing ℒreg, cross-modal attention, or the NIR modality notably degrades RMSE, confirming their
complementary roles. Despite limited NIR data, its inclusion enhances low-light robustness. Overall,
FSMamba’s design efectively balances accuracy, generalization, and real-world applicability through
frequency-aware modeling and multimodal fusion.</p>
        <p>As shown in Table 3, our proposed system (Team: xiuxejia, Hefei University of Technology) ranked
6th on the oficial RE-PSS leaderboard, evaluated on the VIPL-HR and OBF test sets, with an RMSE of
16.25 BPM.</p>
        <p>Due to local resource constraints, we submitted a lightweight version without pre-trained weights
or extensive hyperparameter tuning. However, the system still performed well, demonstrating its
robustness and potential for deployment.
(27)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We propose FSMamba, a frequency-selective multimodal framework for heart rate estimation based on
the Mamba architecture. It combines RGB – NIR fusion, frequency-aware encoding, and multi-branch
loss to enhance robustness. Experiments on VIPL-HR show strong generalization, with future work
focusing on adaptability and deployment eficiency.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT (OpenAI) for translation between
Chinese and English of author-written text and grammar/spelling/style polishing of author-written paragraphs.
No Generative AI tools were used to generate ideas, design the study, conduct analyses/experiments,
produce results, or create figures/tables. After using this tool, the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.</p>
      <p>URL: https://openreview.net/forum?id=uYLFoz1vlAC.
[16] Albert Gu, Tri Dao, Mamba: Linear-time sequence modeling with selective state spaces, in:
Proceedings of the International Conference on Learning Representations (ICLR), 2024. URL:
https://openreview.net/forum?id=AL1fq05o7H.
[17] Alex Kendall, Yarin Gal, What uncertainties do we need in bayesian deep learning for
computer vision?, in: I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S.
Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30,
Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/
2650d6089a6d640c5e85b2b88265dc2b-Paper.pdf.
[18] Xuesong Li, Kun Peng, Xiaobai Li, Hu Han, Shiguang Shan, Xilin Chen, The obf database: A large
face video database for remote physiological signal measurement and atrial fibrillation detection,
in: 2018 13th IEEE International Conference on Automatic Face &amp; Gesture Recognition (FG 2018),
2018, pp. 242–249. doi:10.1109/FG.2018.00043.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Ming-Zher</surname>
            <given-names>Poh</given-names>
          </string-name>
          , Daniel J.
          <string-name>
            <surname>McDuf</surname>
          </string-name>
          , Rosalind W. Picard,
          <article-title>Non-contact, automated cardiac pulse measurements using video imaging and blind source separation</article-title>
          .,
          <source>Opt. Express</source>
          <volume>18</volume>
          (
          <year>2010</year>
          )
          <fpage>10762</fpage>
          -
          <lpage>10774</lpage>
          . URL: https://opg.optica.org/oe/abstract.cfm?URI=
          <fpage>oe</fpage>
          -18-10-
          <lpage>10762</lpage>
          . doi:
          <volume>10</volume>
          .1364/OE.18.010762.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Hao-Yu</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Michael Rubinstein, Eugene Shih, John Guttag, Frédo Durand,
          <string-name>
            <given-names>William</given-names>
            <surname>Freeman</surname>
          </string-name>
          ,
          <article-title>Eulerian video magnification for revealing subtle changes in the world</article-title>
          ,
          <source>ACM Trans. Graph</source>
          .
          <volume>31</volume>
          (
          <year>2012</year>
          ). URL: https://doi.org/10.1145/2185520.2185561. doi:
          <volume>10</volume>
          .1145/2185520.2185561.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Guha</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          , Frédo Durand, John V. Guttag,
          <article-title>Detecting pulse from head motions in video</article-title>
          ,
          <source>2013 IEEE Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2013</year>
          )
          <fpage>3430</fpage>
          -
          <lpage>3437</lpage>
          . URL: https://api.semanticscholar.org/CorpusID:17407827.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Gerard de Haan</surname>
          </string-name>
          , Vincent Jeanne,
          <article-title>Robust pulse rate from chrominance-based rppg</article-title>
          ,
          <source>IEEE Transactions on Biomedical Engineering</source>
          <volume>60</volume>
          (
          <year>2013</year>
          )
          <fpage>2878</fpage>
          -
          <lpage>2886</lpage>
          . doi:
          <volume>10</volume>
          .1109/TBME.
          <year>2013</year>
          .
          <volume>2266196</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Wenjin</given-names>
            <surname>Wang</surname>
          </string-name>
          , Albertus C. den Brinker, Sander Stuijk, Gerard de Haan,
          <article-title>Algorithmic principles of remote ppg</article-title>
          ,
          <source>IEEE Transactions on Biomedical Engineering</source>
          <volume>64</volume>
          (
          <year>2017</year>
          )
          <fpage>1479</fpage>
          -
          <lpage>1491</lpage>
          . doi:
          <volume>10</volume>
          .1109/ TBME.
          <year>2016</year>
          .
          <volume>2609282</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Xiaobai</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jie</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Guoying</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Matti</given-names>
            <surname>Pietikainen</surname>
          </string-name>
          ,
          <article-title>Remote heart rate measurement from face videos under realistic situations</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Weixuan</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Daniel</surname>
            <given-names>McDuf</given-names>
          </string-name>
          ,
          <article-title>Deepphys: Video-based physiological measurement using convolutional attention networks</article-title>
          , in: Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, Yair Weiss (Eds.),
          <source>Computer Vision - ECCV 2018</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>356</fpage>
          -
          <lpage>373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Zitong</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaobai</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Guoying</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1905</year>
          .02419. arXiv:
          <year>1905</year>
          .02419.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Xin</given-names>
            <surname>Liu</surname>
          </string-name>
          , Josh Fromm, Shwetak Patel,
          <article-title>Daniel McDuf, Multi-task temporal shift attention networks for on-device contactless vitals measurement</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>19400</fpage>
          -
          <lpage>19411</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/ 2020/file/e1228be46de6a0234ac22ded31417bc7-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Zitong</surname>
            <given-names>Yu</given-names>
          </string-name>
          , Yuming Shen, Jingang Shi,
          <string-name>
            <given-names>Hengshuang</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Philip H.S. Torr</surname>
            ,
            <given-names>Guoying</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Physformer: Facial video-based physiological measurement with temporal diference transformer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4186</fpage>
          -
          <lpage>4196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <article-title>Ł ukasz Kaiser, Illia Polosukhin, Attention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U.</given-names>
            <surname>Von Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          . URL: https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Xuesong</surname>
            <given-names>Niu</given-names>
          </string-name>
          , Shiguang Shan, Hu Han, Xilin Chen,
          <article-title>Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation</article-title>
          ,
          <source>Trans. Img. Proc. 29</source>
          (
          <year>2020</year>
          )
          <fpage>2409</fpage>
          -
          <lpage>2423</lpage>
          . URL: https://doi.org/10.1109/TIP.
          <year>2019</year>
          .
          <volume>2947204</volume>
          . doi:
          <volume>10</volume>
          .1109/TIP.
          <year>2019</year>
          .
          <volume>2947204</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Yao-Hung Hubert</surname>
            <given-names>Tsai</given-names>
          </string-name>
          , Shaojie Bai, Paul Pu Liang,
          <string-name>
            <given-names>J. Zico</given-names>
            <surname>Kolter</surname>
          </string-name>
          ,
          <string-name>
            <surname>Louis-Philippe</surname>
            <given-names>Morency</given-names>
          </string-name>
          , Ruslan Salakhutdinov,
          <article-title>Multimodal transformer for unaligned multimodal language sequences</article-title>
          , in: Anna Korhonen, David Traum, Lluís Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>6558</fpage>
          -
          <lpage>6569</lpage>
          . URL: https://aclanthology.org/P19-1656/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          - 1656.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Mirco</surname>
            <given-names>Ravanelli</given-names>
          </string-name>
          , Yoshua Bengio,
          <article-title>Speaker recognition from raw waveform with sincnet</article-title>
          ,
          <source>in: 2018 IEEE Spoken Language Technology Workshop (SLT)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1021</fpage>
          -
          <lpage>1028</lpage>
          . doi:
          <volume>10</volume>
          .1109/SLT.
          <year>2018</year>
          .
          <volume>8639585</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Albert</surname>
            <given-names>Gu</given-names>
          </string-name>
          , Karan Goel, Christopher Ré,
          <article-title>Eficiently modeling long sequences with structured state spaces</article-title>
          ,
          <source>in: Proceedings of the International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>