<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A two-stage remote blood pressure estimation method based on selective state space model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zizheng Guo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bochao Zou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huimin Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Science and Technology Beijing</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Blood pressure (BP) estimation based on remote photoplethysmography is an innovative and challenging task that enables non-contact BP monitoring through facial videos using ordinary cameras, facilitating long-term health monitoring. Most existing methods require first extracting blood volume pulse waves from facial videos, followed by estimating BP from the pulse wave characteristics. This process means that the quality of pulse wave extraction significantly constrains BP estimation. However, these methods generally overlook long-range context modeling, resulting in suboptimal precision and granularity in pulse wave extraction. This paper introduces the solution we proposed for the third Remote Physiological Signal Sensing (RePSS) challenge in IJCAI'24. Our approach is an end-to-end method comprising two stages: blood volume pulse estimation and blood pressure estimation. Using the selective state space model allows for capturing long-range dependencies while preserving linear complexity. We conducted intra-dataset experiments, cross-dataset experiments, and ablation studies for evaluation. The proposed method achieved 13.59 mmHg RMSE and ranked in third place in the challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Blood Pressure Estimation</kwd>
        <kwd>Remote Photoplethysmography</kwd>
        <kwd>Video Understanding</kwd>
        <kwd>State Space Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Blood Volume Pulse (BVP) is a crucial physiological signal from which vital signs such as blood
pressure (BP), heart rate (HR), and heart rate variability (HRV) can be derived. The principle
behind extracting BVP using photoplethysmography (PPG) is based on the changes in light
absorption and scattering caused by variations in blood volume in the subcutaneous vessels
during cardiac cycles. This results in minute periodic changes in color signals, invisible to the
naked eye, which can be captured by imaging sensors. Traditionally, BVP is obtained using
contact-based sensors, which can lead to various inconveniences and limitations. Recently,
non-contact methods for extracting BVP, such as remote photoplethysmography (rPPG), have
gained increasing attention [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ].
      </p>
      <p>
        Among these physiological signals, estimating blood pressure is particularly challenging,
with no current theoretical framework adequately explaining the underlying mechanism for BP
estimation from BVP wave characteristics. Most existing rPPG-based BP estimation methods
have evolved from PPG-based techniques and primarily include two types of approaches: the
ifrst measures BP using pulse waves and their features obtained from facial videos [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ]; the
Diff Fusion
Self Attention
Frame Avgpool
      </p>
      <p>Multi-temporal</p>
      <p>Mamba</p>
      <p>Add &amp; Norm
Frequency Domain</p>
      <p>Feed-forward
Add &amp; Norm
BVP Predictor</p>
      <p>BP
Network
Concat
Conv1
Conv2
Conv3</p>
      <p>Avgpool
1 2 3 … T
1 2 3 … T</p>
      <p>
        BVP Waves
second calculates BP using pulse transit time (PTT) derived from BVP waves extracted between
two body parts [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. These methods require the initial estimation of BVP waves from videos,
and the quality of pulse wave estimation significantly afects the accuracy of BP estimation.
However, the existing methods are predominantly based on traditional signal processing or
convolutional neural networks. These approaches overlook long-range context modeling,
resulting in suboptimal precision and granularity in BVP wave extraction [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Recently, Mamba
has emerged with the selective state space model, striking a balance between maintaining linear
complexity and facilitating long-term dynamic modeling [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ]. This provides a novel
option for rPPG, which is suitable for practical deployment.
      </p>
      <p>In this paper, we further explore remote BP estimation methods and propose a two-stage BP
estimation approach based on the selective state space model, comprising BVP estimation and BP
estimation stages. By imposing constraints on BVP estimation across multiple temporal scales
in both temporal and frequency domains, we aim to refine the granularity of BVP estimation
and improve the accuracy of BP estimation. We evaluate the precision of BVP recovery by
assessing the accuracy of heart rate calculated from the BVP waves and conduct intra-dataset
experiments, cross-dataset experiments, and ablation studies to validate the proposed method.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. The overall framework</title>
        <p>Our proposed method is an end-to-end framework that takes video as input to predict blood
pressure values as output. Directly predicting blood pressure from facial video may not yield
optimal results [see Section 3.5]. Therefore, we divide the blood pressure estimation process
into two stages within the model: 1) estimating the corresponding BVP waves from the left and
right halves of the face, and 2) estimating the BP values from these two BVP waves.</p>
        <p>As depicted in Figure 1, the overall framework of the proposed method mainly consists
of three parts: Tokenization Stem, BVP Network, and BP Network. For the input, spatial
information from the left and right faces is initially extracted into token channels separately
via the Tokenization stem, forming two temporal token sequences. Specifically, given an RGB
video input  ∈ R3×  × ×  , the Tokenization Stem output 1 , 2 ∈ R× /2, and
C, T, W, H indicate channel, sequence length, width, and height, respectively.</p>
        <p>
          The two token sequences obtained from the Tokenization Stem are each processed by the
BVP Network. The BVP Network is based on RhythmMamba [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], employing multi-temporal
Mamba and frequency domain feed-forward. It constrains a state space model at multiple
temporal scales in both temporal and frequency domains. The input token sequences will
be partitioned into sequences of diferent temporal lengths. These sequences will undergo
extraction of implicit information using selective state-space models, and then they will be
added according to temporal correspondence. Subsequently, multi-scale information within
multi-channel frequency domains will be interacted in the frequency domain. The dimensions
of the feed-forward output and the Mamba output are identical to the output of the stem
 ∈ R× /2. Finally, the rPPG features are upsampled and projected into BVP waves
through the BVP predictor.
        </p>
        <p>The BP Network estimates blood pressure based on the two BVP waves obtained from the left
and right faces by the BVP Network. The BVP waves predicted from the left and right faces are
ifrst concatenated along the channel dimension and then processed through three convolution
layers. The first two convolution layers are followed by batch normalization and ELU activation
layers. Finally, the BP values are obtained through average pooling.</p>
        <p>Tokenization Stem. Tokenization Stem comprises three components: dif-fusion,
selfattention, and frame average pooling. The dif-fusion module is used to incorporate inter-frame
diferences into the raw frames, allowing the raw frames to be aware of BVP wave variations,
thereby enhancing the rPPG representation. Specifically, given the input video , temporal shift
is initially applied to obtain − 2, − 1, , +1 and +2. Subsequently, frame diferences
between consecutive frames are computed in reverse chronological order, yielding − 2, − 1,
1 and 2. The frame diferences and raw frames are processed through 1 to reduce
resolution and perform primary feature extraction.</p>
        <p>= 1( (− 2, − 1, 1, 2)),
 = 1().
(1)</p>
        <p>Next, the raw frames and frame diferences are combined and processed through 2 to
enhance feature representation.</p>
        <p>=  · 2() +  · 2( ·  +  ·  ).</p>
        <p>Subsequently, the self-attention module focuses on regions rich in rPPG information, aiming
to enhance the weight of rPPG information in the subsequent frame-level average pooling. The
self-attention mask can be represented as:
  =
(/8)(/8) ·  (3()) .</p>
        <p>2|| (3())||1</p>
        <p>After self-attention, the left and right halves of the face are split and the rPPG information
from each half is extracted through subsequent networks. Frame-level average pooling is applied
to tokenize the single-frame images of the left and right halves of the face.</p>
        <p>Multi-temporal Mamba. Multi-temporal Mamba leverages multiple temporal scales,
enabling a selective state space model to simultaneously learn the short-term trends and periodic
patterns of BVP waves. Specifically, the input token sequence  is projected and then
processed through four separate paths. For the ℎ path, the sequence is divided into 2− 1
sub-sequences, each of which undergoes sequential processing through a convolution layer,
activation layer, and selective state space model. Subsequently, they are recombined into a
sequence of the original length, forming the output of the ℎ path, denoted as ℎ . The
output before projection can be represented as follows:</p>
        <p>4
 = ∑︁ ℎ ×  ( ()).</p>
        <p>=1</p>
        <p>Frequency domain feed-forward. The frequency domain feed-forward facilitates channel
interaction in the frequency domain. It consists of three components: domain conversion,
channel interaction, and domain inversion. There is a linear layer before and after each
component for projection. Frequency domain conversion and inversion are carried out using FFT
and iFFT, respectively. After frequency domain conversion, the signal becomes complex, so
channel interaction involves complex arithmetic operations. Specifically, for complex input
 ∈ R/2×  , given complex weight matrix  ∈ R×  and complex bias  ∈ R , according
to the rules of complex multiplication, it is expressed as follows:
(2)
(3)
(4)
(5)
′ =  −  + ,
′ =  +  + ,
′ = ′ +  · ′.</p>
        <p>The subscripts  and  denote the real and imaginary components of the corresponding
complex data, respectively.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Training Procedure</title>
        <p>The training procedure is divided into two stages.</p>
        <p>BVP estimation stage. During the BVP estimation stage, the parameters of the BP Network
are not updated. The loss is calculated based on the distance between the output of the BVP
predictor and the ground truth BVP waveforms. This stage’s loss is a hybrid loss that constrains
the BVP waveforms in both temporal and frequency domains. The temporal-domain loss is
computed using the negative Pearson correlation coeficient between the predicted and ground
truth BVP waveforms:
ℒ  = 1 − √︁∑︀
∑︀=1
︀(    −   
︀) (︀    −   
︀)
=1
︀(    −   
︀) 2√︁∑︀
=1
︀(    −   
︀) 2</p>
        <p>The frequency domain loss is calculated using the cross-entropy between the heart rates
derived from the predicted BVP waves and the ground truth BVP waves’ frequency domain.
The frequency loss ℒ  is then expressed by:</p>
        <p>ℒ  = (( (  )),  (  ))</p>
        <p>Where CE denotes cross-entropy, maxIndex represents the index of the maximum value, and
PSD stands for power spectral density analysis. The first stage loss is expressed by:</p>
        <p>The predicted BVP waveform requires post-processing. In the post-processing phase, a
second-order Butterworth filter (with cutof frequencies of 0.75 and 2.5 Hz) was applied to the
predicted BVP waveform. Subsequently, the Welch algorithm was used to compute the power
spectral density for further heart rate estimation.</p>
        <p>BP estimation stage. During the BP estimation stage, the loss is calculated based on the
distance between the predicted diastolic and systolic blood pressure values and the ground
truth values. The specific loss function is defined as follows:
ℒ = 0.5 · ( − )2 + 0.5 · ( − )
2
ℒ  =  · ℒ   +  · ℒ</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Performance Metric</title>
        <p>
          The Vital Videos dataset [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] was used for training, and the challenge provided 200 videos from
100 subjects in the OBF dataset [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] for testing. The Vital Videos dataset was recorded at a
resolution of 1920 x 1200 pixels and 30 fps. Each subject was recorded in two 30-second video
clips. Participants were recorded under various indoor lighting conditions, including artificial
and natural light sources, in multiple indoor environments such as libraries and shopping
malls across ten locations in Western Europe, covering six skin tones and age groups from
5 to 95 years. The Vital Videos dataset includes three versions: VV-small, VV-Medium, and
(6)
(7)
(8)
(9)
VV-Large, with only VV-small (a subset of 100 subjects exhibiting the greatest demographic
diversity) can be used. The OBF dataset contains 200 five-minute-long videos recorded at a
frame rate of 60 fps, documenting 100 healthy adults under consistent environmental and setting
conditions. Five commonly used metrics were employed for evaluation: MAE (Mean Absolute
Error), RMSE (Root Mean Square Error), MAPE (Mean Absolute Percentage Error), Pearson
correlation coeficient  , and SNR (Signal-to-Noise Ratio).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Implementation Details</title>
        <p>We utilized the rPPG toolbox [18], a PyTorch-based open-source toolbox, to implement our
proposed method. For video segments, facial regions were cropped and resized to 128×128 pixels
in the initial frame, remaining fixed in subsequent frames. During the first stage, we adhered to
the outlined protocol [19], incorporating random upsampling, downsampling, and horizontal
lfipping for data augmentation. Additionally, we continued training based on a pre-trained
model from the PURE dataset. Parameters for both phases remained consistent: a learning rate
of 3e-4, batch size of 16, and 30 epochs for training. Network training was conducted on a single
NVIDIA RTX 3090.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. HR Evaluation</title>
        <p>We used HR estimation results as an indicator for BVP estimation. The VV-small dataset was
sequentially partitioned into training, validation, and test sets in a 7:1:2 ratio. As shown in the
ifrst two rows of Table 1, RhythmMamba significantly outperformed the commonly employed
POS in the past. Moreover, the BVP estimation demonstrated relatively favorable outcomes,
with 2.90 bpm MAE for heart rate. We conducted a simple cross-validation by exchanging the
validation and test sets, as indicated in the last two rows of Table 1, revealing a nearly tenfold
diference in results. This discrepancy may stem from the considerable distribution variation
within the VV-small dataset, which comprises only 200 videos.</p>
        <p>Intra-dataset experiments on VV-small. The BP estimation still adheres to the dataset
partitioning criteria established during HR estimation. As indicated in the first row of Table 2,
the results are less than satisfactory. This may be attributed to severe limitations imposed by
the quantity and distribution of the dataset. While the performance is acceptable for HR tasks
with lower precision requirements, it demonstrates unsatisfactory outcomes for BP tasks with
higher precision demands.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. BP Evaluation</title>
        <p>MAE↓</p>
        <p>Blood pressure consists of systolic blood pressure (SBP) and diastolic blood pressure (DBP).
PTT is linearly correlated with SBP [21], while its correlation with DBP is relatively low. We
separately present the test results for SBP and DBP. As shown in the last two rows of Table
2, DBP demonstrates higher accuracy compared to SBP. However, this may also be due to the
larger fluctuation range of SBP compared to DBP. Additionally, it can be observed that the  of
BP estimation is significantly higher compared to SBP and DBP. This is because the metrics for
BP are calculated by concatenating the SBP and DBP sequences.</p>
        <p>Cross-dataset experiments on OBF. We conducted testing on the videos of 100 subjects
from the OBF dataset provided in the challenge, which can be regarded as a form of cross-dataset
testing. The training dataset was the VV-small dataset, partitioned sequentially into training and
validation sets in an 8:2 ratio. The RMSE of the test results was 13.59. The cross-dataset results
were nearly identical to those from testing within the VV-small dataset, yet neither set of results
was particularly satisfactory. This aligns with the analysis from the HR evaluation, where the
small volume and significant distribution diferences in the dataset were identified as key issues.
The distribution gap between the training and test sets within VV-small is comparable to the
cross-dataset gap.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Ablation Study</title>
        <p>We conducted ablation studies using the same partitioning strategy as in the intra-dataset
evaluation.</p>
        <p>Impact of BVP stage. We attempted to estimate BP directly from facial videos by removing
the BVP stage and the BVP predictor in the network. As shown in the first and last rows of
Table 3, directly estimating BP from facial videos does not yield satisfactory results.</p>
        <p>Impact of BVP extraction algorithm. The BVP extraction algorithm was replaced from POS
to RhythmMamba. The results are shown in the second and last rows of Table 3. Comparing
these results with the HR estimation results in Table 1, it appears that the quality of PPG
extraction only slightly afects BP outcomes.</p>
        <p>Impact of face separate estimation. Theoretically, extracting BVP separately from the
left and right halves of the face and merging them into two channels for input into the neural
network should provide not only BVP feature information but also PTT information. This
approach is expected to outperform extracting a single-channel BVP from the whole face. To
investigate this, we conducted an ablation study. As shown in the third and last rows of Table 3,
the performance improvement from separate extraction of the left and right facial BVPs was not
significant. Further visualization of the BVP waveforms extracted from the left and right sides
of the face revealed that the waveforms were extremely similar, providing almost no additional
PTT information.</p>
        <p>Impact of BP separate estimation. Since SBP is highly correlated with PTT and DBP is
less correlated with PTT [21], we attempted to estimate SBP and DBP separately using two BP
networks. The results, as shown in the fourth and last rows of Table 3, indicate that separate
estimation has little impact.</p>
        <p>Impact of face detection. The default face detection algorithm is the Haar Cascade. The
videos are split into 160-frame segments, performing detection on the first frame of each segment
and maintaining that position for subsequent frames. To investigate the impact of the face
detection algorithm, we used the more recent RetinaFace [22], splitting the video into 300-frame
segments and performing face detection every 4 frames. As shown in the last two rows of
resulting from the longer video segments.</p>
        <sec id="sec-3-5-1">
          <title>Default</title>
        </sec>
        <sec id="sec-3-5-2">
          <title>Default</title>
        </sec>
        <sec id="sec-3-5-3">
          <title>Default</title>
        </sec>
        <sec id="sec-3-5-4">
          <title>Default</title>
        </sec>
        <sec id="sec-3-5-5">
          <title>Customized</title>
        </sec>
        <sec id="sec-3-5-6">
          <title>Default MAE 15.26 10.72</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In the challenge, we proposed a two-stage method for BP estimation based on the selective
state space model which achieves a RMSE of 13.59 mmHg. However, the model’s performance
is still not adequate for practical application. In future work, we hope to integrate BP priors
into the model design and explore the feasibility of pre-training with PPG datasets for better
performance on the rPPG dataset.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported in part by the National Natural Science Foundation of China (62206015,
62227801, U20B2062), the Fundamental Research Funds for the Central Universities
(FRF-TP22-043A1), and the Young Scientist Program of The National New Energy Vehicle Technology
Innovation Center (Xiamen Branch).
[18] X. Liu, G. Narayanswamy, A. Paruchuri, X. Zhang, J. Tang, Y. Zhang, S. Sengupta, S. Patel,
Y. Wang, D. McDuf, rPPG-toolbox: Deep remote PPG toolbox, in: Thirty-seventh
Conference on Neural Information Processing Systems Datasets and Benchmarks Track,
2023. URL: https://openreview.net/forum?id=q4XNX15kSe.
[19] Z. Yu, X. Li, X. Niu, J. Shi, G. Zhao, Autohr: A strong end-to-end baseline for remote
heart rate measurement with neural searching, IEEE Signal Processing Letters 27 (2020)
1245–1249.
[20] W. Wang, A. C. Den Brinker, S. Stuijk, G. De Haan, Algorithmic principles of remote ppg,</p>
      <p>IEEE Transactions on Biomedical Engineering 64 (2016) 1479–1491.
[21] Y. Zhang, M. Berthelot, B. Lo, Wireless wearable photoplethysmography sensors for
continuous blood pressure monitoring, in: 2016 IEEE Wireless Health (WH), IEEE, 2016,
pp. 1–8.
[22] J. Deng, J. Guo, E. Ververas, I. Kotsia, S. Zafeiriou, Retinaface: Single-shot multi-level face
localisation in the wild, in: Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, 2020, pp. 5203–5212.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>McDuf</surname>
          </string-name>
          ,
          <article-title>Camera measurement of physiological vital signs</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-L. Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Challenges and prospects of visual contactless physiological monitoring in clinical study</article-title>
          ,
          <source>NPJ Digital Medicine</source>
          <volume>6</volume>
          (
          <year>2023</year>
          )
          <fpage>231</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Contrast-phys+: Unsupervised and weakly-supervised video-based remote physiological measurement via spatiotemporal contrast</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barszczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vempala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , G. Fu,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-P.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <article-title>Smartphone-based blood pressure measurement using transdermal optical imaging technology</article-title>
          ,
          <source>Circulation: Cardiovascular Imaging</source>
          <volume>12</volume>
          (
          <year>2019</year>
          )
          <article-title>e008857</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A blood pressure prediction method based on imaging photoplethysmography in combination with machine learning</article-title>
          ,
          <source>Biomedical Signal Processing and Control</source>
          <volume>64</volume>
          (
          <year>2021</year>
          )
          <fpage>102328</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bousefsaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Desquins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Djeldjli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ouzar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pruski</surname>
          </string-name>
          ,
          <article-title>Estimation of blood pressure waveform from facial video using a deep u-shaped network and the wavelet representation of imaging photoplethysmographic signals</article-title>
          ,
          <source>Biomedical Signal Processing and Control</source>
          <volume>78</volume>
          (
          <year>2022</year>
          )
          <fpage>103895</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schrumpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Frenzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aust</surname>
          </string-name>
          , G. Osterhof,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fuchs</surname>
          </string-name>
          ,
          <article-title>Assessment of non-invasive blood pressure prediction from ppg and rppg signals using deep learning</article-title>
          ,
          <source>Sensors</source>
          <volume>21</volume>
          (
          <year>2021</year>
          )
          <fpage>6022</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I. C.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Finkelstein</surname>
          </string-name>
          ,
          <article-title>Introducing contactless blood pressure assessment using a high speed video camera</article-title>
          ,
          <source>Journal of medical systems 40</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Fan</surname>
          </string-name>
          , T. Tjahjadi,
          <article-title>Robust contactless pulse transit time estimation based on signal quality metric</article-title>
          ,
          <source>Pattern Recognition Letters</source>
          <volume>137</volume>
          (
          <year>2020</year>
          )
          <fpage>12</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>J</surname>
          </string-name>
          .-s. Park, K.-S. Hong,
          <article-title>Robust blood pressure measurement from facial videos in diverse environments</article-title>
          ,
          <source>Heliyon</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , H. Ma, Rhythmformer:
          <article-title>Extracting rppg signals based on hierarchical temporal periodic transformer</article-title>
          ,
          <source>arXiv preprint arXiv:2402.12788</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          , T. Dao, Mamba:
          <article-title>Linear-time sequence modeling with selective state spaces</article-title>
          ,
          <source>arXiv preprint arXiv:2312.00752</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          , Y. Liu, Vmamba:
          <article-title>Visual state space model</article-title>
          ,
          <source>arXiv preprint arXiv:2401.10166</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          , Videomamba:
          <article-title>State space model for eficient video understanding</article-title>
          ,
          <source>arXiv preprint arXiv:2403.06977</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          , H. Ma, Rhythmmamba:
          <article-title>Fast remote physiological measurement with arbitrary length videos</article-title>
          ,
          <source>arXiv preprint arXiv:2404.06483</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>P.-J. Toye</surname>
          </string-name>
          ,
          <article-title>Vital videos: A dataset of videos with ppg and blood pressure ground truths</article-title>
          ,
          <source>arXiv preprint arXiv:2306.11891</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Alikhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seppanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Junttila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Majamaa-Voltti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tulppo</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Zhao,</surname>
          </string-name>
          <article-title>The obf database: A large face video database for remote physiological signal measurement and atrial fibrillation detection</article-title>
          ,
          <source>in: 2018 13th IEEE international conference on automatic face &amp; gesture recognition (FG</source>
          <year>2018</year>
          ), IEEE,
          <year>2018</year>
          , pp.
          <fpage>242</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>