<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Remote Physiological Signal Sensing (RePSS) Challenge &amp; Workshop</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Huiyu Yang</string-name>
          <email>Huiyu.Yang@oulu.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yunchi Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chenhang Ying</string-name>
          <email>chying@zju.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Youchen Luo</string-name>
          <email>youchen.luo@zju.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jieyi Ge</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antitza Dantcheva</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shiguang Shan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guoying Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hu Han</string-name>
          <email>hanhu@ict.ac.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaobai Li</string-name>
          <email>xiaobai.li@zju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>STARS team</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>INRIA</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Machine Vision and Signal Analysis, University of Oulu</institution>
          ,
          <addr-line>Oulu</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS)</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>State Key Laboratory of Blockchain and Data Security, Zhejiang University</institution>
          ,
          <addr-line>Hangzhou</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Remote photoplethysmography (rPPG) is a non-contact technique for estimating physiological signals-such as heart rate-from subtle color changes in facial videos. While recent advances in rPPG have predominantly relied on RGB videos, these signals are highly susceptible to variations in environmental lighting and individual skin tones, which limits their robustness in real-world scenarios. In contrast, near-infrared (NIR) videos are less afected by illumination changes and dark skin-tones. So, to improve the accuracy and reliability of rPPG measurements, the 4th RePSS challenge is held to promote the novel multimodal fusion strategies which combines the RGB and NIR videos for heart rate (HR) prediction. It is held in conjunction with the International Joint Conference on Artificial Intelligence (IJCAI 2025). This paper provides an overview of the challenge, showing details about the track setting, the data and protocols, the proposed approaches, and the results and discussions. The top-performing solutions are analyzed to provide valuable insights and guide future research directions in the field.</p>
      </abstract>
      <kwd-group>
        <kwd>rPPG</kwd>
        <kwd>remote physiological signal measurement</kwd>
        <kwd>multimodal fusion</kwd>
        <kwd>facial video</kwd>
        <kwd>heart rate</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>relied on hand-crafted signal processing techniques, such as ICA [ 10], CHROM [11] and POS [12],
which apply color space transformations and bandpass filtering to extract pulse signals from RGB
video streams. However, these traditional approaches often struggle under uncontrolled conditions
due to sensitivity to motion, lighting variations, and diferences in skin tone. To overcome these
challenges, recent research has increasingly adopted machine learning and deep learning techniques.
Models based on 2D-CNN [13, 14, 15], 3D-CNN [16, 17, 18], and transformer [19, 20, 21] architectures
nEvelop-O</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
have been explored to extract visual cues for accurate rPPG estimation. Many data-driven models
aim to automatically learn complex spatial-temporal patterns to extract physiological features directly
from video data [22, 23, 24], demonstrating improved generalizability and robustness. But, despite the
progress, significant challenges remain, particularly in improving the accuracy and robustness under
various environments and expanding the applicability of rPPG systems across diverse populations.</p>
      <p>To bring together multidisciplinary researchers and further advance rPPG technologies, the Remote
Physiological Signal Sensing (RePSS) Challenges have been held annually as a platform to promote
the remote physiological signals measurement—such as heart rate (HR), heart rate variability (HRV),
and blood pressure (BP)—from facial videos. Since 2020, RePSS challenges have been organized in
conjunction with top-tier conferences such as CVPR 2020 [25] 1, ICCV 2021 [26] 2 , and IJCAI 2024
[27] 3 4. Through workshops, challenges, and invited talks, the initiative aims to foster technological
innovation, resource sharing, and fair competition in the remote physiological sensing field.</p>
      <p>This year, the 4th RePSS Challenge and Workshop is held in conjunction with the International
Joint Conference on Artificial Intelligence (IJCAI 2025) in August 2025. It is organized on the Kaggle
website 5. The theme centers on “RGB-NIR Fusion for Robust rPPG Measurement”, which invites
participants to develop innovative data fusion techniques that integrate RGB and Near-Infrared (NIR)
facial videos to enhance the accuracy and robustness of rPPG estimation. This focus is motivated
by the limitations of conventional RGB-based rPPG methods, which are often sensitive to motion
artifacts, lighting changes, and skin tone variations. In contrast, NIR imaging is more robust under such
challenging conditions while still retaining meaningful physiological information, which could benefit
RGB-based rPPG methods. However, despite the promising potential, multimodal fusion of RGB-NIR
videos for rPPG measurement has been rarely explored in existing literature [28, 29]. This gap highlights
an important but under-investigated research direction, particularly for real-world applications where
single-modality solutions often fall short. By incorporating NIR data, this challenge aims to stimulate
the development of advanced multimodal fusion strategies that can overcome current limitations and
advance the field of remote physiological monitoring toward greater reliability and broader applicability.</p>
      <p>The paper is structured as follows: Section 2 presents an overview of the 4th RePSS Challenge,
including details on the track setting, dataset and protocols, and evaluation metrics. Section 3 outlines
the top-ranked solutions submitted by participating teams. Section 4 discusses the results and provides
analysis and insights. Finally, Section 5 concludes the paper and highlights potential future research
directions in this field.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Challenge Overview</title>
      <sec id="sec-2-1">
        <title>2.1. Track setting</title>
        <p>The 4th RePSS Challenge focuses on the multimodal fusion of RGB and NIR videos for robust heart
rate (HR) measurement. Participants are provided with a portion of the VIPL-HR dataset [22, 30] as the
training set, which consists of 10-second RGB-NIR video clips from 107 subjects. During the evaluation
phase, RGB-NIR video pairs from 100 subjects in the VIPL-HR-v2 dataset and another 100 subjects
from the OBF [31] dataset are each segmented into three 10-second clips to form the test set. This task
aims to highlight the potential of innovative RGB-NIR fusion strategies to improve the accuracy and
reliability of rPPG-based HR estimation, particularly in challenging real-world conditions.</p>
        <sec id="sec-2-1-1">
          <title>Head movement</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Illumination</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>Acquisition devices</title>
          <p>(a)
(b)</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data and protocol</title>
        <p>2.2.1. Training data
To support the development of multimodal rPPG algorithms, the 4th RePSS Challenge utilizes part of
the VIPL-HR database as its training data. VIPL-HR is a large-scale, multimodal database designed for
non-contact heart rate estimation from facial videos captured under diverse and less-constrained
realworld conditions. The entire dataset comprises 3,130 face videos from 107 subjects, including 2,378 RGB
videos and 752 NIR videos. Each video is accompanied by synchronized physiological measurements
such as heart rate, blood oxygen saturation (SpO2), and blood volume pulse (BVP) signals.</p>
        <p>As shown in Fig. 1(a), VIPL-HR simulates multiple real-world rPPG challenges by recording with
diferent devices-including smartphones, RGB-D cameras, and webcams-under varying illumination
conditions (only the ceiling lamp on, both the ceiling and filament lamps on, and both lamps of)
and substantial head movements (large rotations, talking). As a publicly available dataset with
multimodality and rich condition diversity, VIPL-HR facilitates the development of practical remote HR
detection approaches that generalize well to real-world challenges.
2.2.2. Test data
Data samples from the VIPL-HR-v2 and OBF [31] datasets are used as the test set, which includes paired
RGB-NIR videos of 100 subjects from the reserved portion of the VIPL-HR-v2 dataset (diferent from
the training set) and another 100 subjects from the OBF dataset. An example pair of sample images is
shown in Fig. 1(b). For each RGB-NIR pair, three 10-second video clips are randomly selected, and the
corresponding groun-truth heart rate values are kept for the final evaluation.</p>
        <p>It is worth noting that the test data include challenging scenarios: the VIPL-HR-v2 dataset contains
head motions and varying lighting conditions, while the OBF dataset includes subjects with diverse
skin tones. Due to protocol restrictions, the OBF videos have been anonymized by applying mosaics to
sensitive facial regions to safeguard individual privacy. To compensate for this anonymization, facial
landmarks of the OBF videos generated by OpenFace are provided to the participants. These landmarks
1https://competitions.codalab.org/competitions/22287
2https://competitions.codalab.org/competitions/30855
3https://www.kaggle.com/competitions/the-3rd-repss-t1/data
4https://www.kaggle.com/competitions/the-3rd-repss-t2/data
5https://www.kaggle.com/competitions/the-4th-repss-t1/data
enable precise localization and analysis of relevant facial regions, ensuring efective model training and
evaluation while adhering to privacy protection requirements.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Evaluation metric</title>
        <p>The root mean squared errors (RMSE) is used as the evaluation metric. The RMSE value between the
ground truth heart rates  and submitted heart rates prediction  ′ is calculated as follows, where 
represents the amount of test samples:
  =
√</p>
        <p>Σ=1 (  −   ′)2
.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed approaches</title>
      <p>To maintain consistency and fairness in the evaluation process, the final evaluation and ranking only
include registered teams. The leaderboard is listed as Table 1. To get a better understanding of the
top-ranked methods, brief methodological summaries were gathered from the top three teams and are
presented in the subsequent sections.</p>
      <sec id="sec-3-1">
        <title>3.1. Team ‘HFUT-VUT’ (Hefei University of Technology)</title>
        <p>The HFUT-VUT team introduces DifRePSS, a difusion-based framework for remote physiological signal
sensing. Their method aims to measure remote photoplethysmography (rPPG) signals for heart rate
(HR) estimation from facial videos, especially under real-world challenges including head movements
and varying illumination. The overall framework of their proposed method is illustrated in Fig. 2.</p>
        <p>Their approach is based on a conditional denoising difusion model (DDIM), which progressively
reconstructs clean rPPG signals from Gaussian noise in a step-wise reverse process. Instead of direct
signal regression, the model learns to denoise latent representations iteratively, guided by visual
physiological cues from the input video. To extract these cues, the team proposes a multi-scale
spatialtemporal map (MSTmap) that captures chrominance dynamics across facial regions, together with a
temporal diference representation to emphasize fine-grained pulsatile variations. These features are
combined and fed into an alternating spatial-temporal Transformer denoiser, which models spatial
and temporal dependencies efectively through the attention-based module. Additionally, to address
the common issue of heart rate distribution imbalance in physiological datasets, a data augmentation
strategy is introduced. It involves upsampling or downsampling MSTmaps along the temporal axis to
simulate higher or lower HRs, which efectively expands the HR distribution during training.</p>
        <p>In the evaluation phase, DifRePSS demonstrates strong performance on both the VIPL-HR-v2 and
OBF test sets, which include diverse subjects and challenging scenarios. The method achieves the best
RMSE and ranks 1st in the competition, highlighting the potential of difusion-based modeling for
robust and accurate rPPG signal estimation.</p>
        <p>× N
v
n
oC BN eLRU +
D
3
v
n
oC BN eLRU +
D
3
× N</p>
        <p>P
M
D
3
P
M
D
3
t
a
c
n
o
C
d
e
b
m
E
h
c
t
a
P
m
r
o
N
r
e
y
a
L</p>
        <p>Temporal Difference Transormer× M</p>
        <p>n
ce itno
freen ftt-ae
if l
lD eS
rpoa ehad
eTm lit-u</p>
        <p>M
+
m
r
o
N
r
e
y
a
L
l
ropa rad
itt-oepaSm f-rdoeeFw +</p>
        <p>( , σ²)
n
o
iit
c
d
e
r
P
KL DiLvoesrsgence Cross-Entropy</p>
        <p>PSD LosPsSD
Negative Pearson
Correlation Loss
Reference PPG</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Team ‘IST’ (Nanjing University)</title>
        <p>The IST team developes a robust multimodal framework for HR estimation based on the temporal
diference transformer (TDT) [ 19], which includes spatio-temporal encoders for feature extraction, a
TDT module with temporal diference multi-head self-attention (TD-MHSA) mechanism for multimodal
fusion, and the final prediction module. The framework of the proposed method is shown in Fig. 3.</p>
        <p>The method starts from processing RGB and NIR video sequences separately through two
spatiotemporal encoders. Each encoder consists of multiple residual 3D convolutional blocks, which contains
a Conv3D layer, a batch normalization layer, a ReLU activation layer, and a 3D max-pooling layer. These
blocks capture both spatial and temporal patterns from the video. After encoding, features from two
modalities are aligned in shape but maintain modality-specific semantics.</p>
        <p>Subsequently, the TDT module is employed to enhance the quasi-periodic nature of rPPG signals
and also to align and fuse the RGB and NIR representations adaptively. The TDT integrates a temporal
diference multi-head self-attention (TD-MHSA) mechanism, which enhances standard self-attention
with a temporal diference convolution (TDC) [ 19]. Local temporal changes are explicitly encoded by
TDC, making the model more sensitive to periodic variations. For each attention head, query, key,
and value matrices are derived from normalized input using TDC (for queries and keys) and standard
convolution (for values). The outputs of all attention heads are aggregated and passed through a
learnable projection layer, then reshaped back into spatio-temporal form. To further enhance local
modeling, the transformer replaces traditional feed-forward layers with a spatio-temporal feed-forward
(ST-FF) module. This module uses pointwise and depthwise 3D convolutions to capture fine-grained
spatial and temporal patterns, improving representation ability for downstream tasks.</p>
        <p>After fusion, a two-stage temporal upsampling module is introduced to increase the temporal
resolution while reducing feature dimensionality. Then, a spatial average pooling operation reduces
spatial dimensions, and a final regression head outputs a predicted rPPG signal. The HR is obtained by
applying Fast Fourier Transform (FFT) to this signal and identifying the dominant frequency peak in
the physiological range.</p>
        <p>
          To train the model efectively, a composite loss function combining three objectives is utilized, which
includes: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) a negative Pearson correlation loss aligning the predicted rPPG waveform with the ground
truth [17], (2) a KL divergence loss supervising frequency-domain distributions [32], and (3) a
crossentropy loss enhancing frequency-bin classification [ 33]. This composite loss ensures the model learns
accurate and physiologically consistent predictions.
        </p>
        <p>During evaluation, their proposed method achieve outstanding performance, ranking the 2nd on the
leaderboard.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Team ‘xjgroupscu’ (Sichuan University)</title>
        <p>
          The xjgroupscu team proposed algorithm consists of four steps as shown in Fig. 4, including: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) ROI
segmentation and raw rPPG signal acquisition, (2) Color space transformation, (3) Signal decomposition
and motion noise removal, and (4) Pulse reconstruction and heart rate refinement.
        </p>
        <p>At the beginning of the proposed pipeline, facial landmarks are detected using the MediaPipe
Face Mesh [34]. Based on these landmarks, regions of interest (ROIs) are defined, as illustrated in
Fig. 4. Spatial averaging is then applied to the pixel values within each ROI to generate temporal
signals. While RGB videos provide three-channel data, near-infrared (NIR) videos contain only a single
channel. Subsequently, the signals are transformed into multiple channel representations, including:
the combined RGB-NIR channel, CHROM-projected channel [11], POS-projected channel [12], the
green channel, and the NIR channel. Next, each channel signal is decomposed using Variational Mode
Decomposition (VMD) [35] to separate motion noise from pulse-related components. Motion-related
noise is estimated based on the positional changes of a key facial point located at the center of the
upper lip. The time delay between each decomposed signal component and the motion reference signal
is computed. If the delay is below a predefined threshold, the component is considered motion-induced
noise and is therefore discarded. Finally, Principal Component Analysis (PCA) [36] is applied to the
remaining (cleaned) signals to reconstruct the pulse waveform, and heart rate is estimated from this
signal using the Fast Fourier Transform (FFT).</p>
        <p>Also, for each sample, three signal segments spaced 0.2 seconds apart are selected. A constraint
is imposed to ensure that the heart rate (HR) diference among these segments does not exceed 10
bpm, which ensures the temporal consistency of rPPG signals. Furthermore, a heart rate probability
constraint is applied to handle outliers: if the estimated HR falls outside the plausible physiological
range of 35.6 to 122.6 bpm, the signal window is shifted forward frame by frame. This process continues
until the updated HR estimate deviates by less than 5 bpm from the initial value. The final heart rate
estimation is then obtained from this refined signal window.</p>
        <p>In the evaluation phase, their proposed work achieves the 3rd place on the final leaderboard,
demonstrating its accuracy and robustness under noisy situation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Challenge results and discussion</title>
      <p>The final leaderboard of the 4th RePSS Challenge is shown in Table 1. In this section, we provide a
detailed analysis and discussion of the final results.</p>
      <p>Performance across two datasets. The Root Mean Square Error (RMSE) results for all teams
are presented in Fig. 5. The results are divided into three categories: overall performance on the
entire test set (blue bars), performance on the OBF test partition (orange bars), and performance on
the VIPL-HR-v2 test partition (grey bars). This division facilitates a detailed evaluation across diverse
datasets, enabling a comprehensive understanding of the robustness and cross-dataset generalizability
of the proposed methods.</p>
      <p>The proposed methods from the top three teams demonstrate relatively low RMSE values across all
three settings, including the overall test set, the OBF partition, and the VIPL-HR-v2 partition, indicating
their strong performance in video-based remote heart rate estimation. Interestingly, although all
models were trained on the VIPL-HR training set, most teams achieved better performance on the
OBF test partition than on the VIPL-HR-v2 test partition. This discrepancy may be caused by the
substantial gap between the training set from VIPL-HR and the test set from VIPL-HR-v2: VIPL-HR
mainly includes videos recorded in a controlled meeting-room environment on participants around
twenty years old, whereas VIPL-HR-v2 involves participants in a broader range of ages and was collected
in less controllable environments across diferent regions. This inconsistency may bring challenges to
the proposed methods. Moreover, varying lighting conditions and head motions in the VIPL-HR-v2
test data further bring dificulty to video-based remote physiological signal measurement, as varying
lighting conditions can hinder the accurate capture of facial color changes, and head movements may
cause key facial regions to become partially or fully invisible.</p>
      <p>Performance on diferent skin-tone groups of the OBF test partition. We further divide the
OBF test data into three groups according to the participants’ skin tones. The sample numbers of each
sub-set are: 31 samples for light skin-tone, 41 samples for medium skin-tone, and 28 samples for dark
skin-tone. One fact observed from previous rPPG studies is that most RGB-based rPPG approaches work
better on lighter skin tone, while darker skin is more challenging. By fusing the NIR data, it is expected
to compensate for the challenge. The RMSE results on diferent skin-tone groups of each participating
team are shown in Fig. 6. As expected, the worst performance is observed on the OBF-Dark set across
all participating teams. Unexpectedly, the lowest RMSE values were achieved on the OBF-Medium set
rather than the OBF-Medium set, which contradicts the common assumption that heart rate estimation
is easier for lighter skin tones. We believe this may result from a domain mismatch: the VIPL-HR
training set mainly contains medium skin-tone samples, while the OBF test set is more diverse in terms
of skin-tone distribution. Another interesting observation from Fig. 6 is that the performance gaps (i.e.,
RMSE diferences between Medium and Dark sets) of the winning teams (left three groups of bars) are
obviously smaller than the rest (right three groups of bars). This may indicate that balanced skin-tone
performance (small gap between diferent skin groups) is an indicator of a well-performing RGB-NIR
fusion model.</p>
      <p>Performance on diferent motion levels of the VIPL-HR-v2 test partition. All samples in the
VIPL-HR-v2 test partition include head motions. We divided the samples into three sub-sets based on
the ’RANGE’ and the ’SPEED’ of head rotations to investigate their impact on rPPG models. For rotation
range, each subset contained 100 samples. For rotation speed, the sample numbers of each sub-set are:
111 samples for low speed, 85 samples for medium speed, 104 samples for high speed.s Fig. 7 shows
that a larger rotation range generally results in higher RMSE values. Similarly, Fig. 8 demonstrates that
faster head rotation speeds degrade model performance. These findings suggest that both large and
rapid head movements interfere with rPPG measurements by making it dificult to consistently track
critical facial areas such as the cheeks and forehead. These areas may become occluded or distorted
during head movement, and color fluctuations are harder to detect during fast rotations. This highlights
important future directions for rPPG research, especially in improving motion-robust signal extraction.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and future directions</title>
      <p>As a continuation of the RePSS series, the 4th RePSS Challenge maintains its focus on remote
physiological signal measurement. However, unlike previous editions that only employed RGB modality, this
challenge invited participants to explore innovative data fusion techniques by integrating RGB and
Near-Infrared (NIR) facial videos to enhance the accuracy and robustness of rPPG estimation.</p>
      <p>This challenge highlights an important yet underexplored direction in rPPG research—multimodal
fusion for real-world applications where single-modality approaches often fall short. By incorporating
NIR data, the challenge encourages the development of advanced fusion strategies that can overcome
current limitations and advance the field toward more reliable and generalizable remote physiological
monitoring systems.</p>
      <p>For evaluation, test data from both the OBF and VIPL-HR-v2 datasets were used. The OBF dataset
contains videos of participants from diferent skin-tone groups, while the VIPL-HR-v2 dataset includes
samples with head motions and varying lighting conditions. Although the top-performing teams
achieved relatively low RMSE values, all methods experienced performance degradation in challenging
conditions such as dark skin tones, large head movements, and fast head rotations. These issues pose
challenges to existed rPPG methods and point to future directions for developing more accurate and
robust rPPG systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is supported by the Finnish Doctoral Program Network in Artificial Intelligence, AI-DOC
(decision number VN/3137/2024-OKM-6), and the National Natural Science Foundation of China under
Grant (U2336213 and 62176249).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-5 for grammar and spelling check. After
using the tool, the authors reviewed and edited the content as needed and take full responsibility for
the publication’s content.
(2019) 2781–2795.
[2] Z. Sun, J. Junttila, M. Tulppo, T. Seppänen, X. Li, Non-contact atrial fibrillation detection from face
videos by learning systolic peaks, IEEE Journal of Biomedical and Health Informatics 26 (2022)
4587–4598.
[3] F. Ding, Y. Qin, L. Zhang, H. Lyu, Driver drowsiness detection based on facial video non-contact
heart rate measurement, Journal of Advanced Computational Intelligence and Intelligent
Informatics 29 (2025) 306–315.
[4] W. Yu, S. Ding, Z. Yue, S. Yang, Emotion recognition from facial expressions and contactless heart
rate using knowledge graph, in: 2020 IEEE International Conference on Knowledge Graph (ICKG),
IEEE, 2020, pp. 64–69.
[5] S. Ziaratnia, T. Laohakangvalvit, M. Sugaya, P. Sripian, Multimodal deep learning for remote stress
estimation using cct-lstm, in: Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, 2024, pp. 8336–8344.
[6] P. Kumar, S. Misra, Z. Shao, B. Zhu, B. Raman, X. Li, Multimodal interpretable depression analysis
using visual, physiological, audio and textual data, in: 2025 IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV), IEEE, 2025, pp. 5305–5315.
[7] L. Zhao, X. Zhang, X. Niu, J. Sun, R. Geng, Q. Li, X. Zhu, Z. Dai, Remote photoplethysmography
(rppg) based learning fatigue detection, Applied Intelligence 53 (2023) 27951–27965.
[8] K. Wang, Y. Wei, J. Tang, Y. Wang, Z. Li, M. Tong, J. Gao, Y. Ma, Z. Zhao, Camera-based hrv
prediction for remote learning environments, in: 2024 IEEE Smart World Congress (SWC), IEEE,
2024, pp. 1165–1173.
[9] W. Verkruysse, L. O. Svaasand, J. S. Nelson, Remote plethysmographic imaging using ambient
light., Opt. Express 16 (2008) 21434–21445.
[10] M.-Z. Poh, D. J. McDuf, R. W. Picard, Advancements in noncontact, multiparameter physiological
measurements using a webcam, IEEE transactions on biomedical engineering 58 (2010) 7–11.
[11] G. De Haan, V. Jeanne, Robust pulse rate from chrominance-based rppg, IEEE Trans. Biomed. Eng.</p>
      <p>60 (2013) 2878–2886.
[12] W. Wang, A. C. Den Brinker, S. Stuijk, G. De Haan, Algorithmic principles of remote ppg, IEEE</p>
      <p>Transactions on Biomedical Engineering 64 (2016) 1479–1491.
[13] W. Chen, D. Mcduf, Deepphys: Video-based physiological measurement using convolutional
attention networks, Proc. ECCV (2018) 356–373.
[14] X. Liu, J. Fromm, S. Patel, D. McDuf, Multi-task temporal shift attention networks for on-device
contactless vitals measurement, Advances in Neural Information Processing Systems 33 (2020)
19400–19411.
[15] E. M. Nowara, D. McDuf, A. Veeraraghavan, The benefit of distraction: Denoising camera-based
physiological measurements using inverse attention, in: Proceedings of the IEEE/CVF international
conference on computer vision, 2021, pp. 4955–4964.
[16] Z. Yu, W. Peng, X. Li, X. Hong, G. Zhao, Remote heart rate measurement from highly compressed
facial videos: An end-to-end deep learning solution with video enhancement, in: Proc. IEEE ICCV,
2019.
[17] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from facial videos using
spatio-temporal networks, Proc. BMVC (2019).
[18] J. Gideon, S. Stent, The way to my heart is through contrastive learning: Remote
photoplethysmography from unlabelled video, in: Proceedings of the IEEE/CVF international conference on
computer vision, 2021, pp. 3995–4004.
[19] Z. Yu, Y. Shen, J. Shi, H. Zhao, P. H. Torr, G. Zhao, Physformer: Facial video-based physiological
measurement with temporal diference transformer, in: Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 2022, pp. 4186–4196.
[20] Z. Yu, Y. Shen, J. Shi, H. Zhao, Y. Cui, J. Zhang, P. Torr, G. Zhao, Physformer++: Facial video-based
physiological measurement with slowfast temporal diference transformer, International Journal
of Computer Vision 131 (2023) 1307–1330.
[21] X. Liu, B. Hill, Z. Jiang, S. Patel, D. McDuf, Eficientphys: Enabling simple, fast and accurate
camera-based cardiac measurement, in: Proceedings of the IEEE/CVF winter conference on
applications of computer vision, 2023, pp. 5008–5017.
[22] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation from face via
spatial-temporal representation, IEEE Trans. Image Processing (2019).
[23] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, G. Zhao, Video-based remote physiological measurement via
cross-verified feature disentangling, in: Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, Springer, 2020, pp. 295–310.
[24] H. Lu, H. Han, S. K. Zhou, Dual-gan: Joint bvp and noise modeling for remote physiological
measurement, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021, pp. 12404–12413.
[25] X. Li, H. Han, H. Lu, X. Niu, Z. Yu, A. Dantcheva, G. Zhao, S. Shan, The 1st challenge on remote
physiological signal sensing (repss), in: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition workshops, 2020, pp. 314–315.
[26] X. Li, H. Sun, Z. Sun, H. Han, A. Dantcheva, S. Shan, G. Zhao, The 2nd challenge on remote
physiological signal sensing (repss), in: Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2021, pp. 2404–2413.
[27] Z. Sun, X. Li, H. Han, J. Tang, C. Ying, J. Ge, A. Dantcheva, S. Shan, G. Zhao, The 3rd
visionbased remote physiological signal sensing (repss) challenge &amp; workshop, in: CEUR Workshop
Proceedings, R. Piskac c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, 2024.
[28] K. Kurihara, D. Sugimura, T. Hamamoto, Non-contact heart rate estimation via adaptive rgb/nir
signal fusion, IEEE Transactions on Image Processing 30 (2021) 6528–6543.
[29] S. Park, B.-K. Kim, S.-Y. Dong, Self-supervised rgb-nir fusion video vision transformer framework
for rppg estimation, IEEE Transactions on Instrumentation and Measurement 71 (2022) 1–10.
[30] X. Niu, H. Han, S. Shan, X. Chen, VIPL-HR: A multi-modal database for pulse estimation from
less-constrained face video, in: Proc. ACCV, 2018, pp. 562–576.
[31] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. Tulppo, G. Zhao, The OBF
database: A large face video database for remote physiological signal measurement and atrial
ifbrillation detection, in: Proc. IEEE FG, 2018, pp. 1–6.
[32] B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, X. Geng, Deep label distribution learning with label ambiguity,</p>
      <p>IEEE Transactions on Image Processing 26 (2017) 2825–2838.
[33] Z. Yu, X. Li, X. Niu, J. Shi, G. Zhao, Autohr: A strong end-to-end baseline for remote heart rate
measurement with neural searching, IEEE Signal Processing Letters 27 (2020) 1245–1249.
[34] Y. Kartynnik, A. Ablavatski, I. Grishchenko, M. Grundmann, Real-time facial surface geometry
from monocular video on mobile gpus, arXiv preprint arXiv:1907.06724 (2019).
[35] K. Dragomiretskiy, D. Zosso, Variational mode decomposition, IEEE transactions on signal
processing 62 (2013) 531–544.
[36] H. Abdi, L. J. Williams, Principal component analysis, Wiley interdisciplinary reviews:
computational statistics 2 (2010) 433–459.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Alikhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seppänen</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Atrial fibrillation detection from face videos by fusing subtle variations</article-title>
          ,
          <source>IEEE Transactions on Circuits and Systems for Video Technology 30</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>