The 3rd Vision-based Remote Physiological Signal Sensing
                         (RePSS) Challenge & Workshop
                         Zhaodong Sun2 , Xiaobai Li1,2,∗ , Hu Han3 , Jiyang Tang3 , Chenhang Ying1 , Jieyi Ge1 ,
                         Antitza Dantcheva4 , Shiguang Shan3 and Guoying Zhao2
                         1
                           State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China
                         2
                           Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland
                         3
                           Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), China
                         4
                           STARS team, INRIA, France


                                        Abstract
                                        The remote measurement of physiological signals from video recordings is a topic of growing interest. Despite
                                        its potential, progress in this field is being impeded by the absence of publicly available benchmark databases and
                                        a standardized validation platform. To address these issues, the RePSS Challenge is held annually. The 3rd RePSS
                                        Challenge is being conducted alongside IJCAI 2024 and features two competition tracks. Track 1 focuses on
                                        self-supervised learning for heart rate measurement using unlabeled facial videos, while Track 2 tackles the more
                                        complex task of measuring blood pressure from facial videos. This paper provides an overview of the challenge,
                                        detailing the data, protocols, analysis of results, and discussions. We highlight the top-performing solutions to
                                        offer insights for researchers and outline future directions for this field and the challenge itself.

                                        Keywords
                                        rPPG, physiological signal, facial video, heart rate, blood pressure


                         1. Introduction
                         Physiological signals, including heart rate (HR), respiration rate (RR), heart rate variability (HRV), and
                         blood pressure (BP), are crucial indicators of human health. Traditionally, these signals are measured
                         using specialized medical instruments such as electrocardiography (ECG), photoplethysmography (PPG)
                         oximeters, and breathing belts. However, using contact medical sensors is expensive and inconvenient
                         for long-term monitoring. Later, researchers discovered that PPG signals can be captured remotely
                         from human faces under ambient light conditions. For instance, Verkruysse et al. [1] demonstrated the
                         measurement of PPG signals from the forehead. Subsequently, numerous studies have proposed various
                         remote PPG (rPPG) measurement techniques. Early methods relied on empirically designed filters and
                         lacked a training process. Some approaches [2, 3, 4, 5, 6, 7] utilized subtle color changes in facial pixels for
                         rPPG measurement, while others [8, 9, 10] focused on tracking vertical head motions. Most researchers
                         have adopted supervised approaches for rPPG measurement, such as [11], [12], [13], and [14]. Recently,
                         more researchers are developing unsupervised/self-supervised rPPG methods [15, 16, 17, 18, 19, 20] to
                         train rPPG measurement models with only facial videos.
                            Despite significant research interest, the development of this field is hindered by the lack of publicly
                         available benchmark databases and a standardized validation platform. To address these issues, we
                         organized the 1st RePSS challenge [21] 1 in conjunction with CVPR 2020, followed by the 2nd RePSS
                         challenge [22] 2 with ICCV 2021, aiming to provide benchmark datasets and a fair comparison platform

                         The 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge & Workshop, August 5, 2024, Jeju, South Korea
                         ∗
                             Corresponding author.
                         Envelope-Open zhaodong.sun@oulu.fi (Z. Sun); xiaobai.li@zju.edu.cn (X. Li); hanhu@ict.ac.cn (H. Han); tangjiyang22s@ict.ac.cn
                         (J. Tang); chying@zju.edu.cn (C. Ying); jyge@zju.edu.cn (J. Ge); antitza.dantcheva@inria.fr (A. Dantcheva); sgshan@ict.ac.cn
                         (S. Shan); guoying.zhao@oulu.fi (G. Zhao)
                         Orcid 0000-0002-0597-0765 (Z. Sun); 0000-0003-4519-7823 (X. Li); 0000-0001-6010-1792 (H. Han); 0000-0003-0107-7029
                         (A. Dantcheva); 0000-0002-8348-392X (S. Shan); 0000-0003-3694-206X (G. Zhao)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             https://competitions.codalab.org/competitions/22287
                         2
                             https://competitions.codalab.org/competitions/30855

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
for researchers. The RePSS challenge series is intended to be an annual event with a continuous and
evolving theme. The inaugural 1st RePSS challenge focused on the fundamental task of measuring
average HR from color facial videos. The 2nd RePSS challenge, held alongside ICCV 2021, introduced
two tracks: inter-beat interval (IBI) and respiration measurement. This year, the 3rd RePSS, held in
conjunction with IJCAI 2024, introduces two new tracks: self-supervised facial video-based heart rate
measurement and blood pressure measurement.
   The paper is structured as follows: Section 2 provides an overview of the 3rd RePSS challenge, detailing
the tasks, datasets, challenge protocol, and evaluation metrics. Section 3 discusses the approaches
proposed by the top-performing teams in the challenge. Section 4 presents the challenge results and
discussions, and Section 5 explores future directions in this research area.


2. Challenge Overview
2.1. Challenge tracks
There are two tracks for the 3rd RePSS challenge held on Kaggle. There are 18 teams registered for
Track 1 and 15 for Track 2. By the final submission date, valid results were submitted by 13 teams in
Track 1 and six teams in Track 2. There are totally 313 result submissions and 58 participants in the
track 1, and there are 148 result submissions and 23 participants in the track 2.
   Track 1 is self-supervised learning for heart rate measurement using unlabeled facial videos’. Since
there are only a few facial videos with HR labels, track 1 mainly focuses on developing self-supervised
training methods on large-scale unlabeled facial videos. Track 1 was organized on the Kaggle website 3 .
   Track 2 is facial video-based blood pressure measurement, which is an emerging topic and more
challenging. Blood pressure measurement requires high-quality physiological signals from facial videos,
so each participant in this track should design both an accurate remote physiological signal measurement
algorithm and a blood pressure estimation algorithm. Track 2 was organized on the Kaggle website 4 .

2.2. Data and protocol
Track 1. Since track 1 is about self-supervised training, there are three stages for this track including
the pre-training stage, the model fine-tuning stage, and the test stage. For the pre-training stage that
focuses on unsupervised pretraining, we have summarized a list of open-source, large-scale facial video
datasets including (a) VFHQ [23] 5 , (b) FaceForensics++ [24] 6 , (c) DeeperForensics [25] 7 , (d) CelebV-HQ
[26] 8 , (e) DISFA [27] 9 , and (f) MAHNOB Laughter Database [28] 10 . We have checked each of the
datasets to confirm that the video quality is suitable for the task, no ground truth is available, and
the data can be easily accessed online. Participants can also use other face video data for pre-training
without ground truth. For the model fine-tuning stage, we provide the VIPL-V2 dataset [21, 29, 30] built
by the organizers’ team. The dataset contains facial videos with ground truth physiological signals
from 400 persons. For the test stage, we provide a subset of 200 persons’ data from the VIPL-HR-V2 and
the OBF datasets as the testing data. The ground truth signals of the test set have never been released
in previous challenges. Participants should submit their HR prediction to the Kaggle website to get the
evaluation results. Each team has a maximum of 5 submissions per day. The ranking will be based on
the RMSE on the test data.
   Track 2. Track 2 is facial video-based blood pressure measurement, which contains the training and


3
  https://www.kaggle.com/competitions/the-3rd-repss-t1
4
  https://www.kaggle.com/competitions/the-3rd-repss-t2
5
  https://liangbinxie.github.io/projects/vfhq/
6
  https://github.com/ondyari/FaceForensics
7
  https://github.com/EndlessSora/DeeperForensics-1.0
8
  https://celebv-hq.github.io/
9
  http://mohammadmahoor.com/disfa/
10
   https://mahnob-db.eu/laughter/
test stages. For the training stage, there is a large-scale rPPG dataset called vital videos [31] 11 with
facial videos and blood pressure labels from around 880 subjects. The video and labels in the dataset
are of good quality. We have made an agreement with the dataset owner that the dataset can be used
for the challenge track. Participants can use this labeled dataset to train models for rPPG-based blood
pressure measurement. Participants can split part of the training data as the validation set. For the test
stage, we will use the OBF dataset [32] including 100 subjects. There are 200 facial videos with blood
pressure labels. Only the facial videos will be released, and the blood pressure labels have never been
released. Participants should submit their systolic and diastolic BP prediction to the Kaggle website to
get the evaluation results. Each team has a maximum of 5 submissions per day. The ranking will be
based on the RMSE results on the test data.

2.3. Evaluation metrics
We use root mean squared errors (RMSE) as the evaluation metrics. For Track 1, the RMSE between
ground truth heart rates 𝑦 and submitted heart rates 𝑦 ′ is calculated as

                                                         Σ𝑁           ′ 2
                                                          𝑖=1 (𝑦𝑖 − 𝑦𝑖 )
                                            𝑅𝑀𝑆𝐸1 =                       .                             (1)
                                                       √        𝑁
  For Track 2, the systolic RMSE between ground truth systolic blood pressure 𝑠 and submitted systolic
blood pressure 𝑠 ′ is calculated first, and the diastolic RMSE between ground truth diastolic blood
pressure 𝑑 and submitted diastolic blood pressure 𝑑 ′ is calculated. The final RMSE is the mean of systolic
RMSE and diastolic RMSE as shown below.

                                              Σ𝑁          ′ 2
                                               𝑖=1 (𝑠𝑖 − 𝑠𝑖 )         Σ𝑁          ′ 2
                                                                       𝑖=1 (𝑑𝑖 − 𝑑𝑖 )
                                𝑅𝑀𝑆𝐸2 = 0.5                   + 0.5                   .                 (2)
                                            √        𝑁              √        𝑁

3. Proposed approaches
To ensure fair competition, only pre-registered teams with authorized IDs are included in the final
performance evaluation and ranking. The leaderboard in both tracks are displayed in Table 1. We
reached out to the top three teams in both tracks, requesting brief descriptions of their methods for
inclusion in this review paper. These methods are detailed below.

3.1. Track 1
3.1.1. Team ‘Face AI’ (Agency for Science, Technology and Research)
The proposed solution includes two stages. In the pre-training stage, they propose a contrastive deep
learning method called RankContrast to extract the rPPG-related features. In the fine-tuning stage, a
supervised method with data augmentation and ensemble technique is utilized to train the model based
on limited number of labeled facial videos. The overall framework is depicted in Fig.1.
   They utilize an end-to-end framework based on PhysNet-large 3D-CNN model where a sequence
of face frames is fed directly into the deep learning model. They use multiple datasets with highly
complex backgrounds to train the model during the pre-training stage. To minimize noise, only the face
area reflecting the rPPG signal is cropped for training. The human faces are detected by MTCNN [33]
on the first frame, and then the whole video is cropped by a larger bounding box based on the detected
face with a scale factor of 1.3. The cropped image frames are resized to 128 x 128.
   A RankContrast self-supervised learning method that integrates the ranking loss and the contrastive
learning loss is proposed in this work, as shown in Fig.2. Since the rPPG signal is periodic, the heart
rate varies by resampling the video clips. Upsample the clips will reduce the heart rate and downsample

11
     https://vitalvideos.org/
Figure 1: Overall framework for Team Face AI in track 1.


Figure 2: Proposed RankContrat Method for Team Face AI in track 1


the clips will increase the heart rate [34]. According to these characteristics, a ranking loss function is
designed to extract features with upsampling and downsampling of the video clips.
   The contrastive learning loss is to compare similar (positive) clips and dissimilar (negative) clips with
the anchor clips through the attracting and resisting strategy. As the heart rate is relatively stable for
an individual in a short time, the positive pairs are constructed by shifting the training clip for some
frames in the same video. The resampled samples from the anchor sample are considered as negative
pairs.
   The pre-trained model is then fine-tuned on the VIPL-HR-V2 dataset that consists of 400 subjects in
a supervised learning manner. The ground truth of blood volume pulse (BVP) wave and heart rate are
provided in the VIPL-V2 dataset. They adopt two supervised loss functions: the classification loss and
the Pearson loss to guide the learning process.

3.1.2. Team ‘HFUT-VUT’ (Hefei University of Technology)
The team HFUT-VUT participated in Track 1, and they presented two self-supervised HR estimation
solutions that integrate spatial-temporal modeling and contrastive learning, respectively. They first
propose a non-end-to-end self-supervised HR measurement framework (solution 1) based on spatial-
temporal modeling. Meanwhile, they employ complementaryly an excellent end-to-end solution based
                                                                                                Spatial-Temporal Transformer                                                               ×𝑳                                 ℒ𝒕𝒐𝒕𝒂𝒍 = ℒ𝒃)*+ + ℒ𝒔𝒑𝒂𝒓𝒆 + ℒ𝒗𝒂𝒓 + ℒ𝒑𝒆𝒓𝒊𝒐
  Video       Landmarks
                                                                         Spatial Encoder                                                   Temporal Encoder                                                                          variance loss ℒ 𝒗𝒂𝒓


                                                                                                                                                                                                                               PSD
                                                                                                                                                                                                         𝒚𝒑𝒓𝒆𝒅 ∈ ℝ𝑻×𝟏
                                                                                                                                                                                                                                                      bandwidth loss ℒ 𝒃𝒂𝒏𝒅


                                              𝑿 ∈ ℝ𝑻×𝑵×𝑫
                                                                                                                                                                                                                                                     sparsity loss ℒ 𝒔𝒑𝒂𝒓𝒆


                                                                        Self-Attention


                                                                                                                                           Self-Attention
                                                           Layer Norm


                                                                                                      Layer Norm


                                                                                                                              Layer Norm


                                                                                                                                                                    Layer Norm


                                                                                                                                                                                            Regression
                                 Embeddings


                                                                                                                                             Temporal
 N ROI Combinations                                                                                                                                                                                                              0    0.66        3.0     HZ


                                                                            Spatial


                                                                                                                    MLP


                                                                                                                                                                                 MLP
        …            …                                                                   +                                +                                 +                          +
ROI-1       ROI-36       ROI-N                                                                                                                                                                               rPPG signal
        Average pooling
                                                                                                                                                                                                                                     clip A   clip B    clip C

                                                                                                                                                                                                                                     PSD      PSD       PSD
 MSTmap 𝑿 ∈ ℝ𝑻×𝑵×𝑪                                                                                                                                                                                                                     periodicity loss ℒ 𝒑𝒆𝒓𝒊𝒐


                Figure 3: Overview of the proposed solution 1 of team HFUT-VUT in Track 1.


 1) Pre-train Stage                                                                                                                                   2) Fine-tune Stage
                                                                                         ST-sampler


                                                                                                                                                                                       3DCNN
   Video 1                  3DCNN
                                                                                                                                                            Video                                                       Pear loss             MCC loss


   Video 2                                                                                                                                                                                                                    Label                     Label PSD
                                                                                                                   Contrastive Loss


                Figure 4: Overview of the solution 2 of team HFUT-VUT in Track 1.


on contrastive learning (solution 2). Finally, they combine the strengths of the above solutions through
an ensemble strategy to generate the final predictions.
   Solution 1. This solution is a non-end-to-end self-supervised HR measurement framework based on
a spatial-temporal Transformer to capture subtle rPPG clues. The overview of this solution is illustrated
in Figure 3. The method contains three steps. 1) Data pre-processing: The raw facial video is first
transformed into MSTmap to suppress the irrelevant background and noise features while retaining most
of the temporal characteristics of the periodic physiological signals. 2) Spatial-Temporal Transformer:
Inspired by Dual-TL [35], a spatial-temporal Transformer is proposed to perceive the temporal and
spatial correlations. It includes two encoders (spatial encoder and temporal encoder) to refine the ROI
representation containing rPPG clues by capturing long-term spatiotemporal contextual information. 3)
Self-supervised Loss: In this solution, they employ four self-supervised loss functions by incorporating
prior of rPPG bandwidth and periodicity [18]. A bandwidth loss ℒ𝑏𝑎𝑛𝑑 is first adopted to penalize
the model for producing signals that exceed the healthy HR bandwidth limits. Then, a sparsity loss
ℒ𝑠𝑝𝑎𝑟𝑠𝑒 is adopted to emphasize the periodic heartbeats by suppressing non-heartbeat frequencies. To
avoid the model collapsing to a specific frequency, they use a variance loss ℒ𝑣𝑎𝑟 to spread the variance
of the power spectral density into a uniform distribution over the desired frequency band. Besides,
a periodicity loss ℒ𝑝𝑒𝑟𝑖𝑜 is proposed to avoid abnormal periodic fluctuations of the predicted signal,
thereby ensuring temporal periodicity consistency.
   Solution 2. This solution provides the end-to-end self-supervised HR measurement framework the
Contrast-Phys+ [20]. The framework is depicted in Figure 4 and consists of three steps. 1) Data pre-
processing: Firstly, face detection is performed using MTCNN [36] to obtain the face bounding box. The
face video is then cropped to 1.5 times the size of the bounding box and resized to 128×128. Subsequently,
each video is segmented into clips, and frame differencing is applied to generate normalized difference
frames as input to the model. 2) Pre-training: Following the setup of [20], the 3DCNN-based PhysNet
is used to obtain spatiotemporal rPPG (ST-rPPG) block representation. Observing the rPPG spatial
and temporal similarity in [20], a contrastive loss is adopted to pull together the rPPG signals from
the same ST-rPPG block and push away the signals from different ST-rPPG blocks extracted from
different videos. 3) Fine-tuning: With the pre-trained 3DCNN-based PhysNet model, the model is
then fine-tuned in a supervised manner. Specifically, given the predicted rPPG signal 𝑦𝑝𝑟𝑒𝑑 and the
Figure 5: Method Diagram for PCA_Vital team in Track1


ground-truth PPG signal 𝑦𝑔𝑡 , the popular time domain-based Negative Pearson correlation (Pear) loss
and frequency domain-based Negative max cross-correlation (MCC) [16] loss are selected to perform
supervised training. The MCC is robust to temporal offsets in the ground truth, which can make up for
the Pear loss.

3.1.3. Team ‘PCA_Vital’ (Nanjing University of Science and Technology)
The team PCA_Vital participated in Track 1 of self-supervised heart rate sensing, and they used a method
based on contrastive learning and spatiotemporal reconstruction to learn heart rate from unlabeled
facial videos. The framework of the proposed method is shown in Fig. 5.
   First, to overcome the redundant skin information, they designed a novel regions of interest extraction
method that focuses on facial muscles and capillary-rich areas while ignoring the interference of
explicit edges, corners, and textures. They converted the video segment into spatiotemporal mapping,
independently performed temporal normalization on each sub-block feature dimension, and then
performed YUV color space conversion to mine the subtle color changes of blood volume pulsation
feedback in unlabeled facial videos. This process can harvest certain rhythm and color variation
characteristics in the preprocessing and enhancement stages without relying on a learning model.
   Second, after converting the input video clips into spatiotemporal mappings, they guided inter-
instance and intra-instance contrastive learning by enriching positive and negative sample pairs during
the pre-training stage. They constructed positive and negative sets between different individual
instances, and randomly reorganized and reselected these samples at the feature point level to increase
diversity. Then, they constructed an encoder to extract features from the input samples, obtained
waveform and frequency features, and respectively calculated the contrast correlation and power
spectral density to bring the representation of the same instance closer and different instances farther
apart.
   Finally, they improved the traditional remote photoplethysmography regression into spatiotemporal
reconstruction, and further improved the robustness of the model by focusing on the interaction of
temporal features between different sub-regions of the face in the fine-tuning stage. The fine-tuning
stage uses a U-shaped network as the backbone to constrain waveform reconstruction, and extracts
intermediate layer feature features to construct a mapping of the same scale as the target pulse label.
In addition, they embedded a series of temporal attention modules at the skip-layer connections of
the U-shaped structure, calculated the global self-attention scores within the encoder features, and
concatenated them with the main path features to the decoder.
Figure 6: Overall framework for Team Face AI (BP) in Track 2.


3.2. Track 2
3.2.1. Team ‘Face AI (BP)’ (Agency for Science, Technology and Research)
The overall framework of their ensemble deep learning method is illustrated in Fig. 6, from which we
can see that there are multiple regression models. To import diversity, multiple models are trained
using different input feature vectors, backbones, or random seeds. The outputs of individual models are
then fused with an aggregator.
   Data Preprocessing: A short clip is extracted from the original full video and then partitioned into
frames. They select the clip closest to the time when blood pressure (BP) is measured to mitigate the
impact of BP fluctuation during video taking. If the video is recorded before BP measurement, the
last part the video is selected and vice versa for videos taken after BP measurement. The face region
of each frame is then cropped and resized to 128×128. To improve model performance in different
lighting conditions, data augmentation technique is applied during the training process. As it has been
demonstrated in [29], [37] that alternative color spaces derived from RGB videos are beneficial for
better representation of HR signal, they also explored using YUV color space for BP estimation other
than the original RGB space.
   Network Structure: They utilize two state-of-the-art models as the backbone for theyr BP estima-
tion model, including a 3D CNN model named PhysNet [11] and a transformer-based model named
PhysFormer [38]. They keep all the layers of the backbones so that the output of the backbone remains
as the PPG signal. Then, they stack a regression head with one hidden layer on top of the backbone and
the regression head has two output nodes corresponding to systolic BP (SBP) and diastolic BP (DBP),
respectively. The average RMSE of SBP and DBP is used as the loss function to train their models.

3.2.2. Team ‘PCA_Vital’ (Nanjing University of Science and Technology)
The team PCA_Vital participated in Track 2 of facial video-based remote blood pressure measurement,
for witch they used a method based on convolutional neural network and random forest feature fusion.
The framework of the proposed method is shown in Fig. 7. First, they extract the blood volume pulse
signal that changes with optical reflectance from the input visible light facial video clips based on the
pixel-level chromaticity transformation information. Then, they combined residual convolution, local
and global attention mechanisms to design a convolutional neural network for remote blood pressure
measurement, named RBP-CNN, to learn the blood pressure relationship information implicit in the
blood volume pulse in spatial and temporal dimensions. At the same time, they also captured the prior
information of the participants’ body mass index and age from the facial video frames, and calculated
the corresponding heart rate value based on the blood volume pulse. In this process, they found that
Figure 7: Method Diagram for PCA_Vital team in Track2.


there was a strong correlation between diastolic and systolic blood pressure and utilized diastolic blood
pressure for systolic one prediction. Finally, they used an ensemble learning strategy and a random
forest manner to fuse multiple features to achieve blood pressure measurement, and employed the
feature importance of random forest to verify the rationality of the proposed remote detection approach.

3.2.3. Team ‘Rhythm’ (University of Science and Technology Beijing)
Our proposed method is an end-to-end framework that takes video as input to predict blood pressure
values as output. Directly predicting blood pressure from facial video may not yield optimal results.
Therefore, they divide the blood pressure estimation process into two stages within the model: 1)
estimating the corresponding BVP waves from the left and right halves of the face, and 2) estimating
the BP values from these two BVP waves. As depicted in Fig. 8, the overall framework of the proposed
method mainly consists of three components: Tokenization Stem, BVP Network, and BP Network.
The process begins with video input, from which the Tokenization Stem extracts temporal token
sequences from both the left and right facial regions. Subsequently, the BVP Network reconstructs
BVP waveforms separately from the two temporal token sequences. The BVP Network is based on
RhythmMamba, which constrains a state space model across multiple temporal scales in both the
temporal and frequency domains. This approach maintains linear computational complexity while
possessing the capability for long-range dependency modeling. They aim to refine the granularity of
pulse wave reconstruction through long-range dependency modeling, thereby improving the accuracy
of blood pressure estimation. Finally, the BP Network estimates BP values based on the two BVP waves,
primarily utilizing the convolutional neural network.


4. Challenge results and discussion
The main results and ranking in the two competition tracks are summarized and shown in Table 1. In
this section we also provide more detail statistics of the results for both tracks.
                                     1 2   3 … T
                  Tokenization                         BVP                                BP              BP
                     Stem                             Network         BVP Waves
                                                                                        Network        (SBP, DBP)
                                     1 2   3 … T


                   Diff Fusion                      Multi-temporal                       Concat
                                                      Mamba

                  Self Attention                                                         Conv1
                                                     Add & Norm


                                                                                         Conv2
                                                   Frequency Domain
                                                      Feed-forward
                                                                                         Conv3
                                                     Add & Norm
                                                                      ×N

                 Frame Avgpool                                                          Avgpool
                                                     BVP Predictor


Figure 8: Method Diagram for Rhythm team in Track2


Figure 9: RMSE results on the overall test set, OBF test partition, and VIPL-HR-V2 test partition for all teams in
Track 1.


4.1. Track1 result analysis
Fig. 9 showcases the Root Mean Square Error (RMSE) results for various teams in track 1. The results
are divided into three categories: overall performance (blue bars), performance on the OBF dataset
(orange bars), and performance on the VIPL-HR-V2 dataset (gray bars). This structured approach allows
for a detailed analysis of how well different teams performed across diverse datasets.
   Examining the overall performance, it is evident that teams with high rankings like ”Face AI”
demonstrated consistently low RMSE values, indicating their strong overall performance. On the other
hand, teams with lower rankings such as ”Rhythm” and ”CAS-MAIS” exhibited higher RMSE values,
suggesting that their models were less accurate in heart rate measurements compared to others in the
competition.
Table 1
The final leaderboard of the 3rd challenge of RePSS.
                                                  (1) Track 1

       Ranking    Team Name        Organization                                               Score

                                      Institute of High Performance Computing (IHPC),
            1        Face AI                                                                  8.50693
                                    Agency for Science, Technology and Research (A*STAR)
         2         HFUT-VUT                      Hefei University of Technology               8.85277
         3         PCA_Vital             Nanjing University of Science and Technology         8.96941
         4        Hash Brown         Beijing University of Posts and Telecommunications       9.26198
         5            AIIA                       Harbin Institute of Technology               9.28902
         6          SHDMIC                               Ruijin Hospital                      10.74201
         7        HFUT-BCDH                      Hefei University of Technology               11.77657
         8        NeuroAI_KW                         Kwangwoon University                     14.4793
         9           NUIST         Nanjing University of Information Science and Technology   15.7968
         10       SCUT_rPPG                  South China University of Technology             15.88228
         11            b7               University of Science and Technology of China         19.06485
         12        CAS-MAIS         Institute of Automation, Chinese Academy of Sciences      21.48006
         13         Rhythm               University of Science and Technology Beijing          24.0241
                                                  (2) Track 2

        Ranking    Team Name       Organization                                           Score

                                   Institute of High Performance Computing (IHPC),
        1          Face AI(BP)                                                            12.95258
                                   Agency for Science, Technology and Research (A*STAR)
        2          PCA_Vital       Nanjing University of Science and Technology           13.48281
        3          Rhythm          University of Science and Technology Beijing           13.59307
        4          SCUT_rPPG       South China University of Technology                   15.06056
        5          IAI-USTC        University of Science and Technology of China          16.01179
        6          NeuroAI         Kwangwoon University                                   16.56091


   When focusing on the OBF dataset specifically, teams such as ”PCA_Vital” and ”SHDMIC” performed
particularly well, with notably low RMSE values. This indicates that their models were highly effective
at processing the data characteristics inherent to the OBF dataset. Conversely, other teams had higher
RMSE values on the OBF dataset, reflecting challenges in adapting their models to the OBF dataset
when fine-tunined on VIPL-HR-V2. This variability points to the importance of dataset-specific tuning
and the potential difficulty in developing models that can handle a wide range of input variations.
   In terms of performance on the VIPL-HR-V2 dataset, teams like ”HFUT-BCDH” and ”Face AI” excelled,
achieving lower RMSE values. Their success suggests effective utilization of the VIPL-HR-V2 dataset’s
characteristics for fine-tuning. In contrast, teams like ”CAS-MAIS” and ”Rhythm” exhibited the high
RMSE in this category, indicating potential difficulties in leveraging the challenging VIPL-HR-V2 dataset
for precise heart rate measurement.
   The competition results underscore the importance of consistency and robustness in model perfor-
mance. Teams with consistently low RMSE across both datasets, such as ”Face AI” and ”HFUT-VUT,”
likely developed more robust models capable of generalizing well across different facial video data. This
indicates that their pre-training and fine-tuning stages effectively captured the underlying features
necessary for accurate heart rate measurement.
   Certain teams exhibited strong performance on one dataset but not the other. For example, ”PCA_Vi-
tal” showed excellent results on the OBF dataset but struggled significantly with the VIPL-HR-V2 dataset.
This disparity could be due to differences in video quality, lighting conditions, or variations in facial
expressions and movements between the datasets. Such differences highlight the importance of diverse
and comprehensive pre-training data to ensure models can handle various real-world conditions.
Figure 10: The overall BP RMSE results along with systolic and diastolic BP RMSE for all teams in Track 2.


Table 2
The cummulative percentage of errors (CPE) of diastolic and systolic BP and the corresponding British Hyper-
tension Society (BHS) grade for all teams. The BHS grade standard is shown in Table 3.
                                     Diastolic BP                            Systolic BP
          Teams
                                                       BHS                                    BHS
                          CPE5      CPE10    CPE15                 CPE5     CPE10    CPE15
                                                       Grade                                  Grade
          Face AI (BP)    38%       75.50%   90.50%         D     22.50%     44%     64.50%    D
          PCA_Vital      42.50%     73.50%   90.50%         C     21.50%    42.50%    63%      D
          Rhythm         41.50%     75.50%    92%           C     23.50%    44.50%    66%      D
          SCUT_rPPG       38%       63.50%   84.50%         D     25.50%    44.50%   68.50%    D
          IAIUSTC        30.50%      61%     78.50%         D      25%       44%     58.50%    D
          NeuroAI        42.50%      74%     90.50%         C      13%      27.50%    44%      D


Table 3
The British Hypertension Society (BHS) grade standard.
                                      Grade A    Grade B        Grade C    Grade D
                            CPE5       >=60%        >=50%       >=40%       <40%
                            CPE10      >=85%        >=75%       >=65%       <65%
                            CPE15      >=95%        >=90%       >=85%       <85%


4.2. Track2 result analysis
The competition results for facial video-based blood pressure (BP) measurement reveal variations in
performance among the participating teams, as shown by their root mean square error (RMSE) and
cumulative percentage of errors (CPE) for diastolic and systolic BP presented in Fig. 10 and Table 3.
  In terms of overall RMSE for BP measurement, the team Face AI (BP) exhibited the lowest overall
RMSE, indicating the most accurate performance among the teams. Focusing on diastolic BP RMSE,
Face AI (BP) still achieved the lowest error, underscoring their strong performance in this specific metric,
while other teams, such as NeuroAI and IAIUSTC, had relatively higher RMSE values. For systolic BP
RMSE, NeuroAI showed the highest error, indicating less accurate performance in systolic BP. Face AI
(BP) again performed well, followed by PCA_Vital and Rhythm. When comparing the RMSE between
diastolic and systolic BP, diasotlic BP RMSE is always lower than systolic BP RMSE, which was also
observed in the contact PPG BP research [39, 40].
  The cumulative percentage of errors (CPE) and corresponding British Hypertension Society (BHS)
grades in Table 3 provide further insight into the teams’ performance. he CPE5, CPE10, and CPE15
values reflect the percentage of errors within 5, 10, and 15 mmHg, respectively. For diastolic BP,
PCA_Vital, Rhythm, and NeuroAI achieves BHS grade C while others achievs the lowest grade D. For
systolic BP, all teams fell into the lowest BHS grade D. The results suggest that while there is notable
variation in the performance of different teams, all exhibit relatively high errors as evidenced by the
BHS grades. Since Grade A and B are recommended for clinical use, the rPPG-based BP estimation from
the teams of track 2 still needs performance improvement to achieve clinical use.
  The results across all teams, especially in systolic BP measurements, highlight the complexity of
accurately estimating BP from facial videos. Enhancements in video preprocessing, feature extraction,
and model training could help improve performance. Additionally, incorporating more diverse datasets
for training could help models generalize better to the test set.


5. Conclusion and future directions
As a continuous event, the 3rd RePSS challenge advanced beyond the 2nd and 1st RePSS in two key
ways: 1) In Track 1, participants utilized self-supervised methods to pre-train models on unlabeled
facial videos, unlike previous challenges that relied on supervised methods requiring labeled facial
videos. 2) Track 2 introduced a new competition for rPPG-based blood pressure estimation, which
necessitates high-quality rPPG signals for accurate blood pressure estimation. The 3rd RePSS challenge
attracted more specialized research groups and led to the proposal of interesting approaches from the
participating teams, potentially offering valuable insights for future research.
   For track 1, the competition results highlight both the potential and the challenges of self-supervised
learning for heart rate measurement. While some teams demonstrated impressive accuracy, there
remains significant room for improvement, particularly in ensuring models generalize well across
diverse datasets. The findings suggest that a focus on dataset diversity, advanced pre-training methods,
and the exploration of multi-modal data could drive further advancements in this field. To further
improve performance, future work could explore the integration of multi-modal data, such as combining
facial video with other modalities like radar and infrared bands. Additionally, enhancing the diversity
and quality of pre-training datasets could improve the pre-trained models.
   For track 2, these blood pressure results underscore the need for further research and development in
rPPG-based blood pressure measurement. While the competition showcases promising advancements
in facial video-based BP measurement, the results indicate substantial room for improvement before
these methods can be considered reliable for clinical or real-world applications. Future competitions
could also focus on rPPG signal waveform evaluation, which is the fundamental to BP estimation .


Acknowledgments
This work was supported by the Research Council of Finland (former Academy of Finland) Academy
Professor project EmotionAI (grants 336116, 345122), ICT 2023 project TrustFace (grant 345948), the
University of Oulu & Research Council of Finland Profi 7 (grant 352788), Infotech Oulu, and National
Natural Science Foundation of China (grant 62176249). The authors would like to acknowledge Pieter-
Jan Toye for providing data in track 2 of the RePPS challenge. The authors also acknowledge CSC-IT
Center for Science, Finland, for providing computational resources.


References
 [1] W. Verkruysse, L. O. Svaasand, J. S. Nelson, Remote plethysmographic imaging using ambient
     light., Opt. Express 16 (2008) 21434–21445.
 [2] M.-Z. Poh, D. J. McDuff, R. W. Picard, Non-contact, automated cardiac pulse measurements using
     video imaging and blind source separation., Opt. Express 18 (2010) 10762–10774.
 [3] M.-Z. Poh, D. J. McDuff, R. W. Picard, Advancements in noncontact, multiparameter physiological
     measurements using a webcam, IEEE Trans. Biomed. Eng. 58 (2011) 7–11.
 [4] G. De Haan, V. Jeanne, Robust pulse rate from chrominance-based rppg, IEEE Trans. Biomed. Eng.
     60 (2013) 2878–2886.
 [5] X. Li, J. Chen, G. Zhao, M. Pietikainen, Remote heart rate measurement from face videos under
     realistic situations, in: Proc. IEEE CVPR, 2014, pp. 4264–4271.
 [6] D. McDuff, S. Gontarek, R. W. Picard, Improvements in remote cardiopulmonary measurement
     using a five band digital camera, IEEE Trans. Biomed. Eng. 61 (2014) 2593–2601.
 [7] W. Wang, A. C. den Brinker, S. Stuijk, G. de Haan, Algorithmic principles of remote ppg, IEEE
     Trans. Biomed. Eng. 64 (2017) 1479–1491.
 [8] G. Balakrishnan, F. Durand, J. Guttag, Detecting pulse from head motions in video, in: Proc. IEEE
     CVPR, 2013, pp. 3430–3437.
 [9] A. V. Moco, S. Sander, G. de Haan., Ballistocardiographic artifacts in ppg imaging, IEEE Trans.
     Biomed. Eng. 63 (2016).
[10] C. Yang, G. Cheung, V. Stankovic, Estimating heart rate and rhythm via 3D motion tracking in
     depth video, IEEE Trans. Multimedia 19 (2017) 1625–1636.
[11] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from facial videos using
     spatio-temporal networks, Proc. BMVC (2019).
[12] Z. Yu, W. Peng, X. Li, X. Hong, G. Zhao, Remote heart rate measurement from highly compressed
     facial videos: An end-to-end deep learning solution with video enhancement, in: Proc. IEEE ICCV,
     2019.
[13] W. Chen, D. Mcduff, Deepphys: Video-based physiological measurement using convolutional
     attention networks, Proc. ECCV (2018) 356–373.
[14] R. Song, H. Chen, J. Cheng, C. Li, Y. Liu, X. Chen, Pulsegan: Learning to generate realistic pulse
     waveforms in remote photoplethysmography, IEEE Journal of Biomedical and Health Informatics
     25 (2021) 1373–1384. doi:10.1109/JBHI.2021.3051176 .
[15] H. Wang, E. Ahn, J. Kim, Self-supervised representation learning framework for remote phys-
     iological measurement using spatiotemporal augmentation loss, in: Proceedings of the AAAI
     Conference on Artificial Intelligence, volume 36, 2022, pp. 2431–2439.
[16] J. Gideon, S. Stent, The way to my heart is through contrastive learning: Remote photoplethys-
     mography from unlabelled video, in: Proceedings of the IEEE/CVF international conference on
     computer vision, 2021, pp. 3995–4004.
[17] Z. Sun, X. Li, Contrast-phys: Unsupervised video-based remote physiological measurement
     via spatiotemporal contrast, in: European Conference on Computer Vision, Springer, 2022, pp.
     492–510.
[18] J. Speth, N. Vance, P. Flynn, A. Czajka, Non-contrastive unsupervised learning of physiological
     signals from video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition, 2023, pp. 14464–14474.
[19] Y. Yang, X. Liu, J. Wu, S. Borac, D. Katabi, M.-Z. Poh, D. McDuff, Simper: Simple self-supervised
     learning of periodic targets, in: The Eleventh International Conference on Learning Representa-
     tions, 2023.
[20] Z. Sun, X. Li, Contrast-phys+: Unsupervised and weakly-supervised video-based remote physi-
     ological measurement via spatiotemporal contrast, IEEE Transactions on Pattern Analysis and
     Machine Intelligence (2024).
[21] X. Li, H. Han, H. Lu, X. Niu, Z. Yu, A. Dantcheva, G. Zhao, S. Shan, The 1st challenge on remote
     physiological signal sensing (repss), in: Proceedings of the IEEE/CVF conference on computer
     vision and pattern recognition workshops, 2020, pp. 314–315.
[22] X. Li, H. Sun, Z. Sun, H. Han, A. Dantcheva, S. Shan, G. Zhao, The 2nd challenge on remote
     physiological signal sensing (repss), in: Proceedings of the IEEE/CVF International Conference on
     Computer Vision, 2021, pp. 2404–2413.
[23] L. Xie, X. Wang, H. Zhang, C. Dong, Y. Shan, Vfhq: A high-quality dataset and benchmark for
     video face super-resolution, in: The IEEE Conference on Computer Vision and Pattern Recognition
     Workshops (CVPRW), 2022.
[24] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Nießner, FaceForensics++: Learning to
     detect manipulated facial images, in: International Conference on Computer Vision (ICCV), 2019.
[25] L. Jiang, R. Li, W. Wu, C. Qian, C. C. Loy, DeeperForensics-1.0: A large-scale dataset for real-world
     face forgery detection, in: CVPR, 2020.
[26] H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, C. C. Loy, CelebV-HQ: A large-scale
     video facial attributes dataset, in: ECCV, 2022.
[27] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, J. F. Cohn, Disfa: A spontaneous facial action
     intensity database, IEEE Transactions on Affective Computing 4 (2013) 151–160.
[28] S. Petridis, B. Martinez, M. Pantic, The mahnob laughter database, Image and Vision Computing
     31 (2013) 186–202.
[29] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation from face via
     spatial-temporal representation, IEEE Trans. Image Processing (2019).
[30] X. Niu, H. Han, S. Shan, X. Chen, Vipl-hr: A multi-modal database for pulse estimation from less-
     constrained face video, in: Computer Vision–ACCV 2018: 14th Asian Conference on Computer
     Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part V 14, Springer, 2019,
     pp. 562–576.
[31] P.-J. Toye, Vital videos: A dataset of videos with ppg and blood pressure ground truths, arXiv
     preprint arXiv:2306.11891 (2023).
[32] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. Tulppo, G. Zhao, The OBF
     database: A large face video database for remote physiological signal measurement and atrial
     fibrillation detection, in: Proc. IEEE FG, 2018, pp. 1–6.
[33] J. Xiang, G. Zhu, Joint face detection and facial expression recognition with mtcnn, in: 2017 4th
     international conference on information science and control engineering (ICISCE), IEEE, 2017, pp.
     424–427.
[34] Z. Li, L. Yin, Contactless pulse estimation leveraging pseudo labels and self-supervision, in:
     Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20588–20597.
[35] W. Qian, D. Guo, K. Li, X. Zhang, X. Tian, X. Yang, M. Wang, Dual-path tokenlearner for remote
     photoplethysmography-based physiological measurement with facial videos, IEEE Transactions
     on Computational Social Systems (2024).
[36] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded
     convolutional networks, IEEE signal processing letters 23 (2016) 1499–1503.
[37] H. Shao, L. Luo, J. Qian, S. Chen, C. Hu, J. Yang, Tranphys: Spatiotemporal masked transformer
     steered remote photoplethysmography estimation, IEEE Transactions on Circuits and Systems for
     Video Technology (2023).
[38] Z. Yu, Y. Shen, J. Shi, H. Zhao, P. H. Torr, G. Zhao, Physformer: Facial video-based physiological
     measurement with temporal difference transformer, in: Proceedings of the IEEE/CVF conference
     on computer vision and pattern recognition, 2022, pp. 4186–4196.
[39] M. Kachuee, M. M. Kiani, H. Mohammadzade, M. Shabany, Cuffless blood pressure estimation
     algorithms for continuous health-care monitoring, IEEE Transactions on Biomedical Engineering
     64 (2016) 859–869.
[40] Z.-D. Liu, Y. Li, Y.-T. Zhang, J. Zeng, Z.-X. Chen, Z.-W. Cui, J.-K. Liu, F. Miao, Cuffless blood pressure
     measurement using smartwatches: a large-scale validation study, IEEE Journal of Biomedical and
     Health Informatics 27 (2023) 4216–4227.