Joint Spatial-Temporal Modeling and Contrastive
                                Learning for Self-supervised Heart Rate Measurement
                                Wei Qian1,† , Qi Li3,4,† , Kun Li5,∗ , Xinke Wang4,3 , Xiao Sun1,2,3 , Meng Wang1,2,3 and
                                Dan Guo1,2,3,6,∗
                                1
                                  School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of
                                Technology (HFUT)
                                2
                                  Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education
                                3
                                  Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China
                                4
                                  Anhui University, China
                                5
                                  Zhejiang University, China
                                6
                                  Anhui Zhonghuitong Technology Co., Ltd.


                                            Abstract
                                            This paper briefly introduces the solutions developed by our team, HFUT-VUT, for Track 1 of self-
                                            supervised heart rate measurement in the 3rd Vision-based Remote Physiological Signal Sensing (RePSS)
                                            Challenge hosted at IJCAI 2024. The goal is to develop a self-supervised learning algorithm for heart
                                            rate (HR) estimation using unlabeled facial videos. To tackle this task, we present two self-supervised
                                            HR estimation solutions that integrate spatial-temporal modeling and contrastive learning, respectively.
                                            Specifically, we first propose a non-end-to-end self-supervised HR measurement framework based on
                                            spatial-temporal modeling, which can effectively capture subtle rPPG clues and leverage the inherent
                                            bandwidth and periodicity characteristics of rPPG to constrain the model. Meanwhile, we employ
                                            an excellent end-to-end solution based on contrastive learning, aiming to generalize across different
                                            scenarios from complementary perspectives. Finally, we combine the strengths of the above solutions
                                            through an ensemble strategy to generate the final predictions, leading to a more accurate HR estimation.
                                            As a result, our solutions achieved a remarkable RMSE score of 8.85277 on the test dataset, securing 2nd
                                            place in Track 1 of the challenge.

                                            Keywords
                                            Self-supervised, heart rate, rPPG, spatial-temporal modeling, contrastive learning


                                1. Introduction
                                Remote physiological measurement [1, 2, 3, 4, 5] has emerged as a promising field with sig-
                                nificant applications in healthcare, wellness monitoring, and human-computer interaction.

                                The 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge & Workshop, Aug 3–9, 2024, Jeju, South
                                Korea
                                ∗
                                    Corresponding authors.
                                †
                                    These authors contributed equally.
                                Envelope-Open qianwei.hfut@gmail.com (W. Qian); liqi@stu.ahu.edu.cn (Q. Li); kunli.hfut@gmail.com (K. Li);
                                xinkewang689@gmail.com (X. Wang); sunx@hfut.edu.cn (X. Sun); eric.mengwang@gmail.com (M. Wang);
                                guodan@hfut.edu.cn (D. Guo)
                                Orcid 0009-0007-9467-6296 (W. Qian); 0000-0002-8655-5781 (Q. Li); 0000-0001-5083-2145 (K. Li); 0009-0002-8399-8322
                                (X. Wang); 0000-0001-9750-7032 (X. Sun); 0000-0002-3094-7735 (M. Wang); 0000-0003-2594-254X (D. Guo)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Traditional methods for physiological measurement, such as electrocardiograms (ECG) and
photoplethysmograms (PPG), require direct contact with the skin, which can be cumbersome
and inconvenient for continuous monitoring. With the great success of deep learning in com-
puter vision [6, 7, 8, 9, 10], recent advancements [11, 12] have paved the way for non-contact,
video-based techniques to estimate physiological signals such as heart rate (HR) and respiratory
rate (RR) from facial videos, providing a more comfortable and accessible approach for users.
   Despite the promising potential of video-based physiological measurement, most existing
methods [13, 5, 3] rely heavily on supervised learning, necessitating large amounts of labeled data
for training. Acquiring such labeled data is often labor-intensive and time-consuming, posing a
significant bottleneck for developing robust and generalizable models. Moreover, supervised
methods may not generalize well across different environments and lighting conditions, limiting
their practical applicability. Therefore, the development of label-free rPPG estimation methods
is becoming increasingly urgent.
   To address these challenges, the 3rd Vision-based Remote Physiological Signal Sensing
(RePSS) Challenge at IJCAI 2024 was launched. This challenge aims to develop self-supervised
training methods for HR measurement using unlabeled facial videos, thereby reducing the
dependency on extensive labeled datasets. For this challenge, we present two self-supervised
HR estimation solutions that integrate spatial-temporal modeling and contrastive learning, re-
spectively. Inspired by Dual-TL [3] and SiNC [14], we propose a non-end-to-end self-supervised
HR measurement framework based on a spatial-temporal Transformer to capture subtle rPPG
clues. Meanwhile, we adopt a complementary end-to-end contrastive learning solution based
on Contrast-Phys+ [11] to enhance the model accuracy. Finally, we combine the strengths of
both solutions through an ensemble strategy to generate the final predictions, securing second
place with the RMSE score of 8.85277.
   In conclusion, the main contributions can be summarized as follows:

    • We propose a non-end-to-end self-supervised solution based on spatial-temporal modeling.
      By considering the priors of periodicity consistency and bandwidth limitation of the rPPG
      signal, we introduce four loss functions to supervise the model effectively.
    • We present an end-to-end solution based on contrastive learning, which utilizes 3DCNN
      to extract features and employs a contrastive loss to learn discriminative representations
      for periodic rPPG signal modeling.
    • Our solution achieved second place with the RMSE score of 8.85277 on the test dataset
      in Track 1 of the 3rd Vision-based Remote Physiological Signal Sensing Challenge. The
      experimental results demonstrate the effectiveness and robustness of our proposed solu-
      tions.


2. Methodology
2.1. Solution 1: Self-supervised HR Measurement with Spatial-Temporal
     Transformer
Inspired by the great success of Transformer in computer vision [15], we present a non-end-
to-end self-supervised HR measurement framework to mitigate the need for labeled video
                                                                                             Spatial-Temporal Transformer                                                     ×𝑳                              ℒ𝒕𝒐𝒕𝒂𝒍 = ℒ𝒃)*+ + ℒ𝒔𝒑𝒂𝒓𝒆 + ℒ𝒗𝒂𝒓 + ℒ𝒑𝒆𝒓𝒊𝒐
  Video       Landmarks
                                                                         Spatial Encoder                                          Temporal Encoder                                                                   variance loss ℒ 𝒗𝒂𝒓


                                                                                                                                                                                                               PSD
                                                                                                                                                                                            𝒚𝒑𝒓𝒆𝒅 ∈ ℝ𝑻×𝟏
                                                                                                                                                                                                                                      bandwidth loss ℒ 𝒃𝒂𝒏𝒅


                                              𝑿 ∈ ℝ𝑻×𝑵×𝑫
                                                                                                                                                                                                                                     sparsity loss ℒ 𝒔𝒑𝒂𝒓𝒆


                                                                        Self-Attention


                                                                                                                                  Self-Attention
                                                           Layer Norm


                                                                                              Layer Norm


                                                                                                                     Layer Norm


                                                                                                                                                       Layer Norm


                                                                                                                                                                               Regression
                                 Embeddings


                                                                                                                                    Temporal
 N ROI Combinations                                                                                                                                                                                              0    0.66        3.0     HZ


                                                                            Spatial


                                                                                                           MLP


                                                                                                                                                                    MLP
        …            …                                                                   +                       +                                 +                      +
ROI-1       ROI-36       ROI-N                                                                                                                                                                  rPPG signal
        Average pooling
                                                                                                                                                                                                                     clip A   clip B    clip C

                                                                                                                                                                                                                     PSD      PSD       PSD
 MSTmap 𝑿 ∈ ℝ𝑻×𝑵×𝑪                                                                                                                                                                                                     periodicity loss ℒ 𝒑𝒆𝒓𝒊𝒐


Figure 1: Overview of the proposed solution 1. Given an input facial video with 𝑇 frames, we obtain 𝑁
facial ROIs for each frame and extract the MSTmap representation 𝑀 ∈ ℝ𝑇 ×𝑁 ×𝐶 for the video, where 𝑁 is
the number of facial ROI. A feature embedding layer is used to project the MSTmap to high-dimensional
feature 𝑋 ∈ ℝ𝑇 ×𝑁 ×𝐷 . Then, we stack spatial-temporal Transformer for 𝐿 loops to capture subtle rPPG
clues. Next, a rPPG regression head is used to output rPPG signal 𝑠𝑝𝑟𝑒 ∈ ℝ𝑇 ×1 . Finally, we apply four
self-supervised losses to constrain the model.


data based on a Spatial-Temporal Transformer. The overview of this solution is illustrated in
Figure 1. Specifically, we first transform the input facial video into a multi-scale spatial-temporal
map (MSTmap) in Section 2.1.1. Then, we introduce our spatial-temporal Transformer module
in Section 2.1.2. Next, in Section 2.1.3, with the constraints of periodicity consistency and
bandwidth finiteness, our model directly discovers blood volume pulses from unlabeled videos
to predict HR.

2.1.1. Data Pre-processing
 The quasi-periodic pulse signal originates from subtle light reflections of blood vessels un-
der the skin. Therefore, non-skin pixels and facial geometric features can be considered as
rPPG-independent noises. We transform the raw facial video into MSTmap to highlight the
spatiotemporal information of the human face, which is a common practice in rPPG measure-
ment [16, 17]. Concretely, the MSTmap divides the facial area into 6 meta-ROI blocks, which
can generate 𝑁 = (26 -1)=63 ROI combination blocks, and the pixels of each block are averaged
separately for 𝐶 color channels. In the video, all the frames are concatenated along the time di-
mension to generate a spatial-temporal map of size ℝ𝑇 ×𝑁 ×𝐶 , where 𝐶 = 6 represents {R,G,B,Y,U,V}
channels. Next, we embed the MSTmap 𝑀 to high-dimensional feature 𝑋 ∈ ℝ𝑇 ×𝑁 ×𝐷 with feature
dimension 𝐷 by using a full-connected layer.

2.1.2. Spatial-Temporal Transformer
Our spatial-temporal Transformer tailored for remote physiological measurement is designed
carefully for perceiving the temporal and spatial correlations. It includes two encoders (spatial
encoder and temporal encoder) to refine the ROI representation containing rPPG clues by
capturing long-term spatiotemporal contextual information. We now explain the proposed
model in detail. Specifically, given the input features 𝑋 ∈ ℝ𝑇 ×𝑁 ×𝐷 , the process of embedding
spatial context for 𝑡-frame can be formulated as:

                          𝑄 (𝑡) = 𝑋 (𝑡) 𝑊𝑡𝑞 , 𝐾 (𝑡) = 𝑋 (𝑡) 𝑊𝑡𝑘 , 𝑉 (𝑡) = 𝑋 (𝑡) 𝑊𝑡𝑣 ,

                                           𝑄      (𝑡) 𝐾 (𝑡) 𝑇
                          𝑍 (𝑡) = softmax(         )𝑉 (𝑡) + 𝑋 (𝑡) ,                               (1)
                                           √ 𝐷
                           ′
                          𝑍 (𝑡) = MLP(LN(𝑍 (𝑡) )) + 𝑍 (𝑡) ,

where 𝑊𝑡𝑞 , 𝑊𝑡𝑘 , 𝑊𝑡𝑣 are learnable parameters shaped as 𝐷 × 𝐷. 𝑋 (𝑡) denote the feature in 𝑡-th
frame. MLP is the multi-layer perceptron layer and LN is layer normalization operation. The
                               ′
feature map of all frames {𝑍 (𝑡) |𝑡 = 1, … , 𝑇 } are concatenated together into 𝑍𝑠 ∈ ℝ𝑇 ×𝑁 ×𝐷 .
   The other complementary module is applied to enhance the input rPPG features with temporal
dynamical transition clues and enrich the temporal context by highlighting the informative
features along the time dimension for each facial ROI. Our temporal encoder follows the way in
Eq. 1. The difference is that we calculate the temporal dimension for each spatial unit (𝑛 ∈ [1, 𝑁 ]).
                                                                                    ′
We output the temporally correlated feature for the 𝑛-th facial ROI feature as 𝑍 (𝑛) ∈ ℝ𝑇 ×𝐷 and
                       ′ (𝑛)
stack the features {𝑍 |𝑛 = 1, 2, … , 𝑁 } together, represented by 𝑍𝑡 ∈ ℝ𝑁 ×𝑇 ×𝐷 .
   The spatial and temporal encoders are stacked as 𝐿 loops in an alternating manner, taking into
account the spatial and temporal complementary contextual information integrally. Moreover,
spatial and temporal position embedding is applied only to the first encoder to retain two kinds
of position information. Finally, we use an rPPG regression head to project the feature to a 1D
rPPG signal 𝑦𝑝𝑟𝑒𝑑 ∈ ℝ𝑇 ×1 .

2.1.3. Self-supervised Loss
As highlighted in previous studies [18, 14], the rPPG signal possesses inherent theoretical
priors, including specific bandwidth in the frequency domain. By incorporating this prior
knowledge, we employ three self-supervised loss functions from [14] in this work. Additionally,
to further effectively train the model, we also propose a new periodicity loss based on periodic
characteristics of the rPPG signal. Notably, all predicted rPPG signals are transformed into
power spectrum density (PSD) with the Fast Fourier Transform (FFT) before computing all
losses in our method, denoted as 𝐹 = FFT(𝑦).
   Bandwidth Loss. A healthy HR falls within a specific frequency range. Following the [14],
we penalize the model for producing signals that exceed the healthy HR bandwidth limits.
Consequently, the bandwidth loss can be formalized as follows:
                                                                𝑎    ∞
                                                1
                                 L𝑏𝑎𝑛𝑑 =      ∞         [ ∑ 𝐹𝑖 + ∑ 𝐹𝑖 ] ,                         (2)
                                              ∑ 𝐹𝑖        𝑖=−∞      𝑖=𝑏
                                            𝑖=−∞

where 𝑎 and 𝑏 denote lower and upper band limits, respectively. 𝐹𝑖 is the power in the 𝑖th
frequency bin of the predicted signal. In our experiments, we specify the limits as 𝑎 = 0.66 Hz to
𝑏 = 3 Hz, which corresponds to a common pulse rate range from 40 bpm to 180 bpm. This range
effectively captures the typical variations in a healthy HR, ensuring that our model focuses on
the relevant frequency components while minimizing the influence of noise. By incorporating
this bandwidth loss, our model is better equipped to distinguish between meaningful rPPG
signals and disturbances, ultimately leading to more accurate HR estimation.
   Sparsity Loss. Since we are primarily interested in heartbeat frequency, we emphasize the
periodic heartbeats by suppressing non-heartbeat frequencies. Following [14], we penalize the
energy in the bandwidth regions far away from the spectrum peak, which can ensure that the
model focuses on the relevant heartbeat frequencies. It can be formulated as:
                                               argmax(𝐹 )−Δ𝐹                 𝑏
                                       1
                      L𝑠𝑝𝑎𝑟𝑠𝑒 =            [         ∑         𝐹𝑖 +        ∑            𝐹𝑖 ] ,   (3)
                                   𝑏
                                                      𝑖=𝑎             𝑖=argmax(𝐹 )+Δ𝐹
                                  ∑ 𝐹𝑖
                                  𝑖=𝑎

where argmax(𝐹 ) is the frequency of the spectral peak, and Δ𝐹 = 6 is the frequency padding
around the peak. This loss enhances the model’s ability to accurately estimate HR by ensuring
that the spectral energy is concentrated around the true HR frequencies, thus minimizing the
influence of noise and other non-relevant frequency components.
   Variance Loss. To avoid the model collapsing to a specific frequency, we also use a variance
loss [14, 19] to spread the variance of the power spectral density into a uniform distribution over
the desired frequency band. Firstly, we define a uniform prior distribution 𝑃 over 𝑑 frequencies.
Then, we consider a batch of 𝑛 spectral densities, represented as 𝐹 = [𝑣1 , … , 𝑣𝑛 ], where each
𝑣𝑖 is a 𝑑-dimensional frequency decomposition of a predicted waveform. To aggregate these
spectral densities, we compute the normalized sum across the batch, denoted as 𝑄. Therefore,
the variance loss L𝑣𝑎𝑟 can be formulated as:
                                                𝑑
                                    1                       2
                              L𝑣𝑎𝑟 = ∑ (CDF𝑖 (𝑄) − CDF𝑖 (𝑃)) ,                                   (4)
                                    𝑑 𝑖=1

where CDF𝑖 represents the cumulative distribution function at the 𝑖-th frequency.
  Periodicity Loss. In addition to the intrinsic properties of the rPPG signal itself, we have
observed that adjacent rPPG signals do not change rapidly over short periods. This is typically
manifested by similar periodicity in neighboring rPPG signals, meaning they share a dominant
peak in the PSD. Specifically, we uniformly sample 𝑆 non-overlapping temporal segments from a
short rPPG signal (e.g., 10s). The PSDs of these segments should be similar. Thus, our proposed
periodicity loss can be formulated as:
                                                    𝑆−1 ∞
                                                                𝑗      𝑗+1 2
                                  L𝑝𝑒𝑟𝑖𝑜 = ∑ ∑ (𝐹𝑖 − 𝐹𝑖                    ) ,                   (5)
                                                    𝑗=1 𝑖=−∞

where 𝑆 = 3 denotes the number of segments.
  In summary, the overall loss function of our self-supervised learning strategy is :

                             L𝑡𝑜𝑡𝑎𝑙 = L𝑏𝑎𝑛𝑑 + L𝑠𝑝𝑎𝑟𝑠𝑒 + L𝑣𝑎𝑟 + L𝑝𝑒𝑟𝑖𝑜 .                          (6)
 1) Pre-train Stage                                            2) Fine-tune Stage


                               ST-sampler
                                                                                3DCNN
  Video 1      3DCNN
                                                                 Video                  Pear loss     MCC loss


  Video 2                                                                                     Label      Label PSD
                                            Contrastive Loss

Figure 2: Overview of the solution 2. In the pre-train stage, the model is trained in a contrastive
learning-based self-supervised manner. After that, the pre-trained model is fine-tuned by supervised
loss.


2.2. Solution 2: Self-supervised HR Measurement with Contrastive Learning
Here we provide the end-to-end self-supervised HR measurement framework based on the
contrastive learning strategy. The framework is depicted in Figure 2. Specifically, we first
perform data-preprocessing in Section2.2.1. Then we pre-train the proposed model in an
unsupervised setting based on the Contrast-Phys+ [11] in Section 2.2.2. Finally, we fine-tune
the Contrast-Phys+ model with a supervised setting and obtain the final rPPG predictor in
Section 2.2.3.

2.2.1. Data Pre-processing
In this self-supervised manner, we input facial video into our model to estimate the final rPPG
signal. For an original video, we first perform face detection by MTCNN [20] to get the four
coordinates of the face bounding box from the first frame. Then, we enlarge the length and
width of the bounding box by 1.5 times and crop the face region for each frame of the video. The
cropped faces are resized to 128 × 128. Next, we segment each video into clips to feed into the
model. Note that we also perform frame difference operations on the clip to generate normalized
difference frames as an attempt of model input. The difference between two consecutive frames
can be formulated as:
                                         Δ𝑉𝑡 = 𝑉𝑡+1 − 𝑉𝑡 ,                                    (7)
where 𝑉𝑡 denotes the 𝑡-th frame. To keep the length of the difference video equal to the raw
video, we simply repeat the last difference frame. Then, the Δ𝑉 is normalized.

2.2.2. Pre-training
In this stage, following the setting of [11] we modify the 3DCNN-based PhysNet to get spa-
tiotemporal rPPG (ST-rPPG) block representation. The model outputs spatiotemporal rPPG
features with shape 𝑇 × 𝑆 × 𝑆, where 𝑇 is the temporal length, and 𝑆 is the spatial dimension.
The ST-rPPG block can be regarded as a collection of rPPG signals from different facial regions.
Therefore, for each input, we can sample 𝑆 2 rPPG signals with the length of 𝑇.
   According to the observations that rPPG spatial similarity and temporal similarity in [11], the
ST-rPPG block can sample multiple rPPG signals with short time intervals and different spatial
positions. Those signals should be similar. Then contrastive learning can be formulated by
pulling together the rPPG signals from the same ST-rPPG block and pushing away the signals
from different ST-rPPG blocks extracted in the crossing video. The contrastive loss can be
formulated as:
                              𝑁     𝑁
                                                  2                    2
                      L𝑝𝑜𝑠 = ∑ ∑ (‖𝑓𝑖 − 𝑓𝑗 ‖ + ‖𝑓𝑖′ − 𝑓𝑗′ ‖ ) /(2𝑁 (𝑁 − 1)),                     (8)
                              𝑖=1 𝑗=1
                                  𝑗≠𝑖
                                 𝑁 𝑁
                                                      2
                      L𝑛𝑒𝑔 = − ∑ ∑ ‖𝑓𝑖 − 𝑓𝑗′ ‖ /𝑁 2 ,                                            (9)
                                  𝑖=1 𝑗=1
                      L𝑐𝑡𝑟 = L𝑝𝑜𝑠 + L𝑛𝑒𝑔 ,                                                     (10)

where 𝑓𝑖 denotes the Power Spectrum Densities (PSDs) of the rPPG signal in position 𝑖 and 𝑓𝑖′ is
the other video’s PSDs. 𝑁 is the number of sampled rPPG pairs. The contrastive loss function
minimizes the MSE distance between positive samples and maximizes the distance between the
negative samples to force the model to learn the discriminative representation of the underlying
signals from different videos.

2.2.3. Fine-tuning
With the pre-trained 3DCNN-based PhysNet model, we use the officially designated dataset to
fine-tune it in a supervised manner. Specifically, in this stage, we modified the output of the
model by averaging the spatial dimension and then obtained a predicted rPPG signal. Given the
predicted rPPG signal 𝑦𝑝𝑟𝑒𝑑 and the ground-truth PPG signal 𝑦𝑔𝑡 , a popular Negative Pearson
correlation (Pear) loss and Negative max cross-correlation (MCC) loss are selected to perform
supervised training. It is worth noting that the Pear is the time domain loss function while the
MCC loss is the frequency domain loss function. The MCC is robust to temporal offsets in the
ground truth, which can make up for the Pear loss. The MCC loss is formulated as:

                                        𝐹 𝐹 𝑇 −1 {𝐵𝑃𝑎𝑠𝑠(𝐹 𝐹 𝑇 {𝑦𝑝𝑟𝑒𝑑 } ⋅ 𝐹 𝐹 𝑇 {𝑦𝑔𝑡 })
                     L𝑚𝑐𝑐 = − Max (                                                      ),    (11)
                                                              𝜎𝑦𝑝𝑟𝑒𝑑 × 𝜎𝑦𝑔𝑡

where 𝐹 𝐹 𝑇 −1 is the inverse of fast Fourier transform (FFT), 𝜎 is the standard deviation. Besides,
as the ground-truth signals are the reference of predicted rPPG signals, the 𝑦𝑝𝑟𝑒𝑑 should be
similar to 𝑦𝑔𝑡 . Therefore, we also use the contrastive loss by the following:

                              𝑁     𝑁
                       𝑔𝑡                         2                    2
                      L𝑝𝑜𝑠 = ∑ ∑ (‖𝑓𝑖 − 𝑔𝑗 ‖ + ‖𝑓𝑖′ − 𝑔𝑗′ ‖ ) /(2𝑁 (𝑁 − 1)),                   (12)
                              𝑖=1 𝑗=1
                                  𝑗≠𝑖
                                 𝑁 𝑁
                       𝑔𝑡                                 2                2
                      L𝑛𝑒𝑔 = − ∑ ∑ (‖𝑓𝑖 − 𝑔𝑗′ ‖ + ‖𝑓𝑖′ − 𝑔𝑗 ‖ ) /𝑁 2 ,                         (13)
                                  𝑖=1 𝑗=1

where 𝑔 is the PSDs of the ground-truth signal.
  Finally, the overall loss for fine-tuning is the combination of Pear loss, MCC loss, and
contrastive loss, which can resist noise interference of ground-truth signal.
                                      𝑔𝑡      𝑔𝑡
                               L𝑠 = L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛼L𝑝𝑒𝑎𝑟 + 𝛽L𝑚𝑐𝑐 ,                             (14)

where L𝑝𝑒𝑎𝑟 is the Negative Pearson correlation loss function. In our experiments, we set 𝛼 to
0.1 and 𝛽 to 0.2 for the VIPL-V2 dataset.


3. Experiments
3.1. Datasets
UBFC-rPPG [21] is a commonly used pure dataset for physiological estimation. It records 42
facial videos from 42 subjects in a stable lab environment. PURE [22] contains 60 facial videos of
10 participants under 6 modes (steady, small rotation, medium rotation, talking, slow translation,
and fast translation). MMSE-HR [23] contains 102 facial videos captured from 40 subjects
under six task modes. This dataset contains various facial expression changes. DISFA [24] is a
non-posed facial expression dataset. It records 27 facial videos from 27 subjects with different
ethnicities[25]. VIPL-V2 [26] is the second version of the VIPL-HR [26] dataset for remote
HR estimation from face videos under less-constrained situations, which contains 2,000 RGB
videos provided in this challenge [16, 17]. Up until the publication of the OBF [2] dataset, it
contains 100 healthy subjects and 6 patients with atrial fibrillation, totaling 10,600 minutes in
length [13]. In this challenge, some data of OBF are included in the test set. Following the rule
of this challenge, we use the datasets except VIPL-V2 and OBF without labels to pre-train the
model and finetune the model on the VIPL-V2 dataset.

3.2. Evaluation Metrics and Implementation Details
In this challenge, the root mean squared error (RMSE) is selected as the evaluation metric
between the predicted HR 𝑦𝑝𝑟𝑒𝑑 and ground-truth HR 𝑦𝑔𝑡 as below:

                                                     1 𝑁 𝑖         𝑖 ),
                            𝑅𝑀𝑆𝐸(𝑦𝑝𝑟𝑒𝑑 , 𝑦𝑔𝑡 ) =      ∑ (𝑦      − 𝑦𝑔𝑡                          (15)
                                                   √ 𝑁 𝑖=1 𝑝𝑟𝑒𝑑
where 𝑁 denotes the number of video samples.
   For solution 1 introduced in Section 2.1, we begin by extracting the facial ROI regions using
the landmark detection tool of OpenFace during the data pre-processing step. We then follow
the setting described in [17], applying a sliding window size of 300 frames (10s) and a step
size of 15 frames (0.5s) to generate MSTmap from the facial videos. For the spatial-temporal
Transformer module, we set the dimensionality 𝐷 to 128 and the number of layers 𝐿 to 6. During
pre-training, we use the AdamW optimizer with a learning rate of 1e-4 and a batch size of
4. Data augmentation techniques include random horizontal and vertical flipping as well as
frequency up/down sampling are used. In the fine-tuning step with data labels, in addition
to the self-supervised loss, we also add Negative Pearson Loss to further optimize the model.
Besides, we use a smaller learning rate, i.e., 1e-5, to finetune the model. For the VIPL-V2 dataset,
Table 1
The ablation study results of our solution 1 on the test dataset.
               Pre-training                    Fine-tuning   Loss                              RMSE↓ (bpm)
                                                             L𝑏𝑎𝑛𝑑 + L𝑠𝑝𝑎𝑟𝑠𝑒 + L𝑣𝑎𝑟              13.88440
               UBFC-rPPG                       VIPL-V2
                                                             L𝑏𝑎𝑛𝑑 + L𝑠𝑝𝑎𝑟𝑠𝑒 + L𝑣𝑎𝑟 + L𝑝𝑒𝑟𝑖𝑜     12.30601
                                                             L𝑏𝑎𝑛𝑑 + L𝑠𝑝𝑎𝑟𝑠𝑒 + L𝑣𝑎𝑟              11.52003
               UBFC-rPPG + PURE                VIPL-V2
                                                             L𝑏𝑎𝑛𝑑 + L𝑠𝑝𝑎𝑟𝑠𝑒 + L𝑣𝑎𝑟 + L𝑝𝑒𝑟𝑖𝑜     10.67180
                                                             L𝑏𝑎𝑛𝑑 + L𝑠𝑝𝑎𝑟𝑠𝑒 + L𝑣𝑎𝑟              10.36720
               UBFC-rPPG + PURE + MMSE-HR      VIPL-V2
                                                             L𝑏𝑎𝑛𝑑 + L𝑠𝑝𝑎𝑟𝑠𝑒 + L𝑣𝑎𝑟 + L𝑝𝑒𝑟𝑖𝑜      9.93125


we split the training and validation subsets in a ratio of 8:2. For the HR estimation inference
step, following previous work [3, 4], we apply a 1st-order Butterworth filter to convert the rPPG
signal into an HR value with a cutoff frequency range of [0.66Hz, 3.0Hz], corresponding to [40,
180] beats per minute. Subsequently, we perform the PSD [27] to estimate HR for each video
clip. For solution 2 elaborated in Section 2.2, we resample the videos to a frame rate of 30 and
then perform face detection and cropping. We set the length of the video clip to 300 frames
without overlapping. Following the setting in [11], the spatial resolution 𝑆 is set to 2, and the
sampled time interval Δ𝑡 of each rPPG signal is set to 150 frames. Other settings are the same
as solution 1.
   For the ensemble strategy, we take the multiple best prediction results under different settings
of both solution 1 and solution 2. Then we average the different predicted heart rates of each
sample as the final result.

3.3. Experimental Results
Results for Solution 1. As shown in Table 1, we investigate the impact of different pre-
training datasets and loss functions for solution 1. The results indicate that as the amount
of pre-training data increases, the performance of the model improves accordingly. In our
solution, we ultimately select the UBFC-rPPG [21], PURE [22], and MMSE-HR [23] datasets for
pre-training. Additionally, we also investigate the impact of the proposed periodicity loss L𝑝𝑒𝑟𝑖𝑜 .
We can see that the incorporation of the periodicity loss consistently improves the performance
of the model significantly across different settings. For instance, when the model is trained
on the UBFC-rPPG, PURE, and MMSE-HR datasets, the introduction of the periodicity loss
reduces RMSE from 10.35720 to 9.93125. This improvement underscores the effectiveness of
the periodicity loss in mitigating abnormal periodic fluctuations in the predicted signal and
maintaining temporal periodicity consistency.
Results for Solution 2. As shown in Table 2, we evaluate different pre-training datasets,
loss functions, and model inputs to find the best setting for this task. Note that the DISFA
dataset is a non-posed facial expression database. However, from the results, we can find that
using it for pre-training can still achieve comparable performance. Apart from that, we can
find the same conclusion as solution 1 that increasing the amount of pre-training datasets is
beneficial to performance. In this solution, we choose DISFA, UBFC-rPPG, MMSE-HR, and
PURE for pre-training. Additionally, we also evaluate different combinations of supervised

1
    https://www.kaggle.com/competitions/the-3rd-repss-t1/leaderboard
Table 2
The ablation study results of our solution 2 on the test dataset. * denotes the normalized difference on
model input.
  Pre-training                                 Fine-tuning     Loss                           RMSE↓ (bpm)
                                                                𝑔𝑡     𝑔𝑡
                                                               L𝑝𝑜𝑠 + L𝑛𝑒𝑔                      11.81139
                                                                𝑔𝑡     𝑔𝑡
  DISFA                                        VIPL-V2         L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛼L𝑝𝑒𝑎𝑟             12.01150
                                                                𝑔𝑡     𝑔𝑡
                                                               L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛽L𝑚𝑐𝑐              11.29330
                                                                𝑔𝑡     𝑔𝑡
                                                               L𝑝𝑜𝑠 + L𝑛𝑒𝑔                      11.35523
  DISFA + MMSE-HR                              VIPL-V2          𝑔𝑡     𝑔𝑡
                                                               L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛼L𝑝𝑒𝑎𝑟 + 𝛽L𝑚𝑐𝑐     10.72491
                                                                𝑔𝑡     𝑔𝑡
                                                               L𝑝𝑜𝑠 + L𝑛𝑒𝑔                      10.37686
                                                                𝑔𝑡     𝑔𝑡
  DISFA + UBFC-rPPG + MMSE-HR                  VIPL-V2         L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛽L𝑚𝑐𝑐              11.03058
                                                                𝑔𝑡     𝑔𝑡
                                                               L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛼L𝑝𝑒𝑎𝑟 + 𝛽L𝑚𝑐𝑐     10.75880
                                                                𝑔𝑡     𝑔𝑡
                                                               L𝑝𝑜𝑠 + L𝑛𝑒𝑔                      10.62485
                                                                𝑔𝑡     𝑔𝑡
  DISFA + UBFC-rPPG + MMSE-HR + PURE           VIPL-V2         L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛽L𝑚𝑐𝑐              10.19808
                                                                𝑔𝑡     𝑔𝑡
                                                               L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛼L𝑝𝑒𝑎𝑟 + 𝛽L𝑚𝑐𝑐     11.01228
                                                                𝑔𝑡     𝑔𝑡
  * DISFA + UBFC-rPPG + MMSE-HR + PURE         VIPL-V2         L𝑝𝑜𝑠 + L𝑛𝑒𝑔 + 𝛼L𝑝𝑒𝑎𝑟 + 𝛽L𝑚𝑐𝑐     10.36316


Table 3
The results of the top-3 leaderboards on the test dataset in each challenge of RePSS. The best result
is highlighted in bold, and the second-best result is underlined. The results of 1st and 2nd RePSS are
provided by the report [28, 29], and the 3rd results are provided by the Kaggle competition page1 .
              Team Name              Venue       Rank         Method Type       RMSE↓ (bpm)
                Mixanik            1st RePSS      1            Supervised         10.68021
              PoWeiHuang           1st RePSS      2            Supervised         14.16263
               AWoyczyk            1st RePSS      3            Supervised         14.37509
                  Dr.L             2nd RePSS      1            Supervised           11.05
                  TIME             2nd RePSS      2            Supervised           11.44
            The Anti-Spoofers      2nd RePSS      3            Supervised           14.51
                 Face AI           3rd RePSS      1          Self-supervised      8.50693
            HFUT-VUT (Ours)        3rd RePSS      2          Self-supervised      8.85277
               PCA_Vital           3rd RePSS      3          Self-supervised      8.96941


loss L𝑠 . The results show that both the time domain and frequency domain loss are helpful
for model fine-tuning. Moreover, we evaluate the performance of normalized frame difference
input, and it shows a comparable result with normal input. In the model ensemble phase, we
added the frame difference-based manner as different feature forms.
Model Ensemble. In order to combine the advantages of Solution 1 and Solution 2, we use
an ensemble strategy to integrate the best prediction results of these two solutions together.
Specifically, we ensembled the models by taking the average value of the prediction results for
Solution 1 and Solution 2, and then obtained the final prediction results. As shown in Table 3,
we report the top-3 results on the test dataset for each RePSS challenge. Compared to other
teams, we can see that our team achieves 2nd place, which is higher than the 3rd by 1.2%. This
demonstrates that our proposed two self-supervision solutions can complementaryly achieve
more accurate and robust heart rate estimation. Compared to the results of the supervised
methods in previous challenges, we can find that self-supervised methods improve performance
by a large margin. This indicates that self-supervised methods can capture rPPG-related signals
from facial videos during the pre-train phase without requiring any real physiological signals.


4. Conclusion
In this paper, we present our solutions developed for self-supervised remote heart rate mea-
surement of the 3rd RePSS challenge hosted at IJCAI 2024. Specifically, we propose two self-
supervised HR estimation solutions that integrate spatial-temporal modeling and contrastive
learning, respectively. By leveraging the ensemble strategy, our final submission takes second
place with the RMSE score of 8.85277 bpm. In the future, we plan to address the issues in this
challenge from other perspectives, e.g., using video motion magnification algorithms [30] to
capture the subtle change reflected in faces by heartbeats.


Acknowledgments
This work was supported by the National Key R&D Program of China (NO.2022YFB4500601), the
National Natural Science Foundation of China (72188101,62272144,62020106007and U20A20183),
the Major Project of Anhui Province(202203a05020011), and the Fundamental Research Funds
for the Central Universities.


References
 [1] X. Li, J. Chen, G. Zhao, M. Pietikainen, Remote heart rate measurement from face videos
     under realistic situations, in: Proceedings of the IEEE Conference on Computer Vision
     and Pattern Recognition, 2014, pp. 4264–4271.
 [2] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. Tulppo, G. Zhao, The
     obf database: A large face video database for remote physiological signal measurement
     and atrial fibrillation detection, in: 2018 13th IEEE International Conference on Automatic
     Face & Gesture Recognition (FG 2018), 2018, pp. 242–249.
 [3] W. Qian, D. Guo, K. Li, X. Zhang, X. Tian, X. Yang, M. Wang, Dual-path tokenlearner for
     remote photoplethysmography-based physiological measurement with facial videos, IEEE
     Transactions on Computational Social Systems (2024).
 [4] Q. Li, D. Guo, W. Qian, X. Tian, X. Sun, H. Zhao, M. Wang, Channel-wise interactive
     learning for remote heart rate estimation from facial video, IEEE Transactions on Circuits
     and Systems for Video Technology (2023).
 [5] X. Liu, B. Hill, Z. Jiang, S. Patel, D. McDuff, Efficientphys: Enabling simple, fast and
     accurate camera-based cardiac measurement, in: Proceedings of the IEEE/CVF Winter
     Conference on Applications of Computer Vision, 2023, pp. 5008–5017.
 [6] S. Tang, R. Hong, D. Guo, M. Wang, Gloss semantic-enhanced network with online back-
     translation for sign language production, in: Proceedings of the 30th ACM International
     Conference on Multimedia, 2022, pp. 5630–5638.
 [7] J. Zhou, D. Guo, M. Wang, Contrastive positive sample propagation along the audio-visual
     event line, IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
 [8] K. Li, D. Guo, M. Wang, Vigt: proposal-free video grounding with a learnable token in the
     transformer, Science China Information Sciences 66 (2023) 202102.
 [9] D. Guo, K. Li, B. Hu, Y. Zhang, M. Wang, Benchmarking micro-action recognition: Dataset,
     methods, and applications, IEEE Transactions on Circuits and Systems for Video Technol-
     ogy (2024).
[10] Y. Wei, Z. Zhang, Y. Wang, M. Xu, Y. Yang, S. Yan, M. Wang, Deraincyclegan: Rain
     attentive cyclegan for single image deraining and rainmaking, IEEE Transactions on Image
     Processing 30 (2021) 4788–4801.
[11] Z. Sun, X. Li, Contrast-phys+: Unsupervised and weakly-supervised video-based remote
     physiological measurement via spatiotemporal contrast, IEEE Transactions on Pattern
     Analysis and Machine Intelligence (2024) 1–18.
[12] H. Lu, H. Han, S. K. Zhou, Dual-gan: Joint bvp and noise modeling for remote physiological
     measurement, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
     Pattern Recognition, 2021, pp. 12404–12413.
[13] Z. Yu, W. Peng, X. Li, X. Hong, G. Zhao, Remote heart rate measurement from highly
     compressed facial videos: an end-to-end deep learning solution with video enhancement,
     in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp.
     151–160.
[14] J. Speth, N. Vance, P. Flynn, A. Czajka, Non-contrastive unsupervised learning of physi-
     ological signals from video, in: Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition, 2023, pp. 14464–14474.
[15] K. Li, J. Li, D. Guo, X. Yang, M. Wang, Transformer-based visual grounding with cross-
     modality interaction, ACM Transactions on Multimedia Computing, Communications and
     Applications 19 (2023) 1–19.
[16] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation from face
     via spatial-temporal representation, IEEE Transactions on Image Processing 29 (2019)
     2409–2423.
[17] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, G. Zhao, Video-based remote physiological mea-
     surement via cross-verified feature disentangling, in: Computer Vision–ECCV 2020: 16th
     European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, 2020, pp.
     295–310.
[18] J. Gideon, S. Stent, The way to my heart is through contrastive learning: Remote photo-
     plethysmography from unlabelled video, in: Proceedings of the IEEE/CVF International
     Conference on Computer Vision, 2021, pp. 3995–4004.
[19] A. Bardes, J. Ponce, Y. Lecun, Vicreg: Variance-invariance-covariance regularization for
     self-supervised learning, in: International Conference on Learning Representations, 2022.
[20] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask
     cascaded convolutional networks, IEEE Signal Processing Letters 23 (2016) 1499–1503.
[21] S. Bobbia, R. Macwan, Y. Benezeth, A. Mansouri, J. Dubois, Unsupervised skin tissue
     segmentation for remote photoplethysmography, Pattern Recognition Letters 124 (2019)
     82–90.
[22] R. Stricker, S. Müller, H.-M. Gross, Non-contact video-based pulse rate measurement on a
     mobile service robot, in: The 23rd IEEE International Symposium on Robot and Human
     Interactive Communication, 2014, pp. 1056–1062.
[23] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, N. Sebe, Self-adaptive matrix
     completion for heart rate estimation from face videos under realistic conditions, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
     2016, pp. 2396–2404.
[24] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, J. F. Cohn, Disfa: A spontaneous facial
     action intensity database, IEEE Transactions on Affective Computing 4 (2013) 151–160.
[25] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, Automatic detection of non-posed
     facial action units, in: 2012 19th IEEE International Conference on Image Processing, 2012,
     pp. 1817–1820.
[26] X. Niu, H. Han, S. Shan, X. Chen, Vipl-hr: A multi-modal database for pulse estimation
     from less-constrained face video, in: Computer Vision–ACCV 2018: 14th Asian Conference
     on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part
     V 14, 2019, pp. 562–576.
[27] P. Welch, The use of fast fourier transform for the estimation of power spectra: A method
     based on time averaging over short, modified periodograms, IEEE Transactions on Audio
     and Electroacoustics 15 (1967) 70–73.
[28] X. Li, H. Han, H. Lu, X. Niu, Z. Yu, A. Dantcheva, G. Zhao, S. Shan, The 1st challenge on
     remote physiological signal sensing (repss), in: Proceedings of the IEEE/CVF Conference
     on Computer Vision and Pattern Recognition Workshops, 2020, pp. 314–315.
[29] X. Li, H. Sun, Z. Sun, H. Han, A. Dantcheva, S. Shan, G. Zhao, The 2nd challenge on
     remote physiological signal sensing (repss), in: Proceedings of the IEEE/CVF International
     Conference on Computer Vision, 2021, pp. 2404–2413.
[30] F. Wang, D. Guo, K. Li, M. Wang, Eulermormer: Robust eulerian motion magnification
     via dynamic filtering within transformer, in: Proceedings of the AAAI Conference on
     Artificial Intelligence, volume 38, 2024, pp. 5345–5353.