DINO-rPPG: Remote photoplethysmography
                                Measurement using Facial Representation from DINO
                                Guidance
                                Jiho Choi1 , Sang Jun Lee1,∗
                                1
                                    Jeonbuk National University, Republic of Korea


                                               Abstract
                                               Remote photoplethysmography (rPPG) is a camera-based technique that enables non-invasive monitoring
                                               of physiological signals such as heart rate (HR) and respiration rate (RR). In light of this advantage, many
                                               researchers have suggested deep learning-based methods to measure physiological signals from video
                                               data. The 3D convolutional neural network (3D CNN) has been widely applied to capture spatio-temporal
                                               features of subtle rPPG changes from facial video. However, the limited receptive fields of the convolution
                                               operation leave room for improvement in obtaining global facial features that are crucial for accurate
                                               rPPG estimation. Recently, vision transformer trained with self-supervised learning have emerged as
                                               powerful tools for extracting high-level features than CNN and supervised ViT. In this study, we propose
                                               DINO-rPPG, a method that utilizes a pre-trained DINO to obtain features relevant to the face without
                                               additional training. The DINO representation is extracted by a DINO-based semantic extractor (DSE),
                                               which effectively captures the high-level semantic features of the face region. The spatio-temporal
                                               feature is important for estimating accurate rPPG, therefore, we enhance the spatial DINO representation
                                               by incorporating it with features from a spatio-temporal extractor (STE). We conducted experiments
                                               using the V4V dataset for estimating HR values, and the results demonstrated that DINO representation
                                               guidance is effective for rPPG estimation.

                                               Keywords
                                               Remote photoplethysmography, Self-supervised vision transformer, Heart rate estimation,


                                1. Introduction
                                Remote photoplethysmography (rPPG) measurement is a non-contact physiological signal
                                monitoring method based on a camera sensor. It operates through subtle movements of blood
                                flow and changes in pixel values resulting from blood supply in facial images. Consequently,
                                various physiological indicators such as heart rate (HR), respiration rate, and heart rate variability
                                (HRV) can be extracted from the video data without the need for additional contact sensors. The
                                rPPG technique addresses the limitations of conventional contact sensors that require physical
                                attachment to the body. In particular, advances in rPPG research are achieved by the application
                                of deep learning models, expanding applicability to fields such as telemedicine [1, 2], affective
                                computing [3, 4, 5], and deepfake detection [6, 7, 8].

                                The 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge & Workshop, August 3–9, 2024, Jeju, South
                                Korea
                                ∗
                                    Corresponding author.
                                Envelope-Open jihochoi@jbnu.ac.kr (J. Choi); sj.lee@jbnu.ac.kr (S. J. Lee)
                                Orcid 0000-0003-0771-149X (J. Choi); 0000-0002-9312-6299 (S. J. Lee)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Visualization results of the features from pre-trained DINO. We extracted the features from
randomly selected frames from videos in the V4V dataset and performed principal component analysis
on these features.


   The extraction of physiological signals from video data necessitates the capture of subtle
changes in blood flow in facial areas. Early studies employed conventional signal processing
methods [9, 10, 11] to calculate intensity changes in RGB images or applied color space trans-
formations [12, 13]. Moreover, a comprehensive understanding of the spatial and temporal
information and the corresponding signals is essential for a deep learning model to achieve
accurate rPPG measurements. Recent studies preprocess spatio-temporal (ST) maps of the
region of interest (ROI) to allow a model to estimate rPPG from ST map [14, 15]. Additionally,
various deep learning architectures, such as 3D CNN and vision transformer (ViT), have been
explored to effectively capture spatio-temporal features.
   Challenges in measuring rPPG include subject movements and varying lighting conditions.
The attention mechanisms was used to mitigate the impact of motion artifacts and improve ROI
representation [16, 17]. This mechanisms enhance the robustness of deep learning model to
noise and illumination changes by suppressing irrelevant information and focusing on important
features. rPPGNet [16] introduced a skin-based attention module, and SAM-rPPGNet [17]
proposed a spatial-temporal attention mechanism to reduce noise components and improve
measurement accuracy. By incorporating attention mechanisms, the model can focus on face-
relevant regions and spatio-temporal features that are crucial for estimating rPPG. This allows
rPPG signals to be accurately estimated even in challenging environments.
   Recently, a vision transformer trained in a self-supervised manner, called DINO [18], was pro-
posed. DINO is trained on large datasets without ground-truth data and can extract meaningful
representation from input image. By directly accessing the self-attention layer, it is possible to
obtain features that capture high-level semantic information. Due to these properties, DINO has
been utilized in downstream tasks and as a feature extractor. We propose a method to estimate
rPPG by leveraging the representation of face region obtained from pre-trained DINO.
   In this study, we propose a framework to guide rPPG measurement network using repre-
sentations of ViT trained with self-supervised learning. We found that pre-trained DINO can
successfully extract features for ROI regions without additional training, and the feature visual-
ization results for the V4V dataset can be seen in Figure 1. Therefore, we construct a network
that effectively combines visual and spatio-temporal features. Our contributions are as follows:

    • We propose a framework for rPPG estimation, namely DINO-rPPG, leveraging the guid-
      ance of DINO representations.
    • To the best of our knowledge, this is the first attempt to utilize DINO features in rPPG
      measurement.
    • We demonstrate the effectiveness of the proposed method using the V4V database [19].


2. Related Works
2.1. Remote photoplethysmography measurement
Conventional rPPG methods using hand-crafted processes extract physiological signals using
various facial regions [20] and specific color channels [21]. Additionally, signal decomposition
methods such as independent component analysis [10, 11] and principal component analy-
sis [22] have been employed to improve rPPG measurement. The CHROM algorithm [12] uses
chrominance to suppress mirror distortion of image and is still widely used in recent research.
However, these approaches tend to perform poorly when estimating in noisy or challenging
environments, including those with motion artifacts.
   Recently, deep learning methods such as convolutional neural networks (CNNs), have been
applied to rPPG measurement. However, 2D CNN-based approaches [23, 24] have limitations
in capturing temporal features, which are important for physiological signal analysis. To solve
these shortcomings, 3D CNN models have been utilized to extract spatio-temporal features [25,
16, 26]. In particular, the 3D CNN has enabled promising rPPG measurement accuracy by
leveraging spatio-temporal features and effectively dealing with motion artifacts in video
data. Yu et al. proposed PhysNet [25], which is based on a 3D CNN with an encoder-decoder
architecture, includes upsampling along the time axis to enhance temporal resolution. However,
CNN-based model is limited to the local receptive fields, therefore, ViT architecture has been
employed which aggregates global and local information. Yu et al. proposed PhysFormer [27],
which aims to capture long-range spatio-temporal and local-global features using a video
transformer architecture.
   Additionally, ViT with trained by self-supervised manner can extract meaningful repre-
sentations, and there have been attempts to utilize ViT without ground-truth data to obtain
enhanced features. rPPG-MAE [28] was proposed as one of the rPPG method utilizing the
representation of self-supervised learning. They utilized the masked autoencoder (MAE) to
improve the network representation, robustness to noise, and demonstrated promising estima-
tion accuracy on the VIPL-HR [29], PURE [30], and UBFC-rPPG [31] datasets. This suggests
that deep learning models are capable of accurate signal estimation based on representations
enriched with relevant feature information. Therefore, in this paper, we propose a method to
extract and utilize meaningful information about facial regions from the ViT model DINO [18],
trained by self-supervised learning.

2.2. Self-supervised vision transformer
The features of the vision translator provide powerful visual representations that can be used
in downstream vision tasks. In particular, DINO is emerging due to its ability to capture high-
level semantic information. DINO can obtain semantic information of images more effectively
than CNN-based and ViT models trained with labels. This property is achieved by training
Figure 2: Overview of the proposed method. We input 𝑇 frames of face video and extract features from
DSE and STE. Each extractor outputs feature maps 𝑓𝐷𝑆 and 𝑓𝑆𝑇 , respectively. We then integrate them to
obtain enhanced spatio-temporal representation, which is denoted as 𝑓𝐷𝑆−𝑆𝑇 . We use tokenized feature
map 𝑓𝐷𝑆−𝑆𝑇 as input to the video transformer. Finally, the HR value can be estimated from the measured
rPPG signal.


teacher and student networks with the same structure in a self-distilling manner. Specifically,
augmentations are applied to the input images to generate different views, which are then
passed through the networks. The student model is optimized using cross-entropy loss and the
exponential moving average technique is applied to update the parameters. With self-supervised
learning, ViT can serve as powerful feature extractors and has been applied in various vision
tasks.
   Another powerful ViT method, such as CLIP [32], provides meaningful features by estimat-
ing the similarity between images and text. However, DoesFS [33] demonstrated that DINO
representation is superior to CLIP for facial regions using principal component analysis on
tokens and keys in intermediate layers. Therefore, we leverage DINO as a feature extractor to
improve the representation of the rPPG estimation model. The visualization of the features
obtained from DINO is shown in Figure 1.


3. Method
In this section, we introduce DINO-rPPG, a DINO-guided physiological measurement method.
The overall framework is shown in Figure 2. We describe the DINO-based semantic extractor
(DSE), spatio-temporal extractor (STE), and feature aggregation process in Section 3.1. Then,
the video transformer architecture in Section 3.2, and the loss functions for estimating rPPG
and HR in Section 3.3.

3.1. DINO based spatio-temporal feature extraction
We leverage the pre-trained ViT-B model from DINOv2, which follows the transformer archi-
tecture and employs a self-supervised method trained on large scale datasets. During training,
we first input video 𝑣 ∈ ℝ𝐶×𝑇 ×𝐻 ×𝑊 into the feature extractors, DSE and STE. The DINO-based
semantic extractor designed to capture high-level semantic information 𝑓𝐷𝑆 = 𝐷𝑆𝐸(𝑣) about
the facial region in the input 𝑣 that is important for rPPG signal estimation. The DSE features
guide the model to learn enhanced representation of the facial regions, which is indicative of
physiological signals and enables fast convergence.
   However, the DINO feature 𝑓𝐷𝑆 ∈ ℝ𝐷×𝑇 ×𝐻 /8×𝑊 /8 only provides the semantic information
without considering the temporal aspects. Therefore, we leverage the spatio-temporal extractor
to obtain spatial feature 𝑓𝑆𝑇 = 𝑆𝑇 𝐸(𝑣) over time in a local region, capturing spatio-temporal
features 𝑓𝑆𝑇 ∈ ℝ3×𝑇 ×𝐻 /8×𝑊 /8 . Inspired by [27], we construct the STE with 3D convolution layers
with kernel sizes of the 1×5×5, 1×3×3 and 1×3×3. The feature map 𝑓𝑆𝑇 is restricted in capturing
global features due to its limited receptive field. Therefore, we aggregate low-dimensional
local features from the STE and high-dimensional semantic features from the DSE to obtain an
enhanced feature map with a DINO-based representation. The combined feature map 𝑓𝐷𝑆−𝑆𝑇 is
obtained by concatenating the 𝑓𝑆𝑇 and 𝑓𝐷𝑆 , formulated as follows.

                                       𝑓𝐷𝑆−𝑆𝑇 = 𝑓𝑆𝑇 ⊕ 𝑓𝐷𝑆 ,                                    (1)

where ⊕ indicates concatenation process. To embed the combined feature map 𝑓𝐷𝑆−𝑆𝑇 from the
input video, we generate spatio-temporal tubes by considering the temporal dimension. This
process generates non-overlapping tubelet tokens from 𝑓𝐷𝑆−𝑆𝑇 .

3.2. Video transformer
We adopt the temporal difference transformer from PhysFormer [27], which is based on a video
transformer architecture. We first embed the feature map 𝑓𝐷𝑆−𝑆𝑇 using the tubelet embedding
method [34], which linearly embeds non-overlapping tubelets. The embedded spatio-temporal
                                                                      𝐻 /8          𝑊 /8
tubelet (ST tubelet) token sizes are defined as 𝑇 ′ = ⌊ 𝑇𝑡 ⌋ , 𝐻 ′ = ⌊ ℎ ⌋ , 𝑊 ′ = ⌊ 𝑤 ⌋ , with the
tube size parameters 𝑡, ℎ, 𝑤 set to (4, 4, 4) as in [27].
   ST tubelet tokens are then input to the 𝑁 transformer blocks, which extract local and global
spatio-temporal rPPG features. The temporal difference multi-head self-attention block (TD-
MHSA) captures local temporal difference features within the video frames, with its self-attention
map representing the attention between key and query tube tokens. The self-attention map
helps the network focus on the ROI in the spatial domain and locate the peak values of the
estimated signal in the time domain. Specifically, the query and key are obtained by learning
spatial differences between neighboring frames using temporal difference convolution [35],
which captures the derivatives of the PPG signal crucial for rPPG estimation.
   The spatio-temporal feed-forward (ST-FF) block refines local inconsistencies in the features
extracted by the TD-MHSA block. The latent vector of the transformer blocks undergoes
temporal upsampling to match the original video sequence length 𝑇, ensuring that the estimated
rPPG signal has the same temporal resolution as the input frames. Finally, we extract the
physiological signal using an rPPG estimator, which consists of a 1D convolution.

3.3. Loss functions
To optimize the proposed network, we define loss functions in the time domain, frequency
domain, and by direct comparison of HR values. To extract accurate rPPG from facial video,
we utilize a loss function in the time domain 𝐿𝑡𝑖𝑚𝑒 . The accurate rPPG should exhibit a high
trend similarity with the ground-truth PPG signal, and its peak values on the time axis should
be close to the ground-truth peaks. Therefore, we define a loss function using the negative
Pearson correlation, formulated as follows.
                                                   𝑇              𝑇       𝑇
                                               𝑇 ∑1 𝑥𝑥 ′ − ∑1 𝑥 ∑1 𝑥 ′
                     𝐿𝑡𝑖𝑚𝑒 = 1 −                                                  ,                  (2)
                                         𝑇 2        2  𝑇     ′2       𝑇
                                                                     ′ 2      𝑇
                                   √(𝑇 ∑1 𝑥 − (∑1 𝑥) )(𝑇 ∑1 𝑥 − (∑1 𝑥 ) )
where 𝑥 and 𝑥 ′ denote estimated rPPG and ground-truth PPG signals, respectively. The loss term
𝐿𝑡𝑖𝑚𝑒 ensures that the predicted signal has a trend and peak values similar to the ground-truth
signal on the time axis.
   The PPG is recorded during a heartbeat and frequency analysis can reveal the periodicity of
the signal. We obtain the power spectral density (PSD) by applying a fast Fourier transform to
both the actual and predicted signals. Inspired by [26], we define 𝐿𝑓 𝑟𝑒𝑞 as the root mean square
error between the PSD of the signals.

                                      𝐿𝑓 𝑟𝑒𝑞 = ‖𝑃(𝑥) − 𝑃(𝑥 ′ )‖2 ,                                   (3)

where 𝑃(⋅) represents PSD. Components such as motion noise have small amplitudes in the
frequency domain and lack strong periodicity. Using 𝐿𝑓 𝑟𝑒𝑞 enhances robustness to noise, aligns
the rPPG signal more closely with the ground-truth, and minimizes errors in the frequency
components.
   From the rPPG signal, we can calculate the HR value and directly compare the estimated HR
with the ground-truth HR. 𝐿ℎ𝑟 is defined in the time domain as follows.

                                             𝐿ℎ𝑟 = ‖ℎ − ℎ′ ‖2 ,                                      (4)

where h and h’ are estimated and actual heart rate value. The overall loss function is denoted
as 𝐿𝑜𝑣𝑒𝑟𝑎𝑙𝑙 = 𝜆1 𝐿𝑡𝑖𝑚𝑒 + 𝜆2 𝐿𝑓 𝑟𝑒𝑞 + 𝜆3 𝐿ℎ𝑟 , where hyper-parameters 𝜆1 , 𝜆2 , 𝜆3 are set to 1, 100, and
0.0001, respectively.


4. Experiments
4.1. Implementation details
We introduce preprocessing to prepare the input videos and signals for training our proposed
model. Using the MTCNN algorithm, we find the bounding box of the face region and crop an
area approximately 1.6 times the size of the box. The bounding box is determined in the first
frame and applied to subsequent frames. The collected video is segmented into 160 frames and
is used as input to the network. The frame rate of the video is maintained at 25 fps, and the
signal is resampled to 25 Hz for synchronization.
   Our proposed network is trained with an Adam optimizer and a learning rate of 1e-4. We set
the batch size to 8 and train the model for 20 epochs on the NVIDIA RTX 4090. The parameters
of transformer [27], 𝑁 and 𝐷 are set to 12 and 96, respectively. We adopted the mean absolute
error (MAE), root mean squared error (RMSE), Pearson correlation coefficient (r) for evaluating
the HR estimation task.
Table 1
Heart rate estimation results on V4V dataset. The best results are in bold.
                            Methods               MAE ↓     RMSE ↓      r↑
                            DeepPhys [36]           10.2     13.25     0.45
                            PhysNet [16]           13.15     19.23     0.75
                            APNET [37]              4.89     7.83      0.74
                            DRPNET [26]             3.83     9.59      0.75
                            DINO-rPPG (Ours)       3.09      7.05      0.83


4.2. Datasets
Vision for Vitals (V4V) is a dataset released as part of the ICCV 2021 Vision for Vitals challenge.
The dataset was collected from 179 subjects and consists of videos of them performing 10 tasks
designed to induce different emotions. It contains a total of 1,358 video recordings along with
accompanying physiological information such as heart rate, blood pressure, and PPG signals.
The videos were collected at a resolution of 1280×720 and a frame rate of 25 fps, with varying
lengths for each video.

4.3. Experimental results on V4V dataset
In this section, we evaluate the HR estimation performance of the proposed DINO-rPPG com-
pared to previous methods on the V4V dataset. As shown in Table 1, our method consistently
outperforms previous approaches across all metrics. DeepPhys [36] and PhysNet [16] perform
poorly, with MAE and RMSE values exceeding 10. The recent state-of-the-art method, DRP-
NET [26], achieved the MAE of 3.83, RMSE of 9.59 and 𝑟 of 0.75. However, our DINO-rPPG
outperforms DRPNET with the MAE 3.09, demonstrating a significant reduction in error rate.
We also achieved an 𝑟 value of 0.83, which is higher than DRPNET, indicating a stronger linear
relationship between the estimated HR and the ground-truth HR. Notably, the RMSE of the
proposed method decreased from 7.83 in APNET [37] to 7.05. These results suggest that the
proposed framework is effective in measuring rPPG and accurately estimating the HR values.

4.4. Ablation study
We also present ablation study results for the HR estimation task on the V4V dataset, as
shown in Table 2. The representations of the self-supervised ViT model provides the high-level
semantic information, allowing the model to be trained effectively based on abundant facial
features and achieve fast convergence. However, the DINO representation provides spatial
features without considering temporal information, which is important for sequence data. We
report the MAE of 10.67 without using STE, indicating the need for spatio-temporal features.
Moreover, when utilizing only STE, the network utilizes spatio-temporal features of the local
region, which reduces HR estimation performance, with the MAE and RMSE values of 3.18
and 8.34, respectively. This highlights the limitations of relying solely on local features in
3D-CNN. Therefore, aggregating features from both the DSE and STE is crucial for estimating
the rPPG signals, and we demonstrate it by achieving significant improvements in all metrics.
Table 2
Ablation study of DSE and STE on the V4V dataset. The best results are in bold.
                                 Method         MAE ↓    RMSE ↓      r↑
                                  Ours          3.09       7.05     0.83
                              Ours w/o DSE       3.18      8.34     0.80
                              Ours w/o STE      10.67      13.97    0.13


Table 3
Ablation study of loss functions on the V4V dataset. The best results are in bold.
                                 Method         MAE ↓     RMSE ↓     r↑
                                  Ours           3.09      7.05      0.83
                              Ours w/o 𝐿𝑓 𝑟𝑒𝑞    3.16      8.30      0.81
                              Ours w/o 𝐿ℎ𝑟       3.30      8.74      0.79
                              Ours w/o 𝐿𝑡𝑖𝑚𝑒     4.63      10.71     0.69


Including pre-trained DINO features enhances the ability of network to capture global semantic
information, resulting in more accurate prediction of rPPG signals and improved HR estimation.
   We investigated the impact of the loss functions used in this study. The loss function 𝐿𝑓 𝑟𝑒𝑞
is based on frequency domain analysis to measure the PSD difference of the signal. Without
𝐿𝑓 𝑟𝑒𝑞 , the model report a slightly increased MAE of 3.16, indicating that the loss function
effectively handles noise components of small amplitudes and poor periodicity in the frequency
domain. Moreover, we conducted an ablation study on time domain loss functions 𝐿ℎ𝑟 and 𝐿𝑡𝑖𝑚𝑒 .
Comparing HR values directly enables accurate HR estimation, as evidenced by the observed
increase in error without using 𝐿ℎ𝑟 . In addition, 𝐿𝑡𝑖𝑚𝑒 applies penalty to the distance between
the peak values of predicted and ground-truth rPPG to obtain a similar trend on the time
axis. Omitting 𝐿𝑡𝑖𝑚𝑒 degraded estimation performance, with the MAE of 4.63 and the 𝑟 value of
0.69. We demonstrated the efficiency of the loss functions used in this study, highlighting its
effectiveness in improving the accuracy and robustness of the model.


5. Conclusion
In this paper, we propose the deep learning model for rPPG estimation called DINO-rPPG, which
utilizes the ViT model trained in a self-supervised manner. We found that DINO provides face-
relevant representations without additional training, effectively capturing essential features for
accurate rPPG estimation. The DSE uses pre-trained DINO to extract semantic representations
of ROI regions, and is further enhanced by considering temporal information through spatio-
temporal feature maps obtained from STE. Experimental results demonstrate that DINO-rPPG
outperforms previous works in the HR estimation task, highlighting the important role of DINO
guidance in improving model performance on the V4V dataset.
Acknowledgments
This research was supported by “Regional Innovation Strategy (RIS)” through the National
Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE). (2023RIS-008)


References
 [1] B. P. Yan, W. H. Lai, C. K. Chan, S. C.-H. Chan, L.-H. Chan, K.-M. Lam, H.-W. Lau, C.-M.
     Ng, L.-Y. Tai, K.-W. Yip, et al., Contact-free screening of atrial fibrillation by a smartphone
     using facial pulsatile photoplethysmographic signals, Journal of the American Heart
     Association 7 (2018) e008585.
 [2] Z. Sun, J. Junttila, M. Tulppo, T. Seppänen, X. Li, Non-contact atrial fibrillation detection
     from face videos by learning systolic peaks, IEEE Journal of Biomedical and Health
     Informatics 26 (2022) 4587–4598.
 [3] R. M. Sabour, Y. Benezeth, P. De Oliveira, J. Chappe, F. Yang, Ubfc-phys: A multimodal
     database for psychophysiological studies of social stress, IEEE Transactions on Affective
     Computing 14 (2021) 622–636.
 [4] W. Yu, S. Ding, Z. Yue, S. Yang, Emotion recognition from facial expressions and contactless
     heart rate using knowledge graph, in: 2020 IEEE International Conference on Knowledge
     Graph (ICKG), IEEE, 2020, pp. 64–69.
 [5] Z. Yu, X. Li, G. Zhao, Facial-video-based physiological signal measurement: Recent
     advances and affective applications, IEEE Signal Processing Magazine 38 (2021) 50–58.
 [6] J. Hernandez-Ortega, R. Tolosana, J. Fierrez, A. Morales, Deepfakeson-phys: Deepfakes
     detection based on heart rate estimation. arxiv 2020, arXiv preprint arXiv:2010.00400
     (2020).
 [7] Y. Xu, R. Zhang, C. Yang, Y. Zhang, Z. Yang, J. Liu, New advances in remote heart rate
     estimation and its application to deepfake detection, in: 2021 International Conference on
     Culture-oriented Science & Technology (ICCST), IEEE, 2021, pp. 387–392.
 [8] G. Boccignone, S. Bursic, V. Cuculo, A. D’Amelio, G. Grossi, R. Lanzarotti, S. Patania, Deep-
     fakes have no heart: A simple rppg-based method to reveal fake videos, in: International
     Conference on Image Analysis and Processing, Springer, 2022, pp. 186–195.
 [9] X. Li, J. Chen, G. Zhao, M. Pietikainen, Remote heart rate measurement from face videos
     under realistic situations, in: Proceedings of the IEEE conference on computer vision and
     pattern recognition, 2014, pp. 4264–4271.
[10] M.-Z. Poh, D. J. McDuff, R. W. Picard, Advancements in noncontact, multiparameter
     physiological measurements using a webcam, IEEE transactions on biomedical engineering
     58 (2010) 7–11.
[11] M.-Z. Poh, D. J. McDuff, R. W. Picard, Non-contact, automated cardiac pulse measurements
     using video imaging and blind source separation., Optics express 18 (2010) 10762–10774.
[12] G. De Haan, V. Jeanne, Robust pulse rate from chrominance-based rppg, IEEE transactions
     on biomedical engineering 60 (2013) 2878–2886.
[13] W. Wang, A. C. Den Brinker, S. Stuijk, G. De Haan, Algorithmic principles of remote ppg,
     IEEE Transactions on Biomedical Engineering 64 (2016) 1479–1491.
[14] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation from face
     via spatial-temporal representation, IEEE Transactions on Image Processing 29 (2019)
     2409–2423.
[15] R. Song, S. Zhang, C. Li, Y. Zhang, J. Cheng, X. Chen, Heart rate estimation from facial
     videos using a spatiotemporal representation with convolutional neural networks, IEEE
     Transactions on Instrumentation and Measurement 69 (2020) 7411–7421.
[16] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from facial videos
     using spatio-temporal networks, arXiv preprint arXiv:1905.02419 (2019).
[17] M. Hu, F. Qian, X. Wang, L. He, D. Guo, F. Ren, Robust heart rate estimation with spatial–
     temporal attention network from facial videos, IEEE Transactions on Cognitive and
     Developmental Systems 14 (2021) 639–647.
[18] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez,
     D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without
     supervision, arXiv preprint arXiv:2304.07193 (2023).
[19] A. Revanur, Z. Li, U. A. Ciftci, L. Yin, L. A. Jeni, The first vision for vitals (v4v) challenge
     for non-contact video-based physiological estimation, in: Proceedings of the IEEE/CVF
     International Conference on Computer Vision, 2021, pp. 2760–2767.
[20] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, N. Sebe, Self-adaptive matrix
     completion for heart rate estimation from face videos under realistic conditions, in:
     Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
     2396–2404.
[21] W. Verkruysse, L. O. Svaasand, J. S. Nelson, Remote plethysmographic imaging using
     ambient light., Optics express 16 (2008) 21434–21445.
[22] G. Balakrishnan, F. Durand, J. Guttag, Detecting pulse from head motions in video, in:
     Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp.
     3430–3437.
[23] X. Liu, J. Fromm, S. Patel, D. McDuff, Multi-task temporal shift attention networks for
     on-device contactless vitals measurement, Advances in Neural Information Processing
     Systems 33 (2020) 19400–19411.
[24] E. M. Nowara, D. McDuff, A. Veeraraghavan, The benefit of distraction: Denoising
     camera-based physiological measurements using inverse attention, in: Proceedings of the
     IEEE/CVF international conference on computer vision, 2021, pp. 4955–4964.
[25] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from facial videos
     using spatio-temporal networks. arxiv 2019, arXiv preprint arXiv:1905.02419 (????).
[26] G. Hwang, S. J. Lee, Phase-shifted remote photoplethysmography for estimating heart
     rate and blood pressure from facial video, arXiv preprint arXiv:2401.04560 (2024).
[27] Z. Yu, Y. Shen, J. Shi, H. Zhao, P. H. Torr, G. Zhao, Physformer: Facial video-based
     physiological measurement with temporal difference transformer, in: Proceedings of the
     IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4186–4196.
[28] X. Liu, Y. Zhang, Z. Yu, H. Lu, H. Yue, J. Yang, rppg-mae: Self-supervised pretraining
     with masked autoencoders for remote physiological measurements, IEEE Transactions on
     Multimedia (2024).
[29] X. Niu, H. Han, S. Shan, X. Chen, Vipl-hr: A multi-modal database for pulse estimation
     from less-constrained face video, in: Computer Vision–ACCV 2018: 14th Asian Conference
     on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part
     V 14, Springer, 2019, pp. 562–576.
[30] R. Stricker, S. Müller, H.-M. Gross, Non-contact video-based pulse rate measurement on a
     mobile service robot, in: The 23rd IEEE International Symposium on Robot and Human
     Interactive Communication, IEEE, 2014, pp. 1056–1062.
[31] S. Bobbia, R. Macwan, Y. Benezeth, A. Mansouri, J. Dubois, Unsupervised skin tissue
     segmentation for remote photoplethysmography, Pattern Recognition Letters 124 (2019)
     82–90.
[32] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
     P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language
     supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[33] Y. Zhou, Z. Chen, H. Huang, Deformable one-shot face stylization via dino semantic
     guidance, arXiv preprint arXiv:2403.00459 (2024).
[34] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision
     transformer, in: Proceedings of the IEEE/CVF international conference on computer vision,
     2021, pp. 6836–6846.
[35] Z. Yu, X. Li, X. Niu, J. Shi, G. Zhao, Autohr: A strong end-to-end baseline for remote
     heart rate measurement with neural searching, IEEE Signal Processing Letters 27 (2020)
     1245–1249.
[36] W. Chen, D. McDuff, Deepphys: Video-based physiological measurement using convolu-
     tional attention networks, in: Proceedings of the european conference on computer vision
     (ECCV), 2018, pp. 349–365.
[37] D.-Y. Kim, S.-Y. Cho, K. Lee, C.-B. Sohn, A study of projection-based attentive spatial–
     temporal map for remote photoplethysmography measurement, Bioengineering 9 (2022)
     638.