<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DINO-rPPG: Remote photoplethysmography Measurement using Facial Representation from DINO Guidance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiho Choi</string-name>
          <email>jihochoi@jbnu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sang Jun Lee</string-name>
          <email>sj.lee@jbnu.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jeonbuk National University</institution>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge &amp; Workshop</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Remote photoplethysmography (rPPG) is a camera-based technique that enables non-invasive monitoring of physiological signals such as heart rate (HR) and respiration rate (RR). In light of this advantage, many researchers have suggested deep learning-based methods to measure physiological signals from video data. The 3D convolutional neural network (3D CNN) has been widely applied to capture spatio-temporal features of subtle rPPG changes from facial video. However, the limited receptive fields of the convolution operation leave room for improvement in obtaining global facial features that are crucial for accurate rPPG estimation. Recently, vision transformer trained with self-supervised learning have emerged as powerful tools for extracting high-level features than CNN and supervised ViT. In this study, we propose DINO-rPPG, a method that utilizes a pre-trained DINO to obtain features relevant to the face without additional training. The DINO representation is extracted by a DINO-based semantic extractor (DSE), which efectively captures the high-level semantic features of the face region. The spatio-temporal feature is important for estimating accurate rPPG, therefore, we enhance the spatial DINO representation by incorporating it with features from a spatio-temporal extractor (STE). We conducted experiments using the V4V dataset for estimating HR values, and the results demonstrated that DINO representation guidance is efective for rPPG estimation.</p>
      </abstract>
      <kwd-group>
        <kwd>Remote photoplethysmography</kwd>
        <kwd>Self-supervised vision transformer</kwd>
        <kwd>Heart rate estimation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Korea</p>
      <p>
        The extraction of physiological signals from video data necessitates the capture of subtle
changes in blood flow in facial areas. Early studies employed conventional signal processing
methods [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ] to calculate intensity changes in RGB images or applied color space
transformations [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. Moreover, a comprehensive understanding of the spatial and temporal
information and the corresponding signals is essential for a deep learning model to achieve
accurate rPPG measurements. Recent studies preprocess spatio-temporal (ST) maps of the
region of interest (ROI) to allow a model to estimate rPPG from ST map [14, 15]. Additionally,
various deep learning architectures, such as 3D CNN and vision transformer (ViT), have been
explored to efectively capture spatio-temporal features.
      </p>
      <p>Challenges in measuring rPPG include subject movements and varying lighting conditions.
The attention mechanisms was used to mitigate the impact of motion artifacts and improve ROI
representation [16, 17]. This mechanisms enhance the robustness of deep learning model to
noise and illumination changes by suppressing irrelevant information and focusing on important
features. rPPGNet [16] introduced a skin-based attention module, and SAM-rPPGNet [17]
proposed a spatial-temporal attention mechanism to reduce noise components and improve
measurement accuracy. By incorporating attention mechanisms, the model can focus on
facerelevant regions and spatio-temporal features that are crucial for estimating rPPG. This allows
rPPG signals to be accurately estimated even in challenging environments.</p>
      <p>Recently, a vision transformer trained in a self-supervised manner, called DINO [18], was
proposed. DINO is trained on large datasets without ground-truth data and can extract meaningful
representation from input image. By directly accessing the self-attention layer, it is possible to
obtain features that capture high-level semantic information. Due to these properties, DINO has
been utilized in downstream tasks and as a feature extractor. We propose a method to estimate
rPPG by leveraging the representation of face region obtained from pre-trained DINO.</p>
      <p>In this study, we propose a framework to guide rPPG measurement network using
representations of ViT trained with self-supervised learning. We found that pre-trained DINO can
successfully extract features for ROI regions without additional training, and the feature
visualization results for the V4V dataset can be seen in Figure 1. Therefore, we construct a network
that efectively combines visual and spatio-temporal features. Our contributions are as follows:
• We propose a framework for rPPG estimation, namely DINO-rPPG, leveraging the
guidance of DINO representations.
• To the best of our knowledge, this is the first attempt to utilize DINO features in rPPG
measurement.</p>
      <p>• We demonstrate the efectiveness of the proposed method using the V4V database [ 19].</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Works</title>
      <sec id="sec-3-1">
        <title>2.1. Remote photoplethysmography measurement</title>
        <p>
          Conventional rPPG methods using hand-crafted processes extract physiological signals using
various facial regions [20] and specific color channels [ 21]. Additionally, signal decomposition
methods such as independent component analysis [
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ] and principal component
analysis [22] have been employed to improve rPPG measurement. The CHROM algorithm [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] uses
chrominance to suppress mirror distortion of image and is still widely used in recent research.
However, these approaches tend to perform poorly when estimating in noisy or challenging
environments, including those with motion artifacts.
        </p>
        <p>Recently, deep learning methods such as convolutional neural networks (CNNs), have been
applied to rPPG measurement. However, 2D CNN-based approaches [23, 24] have limitations
in capturing temporal features, which are important for physiological signal analysis. To solve
these shortcomings, 3D CNN models have been utilized to extract spatio-temporal features [25,
16, 26]. In particular, the 3D CNN has enabled promising rPPG measurement accuracy by
leveraging spatio-temporal features and efectively dealing with motion artifacts in video
data. Yu et al. proposed PhysNet [25], which is based on a 3D CNN with an encoder-decoder
architecture, includes upsampling along the time axis to enhance temporal resolution. However,
CNN-based model is limited to the local receptive fields, therefore, ViT architecture has been
employed which aggregates global and local information. Yu et al. proposed PhysFormer [27],
which aims to capture long-range spatio-temporal and local-global features using a video
transformer architecture.</p>
        <p>Additionally, ViT with trained by self-supervised manner can extract meaningful
representations, and there have been attempts to utilize ViT without ground-truth data to obtain
enhanced features. rPPG-MAE [28] was proposed as one of the rPPG method utilizing the
representation of self-supervised learning. They utilized the masked autoencoder (MAE) to
improve the network representation, robustness to noise, and demonstrated promising
estimation accuracy on the VIPL-HR [29], PURE [30], and UBFC-rPPG [31] datasets. This suggests
that deep learning models are capable of accurate signal estimation based on representations
enriched with relevant feature information. Therefore, in this paper, we propose a method to
extract and utilize meaningful information about facial regions from the ViT model DINO [18],
trained by self-supervised learning.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Self-supervised vision transformer</title>
        <p>The features of the vision translator provide powerful visual representations that can be used
in downstream vision tasks. In particular, DINO is emerging due to its ability to capture
highlevel semantic information. DINO can obtain semantic information of images more efectively
than CNN-based and ViT models trained with labels. This property is achieved by training
teacher and student networks with the same structure in a self-distilling manner. Specifically,
augmentations are applied to the input images to generate diferent views, which are then
passed through the networks. The student model is optimized using cross-entropy loss and the
exponential moving average technique is applied to update the parameters. With self-supervised
learning, ViT can serve as powerful feature extractors and has been applied in various vision
tasks.</p>
        <p>Another powerful ViT method, such as CLIP [32], provides meaningful features by
estimating the similarity between images and text. However, DoesFS [33] demonstrated that DINO
representation is superior to CLIP for facial regions using principal component analysis on
tokens and keys in intermediate layers. Therefore, we leverage DINO as a feature extractor to
improve the representation of the rPPG estimation model. The visualization of the features
obtained from DINO is shown in Figure 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Method</title>
      <p>In this section, we introduce DINO-rPPG, a DINO-guided physiological measurement method.
The overall framework is shown in Figure 2. We describe the DINO-based semantic extractor
(DSE), spatio-temporal extractor (STE), and feature aggregation process in Section 3.1. Then,
the video transformer architecture in Section 3.2, and the loss functions for estimating rPPG
and HR in Section 3.3.</p>
      <sec id="sec-4-1">
        <title>3.1. DINO based spatio-temporal feature extraction</title>
        <p>We leverage the pre-trained ViT-B model from DINOv2, which follows the transformer
architecture and employs a self-supervised method trained on large scale datasets. During training,
we first input video  ∈ ℝ × × × into the feature extractors, DSE and STE. The DINO-based
semantic extractor designed to capture high-level semantic information   = ( ) about
the facial region in the input  that is important for rPPG signal estimation. The DSE features
guide the model to learn enhanced representation of the facial regions, which is indicative of
physiological signals and enables fast convergence.</p>
        <p>However, the DINO feature  
∈ ℝ× × /8× /8
only provides the semantic information
without considering the temporal aspects. Therefore, we leverage the spatio-temporal extractor
to obtain spatial feature  
features   ∈ ℝ3× × /8× /8
=  ( )</p>
        <p>over time in a local region, capturing spatio-temporal
. Inspired by [27], we construct the STE with 3D convolution layers
with kernel sizes of the 1 × 5 × 5, 1 × 3 × 3 and 1 × 3 × 3. The feature map   is restricted in capturing
global features due to its limited receptive field. Therefore, we aggregate low-dimensional
local features from the STE and high-dimensional semantic features from the DSE to obtain an
enhanced feature map with a DINO-based representation. The combined feature map  −
obtained by concatenating the   and   , formulated as follows.
is
(1)
 −
=  
⊕   ,
where ⊕ indicates concatenation process. To embed the combined feature map  −
input video, we generate spatio-temporal tubes by considering the temporal dimension. This
from the
process generates non-overlapping tubelet tokens from  −
.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Video transformer</title>
        <p>We adopt the temporal diference transformer from PhysFormer [ 27], which is based on a video
transformer architecture. We first embed the feature map  −
using the tubelet embedding
method [34], which linearly embeds non-overlapping tubelets. The embedded spatio-temporal
tubelet (ST tubelet) token sizes are defined as  ′ = ⌊ 
tube size parameters , ℎ,  set to (4, 4, 4) as in [27].
 ⌋ ,  ′ = ⌊  /ℎ8 ⌋ ,  ′ = ⌊  /8 ⌋ , with the</p>
        <p>ST tubelet tokens are then input to the  transformer blocks, which extract local and global
spatio-temporal rPPG features. The temporal diference multi-head self-attention block
(TDMHSA) captures local temporal diference features within the video frames, with its self-attention
map representing the attention between key and query tube tokens. The self-attention map
helps the network focus on the ROI in the spatial domain and locate the peak values of the
estimated signal in the time domain. Specifically, the query and key are obtained by learning
spatial diferences between neighboring frames using temporal diference convolution [
35],
which captures the derivatives of the PPG signal crucial for rPPG estimation.</p>
        <p>The spatio-temporal feed-forward (ST-FF) block refines local inconsistencies in the features
extracted by the TD-MHSA block. The latent vector of the transformer blocks undergoes
temporal upsampling to match the original video sequence length  , ensuring that the estimated
rPPG signal has the same temporal resolution as the input frames. Finally, we extract the
physiological signal using an rPPG estimator, which consists of a 1D convolution.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Loss functions</title>
        <p>To optimize the proposed network, we define loss functions in the time domain, frequency
domain, and by direct comparison of HR values. To extract accurate rPPG from facial video,
we utilize a loss function in the time domain  
. The accurate rPPG should exhibit a high
trend similarity with the ground-truth PPG signal, and its peak values on the time axis should
be close to the ground-truth peaks. Therefore, we define a loss function using the negative
Pearson correlation, formulated as follows.</p>
        <p>= 1 −</p>
        <p>∑1  ′ − ∑1</p>
        <p>∑1  ′
,
(2)
(3)
(4)
 
signal on the time axis.
where  and  ′ denote estimated rPPG and ground-truth PPG signals, respectively. The loss term
ensures that the predicted signal has a trend and peak values similar to the ground-truth
The PPG is recorded during a heartbeat and frequency analysis can reveal the periodicity of
the signal. We obtain the power spectral density (PSD) by applying a fast Fourier transform to
both the actual and predicted signals. Inspired by [26], we define   
error between the PSD of the signals.</p>
        <p>as the root mean square
  
= ‖ () −  (
′)‖2 ,
where  (⋅) represents PSD. Components such as motion noise have small amplitudes in the
frequency domain and lack strong periodicity. Using   
enhances robustness to noise, aligns
the rPPG signal more closely with the ground-truth, and minimizes errors in the frequency
components.</p>
        <p>From the rPPG signal, we can calculate the HR value and directly compare the estimated HR
with the ground-truth HR.  ℎ is defined in the time domain as follows.</p>
        <p>ℎ = ‖ℎ − ℎ′‖2 ,
where h and h’ are estimated and actual heart rate value. The overall loss function is denoted
as  
=  1 
+  2</p>
        <p>+  3 ℎ , where hyper-parameters  1,  2,  3 are set to 1, 100, and
0.0001, respectively.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <sec id="sec-5-1">
        <title>4.1. Implementation details</title>
        <p>We introduce preprocessing to prepare the input videos and signals for training our proposed
model. Using the MTCNN algorithm, we find the bounding box of the face region and crop an
area approximately 1.6 times the size of the box. The bounding box is determined in the first
frame and applied to subsequent frames. The collected video is segmented into 160 frames and
is used as input to the network. The frame rate of the video is maintained at 25 fps, and the
signal is resampled to 25 Hz for synchronization.</p>
        <p>Our proposed network is trained with an Adam optimizer and a learning rate of 1e-4. We set
the batch size to 8 and train the model for 20 epochs on the NVIDIA RTX 4090. The parameters
of transformer [27],  and  are set to 12 and 96, respectively. We adopted the mean absolute
error (MAE), root mean squared error (RMSE), Pearson correlation coeficient (r) for evaluating
the HR estimation task.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Datasets</title>
        <p>Vision for Vitals (V4V) is a dataset released as part of the ICCV 2021 Vision for Vitals challenge.
The dataset was collected from 179 subjects and consists of videos of them performing 10 tasks
designed to induce diferent emotions. It contains a total of 1,358 video recordings along with
accompanying physiological information such as heart rate, blood pressure, and PPG signals.
The videos were collected at a resolution of 1280×720 and a frame rate of 25 fps, with varying
lengths for each video.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Experimental results on V4V dataset</title>
        <p>In this section, we evaluate the HR estimation performance of the proposed DINO-rPPG
compared to previous methods on the V4V dataset. As shown in Table 1, our method consistently
outperforms previous approaches across all metrics. DeepPhys [36] and PhysNet [16] perform
poorly, with MAE and RMSE values exceeding 10. The recent state-of-the-art method,
DRPNET [26], achieved the MAE of 3.83, RMSE of 9.59 and  of 0.75. However, our DINO-rPPG
outperforms DRPNET with the MAE 3.09, demonstrating a significant reduction in error rate.
We also achieved an  value of 0.83, which is higher than DRPNET, indicating a stronger linear
relationship between the estimated HR and the ground-truth HR. Notably, the RMSE of the
proposed method decreased from 7.83 in APNET [37] to 7.05. These results suggest that the
proposed framework is efective in measuring rPPG and accurately estimating the HR values.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Ablation study</title>
        <p>We also present ablation study results for the HR estimation task on the V4V dataset, as
shown in Table 2. The representations of the self-supervised ViT model provides the high-level
semantic information, allowing the model to be trained efectively based on abundant facial
features and achieve fast convergence. However, the DINO representation provides spatial
features without considering temporal information, which is important for sequence data. We
report the MAE of 10.67 without using STE, indicating the need for spatio-temporal features.
Moreover, when utilizing only STE, the network utilizes spatio-temporal features of the local
region, which reduces HR estimation performance, with the MAE and RMSE values of 3.18
and 8.34, respectively. This highlights the limitations of relying solely on local features in
3D-CNN. Therefore, aggregating features from both the DSE and STE is crucial for estimating
the rPPG signals, and we demonstrate it by achieving significant improvements in all metrics.
Including pre-trained DINO features enhances the ability of network to capture global semantic
information, resulting in more accurate prediction of rPPG signals and improved HR estimation.</p>
        <p>We investigated the impact of the loss functions used in this study. The loss function   
is based on frequency domain analysis to measure the PSD diference of the signal. Without
   , the model report a slightly increased MAE of 3.16, indicating that the loss function
efectively handles noise components of small amplitudes and poor periodicity in the frequency
domain. Moreover, we conducted an ablation study on time domain loss functions  ℎ and   .
Comparing HR values directly enables accurate HR estimation, as evidenced by the observed
increase in error without using  ℎ . In addition,   applies penalty to the distance between
the peak values of predicted and ground-truth rPPG to obtain a similar trend on the time
axis. Omitting   degraded estimation performance, with the MAE of 4.63 and the  value of
0.69. We demonstrated the eficiency of the loss functions used in this study, highlighting its
efectiveness in improving the accuracy and robustness of the model.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>In this paper, we propose the deep learning model for rPPG estimation called DINO-rPPG, which
utilizes the ViT model trained in a self-supervised manner. We found that DINO provides
facerelevant representations without additional training, efectively capturing essential features for
accurate rPPG estimation. The DSE uses pre-trained DINO to extract semantic representations
of ROI regions, and is further enhanced by considering temporal information through
spatiotemporal feature maps obtained from STE. Experimental results demonstrate that DINO-rPPG
outperforms previous works in the HR estimation task, highlighting the important role of DINO
guidance in improving model performance on the V4V dataset.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research was supported by “Regional Innovation Strategy (RIS)” through the National
Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE). (2023RIS-008)
[14] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation from face
via spatial-temporal representation, IEEE Transactions on Image Processing 29 (2019)
2409–2423.
[15] R. Song, S. Zhang, C. Li, Y. Zhang, J. Cheng, X. Chen, Heart rate estimation from facial
videos using a spatiotemporal representation with convolutional neural networks, IEEE
Transactions on Instrumentation and Measurement 69 (2020) 7411–7421.
[16] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from facial videos
using spatio-temporal networks, arXiv preprint arXiv:1905.02419 (2019).
[17] M. Hu, F. Qian, X. Wang, L. He, D. Guo, F. Ren, Robust heart rate estimation with spatial–
temporal attention network from facial videos, IEEE Transactions on Cognitive and
Developmental Systems 14 (2021) 639–647.
[18] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez,
D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without
supervision, arXiv preprint arXiv:2304.07193 (2023).
[19] A. Revanur, Z. Li, U. A. Ciftci, L. Yin, L. A. Jeni, The first vision for vitals (v4v) challenge
for non-contact video-based physiological estimation, in: Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2021, pp. 2760–2767.
[20] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, N. Sebe, Self-adaptive matrix
completion for heart rate estimation from face videos under realistic conditions, in:
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
2396–2404.
[21] W. Verkruysse, L. O. Svaasand, J. S. Nelson, Remote plethysmographic imaging using
ambient light., Optics express 16 (2008) 21434–21445.
[22] G. Balakrishnan, F. Durand, J. Guttag, Detecting pulse from head motions in video, in:
Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp.
3430–3437.
[23] X. Liu, J. Fromm, S. Patel, D. McDuf, Multi-task temporal shift attention networks for
on-device contactless vitals measurement, Advances in Neural Information Processing
Systems 33 (2020) 19400–19411.
[24] E. M. Nowara, D. McDuf, A. Veeraraghavan, The benefit of distraction: Denoising
camera-based physiological measurements using inverse attention, in: Proceedings of the
IEEE/CVF international conference on computer vision, 2021, pp. 4955–4964.
[25] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from facial videos
using spatio-temporal networks. arxiv 2019, arXiv preprint arXiv:1905.02419 (????).
[26] G. Hwang, S. J. Lee, Phase-shifted remote photoplethysmography for estimating heart
rate and blood pressure from facial video, arXiv preprint arXiv:2401.04560 (2024).
[27] Z. Yu, Y. Shen, J. Shi, H. Zhao, P. H. Torr, G. Zhao, Physformer: Facial video-based
physiological measurement with temporal diference transformer, in: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4186–4196.
[28] X. Liu, Y. Zhang, Z. Yu, H. Lu, H. Yue, J. Yang, rppg-mae: Self-supervised pretraining
with masked autoencoders for remote physiological measurements, IEEE Transactions on
Multimedia (2024).
[29] X. Niu, H. Han, S. Shan, X. Chen, Vipl-hr: A multi-modal database for pulse estimation
from less-constrained face video, in: Computer Vision–ACCV 2018: 14th Asian Conference
on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part
V 14, Springer, 2019, pp. 562–576.
[30] R. Stricker, S. Müller, H.-M. Gross, Non-contact video-based pulse rate measurement on a
mobile service robot, in: The 23rd IEEE International Symposium on Robot and Human
Interactive Communication, IEEE, 2014, pp. 1056–1062.
[31] S. Bobbia, R. Macwan, Y. Benezeth, A. Mansouri, J. Dubois, Unsupervised skin tissue
segmentation for remote photoplethysmography, Pattern Recognition Letters 124 (2019)
82–90.
[32] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language
supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[33] Y. Zhou, Z. Chen, H. Huang, Deformable one-shot face stylization via dino semantic
guidance, arXiv preprint arXiv:2403.00459 (2024).
[34] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision
transformer, in: Proceedings of the IEEE/CVF international conference on computer vision,
2021, pp. 6836–6846.
[35] Z. Yu, X. Li, X. Niu, J. Shi, G. Zhao, Autohr: A strong end-to-end baseline for remote
heart rate measurement with neural searching, IEEE Signal Processing Letters 27 (2020)
1245–1249.
[36] W. Chen, D. McDuf, Deepphys: Video-based physiological measurement using
convolutional attention networks, in: Proceedings of the european conference on computer vision
(ECCV), 2018, pp. 349–365.
[37] D.-Y. Kim, S.-Y. Cho, K. Lee, C.-B. Sohn, A study of projection-based attentive spatial–
temporal map for remote photoplethysmography measurement, Bioengineering 9 (2022)
638.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.-H.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-H.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-M. Lam</surname>
          </string-name>
          , H.-W. Lau,
          <string-name>
            <surname>C.-M. Ng</surname>
            , L.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Tai</surname>
            ,
            <given-names>K.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Yip</surname>
          </string-name>
          , et al.,
          <article-title>Contact-free screening of atrial fibrillation by a smartphone using facial pulsatile photoplethysmographic signals</article-title>
          ,
          <source>Journal of the American Heart Association</source>
          <volume>7</volume>
          (
          <year>2018</year>
          )
          <article-title>e008585</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Junttila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tulppo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seppänen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Non-contact atrial fibrillation detection from face videos by learning systolic peaks</article-title>
          ,
          <source>IEEE Journal of Biomedical and Health Informatics</source>
          <volume>26</volume>
          (
          <year>2022</year>
          )
          <fpage>4587</fpage>
          -
          <lpage>4598</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Sabour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Benezeth</surname>
          </string-name>
          , P. De Oliveira,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chappe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Ubfc-phys: A multimodal database for psychophysiological studies of social stress</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>622</fpage>
          -
          <lpage>636</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Emotion recognition from facial expressions and contactless heart rate using knowledge graph</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Knowledge Graph (ICKG)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>64</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Facial-video-based physiological signal measurement: Recent advances and afective applications</article-title>
          ,
          <source>IEEE Signal Processing Magazine</source>
          <volume>38</volume>
          (
          <year>2021</year>
          )
          <fpage>50</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernandez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tolosana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Morales</surname>
          </string-name>
          ,
          <article-title>Deepfakeson-phys: Deepfakes detection based on heart rate estimation</article-title>
          .
          <source>arxiv</source>
          <year>2020</year>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>00400</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. Liu,
          <article-title>New advances in remote heart rate estimation and its application to deepfake detection</article-title>
          ,
          <source>in: 2021 International Conference on Culture-oriented Science &amp; Technology (ICCST)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>387</fpage>
          -
          <lpage>392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Boccignone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bursic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cuculo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. D'Amelio</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Grossi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Lanzarotti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Patania</surname>
          </string-name>
          ,
          <article-title>Deepfakes have no heart: A simple rppg-based method to reveal fake videos</article-title>
          ,
          <source>in: International Conference on Image Analysis and Processing</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietikainen</surname>
          </string-name>
          ,
          <article-title>Remote heart rate measurement from face videos under realistic situations</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>4264</fpage>
          -
          <lpage>4271</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M.-Z. Poh</surname>
            ,
            <given-names>D. J. McDuf</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Picard</surname>
          </string-name>
          ,
          <article-title>Advancements in noncontact, multiparameter physiological measurements using a webcam</article-title>
          ,
          <source>IEEE transactions on biomedical engineering 58</source>
          (
          <year>2010</year>
          )
          <fpage>7</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>M.-Z. Poh</surname>
            ,
            <given-names>D. J. McDuf</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Picard</surname>
          </string-name>
          ,
          <article-title>Non-contact, automated cardiac pulse measurements using video imaging and blind source separation</article-title>
          .,
          <source>Optics express 18</source>
          (
          <year>2010</year>
          )
          <fpage>10762</fpage>
          -
          <lpage>10774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>G. De Haan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Jeanne</surname>
          </string-name>
          ,
          <article-title>Robust pulse rate from chrominance-based rppg</article-title>
          ,
          <source>IEEE transactions on biomedical engineering 60</source>
          (
          <year>2013</year>
          )
          <fpage>2878</fpage>
          -
          <lpage>2886</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Den Brinker</surname>
          </string-name>
          , S. Stuijk, G. De Haan,
          <article-title>Algorithmic principles of remote ppg</article-title>
          ,
          <source>IEEE Transactions on Biomedical Engineering</source>
          <volume>64</volume>
          (
          <year>2016</year>
          )
          <fpage>1479</fpage>
          -
          <lpage>1491</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>