<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Video-Based Heart Rate Estimation with Temporal Diference Transformer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhiqin Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Songping Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caifeng Shan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>State Key Laboratory for Novel Software Technology and School of Intelligence Science and Technology, Nanjing University</institution>
          ,
          <addr-line>Suzhou, 215123</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>With the advancement of remote photoplethysmography (rPPG), video-based physiological signal measurement has emerged as a convenient and contactless method for heart rate (HR) estimation. Most existing video-based rPPG methods rely on either RGB or NIR imaging, each with distinct advantages and limitations. RGB-based methods demonstrate higher accuracy but are sensitive to lighting and skin tone, while NIR-based methods are more robust to such variations but less responsive to blood volume changes, leading to lower accuracy. Therefore, we propose a temporal diference transformer (TDT)-based multimodal fusion framework for robust HR estimation. We employ separate 3D convolutional encoders to independently extract modality-specific spatio-temporal representations from RGB and NIR input streams. Subsequently, the TDT module enhances quasi-periodic features while adaptively aligning and fusing RGB and NIR representations. Additionally, a composite loss function is introduced to provide simultaneous supervision across temporal and spectral domains. In the 4th RePSS Challenge, the proposed method achieved second place with the root mean square error (RMSE) of 12.32 bpm on the oficial test set, demonstrating strong performance in multimodal HR estimation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Remote photoplethysmography</kwd>
        <kwd>heart rate estimation</kwd>
        <kwd>temporal diference transformer</kwd>
        <kwd>multimodal fusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Heart rate (HR) is a crucial physiological indicator for assessing health and emotional states [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While
traditional contact-based methods can cause discomfort, remote photoplethysmography (rPPG) has
emerged as a promising non-contact alternative. By analyzing subtle color variations in facial videos
caused by blood volume changes with each cardiac pulse, rPPG enables HR estimation without physical
contact. With the rapid advancement of deep learning techniques [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], video-based rPPG methods
have made significant progress [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Numerous studies have adopted convolutional neural networks
(CNNs) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and transformer-based architectures [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] to model spatio-temporal features of facial blood
lfow. These learning frameworks have demonstrated promising performance across various public
datasets.
      </p>
      <p>
        Despite the success of RGB-based deep learning methods for HR estimation, they still face considerable
challenges in real-world scenarios. In low-light conditions, limited visible light reduces the
signalto-noise ratio of rPPG signals [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Illumination changes introduce color shifts that can obscure true
physiological rhythms. Skin tone diferences also afect reflectance: darker skin absorbs more light,
leading to weaker signals and lower rPPG amplitude. In contrast, NIR cameras capture only the infrared
portion of the incident light, which is less sensitive to ambient illumination changes and penetrates
deeper into the skin. This enables more stable signal acquisition in low-light conditions and reduces
reflectance disparities across diferent skin tones. HR estimation methods based on NIR cameras and
infrared illumination exhibit advantages in such settings [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, due to the limited sensitivity of
NIR light to blood volume changes, NIR-based methods generally result in less accurate HR estimation
than RGB-based methods [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>× N
v
n
oC BN eLRU +
D
3
v
n
oC BN eLRU +
D
3
× N</p>
      <p>P
M
D
3
P
M
D
3
t
a
c
n
o
C
d
e
b
m
E
h
c
t
a
P
m
r
o
N
r
e
y
a
L</p>
      <p>Temporal Difference Transormer× M</p>
      <p>n
ce itno
freen ftt-ea
if l
lD eS
rpoa ehad
eTm lit-u</p>
      <p>M
+
m
r
o
N
r
e
y
a
L
l
ropa rad
itt-epaoSm f-rodeeFw +</p>
      <p>( , σ²)
n
o
iit
c
d
e
r
P
KL DiLvoesrsgence Cross-Entropy</p>
      <p>PSD LosPsSD
Negative Pearson
Correlation Loss
Reference PPG</p>
      <p>To address these limitations, integrating RGB and NIR modalities shows promise for achieving a
balance between illumination robustness and sensitivity to blood flow signals. By leveraging the
complementary strengths of both modalities, multimodal fusion has the potential to improve the accuracy
and generalizability of rPPG estimation in diverse and complex environments. To explore the potential
of multimodal data fusion in rPPG, the 4th Remote Physiological Signal Sensing (RePSS) Challenge was
launched in conjunction with the International Joint Conference on Artificial Intelligence (IJCAI 2025).</p>
      <p>
        In this challenge, we propose an end-to-end multimodal fusion method based on the temporal
diference transformer (TDT) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. First, separate 3D convolutional encoders extract modality-specific
spatio-temporal representations from the RGB and NIR input streams. The TDT module then enhances
quasi-periodic features while adaptively aligning and fusing cross-modal RGB and NIR representations.
Furthermore, a composite loss function enforces simultaneous supervision in both temporal and spectral
domains to optimize feature learning. By efectively integrating features from both modalities and
domains, the proposed method enhances the robustness of HR estimation under challenging scenarios.
On the oficial challenge test dataset, the proposed method achieved the root mean square error (RMSE)
of 12.32, ranking second among all participating teams. These results demonstrate the efectiveness and
competitiveness of the proposed approach in multimodal HR estimation tasks.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>As shown in Fig. 1, our method consists of three main components: modality-specific spatio-temporal
encoding, multimodal feature fusion, and rPPG prediction. Furthermore, a composite loss function is
designed to provide supervision in both the time and frequency domains. In the following sections, we
provide a detailed description of each component.</p>
      <sec id="sec-2-1">
        <title>2.1. Modality-specific Spatio-temporal Encoding</title>
        <p>Given a sequence of facial video frames from two modalities (RGB denoted as XRGB and NIR as XNIR),
each modality is independently processed through a 3D convolutional encoder to extract spatio-temporal
features specific to that modality. The RGB input has three channels, while the NIR input has one
channel, with both sharing the same temporal and spatial dimensions.</p>
        <p>Each encoder is composed of multiple stacked blocks, where each block includes a 3D convolutional
layer followed by batch normalization, ReLU activation, and 3D max pooling. To improve training
stability and maintain feature quality, residual connections are introduced within each block by adding
the input features to the output of the transformation.</p>
        <p>Through this encoding process, we obtain modality-specific feature representations: FRGB ∈
R× 1× 1× 1 and FNIR ∈ R× 1× 1× 1 , both sharing the same shape in terms of temporal and
spatial resolution after downsampling, but carrying distinct semantic information from their respective
inputs.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multimodal Feature Fusion</title>
        <p>
          In temporal diference multi-head self-attention (TD-MHSA) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], we utilize a temporal diference
convolution (TDC) operator to explicitly encode temporal dynamics into attention computations. Given
an input sequence  ∈ R2× 2× 2× 2 , the TDC operation is formally defined as:
        </p>
        <p>TDC() = ∑︁ () · (0 + ) +  · (︀ − (0) ·
∈ℛ
∑︁ ()︀)
∈ℛ′
(1)
(2)
(3)
(4)
where 0 denotes the current spatio-temporal location, ℛ is the local 3 × 3 × 3 receptive field, and ℛ′
represents the adjacent temporal neighborhood.</p>
        <p>For each attention head , we compute the query/key/value matrices by:</p>
        <p>= ︀( TDC (LN()) )︀ ,  = ︀( TDC (LN()) )︀ ,  = ︀( Conv3D (LN()) )︀ ,
where LN(· ) is layer normalization, (· ) denotes the reshaping operation from video to sequence
format, and , ,  ∈ R(2· 2· 2)× 2. This design explicitly injects temporal change patterns into
the attention mechanism through diferential operators.</p>
        <p>The self-attention for each head is then computed using the scaled dot-product:</p>
        <p>SA = softmax
︂(  )︂
√ℎ
,
where ℎ = 2ℎ is the dimensionality of each attention head, ℎ represents the number of attention
heads.</p>
        <p>The outputs from all ℎ heads are concatenated and linearly projected, then reshaped back to the
original video format via (· ), yielding the final TD-MHSA output:</p>
        <p>TD-MHSA = (︀ Concat(SA1, SA2, . . . , SAℎ) ·  )︀ ,
where  is a learnable output projection matrix, and (· ) is the inverse reshape operator (sequence to
video).</p>
        <p>To enhance local feature consistency and positional awareness, a spatio-temporal feed-forward
(ST-FF) module is employed in place of standard linear layers. The module consists of a 1 × 1 × 1
pointwise convolution for channel expansion, a 3 × 3 × 3 depthwise-separable 3D convolution to model
spatio-temporal interactions, and a second 1 × 1 × 1 pointwise convolution for dimensionality reduction.
Each stage is followed by batch normalization and ELU activation to promote training stability and
non-linearity.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. rPPG Prediction</title>
        <p>After the TDT module completes modality fusion and temporal modeling, the model employs a two-stage
upsampling process to progressively restore temporal resolution while reducing channel dimensionality
to extract key sequential features. Spatial average pooling is then applied to integrate local information
and produce a more compact temporal representation. Finally, a regression module maps the temporal
features into a rPPG signal sequence.</p>
        <p>To obtain the final HR, the fast Fourier transform (FFT) is applied to the predicted rPPG signal and
identify the dominant frequency component within the physiological range (0.7–4 Hz). The frequency
with the highest amplitude is selected as the pulse frequency peak.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Loss Function</title>
        <p>To enable the model to efectively reconstruct the rPPG waveform and accurately estimate HR, we
design a composite loss function consisting of three components, providing supervision from both the
time and frequency domains.</p>
        <p>
          KL Divergence Loss [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]: This loss provides soft supervision in the frequency domain. Given the
ground truth HRgt, a Gaussian distribution centered at HRgt with standard deviation  is used as the
target distribution  over the discretized frequency bins [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]:
 = √
1
2 2
        </p>
        <p>︂(
exp −
( − (HRgt − HRmin))2 )︂
,
where HRmin denotes the theoretical minimum HR. Let ˆ represent the softmax-normalized result of
the power spectral density (PSD) of the predicted rPPG signal. The KL divergence loss is defined as:</p>
        <p>.
ℒCE = − log</p>
        <p>exp(ˆ)
∑︀=−01 exp(ˆ)
)︃</p>
        <p>.
ℒPearson = 1 −</p>
        <p>Cov(, )
  
,</p>
        <p>
          Cross-Entropy Loss [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]: Let ˆ represent the spectral energy at frequency bin  in the PSD of the
predicted rPPG, where  is the index of the bin closest to the HRgt. The cross-entropy loss is then:
        </p>
        <p>
          Negative Pearson Correlation Loss [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]: This loss optimizes temporal similarity between the
predicted signal and ground truth. Given predicted signal  and ground truth , the loss is defined as:
(5)
(6)
(7)
(8)
(9)
where Cov(, ) represents the covariance between  and , and  ,   are the standard deviations of
the predicted signal and ground truth, respectively.
        </p>
        <p>Overall Loss: The final training objective combines the three losses with weighting coeficients:
ℒtotal =  · ℒ Pearson +  · (ℒKL + ℒCE),
where  and  control the relative importance of the temporal and frequency-domain losses.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>
          The training data comes from the VIPL-HR [
          <xref ref-type="bibr" rid="ref15">15, 16</xref>
          ] dataset, which includes paired RGB and NIR videos
from 107 subjects, along with synchronized ground truth PPG signals. It covers diverse real-world
scenarios such as talking, body movement, and varying lighting conditions.
        </p>
        <p>
          The test set consists of paired RGB-NIR clips from 100 subjects in the VIPL-V2 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] dataset and 100
subjects from the OBF [17] dataset. All videos are divided into 10 s segments for evaluation. VIPL-V2
provides diverse lighting conditions, while OBF includes subjects with varied skin tones, enabling a
comprehensive assessment of the model’s generalization ability.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Implement and evaluation metric</title>
        <p>The model proposed in this study is implemented using the PyTorch framework and trained on a
high-performance computing system equipped with eight NVIDIA GeForce RTX 4090 GPUs. Each input
sample consists of a facial video segment containing 160 frames, with each frame resized to a resolution
halves the learning rate every 25 epochs.
of 128 ×</p>
        <p>128 pixels. During training, the batch size is set to 8, and the total number of training epochs
is 50. The model is optimized using the Adam optimizer, with a weight decay coeficient of
The initial learning rate is set to 1 × 10− 4, and a StepLR learning rate scheduler is employed, which
5
× 10− 5.
where  denotes the number of video segments, HRgt represents the ground truth HR of the -th video
segment, and HRpred denotes the predicted HR of the -th video segment.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results</title>
        <p>As shown in Table 1, our team, IST (Nanjing University), achieved the RMSE of 12.31846 bpm in the HR
estimation task, ranking second in the final leaderboard of the 4th RePSS Challenge. Compared to the
winning team, HFUT-VUT, which achieved the RMSE of 11.89505 bpm, our result is only approximately
0.42 bpm higher, demonstrating the strong performance and competitiveness of our proposed method.</p>
        <p>Table 2 compares our method with PhysFormer under diferent input settings. PhysFormer achieved
the RMSE of 12.95383 bpm with RGB input, and a slightly improved 12.92008 bpm when using
concatenated RGB and NIR frames. In contrast, the proposed method, with a more efective RGB-NIR
fusion strategy, achieved a significantly lower RMSE of 12.31846 bpm, reflecting a relative improvement
of about 4.9% over the RGB-only PhysFormer. This demonstrates the importance of both modality
integration and efective fusion design to fully leverage the complementary information from RGB and
NIR inputs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This paper presents a multimodal HR estimation method that fuses RGB and NIR video information
using a TDT architecture. Modality-specific spatio-temporal encoders are first applied to extract robust
spatio-temporal features, providing a strong foundation for subsequent fusion. The TDT enables the
model to capture subtle physiological variations in facial regions while efectively fusing RGB and NIR
features. To further improve prediction accuracy, the model is trained with a composite loss function
combining time-domain and frequency-domain components, guiding it to focus on both dominant
frequency localization and waveform shape consistency. In the 4th RePSS Challenge, the proposed
method achieved second place on the oficial test set, demonstrating its efectiveness and competitiveness
in multimodal HR estimation. We believe that future improvements could lead to even better outcomes.
(10)</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Natural Science Foundation of China (62350068).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content
as needed and took full responsibility for the publication’s content.
[16] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation from face via
spatial-temporal representation, IEEE Transactions on Image Processing 29 (2019) 2409–2423.
[17] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. Tulppo, G. Zhao, The obf
database: A large face video database for remote physiological signal measurement and atrial
ifbrillation detection, in: 2018 13th IEEE international conference on automatic face &amp; gesture
recognition (FG 2018), IEEE, 2018, pp. 242–249.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>McDuf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gontarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Picard</surname>
          </string-name>
          ,
          <article-title>Remote measurement of cognitive stress via heart rate variability, in: 2014 36th annual international conference of the IEEE engineering in medicine and biology society</article-title>
          , IEEE,
          <year>2014</year>
          , pp.
          <fpage>2957</fpage>
          -
          <lpage>2960</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Eficient robustness assessment via adversarial spatial-temporal focus on videos</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>45</volume>
          (
          <year>2023</year>
          )
          <fpage>10898</fpage>
          -
          <lpage>10912</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.-Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          , S.-Y. Cho,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-B. Sohn</surname>
          </string-name>
          ,
          <article-title>A study of projection-based attentive spatial-temporal map for remote photoplethysmography measurement</article-title>
          ,
          <source>Bioengineering</source>
          <volume>9</volume>
          (
          <year>2022</year>
          )
          <fpage>638</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <article-title>Camera-based physiological measurement: Recent advances and future prospects</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>575</volume>
          (
          <year>2024</year>
          )
          <fpage>127282</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Comas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sukno</surname>
          </string-name>
          ,
          <article-title>Eficient remote photoplethysmography with temporal derivative modules and time-shift invariant loss</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2182</fpage>
          -
          <lpage>2191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <article-title>Self-supervised augmented vision transformers for remote physiological measurement</article-title>
          ,
          <source>in: 2023 8th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>623</fpage>
          -
          <lpage>627</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.-X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.-M. Sun</surname>
            ,
            <given-names>R.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Hao</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Pan</surname>
          </string-name>
          , R.-S. Jia, Transphys:
          <article-title>Transformer-based unsupervised contrastive learning for remote heart rate measurement</article-title>
          ,
          <source>Biomedical Signal Processing and Control</source>
          <volume>86</volume>
          (
          <year>2023</year>
          )
          <fpage>105058</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kurihara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sugimura</surname>
          </string-name>
          , T. Hamamoto,
          <article-title>Non-contact heart rate estimation via adaptive rgb/nir signal fusion</article-title>
          ,
          <source>IEEE Transactions on Image Processing</source>
          <volume>30</volume>
          (
          <year>2021</year>
          )
          <fpage>6528</fpage>
          -
          <lpage>6543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Magdalena Nowara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Marks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veeraraghavan</surname>
          </string-name>
          , Sparseppg:
          <article-title>Towards driver monitoring using camera-based vital signs estimation in near-infrared</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1272</fpage>
          -
          <lpage>1281</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. C.</surname>
          </string-name>
          den Brinker, G. De Haan,
          <article-title>Discriminative signatures for remote-ppg</article-title>
          ,
          <source>IEEE Transactions on Biomedical Engineering</source>
          <volume>67</volume>
          (
          <year>2019</year>
          )
          <fpage>1462</fpage>
          -
          <lpage>1473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Torr</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Physformer: Facial video-based physiological measurement with temporal diference transformer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4186</fpage>
          -
          <lpage>4196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>B.-B. Gao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Xing</surname>
            ,
            <given-names>C.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Geng</surname>
          </string-name>
          ,
          <article-title>Deep label distribution learning with label ambiguity</article-title>
          ,
          <source>IEEE Transactions on Image Processing</source>
          <volume>26</volume>
          (
          <year>2017</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2838</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Autohr: A strong end-to-end baseline for remote heart rate measurement with neural searching</article-title>
          ,
          <source>IEEE Signal Processing Letters</source>
          <volume>27</volume>
          (
          <year>2020</year>
          )
          <fpage>1245</fpage>
          -
          <lpage>1249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>151</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Niu</surname>
          </string-name>
          , H. Han,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Vipl-hr: A multi-modal database for pulse estimation from less-constrained face video</article-title>
          ,
          <source>in: Computer Vision-ACCV 2018: 14th Asian Conference on Computer Vision</source>
          , Perth, Australia, December 2-
          <issue>6</issue>
          ,
          <year>2018</year>
          ,
          <string-name>
            <given-names>Revised</given-names>
            <surname>Selected</surname>
          </string-name>
          <string-name>
            <surname>Papers</surname>
          </string-name>
          , Part V 14, Springer,
          <year>2019</year>
          , pp.
          <fpage>562</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>