<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">UP-Phys: Exploring the Effect of Prior Knowledge in Unsupervised Remote Photoplethysmography</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yan</forename><surname>Jiang</surname></persName>
							<email>jiangyan@nuist.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="institution">Nanjing University of Information Science and Technology</orgName>
								<address>
									<addrLine>219 Ningliu Road</addrLine>
									<postCode>210044</postCode>
									<settlement>Nanjing</settlement>
									<region>Jiangsu</region>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mingyue</forename><surname>Cao</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Nanjing University of Information Science and Technology</orgName>
								<address>
									<addrLine>219 Ningliu Road</addrLine>
									<postCode>210044</postCode>
									<settlement>Nanjing</settlement>
									<region>Jiangsu</region>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hao</forename><surname>Yu</surname></persName>
							<email>yuhao@nuist.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="institution">Nanjing University of Information Science and Technology</orgName>
								<address>
									<addrLine>219 Ningliu Road</addrLine>
									<postCode>210044</postCode>
									<settlement>Nanjing</settlement>
									<region>Jiangsu</region>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Xingyu</forename><surname>Liu</surname></persName>
							<email>xingyu@nuist.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="institution">Nanjing University of Information Science and Technology</orgName>
								<address>
									<addrLine>219 Ningliu Road</addrLine>
									<postCode>210044</postCode>
									<settlement>Nanjing</settlement>
									<region>Jiangsu</region>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Xu</forename><surname>Cheng</surname></persName>
							<email>xcheng@nuist.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="institution">Nanjing University of Information Science and Technology</orgName>
								<address>
									<addrLine>219 Ningliu Road</addrLine>
									<postCode>210044</postCode>
									<settlement>Nanjing</settlement>
									<region>Jiangsu</region>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">UP-Phys: Exploring the Effect of Prior Knowledge in Unsupervised Remote Photoplethysmography</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">AEDD171A33DB981484C057DC53F2ED41</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Remote Photoplethysmography, Unsupervised Learning, Prior Knowledge The 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge &amp; Workshop Orcid 0009-0002-2031-5627 (Y. Jiang)</term>
					<term>0009-0005-7796-7484 (M. Cao)</term>
					<term>0000-0002-8298-7181 (H. Yu)</term>
					<term>0009-0009-6064-9104 (X. Liu)</term>
					<term>0000-0003-2355-9010 (X. Cheng)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Remote photoplethysmography (rPPG) is a non-contact method that estimates multiple physiological parameters according to facial videos. Although existing supervised rPPG methods have achieved remarkable performance, the success mainly benefits from massive and expensive annotated data. Fortunately, many unsupervised rPPG methods have emerged recently to solve this issue. However, we find that existing unsupervised rPPG methods are learn-from-scratch. Many downstream tasks in deep learning have achieved great success using fine-tuning strategies in the past decade. Inspired by this, we explore the effect of prior knowledge in unsupervised rPPG and proposed UP-Phys. Moreover, to regulate the backbone to prioritize regions rich in rPPG information, we propose a plug-and-play representation augmentation module (RAM). RAM dynamically enhances salient temporal-spatial information derived from extracted features, effectively reducing the effect of noise brought by lighting, motion, etc. Experiments on two widely used rPPG datasets UBFC-rPPG and PURE demonstrate the superiority of our proposed method. In addition, our method achieves 15.79 RMSE accuracy in the 3rd RePSS.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Remote photoplethysmography (rPPG) estimates multiple physiological parameters that are important for healthcare including heart rate (HR), respiration frequency (RF), and heart rate variability (HRV) through videos captured by cameras <ref type="bibr" target="#b0">[1]</ref>. Compared with traditional HR estimation approaches like electrocardiogram (ECG) <ref type="bibr" target="#b1">[2]</ref> and photoplethysmography (PPG) <ref type="bibr" target="#b2">[3]</ref> that require skin contact with subjects, rPPG is non-contact, thus avoiding discomfort and skin irritation caused by skin-contact sensors. To this end, rPPG technology has become intensively researched in recent years and plays an increasingly pivotal role in remote healthcare <ref type="bibr" target="#b0">[1]</ref>, affective computing <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>, spoof detection <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>, etc.</p><p>Existing rPPG methods <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref> have achieved remarkable performance with deep learning methods. However, the success mainly profits from supervised learning over massive human-labeled data. In fact, the process of collecting and annotating such data is prohibitively Existing methods adopt the learn-from-scratch strategy, which may introduce potential issues such as limited generalization, overfitting, reliance on the scale of data, etc. To solve this issue, our method adopts a fine-tuning strategy that introduces prior knowledge in rPPG, enhancing the robustness and efficacy of the learning process.</p><p>expensive, requiring not only the deployment of subjects equipped with contact PPG or ECG sensors but also careful consideration of various potential environmental factors such as lighting changes, motion, gestures, and so on while capturing data. In addition, existing supervised rPPG methods struggle to break through the bottleneck posed by unlabeled data due to their performance being positively corresponded to the scale of annotated data available, resulting in less applicability in real scenarios. Fortunately, some unsupervised rPPG methods have been proposed recently to solve this issue of expensive rPPG data annotations.</p><p>Existing unsupervised rPPG methods <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref> can be roughly divided into two categories: contrastive and non-contrastive. In the former category, Sun et al. <ref type="bibr" target="#b12">[13]</ref> pioneered the introduction of contrastive learning into unsupervised rPPG methods with their proposal of Contrast-Phys. This method was developed based on four key observations: spatial similarity in rPPG signals, temporal similarity in rPPG signals, dissimilarity in rPPG signals across different videos, and HR range constraint. Crucially, Contrast-Phys eliminates the reliance on annotated data and achieves state-of-the-art in publicly available academic datasets. For the latter category, Speth et al. <ref type="bibr" target="#b13">[14]</ref> extended unsupervised methods based on contrastive learning research lines into non-contrastive and proposed SiNC by discovering periodic signals in video data. SiNC considers that periodicity suffices for learning minuscule visual features corresponding to the blood volume pulse from unlabeled face videos, which brings novel inspirations into the rPPG community.</p><p>Despite achieving encouraging progress, the aforementioned unsupervised rPPG methods are learn-from-scratch, as shown in Fig. <ref type="figure" target="#fig_0">1 (a)</ref>. This training strategy may introduce potential issues such as limited generalization, overfitting, and reliance on the scale of data. Moreover, the quality of predicted rPPG signals by the deep neural network has emerged as a pivotal challenge in elevating the performance ceiling of unsupervised rPPG, as it lacks effective supervision by label information. During the past decade, many downstream tasks in computer vision adopted the fine-tuning strategy <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18]</ref> and achieved significant success. This success is attributed to the prior knowledge acquired through pretraining, which enables the network to adapt to various datasets more efficiently and attain superior performance. Inspired by this, in this paper, we explore the effect of prior knowledge in unsupervised rPPG and propose UP-Phys, as shown in Fig. <ref type="figure" target="#fig_0">1 (b)</ref>. Specifically, we utilize the Contrast-Phys pre-trained on the MMSE-HR <ref type="bibr" target="#b18">[19]</ref> dataset and fine-tune other datasets. Compared with the official training protocol of 30 epochs, our UP-Phys undergoes only 1 epoch of fine-tuning, resulting in significant time savings during training. Furthermore, we design a plug-and-play representation augmentation module (RAM) that dynamically enhances salient temporal-spatial information derived from extracted features. This augmentation empowers the network to prioritize regions abundant in rPPG information, consequently reducing the effect of noise brought by lighting, motion, etc. Generally, the main contributions of this paper can be summarized as follows:</p><p>• We introduce a novel solution for unsupervised rPPG, termed UP-Phys, which leverages prior knowledge to reduce training time notably. • We design a plug-and-play representation augmentation module (RAM) that dynamically enhances salient temporal-spatial information derived from extracted features for unsupervised rPPG. • Experiments on PURE and UBFC-rPPG datasets demonstrate that our UP-Phys significantly outperforms existing unsupervised rPPG methods, and even surpasses some supervised counterparts. In addition, UP-Phys achieves 15.79 RMSE accuracy in 3rd RePSS.</p><formula xml:id="formula_0">Softmax Softmax Concat Conv Face Videos 3DCNN Prior Knowledge … … FFT … … Representation Augmentation Module (RAM) [C, T, H, 1] [C, T, 1, W] [C, T, H, W] [C, T, 1, W] [C, T, H, 1] [C, T, H+W, 1] [C, T, H, 1] [C, T, 1, W] [C, T, H, W] [C, T, H, W] Figure 2:</formula><p>The pipeline of the proposed UP-Phys. RAM is our proposed Representation Augmentation Module.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology</head><p>The overview of our proposed UP-Phys is shown in Fig. <ref type="figure">2</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Prior Knowledge</head><p>Over the past decade, deep learning has achieved significant success, with many downstream tasks showing impressive results through fine-tuning pre-trained weights. Inspired by this, we introduce the fine-tuning strategy into unsupervised rPPG as existing methods are learnfrom-scratch. Specifically, we utilize Contrast-Phys, pre-trained on the MMSE-HR dataset, and fine-tune it for just one epoch on the UBFC-rPPG dataset to investigate the impact of prior knowledge, as shown in Tab. 1.</p><p>With 25 pre-training videos, the MAE accuracy improves by 0.14 but RMSE accuracy increases by 0.47. This indicates that prior knowledge can help reduce the average error. The underlying reason for bad RMSE stems from less prior knowledge. When we increase the pre-training videos to 50, as shown in index 3, we can observe that both MAE and RMSE achieve significant improvement. Moreover, the pre-training with 100 videos shows the best performance with the lowest MAE of 0.33 and RMSE of 0.65. This indicates that larger prior knowledge significantly enhances the model's prediction accuracy and consistency. In summary, these results demonstrate a clear trend that as the number of pre-training videos increases, the accuracy of the model improves. This emphasizes the benefits of leveraging prior knowledge through pre-training in enhancing the performance of rPPG models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Representation Augmentation Module</head><p>Existing unsupervised rPPG methods mainly design refreshing strategies to achieve robust training without annotated data. The quality of rPPG signal prediction by these methods heavily relies on the features extracted by the backbone. These unsupervised methods rely solely on 3DCNN and cannot accurately focus on regions with rich rPPG signals in complex environments such as head movement and lighting, resulting in difficulty in improving performance. Therefore, we propose a plug-and-play representation augmentation module (RAM) that dynamically enhances salient temporal-spatial information, helping the backbone focus on regions rich in rPPG information.</p><p>Specifically, given the input features F ∈ ℝ 𝐶×𝑇 ×𝐻 ×𝑊 , we first apply 3D AdaptiveMaxPool to extract the most salient rPPG knowledge in both horizontal and vertical directions. Subsequently, we utilize a softmax function to transform this rPPG knowledge into a distribution ranging from 0 to 1. This distribution is then used to create the augmentation mask through multiplication. Finally, this augmented mask is added to the input features to enhance the rPPG information. It is written as follows:</p><formula xml:id="formula_1">F = F + Softmax(AMP 𝑥 (F)) ⊗ Softmax(AMP 𝑦 (F)),<label>(1)</label></formula><p>where AMP 𝑥 and AMP 𝑦 denote the 3D AdaptiveMaxPool with pooling kernels (𝑇 , 𝐻 , 1) and (𝑇 , 1, 𝑊 ), respectively. ⊗ is the multiplication operation.</p><p>After that, the augmented features F are processed by 3D AdaptiveAvgPool to attain the directional rPPG knowledge. Then, we concatenate the two directional features along the spatial dimension to investigate the spatial rPPG information. In addition, A basic 3D convolutional block is employed to discover shared rPPG information and reduce channel dimension, which can be expressed as:</p><formula xml:id="formula_2">F = Conv(Cat(AAP 𝑥 ( F ), AAP 𝑦 ( F ))),<label>(2)</label></formula><p>where AAP 𝑥 and AAP 𝑦 denote the 3D AdaptiveAvgPool with pooling kernels (𝑇 , 𝐻 , 1) and (𝑇 , 1, 𝑊 ), respectively. Cat(⋅, ⋅) denotes the concatenation on the height dimension. Conv denotes the basic 3D convolutional block consisting of a pointwise convolution, batch normalization, and ELU activation. Further, we split the F along spatial dimension and get F ℎ and F 𝑤 . Based on F ℎ and F 𝑤 , a pointwise convolution is utilized to restore the channel dimension. Then, sigmoid normalization and multiplication are employed to generate a mask that discriminates against rPPG information. Finally, the mask is element-wise multiplicated with the input features to augment the features, thereby regulating the backbone sensitively concentrating on the regions rich in rPPG information.</p><formula xml:id="formula_3">F = [𝜎 (P 1×1 ( F ℎ )) ⊗ 𝜎 (P 1×1 ( F 𝑤 ))] ⊙ F.<label>(3)</label></formula><p>where F ∈ ℝ 𝐶×𝑇 ×𝐻 ×𝑊 is the augmented features; 𝜎 denotes the sigmoid function; P 1×1 denotes the pointwise convolution; ⊙ is the element-wise multiplication.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Experimental Setup and Evaluation Protocol</head><p>Datasets. We evaluate the proposed method on the two widely used rPPG datasets UBFC-rPPG <ref type="bibr" target="#b19">[20]</ref> and PURE <ref type="bibr" target="#b20">[21]</ref>. In addition, we pretrain our method on the MMSE-HR <ref type="bibr" target="#b18">[19]</ref> dataset. UBFC-rPPG contains 42 videos where subjects manipulate their heart rates by engaging in mathematical games. Each video is recorded at 30 frames per second (fps), has a resolution of 640×480, and runs for approximately one minute. Ground truth data is collected synchronously using a CMS50E pulse oximeter at a sampling rate of 30 Hz. PURE records videos of 10 subjects across 6 different scenarios, including those with head movements. Each video maintains a one-minute duration, is captured at 30 fps, and boasts a resolution of 640×480. The ground truth is accurately recorded using a fingertip pulse oximeter at 60 Hz, specifically to capture the blood volume pulse (BVP) signal. MMSE-HR contains 102 videos from 40 subjects. Each video is 25fps, and the subject's emotional guidance ensures the heart rate changes. Physiological data were collected by the Boipac Mp150 data acquisition system at 1khz. Evaluation Protocol. Following previous works <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>, we adopt mean absolute error (MAE), root mean squared error (RMSE), and person correlation coefficient (R) as the evaluation metrics. Experimental Setup. We implement our UP-Phys on the PyTorch framework with two RTX 2080Ti GPUs. The Contrast-Phys <ref type="bibr" target="#b12">[13]</ref> is utilized as our baseline. The proposed RAM is added after encoder 1 and encoder 2 of the backbone. We initially pre-train our UP-Phys model on the MMSE-HR dataset, utilizing the AdamW optimizer with a learning rate of 10 −5 for 30 epochs. Subsequently, we only fine-tune the UP-Phys 1 epoch on the dataset to be evaluated. All other settings are maintained consistently with those of Contrast-Phys. RePSS Setup. We first pre-train our UP-Phys on 209 videos collected by MMSE-HR and VIPL-HR <ref type="bibr" target="#b21">[22]</ref> datasets. Subsequently, we fine-tune our method on the UBFC-rPPG and PURE datasets for 1 epoch. We finally achieve 15.79 RMSE accuracy on the 3rd RePSS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Intra-Dataset Testing</head><p>We report 3 representative supervised and unsupervised methods for comparison.</p><p>Comparison with Unsupervised Methods. As reported in Tab. 2, the performance of our method surpasses current leading unsupervised methods. More precisely, our UP-Phys achieves 0.18 and 0.48 MAE accuracy on UBFC-rPPG and PURE datasets, respectively. It significantly outperforms SiNC <ref type="bibr" target="#b13">[14]</ref> by 0.41 and 0.13 on these two datasets. Note that while our UP-Phys is based on Contrast-Phys <ref type="bibr" target="#b12">[13]</ref>, it significantly outperforms Contrast-Phys. This success is attributed to the pivotal role of prior knowledge and UP-Phys's keen ability to focus on regions abundant in rPPG information, simultaneously demonstrating the effectiveness of our proposed method.</p><p>Comparison with Supervised Methods. Supervised methods such as Dual-GAN <ref type="bibr" target="#b10">[11]</ref> perform well on both datasets, particularly achieving excellent results with an MAE of 0.44 and an RMSE of 0.67 on UBFC-rPPG. This can be attributed to the ability of supervised methods to utilize labeled information in the dataset for training, facilitating the model to learn accurate heart rate estimation patterns. However, without the label information, our proposed UP-Phys significantly surpasses Dual-GAN. The excellent performance of our method benefits from the insightful design of the prior knowledge. Interestingly, our method shows the potential of unsupervised rPPG methods, and we believe this design can bring new insights to the rPPG community. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Ablation Study</head><p>To evaluate the contribution of the designed component, we conduct an ablation experiment on the UBFC-rPPG dataset, as shown in Tab. 3. Baseline in index 1 denotes that we directly train the Contrast-Phys <ref type="bibr" target="#b12">[13]</ref>. It is observed that the baseline only achieves 0.64 MAE accuracy and 1.00 RMSE accuracy, showing the limited capability of the baseline to predict accurate HR.</p><p>Effectiveness of RAM. As shown in index 2, by only adding the RAM, the MAE slightly decreases to 0.58, but the RMSE increases to 1.50, indicating that the RAM module improves the prediction accuracy of the model on some samples but introduces large errors on other samples. With the help of knowledge, as shown in index 4, the MAE further decreases to 0.18 and the RMSE to 0.45, achieving a superior performance. This indicates that prior knowledge can help RAM significantly reduce prediction errors.</p><p>Effectiveness of Prior Knowledge. As shown in index 3, only directly adopting the pretrain can bring significant improvement. Specifically, the MAE drops from 0.64 to 0.33 and RMSE drops from 1.00 to 0.65. Meanwhile, this accuracy even surpasses existing unsupervised rPPG methods, showing the effectiveness of prior knowledge.</p><p>Generally, the above observation and analysis demonstrate the effectiveness of our proposed components.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusion</head><p>This paper introduces a novel unsupervised method termed UP-Phys that leverages prior knowledge to reduce training time and improve HR estimate accuracy notably. Furthermore, we design a plug-and-play representation augmentation module (RAM) that dynamically enhances salient temporal-spatial information derived from extracted features. This augmentation empowers the network to prioritize regions abundant in rPPG information, consequently reducing the effect of noise brought by lighting, motion, etc. Experiments on PURE and UBFC-rPPG datasets demonstrate the effectiveness of our method. In addition, our method achieves 15.79 RMSE accuracy in the 3rd RePSS.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure1: Motivation of our proposed method. Existing methods adopt the learn-from-scratch strategy, which may introduce potential issues such as limited generalization, overfitting, reliance on the scale of data, etc. To solve this issue, our method adopts a fine-tuning strategy that introduces prior knowledge in rPPG, enhancing the robustness and efficacy of the learning process.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Experiments on different prior knowledge. V denotes the number of pre-training videos.To reduce background noise and interference from irrelevant areas, we adopt the OpenFace toolkit to preprocess the video. Specifically, we begin by determining the minimum and maximum horizontal and vertical coordinates of generated landmarks to pinpoint the central facial point for each frame. The size of the bounding box is set to 1.2 times the range of the vertical coordinates of landmarks from the first frame, and this size remains constant for subsequent frames. Then, we crop the face from each frame and resize it to 128 × 128 according to the central facial point of each frame and bounding box. To minimize I/O overhead during training, we convert video files into Hierarchical Data Format (HDF5) format.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>UBFC-rPPG</cell></row><row><cell>Index</cell><cell>V</cell><cell>Pretrain</cell><cell cols="3">MAE ↓ RMSE ↓ R ↑</cell></row><row><cell>1</cell><cell>0</cell><cell></cell><cell>0.64</cell><cell>1.00</cell><cell>0.99</cell></row><row><cell>2</cell><cell>25</cell><cell></cell><cell>0.50</cell><cell>1.47</cell><cell>0.99</cell></row><row><cell>3</cell><cell>50</cell><cell></cell><cell>0.42</cell><cell>0.96</cell><cell>0.99</cell></row><row><cell>4</cell><cell>100</cell><cell></cell><cell>0.33</cell><cell>0.65</cell><cell>0.99</cell></row><row><cell>2.1. Preprocessing</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Intra-dataset HR results. The best results are in bold. The Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Pearson Correlation Coefficient (R) are reported.</figDesc><table><row><cell>Method Types</cell><cell>Methods</cell><cell cols="6">UBFC-rPPG MAE ↓ RMSE ↓ R ↑ MAE ↓ RMSE ↓ R ↑ PURE</cell></row><row><cell></cell><cell>PhysNet [9]</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>2.10</cell><cell>2.60</cell><cell>0.99</cell></row><row><cell>Supervised</cell><cell>PulseGAN [10]</cell><cell>1.19</cell><cell>2.10</cell><cell>0.98</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>Dual-GAN [11]</cell><cell>0.44</cell><cell>0.67</cell><cell>0.99</cell><cell>0.82</cell><cell>1.31</cell><cell>0.99</cell></row><row><cell></cell><cell>Gideon2021 [12]</cell><cell>1.85</cell><cell>4.28</cell><cell>0.93</cell><cell>2.30</cell><cell>2.90</cell><cell>0.99</cell></row><row><cell>Unsupervised</cell><cell>Contrast-Phys [13] SiNC [14]</cell><cell>0.64 0.59</cell><cell>1.00 1.83</cell><cell>0.99 0.99</cell><cell>1.00 0.61</cell><cell>1.40 1.84</cell><cell>0.99 1.00</cell></row><row><cell></cell><cell>UP-Phys (Ours)</cell><cell>0.18</cell><cell>0.45</cell><cell>0.99</cell><cell>0.48</cell><cell>0.69</cell><cell>1.00</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Ablation studies for different components of the proposed UP-Phys on UBFC-rPPG. RAM denotes the proposed representation augmentation module. MAE, RMSE, and R are reported.</figDesc><table><row><cell cols="4">Index Pretrain RAM MAE ↓ RMSE ↓ R ↑</cell></row><row><cell>1</cell><cell></cell><cell>1.00</cell><cell>0.99</cell></row><row><cell>2</cell><cell>0.58</cell><cell>1.50</cell><cell>0.99</cell></row><row><cell>3</cell><cell></cell><cell>0.65</cell><cell>0.99</cell></row><row><cell>4</cell><cell>0.18</cell><cell>0.45</cell><cell>0.99</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Acknowledgements</head><p>This research is funded in part by the National Natural Science Foundation of China (Grant No. 61802058, 61911530397), in part by the open Project Program of the State Key Laboratory of CAD&amp;CG, Zhejiang University (under Grant A2318), and in part by the Postgraduate Research &amp; Practice Innovation Program of Jiangsu Province (Grant No. KYCX24_1514).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Atrial fibrillation detection from face videos by fusing subtle variations</title>
		<author>
			<persName><forename type="first">J</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Alikhani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Seppänen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Circuits and Systems for Video Technology</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="2781" to="2795" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">iphys: An open non-contact imaging-based physiological measurement toolbox</title>
		<author>
			<persName><forename type="first">D</forename><surname>Mcduff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Blackford</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2019 41st annual international conference of the IEEE engineering in medicine and biology society (EMBC)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="6521" to="6524" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Photoplethysmography and its application in clinical physiological measurement</title>
		<author>
			<persName><forename type="first">J</forename><surname>Allen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Physiological measurement</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page">R1</biblScope>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Facial-video-based physiological signal measurement: Recent advances and affective applications</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Signal Processing Magazine</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="50" to="58" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Ubfc-phys: A multimodal database for psychophysiological studies of social stress</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Sabour</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Benezeth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">De</forename><surname>Oliveira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chappe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="622" to="636" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Patron: Exploring respiratory signal derived from non-contact face videos for face anti-spoofing</title>
		<author>
			<persName><forename type="first">L</forename><surname>Birla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gupta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">187</biblScope>
			<biblScope unit="page">115883</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Sunrise: Improving 3d mask face anti-spoofing for short videos using pre-emptive split and merge</title>
		<author>
			<persName><forename type="first">L</forename><surname>Birla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Dependable and Secure Computing</title>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Deepphys: Video-based physiological measurement using convolutional attention networks</title>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mcduff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the european conference on computer vision (ECCV)</title>
				<meeting>the european conference on computer vision (ECCV)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="349" to="365" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1905.02419</idno>
		<title level="m">Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Pulsegan: Learning to generate realistic pulse waveforms in remote photoplethysmography</title>
		<author>
			<persName><forename type="first">R</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Journal of Biomedical and Health Informatics</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="1373" to="1384" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Dual-gan: Joint bvp and noise modeling for remote physiological measurement</title>
		<author>
			<persName><forename type="first">H</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</title>
				<meeting>the IEEE/CVF conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="12404" to="12413" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The way to my heart is through contrastive learning: Remote photoplethysmography from unlabelled video</title>
		<author>
			<persName><forename type="first">J</forename><surname>Gideon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Stent</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF international conference on computer vision</title>
				<meeting>the IEEE/CVF international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="3995" to="4004" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Contrast-phys: Unsupervised video-based remote physiological measurement via spatiotemporal contrast</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="492" to="510" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Non-contrastive unsupervised learning of physiological signals from video</title>
		<author>
			<persName><forename type="first">J</forename><surname>Speth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Vance</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Flynn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Czajka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="14464" to="14474" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">St-phys: Unsupervised spatio-temporal contrastive remote physiological measurement</title>
		<author>
			<persName><forename type="first">M</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Journal of Biomedical and Health Informatics</title>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Toplight: Lightweight neural networks with task-oriented pretraining for visible-infrared recognition</title>
		<author>
			<persName><forename type="first">H</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="3541" to="3550" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Modality unifying network for visible-infrared person re-identification</title>
		<author>
			<persName><forename type="first">H</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="11185" to="11195" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Differentiable auxiliary learning for sketch reidentification</title>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="3747" to="3755" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Multimodal spontaneous emotion corpus for human behavior analysis</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Girard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Ciftci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Canavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Reale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Horowitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="3438" to="3446" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Unsupervised skin tissue segmentation for remote photoplethysmography</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bobbia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Macwan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Benezeth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mansouri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dubois</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">124</biblScope>
			<biblScope unit="page" from="82" to="90" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Non-contact video-based pulse rate measurement on a mobile service robot</title>
		<author>
			<persName><forename type="first">R</forename><surname>Stricker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-M</forename><surname>Gross</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The 23rd IEEE International Symposium on Robot and Human Interactive Communication</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1056" to="1062" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation</title>
		<author>
			<persName><forename type="first">X</forename><surname>Niu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Image Processing</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="2409" to="2423" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
