Ensemble deep learning for blood pressure estimation using facial videos Wei Liu1 , Bingjie Wu1 , Menghan Zhou1 , Xingjian Zheng1 , Xingyao Wang1 , Yiping Xie2 , Chaoqi Luo3 and Liangli Zhen1,∗ 1 Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), Singapore 2 College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China. 3 School of Electrical Engineering, Southwest Jiaotong University, Chengdu, China. Abstract Blood pressure (BP) estimation is a standard and critical component of routine health assessment, especially for cardiac disease patients. Traditional methods typically require direct contact with the patient, which can cause discomfort and inconvenience. Remote photoplethysmography (rPPG) that enables non-contact measurement of the blood volume pulse using trivial cues from facial videos has drawn attention to measure vital signs. This paper presents an ensemble deep learning approach for estimating BP remotely using facial videos. Specifically, to address the vulnerabilities and biases in deep learning models for BP measurement, we emphasize both the accuracy of individual models and the diversity within the ensemble. We utilize advanced deep learning architectures to construct several regression models incorporating convolutional neural networks and transformer blocks, which learn the spatiotemporal relationships between different frames and locations. These trained models are then combined to measure BP readings. Additionally, to enhance the system’s robustness under varying lighting conditions, data augmentation techniques are employed to generate more training data. The proposed method is tested on an unseen dataset and the average root of mean squared error (RMSE) is 12.95 mmHg, ranking 1st in the 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge. Keywords Blood pressure measurement, remote photoplethysmography, deep learning, ensemble learning 1. Introduction Blood pressure (BP) measurement is a fundamental diagnostic tool in medical practice, serving as a crucial indicator of cardiovascular health. For instance, elevated BP, or hypertension, is a significant risk factor for cardiovascular diseases, including stroke, heart attack, and renal failure, making accurate and timely measurement vital for early detection and management [1]. The golden standard of continuous BP monitoring is invasive arterial pressure monitoring, which is mainly adopted for primary care [2]. In addition, there are traditional noninvasive BP measurement methods rely on cuffs, but it increases discomfort for patients receiving long-term or frequent monitoring and discouraging real-time measurement [3]. The 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge & Workshop, August 03–09, 2024, Jeju, South Korea ∗ Corresponding author: Liangli Zhen (email: llzhen@outlook.com) Envelope-Open liuw2@ihpc.a-star.edu.sg (W. Liu); wu_bingjie@ihpc.a-star.edu.sg (B. Wu); zhou_menghan@ihpc.a-star.edu.sg (M. Zhou); zheng_xingjian@ihpc.a-star.edu.sg (X. Zheng); wang_xingyao@ihpc.a-star.edu.sg (X. Wang); yipingx1123@gmail.com (Y. Xie); chaoqiluo7@gmail.com (C. Luo); llzhen@outlook.com (L. Zhen) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Currently, cuffless BP monitor methods have been explored for real-time measurement, and providing convenience and comfort. There are two main methods: pulse transit time (PTT) and pulse wave analysis (PWA) [4]. PTT requires two simultaneous physiological signals to calculate, such as electrocardiography (ECG), phonocardiography (PCG) and seismocardiography (SCG). Compared to PTT, PWA exclusively extracts features from PPG to estimate BP. In recent years, machine learning and deep learning are also employed to establish the mapping relations between PPG and BP [5, 6, 7]. Note that these methods are contact-based and require specific devices like smart watches to make the measurement. Over the past years, remote PPG (rPPG) techniques have been developed for vital sign measurements, especially for heart rate (HR) estimation [8, 9, 10, 11]. Compared to PPG techniques, rPPG-based methods are contactless and can work with digital cameras which are easily accessible nowadays. Other than HR estimation, rPPG techniques have also been applied to make BP estimation from facial videos [12, 13]. While rPPG provides a convenient and cost-effective method for BP estimation, its accuracy can be easily affected by factors such as lighting conditions, skin tones, and motion blur, making rPPG-based BP measurement extremely challenging. This paper proposes to achieve rPPG-based BP measurement with ensemble deep learning using facial videos. Specifically, to address the vulnerabilities and biases in deep learning models for BP measurement, we prioritize the accuracy of individual models and the diversity within the ensemble. We construct individual regression models by adding a regression head to CNN- and transformer-based backbones. For training each model, we use not only the original RGB images but also features obtained by transforming the color space from RGB to YUV. To enhance the models’ performance under varying lighting conditions, data augmentation techniques are employed. Finally, an aggregator is used to combine the outputs from these individual models. 2. Related Work 2.1. Invasive BP Monitoring Invasive BP estimation can provide continuous and accurate monitoring and is there essential in certain clinical settings, particularly for patients under critical care or during surgery [14, 2]. This method involves the insertion of a catheter into a suitable artery, commonly the radial or femoral artery [15]. The catheter is connected to a pressure transducer, which converts the mechanical pressure exerted by the blood into an electrical signal that can be continuously displayed and monitored. In general, invasive methods provide accurate and continuous monitoring of BP but are only used in certain circumstances due to the significant discomforts to patients. 2.2. Cuff-based BP Estimation Cuff-based BP measurement is the most common non-invasive method used in both clinical and home settings to assess ABP [16, 3]. This technique utilizes a sphygmomanometer, which includes a cuff that is wrapped around the upper arm and inflated to constrict blood flow. As the cuff deflates, measurements are taken either manually by auscultation—listening to the Korotkoff sounds through a stethoscope—or automatically by oscillometric monitors that detect blood flow vibrations [17]. Cuff-based methods provide the convenience of quick and easy readings and has been extensively validated for clinical use. However, they impose light discomforts to patients and accuracy can be easily affected by factors like cuff size, arm position, and patient movement. 2.3. PPG-based BP Estimation PPG-based BP estimation is getting more widely used as the emergence of deep learning algorithms and PPG sensors that can be placed on the finger, earlobe, or over the wrist [18]. Variations in light absorption during the cardiac cycle are measured, providing information about the blood flow, heart rate, and other cardiovascular attributes. By analyzing these variations, algorithms can estimate systolic and diastolic BP values [6, 19]. PPG-based methods offer ease of use, the potential for continuous monitoring, and the absence of discomfort of cuff methods. However, the accuracy is sensitive to motion artifacts and changes in sensor placement. 2.4. The rPPG-based BP Estimation Recently, the rPPG-based methods offer a non-contact way for BP estimation by using video cameras to detect blood volume changes in facial skin [20]. This technology, which can be implemented with standard RGB cameras found in common devices like smartphones and tablets, captures subtle changes in light reflection off the skin due to pulsating blood flows [21, 22, 23]. The rPPG-based methods are non-invasive and use widely-accessible cameras, making it potentially cost-effective and convenient for regular BP checks. However, the accuracy can be compromised by factors such as motion and variable lighting conditions, posing challenges for its use in dynamic or uncontrolled environments. 3. Methodology The overall framework of our ensemble deep learning method is illustrated in Fig. 1, from which we can see that there are multiple regression models. To import diversity, multiple models are trained using different input feature vectors, backbones, or random seeds. The outputs of individual models are then fused with an aggregator. 3.1. Data Preprocessing A short clip is extracted from the original full video and then partitioned into frames. It is worth pointing out that we select the clip closest to the time when BP is measured to mitigate the impact of BP fluctuation during video taking. If the video is recorded before BP measurement, the last part the video is selected and vice versa for videos taken after BP measurement. The face region of each frame is then cropped and resized to 128 × 128. To improve model performance in different lighting conditions, data augmentation technique is applied during the training process. It has been demonstrated in [11, 24] that alternative color spaces derived from RGB videos are beneficial for better representation of HR signal. Other than original RGB images, we also Regressor 1, seed 1 PhysNet Head Regressor 2, seed 1 PhysFormer Aggregator Yes … YUV Aggregated conversion SBP/DBP Regressor 4N-1, seed N No RGB frames Regressor 4N, seed N Figure 1: Overall framework of the proposed method. SBP: systolic BP. DBP: diastolic BP. explored using YUV color space for BP estimation. Mathematically, the transformation from RGB to YUV can be calculated as 𝑌 0.299 0.587 0.114 𝑅 0 [𝑈] = [−0.169 −0.331 0.5 ] [𝐺] + [128] (1) 𝑉 0.5 −0.419 −0.081 𝐵 128 where 𝑅, 𝐺, and 𝐵 represent the red, green, and blue color components of an image, respectively. 𝑌 represents the luminance component, while 𝑈 and 𝑉 represent the chrominance components, capturing the color information minus the brightness. 3.2. Network Structure 3.2.1. Backbones We utilize two state-of-the-art models as the backbone for our BP estimation model, including a 3D CNN model named PhysNet [25] and a transformer-based model named PhysFormer [8]. The output of two backbones are both the estimated PPG signal which has been used to recover ABP [6, 26, 27]. Therefore, we keep all the layers of the backbones so that the output of the backbone remain as the PPG signal. The output of the backbone is a 1D signal that has the same length as the number of input frames. The details of the backbones can be found in [25, 8]. 3.2.2. Regression head We stack a regression head with one hidden layer on top of the backbone and the regression head has two output nodes corresponding to SBP and DBP, respectively. The regression head can be formulated as h = 𝜎 (W(1) x + b(1) ) (2) y = W(2) h + b(2) where 𝜎 is the standard Sigmoid function. W and b are the weights and biases, respectively. x is the output signal from the backbone. h denotes the vector at the hidden layer. y denotes the output vector consisting of DBP and SBP. 3.3. Loss Function The average RMSE of SBP and DBP is used as the loss function to train our models, defined as 𝑁 𝑁 ∑𝑖=1 (𝑔𝑖𝑑 − 𝑦𝑖𝑑 )2 ∑𝑖=1 (𝑔𝑖𝑠 − 𝑦𝑖𝑠 )2 𝐿 = 0.5 × + 0.5 × (3) √ 𝑁 √ 𝑁 where 𝑔𝑖𝑑 and 𝑔𝑖𝑠 are the ground-truth DBP and SBP results of the 𝑖𝑡ℎ sample, respectively. 𝑦𝑖𝑑 and 𝑦𝑖𝑠 are the predicted DBP and SBP of the 𝑖𝑡ℎ sample, respectively. 𝑁 is the number of samples. 3.4. Aggregation As mentioned above, multiple individual models are trained with different input features (RGB or YUV), backbones (PhysNet or PhysFormer) and random seeds to introduce diversity to our ensemble method. Ensemble learning is used to aggregate the outputs of individual models. For each sample, we remove the top-𝑛 and bottom-𝑛 results and then calculate the average of the rest outputs as 𝑁 −𝑛 1 𝑑 𝑦ens = ∑ 𝑦𝑑 𝑁 − 2𝑛 𝑖=𝑛+1 𝑖 𝑁 −𝑛 (4) 𝑠 1 𝑠 𝑦ens = ∑ 𝑦 𝑁 − 2𝑛 𝑖=𝑛+1 𝑖 𝑑 and 𝑦 𝑠 are the aggregated prediction of DBP and SBP, respectively. 𝑦 𝑑 and 𝑦 𝑠 where 𝑦ens ens 𝑖 𝑖 represent the predicted DBP and SBP of the 𝑖𝑡ℎ model when they are arranged in ascending order. 𝑁 is the number of individual models. The top-𝑛 and bottom-𝑛 values are neglected. 4. Experimental Study 4.1. Experimental Setup The proposed method is tested using PyTorch on a server equipped with Intel(R) Xeon(R) Gold 6430 CPU and RTX4090 GPU. The models are trained for 150 epochs using AdamW optimizer [28] with learning rate 𝑙𝑟 = 1 × 10−5 and 𝑤𝑒𝑖𝑔ℎ𝑡_𝑑𝑒𝑐𝑎𝑦 = 1 × 10−5 . The value of 𝑛 in Equ. (4) is set as 3. Table 1 Brief Summary of Datasets for Training. FPS: Frames Per Second. Dataset # Subject # Videos # BP labels Video length (s) FPS VV-medium [29] 250 499 250 30 30 Our private dataset 88 88 88 120 30 (a) Diastolic BP (b) Systolic BP Figure 2: Distribution of diastolic and systolic BP of two datasets used for training. 4.2. Datasets Two datasets are used for model training and validation, including the VV-medium dataset [29] and our private dataset. A brief summary of these two datasets is reported in Table 1 and distribution are illustrated in Fig. 2. VV-medium dataset [29] has more videos than BP label because each BP label corresponds to multiple videos. It is shown that BP of VV-medium dataset [29] is more diversely distributed compared to our dataset. For testing, the OBF Database – Oulu BioFace Database [30, 31] consisting of 100 subjects and 200 facial videos with DBP/SBP labels is used for evaluation. Note that for testing, we only have access to the facial videos and have no access to the ground-truth BP labels. 4.3. Evaluation Metrics Three metrics, including the root of mean squared error (RMSE), mean absolute error (MAE) and Pearson correlation coefficient 𝑟, are used to evaluate model performance on the validation dataset, which are defined as 𝑁 ∑𝑖=1 (𝑔𝑖 − 𝑦𝑖 )2 RMSE = √ 𝑁 𝑁 1 MAE = ∑ |𝑔 − 𝑦𝑖 | (5) 𝑁 𝑖=1 𝑖 𝑁 ∑𝑖=1 (𝑔𝑖 − 𝑔)(𝑦𝑖 − 𝑦) 𝑟= 𝑁 2 𝑁 2 √∑𝑖=1 (𝑔𝑖 − 𝑔) ∑𝑖=1 (𝑦𝑖 − 𝑦) where 𝑔 and 𝑦 are the ground-truth and predicted SBP/DBP, respectively. 𝑁 is the number of samples. 𝑔 and 𝑦 indicate the average values of ground-truth and prediction, respectively. For testing, the average RMSE of DBP and SBP is used to evaluate model performance , calculated as RMSEavg = 0.5 × RMSE𝑑 + 0.5 × RMSE𝑠 (6) where RMSE𝑑 and RMSE𝑠 are the RMSE of DBP and SBP, respectively. 4.4. Experimental Results Figure 3: The training and validation loss over epochs during training. We randomly split the available dataset into training dataset (80%) and validation dataset (20%). The learning curve one of our individual models is shown in Fig. 3 and it shows that the model converges well in 150 epochs. The scatter plots on validation set are illustrated in Fig. 4 where RMSE, MAE and 𝑟 are also reported. It is shown that the estimated and true BP are strongly correlated and the errors of most samples are within ±10 mmHg. The RMSE of DBP and SBP are 8.93 mmHg and 11.03 mmHg, respectively. The MAE and RMSE of SBP are larger than those of DBP because SBP range is larger and more diversely distributed, which can be seen from Fig. 2. On the testing dataset, the average RMSE is 12.95 mmHg and a comparison is reported in Table 2. One can see that our method outperforms competing methods by more than 0.5 mmHg. (a) Diastolic BP (b) Systolic BP Figure 4: Scatter plots of ground-truth and predicted BP. The solid red line indicates that prediction is the same as the ground-truth. The dashed blue lines indicate error of ±10 mmHg. Table 2 A Comparison of Our Method and the Peer Methods [32]. Rank Team RMSEavg 1 Face AI (BP)-Ours 12.95 2 PCA_Vital 13.48 3 Ryhthm 13.59 4 SCUT_rPPG 15.06 5 IAI-USTC 16.01 6 NeuroAI 16.56 5. Conclusion This paper presented an ensemble deep learning method for BP estimation using facial videos. To improve the diversity of models to ensemble learning, multiple models are built with different backbones and input feature vectors. Besides, data augmentation technique is used to improve model performance under different lighting conditions. The outputs of individual models are fused with an aggregator. Our method is tested on an unseen dataset in the RePSS challenge, and the average RMSE of SBP and DBP is 12.95 mmHg, which outperforms all the peer methods and indicates the effectiveness of our proposed method. 6. Acknowledgement This work is supported by A*STAR Gap project Face AI (Phase 1) under project No. SC36/19- 000801-A042 and A*STAR Career Development Fund under grant No. C233312006. References [1] F. D. Fuchs, P. K. Whelton, High blood pressure and cardiovascular disease, Hypertension 75 (2020) 285–292. [2] B. Saugel, K. Kouz, A. S. Meidert, L. Schulte-Uentrop, S. Romagnoli, How to measure blood pressure using an arterial catheter: a systematic 5-step approach, Critical Care 24 (2020) 1–10. [3] D. S. Picone, M. G. Schultz, P. Otahal, S. Aakhus, A. M. Al-Jumaily, J. A. Black, W. J. Bos, J. B. Chambers, C.-H. Chen, H.-M. Cheng, et al., Accuracy of cuff-measured blood pressure: systematic reviews and meta-analyses, Journal of the American College of Cardiology 70 (2017) 572–586. [4] R. Mukkamala, J.-O. Hahn, A. Chandrasekhar, Photoplethysmography in noninvasive blood pressure monitoring, in: Photoplethysmography, Elsevier, 2022, pp. 359–400. [5] D. Konstantinidis, P. Iliakis, F. Tatakis, K. Thomopoulos, K. Dimitriadis, D. Tousoulis, K. Tsioufis, Wearable blood pressure measurement devices and new approaches in hyper- tension management: the digital era, Journal of human hypertension 36 (2022) 945–951. [6] N. Ibtehaz, S. Mahmud, M. E. Chowdhury, A. Khandakar, M. Salman Khan, M. A. Ayari, A. M. Tahir, M. S. Rahman, Ppg2abp: Translating photoplethysmogram (ppg) signals to arterial blood pressure (abp) waveforms, Bioengineering 9 (2022) 692. [7] K. R. Vardhan, S. Vedanth, G. Poojah, K. Abhishek, M. N. Kumar, V. Vijayaraghavan, Bp-net: Efficient deep learning for continuous arterial blood pressure estimation using photoplethysmogram, in: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2021, pp. 1495–1500. [8] Z. Yu, Y. Shen, J. Shi, H. Zhao, P. H. Torr, G. Zhao, Physformer: Facial video-based physiological measurement with temporal difference transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4186–4196. [9] Z. Yu, Y. Shen, J. Shi, H. Zhao, Y. Cui, J. Zhang, P. Torr, G. Zhao, Physformer++: Facial video-based physiological measurement with slowfast temporal difference transformer, International Journal of Computer Vision 131 (2023) 1307–1330. [10] X. Liu, B. Hill, Z. Jiang, S. Patel, D. McDuff, Efficientphys: Enabling simple, fast and accurate camera-based cardiac measurement, in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp. 5008–5017. [11] H. Shao, L. Luo, J. Qian, S. Chen, C. Hu, J. Yang, Tranphys: Spatiotemporal masked transformer steered remote photoplethysmography estimation, IEEE Transactions on Circuits and Systems for Video Technology (2023). [12] H. Luo, D. Yang, A. Barszczyk, N. Vempala, J. Wei, S. J. Wu, P. P. Zheng, G. Fu, K. Lee, Z.-P. Feng, Smartphone-based blood pressure measurement using transdermal optical imaging technology, Circulation: Cardiovascular Imaging 12 (2019) e008857. [13] Y. Zhou, H. Ni, Q. Zhang, Q. Wu, The noninvasive blood pressure measurement based on facial images processing, IEEE Sensors Journal 19 (2019) 10624–10634. [14] H. L. Li-wei, M. Saeed, D. Talmor, R. Mark, A. Malhotra, Methods of blood pressure measurement in the icu, Critical care medicine 41 (2013) 34–40. [15] S. Romagnoli, Z. Ricci, D. Quattrone, L. Tofani, O. Tujjar, G. Villa, S. M. Romano, A. R. De Gaudio, Accuracy of invasive arterial pressure monitoring in cardiovascular patients: an observational study, Critical care 18 (2014) 1–11. [16] P. Palatini, R. Asmar, Cuff challenges in blood pressure measurement, The Journal of Clinical Hypertension 20 (2018) 1100–1103. [17] M. Forouzanfar, H. R. Dajani, V. Z. Groza, M. Bolic, S. Rajan, I. Batkin, Oscillometric blood pressure estimation: past, present, and future, IEEE reviews in biomedical engineering 8 (2015) 44–63. [18] D. Castaneda, A. Esparza, M. Ghamari, C. Soltanpur, H. Nazeran, A review on wearable photoplethysmography sensors and their potential future applications in health care, International journal of biosensors & bioelectronics 4 (2018) 195. [19] M. Panwar, A. Gautam, D. Biswas, A. Acharyya, Pp-net: A deep learning framework for ppg-based blood pressure and heart rate estimation, IEEE Sensors Journal 20 (2020) 10000–10011. [20] Y. Lu, C. Wang, M. Q.-H. Meng, Video-based contactless blood pressure estimation: A review, in: 2020 IEEE International Conference on Real-time Computing and Robotics (RCAR), IEEE, 2020, pp. 62–67. [21] F. Schrumpf, P. Frenzel, C. Aust, G. Osterhoff, M. Fuchs, Assessment of deep learning based blood pressure prediction from ppg and rppg signals, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3820–3830. [22] B.-F. Wu, B.-J. Wu, B.-R. Tsai, C.-P. Hsu, A facial-image-based blood pressure measurement system without calibration, IEEE Transactions on Instrumentation and Measurement 71 (2022) 1–13. [23] Y. Chen, J. Zhuang, B. Li, Y. Zhang, X. Zheng, Remote blood pressure estimation via the spatiotemporal mapping of facial videos, Sensors 23 (2023) 2963. [24] X. Niu, S. Shan, H. Han, X. Chen, Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation, IEEE Transactions on Image Processing 29 (2019) 2409–2423. [25] Z. Yu, X. Li, G. Zhao, Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks, in: Proceedings of the British Machine Vision Conference, 2019. [26] M. A. Mehrabadi, S. A. H. Aqajari, A. H. A. Zargari, N. Dutt, A. M. Rahmani, Novel blood pressure waveform reconstruction from photoplethysmography using cycle generative ad- versarial networks, in: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, 2022, pp. 1906–1909. [27] L. N. Harfiya, C.-C. Chang, Y.-H. Li, Continuous blood pressure estimation using exclusively photopletysmography by lstm-based signal-to-signal translation, Sensors 21 (2021) 2952. [28] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017). [29] P.-J. Toye, Vital videos: A dataset of videos with ppg and blood pressure ground truths, arXiv preprint arXiv:2306.11891 (2023). [30] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. Tulppo, G. Zhao, The obf database: A large face video database for remote physiological signal measurement and atrial fibrillation detection, in: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), IEEE, 2018, pp. 242–249. [31] Z. Yu, W. Peng, X. Li, X. Hong, G. Zhao, Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 151–160. [32] Z. Sun, The 3rd repss track 2, 2024. URL: https://kaggle.com/competitions/the-3rd-repss-t2.