HCMUS at MediaEval 2021: Facial Data De-identification with Adversarial Generation and Perturbation Methods Minh-Khoi Pham∗1,3 , Thang-Long Nguyen-Ho∗1,3 , Trong-Thang Pham∗1,3 , Hai-Tuan Ho-Nguyen1,3 , Hai-Dang Nguyen1,2,3 , Minh-Triet Tran1,2,3 1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM 3 Vietnam National University, Ho Chi Minh city, Vietnam {pmkhoi,nhtlong,ptthang,nhhtuan,nhdang}@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT 2 METHOD The 2021 MediaEval Multimedia Evaluation introduces a new data 2.1 Run 01 - Face Swapping de-identification task, which goal is to explore methods for obscur- In this run, we implement the idea of swapping face to hide the ing driver identity in driver-facing video recordings while main- real face of the driver while keeping all other facial features like taining visible human behavioral information. Interested in the gaze, eyes, mouth, nose, and head pose. The proposed procedure is challenge, our HCMUS team participate in searching for different shown in Figure 1. We first extract face from the given RetinaFace ideas to tackle the problem. We propose two novel approaches as [6] detection results. Then, we use the swapping method from [7] our main contribution: one is based on generative adversarial net- to swap between an anonymous face identity and the driver’s face. works and the other is based on adversarial attacks. Moreover, a Finally, we bring back the swapped face to the original video, by specific combination of evaluation metrics is also included for the using an overlay module. The implemented overlay module in our later method for a fair comparison. The source code is available at work is the overlay feature of a third-party tool (e.g. ffmpeg). https://github.com/kaylode/mediaeval21-drsf Overlay 1 INTRODUCTION Car accidents are an urgent problem for many countries today. Currently, we develop vehicles to become safer, more durable, but the number of accidents is still worrisome. In [3], it is shown that driver-related problems (e.g. distraction, emotionally agitated, fa- tigue) account for 90% of crashes. These findings will help govern- ments, driver educators, vehicle companies, and the public better FaceSwap understand the situation so that appropriate measures can be taken. With the desire to study driver behavior, it is required to collect more driving data. This leads to concerns about privacy and security. At the "Driving Road Safety Forward: Video Data Privacy" compe- tition, we need to perform de-identification on SHRP2 dataset [5]. The dataset shows the drivers’ faces, bodies, genders, and behaviors. Figure 1: An illustration of overall pipeline for our Face Swapping We aim to de-identify such that we can keep as much information method as possible for the behavioral experts. In this work, we want to focus on hiding their full face, which is the most important identifier in the given dataset. Specifically, we 2.2 Run 02 - Adversarial Attack propose two approaches: This run focus on hiding a person’s identity in the image from hu- man view and preventing unauthorized deep vision algorithms from • A simple process that swaps the driver’s face with an extracting useful information while ensuring correct prediction for anonymous face to keep the most facial information. authorized algorithms only. In our research, we only consider the • An adversarial pipeline to perturbate the face identity position with rotation of the human face and its eye gaze vector as while preserving main facial attributes in form of embed- principal information for studying the person’s action and behavior. ding features. This approach is followed by specific evalu- The specific approach is a process consisting of two main steps: ation functions for appropriate assessment. (1) Safeguard identified information from being inferred by unauthorized models. * Equal contribution (2) Guarantee that the model with a defined set of weights can extract information with low error. Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In the first step, we apply a simple identity masking technique - MediaEval’21, 13-15 December 2021, Online pixelation and blurring to anonymize the driver’s faces. This step MediaEval’21, December 13-15, 2021, Online Minh-khoi Pham, Thang-Long Nguyen-Ho, Trong-Thang Pham et al. Original images Prediction on De-identified images Unauthorized prediction Authorized prediction original images on de-identified images on de-identified images Figure 2: Adversarial de-identification results for run 2 (green box, dots, 3D RGB axis, yellow vector indicates head localization, face landmarks, head pose, and gaze vector respectively) provides strong perturbation to the original image such that the between the original image and its de-identified version have a faces may not be identified by humans nor any vision models. slight deviation. Therefore, we propose the following assessment Secondarily, to ensure that the hidden attributes, which are the method, which indicates whether the attributes are well hidden or bounding box, landmarks, and gaze vector of the face, can be re- not: vealed only to model with a defined set of weights, we utilize Itera- tive Fast Gradient Sign Method (I-FGSM) [4]. 𝐷𝑖 𝑓 𝑓 𝑆𝑐𝑜𝑟𝑒𝑚 (𝑣𝑎 , 𝑣𝑜 ) = In general, I-FGSM works by exploiting the gradient of the cost 𝑁 function with respect to the input image to create a new image that 1 ∑︁ maximizes the loss such that it drives the model’s outputs towards (1 − 𝐼𝑂𝑈𝐵 (𝑓𝑎𝑖 , 𝑓𝑜 𝑖 )) + 𝑑𝐿𝑀 (𝑓𝑎𝑖 , 𝑓𝑜 𝑖 ) + Θ𝐺 (𝑓𝑎𝑖 , 𝑓𝑜 𝑖 ) 𝑁 𝑖 the desired target. In our case, I-FGSM is used in the opposite trend to the common goal, which is to modify the input image and Where 𝑣𝑎 , 𝑣𝑜 is the adversarial and raw video sequence consisting minimize the loss, resulting in changing the model prediction on of 𝑁 frames. 𝑓𝑎𝑖 , 𝑓𝑜 𝑖 is the predictions of model 𝑚 on two adversarial the adversarial sample from false to true. and original frames at the same timestamp 𝑖 respectively. The 𝐼𝑂𝑈𝐵 The targeted models (and their default model weights) which is the Intersect-over-Union between the two faces location. The we choose to attack are listed below: 𝑑𝐿𝑀 is measured based on the euclidean distance between landmark • For face detection, we experiment on the Retina Face [2] points of the faces, Θ𝐺 is the angle between two gaze vectors. The and MTCNN [8] final 𝐷𝑖 𝑓 𝑓 𝑆𝑐𝑜𝑟𝑒 is calculated by summing all differences between • For facial landmark detection, we explore the 2D-FAN [1] pairs of frames across the timestamp. • For gaze vector, we use a simple ResNet pretrained on Consequently, we expect 𝐷𝑖 𝑓 𝑓 𝑆𝑐𝑜𝑟𝑒 is low for authorized algo- ETH-XGaze [9] and MPII-Gaze [10] datasets. rithms and higher 𝐷𝑖 𝑓 𝑓 𝑆𝑐𝑜𝑟𝑒 for unauthorized ones. We carry out the backpropagation process and find the corre- sponding quantity to change on the image. Our objective is to 3 EXPERIMENTS AND RESULTS minimize function as described follows: In the second run, we pass consecutive video sequences as a batch 𝐿 = 𝐿𝐵 + 𝐿𝐿𝑀 + 𝐿𝐺 with a size equal to 64 into our de-identification pipeline to generate perturbated videos whose facial attributes have been hidden. We Where: perform a full adversarial pipeline on 720 videos from the SHRP2 • 𝐿𝐵 is the box proposal loss of the detection model. dataset and demonstrate some visual results as shown in Figure 2. • 𝐿𝐿𝑀 is the L2 error between the predicted heatmap and the ground truth heatmap of the landmarks estimator. 4 CONCLUSION AND FUTURE WORKS • 𝐿𝐺 is the L2 error between the gaze vector predicted by Conclusively, we present different strategies to address the data the model and the true gaze vector. privacy issues for MediaEval Challenge 2021. In the future, we With the proposed loss function, we compute the network gradi- aim to study the performance of our adversarial attack for the ent and use it as a perturbation to update the current input image. information preservation method on several deep vision models regarding facial attributes. We are intent to analyze our proposed 𝑎𝑑𝑣 𝑎𝑑𝑣 𝑋𝑁 +1 = 𝐶𝑙𝑖𝑝𝑋 ,𝜖 {𝑋 𝑁 + 𝛼𝑠𝑖𝑔𝑛(∇𝑥 𝐿)}[4] metrics on these experiments as well. Given the 𝑋 0𝑎𝑑𝑣 = 𝑋 the raw input image, we iteratively add the perturbation to X until 𝐿 becomes smaller than a predefined ACKNOWLEDGMENTS threshold or until 𝑁 meets the maximum number of iterations. This work was funded by Gia Lam Urban Development and Invest- In the concept described above, the goal of the problem is to ment Company Limited, Vingroup and supported by Vingroup In- ensure that the facial attributes extracted by authorized models novation Foundation (VINIF) under project code VINIF.2019.DA19. Driving Road Safety Forward: Video Data Privacy MediaEval’21, December 13-15, 2021, Online REFERENCES Driving Study Dataset. (2013). [1] Adrian Bulat and Georgios Tzimiropoulos. 2017. How Far are We from [6] Sefik Ilkin Serengil and Alper Ozpinar. 2021. HyperExtended Light- Solving the 2D 3D Face Alignment Problem? (and a Dataset of 230,000 Face: A Facial Attribute Analysis Framework. In 2021 International 3D Facial Landmarks). 2017 IEEE International Conference on Computer Conference on Engineering and Emerging Technologies (ICEET). IEEE. Vision (ICCV) (Oct 2017). https://doi.org/10.1109/iccv.2017.116 [7] Aliaksandr Siarohin, Subhankar Roy, Stéphane Lathuilière, Sergey [2] Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, and Tulyakov, Elisa Ricci, and Nicu Sebe. 2020. Motion Supervised co-part Stefanos Zafeiriou. 2019. RetinaFace: Single-stage Dense Face Locali- Segmentation. arXiv preprint (2020). sation in the Wild. (2019). arXiv:cs.CV/1905.00641 [8] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. [3] Thomas A. Dingus, Feng Guo, Suzie Lee, Jonathan F. Antin, Miguel Joint Face Detection and Alignment Using Multitask Cascaded Con- Perez, Mindy Buchanan-King, and Jonathan Hankey. 2016. Dri- volutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), ver crash risk factors and prevalence evaluation using naturalis- 1499–1503. https://doi.org/10.1109/lsp.2016.2603342 tic driving data. Proceedings of the National Academy of Sciences [9] Xucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu 113, 10 (2016), 2636–2641. https://doi.org/10.1073/pnas.1513271113 Tang, and Otmar Hilliges. 2020. ETH-XGaze: A Large Scale Dataset arXiv:https://www.pnas.org/content/113/10/2636.full.pdf for Gaze Estimation under Extreme Head Pose and Gaze Variation. [4] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2017. Adversarial (2020). arXiv:cs.CV/2007.15837 examples in the physical world. (2017). arXiv:cs.CV/1607.02533 [10] Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. [5] Transportation Research Board of the National Academy of Sciences. 2017. MPIIGaze: Real-World Dataset and Deep Appearance-Based 2013. The 2nd Strategic Highway Research Program Naturalistic Gaze Estimation. (2017). arXiv:cs.CV/1711.09017