HCMUS at MediaEval 2021: Facial Data De-identification with
       Adversarial Generation and Perturbation Methods
    Minh-Khoi Pham∗1,3 , Thang-Long Nguyen-Ho∗1,3 , Trong-Thang Pham∗1,3 , Hai-Tuan Ho-Nguyen1,3 ,
                              Hai-Dang Nguyen1,2,3 , Minh-Triet Tran1,2,3
                                 1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM
                                      3 Vietnam National University, Ho Chi Minh city, Vietnam

                       {pmkhoi,nhtlong,ptthang,nhhtuan,nhdang}@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                             2 METHOD
The 2021 MediaEval Multimedia Evaluation introduces a new data                       2.1 Run 01 - Face Swapping
de-identification task, which goal is to explore methods for obscur-
                                                                                     In this run, we implement the idea of swapping face to hide the
ing driver identity in driver-facing video recordings while main-
                                                                                     real face of the driver while keeping all other facial features like
taining visible human behavioral information. Interested in the
                                                                                     gaze, eyes, mouth, nose, and head pose. The proposed procedure is
challenge, our HCMUS team participate in searching for different
                                                                                     shown in Figure 1. We first extract face from the given RetinaFace
ideas to tackle the problem. We propose two novel approaches as
                                                                                     [6] detection results. Then, we use the swapping method from [7]
our main contribution: one is based on generative adversarial net-
                                                                                     to swap between an anonymous face identity and the driver’s face.
works and the other is based on adversarial attacks. Moreover, a
                                                                                     Finally, we bring back the swapped face to the original video, by
specific combination of evaluation metrics is also included for the
                                                                                     using an overlay module. The implemented overlay module in our
later method for a fair comparison. The source code is available at
                                                                                     work is the overlay feature of a third-party tool (e.g. ffmpeg).
https://github.com/kaylode/mediaeval21-drsf


                                                                                                                        Overlay


1    INTRODUCTION
Car accidents are an urgent problem for many countries today.
Currently, we develop vehicles to become safer, more durable, but
the number of accidents is still worrisome. In [3], it is shown that
driver-related problems (e.g. distraction, emotionally agitated, fa-
tigue) account for 90% of crashes. These findings will help govern-
ments, driver educators, vehicle companies, and the public better
                                                                                                          FaceSwap
understand the situation so that appropriate measures can be taken.
    With the desire to study driver behavior, it is required to collect
more driving data. This leads to concerns about privacy and security.
At the "Driving Road Safety Forward: Video Data Privacy" compe-
tition, we need to perform de-identification on SHRP2 dataset [5].
The dataset shows the drivers’ faces, bodies, genders, and behaviors.                Figure 1: An illustration of overall pipeline for our Face Swapping
We aim to de-identify such that we can keep as much information                      method
as possible for the behavioral experts.
    In this work, we want to focus on hiding their full face, which is
the most important identifier in the given dataset. Specifically, we
                                                                                     2.2     Run 02 - Adversarial Attack
propose two approaches:                                                              This run focus on hiding a person’s identity in the image from hu-
                                                                                     man view and preventing unauthorized deep vision algorithms from
       • A simple process that swaps the driver’s face with an                       extracting useful information while ensuring correct prediction for
         anonymous face to keep the most facial information.                         authorized algorithms only. In our research, we only consider the
       • An adversarial pipeline to perturbate the face identity                     position with rotation of the human face and its eye gaze vector as
         while preserving main facial attributes in form of embed-                   principal information for studying the person’s action and behavior.
         ding features. This approach is followed by specific evalu-                    The specific approach is a process consisting of two main steps:
         ation functions for appropriate assessment.
                                                                                           (1) Safeguard identified information from being inferred by
                                                                                               unauthorized models.
     * Equal contribution
                                                                                           (2) Guarantee that the model with a defined set of weights can
                                                                                               extract information with low error.
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                      In the first step, we apply a simple identity masking technique -
MediaEval’21, 13-15 December 2021, Online                                            pixelation and blurring to anonymize the driver’s faces. This step
MediaEval’21, December 13-15, 2021, Online                               Minh-khoi Pham, Thang-Long Nguyen-Ho, Trong-Thang Pham et al.


         Original images             Prediction on            De-identified images        Unauthorized prediction         Authorized prediction
                                     original images                                      on de-identified images         on de-identified images

Figure 2: Adversarial de-identification results for run 2 (green box, dots, 3D RGB axis, yellow vector indicates head localization, face landmarks,
head pose, and gaze vector respectively)


provides strong perturbation to the original image such that the            between the original image and its de-identified version have a
faces may not be identified by humans nor any vision models.                slight deviation. Therefore, we propose the following assessment
   Secondarily, to ensure that the hidden attributes, which are the         method, which indicates whether the attributes are well hidden or
bounding box, landmarks, and gaze vector of the face, can be re-            not:
vealed only to model with a defined set of weights, we utilize Itera-
tive Fast Gradient Sign Method (I-FGSM) [4].                                    𝐷𝑖 𝑓 𝑓 𝑆𝑐𝑜𝑟𝑒𝑚 (𝑣𝑎 , 𝑣𝑜 ) =
   In general, I-FGSM works by exploiting the gradient of the cost
                                                                                        𝑁
function with respect to the input image to create a new image that                 1 ∑︁
maximizes the loss such that it drives the model’s outputs towards                       (1 − 𝐼𝑂𝑈𝐵 (𝑓𝑎𝑖 , 𝑓𝑜 𝑖 )) + 𝑑𝐿𝑀 (𝑓𝑎𝑖 , 𝑓𝑜 𝑖 ) + Θ𝐺 (𝑓𝑎𝑖 , 𝑓𝑜 𝑖 )
                                                                                    𝑁 𝑖
the desired target. In our case, I-FGSM is used in the opposite
trend to the common goal, which is to modify the input image and
                                                                               Where 𝑣𝑎 , 𝑣𝑜 is the adversarial and raw video sequence consisting
minimize the loss, resulting in changing the model prediction on
                                                                            of 𝑁 frames. 𝑓𝑎𝑖 , 𝑓𝑜 𝑖 is the predictions of model 𝑚 on two adversarial
the adversarial sample from false to true.
                                                                            and original frames at the same timestamp 𝑖 respectively. The 𝐼𝑂𝑈𝐵
   The targeted models (and their default model weights) which
                                                                            is the Intersect-over-Union between the two faces location. The
we choose to attack are listed below:
                                                                            𝑑𝐿𝑀 is measured based on the euclidean distance between landmark
      • For face detection, we experiment on the Retina Face [2]            points of the faces, Θ𝐺 is the angle between two gaze vectors. The
        and MTCNN [8]                                                       final 𝐷𝑖 𝑓 𝑓 𝑆𝑐𝑜𝑟𝑒 is calculated by summing all differences between
      • For facial landmark detection, we explore the 2D-FAN [1]            pairs of frames across the timestamp.
      • For gaze vector, we use a simple ResNet pretrained on                  Consequently, we expect 𝐷𝑖 𝑓 𝑓 𝑆𝑐𝑜𝑟𝑒 is low for authorized algo-
        ETH-XGaze [9] and MPII-Gaze [10] datasets.                          rithms and higher 𝐷𝑖 𝑓 𝑓 𝑆𝑐𝑜𝑟𝑒 for unauthorized ones.
  We carry out the backpropagation process and find the corre-
sponding quantity to change on the image. Our objective is to               3     EXPERIMENTS AND RESULTS
minimize function as described follows:
                                                                            In the second run, we pass consecutive video sequences as a batch
                           𝐿 = 𝐿𝐵 + 𝐿𝐿𝑀 + 𝐿𝐺                                with a size equal to 64 into our de-identification pipeline to generate
                                                                            perturbated videos whose facial attributes have been hidden. We
Where:
                                                                            perform a full adversarial pipeline on 720 videos from the SHRP2
      • 𝐿𝐵 is the box proposal loss of the detection model.                 dataset and demonstrate some visual results as shown in Figure 2.
      • 𝐿𝐿𝑀 is the L2 error between the predicted heatmap and
        the ground truth heatmap of the landmarks estimator.                4     CONCLUSION AND FUTURE WORKS
      • 𝐿𝐺 is the L2 error between the gaze vector predicted by
                                                                            Conclusively, we present different strategies to address the data
        the model and the true gaze vector.
                                                                            privacy issues for MediaEval Challenge 2021. In the future, we
  With the proposed loss function, we compute the network gradi-            aim to study the performance of our adversarial attack for the
ent and use it as a perturbation to update the current input image.         information preservation method on several deep vision models
                                                                            regarding facial attributes. We are intent to analyze our proposed
              𝑎𝑑𝑣                𝑎𝑑𝑣
             𝑋𝑁 +1 = 𝐶𝑙𝑖𝑝𝑋 ,𝜖 {𝑋 𝑁 + 𝛼𝑠𝑖𝑔𝑛(∇𝑥 𝐿)}[4]                        metrics on these experiments as well.
   Given the 𝑋 0𝑎𝑑𝑣 = 𝑋 the raw input image, we iteratively add
the perturbation to X until 𝐿 becomes smaller than a predefined
                                                                            ACKNOWLEDGMENTS
threshold or until 𝑁 meets the maximum number of iterations.                This work was funded by Gia Lam Urban Development and Invest-
   In the concept described above, the goal of the problem is to            ment Company Limited, Vingroup and supported by Vingroup In-
ensure that the facial attributes extracted by authorized models            novation Foundation (VINIF) under project code VINIF.2019.DA19.
Driving Road Safety Forward: Video Data Privacy                                                    MediaEval’21, December 13-15, 2021, Online


REFERENCES                                                                       Driving Study Dataset. (2013).
[1] Adrian Bulat and Georgios Tzimiropoulos. 2017. How Far are We from       [6] Sefik Ilkin Serengil and Alper Ozpinar. 2021. HyperExtended Light-
    Solving the 2D 3D Face Alignment Problem? (and a Dataset of 230,000          Face: A Facial Attribute Analysis Framework. In 2021 International
    3D Facial Landmarks). 2017 IEEE International Conference on Computer         Conference on Engineering and Emerging Technologies (ICEET). IEEE.
    Vision (ICCV) (Oct 2017). https://doi.org/10.1109/iccv.2017.116          [7] Aliaksandr Siarohin, Subhankar Roy, Stéphane Lathuilière, Sergey
[2] Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, and            Tulyakov, Elisa Ricci, and Nicu Sebe. 2020. Motion Supervised co-part
    Stefanos Zafeiriou. 2019. RetinaFace: Single-stage Dense Face Locali-        Segmentation. arXiv preprint (2020).
    sation in the Wild. (2019). arXiv:cs.CV/1905.00641                       [8] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016.
[3] Thomas A. Dingus, Feng Guo, Suzie Lee, Jonathan F. Antin, Miguel             Joint Face Detection and Alignment Using Multitask Cascaded Con-
    Perez, Mindy Buchanan-King, and Jonathan Hankey. 2016. Dri-                  volutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016),
    ver crash risk factors and prevalence evaluation using naturalis-            1499–1503. https://doi.org/10.1109/lsp.2016.2603342
    tic driving data. Proceedings of the National Academy of Sciences        [9] Xucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu
    113, 10 (2016), 2636–2641. https://doi.org/10.1073/pnas.1513271113           Tang, and Otmar Hilliges. 2020. ETH-XGaze: A Large Scale Dataset
    arXiv:https://www.pnas.org/content/113/10/2636.full.pdf                      for Gaze Estimation under Extreme Head Pose and Gaze Variation.
[4] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2017. Adversarial           (2020). arXiv:cs.CV/2007.15837
    examples in the physical world. (2017). arXiv:cs.CV/1607.02533          [10] Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling.
[5] Transportation Research Board of the National Academy of Sciences.           2017. MPIIGaze: Real-World Dataset and Deep Appearance-Based
    2013. The 2nd Strategic Highway Research Program Naturalistic                Gaze Estimation. (2017). arXiv:cs.CV/1711.09017