MediaEval 2020: Maintaining Human-Imperceptibility of Image
  Adversarial Attack by using Human-Aware Sensitivity Map
                     Zhiqi Shen1, Muhammad Furqan Habibi1, Shaojing Fan1, Mohan Kankanhalli1
                                                            1National University of Singapore
                                                     dcsshenz@nus.edu.sg,furqan.habibi@u.nus.edu
                                                       dcsfs@nus.edu.sg,mohan@comp.nus.edu.sg

ABSTRACT
With the rapid rise of big data with developments in artificial intel-
ligence, privacy has come under the spotlight. Adversarial attacks
using image perturbation have recently been introduced to fool
machines on pattern recognition tasks. They also have been success-
fully employed to protect privacy of images. However, only a few
works consider the imperceptibility of perturbations for humans.
This report presents our submission to the pixel privacy task, where
we improve the imperceptibility of image perturbations by using a
human-aware sensitivity map, while protecting image privacy via
adversarial attack techniques.


1    INTRODUCTION
The Pixel Privacy task [7] of MediaEval aims to protect personal
privacy by embedding human-imperceptible noise on images that                        Figure 1: The figure shows sensitivity map examples. The
fools the BIQA classifiers. The attack models use InceptionResNetV2                  left column has the original images and the right column are
structure and are pre-trained on KonIQ-10k dataset. The organizers                   the corresponding sensitivity maps. For example, in the first
evaluated the performance in terms of success attack rate (accuracy)                 image, its sensitivity map highlights the humans, indicating
and imperceptibility of perturbation.                                                that noise added to the human region will be perceived more
   Prior work usually applies 𝐿2 norm [1, 5, 6] to the loss func-                    easily than when the noise is added to the background.
tion to improve the imperceptibility of perturbed images. However,
𝐿2 norm only guarantees the overall noise to be small without
considering the perceptual characteristics of regions. For example,                  2 APPROACH
observers will perceive differently when we add the same noise to a                  2.1 Preliminaries
flat background versus a content-rich background. With this insight,
                                                                                     We denote an image by 𝐼 ∈𝐻 ∗𝑊 ∗𝐶 , where H, W, C is the frame
we can apply a sensitivity map to the loss function that indicates
                                                                                     height, frame width, and the number of channels, respectively. The
which regions’ changes are least sensitive to observers, so that the
                                                                                     BIQA classifier is denoted by 𝑓 (𝑋, 𝜃 ) = 𝑙𝑜𝑔𝑖𝑡𝑠 that takes an image
algorithms know where to add the noise. Recent works [2, 4, 11]
                                                                                     as input and produces the corresponding logits 1𝐾 , which 𝐾 is the
published after our earlier work [9] do take human imperceptibil-
                                                                                     class number. A softmax layer is followed to the network to transfer
ity of perturbations into account. Unlike our deep learning-based
                                                                                     the logits to each class’s probability 𝑦. The whole BIQA classifier is
method, most of them compute human imperceptibility based on
                                                                                     represented by 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑓 (𝑋, 𝜃 )) = 𝑦.
texture information.
                                                                                        The image adversarial attack approach aims to find an image
   Our method is an optimization-based approach based on the CW
                                                                                     perturbation 𝐼𝑎𝑑𝑣 that maximizes the classification error. We denote
attack [1]. We manipulate each input image’s model logits to its tar-
                                                                                     𝐼𝑠 = 𝐼𝑎𝑑𝑣 − 𝐼 the adversarial image perturbation.
get class. We then optimize the attack to minimize the loss function
                                                                                        We propose an optimization-based approach. The general idea
by modifying the input image. To improve human imperceptibility,
                                                                                     of generating perturbation for an image is by using the following
we improve the loss function by integrating human sensitivity maps
                                                                                     optimization equation.
learned from [9]. Experimental evaluation indicates our approach
achieves good results in terms of human imperceptibility.                                                                                  ˆ
                                                                                                    arg min 𝛼𝐷 (𝐼𝑠 ) − ℓ (𝑓 (𝐼 + 𝐼𝑠 , 𝜃 ), 𝑙)           (1)
                                                                                                       𝐼𝑠

                                                                                         where 𝐷 (.) is the perception regularization to keep the perturba-
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                     tion to be small and imperceptible to humans. 𝑙ˆ is the target logits.
MediaEval’20, December 14-15 2020, Online                                            ℓ (., .) is the loss function to measure the difference between the
                                                                                     actual prediction and the target prediction. To obtain a high attack
MediaEval’20, December 14-15 2020, Online                                                                                       M. Larson et al.


Figure 2: The figure shows the sensitivity map prediction network. The network bases on FCN network and use VGG-16 as its
backbone network.


rate success, we minimize the distance between actual logits and            integrate the human perceptual sensitivity, we extend the L2 norm
the target logits. 𝛼 is a hyper-parameter to balance these two terms.       by multiplying it with the sensitivity map, as shown below.
                                                                                                                     2
2.2    Loss to fool machines                                                                          𝐷 (𝐼𝑠 ) = 𝛽𝑠 𝐼𝑠 2                      (3)
We follow the loss in [1] to fool machines. For the sake of clarity,        3     RESULTS AND ANALYSIS
                                ˆ the detailed formulation is as follows:
we use 𝐿𝐶 = ℓ (𝑓 (𝐼 + 𝐼𝑠 , 𝜃 ), 𝑙),
                                                                            We submitted five runs towards the Pixel Privacy task. The organiz-
                                                                            ers selected 20 images with the largest BIQA variance for human
                                                                            evaluation. They then put the same image of all qualified runs in
       (
                                     ˆ  if 𝑎𝑟𝑔𝑚𝑎𝑥 𝑓 (𝐼 + 𝐼𝑠 ) ≠ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑙ˆ
           |𝑚𝑎𝑥 (𝑓 (𝐼 + 𝐼𝑠 )) − 𝑚𝑎𝑥 (𝑙)|,
𝐿𝐶 =                                                                        one folder and let 7 experts select the most appealing (i.e., “Best”)
           0,                           otherwise                           three runs out of 17 runs. A run can be selected as “Best” for at
                                                                     (2)    most 140 times.
    Where 𝑓 (𝐼 + 𝐼𝑠 ) and 𝑙ˆ are the one-hot vectors representing the          From Table 1, we can observe that the accuracy of our first run
current logits and desired logits. The losses consist of two parts. The     (with parameter 𝛼 = 10) has dropped to lower than random guess
first part represents the situations when the perturbed image has not       (50%), meaning that our perturbed images have fooled machines’
been into our desired class. The loss value is the absolute distance        prediction. More importantly, more than half of the images are
between the most trusted class in current logits and the desired            selected as the best three images out of 17 runs. From the trend of
class. The second part depicts the situation when the perturbed             parameters, we can see the potential of our algorithm. If we can
image has been classified into our desired class, so we set the loss        try more parameters (e.g,. smaller than 10), the performance might
value to zero.                                                              be even better than the current one. For the other runs, we have
                                                                            not achieved a good attack rate. This is because the parameter 𝛼 is
2.3    Loss to fool humans                                                  too large that forces the perturbed images to focus more on image
We observed that the traditional norms (e.g., 𝐿0, 𝐿2, 𝐿𝑖𝑛𝑓 ) consider       quality during back-propagation.
all pixels in the images to be equal, while humans have different
priorities when viewing different image regions. More specifically,         Table 1: The table shows the evaluation of our five runs. The
even adding the same perturbation noise to different regions will           first run with parameter 𝛼 = 10 has a high attack rate success
lead to different humans’ perceptibility. For quantifying humans’           with more than half of the perturbed images selected as best.
perceptibility of each pixel, we integrate a sensitivity map with our           Parameter (𝛼)   Accuracy   Number of times selected as “Best”
loss function. The value of each pixel in the sensitivity map ranges                 10           42.73                   74
from 0 to 1. The larger value indicates more chance to be perceived                  20           52.91              Not qualified
when adding noise on such pixels.                                                    30           62.36              Not qualified
                                                                                     40           75.10              Not qualified
Human-aware sensitivity map Human perception is a complex                            50           93.82              Not qualified
phenomenon which is not easily captured in a neat mathematical
formulation. Therefore, we train a neural network to generate the           4     CONCLUSION AND FUTURE WORKS
spatially dense prediction of each pixel with human sensitivity
scores. The network is designed based on a fully convolutional              This report introduces our approach for privacy protection, which
network (FCN) [8]. The backbone network is a VGG-16 [10] model              integrates the human-aware sensitivity map to the loss function
pre-trained ImageNet dataset. A 1*1 convolutional layer is used             to improve the quality of perturbed images’. The results demon-
to combine all feature maps extracted from VGG-16 to obtain the             strate the effectiveness of the sensitivity map in maintaining noise
final sensitivity map. The architecture of our DNN is illustrated in        imperceptibility. However, some aspects can be further improved.
Figure 2.                                                                   The current sensitivity map prediction network is trained on the
                                                                            EMOd dataset, which has only 698 images. Another problem is that
Embed sensitivity maps to attack approach For this workshop,                the network structure (FCN) is rudimentary. We can foresee that
we train the sensitivity map generation model on the EMOd dataset           with a more sophisticated structure, trained on a larger data-set,
[3] and then test it on the given Place365 testing set. In order to         can improve the performance.
Pixel Privacy: Quality Camouflage for Social Images                             MediaEval’20, December 14-15 2020, Online


REFERENCES
 [1] Nicholas Carlini and David Wagner. 2017. Towards evaluating the
     robustness of neural networks. In 2017 ieee symposium on security and
     privacy (sp). IEEE, 39–57.
 [2] Francesco Croce and Matthias Hein. 2019. Sparse and imperceivable
     adversarial attacks. In Proceedings of the IEEE International Conference
     on Computer Vision. 4724–4732.
 [3] Shaojing Fan, Zhiqi Shen, Ming Jiang, Bryan L Koenig, Juan Xu, Mo-
     han S Kankanhalli, and Qi Zhao. 2018. Emotional attention: A study
     of image sentiment and visual attention. In Proceedings of the IEEE
     Conference on computer vision and pattern recognition. 7521–7531.
 [4] Diego Gragnaniello, Francesco Marra, Giovanni Poggi, and Luisa Ver-
     doliva. 2019. Perceptual Quality-preserving Black-Box Attack against
     Deep Learning Image Classifiers. arXiv preprint arXiv:1902.07776
     (2019).
 [5] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial
     examples in the physical world. arXiv preprint arXiv:1607.02533 (2016).
 [6] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2016. Delving
     into transferable adversarial examples and black-box attacks. arXiv
     preprint arXiv:1611.02770 (2016).
 [7] Zhuoran Liu, Zhengyu Zhao, Martha Larson, and Laurent Amsaleg.
     2020. Exploring Quality Camouflage for Social Images. In Working
     Notes Proceedings of the MediaEval Workshop.
 [8] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully
     convolutional networks for semantic segmentation. In Proceedings of
     the IEEE conference on computer vision and pattern recognition. 3431–
     3440.
 [9] Zhiqi Shen, Shaojing Fan, Yongkang Wong, Tian-Tsong Ng, and Mohan
     Kankanhalli. 2019. Human-imperceptible privacy protection against
     machines. In Proceedings of the 27th ACM International Conference on
     Multimedia. 1119–1128.
[10] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     lutional networks for large-scale image recognition. arXiv preprint
     arXiv:1409.1556 (2014).
[11] Eric Wong, Frank R Schmidt, and J Zico Kolter. 2019. Wasserstein
     adversarial examples via projected sinkhorn iterations. arXiv preprint
     arXiv:1902.07906 (2019).