HCMUS at Pixel Privacy 2020: Quality Camouflage with Back
              Propagation and Image Enhancement
            Minh-Khoi Pham∗1,3 , Hai-Tuan Ho-Nguyen∗1,3 , Trong-Thang Pham1,3 , Hung Vinh Tran1,3 ,
                                   Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3
                                                              1 University of Science, VNU-HCM
                                                         2 John von Neumann Institute, VNU-HCM
                                     3 Vietnam National University, Ho Chi Minh city, Vietnam

                       {pmkhoi,hnhtuan,ptthang,tvhung,nhdang}@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                             approach. For the last approach, we assume that if the enhancement
As our needs to share moments evolve, the more high-quality pho-                     model is good enough, it will keep the image attributes, which in
tos appear on the Internet. Hence, it is more likely that shared                     this case is protected from the BIQA model. The End-to-End with
photos will be used for purposes that the owner does not want by                     I-FGSM applies this approach.
someone else. If the target photos are high-quality, the attacker
may use some criteria to assess the quality of images, such as the
Blind Image Quality Assessment (BIQA) classifier. Pixel Privacy                      2 APPROACH
2020 aims to tackle this problem. In this challenge, we have im-                     2.1 Vanilla End-to-End
plemented methods of combining image enhancement and an end                          In this run, we use an Image-to-Image network to reconstruct the
to end attack. The final results show that all of our approaches                     input image; then, we forward the reconstructed result to the BIQA
successfully fool the BIQA. In particular, our best run results in 84                regressor.
photos being chosen as top-3 most attractive while maintaining                          We simply choose the U-Net[6] model as our main network
100% attack’s accuracy on the assessment model, and out-performs                     because it is one of the most popular baseline models for image to
all other submissions.                                                               image problem and simple enough to implement.

1    INTRODUCTION
With the recent rapid development of social networks, the need
for sharing images also increases. Thus, smartphones have more
high-end cameras, which leads to increment image quality. These
images, which target to share with your friends, could be exploited
by attackers for your private data. For example, high-quality images
could be used as a filter for your honey-moon images. Therefore,
we consider the Pixel Privacy task is vital for this age of connection.
   In Pixel Privacy 2020[5], we are given a set of images that was
evaluated as high quality by BIQA[9] model. Our target is to fool
the BIQA model so that the model will consider the modified image                                  Figure 1: Vanilla End-to-End method
as low quality, and the image remains attractive under human eyes.
                                                                                        In Fig 1, the image x is first taken as input to U-Net, and model
To be more specific, this BIQA model was trained on KonIQ-10k
                                                                                     outputs the image y. Simultaneously, x is also enhanced to x’ by
dataset[1], and the given images were from Places365 validation
                                                                                     using simple transformations from available computer vision li-
dataset. The output image would be processed by JPEG-compression
                                                                                     braries (same as 2.2.2). After that, we use the trained frozen BIQA
(ratio = 90%) before being evaluated.
                                                                                     to predict a score for y. We then generate a pseudo target score to
   We propose three approaches. One is vanilla end-to-end with an
                                                                                     attack the true score of y. Here, we have two objective functions for
image-to-image based. In this approach, we aim to learn a single
                                                                                     the network to minimize: the reconstruction loss between x’ and y
network that could enhance image quality and protect the quality
                                                                                     and the regression loss between pseudo score and true score.
from being evaluated by the BIQA model. To be flexible in chang-
                                                                                        Reconstruction loss: We experiment on both L2 loss, and SSIM
ing the image enhancement method, we also propose a two-stage
                                                                                     Loss [10] and find that model trained with L2 gives out more visually
approach. The first stage is enhancing image quality, in which we
                                                                                     appealing images than other objective functions.
experiment with multiple methods to improve image quality. The
                                                                                        Regression loss: We choose L2 loss to compute the distance be-
second stage is to camouflage the enhanced image’s quality. Three
                                                                                     tween two scores. The pseudo score is generated by subtracting A
of our runs, namely Pillow, Cartoonization, and Retouch follow this
                                                                                     from the original score B. We experiment A with the values of 30,
    *These authors contributed equally.                                              50, and B. We choose A equals 30 to submit in this run.
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons      We add both losses and then back-propagate it to U-Net for the
License Attribution 4.0 International (CC BY4.0).
MediaEval’20, December 14-15 2020, Online                                            model to be able to learn. We train the network from end to end on
                                                                                     the pp2020_dev dataset with U-Net being initiated from scratch.
MediaEval’20, December 14-15 2020, Online                                                             Minh-Khoi Pham, Hai-Tuan Nguyen-Ho, et al.


2.2     Two-stages approaches                                                 Table 1: Official evaluation result (provided by organizers)
   2.2.1 Attack Algorithm.
                                                                                                              Accuracy (after    Number of times
   In these approaches, we utilize Iterative Fast Gradient Sign                           Methods
                                                                                                                 JPEG 90)        selected as “Best”
Method (I-FGSM)[4] to perform a white-box attack on the BIQA                          End-to-End (UNet)            13.27                 11
model after the images are enhanced. Since the BIQA is a regression                   Retouch + I-FGSM              0.18                 34
model, we use L2 loss function instead of Cross-Entropy loss, same                 Cartoonization + I-FGSM          1.27                 34
as in 2.1.                                                                             Pillow + I-FGSM             48.18                 60
   Our modified I-FGSM is described as follows, with 𝑋 the input                  End-to-End (EnlightenGAN)        0.00                 84
image, 𝑦 the BIQA score of 𝑋 𝑁𝑎𝑑𝑣 , 𝑦 ′ and attacking score. 𝐽 (𝑋, 𝑦, 𝑦 ′ )                + I-FGSM
the L2 cost function of the neural network, given image 𝑋 , score 𝑦
                                                                                 With the above result, we have successfully fooled BIQA by more
and attacking score 𝑦 ′ , measuring the distance between y and y’:
                                                                              than 50% in all runs. EnlightenGAN proves to be the best method in
                                                                              our runs, followed by enhancement performance based on Pillow’s
          𝑋 0𝑎𝑑𝑣 = 𝑋                                                   (1)    traditional image processing approach. This is an interesting result
           𝑎𝑑𝑣                𝑎𝑑𝑣               𝑎𝑑𝑣     ′                     compare to our result in the previous Pixel Privacy[7]. Last year,
          𝑋𝑁 +1 = 𝐶𝑙𝑖𝑝𝑋 ,𝜖 {𝑋 𝑁 + 𝛼𝑠𝑖𝑔𝑛(∇𝑥 𝐽 (𝑋 𝑁 , 𝑦, 𝑦 ))}           (2)    traditional methods still outperformed GAN-based approaches, but
                                                                              this year, our proposed approach combine with a GAN method has
                                 𝑎𝑑𝑣 , we iteratively add the pertur-
Given 𝑦 the predicted score on 𝑋 𝑁                                            proven to be better than a traditional image processing method.
bation to X until 𝑦 becomes smaller than score 𝑦 ′ . For all the runs         This could be explained as GAN could perform a more flexible image
below, we find that setting 𝛼 = 0.05, 𝜖 = 0.05 and 𝑦 ′ = 30 gives             enhancement based on image, case by case, comparing to traditional
desirable results in most cases.                                              image processing algorithms with hard-coded parameters.

  2.2.2 Image Enhancement algorithm.
  Pillow: We use several image enhancement operations which is
provided by Pillow, such as adjusting color balance by 1.51 , sharp-
ness by 3.01 , brightness by 1.01 and contrast by 1.51 . We apply the
same configuration for all images in the data set.
  Cartoonization: For this run, we apply a GAN-based White-box                 (a) Original (73.79)     (b) Enlightenment (c) Pillow            FGSM
Cartoonization method[8] to convert input images to cartoon im-                                         GAN (35.46)       (49.9)
ages with styles from Shinkai Makoto, Miyazaki Hayao, and Hosoda
Mamoru films.
  Retouch: In this run, we also want to compare one deep learning
"white box" approach[2] with natural enhancement and "black-
box" method. This method applies deep reinforcement learning and
GAN Model to produce parameters for traditional image processing
methods to improve image quality.                                                  (d) Unet (57.88)     (e) White-box Car-      (f) Retouch (41.9)
                                                                                                        toonization (14.85)
                                                                                          Figure 2: Sample outputs with BIQA scores
2.3     End-to-End with I-FGSM
However, different from other previous approaches, we will inte-              4      CONCLUSION
grate I-FGSM with enhancement model. For each iteration, we will              All of our approaches are simple but effective enough to fool BIQA
first feed forward image to a deep learning model and BIQA model              model while maintaining the high quality of images.
then we will backward from L2 loss to calculate gradient and apply               The Image-to-Image based method benefits that it does not re-
I-FGSM on input image. For this particular experiment, we choose              quire pair-to-pair images to train and it also performs attacks on
EnlightenGAN [3] as enhancement model.                                        feature-level not on raw images, in comparison with other methods.
3     EXPERIMENTS AND RESULTS                                                 Although it gives out the worst result among others, we still believe
                                                                              that it can be further investigated and improved.
As can be seen in table 1, accuracy (after JPEG 90) is the accuracy              The two-stage approaches, whose results are better, still have
of model on dataset after being compressed 90% (lower is better).             clearly visible noise over the images. I-FGSM also shows its strength
Number of times selected as “Best” (Max. 140) base on human ex-               in white-box attacks, both classification or even regression manner.
perts evaluation. To be more specific, 20 images with largest BIQA               Our new proposed approach, which is the combination of End-
variance will be selected for 7 human experts to choose best three            to-end and I-FGSM, shows both efficiency and effectiveness in
runs out of all runs.                                                         camouflaging and visually enhancing the images.

                                                                              ACKNOWLEDGMENTS
                                                                              Research is supported by Vingroup Innovation Foundation (VINIF)
1 See Pillow documents for explanation of these numbers
                                                                              in project code VINIF.2019.DA19.
Pixel Privacy: Quality Camouflage for Social Images                                                MediaEval’20, December 14-15 2020, Online


REFERENCES                                                                      Notes Proceedings of the MediaEval Workshop.
[1] Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe. 2020.          [6] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net:
    KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind        Convolutional Networks for Biomedical Image Segmentation. (2015).
    Image Quality Assessment. IEEE Transactions on Image Processing 29          arXiv:cs.CV/1505.04597
    (2020), 4041–4056. https://doi.org/10.1109/tip.2020.2967829             [7] Hung Vinh Tran, Trong-Thang Pham, Hai-Tuan Ho-Nguyen, Hoai-
[2] Yuanming Hu, Hao He, Chenxi Xu, Baoyuan Wang, and Stephen Lin.              Lam Nguyen-Hy, Xuan-Vy Nguyen, Thang-Long Nguyen-Ho, and
    2017. Exposure: A White-Box Photo Post-Processing Framework.                Minh-Triet Tran. 2019. HCMUS at Pixel Privacy 2019: Scene Category
    CoRR abs/1709.09602 (2017). arXiv:1709.09602 http://arxiv.org/abs/          Protection with Back Propagation and Image Enhancement. (2019).
    1709.09602                                                              [8] Xinrui Wang and Jinze Yu. 2020. Learning to Cartoonize Using White-
[3] Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui             Box Cartoon Representations. In IEEE/CVF Conference on Computer
    Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. 2019. Enlight-           Vision and Pattern Recognition (CVPR).
    engan: Deep light enhancement without paired supervision. arXiv         [9] Xin Li. 2002. Blind image quality assessment. In Proceedings. Interna-
    preprint arXiv:1906.06972 (2019).                                           tional Conference on Image Processing, Vol. 1. I–I. https://doi.org/10.
[4] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2017. Adversarial          1109/ICIP.2002.1038057
    examples in the physical world. (2017). arXiv:cs.CV/1607.02533         [10] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. 2016. Loss func-
[5] Zhuoran Liu, Zhengyu Zhao, Martha Larson, and Laurent Amsaleg.              tions for image restoration with neural networks. IEEE Transactions
    2020. Exploring Quality Camouflage for Social Images. In Working            on computational imaging 3, 1 (2016), 47–57.