-

Understanding the One-pixel Attack: Propagation Maps and Locality Analysis

Danilo Vasconcellos Vargas

vargas@kyushu-u.ac.jp

Jiawei Su

Deep neural networks were shown to be vulnerable to single pixel modifications. However, the reason behind such phenomena has never been elucidated. Here, we propose Propagation Maps which show the influence of the perturbation in each layer of the network. Propagation Maps reveal that even in extremely deep networks such as Resnet, modification in one pixel easily propagates until the last layer. In fact, this initial local perturbation is also shown to spread becoming a global one and reaching absolute difference values that are close to the maximum value of the original feature maps in a given layer. Moreover, we do a locality analysis in which we demonstrate that nearby pixels of the perturbed one in the one-pixel attack tend to share the same vulnerability, revealing that the main vulnerability lies in neither neurons nor pixels but receptive fields. Hopefully, the analysis conducted in this work together with a new technique called propagation maps shall shed light into the inner workings of other adversarial samples and be the basis of new defense systems to come.

Recently, a series of papers have shown that deep neural networks (DNN) are vulnerable to various types of attacks [Szegedy, 2014],[Nguyen et al., 2015],[Moosavi-Dezfooli et al., 2017],[Brown et al., 2017],[Su et al., 2019], [MoosaviDezfooli et al., 2016].[Carlini and Wagner, 2017],[Kurakin et al., 2016],[Sharif et al., 2016], [Athalye and Sutskever, 2018] However, the reasons underlying these vulnerabilities are still largely unknown. We argue that one of the most important facets of adversarial machine learning resides in its investigative nature. In other words, adversarial machine learning provides us with key tools to understand current DNNs. As much as attacks tells us about the security behind DNNs, they also tells us to what extent DNNs can reason over data and what do they understand to be, for example, the concept of a ”car” or ”horse”.

In this paper, inspired by recent attacks and defenses we propose a technique called propagation maps that would be able to explain most of them. Here, giving the constrained space, we focus on one attack which is puzzling and largely unexplained, the one-pixel attack. Propagation maps enable us to show how one pixel perturbation may grow in influence over the layers and spread over many pixels to cause a final change in class. Moreover, statistical properties of the propagation reveal many properties of the attacks as well as their distribution (Figure 1).

Additionally, to further understand the one-pixel attack a locality analysis is performed. The locality analysis consists of executing the attack in nearby pixels of a successful one-pixel attack, i.e., using the same pixel perturbation but different pixel position. Indeed, the success rate of nearby pixel is effective and equal among different neural networks (Figure 1), showing that rather than pixels or neurons, the vulnerability lies in some of the receptive fields. This reveals an interesting property shared among DNNs which is independent of the model or attack success rate. 2

Adversarial Samples and Different Types of Attacks

The samples that can make machine learning algorithms misclassify received the name of adversarial samples. Let f (x) 2 Rk be the output of a machine learning algorithm in which x 2 Rm n is the input of the algorithm for input and output of sizes m n and k respectively. It is possible to define adversarial samples x’ explicitly as follows: x’ = x + x fx’ 2 Rm n j argmax(f (x’)j ) 6= argmax(f (x)i); j i (1) in which x 2 Rm n is a small perturbation added to the input.

In adversarial machine learning one wants to search for adversarial samples. For that it is possible to use the knowledge of the DNN in question to craft samples such as using back-propagation for obtaining gradient information and subsequently using gradient descent as done by the “fast gradient sign” proposed by I.J. Goodfellow et al. [Goodfellow et al., 2014a]. There is also the greedy perturbation searching method proposed by S.M. Moosavi-Dezfooli et al. [MoosaviDezfooli et al., 2016] and N. Papernot et al. utilize Jacobian matrix of the function learned during training to create a saliency map which will guide the search for adversarial samples [Papernot et al., 2016].

However, it is possible to search for adversarial samples without taking into account the internal characteristics of DNNs. This type of model agnostic search is also called black-box attack.

Regarding untargeted attacks, the objective function can be defined, for example, as the minimization of the soft label for the outputted class f (x)i. See the complete equation below. maximize

x subject to

f (x + x)i k xk

L (2) The minimization of the difference between the highest softlabel index and the second highest one is also one of the other possibilities of objective function for untargeted attacks. There are many black-box attacks in the literature. To cite some [Narodytska and Kasiviswanathan, 2017],[Papernot et al., 2017],[Dang et al., 2017].

It is important to note that should be small enough to not allow an image to become a different class. Such a transformation would invalidate the adversarial sample creation because it is not a misclassification. Since most attacks use perturbations which comprise of the whole image, k xk L is a good optimization constraint. However, it is also possible to look the other way around and deal with few perturbed dimensions. In this case, the constraint changes to a L0 norm which actually counts the dimensions of the the perturbation, i.e., the total number of non-zero elements in the perturbation vector. The complete equation is as follows: (3) maximize f (x + x))

x subject to d; where d is a small number of dimensions (d = 1 for the one-pixel attack). 3

Recent Advances in Attacks and Defenses

The question of if machine learning is secure was asked some time ago [Barreno et al., 2006],[Barreno et al., 2010]. However, it was only in 2013 that Deep Neural Networks’ (DNN) security was completely put into question [Szegedy, 2014]. C. Szegedy et al. demonstrated that by adding noise to an image it is possible to produce a visually identical image which can make DNNs misclassify. This was counter-intuitive, since the DNNs that misclassified had a very high accuracy in the tests rivaling even the accuracy of human beings.

Recently, the vulnerabilities of neural networks were shown to be even more aggravating. Universal adversarial perturbations in which a single crafted perturbation is able to make a DNN misclassify multiple samples was also shown to be possible [Moosavi-Dezfooli et al., 2017]. Moreover, in [Su et al., 2019] it was shown that even one pixel could make DNNs’ misclassify. Indicating that although DNNs have a high accuracy in recognition tasks, their ”understanding” of what is a ”dog” or ”cat” is still very different from human beings. In fact, adversarial samples can be used to evaluate the robustness of a DNN [Moosavi-Dezfooli et al., 2016].[Carlini and Wagner, 2017].

Although much of the research in adversarial machine learning is conducted under ideal conditions in a laboratory, the same techniques are not difficult to apply to real world scenarios because printed out adversarial samples still work, i.e., many adversarial samples are robust against different light conditions [Kurakin et al., 2016]. In fact, in [Athalye and Sutskever, 2018] the authors go a step further and verify the existence of 3d adversarial objects which can fool DNNs even when viewpoint, noise, and different light conditions are taking into consideration.

There are many works in attacks and defenses but the reason behind such lack of robustness for accurate classifiers is still largely unknown. In [Goodfellow et al., 2014b] it is argued that DNNs’ linearity are one of the main reasons. If this is the case, perhaps hybrid systems that can leverage the nonlinearity that arise from complex models by using evolutionary based optimization techniques such as self-organizing classifiers [Vargas et al., 2013] and neuroevolution with unified neuron models [Vargas and Murata, 2017] would make for a promising investigation. 4

One-Pixel Attack

One-Pixel Attack investigated the opposite extreme of most attacks to date. Instead of searching for small spread perturbation, it focus on perturbing just one pixel. This vulnerability to one-pixel is a totally different scenario, i.e., neural networks that are vulnerable to usual attacks may not be vulnerable to one-pixel attack and vice-versa.

To achieve such an attack in a black-box scenario the authors used differential evolution which is a simple yet effective evolutionary (DE) algorithm [Storn and Price, 1997]. A candidate solution is coded as a pixel position and its related perturbation. The DE search for promising candidate solutions by minimizing the output label of the correct class (Equation 2). In this paper, we use the same differential evolution settings as the original paper [Su et al., 2019]. However, here we define a successful attack as an adversarial attack made over a correctly classified sample. As a consequence, adversarial attacks over already misclassified samples will be ignored. 5

Propagation Maps

Perturbation on the input image propagates throughout the neural network to change its class in adversarial samples. However, much of this process is unknown. In other words, how does this perturbation cause a change in the class label? What are the internal differences between adversarial attacks and failed attacks if any?

Here, to walk towards an answer to the questions above we propose a technique called propagation maps which can reveal the perturbation throughout the layers. Propagation maps consists of comparing the feature maps of both adversarial and original samples. Specifically, by calculating the difference between the feature maps and averaging them (or getting their maximum value) for each convolutional layer, the perturbation’s influence can be estimated. Consider an element-wise maximum of a three dimensional array O for indices a, b and k to be described as:

Ma;b = max(Oa;b;0; Oa;b;1; :::; Oa;b;k);

k where M is the resulting two dimensional array.

Therefore, for a layer i, its respective propagation map P Mi can be obtained by: (4) (5) (6)

Propagation Maps for the One-Pixel Attack

To investigate how a single pixel perturbation can cause changes in class, we will make use of the proposed propagation maps. This will allow us to visualize the perturbations in each of the layers of the neural network.

For the experiments, Resnet, which is one of the most accurate types of neural networks, is used. Each of the subsections below investigate a specific scenario. 6.1

Single Pixel Perturbations that Change Class

In Figure 2, the propagation map (PMmax) of a successful one-pixel attack is shown. The perturbation is shown to start small and localized and then spread in deeper layers. In the last layer, the perturbation spread enough to influence strongly more than a quarter of the feature map. This is the elementwise maximum behavior which allows us to identify how strong is the maximum difference in feature maps.

The propagation map based on averaged differences (PMavg) shows that the difference is concentrated in some feature maps (Figure 3). Moreover this average difference is kept more or less the same throughout the layers. In the case of PMmax, the difference had a sort of wave behavior, sometimes growing in strength, sometimes slightly fading away. All observed adversarial samples shared similar features of propagation maps. This is to be expected, since they need to influence enough in order to change the class.

Surprisingly, one pixel change can cause influences that spread over the entire feature map, specially in deeper layers. This also contradicts to some extent the expectation that high level features will be processed in deeper layers. 6.2

Single Pixel Perturbations that Do Not Change Class

Successful one-pixel attacks were shown to grow its influence throughout the layers, culminating in a strong and spread influence in the last layers. Here we change the position of the pixel to unable the attack to succeed. Figure 4 shows that in such a case the influence’s intensity decreases. In fact, in the last layer it is almost imperceptible the influence. However, this is not the rule. A counterexample is shown in Figure 5 in which a pixel is changed without changing the class label. This time however, the perturbation propagates strongly, being as strong if not stronger than the successful one-pixel attack observed in Figure 2. One might argue that the influence has caused the confidence to decrease but not enough to cause the change. For this case, indeed the confidence decreases from 99% to 52%. However, Figure 6 has a similar behavior although the confidence decreases only one percentage (from 100% to 99%).

Thus, qualitatively speaking, unsuccessful one-pixel attacks not necessarily fail to achieve a high influence in the last layer. It depends strongly on the pixel position and sample. This is accordance with saliency maps which show that different parts of the image have different importance in the recognition process [Simonyan et al., 2013].

P Mi = max(jF Mi;k k

F Mia;kdvj);

where F Mi;k and F Mia;kdv are respectively the feature maps for layer i and kernel k of the natural (original) and adversarial samples.

Alternatively, one may wish to see the average over the filters which exposes a slightly different influence diluted over the kernels in the same layer. It can be computed as follows: P Mi = 1 nk

X(jF Mi;k k

F Mia;kdvj);

where nk is the number of filters.

PMmax and PMavg will be used to differentiate between Propagation Maps using Equations 5 and 6. Notice that in order to put the perturbation’s influence in the same scale of the original feature map, the maximum scale value will be set to the original feature map when plotting.

Statistical Evaluation of Propagation Maps

In previous sections, single attacks were analyzed in the light of examples and counterexamples. These experiments were important to investigate what happens in detail for each of these attacks. However, they do not tell much about the distribution of attacks. This section aims to fill this gap by investigating the attacks’ distribution and other statistically relevant data.

By evaluating the average of PMmean over successful attacks, we can observe the attack distribution over the entire feature map as well as their average spread over the layers (Figure 7). The successful attacks seem to concentrate mostly close to the center of the image. In deeper layers, the influence expands and increase in intensity, specially at its center. This reveals that the behavior observed in Figure 2 is usual for most of the attacks.

Given this distribution for successful attacks, it would be interesting to contrast them with failed attempts. This would enable us to further clarify the characteristics of a successful attack.

To further clarify if there is any explicit difference between successful and failed attacks, we explicitly calculated the mean over all the feature maps for the previous successful and unsuccessful attacks. The plot is shown in Figure 8. The average difference is also unable to distinguish between successful and failed attacks, with both having very similar behavior. 8

Position Sensitivity and Locality

The one-pixel attack works by searching for a one-pixel perturbation where the class can be modified (misclassified). This search process is costly but to what extent is the success of the attack dependent on its position?

Here the position sensitivity of the attack will be analyzed. First, we consider an attack in which a pixel is randomly chosen to be perturbed by the same amount that could create an adversarial sample. The results, which are shown in Table 1, demonstrate that random pixel attacks have a very low success rate. This suggests that position is important and by disregarding it is almost impossible to achieve a successful attack. Having said that, the attack on nearby pixels (i.e., the eight adjacent pixels) shows a positive result. In fact, if we consider that this attack is not conducting any search at all but only taking one random pixel. The results may be considered extremely positive.

The extremely positive results present in Table1, however, are in accordance with the receptive fields of convolutional layers. In other words, every neuron in convolutional layers calculates a convolution of the kernel with a part of the input image which is of the same size. The convolution itself is a linear function in which the change in one input would cause the whole convolution to be affected. Thus, the result of the convolution will be the same for nearby neurons in the same receptive field. Consequently, this shows that the vulnerable part of DNNs were neither neurons nor pixels but some receptive fields.

Interestingly, completely different networks have a very similar success rate for both nearby attacks. This further demonstrate that the receptive field is the vulnerable part. Since neural networks with similar architectures share similar receptive field relationships, nearby attacks on similar networks should have similar success rate. 9

The Conflicting Salience Hypothesis

Propagation maps demonstrated that some pixels’ influence failed to reach the last layer (Figure 4) while others influenced the last layer enough to cause a change in class (Figure 2). This analysis share a close resemblance to saliency maps in which one wishes to discover which pixels are responsible for a class. In fact, since propagation maps measure the amount of influence from perturbations, it would be reasonable to assume that they may have a close relationship with disturbance in saliency maps. Consequently, adversarial samples would cause enough disturbance in saliency maps to cause a change in class. Thus, we raise here the hypothesis of a conflicting saliency from adversarial samples.

If this is true, then what adversarial machine learning is doing is not fooling DNNs but rather taking away his attention. It is like a magician that calls attention to his right hand while his left hand pushes the magic ball. Or like the blinking light on the street that calls attention of the driver which suddenly drive through a red traffic light.

Having said that, propagation maps is a feedforward based technique to visualize and measure the influence of a perturbation while saliency maps aim to investigate the salient pixels for a given class with backpropagated gradients. Therefore, the methods differ in many ways and their relationship, which might be more complex than what is stated here, goes beyond the scope of this paper. We leave this as an open question that should be worthwhile to investigate. 10

Conclusions

This paper proposed a novel technique called propagation maps and used it to analyze one of the most puzzling attacks, the one-pixel attack. The analysis showed how a pixel modification causes an influence throughout the layers, culminating in the change of the class. Moreover, a locality analysis revealed that receptive fields are the vulnerable parts of DNNs and therefore nearby attacks to successful one-pixel attack have a high success rate. Lastly, a new hypothesis was proposed that could together with the proposed propagation maps help explain the reason behind adversarial attacks in DNNs.

Acknowledgements

This work was supported by JST, ACT-I Grant Number JP50243 and JSPS KAKENHI Grant Number JP20241216.

In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.

[Athalye and Sutskever , 2018]

Anish

Athalye and

Ilya

Sutskever . Synthesizing robust adversarial examples . In ICML , 2018 .

[Barreno et al., 2006 ]

Marco

Barreno , Blaine Nelson, Russell Sears, Anthony D Joseph , and J Doug Tygar . Can machine learning be secure? In Proceedings of the 2006 ACM Symposium on Information, computer and communications security , pages 16 - 25 . ACM, 2006 .

[Barreno et al., 2010 ]

Marco

Barreno , Blaine Nelson, Anthony D Joseph , and J Doug Tygar . The security of machine learning . Machine Learning , 81 ( 2 ): 121 - 148 , 2010 .

[Brown et al., 2017 ] Tom B Brown , Dandelion Mane´, Aurko Roy , Mart´ın Abadi, and

Justin

Gilmer . Adversarial patch . arXiv preprint arXiv:1712.09665 , 2017 .

[Carlini and Wagner , 2017]

Nicholas

Carlini and

David

Wagner . Towards evaluating the robustness of neural networks .

[Dang et al., 2017 ]

Hung

Dang , Yue Huang, and Ee-Chien Chang . Evading classifiers by morphing in the dark . 2017 .

[Goodfellow et al., 2014a] Ian J Goodfellow , Jonathon Shlens, and Christian Szegedy . Explaining and harnessing adversarial examples . arXiv preprint arXiv:1412.6572 , 2014 .

[Goodfellow et al., 2014b] Ian J Goodfellow , Jonathon Shlens, and Christian Szegedy . Explaining and harnessing adversarial examples . arXiv preprint arXiv:1412.6572 , 2014 .

[Kurakin et al., 2016 ]

Alexey

Kurakin , Ian Goodfellow, and

Samy

Bengio . Adversarial examples in the physical world . arXiv preprint arXiv:1607.02533 , 2016 .

[ Moosavi-Dezfooli et al., 2016 ] Seyed-Mohsen

MoosaviDezfooli

, Alhussein Fawzi, and

Pascal

Frossard . Deepfool: a simple and accurate method to fool deep neural networks . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2574 - 2582 , 2016 .

[ Moosavi-Dezfooli et al., 2017 ] Seyed-Mohsen

MoosaviDezfooli

, Alhussein Fawzi, Omar Fawzi, and

Pascal

Frossard . Universal adversarial perturbations . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 86 - 94 . IEEE, 2017 .

[Narodytska and Kasiviswanathan , 2017]

Nina

Narodytska and

Shiva

Kasiviswanathan . Simple black-box adversarial attacks on deep neural networks . In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , pages 1310 - 1318 . IEEE, 2017 .

[Nguyen et al., 2015 ]

Anh

Nguyen , Jason Yosinski, and

Jeff

Clune . Deep neural networks are easily fooled: High confidence predictions for unrecognizable images . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 427 - 436 , 2015 .

[Papernot et al., 2016 ]

Nicolas

Papernot , Patrick

McDaniel

Somesh

Jha ,

Matt

Fredrikson ,

Z Berkay

Celik , and

Ananthram

Swami . The limitations of deep learning in adversarial settings . In Security and Privacy (EuroS&P) , 2016 IEEE European Symposium on , pages 372 - 387 . IEEE, 2016 .

[Papernot et al., 2017 ]

Nicolas

Papernot , Patrick

McDaniel

Ian

Goodfellow ,

Somesh

Jha ,

Z Berkay

Celik , and

Ananthram

Swami . Practical black-box attacks against machine learning . In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security , pages 506 - 519 . ACM, 2017 .

[Sharif et al., 2016 ]

Mahmood

Sharif , Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter . Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition . In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages 1528 - 1540 . ACM, 2016 .

[Simonyan et al., 2013 ]

Karen

Simonyan , Andrea Vedaldi, and

Andrew

Zisserman . Deep inside convolutional networks: Visualising image classification models and saliency maps . arXiv preprint arXiv:1312.6034 , 2013 .

[Storn and Price , 1997]

Rainer

Storn and

Kenneth

Price . Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces . Journal of global optimization , 11 ( 4 ): 341 - 359 , 1997 .

[Su et al., 2019 ]

Jiawei

Su , Danilo Vasconcellos Vargas, and

Sakurai

Kouichi . One pixel attack for fooling deep neural networks . IEEE Transactions on Evolutionary Computation , 2019 .

[Szegedy , 2014] Christian et al. Szegedy. Intriguing properties of neural networks . In In ICLR. Citeseer , 2014 .

[Vargas and Murata , 2017] Danilo Vasconcellos Vargas and

Junichi

Murata . Spectrum-diverse neuroevolution with unified neural models . IEEE transactions on neural networks and learning systems , 28 ( 8 ): 1759 - 1773 , 2017 .

[Vargas et al., 2013 ]

Danilo

Vasconcellos Vargas , Hirotaka Takano, and

Junichi

Murata . Self organizing classifiers: first steps in structured evolutionary machine learning . Evolutionary Intelligence , 6 ( 2 ): 57 - 72 , 2013 .