1. Introduction

Adversarial Examples Through Deep Neural Network's Classification Boundary and Uncertainty Regions

Juan Shu

shu30@purdue.edu 0 1 3 4

Bowei Xi

0 1 3 4

Charles Kamhoua

charles.a.kamhoua.civ@army.mil 0 2 3 4

Description

0 3 4

Deep Neural Network, Adversarial Machine Learning, Classification Boundary, Uncertainty Region

0 And often it is 1 Purdue University , West Lafayette, IN 47907 , USA 2 US Army Research Laboratory 3 classification boundary. Through experiments , we show 4 vulnerabilities. For example, DNN is used to process

Although AI is developing rapidly, AI's vulnerability under adversarial attacks remains an extraordinarily dificult problem. In this paper we study the root cause of adversarial examples through studying the deep neural network's (DNN) classification boundary. The existing attack algorithms can generate from a handful to a few hundred adversarial examples given one clean sample. We show there are a lot more adversarial examples given one clean sample, all within a small neighborhood of the clean sample. We then define DNN uncertainty regions and show the transferability of adversarial examples is not universal. The results lead to two conjectures regarding the size of the DNN uncertainty regions and where DNN function becomes discontinuous. The conjectures ofer a potential explanation for why the generalization error bound - the theoretical guarantee established for DNN - cannot adequately capture the phenomenon of adversarial examples.

Uncertainty

1. Introduction

ter DNN gained popularity [ 1 ], researchers noticed that targetedly adding minor perturbations to a clean image can cause a DNN to misclassify the perturbed image [ 2 ].

Despite decades of theoretical research on DNN, there are still many unanswered questions regarding DNN’s properties. For example, we do not know the shape of DNN classification boundary. There is also a discrepancy between the established generalization error bounds for DNN and the existence of adversarial examples.

We know the shape of the decision boundary of many well known models, such as linear regression, generalized linear regression, non-parametric regression, support vector machine (SVM), to name a few. Despite ate DNN robustness, we are yet to know the shape of DNN classification boundary. A lack of understanding of DNN’s classification boundary naturally leads to the fact that we do not know where are the regions containing the adversarial examples. There are conflicting conjectures about the regions containing the adversarial examples. [ 2, 3 ] believed adversarial examples lie in “dense pockets” in lower dimensional manifold, caused by DNN’s non-linearity. On the other hand [ 4 ] believed it is DNN’s linear nature and the very high nEvelop-O The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety CEUR htp:/ceur-ws.org ISN1613-073 © 2022 Copyright for this paper by its authors. Use permitted under Creative

1. We show DNN classification boundary is highly many work on building a robust DNN and to evalu- in critical applications without fully understanding its fractured, unlike other classifiers. There are lower

resulting learning model to make a mistake with certain dimensional regions containing adversarial exam- test samples. Assuming there is no easy access to the ples within a small neighborhood surrounding

training process, evasion attacks generate test samples every clean image. 2. Our first conjecture is that the union of these lower dimensional bounded regions containing adversarial examples has zero probability mass.

Our second conjecture is that a DNN function is discontinuous at the boundary of these lower dimensional bounded regions, and may be discontinuous inside some of these bounded regions.

The two conjectures could be the reason that the theoretical guarantees established for DNN, such as the generalization error bounds, co-exist with the adversarial examples. Hence new theory is needed to evaluate DNN robustness.

that the learning model cannot handle correctly. The adversarial examples generated to attack DNN belong to evasion attacks. Depending on adversaries’ knowledge of a DNN model, there are white-box attacks and black-box attacks. For white-box attacks, adversaries know the true DNN model, including model structure and parameter values. For black-box attacks, adversaries don’t know the true model. Instead, adversaries query the true model, build a local substitute model based on the queries, and attack the local model. A targeted attack generates adversarial examples that are misclassified into a pre-determined class, while an untargeted attack simply generates misclassified samples. Several survey papers 3. We show that transferability of adversarial exam- are published, introducing the current state and the timeples is not universal, contrary to [ 2, 4, 5 ], which suggested that adversarial examples generated against one DNN are misclassified by other DNNs, even if they have diferent model structures or are trained on diferent subsets of the training data.

We show that adversarial examples against one DNN can be correctly classified by some other DNN models, simply by using diferent initial random seeds in the training process. This leads to our definition of DNN uncertainty regions.

generate up to a few hundred adversarial exam- gray-scale image, and a tensor for a color image. The Besides the three major contributions, additional contributions of this paper are the following.

1. Given one clean image, existing attack algorithms ples. Sampling from the lower dimensional region lead to a stronger attack, generating a lot more adversarial examples given one clean image. 2. Far fewer pixels are perturbed to form these hyper-rectangles compared to the existing attack algorithms. Therefore we reduce the total amount of perturbations added to a clean image to create adversarial examples.

The paper is organized as follows. Section 1.1 discusses tablish the shape of DNN classification boundary and introduces the concept of DNN uncertainty regions. Section 3 discusses the discrepancy between the theoretically ror bound, and the existence of adversarial examples.

Section 4 concludes this paper. 1.1. Related Work

There are two broad categories of attacks, poisoning attacks and evasion attacks [ 7 ]. Poisoning attacks inject malicious samples into the training data, to cause the the related work. Section 2 conducts experiments to es- attack algorithms that need large perturbations to generproven DNN large sample property, its generalization er- (4) Fast Gradient Sign Method (FGSM); (5) Basic Iterative line of attacks and defenses, e.g., [ 8, 7 ]. In general, the attack algorithms follow an optimization approach, i.e., generating adversarial examples through minimizing a loss function.

Adversarial evasion attacks against DNN are the earliest attacks. Recently there are attacks designed to break graph neural network (e.g., [ 9 ]), recurrent neural network (e.g., [ 10 ]) etc. In this paper, we examine the classification boundary and uncertainty regions of CNN and MLP. In our experiments we use Foolbox [ 11 ], which implements a large collection of adversarial attack algorithms.

Let

be a clean image and be an adversarial example. Let class label to . ( be a trained DNN model that assigns a ) ≠ ( )

. is a matrix for a size of the matrix/tensor is determined by the image resolution. The individual elements (pixels) in

represent the light level, having integer values ranging from 0 (no light) to 255 (maximum light). The pixels are rescaled to [ 0, 1 ] by dividing the pixel value by 255.

can be vectorized. Assume a vectorized is − dimensional, i.e., ∈ [ 0, 1 ] . Some attack algorithms generate a single or only a handful of s are not used in our experiments, because there are not enough adversarial examples to locate the region containing these s. We also exclude ate . Here are the attack algorithms that are used in our experiments: (1) Pointwise (PW) Attack; (2) Carlini & Wagner 2 (CW2) Attack; (3) NewtonFool (NF) Attack; (MI) Attack.

Method (BIM) 1, 2, ∞ attacks; (6) Moment Iterative

2. DNN’s Uncertainty Regions

DNN function is described by ( ) =

, where is the object class assigned to image by a trained DNN model used to classify the samples. The areas within the mar- [ 0, 1 ] is 1. Let be suficiently small, hence the images to be slightly larger than the minimum amount of adver- to locate a region containing these adversarial examples. discover there exists multiple uncertainty regions inside a ( , )

DNN model structure:

Because we focus on studying

the classification boundary of DNN, here the DNN model structure must strictly remain the same. We discover that even a minor change to the model structure, such as adding or removing a batch normalization layer, will lead to a diferent classification boundary.

Uncertainty Region Construction:

We compare an

adversarial example with the corresponding clean . If a pixel value in is diferent than that in , it is perturbed by the attack. Given a clean , we use one attack algorithm and generate a suficient amount of adversarial examples that are all mis-classified into the same wrong object class. We examine how many pixels are perturbed by the attack. Then we compute the interval for each perturbed pixel (the original has a single value for this pixel). The perturbed pixels are ordered by the interval sizes from the largest to the smallest. We then construct a hyper-rectangle starting from the largest interval, and stop at where the subsequent intervals can be considered as nearly a constant (which may not equal to the original pixel value for clean ). The detailed procedure is described as follows.

Assume 1 is the model under attack. For a given attack algorithm and a object class , ≠ where is the true object class of , we combine the adversarial examples from both the targeted attack and the untargeted attack, s.t. 1(

) = . We then construct the subspace . spanned by s. This step requires an attack algorithm to generate suficient amount of perturbed images at least 80-100 images, given one clean image. We notice that diferent attack algorithms discover diferent regions containing the adversarial examples for one clean image. Only a handful of adversarial examples is not enough , The more adversarial examples an attack algorithm can (a) (b) into disjoint regions, where at least two DNN models and disagree on the hard label of .

Definition 1.

An uncertainty region is defined as ∶= { ∶ ∃ , ∈ ℳ, .. ( ) ≠ ( ) } , where cannot be separated into disjoint regions in [ 0, 1 ] . an adversarial image , ( , ) = || − We use the 2 distance between a clean image and ||2. Let be the radius of a − ball with the clean image at the center, > 0 .

Denote the − ball by ( , ) ( , ) ∶= { ∶ ( , ) ≤ }. When is suficiently small, the points in ( , ) are noisy versions of and should be labeled to the same object class as . Given a clean image , we can determine the value of based on the amount of adversarial perturbations. We choose sarial perturbations calculated from a number of attack algorithms. Figure 2 (b) conceptually shows two separate |( , )| = in ( , ) Clean natural sample:

We consider a clean natural image as the result of taking a photo using a camera. Regarding how many clean images we can have, let’s consider the volume of a − dimensional ( , )

. The volume of the feature space are noisy versions of . For a fixed and , there are only a finite number of non-overlapping balls in the feature space [ 0, 1 ] . However, as ⟶ ∞ , we have |( , )| ⟶ 0

. Hence the feature space for higher resolution color images can contain increasingly more clean images. intervals are ranked by interval size as (,1) ≥ (,2) ≥ ⋯ (,) . We construct a hyper-rectangle () using largest in () ( , ). As. Then the tervals with ≤

as () = ⊗ =1 [ () ( ,() ),

() ( ,() )]. () is the subspace based on the adversarial examples generated by attack and misclassified to class . We choose the number of intervals that the remaining interval sizes are very small and the perturbations added can be considered as approximately constant. periments. CPU is Intel Xeon Silver 4114 and the GPU is Nvidia Tesla P100. The code is posted on GitHub. 2

2.1. MNIST CNN Experiment

Here we conduct an experiment with the task to classify MNIST dataset of 10 handwritten digit. MNIST has 60,000 training images and 10,000 test images. Each image has 28x28 gray-scale pixels. Our model structure is

2https://github.com/juanshu30/Understanding-AdversarialExamples-Through-DNNs-Classification-Boundary-andUncertainty-Regions on clean test data. the PyTorch implementation [ 14 ] of LeNet [ 15 ], which has two convolutional layers. The model structure has been published previously. We re-train LeNet on MNIST to optimize the parameter values. The optimizer is SGD with learning rate 0.01. is a vectorized MNIST image with pixels rescaled to [ 0, 1 ] in PyTorch implementation.

We have ∈ [ 0, 1 ] 784. Table 1 shows the accuracy of 10 re-trained LeNet models on the MNIST test data using different initial seeds. 1 to 10 have similar performance

Intuitively, the ten handwritten digits have distinct features that facilitate the classification task. Hence LeNet can achieve nearly 99% accuracy. We visualize the digits using t-Distributed Stochastic Neighbor Embedding (t-SNE) technique [ 16 ], a nonlinear dimension reduction technique. Figure 3 provides a 2D projection of the ten digits, based on 2000 sampled images. We observe 10 clusters of digits though some clusters overlap slightly. We would expect a classifier to divide up the feature space, and allow a digit class to occupy a portion of the feature space. Then the points away from the classification boundary and their surrounding neighborhoods would all belong to the same object class. Unfortunately this is not what we see from DNN. We need to draw DNN’s classification boundary around every clean image, not along the border between two object classes.

We choose a clean image , and generate adversarial examples using the attack algorithms listed in Section 1.1.

We then construct the hyper-rectangles () . We have information. obtained similar results. Due to the limited space, here we show the results for a digit 1 from the test data. We run the attacks against 1. Table 2 shows the following 1. The number of intervals used to construct the hyper-rectangles. For example, CW2 1 → 6 means the digit 1 is mis-classified as 6. 150d means 2 tervals.

(6) is spanned by the largest 150 in2. The smallest interval size in () , shown in column () . For PW attack, we use [ 0,1 ] for the

We use Pytorch 1.5.0 and Cuda 10.2 to run all the ex- studied many test images and training images, and have

Table 3 shows the 10 re-trained LeNet models’ mis-classification rates against the original adversarial images generated by the attacks, and the number of perturbed (a) PW 1 → 8 (b) PW Sampled 1 → 8 cpoixluemls.ns iTnhTeablleeft 4thsrheoew Figure 5: Clean Image 1 the minimum amount of perturbations ((|| − || 2)), the maximum amount of perturbations ((|| − || 2)), and the average amount selected pixels, since the measured interval sizes of perturbations ((|| − || 2)) of the 1000 sampled are all close to 1. For all other attacks, the interval images in each hyper-rectangle () . The right three size is measured from the added perturbations. columns in Table 4 show the same information for the 3. We sample 1000 images from each () , and re- adversarial examples generated by the corresponding port the misclassification rates by 1 to 10. attacks.

Figure 4 shows the adversarial examples generated by the attacks, and the adversarial images generated through sampling in the hyper-rectangles () . Figure 5 is the corresponding clean image 1. Except for PointWise attack, which changes a pixel value to 0 or 1, the rest of () has the maximum value 0.032, as shown in Table 2.

This translates to 8 consecutive integer values on the original 0 – 255 scale. They are very similar light levels, and can be considered as approximately constant.

If we add more dimensions to () , the additional di- (a) CW2 5 → 6 (b) CW2 Sampled 5 → 6 mensions can be considered as moving the additional Figure 6: Images for MLP experiment pixels to diferent values. Adding more dimensions do not change the shape and size of () . Instead that moves a hyper-rectangle to a diferent location, increasing the amount of perturbation and away from the clean image 3x512 hidden neurons and ReLU activation. We vary the . The hyper-rectangles () in Table 2 perturbed far initial seeds and train 5 MLPs. The optimizer is SGD fewer pixels than the original attacks. From Table 4, we with learning rate 0.01. Table 5 shows the MLP missee that leads to smaller perturbations to create adver- classification rates on the clean MNIST test data. In the sarial examples. There are more such hyper-rectangles interest of space, here we show two examples, a digit with the same shape and size, as we add more pixels 5 and a digit 7, under Carlini & Wagner 2 attack. Taidentified by the attacks. Adding more pixels does not ble 6 shows the 5 MLPs’ mis-classification rates in the necessarily increase the mis-classification rates by all hyper-rectangles. Table 7 shows mis-classification rates DNNs. For Carlini & Wagner 2 attack and FGSM, even- against the original adversarial images generated by the tually the hyper-rectangle is moved to a place where 1 attack. Figure 6 shows an adversarial example genermis-classification rate is close to 100% and 2 to 10 ated by Carlini & Wagner 2 attack, and an adversarial see near 0% mis-classification rate. This is the efect of example generated through sampling from the hyperthe optimization approach used in the attack algorithms rectangle, based on the same clean image 5. Figure 9 (a) against 1. We observe three types of () in Table 2. is the corresponding clean image 5. The hyper-rectangle for 7 → 2 lie in one DNN uncertainty region. Again Carlini & Wagner 2 attack has great success with the target model 1 but can be correctly classified by some other MLPs. 1. The target DNN mis-classifies most of the adversarial examples while there exists another DNN which correctly classifies the adversarial examples; 2. The target model correctly classifies the adversarial examples while another DNN mis-classifies most of the adversarial examples; 3. The transferable adversarial regions where all

DNNs mis-classify a significant proportion of the adversarial examples.

This phenomenon occurs to attacks adding both small and large perturbations. The first two types of () belong to DNN uncertainty regions. The existence of DNN uncertainty regions shows transferability of adversarial examples is not universal, contrary to [ 2, 4, 5 ].

2.2. MNIST MLP Experiment

Here we conduct experiment with a MLP trained on MNIST. It is a fully connected network with 3 layers,

2.3. CIFAR10 MobileNet Experiment

CIFAR10 [ 17 ] has 60,000 32x32 color images in 10 classes, with 50,000 as training images and 10,000 as test images. A vectorized CIFAR10 image is in [ 0, 1 ]3072, combining three color channels. The dimensionality of a CIFAR10 image is almost 4 times of a MNIST image. We use the MobileNet in this experiment. Similar to Section 2.1, the MobileNet model structure has been published previously [ 18 ], which has an initial convolution layers followed by 19 residual bottleneck layers. We re-train the MobileNet on CIFAR10 to optimize the parameters values. The optimizer is SGD with learning rate 0.01; momentum is 0.9; weight decaying is 5e-4. The mis-classification rates of ifve re-trained MobileNet models on the clean CIFAR10 test data by varying initial seeds are in Table 8. For the interest of space, here we show an example with an airplane image under BIM 2 attack. The attack success on the five re-trained MobileNet models are in Table 9. Note BIM 2 attack perturbed 3071 dimensions and left 1 dimension untouched. The images are shown in Figure 8 and Figure 9 (b).

Figure 7 shows the mis-classification rates as we increase the dimensions of the hyper-rectangle. The largest interval size is 0.2 and the 2000th largest interval size is 0.017. 1 misclassifies all the sampled images starting from around 200 perturbed dimensions. 5 correctly classifies all the sampled images. We see 2 and 4 misclassification rates increase as more efective dimensions are included, then decrease as we include additional irrelevant dimensions. The 2000-dimensional (a) BIM 2 airplane→deer (b) BIM Sampled airplane→deer hyper-rectangle lies in one MobileNet uncertainty region. As noted in [ 4 ], the direction of adversarial perturbation is important. Adversarial examples cannot be generated by randomly sampling in 3072 dimensional ball ( , ) . The lower dimensional hyper-rectangles () containing infinitely many adversarial examples are discovered through the attack algorithms. Table 10 shows that the sampled adversarial images from the hyper-rectangle have much smaller perturbations than the original attack on CIFAR10.

2.4. Uncertainty Regions vs. Transferable Adversarial Regions

Due to the nature of the uncertainty regions, we have to train multiple models. However the classification boundary is established for one model – the model under attack – not the ensemble of all the trained models. The output of a DNN ensemble is either based on the majority vote, or we take the average of the softmax layer outputs from the DNN models in the ensemble. Hence a DNN ensemble has a diferent classification boundary compared with that of a single model used in the ensemble.

For the MNIST CNN experiment, we use Figure 10 as a conceptual plot to show the classification boundary for 1. 1 is the model under attack. Let be the digit 1 used in Section 2.1. Let = 6 . Hence the hyperrectangles with larger perturbations are excluded from ( , ) .

The blue dot in the center is the clean image. Inside the black circle, the solid line segments are part of the shape of an uncertainty/transferable region may not be a hyper-rectangle. It is important to further investigate how many uncertainty regions and transferable adversarial regions exist in the feature space [ 0, 1 ] , and the exact shape and dimensionality of such regions. Strategy for Robust Classification: If at least one DNN assigns a label that is diferent from another DNN, the image triggers an alert and requires additional screening, either involving a human operator or alternative classification boundary for 1. There are two types, classifiers. This strategy will improve the accuracy over illustrated using two diferent colors. Type 1 regions are the adversarial examples in DNN uncertainty regions, where s are misclassified by 1 but can be correctly but won’t solve the problem for transferable adversarclassified by some other model ; type 2 regions are ial examples. Notice although an ensemble can achieve where s are misclassified by all the models, 1 to high predictive performance [ 19 ], a DNN ensemble can be attacked too. Meanwhile there is no guarantee about the number of DNNs that can make correct decision over each uncertainty region. We also need to understand how to measure the size of DNN uncertainty regions vs.

DNN transferable adversarial regions. We leave it to the future work. 10.

The dashed lines inside the black circle are not part of 1’s classification boundary, but they are model 1’s uncertainty regions, because inside these regions s are correctly classified by 1 but misclassified by some other model . We call them the type 3 regions.

Type 1 and 3 are the uncertainty regions. Type 2 are the transferable adversarial regions which are more dififcult to handle. Both the uncertainty regions and the 3. Generalization Error Bound and transferable adversarial regions are lower dimensional Adversarial Examples small “cracks” inside the small neighborhood of a clean image. Only type 1 and type 2 regions where 1 misclas- The accuracy on clean test data is often used to measure sifies the samples in the − ball ( , ) are part of model a classifier’s performance. However, in [ 20 ], the authors 1’s classification boundary around the clean image . argued the test data accuracy is not the most appropriate performance measure, because the variability due to The Shape and Size of Uncertainty Regions: In Ta- the randomness of the training data needs to be taken ble 10 we see a significant reduction of perturbation for into consideration besides those due to the test data. Let BIM 2 attack and the airplane image, because our hyper- = ( , ) denote a sample, where ∈ [ 0, 1 ] is a − directangle perturbed far less pixels than the original attack mensional vectorized image and ∈ {1, ⋯ , } is the true (2000d vs. 3017d). On the other hand, in Table 3, we see object class. is generated independently and identically only minor reduction for FGSM and the digit 1 because from a distribution over [ 0, 1 ] . We denote a training the dimension of the hyper-rectangle is close to the origi- dataset with sample points by = ( 1, ⋯ , ). [ 20 ] nal attack (375d vs 403d). For NF and the digit 1, although defined generalization error as ( ( , +1 )), where our hyper-rectangle used 60d compared with 403d for +1 is a test sample, and () is the loss of applying a the original attack, there is only a minor reduction in classifier trained on to +1 . If () is a 0-1 loss, the total amount of perturbation. Since our approach the generalization error is defined as the error probability relies on the existing attacking algorithms to locate the ( ( ) ≠ ) as in [ 21 ]. regions, the dimensionality of the regions is related to There is an extensive literature on the theoretical genthe original attacks and the clean image itself. Further- eralization error bound for diferent type of classifiers more, although we construct hyper-rectangles, the exact including DNN. Generalization error bound for DNN ), where (ℎ, ℎ) is proven to be ( (ℎ,ℎ)

√ refers to a constant based on the width and depth of a DNN model, e.g., [ 22, 23 ].

We observe there is a discrepancy between the theoretically proven generalization error bound for DNN and the existence of adversarial examples. Following the theory, the generalization error on test data should decrease to 0 at a rate proportional to −1/2 where is the training sample size. However for every clean image we show there exists a large number of adversarial images in its neighborhood ( , )

, for diferent network structures and datasets. Adversarial examples also exist for large DNN models trained on ImageNet with millions on its generalization error. So far we see adversarial examples exist in much lower dimensional regions, leading to Conjecture 1.

Remark 2:

Another definition of generalization error involves the empirical error on the training data.

Let ̂

( ) = 1 ∑1 (

tion error as

∗( ) = ( mated from the training data . [ 24 ] defined generaliza ) be the empirical risk esti ( , +1 )) − ̂

( ), which is also used in some recent papers to establish DNN theoretical guarantees. Corollary 1 in [ 24 ] states there exists neural networks with ReLU activation, depth , width (/) and weights ( + ℎ) , that can fit exconjectures. ior of DNN should already kick in. Here we have two of training data, where the theoretical asymptotic behav- actly any function on in − dimensional space. Assume 1 and 2 are such models trained on . Hence ( ) = 0. Consequently we have Conjecture 1:

The union of these lower dimensional bounded uncertainty regions and transferable adversarial regions has zero probability mass.

Conjecture 2:

A DNN function is discontinuous at the boundary of these lower dimensional bounded regions, and may be discontinuous inside some of these bounded regions. Note Lipschitz continuity is an important assumption for proving the generalization error bound.

The two conjectures with Theorem 1 ofer a potential explanation for why such a discrepancy exists. Let be a − dimensional region in [ 0, 1 ]

with < . Let ℒ = ∪=1 ∞ be the union of countably infinite non-overlapping

lower dimensional regions in [ 0, 1 ] with all < . on . Assume ∀ ∈ [ 0, 1 ] − ℒ, 1( ) = Theorem 1. Let 1 and 2 be two DNN models trained

2( ) . And assume ∃ ∈ ℒ , s.t. 1( ) ≠

2( ) . We have ( 1( , +1 )) = (

2( , +1 )).

Proof: For any continuous distribution on [ 0, 1 ] , (ℒ ) = 0 , i.e., the lower dimensional ℒ has 0 probability mass. For two functions that difer only on 0 probability region, we have ( 1( , +1 )) = (

2( , +1 )). ■ Remark 1:

Theorem 1 means the definition of generalization error cannot tell the diference between a trained classifier that assign correct labels to all the points in [ 0, 1 ] and a diferent classifier that assign wrong labels only to countably infinite lower dimensional bounded regions. For example, let , = 1, 2, ... , s.t. ≠ if ≠ . Assume = ∪ =∞1 [ 0, 1 ]−1 ⊗ , be the union of countably infinite non-overlapping [ 0, 1 ]−1 regions. A classifier can assign wrong labels to without any impact ̂ 1 ∗( 1) = ( ) = ̂

2 ∗( 2).

4. Conclusion

A limitation of our work is that we rely on the existing attack algorithms to locate these hyper-rectangles. Also our approach works with low resolution images. Again we leave it to the future work to capture the shape of the DNN classification boundary in very high dimensional feature space.

We gain important insights from this study. A DNN model draws the classification boundary around every image instead of along the border between the object classes. This helps a DNN model to achieve high accuracy and low generalization error for complex tasks but leaves space for it to be attacked. How to seal these small cracks surrounding every image is a very dificult problem, as we witness the success of the adaptive attacks [ 25 ]. The insights gained from this study points to the problem where a robust DNN model should work on.

Understanding the shape of DNN’s classification boundary also provides insights to defend against the backdoor attacks [ 26 ]. As with many other classifiers, we need to understand how the change in the training data moves the classification boundary, in order to firmly close the backdoor.

We conclude that the adversarial examples stem from a structural problem of DNN. DNN’s classification boundary is unlike that of any other classifier. Current defense strategies do not address this structural problem. We also need new theory to describe the phenomenon of adversarial examples and measure the robustness of DNN.

Acknowledgments

This work is supported in part by US Army Research Ofice award W911NF-17-1-0356 and US Army Research

[1]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Imagenet classification with deep convolutional neural networks , Advances in Neural Information Processing Systems (NIPS) 25 ( 2012 ).

[2]

Szegedy ,

Zaremba , I. Sutskever ,

Bruna ,

Erhan , I. Goodfellow ,

Fergus , Intriguing properties of neural networks , in: The 2nd International Conference on Learning Representations (ICLR) , 2014 .

[3]

Crecchi ,

Bacciu ,

Biggio , Detecting adversarial examples through nonlinear dimensionality reduction , in: 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) , 2019 , pp. 483 - 488 .

[4]

I. J.

Goodfellow ,

Shlens ,

Szegedy , Explaining and harnessing adversarial examples , arXiv preprint arXiv:1412.6572 ( 2014 ).

[5]

Tramer ,

Papernot ,

Goodfellow ,

Boneh ,

McDaniel , The space of transferable adversarial examples , arXiv preprint arXiv:1704.03453 ( 2017 ).

[6] Tesla

AutopilotAI

, Tesla Artificial Intelligence & Autopilot , https://www.tesla.com/autopilotAI, 2022 . Last accessed Feb. 01 , 2022 .

[7] B. Xi, Adversarial machine learning for cybersecurity and computer vision: Current developments and challenges , Wiley Interdisciplinary Reviews: Computational Statistics 12 ( 2020 ) e1511 .

[8]

Biggio ,

Roli , Wild patterns: Ten years after the rise of adversarial machine learning , Pattern Recognition 84 ( 2018 ) 317 - 331 .

[9]

Tang ,

Li ,

Sun ,

Yao ,

Mitra ,

Wang , Transferring robustness for graph neural network against poisoning attacks , in: Proceedings of the 13th International Conference on Web Search and Data Mining , 2020 , pp. 600 - 608 .

[10]

Gao ,

Lanchantin ,

M. L.

Sofa ,

Qi , Black-box generation of adversarial text sequences to evade deep learning classifiers , in: 2018 IEEE Security and Privacy Workshops (SPW) , 2018 , pp. 50 - 56 .

[11]

Rauber ,

Zimmermann ,

Bethge , W. Brendel, Foolbox native: Fast adversarial attacks to benchmark the robustness of machine learning models in pytorch, tensorflow, and jax , Journal of Open Source Software 5 ( 2020 ) 2607 .

[12]

Voichita ,

Khatri ,

Draghici , Identifying uncertainty regions in support vector machines using geometric margin and convex hulls , in: IEEE International Joint Conference on Neural Networks , 2008 , pp. 3319 - 3324 .

[13]

Tuia ,

Ratle ,

Pacifici ,

M. F.

Kanevski ,

W. J.

Emery , Active learning methods for remote sensing image classification , IEEE Transactions on Geoscience and Remote Sensing 47 ( 2009 ) 2218 - 2232 .

[14] PyTorch, PyTorch: Adversarial Example Generation, https://pytorch.org/tutorials/beginner/fgsm_ tutorial.html, 2022 .

[15]

LeCun , L. Bottou,

Bengio ,

Hafner , Gradientbased learning applied to document recognition , Proceedings of the IEEE 86 ( 1998 ) 2278 - 2324 .

[16]

Van der Maaten , G. Hinton, Visualizing data using t-sne. , Journal of machine learning research 9 ( 2008 ).

[17]

Krizhevsky , Learning multiple layers of features from tiny images , Master's thesis , University of Tront ( 2009 ).

[18]

Sandler ,

Howard ,

Zhu ,

Zhmoginov , L.-C. Chen, MobilenetV2: Inverted residuals and linear bottlenecks , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2018 , pp. 4510 - 4520 .

[19]

Dong ,

Yu ,

Cao ,

Shi ,

Ma , A survey on ensemble learning , Frontiers of Computer Science 14 ( 2020 ) 241 - 258 .

[20]

Nadeau ,

Bengio , Inference for the generalization error , Machine Learning 52 ( 2003 ) 239 - 281 .

[21]

Kaariainen , Generalization error bounds using unlabeled data , in: International Conference on Computational Learning Theory , Springer, 2005 , pp. 127 - 142 .

[22]

P. L.

Bartlett ,

D. J.

Foster ,

M. J.

Telgarsky , Spectrallynormalized margin bounds for neural networks , Advances in neural information processing systems 30 ( 2017 ).

[23]

Golowich ,

Rakhlin ,

Shamir , Sizeindependent sample complexity of neural networks , in: Conference On Learning Theory, PMLR , 2018 , pp. 297 - 299 .

[24]

Zhang , S. Bengio,

Hardt ,

Recht ,

Vinyals , Understanding deep learning requires rethinking generalization , Communications of the ACM 64 ( 2021 ) 107 - 115 .

[25]

Tramer ,

Carlini ,

Brendel ,

Madry , On adaptive attacks to adversarial example defenses , Advances in Neural Information Processing Systems 33 ( 2020 ) 1633 - 1645 .

[26]

Chen ,

Liu ,

Li ,

Lu ,

Song , Targeted backdoor attacks on deep learning systems using data poisoning , arXiv preprint arXiv:1712.05526 ( 2017 ).