=Paper=
{{Paper
|id=Vol-2640/paper_19
|storemode=property
|title=DeepSmartFuzzer: Reward Guided Test Generation For Deep Learning
|pdfUrl=https://ceur-ws.org/Vol-2640/paper_19.pdf
|volume=Vol-2640
|authors=Samet Demir,Hasan Ferit Eniser,Alper Sen
|dblpUrl=https://dblp.org/rec/conf/ijcai/DemirE020
}}
==DeepSmartFuzzer: Reward Guided Test Generation For Deep Learning==
DeepSmartFuzzer: Reward Guided Test Generation For Deep Learning Samet Demir1 , Hasan Ferit Eniser2∗ , Alper Sen1 1 Department of Computer Engineering, Boğaziçi University, Turkey 2 Max Planck Institute for Software Systems, Germany samet.demir1@boun.edu.tr, hfeniser@mpi-sws.org, alper.sen@boun.edu.tr catastrophic results that can emerge from erroneous behav- Abstract iors in safety-critical systems, DNNs must be characterized by a high degree of dependability before being deployed in Testing Deep Neural Network (DNN) models has safety-critical systems. become more important than ever with the increas- Testing is the primary practice for analyzing and evaluat- ing usage of DNN models in safety-critical do- ing the quality of a software system [Ammann and Offutt, mains such as autonomous cars. Traditionally, 2016]. It helps in reducing the risk by finding and eliminat- DNN testing relies on the performance on a ded- ing erroneous behaviors before deployment of the systems. icated subset of the available data, namely test One of the most fundamental testing concepts is defining a set. However, DNNs require more thorough test- coverage criterion for a given test set, also called a test ade- ing approaches to exercise corner-case behaviors. quacy criterion. A coverage criterion measures how much of Coverage-guided fuzzing (CGF) which is a com- the system structures are exercised (covered) when test inputs mon practice in software testing aims to produce are provided. Having a test set that satisfies a coverage cri- new test inputs by mutating existing ones to achieve terion provides a degree of dependability to the system under high coverage on a test adequacy criterion. CGF test. has been an effective method for finding error in- Recent research in DNN testing introduces new DNN- ducing inputs by satisfying a well-established cri- specific coverage criteria such as neuron coverage [Pei et al., terion. In this paper, we propose a novel CGF al- 2017] and its variants [Ma et al., 2018], MC/DC-inspired cri- gorithm for structural testing of DNNs. The pro- terion [Sun et al., 2018b] or other criteria such as surprise ad- posed algorithm employs Monte Carlo Tree Search equacy [Kim et al., 2019] and DeepImportance [Gerasimou et to drive the coverage-guided search. In our eval- al., 2020]. Previous works [Pei et al., 2017; Ma et al., 2018; uation, we show that the inputs generated by our Kim et al., 2019], and future studies on coverage criteria for method result in higher coverage than the inputs DNNs could be useful for exposing defects in DNNs, find- produced by the previously introduced CGF tech- ing adversarial examples, or forming diverse test sets. On the niques on various criteria in a fixed amount of time. other hand, satisfying a coverage criterion or at least achiev- ing a high coverage measurement can be difficult without a structured methodology. Existing works [Xie et al., 2018; 1 Introduction Odena and Goodfellow, 2018] leverage coverage guided Given enough amount of data and processing power, training fuzzing (CGF) to achieve high coverage for a given criterion. a Deep Neural Network (DNN) is the most popular way for However, both of these works apply mutations on inputs ran- dealing with many hard computational problems such as im- domly. Therefore, their effectiveness is limited, as shown in age classification [Cireşan et al., 2012], natural language pro- our experiments. cessing [Sutskever et al., 2014] and speech recognition [Hin- In this work, we introduce DeepSmartFuzzer, a novel CGF, ton et al., 2012]. Impressive achievements in such tasks for achieving high coverage in DNNs for existing coverage raised expectations for deploying DNNs in real-world appli- criteria in the literature. Our ultimate goal is to help practi- cations, including the ones in safety-critical domains. tioners extend their test sets with new inputs so that new be- Despite the remarkable achievements, recent works haviours are covered. To that end, we leverage Monte Carlo [Szegedy et al., 2013; Goodfellow et al., 2015] have demon- Tree Search (MCTS) [Chaslot et al., 2008], a search algo- strated that DNNs are vulnerable to small perturbations on rithm for decision processes. In our method, MCTS is used seed inputs, also called adversarial attacks. Considering the to determine a series of mutations that would result in the best ∗ coverage increase for a given input. Most of the work was done when the author was in Boğaziçi Uni. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Contributions of this work are as follows: Coverage Guided Fuzzing Coverage Guided Fuzzing (CGF) performs systematic mutations on inputs and produces • We introduce DeepSmartFuzzer, a novel Coverage new test inputs to increase the number of covered cases for a Guided Fuzzing (CGF) technique for testing DNNs. target coverage metric. A typical CGF process starts with se- DeepSmartFuzzer is applicable to all existing coverage lecting a seed input from the seed pool, then continues with metrics. mutating the selected seed a certain number of times. After • We show the effectiveness of our method for many pop- that, the program under test is run with the mutated seed. If ular coverage criteria and for many DNNs with different a mutated seed creates an increase in the number of covered complexities. cases, CGF keeps the mutated seed in the seed pool. • We compare the effectiveness our method with existing Monte Carlo Tree Search (MCTS) Monte Carlo Tree CGF methods. Search [Chaslot et al., 2008] is a search algorithm for deci- sion processes such as game play. It represents games as trees where each node has children for each possible action to be 2 Related Work taken. Each node of the game tree represents a particular state Recently, several DNN testing techniques have been devel- in the game. On taking an action, one makes a transition from oped in the literature. Among these techniques, there exist a node to one of its children. MCTS algorithm aims to find works developing coverage criteria for DNNs. For exam- the most promising action in an arbitrary state of a game. In ple, DeepXplore [Pei et al., 2017] proposed neuron coverage other words, given an initial state, the objective is to find the (analogous to statement coverage in software). DeepGauge best sequence of actions to win the game. [Ma et al., 2018] proposed a set of fine-grained test coverage MCTS process can be broken down into the following four criteria. Kim et al. [2019] proposed surprise adequacy crite- steps: Selection: Starting from the root node R, successively ria based on a measure of surprise caused by the inputs and select child nodes according to their potentials until a leaf Gerasimou et al. [2020] presented a metric for composing a node L is reached. The potential of each child node is cal- semantically diverse test set. Sun et al. [2018a] proposed the culated by using UCT (Upper Confidence Bound applied to first concolic testing approach for DNNs. Trees) [Kocsis and Szepesvári, q 2006; Chaslot et al., 2008]. We now discuss studies that are close to ours. TensorFuzz UCT is defined as v + e × lnnN where v refers to the value [Odena and Goodfellow, 2018] proposed the first CGF for of the node, n is the visit count of the node, and N is the visit neural networks that aims to increase a novel coverage met- count for the parent of the node. e is a hyperparameter deter- ric. DeepHunter [Xie et al., 2018] is another work explor- mining exploration-exploitation trade-off. Expansion: Un- ing CGF for DNNs by leveraging techniques from software less L is a terminal node (i.e. win/loss/draw), create at least fuzzing, such as power scheduling. Our work is different one child node C (i.e. any valid action from node L) and take from TensorFuzz and DeepHunter where random mutations one of them. Simulation: play the game from node C by are applied on inputs whereas we employ Monte Carlo Tree picking moves randomly until reaching a terminal condition. Search (MCTS) for applying mutations on inputs. Also, we Backpropagation: propagate back the result of the play to apply mutations on a small subset of features of a given input, update values associated with the nodes on the path from C whereas they apply mutations on all features of a given input. to R. The path containing the nodes with the highest values DeepSmartFuzzer is also similar to [Wicker et al., 2018] in in each layer would be the optimal strategy in the game. that both works employ Monte Carlo Tree Search. However, the objective in [Wicker et al., 2018] is to find the nearest adversarial example, whereas our objective is to increase the 4 Method: DeepSmartFuzzer value of a given coverage criterion. Let T = {[X1 , X2 , ...], [Y1 , Y2 , ...]} be a test set where Szegedy et al. [2013] first discovered the vulnerability of (Xi , Yi ) is an input-output pair of the ith test sample. Let DNNs to adversarial attacks. Since then, numerous white- I be a set of inputs called a batch. Let I 0 , I 00 , ..., I (n) be a box and black-box adversarial attacks have been proposed. sequence of mutated batches of the original batch I such that: The most popular attacks include FGSM [Goodfellow et al., 2015], JSMA [Papernot et al., 2016], PGD [Madry et al., IM(r 0 ,m0 ) IM(r 00 ,m00 ) IM(r (n) ,m(n) ) 2017], and C&W [Carlini and Wagner, 2017]. I −−−−−−−→ I 0 −−−−−−−−→ I 00 ... −−−−−−−−→ I (n) (1) where IM(r(n) , m(n) ) is the nth input mutation, r(n) and m(n) 3 Background are the region and mutation indexes, respectively. Also, let I best ∈ {I 0 , I 00 , ...} be the best batch that creates the greatest Coverage Criterion in Software A coverage criterion par- amount of coverage increase. titions the input space into equivalence classes [Ammann and Offutt, 2016] and it is measured by dividing the number of 4.1 Overview equivalence classes that are sampled by at least one input in DeepSmartFuzzer is an MCTS-based coverage-guided fuzzer the test set to the number of all equivalence classes in the test for DNNs. It can be classified as a grey-box testing method set. For example, statement coverage in software measures since it uses coverage information which is related to the the percentage of the statements in the program that are exe- internal states of a DNN model. Our method generates in- cuted with the given set of test inputs. puts that increase the current level of coverage formed by tering, it selects a random cluster using the uniform random distribution. Finally, it samples a random batch of inputs from the selected cluster. We use sampling without replacement to avoid same inputs in a batch since we apply the same muta- tions to all inputs in the batch. In this work, we use k-means clustering as the clustering algorithm. 4.3 Mutation Selector The mutation selector takes a batch of inputs and sequentially selects parameters region index r and mutation index m. The selected mutations are sequentially applied to the selected re- gions by the input mutator and a sequence of mutated batches I 0 , I 00 , ..., I (n-1) , I (n) is generated. Note that I (n) contains all the mutations up to that point. This formulation of the prob- Figure 1: Workflow of DeepSmartFuzzer for an iteration lem has a sequential nature. Therefore, we decided to model the problem as a game, which consists of a sequence of ac- tions and rewards related to the actions. We use a two-player the initial test set. The core idea of our approach is to em- game model since two selections (one for region and one for ploy reward-guided exploration and exploitation in order to mutation) are made for each mutation. achieve high coverage scores. The workflow of the proposed Formulating the mutation selection as a two-player game method is illustrated in Figure 1. It is composed of an input Our proposed mutation selector is a two-player game such chooser, a coverage criterion, an input mutator, and a muta- that Player I selects the region to be mutated, and Player II tion selector, which is the essential part. selects the mutation to be made on the chosen region. Since For each iteration, the input chooser chooses a batch, which regions and mutations are enumerated, these are just integer is a set of inputs I. Then, the mutation selector determines selection problems. The game continues as Players I and II the mutation (r0 , m0 ) to be applied to the inputs. Note that, play iteratively so that multiple mutations could be applied applying one kind of mutation to all input features may be on top of each other and a sequence of mutated batches is too coarse because, for a given mutation, a subset of input generated as described above. Region selection and mutation features may increase coverage while others may decrease selection are considered as separate actions. We call a tuple coverage. Our fuzzer takes a finer approach and applies mu- of actions taken by Players I and II together as a complete tations to a subset of input features (i.e. regions) at each step. action (r, m). The input mutator takes the selected batch I and the selected mutation (r0 , m0 ), where the selected mutation is applied to Reward A naive reward for our problem is the coverage the selected batch of inputs such that the mutated inputs I 0 increase for each action. We use this reward to guide the IM(r 0 ,m0 ) search for mutations. In this study, the coverage increase cor- are formed (I −−−−−−−→ I 0 ). The mutated inputs are then given to the coverage criterion to calculate the coverage of the responds to the difference between the coverage for the cur- mutated inputs together with the test set. The coverage and rent test set and the coverage obtained by adding a new batch mutated inputs are given to the mutation selector such that it to the test set. Overall, the purpose of the mutation selector is could use the coverage and continue working with I 0 so that to find the best sequence of mutations that could result in the IM(r 00 ,m00 ) greatest amount of coverage increase. new mutated inputs I 00 are generated (I 0 −−−−−−−−→ I 00 ). This process continues until a termination condition such as End of the game In order to avoid creating unrealistic in- exploration limit or mutation limit is reached. The best set of puts and consequently human intervention to eliminate un- mutated inputs I best is stored and updated in the meantime. realistic inputs, we put constraints on mutations. Generally, If there is an increase in coverage because of the mutated in- Lp norms are used for this purpose. These are defined as puts I best , they are added to the test set. This concludes the ||Xi0 − Xi ||p < where Xi0 is the mutated input such that iteration for the batch I. We continue iterating with different the distance between mutated inputs and original inputs are batches until a termination condition such as a target num- limited. In general form, let d(Xi0 , Xi ) be a distance metric, ber of new inputs or timeout is reached. Now, we detail each the game is over when d(Xi0 , Xi ) < constraint is violated, component. where d and are hyperparameters. Searching We use Monte Carlo Tree Search (MCTS) in or- 4.2 Input Chooser der to exploit rewards and guide the search for mutations to- We use two types of input choosers for selecting inputs. wards rewarded mutations. For this purpose, MCTS searches These are random input chooser and clustered random input the game tree for the best mutations. The nodes in our game chooser. The random input chooser randomly samples a batch tree correspond to region and mutation selections. We con- of inputs using the uniform random distribution. The clus- tinuously update the batch of inputs I best that creates the best tered random input chooser samples similar inputs together. coverage increase so that the batch is added to the test set It applies an off-the-shelf clustering algorithm. After clus- when MCTS is finished. 4.4 Input Mutator of new inputs. In line 3, a batch of inputs I is sampled using The input mutator mutates the input according to the re- the input chooser. The root node R is created in line 4, and gion index and mutation index selected by the muta- variables to store the best mutated batch I best are initialized in tion selector. The availability of so many mutations line 5. The while loop in line 6 refers to looping until a termi- potentially makes the job of mutation selector harder. nation condition (tc1 ) that limits the search space by limiting Therefore, we come up with a the number of levels in the search tree that the MCTS can general input mutator for images. search. Next, the while loop in line 7 refers to iterating until 0 10 20 31 a termination condition (tc2 ) that determines the number of It divides an image into local re- 1 2 3 gions and provides general image times the subtree of the root node R will be explored. In line mutations as mutation options for 8, MCTS Selection, which is selection of a path from the root 10 each region. These general mu- node R to a leaf node L using the potentials (calculated by 4 5 6 tations include, but are not lim- using UCT) of the nodes, is made and it results in a leaf node 20 ited to, brightness change, con- L. Then, in line 9, MCTS Expansion is applied and it creates 7 8 9 trast change, and blur. When a a new child node C as a child of the leaf node L. In line 10, region r and a mutation m are the mutations formed by the path from the root node of the 31 selected, it applies the selected game tree to the given node C are applied to the initial batch mutation to the selected region. I and it results in I (n-1) . Basically, I (n-1) is the mutated batch Figure 2: Regions that is the result of MCTS Selection and Expansion. Then, The number of regions and the set of available mutations for regions MCTS Simulation plays the game until a complete action so are hyperparameters for the input mutator. With appropri- that r(n) and m(n) are assigned to a region index and a mu- ate settings, we can obtain either a pixel-level mutator or tation index, respectively (line 11). Our MCTS Simulation an image-level mutator or something in-between, which we is different than the original MCTS Simulation. Instead of think is the best for practical reasons. We enumerate the re- playing the game randomly until the end, our MCTS Simu- gions and mutations so that the mutation selector identifies lation plays the game randomly until a complete action since them by indexes. Figure 2 shows an example for the division our game formulation produces a reward for every complete of input into regions. Our proposed input mutator induces a action. The input mutator then mutates the batch I (n-1) ac- bias towards locality since it applies mutations to regions of cording to a region index r(n) and a mutation index m(n) so an image. Therefore, it is a natural fit for image problems and that a new batch I (n) is created (line 12). Termination con- convolutional neural networks. dition (tc3 ) controls the rationality of the generated batch by limiting the distance between the mutated batch I (n) and the 4.5 Algorithm original batch I. If this new batch I (n) does not violate the termination condition (tc3 ), then the mutated batch I (n) is Algorithm 1 Algorithmic description of DeepSmartFuzzer considered a candidate batch of test inputs (line 13-14). In 1: procedure D EEP S MART F UZZER(T, tc0 , tc1 , tc2 , tc3 ) line 15, coverage increase is calculated from the difference 2: while not tc0 do between the coverage of test set T together with the mutated 3: I = Input Chooser(T) batch I (n) and the test set T . If this is the greatest coverage 4: R = MCTS Node(I) increase for this batch I, the mutated batch I (n) is stored as 5: best cov, I best = 0, I the best mutated batch I best (line 15-16). MCTS Backprop- 6: while not tc1 do agation is applied from the new child node C with coverage 7: while not tc2 do increase as reward (line 17). This concludes one iteration of 8: L = MCTS Selection(R) the while loop with tc2 , and the algorithm continues looping 9: C = MCTS Expansion(L) to explore the root node until tc2 . When termination condi- 10: I (n-1) = get batch corresponding to node(C) tion tc2 is reached, it sets the best child of root R as the new 11: r(n) , m(n) = MCTS Simulation(C) root node (line 18). The best child is the node with the great- 12: I (n) = Input Mutator(I (n-1) , r(n) , m(n) ) est value, which is the average coverage increase (reward) 13: if not tc3 then found on the paths (sequences of mutations) that contain the 14: cov inc = Coverage(T ∪ I (n) ) - Coverage(T) node. After setting a child as the new root, an iteration of 15: if cov inc > best cov then the while loop with tc1 is completed, and the while loop con- 16: best cov, I best = cov inc, I (n) tinues iterating by working on the subtree of the child node 17: MCTS Backpropagation(C, cov inc) (now called as the root node R). After the while loop is com- pletely finished, the best batch found I best is added to the test 18: R = select child(R) set if it creates a coverage increase (line 19-20). Here, we 19: if best cov > 0 then add the complete batch to the test set in order to avoid the 20: T = T ∪ I best search for the effective inputs in the batch and thereby speed 21: return T up the process. This concludes a complete iteration of MCTS on the batch I and the algorithm continues iterating with new We describe our method more formally in Algorithm 1. batches until termination condition tc0 is reached. When tc0 The while loop in line 2 refers to iterating until a termination is reached, the final test set which includes the mutated inputs condition (tc0 ) that is a timeout or reaching a target number found up to that point is returned (line 21). region 0 region 8 ... mutation mutation 0 mutation 29 17 ... ... ... region 4 ... ... mutation 16 ... ... region 5 ... ... mutation 4 ... ... (a) The game tree (b) The selected mutations on a seed input Figure 3: Visualization of a snapshot when our method searching the mutation space for TFC and MNIST-LeNet5 where action columns represents the potentials (the more brighter the more potential) of each enumerated action on the search tree Figure 3 illustrates the algorithm in action as follows: it [Pei et al., 2017] neuron coverage (NC), DeepGauge’s [Ma selects a region (action) and a mutation (action) so that the et al., 2018] k-multisection neuron coverage (KMN), neuron input mutator applies the mutation to the region, and then this boundary coverage (NBC), strong neuron activation coverage process is continued repeatedly while MCTS is searching the (SNAC) and Tensorfuzz’s coverage (TFC). game tree. Hyperparamters We set neuron activation threshold to 5 Experiments2 0.75 in NC and the number of sections k to 10000 in KMN, respectively. For NBC and SNAC, we set as lower (upper) 5.1 Setup bound the minimum (maximum) activation value encoun- Datasets and DL Systems We evaluate DeepSmartFuzzer tered in the training set, respectively. These are the recom- on two popular publicly available datasets namely MNIST mended settings in the original studies. On the other hand, we [LeCun, 1998] and CIFAR10 [Krizhevsky and Hinton, 2009] observed that the distance threshold used in the original Ten- (referred to as CIFAR from now on). MNIST is a handwrit- sorFuzz study was too small for MNIST and CIFAR models ten digit dataset with 60000 training and 10000 testing inputs. such that every little mutation could increase the coverage. Each input is a 28x28 pixel white and black image with a Therefore, we tune the threshold of TFC for LeNet1, LeNet4, class label from 0 to 9. CIFAR is a 3-channel colored image LeNet5 and CIFAR CNN as 302 , 132 , 112 and 32 , respec- dataset with 50000 training and 10000 testing samples. Each tively. input is a 3x32x32 image in ten different classes (e.g., plane, The number of regions, the set of mutations, and termina- ship, car). For the MNIST dataset, we study LeNet1, LeNet4, tion conditions (tc1 , tc2 , tc3 ) constitute the hyperparameters and LeNet5 [LeCun et al., 1998] DNN architectures, which of DeepSmartFuzzer. The number of regions is selected as are the three well-known and popular models in the literature. 9, which corresponds to 3 × 3 division of an image. The For the CIFAR dataset, we make use of a suggested convolu- set of mutations is contrast change, brightness change, and tional neural network architecture [Wicker et al., 2018]. blur. The first termination conditions (tc1 ) is chosen to limit Compared Techniques and Coverage Criteria Bench- MCTS from going down more than 8 levels deep in the game marks We evaluate our tool by comparing its performance tree. The second termination condition (tc2 ) limits the num- with two existing CGF frameworks for deep learning systems. ber of iterations on each root to 25. For the last termination The first tool, namely DeepHunter [Xie et al., 2018], aims to condition (tc3 ), we use the limitations that DeepHunter [Xie achieve high coverage by randomly selecting a batch of in- et al., 2018] puts on the distance between mutated and seed puts and applying random mutations on them. DeepHunter inputs to avoid unrealistic mutated inputs. also leverages various fuzzing techniques from software test- ing, such as power scheduling. However, the tool is not pub- 5.2 Results licly available. Therefore we use our implementation of Dee- Summary We aim to show that DeepSmartFuzzer is able pHunter in evaluation. The second tool, namely Tensorfuzz to generate good test inputs. First, we compare DeepSmart- [Odena and Goodfellow, 2018], uses the guidance of cover- Fuzzer with DeepHunter and TensorFuzz by comparing the age to debug DNNs. For example, it finds numerical errors coverage increases created by approximately 1000 new inputs and disagreements between neural networks and quantized for each method in combination with different DNN models versions of those networks. Tensorfuzz code is publicly avail- and coverage criteria. Experimental results in Table 1 and 3 able, and we integrate it into our framework. show that the inputs generated by our method result in the For an unbiased evaluation of DeepSmartFuzzer, we test greatest amount of coverage increase for all (DNN model, our tool on various coverage criteria. We use DeepXplore’s coverage criterion) pairs except for a few. This suggests that DeepSmartFuzzer creates better test inputs than DeepHunter 2 Source code: https://github.com/hasanferit/DeepSmartFuzzer and TensorFuzz with regards to the coverage measurements. Model - Coverage MNIST MNIST MNIST MNIST MNIST MNIST MNIST MNIST MNIST MNIST / LeNet1 LeNet1 LeNet1 LeNet1 LeNet1 LeNet4 LeNet4 LeNet4 LeNet4 LeNet4 CGF (NC) (KMN) (NBC) (SNAC) (TFC) (NC) (KMN) (NBC) (SNAC) (TFC) DeepHunter 0 2.34 ± 0.03 % 35.42 ± 2.76 % 41.67 ± 4.17 % 29.00 ± 3.61 0 1.91 ± 0.04 % 13.15 ± 2.34 % 16.67 ± 0.81 % 20.00 ± 2.00 TensorFuzz 0 1.83 ± 0.23 % 0 0 0.33 ± 0.58 0 1.26 ± 0.05 % 0 0 0 DeepSmartFuzzer 0 2.91 ± 0.11 % 41.67 ± 4.77 % 42.36 ± 6.36 % 204.67 ± 8.50 1.41 ± 0.00 % 2.07 ± 0.14 % 11.50 ± 1.13 % 16.90 ± 3.07 % 64.33 ± 6.03 DeepSmartFuzzer(clustered) 0 2.88 ± 0.04 % 38.54 ± 0.00 % 39.58 ± 7.51 % 111.00 ± 14.53 1.41 ± 0.00 % 2.02 ± 0.07 % 11.50 ± 0.54 % 15.02 ± 2.15 % 53.33 ± 8.39 Table 1: Coverage increase achieved by each CGF for MNIST-LeNet1 and MNIST-LeNet4 models. Model - Coverage MNIST MNIST MNIST MNIST MNIST MNIST MNIST MNIST MNIST MNIST / LeNet1 LeNet1 LeNet1 LeNet1 LeNet1 LeNet4 LeNet4 LeNet4 LeNet4 LeNet4 CGF (NC) (KMN) (NBC) (SNAC) (TFC) (NC) (KMN) (NBC) (SNAC) (TFC) DeepHunter 0* 1051.00 ± 4.00 847.00 ± 159.74* 724.67 ± 180.17* 1029.67 ± 29.48 0* 1051.00 ± 4.00 1036.00 ± 12.49 1033.67 ± 27.50 1026.67 ± 33.50 TensorFuzz 0* 1023.33 ± 1.15 0* 0* 0.33 ± 0.58* 0* 768.00 ± 0.00* 0* 0* 0* DeepSmartFuzzer 0* 1024.00 ± 0.00 1002.67 ± 36.95 533.33 ± 73.90* 1024.00 ± 0.00 128.00 ± 0.00* 981.33 ± 73.90† 1024.00 ± 0.00† 789.33 ± 195.52* 1024.00 ± 0.00 DeepSmartFuzzer(clustered) 0* 1024.00 ± 0.00 896.00 ± 128.00* 469.33 ± 97.76* 1024.00 ± 0.00 128.00 ± 0.00* 1024.00 ± 0.00† 1024.00 ± 0.00† 725.33 ± 97.76* 1024.00 ± 0.00 Table 2: Number of new inputs produced by each CGF for MNIST-LeNet1 and MNIST-LeNet4 models. *2 hours timeout † Extended experiments with 6 hours limit for timeout Model - Coverage MNIST MNIST MNIST MNIST MNIST CIFAR CIFAR CIFAR CIFAR CIFAR / LeNet5 LeNet5 LeNet5 LeNet5 LeNet5 CNN CNN CNN CNN CNN CGF (NC) (KMN) (NBC) (SNAC) (TFC) (NC) (KMN) (NBC) (SNAC) (TFC) DeepHunter 0.51 ± 0.58 % 1.77 ± 0.03 % 6.23 ± 0.55 % 8.40 ± 0.66 % 19.00 ± 1.73 1.99 ± 0.19 % 0.98 ± 0.03 % 2.39 ± 0.64 % 4.48 ± 0.90 % 16.00 ± 2.65 Tensorfuzz 0.13 ± 0.22 % 0.75 ± 0.06 % 0.13 ± 0.22 % 0 1.33 ± 0.58 0.93 ± 0.15 % 0.13 ± 0.01 % 1.54 ± 0.19 % 2.92 ± 0.34 % 0 DeepSmartFuzzer 2.16 ± 0.44 % 1.99 ± 0.01 % 7.82 ± 1.06 % 9.03 ± 1.10 % 76.33 ± 5.69 3.51 ± 0.48 % 1.38 ± 0.09 % 2.39 ± 1.23 % 4.91 ± 2.51 42.33 ± 4.51 DeepSmartFuzzer(clustered) 2.29 ± 0.38 % 1.92 ± 0.08 % 7.89 ± 0.72 % 8.40 ± 1.91 % 76.00 ± 8.89 3.51 ± 0.37 % 1.33 ± 0.06 % 3.83 ± 2.66 % 8.80 ± 7.11 48.67 ± 7.02 Table 3: Coverage increase achieved by each CGF for MNIST-LeNet5 and CIFAR-CNN models. Model - Coverage MNIST MNIST MNIST MNIST MNIST CIFAR CIFAR CIFAR CIFAR CIFAR / LeNet5 LeNet5 LeNet5 LeNet5 LeNet5 CNN CNN CNN CNN CNN CGF (NC) (KMN) (NBC) (SNAC) (TFC) (NC) (KMN) (NBC) (SNAC) (TFC) DeepHunter 86.33 ± 96.81* 1051.00 ± 4.00 1021.00 ± 10.54 1021.67 ± 19.66 1034.00 ± 17.52 1047.00 ± 6.24 1035.67 ± 13.58 1031.67 ± 12.34 1049.00 ± 10.54 1042.67 ± 5.51 Tensorfuzz 0.33 ± 0.58* 448.00 ± 0.00* 0.67 ± 1.15* 0* 1.33 ± 0.58* 7.33 ± 1.15* 192.00 ± 0.00 21.00 ± 2.65 20.67 ± 3.21 0* DeepSmartFuzzer 362.67 ± 73.90* 1024.00 ± 0.00† 1024.00 ± 0.00† 725.33 ± 36.95* 1024.00 ± 0.00 1024.00 ± 0.00 1024.00 ± 0.00† 320.00 ± 0.00 341.33 ± 36.95 1024.00 ± 0.00 DeepSmartFuzzer(clustered) 362.67 ± 73.90* 1024.00 ± 0.00 † 1024.00 ± 0.00† 682.67 ± 36.95* 1024.00 ± 0.00 1024.00 ± 0.00 1024.00 ± 0.00† 341.33 ± 36.95 341.33 ± 36.95 1024.00 ± 0.00 Table 4: Number of new inputs produced by each CGF for MNIST-LeNet5 and CIFAR-CNN models. *2 hours timeout † Extended experi- ments with 6, 12, 24 hours limits for timeout Figure 4 shows examples of mutated MNIST and CIFAR in- datasets. In order to provide complete results, Tables 2 and 4 puts created by DeepSmartFuzzer. indicate exactly how many inputs are generated for each case. All of these results are given as mean and standard deviation of the population resulting from running the same experiment three times with different random seeds. For most of the cases, DeepSmartFuzzer is better than the other two. Especially for the case of TFC, DeepSmart- Figure 4: Example inputs generated by DeepSmartFuzzer Fuzzer provides a substantial improvement over DeepHunter Comparison to DeepHunter and Tensorfuzz We focus and TensorFuzz. This might be related to TFC being a layer- on the inputs generated by DeepSmartFuzzer, DeepHunter, level coverage criterion, while the others are neuron-level and Tensorfuzz. For experimental integrity, we make each coverage criteria. Our solution gets better when model com- method generate approximately 1000 input samples. Only plexity is increased. This is suggested by the increasing the inputs which induce coverage increase are taken into ac- performance gap between our method and the others. Fur- count. We also put a time limit in order to avoid unending thermore, DeepSmartFuzzer with clustering tends to be bet- cases resulting from being unable to find any coverage in- ter than naive DeepSmartFuzzer when the complexity of the crease for some (DNN model, coverage criteria) pairs. When model is increased. a method could not produce the target amount of inputs in On the other hand, for a few cases, our approach fails to time, yet it creates some coverage increase such that it shows provide an improvement. For example, in neuron coverage more potential to be explored, the timeout limit is extended (NC) with LeNet1 model case, we observe that all fuzzers fail so that they could reach 1000 inputs. This condition is not to generate any coverage-increasing input. This is because applied to TensorFuzz since it generates inputs one by one, when we cannot find any reward (i.e. coverage increase), our and therefore, it could practically take days to reach 1000 in- MCTS solution is similar to a random search. However, we puts for some cases. The timeout is set to be 2 hours initially. believe this problem can be avoided with a well-designed re- It is then gradually increased to 6, 12, and 24 hours to ex- ward shaping, and this is left to future work. Also, for the case plore the full potential. Tables 1 and 3 show the amounts of LeNet4 in combination with NBC, DeepHunter seems to of coverage increase produced by approximately 1000 gen- be better than ours. This may indicate a need for further hy- erated input samples from each method with divergent set of perparameter tuning since it conflicts with the general trend. coverage criteria and DNN models for MNIST and CIFAR In order to check the statistical significance of the results, we apply one-tailed t-test. We check two hypotheses which [Kim et al., 2019] Jinhan Kim, Robert Feldt, and Shin Yoo. Guid- are whether DeepSmartFuzzer is better than Tensorfuzz and ing deep learning system testing using surprise adequacy. In Pro- whether DeepSmartFuzzer is better than DeepHunter in terms ceedings of the 41th International Conference on Software Engi- of coverage increase. For the statistical comparisons, we use neering, ICSE, 2019. all 60 experiments (with different models and different cover- [Kocsis and Szepesvári, 2006] Levente Kocsis and Csaba age criteria) for each CGF method as observations. The sig- Szepesvári. Bandit based monte-carlo planning. In 17th nificance threshold is set at .05. We find P<.001 for the for- European Conference on Machine Learning, ECML’06, pages mer hypothesis and P=.007 for the latter hypothesis. There- 282–293. Springer-Verlag, 2006. fore, we accept the hypotheses. [Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey Overall, we conclude that DeepSmartFuzzer provides Hinton. Learning multiple layers of features from tiny images. a significant improvement over existing coverage-guided Technical report, Citeseer, 2009. fuzzers for DNNs. [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to doc- 6 Conclusion & Future Work ument recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. In this study, we introduce a novel Coverage Guided Fuzzing [LeCun, 1998] Yann LeCun. The MNIST database of handwritten (CGF) technique for DNNs that uses Monte Carlo Tree digits. http://yann.lecun.com/exdb/mnist, 1998. Search (MCTS) to explore and exploit the coverage increase patterns. We experimentally show that our method is better [Ma et al., 2018] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, et al. Deep- than previous CGFs for DNNs in terms of satisfying subject Gauge: Multi-granularity testing criteria for deep learning sys- coverage criteria. Our results also show that MCTS methods tems. In IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018. can be promising for better DNN testing. We use naive cov- erage increase as reward. Therefore, experimentation with [Madry et al., 2017] Aleksander Madry, Aleksandar Makelov, Lud- reward shaping and different reinforcement learning methods wig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep for this problem are left for future studies. Finally, we share learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. the source code regarding to our experiments and implemen- tation publicly in order to provide a base for future studies. [Odena and Goodfellow, 2018] A. Odena and I. Goodfellow. Ten- sorfuzz: Debugging neural networks with coverage-guided fuzzing. In arXiv preprint arXiv:1807.10875, 2018. Acknowledgements [Papernot et al., 2016] Nicolas Papernot, Patrick McDaniel, This research was supported in part by Semiconductor Re- Somesh Jha, Matt Fredrikson, et al. The limitations of deep search Corporation under task 2020-AH-2970. learning in adversarial settings. In International Symposium on Security and Privacy (S&P), pages 372–387, 2016. References [Pei et al., 2017] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman [Ammann and Offutt, 2016] Paul Ammann and Jeff Offutt. Intro- Jana. DeepXplore: Automated whitebox testing of deep learning duction to software testing. Cambridge University Press, 2016. systems. In Symposium on Operating Systems Principles (SOSP), [Carlini and Wagner, 2017] Nicholas Carlini and David Wagner. pages 1–18, 2017. Towards evaluating the robustness of neural networks. In IEEE [Sun et al., 2018a] Y. Sun, M. Wu, W. Ruan, X. Huang, et al. Con- Symposium on Security and Privacy (S&P), pages 39–57, 2017. colic testing for deep neural networks. In Proceedings of the 33rd [Chaslot et al., 2008] Guillaume M JB Chaslot, Mark HM ACM/IEEE International Conference on Automated Software En- gineering (ASE), pages 109–119, 2018. Winands, H JAAP VAN DEN HERIK, Jos WHM Uiterwijk, and Bruno Bouzy. Progressive strategies for monte-carlo tree search. [Sun et al., 2018b] Youcheng Sun, Xiaowei Huang, Daniel Kroen- New Mathematics and Natural Computation, 4(03):343–357, ing, James Sharp, Matthew Hill, and Rob Ashmore. Testing deep 2008. neural networks, 2018. [Cireşan et al., 2012] Dan Cireşan, Ueli Meier, and Jürgen Schmid- [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. huber. Multi-column deep neural networks for image classifica- Le. Sequence to sequence learning with neural networks. In tion. In Conference on Computer Vision and Pattern Recognition International Conference on Neural Information Processing Sys- (CVPR), pages 3642–3649, 2012. tems, pages 3104–3112, 2014. [Gerasimou et al., 2020] Simos Gerasimou, Hasan Ferit Eniser, [Szegedy et al., 2013] Christian Szegedy, Wojciech Zaremba, Ilya Alper Sen, and Alper Cakan. Importance-driven deep learning Sutskever, Joan Bruna, et al. Intriguing properties of neural net- system testing. In International Conference on Software Engi- works. arXiv preprint arXiv:1312.6199, 2013. neering, ICSE, 2020. [Wicker et al., 2018] Matthew Wicker, Xiaowei Huang, and Marta [Goodfellow et al., 2015] Ian Goodfellow, Jonathon Shlens, and Kwiatkowska. Feature-guided black-box safety testing of deep Christian Szegedy. Explaining and harnessing adversarial exam- neural networks. In International Conference on Tools and Al- ples. In International Conference on Learning Representations gorithms for the Construction and Analysis of Systems (TACAS), (ICLR), 2015. pages 408–426, 2018. [Hinton et al., 2012] G. Hinton, L. Deng, D. Yu, G. E. Dahl, et al. [Xie et al., 2018] X. Xie, L. Ma, F. Juefei-Xu, H. Chen, et al. Deep- Deep neural networks for acoustic modeling in speech recogni- hunter: Hunting deep neural network defects via coverage-guided tion: The shared views of four research groups. IEEE Signal fuzzing. In arXiv preprint arXiv:1809.01266, 2018. Processing Magazine, 29(6):82–97, 2012.