=Paper= {{Paper |id=Vol-2640/paper_19 |storemode=property |title=DeepSmartFuzzer: Reward Guided Test Generation For Deep Learning |pdfUrl=https://ceur-ws.org/Vol-2640/paper_19.pdf |volume=Vol-2640 |authors=Samet Demir,Hasan Ferit Eniser,Alper Sen |dblpUrl=https://dblp.org/rec/conf/ijcai/DemirE020 }} ==DeepSmartFuzzer: Reward Guided Test Generation For Deep Learning== https://ceur-ws.org/Vol-2640/paper_19.pdf
        DeepSmartFuzzer: Reward Guided Test Generation For Deep Learning

                               Samet Demir1 , Hasan Ferit Eniser2∗ , Alper Sen1
                       1
                         Department of Computer Engineering, Boğaziçi University, Turkey
                              2
                                Max Planck Institute for Software Systems, Germany
                    samet.demir1@boun.edu.tr, hfeniser@mpi-sws.org, alper.sen@boun.edu.tr


                                                                      catastrophic results that can emerge from erroneous behav-
                           Abstract                                   iors in safety-critical systems, DNNs must be characterized
                                                                      by a high degree of dependability before being deployed in
     Testing Deep Neural Network (DNN) models has                     safety-critical systems.
     become more important than ever with the increas-                   Testing is the primary practice for analyzing and evaluat-
     ing usage of DNN models in safety-critical do-                   ing the quality of a software system [Ammann and Offutt,
     mains such as autonomous cars. Traditionally,                    2016]. It helps in reducing the risk by finding and eliminat-
     DNN testing relies on the performance on a ded-                  ing erroneous behaviors before deployment of the systems.
     icated subset of the available data, namely test                 One of the most fundamental testing concepts is defining a
     set. However, DNNs require more thorough test-                   coverage criterion for a given test set, also called a test ade-
     ing approaches to exercise corner-case behaviors.                quacy criterion. A coverage criterion measures how much of
     Coverage-guided fuzzing (CGF) which is a com-                    the system structures are exercised (covered) when test inputs
     mon practice in software testing aims to produce                 are provided. Having a test set that satisfies a coverage cri-
     new test inputs by mutating existing ones to achieve             terion provides a degree of dependability to the system under
     high coverage on a test adequacy criterion. CGF                  test.
     has been an effective method for finding error in-                  Recent research in DNN testing introduces new DNN-
     ducing inputs by satisfying a well-established cri-              specific coverage criteria such as neuron coverage [Pei et al.,
     terion. In this paper, we propose a novel CGF al-                2017] and its variants [Ma et al., 2018], MC/DC-inspired cri-
     gorithm for structural testing of DNNs. The pro-                 terion [Sun et al., 2018b] or other criteria such as surprise ad-
     posed algorithm employs Monte Carlo Tree Search                  equacy [Kim et al., 2019] and DeepImportance [Gerasimou et
     to drive the coverage-guided search. In our eval-                al., 2020]. Previous works [Pei et al., 2017; Ma et al., 2018;
     uation, we show that the inputs generated by our                 Kim et al., 2019], and future studies on coverage criteria for
     method result in higher coverage than the inputs                 DNNs could be useful for exposing defects in DNNs, find-
     produced by the previously introduced CGF tech-                  ing adversarial examples, or forming diverse test sets. On the
     niques on various criteria in a fixed amount of time.            other hand, satisfying a coverage criterion or at least achiev-
                                                                      ing a high coverage measurement can be difficult without a
                                                                      structured methodology. Existing works [Xie et al., 2018;
1   Introduction                                                      Odena and Goodfellow, 2018] leverage coverage guided
 Given enough amount of data and processing power, training           fuzzing (CGF) to achieve high coverage for a given criterion.
a Deep Neural Network (DNN) is the most popular way for               However, both of these works apply mutations on inputs ran-
dealing with many hard computational problems such as im-             domly. Therefore, their effectiveness is limited, as shown in
age classification [Cireşan et al., 2012], natural language pro-     our experiments.
cessing [Sutskever et al., 2014] and speech recognition [Hin-            In this work, we introduce DeepSmartFuzzer, a novel CGF,
ton et al., 2012]. Impressive achievements in such tasks              for achieving high coverage in DNNs for existing coverage
raised expectations for deploying DNNs in real-world appli-           criteria in the literature. Our ultimate goal is to help practi-
cations, including the ones in safety-critical domains.               tioners extend their test sets with new inputs so that new be-
   Despite the remarkable achievements, recent works                  haviours are covered. To that end, we leverage Monte Carlo
[Szegedy et al., 2013; Goodfellow et al., 2015] have demon-           Tree Search (MCTS) [Chaslot et al., 2008], a search algo-
strated that DNNs are vulnerable to small perturbations on            rithm for decision processes. In our method, MCTS is used
seed inputs, also called adversarial attacks. Considering the         to determine a series of mutations that would result in the best
∗
                                                                      coverage increase for a given input.
 Most of the work was done when the author was in Boğaziçi Uni.
Copyright c 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY
4.0).
    Contributions of this work are as follows:                      Coverage Guided Fuzzing Coverage Guided Fuzzing
                                                                    (CGF) performs systematic mutations on inputs and produces
    • We introduce DeepSmartFuzzer, a novel Coverage                new test inputs to increase the number of covered cases for a
      Guided Fuzzing (CGF) technique for testing DNNs.              target coverage metric. A typical CGF process starts with se-
      DeepSmartFuzzer is applicable to all existing coverage        lecting a seed input from the seed pool, then continues with
      metrics.                                                      mutating the selected seed a certain number of times. After
    • We show the effectiveness of our method for many pop-         that, the program under test is run with the mutated seed. If
      ular coverage criteria and for many DNNs with different       a mutated seed creates an increase in the number of covered
      complexities.                                                 cases, CGF keeps the mutated seed in the seed pool.
    • We compare the effectiveness our method with existing         Monte Carlo Tree Search (MCTS) Monte Carlo Tree
      CGF methods.                                                  Search [Chaslot et al., 2008] is a search algorithm for deci-
                                                                    sion processes such as game play. It represents games as trees
                                                                    where each node has children for each possible action to be
2     Related Work                                                  taken. Each node of the game tree represents a particular state
Recently, several DNN testing techniques have been devel-           in the game. On taking an action, one makes a transition from
oped in the literature. Among these techniques, there exist         a node to one of its children. MCTS algorithm aims to find
works developing coverage criteria for DNNs. For exam-              the most promising action in an arbitrary state of a game. In
ple, DeepXplore [Pei et al., 2017] proposed neuron coverage         other words, given an initial state, the objective is to find the
(analogous to statement coverage in software). DeepGauge            best sequence of actions to win the game.
[Ma et al., 2018] proposed a set of fine-grained test coverage         MCTS process can be broken down into the following four
criteria. Kim et al. [2019] proposed surprise adequacy crite-       steps: Selection: Starting from the root node R, successively
ria based on a measure of surprise caused by the inputs and         select child nodes according to their potentials until a leaf
Gerasimou et al. [2020] presented a metric for composing a          node L is reached. The potential of each child node is cal-
semantically diverse test set. Sun et al. [2018a] proposed the      culated by using UCT (Upper Confidence Bound applied to
first concolic testing approach for DNNs.                           Trees) [Kocsis and Szepesvári,
                                                                                                 q 2006; Chaslot et al., 2008].
   We now discuss studies that are close to ours. TensorFuzz        UCT is defined as v + e × lnnN where v refers to the value
[Odena and Goodfellow, 2018] proposed the first CGF for             of the node, n is the visit count of the node, and N is the visit
neural networks that aims to increase a novel coverage met-         count for the parent of the node. e is a hyperparameter deter-
ric. DeepHunter [Xie et al., 2018] is another work explor-          mining exploration-exploitation trade-off. Expansion: Un-
ing CGF for DNNs by leveraging techniques from software             less L is a terminal node (i.e. win/loss/draw), create at least
fuzzing, such as power scheduling. Our work is different            one child node C (i.e. any valid action from node L) and take
from TensorFuzz and DeepHunter where random mutations               one of them. Simulation: play the game from node C by
are applied on inputs whereas we employ Monte Carlo Tree            picking moves randomly until reaching a terminal condition.
Search (MCTS) for applying mutations on inputs. Also, we            Backpropagation: propagate back the result of the play to
apply mutations on a small subset of features of a given input,     update values associated with the nodes on the path from C
whereas they apply mutations on all features of a given input.      to R. The path containing the nodes with the highest values
DeepSmartFuzzer is also similar to [Wicker et al., 2018] in         in each layer would be the optimal strategy in the game.
that both works employ Monte Carlo Tree Search. However,
the objective in [Wicker et al., 2018] is to find the nearest
adversarial example, whereas our objective is to increase the
                                                                    4     Method: DeepSmartFuzzer
value of a given coverage criterion.                                Let T = {[X1 , X2 , ...], [Y1 , Y2 , ...]} be a test set where
   Szegedy et al. [2013] first discovered the vulnerability of      (Xi , Yi ) is an input-output pair of the ith test sample. Let
DNNs to adversarial attacks. Since then, numerous white-            I be a set of inputs called a batch. Let I 0 , I 00 , ..., I (n) be a
box and black-box adversarial attacks have been proposed.           sequence of mutated batches of the original batch I such that:
The most popular attacks include FGSM [Goodfellow et al.,
2015], JSMA [Papernot et al., 2016], PGD [Madry et al.,                   IM(r 0 ,m0 )     IM(r 00 ,m00 )        IM(r (n) ,m(n) )
2017], and C&W [Carlini and Wagner, 2017].                              I −−−−−−−→ I 0 −−−−−−−−→ I 00 ... −−−−−−−−→ I (n) (1)
                                                                    where IM(r(n) , m(n) ) is the nth input mutation, r(n) and m(n)
3     Background                                                    are the region and mutation indexes, respectively. Also, let
                                                                    I best ∈ {I 0 , I 00 , ...} be the best batch that creates the greatest
Coverage Criterion in Software A coverage criterion par-            amount of coverage increase.
titions the input space into equivalence classes [Ammann and
Offutt, 2016] and it is measured by dividing the number of          4.1    Overview
equivalence classes that are sampled by at least one input in       DeepSmartFuzzer is an MCTS-based coverage-guided fuzzer
the test set to the number of all equivalence classes in the test   for DNNs. It can be classified as a grey-box testing method
set. For example, statement coverage in software measures           since it uses coverage information which is related to the
the percentage of the statements in the program that are exe-       internal states of a DNN model. Our method generates in-
cuted with the given set of test inputs.                            puts that increase the current level of coverage formed by
                                                                     tering, it selects a random cluster using the uniform random
                                                                     distribution. Finally, it samples a random batch of inputs from
                                                                     the selected cluster. We use sampling without replacement to
                                                                     avoid same inputs in a batch since we apply the same muta-
                                                                     tions to all inputs in the batch. In this work, we use k-means
                                                                     clustering as the clustering algorithm.

                                                                     4.3    Mutation Selector
                                                                     The mutation selector takes a batch of inputs and sequentially
                                                                     selects parameters region index r and mutation index m. The
                                                                     selected mutations are sequentially applied to the selected re-
                                                                     gions by the input mutator and a sequence of mutated batches
                                                                     I 0 , I 00 , ..., I (n-1) , I (n) is generated. Note that I (n) contains all
                                                                     the mutations up to that point. This formulation of the prob-
      Figure 1: Workflow of DeepSmartFuzzer for an iteration
                                                                     lem has a sequential nature. Therefore, we decided to model
                                                                     the problem as a game, which consists of a sequence of ac-
                                                                     tions and rewards related to the actions. We use a two-player
the initial test set. The core idea of our approach is to em-        game model since two selections (one for region and one for
ploy reward-guided exploration and exploitation in order to          mutation) are made for each mutation.
achieve high coverage scores. The workflow of the proposed
                                                                     Formulating the mutation selection as a two-player game
method is illustrated in Figure 1. It is composed of an input
                                                                     Our proposed mutation selector is a two-player game such
chooser, a coverage criterion, an input mutator, and a muta-
                                                                     that Player I selects the region to be mutated, and Player II
tion selector, which is the essential part.
                                                                     selects the mutation to be made on the chosen region. Since
   For each iteration, the input chooser chooses a batch, which      regions and mutations are enumerated, these are just integer
is a set of inputs I. Then, the mutation selector determines         selection problems. The game continues as Players I and II
the mutation (r0 , m0 ) to be applied to the inputs. Note that,      play iteratively so that multiple mutations could be applied
applying one kind of mutation to all input features may be           on top of each other and a sequence of mutated batches is
too coarse because, for a given mutation, a subset of input          generated as described above. Region selection and mutation
features may increase coverage while others may decrease             selection are considered as separate actions. We call a tuple
coverage. Our fuzzer takes a finer approach and applies mu-          of actions taken by Players I and II together as a complete
tations to a subset of input features (i.e. regions) at each step.   action (r, m).
The input mutator takes the selected batch I and the selected
mutation (r0 , m0 ), where the selected mutation is applied to       Reward A naive reward for our problem is the coverage
the selected batch of inputs such that the mutated inputs I 0        increase for each action. We use this reward to guide the
                 IM(r 0 ,m0 )                                        search for mutations. In this study, the coverage increase cor-
are formed (I −−−−−−−→ I 0 ). The mutated inputs are then
given to the coverage criterion to calculate the coverage of the     responds to the difference between the coverage for the cur-
mutated inputs together with the test set. The coverage and          rent test set and the coverage obtained by adding a new batch
mutated inputs are given to the mutation selector such that it       to the test set. Overall, the purpose of the mutation selector is
could use the coverage and continue working with I 0 so that         to find the best sequence of mutations that could result in the
                                               IM(r 00 ,m00 )
                                                                     greatest amount of coverage increase.
new mutated inputs I 00 are generated (I 0 −−−−−−−−→ I 00 ).
This process continues until a termination condition such as         End of the game In order to avoid creating unrealistic in-
exploration limit or mutation limit is reached. The best set of      puts and consequently human intervention to eliminate un-
mutated inputs I best is stored and updated in the meantime.         realistic inputs, we put constraints on mutations. Generally,
If there is an increase in coverage because of the mutated in-       Lp norms are used for this purpose. These are defined as
puts I best , they are added to the test set. This concludes the     ||Xi0 − Xi ||p <  where Xi0 is the mutated input such that
iteration for the batch I. We continue iterating with different      the distance between mutated inputs and original inputs are
batches until a termination condition such as a target num-          limited. In general form, let d(Xi0 , Xi ) be a distance metric,
ber of new inputs or timeout is reached. Now, we detail each         the game is over when d(Xi0 , Xi ) <  constraint is violated,
component.                                                           where d and  are hyperparameters.
                                                                     Searching We use Monte Carlo Tree Search (MCTS) in or-
4.2    Input Chooser                                                 der to exploit rewards and guide the search for mutations to-
We use two types of input choosers for selecting inputs.             wards rewarded mutations. For this purpose, MCTS searches
These are random input chooser and clustered random input            the game tree for the best mutations. The nodes in our game
chooser. The random input chooser randomly samples a batch           tree correspond to region and mutation selections. We con-
of inputs using the uniform random distribution. The clus-           tinuously update the batch of inputs I best that creates the best
tered random input chooser samples similar inputs together.          coverage increase so that the batch is added to the test set
It applies an off-the-shelf clustering algorithm. After clus-        when MCTS is finished.
4.4   Input Mutator                                                of new inputs. In line 3, a batch of inputs I is sampled using
The input mutator mutates the input according to the re-           the input chooser. The root node R is created in line 4, and
gion index and mutation index selected by the muta-                variables to store the best mutated batch I best are initialized in
tion selector.      The availability of so many mutations          line 5. The while loop in line 6 refers to looping until a termi-
potentially makes the job of mutation selector harder.             nation condition (tc1 ) that limits the search space by limiting
                             Therefore, we come up with a          the number of levels in the search tree that the MCTS can
                             general input mutator for images.     search. Next, the while loop in line 7 refers to iterating until
  0       10     20     31                                         a termination condition (tc2 ) that determines the number of
                             It divides an image into local re-
       1      2      3       gions and provides general image      times the subtree of the root node R will be explored. In line
                             mutations as mutation options for     8, MCTS Selection, which is selection of a path from the root
 10
                             each region. These general mu-        node R to a leaf node L using the potentials (calculated by
       4      5       6
                             tations include, but are not lim-     using UCT) of the nodes, is made and it results in a leaf node
 20                          ited to, brightness change, con-      L. Then, in line 9, MCTS Expansion is applied and it creates
       7      8       9      trast change, and blur. When a        a new child node C as a child of the leaf node L. In line 10,
                             region r and a mutation m are         the mutations formed by the path from the root node of the
 31
                             selected, it applies the selected     game tree to the given node C are applied to the initial batch
                             mutation to the selected region.      I and it results in I (n-1) . Basically, I (n-1) is the mutated batch
    Figure 2: Regions                                              that is the result of MCTS Selection and Expansion. Then,
                             The number of regions and the set
                             of available mutations for regions    MCTS Simulation plays the game until a complete action so
are hyperparameters for the input mutator. With appropri-          that r(n) and m(n) are assigned to a region index and a mu-
ate settings, we can obtain either a pixel-level mutator or        tation index, respectively (line 11). Our MCTS Simulation
an image-level mutator or something in-between, which we           is different than the original MCTS Simulation. Instead of
think is the best for practical reasons. We enumerate the re-      playing the game randomly until the end, our MCTS Simu-
gions and mutations so that the mutation selector identifies       lation plays the game randomly until a complete action since
them by indexes. Figure 2 shows an example for the division        our game formulation produces a reward for every complete
of input into regions. Our proposed input mutator induces a        action. The input mutator then mutates the batch I (n-1) ac-
bias towards locality since it applies mutations to regions of     cording to a region index r(n) and a mutation index m(n) so
an image. Therefore, it is a natural fit for image problems and    that a new batch I (n) is created (line 12). Termination con-
convolutional neural networks.                                     dition (tc3 ) controls the rationality of the generated batch by
                                                                   limiting the distance between the mutated batch I (n) and the
4.5   Algorithm                                                    original batch I. If this new batch I (n) does not violate the
                                                                   termination condition (tc3 ), then the mutated batch I (n) is
Algorithm 1 Algorithmic description of DeepSmartFuzzer             considered a candidate batch of test inputs (line 13-14). In
 1: procedure D EEP S MART F UZZER(T, tc0 , tc1 , tc2 , tc3 )      line 15, coverage increase is calculated from the difference
 2:   while not tc0 do                                             between the coverage of test set T together with the mutated
 3:     I = Input Chooser(T)                                       batch I (n) and the test set T . If this is the greatest coverage
 4:     R = MCTS Node(I)                                           increase for this batch I, the mutated batch I (n) is stored as
 5:     best cov, I best = 0, I                                    the best mutated batch I best (line 15-16). MCTS Backprop-
 6:     while not tc1 do                                           agation is applied from the new child node C with coverage
 7:        while not tc2 do                                        increase as reward (line 17). This concludes one iteration of
 8:           L = MCTS Selection(R)                                the while loop with tc2 , and the algorithm continues looping
 9:           C = MCTS Expansion(L)                                to explore the root node until tc2 . When termination condi-
10:           I (n-1) = get batch corresponding to node(C)         tion tc2 is reached, it sets the best child of root R as the new
11:           r(n) , m(n) = MCTS Simulation(C)                     root node (line 18). The best child is the node with the great-
12:           I (n) = Input Mutator(I (n-1) , r(n) , m(n) )        est value, which is the average coverage increase (reward)
13:           if not tc3 then                                      found on the paths (sequences of mutations) that contain the
14:              cov inc = Coverage(T ∪ I (n) ) - Coverage(T)      node. After setting a child as the new root, an iteration of
15:              if cov inc > best cov then                        the while loop with tc1 is completed, and the while loop con-
16:                  best cov, I best = cov inc, I (n)             tinues iterating by working on the subtree of the child node
17:              MCTS Backpropagation(C, cov inc)                  (now called as the root node R). After the while loop is com-
                                                                   pletely finished, the best batch found I best is added to the test
18:        R = select child(R)
                                                                   set if it creates a coverage increase (line 19-20). Here, we
19:     if best cov > 0 then                                       add the complete batch to the test set in order to avoid the
20:        T = T ∪ I best                                          search for the effective inputs in the batch and thereby speed
21:   return T                                                     up the process. This concludes a complete iteration of MCTS
                                                                   on the batch I and the algorithm continues iterating with new
  We describe our method more formally in Algorithm 1.             batches until termination condition tc0 is reached. When tc0
The while loop in line 2 refers to iterating until a termination   is reached, the final test set which includes the mutated inputs
condition (tc0 ) that is a timeout or reaching a target number     found up to that point is returned (line 21).
                                region 0               region 8
                                                 ...

                    mutation            mutation
                       0       mutation    29
                                 17
                                                                  ...
                           ...         ...

                                    region 4
                          ...              ...
                                    mutation 16
                         ...               ...
                                    region 5
                         ...               ...
                                    mutation 4
                          ...           ...



                               (a) The game tree                        (b) The selected mutations on a seed input

Figure 3: Visualization of a snapshot when our method searching the mutation space for TFC and MNIST-LeNet5 where action columns
represents the potentials (the more brighter the more potential) of each enumerated action on the search tree


   Figure 3 illustrates the algorithm in action as follows: it                   [Pei et al., 2017] neuron coverage (NC), DeepGauge’s [Ma
selects a region (action) and a mutation (action) so that the                    et al., 2018] k-multisection neuron coverage (KMN), neuron
input mutator applies the mutation to the region, and then this                  boundary coverage (NBC), strong neuron activation coverage
process is continued repeatedly while MCTS is searching the                      (SNAC) and Tensorfuzz’s coverage (TFC).
game tree.                                                                       Hyperparamters We set neuron activation threshold to
5      Experiments2                                                              0.75 in NC and the number of sections k to 10000 in KMN,
                                                                                 respectively. For NBC and SNAC, we set as lower (upper)
5.1      Setup
                                                                                 bound the minimum (maximum) activation value encoun-
Datasets and DL Systems We evaluate DeepSmartFuzzer                              tered in the training set, respectively. These are the recom-
on two popular publicly available datasets namely MNIST                          mended settings in the original studies. On the other hand, we
[LeCun, 1998] and CIFAR10 [Krizhevsky and Hinton, 2009]                          observed that the distance threshold used in the original Ten-
(referred to as CIFAR from now on). MNIST is a handwrit-                         sorFuzz study was too small for MNIST and CIFAR models
ten digit dataset with 60000 training and 10000 testing inputs.                  such that every little mutation could increase the coverage.
Each input is a 28x28 pixel white and black image with a                         Therefore, we tune the threshold of TFC for LeNet1, LeNet4,
class label from 0 to 9. CIFAR is a 3-channel colored image                      LeNet5 and CIFAR CNN as 302 , 132 , 112 and 32 , respec-
dataset with 50000 training and 10000 testing samples. Each                      tively.
input is a 3x32x32 image in ten different classes (e.g., plane,                     The number of regions, the set of mutations, and termina-
ship, car). For the MNIST dataset, we study LeNet1, LeNet4,                      tion conditions (tc1 , tc2 , tc3 ) constitute the hyperparameters
and LeNet5 [LeCun et al., 1998] DNN architectures, which                         of DeepSmartFuzzer. The number of regions is selected as
are the three well-known and popular models in the literature.                   9, which corresponds to 3 × 3 division of an image. The
For the CIFAR dataset, we make use of a suggested convolu-                       set of mutations is contrast change, brightness change, and
tional neural network architecture [Wicker et al., 2018].                        blur. The first termination conditions (tc1 ) is chosen to limit
Compared Techniques and Coverage Criteria Bench-                                 MCTS from going down more than 8 levels deep in the game
marks We evaluate our tool by comparing its performance                          tree. The second termination condition (tc2 ) limits the num-
with two existing CGF frameworks for deep learning systems.                      ber of iterations on each root to 25. For the last termination
The first tool, namely DeepHunter [Xie et al., 2018], aims to                    condition (tc3 ), we use the limitations that DeepHunter [Xie
achieve high coverage by randomly selecting a batch of in-                       et al., 2018] puts on the distance between mutated and seed
puts and applying random mutations on them. DeepHunter                           inputs to avoid unrealistic mutated inputs.
also leverages various fuzzing techniques from software test-
ing, such as power scheduling. However, the tool is not pub-                     5.2    Results
licly available. Therefore we use our implementation of Dee-                     Summary We aim to show that DeepSmartFuzzer is able
pHunter in evaluation. The second tool, namely Tensorfuzz                        to generate good test inputs. First, we compare DeepSmart-
[Odena and Goodfellow, 2018], uses the guidance of cover-                        Fuzzer with DeepHunter and TensorFuzz by comparing the
age to debug DNNs. For example, it finds numerical errors                        coverage increases created by approximately 1000 new inputs
and disagreements between neural networks and quantized                          for each method in combination with different DNN models
versions of those networks. Tensorfuzz code is publicly avail-                   and coverage criteria. Experimental results in Table 1 and 3
able, and we integrate it into our framework.                                    show that the inputs generated by our method result in the
   For an unbiased evaluation of DeepSmartFuzzer, we test                        greatest amount of coverage increase for all (DNN model,
our tool on various coverage criteria. We use DeepXplore’s                       coverage criterion) pairs except for a few. This suggests that
                                                                                 DeepSmartFuzzer creates better test inputs than DeepHunter
2
    Source code: https://github.com/hasanferit/DeepSmartFuzzer                   and TensorFuzz with regards to the coverage measurements.
    Model - Coverage           MNIST         MNIST               MNIST              MNIST                 MNIST              MNIST             MNIST               MNIST             MNIST             MNIST
            /                  LeNet1        LeNet1              LeNet1             LeNet1                LeNet1             LeNet4            LeNet4              LeNet4            LeNet4            LeNet4
          CGF                   (NC)         (KMN)               (NBC)              (SNAC)                 (TFC)               (NC)            (KMN)               (NBC)             (SNAC)             (TFC)
      DeepHunter                  0       2.34 ± 0.03 %      35.42 ± 2.76 %     41.67 ± 4.17 %          29.00 ± 3.61             0          1.91 ± 0.04 %      13.15 ± 2.34 %    16.67 ± 0.81 %      20.00 ± 2.00
       TensorFuzz                 0       1.83 ± 0.23 %             0                  0                0.33 ± 0.58              0          1.26 ± 0.05 %             0                 0                  0
    DeepSmartFuzzer               0       2.91 ± 0.11 %      41.67 ± 4.77 %     42.36 ± 6.36 %         204.67 ± 8.50      1.41 ± 0.00 %     2.07 ± 0.14 %      11.50 ± 1.13 %    16.90 ± 3.07 %      64.33 ± 6.03
DeepSmartFuzzer(clustered)        0       2.88 ± 0.04 %      38.54 ± 0.00 %     39.58 ± 7.51 %        111.00 ± 14.53      1.41 ± 0.00 %     2.02 ± 0.07 %      11.50 ± 0.54 %    15.02 ± 2.15 %      53.33 ± 8.39


                             Table 1: Coverage increase achieved by each CGF for MNIST-LeNet1 and MNIST-LeNet4 models.

    Model - Coverage          MNIST        MNIST             MNIST                 MNIST             MNIST                MNIST               MNIST            MNIST               MNIST              MNIST
            /                 LeNet1       LeNet1             LeNet1               LeNet1             LeNet1              LeNet4              LeNet4            LeNet4             LeNet4              LeNet4
          CGF                  (NC)        (KMN)              (NBC)                (SNAC)             (TFC)                 (NC)              (KMN)             (NBC)              (SNAC)              (TFC)
      DeepHunter                0*      1051.00 ± 4.00   847.00 ± 159.74*     724.67 ± 180.17*   1029.67 ± 29.48             0*           1051.00 ± 4.00    1036.00 ± 12.49    1033.67 ± 27.50     1026.67 ± 33.50
       TensorFuzz               0*      1023.33 ± 1.15          0*                    0*           0.33 ± 0.58*              0*           768.00 ± 0.00*          0*                  0*                 0*
    DeepSmartFuzzer             0*      1024.00 ± 0.00    1002.67 ± 36.95      533.33 ± 73.90*   1024.00 ± 0.00        128.00 ± 0.00*     981.33 ± 73.90†   1024.00 ± 0.00†   789.33 ± 195.52*     1024.00 ± 0.00
DeepSmartFuzzer(clustered)      0*      1024.00 ± 0.00   896.00 ± 128.00*      469.33 ± 97.76*   1024.00 ± 0.00        128.00 ± 0.00*     1024.00 ± 0.00†   1024.00 ± 0.00†    725.33 ± 97.76*     1024.00 ± 0.00


Table 2: Number of new inputs produced by each CGF for MNIST-LeNet1 and MNIST-LeNet4 models. *2 hours timeout † Extended
experiments with 6 hours limit for timeout

     Model - Coverage              MNIST             MNIST             MNIST             MNIST              MNIST              CIFAR              CIFAR             CIFAR            CIFAR             CIFAR
             /                     LeNet5            LeNet5            LeNet5            LeNet5             LeNet5               CNN                CNN               CNN             CNN                CNN
          CGF                        (NC)            (KMN)              (NBC)            (SNAC)              (TFC)               (NC)             (KMN)              (NBC)           (SNAC)             (TFC)
       DeepHunter               0.51 ± 0.58 %     1.77 ± 0.03 %     6.23 ± 0.55 %     8.40 ± 0.66 %       19.00 ± 1.73      1.99 ± 0.19 %      0.98 ± 0.03 %     2.39 ± 0.64 %    4.48 ± 0.90 %      16.00 ± 2.65
        Tensorfuzz              0.13 ± 0.22 %     0.75 ± 0.06 %     0.13 ± 0.22 %           0             1.33 ± 0.58       0.93 ± 0.15 %      0.13 ± 0.01 %     1.54 ± 0.19 %    2.92 ± 0.34 %            0
     DeepSmartFuzzer            2.16 ± 0.44 %     1.99 ± 0.01 %     7.82 ± 1.06 %     9.03 ± 1.10 %       76.33 ± 5.69      3.51 ± 0.48 %      1.38 ± 0.09 %     2.39 ± 1.23 %     4.91 ± 2.51       42.33 ± 4.51
 DeepSmartFuzzer(clustered)     2.29 ± 0.38 %     1.92 ± 0.08 %     7.89 ± 0.72 %     8.40 ± 1.91 %       76.00 ± 8.89      3.51 ± 0.37 %      1.33 ± 0.06 %     3.83 ± 2.66 %     8.80 ± 7.11       48.67 ± 7.02


                              Table 3: Coverage increase achieved by each CGF for MNIST-LeNet5 and CIFAR-CNN models.

    Model - Coverage             MNIST             MNIST              MNIST             MNIST               MNIST              CIFAR             CIFAR              CIFAR            CIFAR             CIFAR
            /                     LeNet5           LeNet5              LeNet5           LeNet5               LeNet5             CNN               CNN                CNN               CNN               CNN
         CGF                       (NC)            (KMN)               (NBC)            (SNAC)               (TFC)              (NC)             (KMN)              (NBC)            (SNAC)             (TFC)
      DeepHunter              86.33 ± 96.81*    1051.00 ± 4.00    1021.00 ± 10.54   1021.67 ± 19.66     1034.00 ± 17.52    1047.00 ± 6.24    1035.67 ± 13.58   1031.67 ± 12.34   1049.00 ± 10.54    1042.67 ± 5.51
       Tensorfuzz              0.33 ± 0.58*     448.00 ± 0.00*      0.67 ± 1.15*           0*             1.33 ± 0.58*      7.33 ± 1.15*      192.00 ± 0.00      21.00 ± 2.65      20.67 ± 3.21           0*
    DeepSmartFuzzer          362.67 ± 73.90*   1024.00 ± 0.00†    1024.00 ± 0.00†   725.33 ± 36.95*      1024.00 ± 0.00    1024.00 ± 0.00    1024.00 ± 0.00†     320.00 ± 0.00   341.33 ± 36.95     1024.00 ± 0.00
DeepSmartFuzzer(clustered)   362.67 ± 73.90*   1024.00 ± 0.00 †   1024.00 ± 0.00†   682.67 ± 36.95*      1024.00 ± 0.00    1024.00 ± 0.00    1024.00 ± 0.00†    341.33 ± 36.95   341.33 ± 36.95     1024.00 ± 0.00


Table 4: Number of new inputs produced by each CGF for MNIST-LeNet5 and CIFAR-CNN models. *2 hours timeout † Extended experi-
ments with 6, 12, 24 hours limits for timeout


Figure 4 shows examples of mutated MNIST and CIFAR in-                                                     datasets. In order to provide complete results, Tables 2 and 4
puts created by DeepSmartFuzzer.                                                                           indicate exactly how many inputs are generated for each case.
                                                                                                           All of these results are given as mean and standard deviation
                                                                                                           of the population resulting from running the same experiment
                                                                                                           three times with different random seeds.
                                                                                                              For most of the cases, DeepSmartFuzzer is better than
                                                                                                           the other two. Especially for the case of TFC, DeepSmart-
      Figure 4: Example inputs generated by DeepSmartFuzzer                                                Fuzzer provides a substantial improvement over DeepHunter
Comparison to DeepHunter and Tensorfuzz We focus                                                           and TensorFuzz. This might be related to TFC being a layer-
on the inputs generated by DeepSmartFuzzer, DeepHunter,                                                    level coverage criterion, while the others are neuron-level
and Tensorfuzz. For experimental integrity, we make each                                                   coverage criteria. Our solution gets better when model com-
method generate approximately 1000 input samples. Only                                                     plexity is increased. This is suggested by the increasing
the inputs which induce coverage increase are taken into ac-                                               performance gap between our method and the others. Fur-
count. We also put a time limit in order to avoid unending                                                 thermore, DeepSmartFuzzer with clustering tends to be bet-
cases resulting from being unable to find any coverage in-                                                 ter than naive DeepSmartFuzzer when the complexity of the
crease for some (DNN model, coverage criteria) pairs. When                                                 model is increased.
a method could not produce the target amount of inputs in                                                     On the other hand, for a few cases, our approach fails to
time, yet it creates some coverage increase such that it shows                                             provide an improvement. For example, in neuron coverage
more potential to be explored, the timeout limit is extended                                               (NC) with LeNet1 model case, we observe that all fuzzers fail
so that they could reach 1000 inputs. This condition is not                                                to generate any coverage-increasing input. This is because
applied to TensorFuzz since it generates inputs one by one,                                                when we cannot find any reward (i.e. coverage increase), our
and therefore, it could practically take days to reach 1000 in-                                            MCTS solution is similar to a random search. However, we
puts for some cases. The timeout is set to be 2 hours initially.                                           believe this problem can be avoided with a well-designed re-
It is then gradually increased to 6, 12, and 24 hours to ex-                                               ward shaping, and this is left to future work. Also, for the case
plore the full potential. Tables 1 and 3 show the amounts                                                  of LeNet4 in combination with NBC, DeepHunter seems to
of coverage increase produced by approximately 1000 gen-                                                   be better than ours. This may indicate a need for further hy-
erated input samples from each method with divergent set of                                                perparameter tuning since it conflicts with the general trend.
coverage criteria and DNN models for MNIST and CIFAR                                                          In order to check the statistical significance of the results,
we apply one-tailed t-test. We check two hypotheses which               [Kim et al., 2019] Jinhan Kim, Robert Feldt, and Shin Yoo. Guid-
are whether DeepSmartFuzzer is better than Tensorfuzz and                  ing deep learning system testing using surprise adequacy. In Pro-
whether DeepSmartFuzzer is better than DeepHunter in terms                 ceedings of the 41th International Conference on Software Engi-
of coverage increase. For the statistical comparisons, we use              neering, ICSE, 2019.
all 60 experiments (with different models and different cover-          [Kocsis and Szepesvári, 2006] Levente       Kocsis    and      Csaba
age criteria) for each CGF method as observations. The sig-                Szepesvári. Bandit based monte-carlo planning. In 17th
nificance threshold is set at .05. We find P<.001 for the for-             European Conference on Machine Learning, ECML’06, pages
mer hypothesis and P=.007 for the latter hypothesis. There-                282–293. Springer-Verlag, 2006.
fore, we accept the hypotheses.                                         [Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey
   Overall, we conclude that DeepSmartFuzzer provides                      Hinton. Learning multiple layers of features from tiny images.
a significant improvement over existing coverage-guided                    Technical report, Citeseer, 2009.
fuzzers for DNNs.                                                       [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio,
                                                                           Patrick Haffner, et al. Gradient-based learning applied to doc-
6    Conclusion & Future Work                                              ument recognition. Proceedings of the IEEE, 86(11):2278–2324,
                                                                           1998.
In this study, we introduce a novel Coverage Guided Fuzzing
                                                                        [LeCun, 1998] Yann LeCun. The MNIST database of handwritten
(CGF) technique for DNNs that uses Monte Carlo Tree
                                                                           digits. http://yann.lecun.com/exdb/mnist, 1998.
Search (MCTS) to explore and exploit the coverage increase
patterns. We experimentally show that our method is better              [Ma et al., 2018] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, et al. Deep-
than previous CGFs for DNNs in terms of satisfying subject                 Gauge: Multi-granularity testing criteria for deep learning sys-
coverage criteria. Our results also show that MCTS methods                 tems. In IEEE/ACM International Conference on Automated
                                                                           Software Engineering (ASE), 2018.
can be promising for better DNN testing. We use naive cov-
erage increase as reward. Therefore, experimentation with               [Madry et al., 2017] Aleksander Madry, Aleksandar Makelov, Lud-
reward shaping and different reinforcement learning methods                wig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep
for this problem are left for future studies. Finally, we share            learning models resistant to adversarial attacks. arXiv preprint
                                                                           arXiv:1706.06083, 2017.
the source code regarding to our experiments and implemen-
tation publicly in order to provide a base for future studies.          [Odena and Goodfellow, 2018] A. Odena and I. Goodfellow. Ten-
                                                                           sorfuzz: Debugging neural networks with coverage-guided
                                                                           fuzzing. In arXiv preprint arXiv:1807.10875, 2018.
Acknowledgements
                                                                        [Papernot et al., 2016] Nicolas Papernot, Patrick McDaniel,
This research was supported in part by Semiconductor Re-                   Somesh Jha, Matt Fredrikson, et al. The limitations of deep
search Corporation under task 2020-AH-2970.                                learning in adversarial settings. In International Symposium on
                                                                           Security and Privacy (S&P), pages 372–387, 2016.
References                                                              [Pei et al., 2017] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman
[Ammann and Offutt, 2016] Paul Ammann and Jeff Offutt. Intro-              Jana. DeepXplore: Automated whitebox testing of deep learning
   duction to software testing. Cambridge University Press, 2016.          systems. In Symposium on Operating Systems Principles (SOSP),
[Carlini and Wagner, 2017] Nicholas Carlini and David Wagner.              pages 1–18, 2017.
   Towards evaluating the robustness of neural networks. In IEEE        [Sun et al., 2018a] Y. Sun, M. Wu, W. Ruan, X. Huang, et al. Con-
   Symposium on Security and Privacy (S&P), pages 39–57, 2017.             colic testing for deep neural networks. In Proceedings of the 33rd
[Chaslot et al., 2008] Guillaume M JB Chaslot, Mark HM                     ACM/IEEE International Conference on Automated Software En-
                                                                           gineering (ASE), pages 109–119, 2018.
   Winands, H JAAP VAN DEN HERIK, Jos WHM Uiterwijk, and
   Bruno Bouzy. Progressive strategies for monte-carlo tree search.     [Sun et al., 2018b] Youcheng Sun, Xiaowei Huang, Daniel Kroen-
   New Mathematics and Natural Computation, 4(03):343–357,                 ing, James Sharp, Matthew Hill, and Rob Ashmore. Testing deep
   2008.                                                                   neural networks, 2018.
[Cireşan et al., 2012] Dan Cireşan, Ueli Meier, and Jürgen Schmid-   [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V.
   huber. Multi-column deep neural networks for image classifica-          Le. Sequence to sequence learning with neural networks. In
   tion. In Conference on Computer Vision and Pattern Recognition          International Conference on Neural Information Processing Sys-
   (CVPR), pages 3642–3649, 2012.                                          tems, pages 3104–3112, 2014.
[Gerasimou et al., 2020] Simos Gerasimou, Hasan Ferit Eniser,           [Szegedy et al., 2013] Christian Szegedy, Wojciech Zaremba, Ilya
   Alper Sen, and Alper Cakan. Importance-driven deep learning             Sutskever, Joan Bruna, et al. Intriguing properties of neural net-
   system testing. In International Conference on Software Engi-           works. arXiv preprint arXiv:1312.6199, 2013.
   neering, ICSE, 2020.                                                 [Wicker et al., 2018] Matthew Wicker, Xiaowei Huang, and Marta
[Goodfellow et al., 2015] Ian Goodfellow, Jonathon Shlens, and             Kwiatkowska. Feature-guided black-box safety testing of deep
   Christian Szegedy. Explaining and harnessing adversarial exam-          neural networks. In International Conference on Tools and Al-
   ples. In International Conference on Learning Representations           gorithms for the Construction and Analysis of Systems (TACAS),
   (ICLR), 2015.                                                           pages 408–426, 2018.
[Hinton et al., 2012] G. Hinton, L. Deng, D. Yu, G. E. Dahl, et al.     [Xie et al., 2018] X. Xie, L. Ma, F. Juefei-Xu, H. Chen, et al. Deep-
   Deep neural networks for acoustic modeling in speech recogni-           hunter: Hunting deep neural network defects via coverage-guided
   tion: The shared views of four research groups. IEEE Signal             fuzzing. In arXiv preprint arXiv:1809.01266, 2018.
   Processing Magazine, 29(6):82–97, 2012.