<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Gothenburg, Sweden. ACM, New York, NY, USA.
arXiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The Utility of Neural Network Test Coverage Measures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rob Ashmore</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alec Banks</string-name>
          <email>abanksg@dstl.gov.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Defence Science and Technology Laboratory</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1708</year>
      </pub-date>
      <volume>08559</volume>
      <abstract>
        <p>In this position paper, we are interested in what test coverage measures can, and cannot, tell us about neural networks. We begin with a review of the role of test coverage measures in traditional development approaches for safety-related software. We show how those coverage measures, in the neural network sense, cannot achieve the same aims as their equivalents in the traditional sense. We provide indications of approaches that can partially meet those aims. We also indicate the utility of current neural network coverage measures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Neural Networks (NNs) have demonstrated significant
utility in a range of safety and security related applications (e.g.
autonomous cars
        <xref ref-type="bibr" rid="ref33">(Tian et al. 2018)</xref>
        ). In such cases there is
a need to provide a commensurate level of assurance that a
particular NN implementation is correct. Approaches to NN
assurance have drawn inspiration from those used for
traditional safety-related software. The concept of test coverage
is one such example.
      </p>
      <p>NNs are different to traditional software. We contend that:
these differences fundamentally change the meaning of
several types of test coverage (e.g. requirements and structural
measures); there are approaches that can partially achieve
the intent of these traditional test coverage measures in an
NN context; and currently-proposed measures of NN test
coverage have utility in different ways.</p>
      <p>The remainder of this paper is structured as follows:
Section 2 provides an overview of the role of test coverage
measures in the development of traditional safety-related
software; Section 3 summarises a selection of NN test coverage
measures proposed in the literature; Section 4 outlines
NNbased approaches that can achieve some of the aims of
traditional test coverage; Section 5 summarises the value that can
be gained from current NN coverage measures; Section 6
concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Traditional Safety-Related Software</title>
    </sec>
    <sec id="sec-3">
      <title>Development</title>
      <p>
        Our discussion of safety-related software development is
based on DO-178C
        <xref ref-type="bibr" rid="ref26">(RTCA 2011)</xref>
        , which is commonly used
in the aerospace domain. This approach is generally
applicable, since the principles adopted in DO-178C are mirrored
elsewhere (e.g.
        <xref ref-type="bibr" rid="ref11">(Hawkins, Habli, and Kelly 2013)</xref>
        ).
      </p>
      <p>
        Before the discussion, we note that although a
significant amount of practical experience provides confidence in
DO-178C, and its predecessor, DO-178B, there is little, if
any, explicit evidence to support the specific coverage
criteria that are used
        <xref ref-type="bibr" rid="ref13">(Holloway 2013)</xref>
        . This limitation is not
restricted to DO-178C: it applies across many safety-related
software standards
        <xref ref-type="bibr" rid="ref10">(Graydon and Holloway 2015)</xref>
        .
Nevertheless, we believe a traditional safety-related software
development standard is an appropriate basis for this work.
This is partly because of the excellent safety record of
industries that use such standards and partly because many NN
test coverage measures are (implicitly or explicitly) based on
measures used in traditional software engineering.
      </p>
      <p>The development of safety-related software begins with
system-level requirements that have been allocated to
software. These high-level requirements are hierarchically
decomposed, in a traceable manner, from requirements that
detail ‘what’ behaviour is needed, down to requirements
that detail ‘how’ to achieve this. The software
requirements are also independently used to produce test cases in a
Requirements-Based Testing (RBT) approach. The tests are
run and the coverage of the code structure is measured
dynamically. RBT is an attractive approach to test design since
it acknowledges that exhaustive testing is (generally
speaking) infeasible and it focusses on the demonstration of the
intended behaviour.</p>
      <p>Assuming all tests pass, if structural coverage is
incomplete then one of the following three conditions holds, where
the notion of correctness directly relates to the intent of the
high-level requirements:</p>
      <p>The software’s behaviour is correct, but the software-level
requirements are incomplete;
The software-level requirements are correct, but the
software includes additional, unnecessary behaviour (we
include unreachable as well as redundant code here);
The software behaviour and software-level requirements
are correct, but the test set is incomplete. This includes,
for example, cases where development tools introduce
code for runtime efficiency.</p>
      <p>Considering all three of these conditions provides
confidence that (once a suitable level of coverage has been
achieved) the code sufficiently implements the requirements
and, furthermore, it contains no undesirable additional
behaviour. The link between requirements and behaviour is
important, because requirements are the key mechanism for
communicating expectations about software behavoiur with
actors in the development, test and integration of the
software.</p>
      <p>Structural coverage is typically measured at the software
unit level. Equivalently, at this level behaviour can be
defined, and verified. Hierarchical decomposition of
requirements, supported by architectural design, provides
confidence that unit-level behaviour can be aggregated to provide
the required system-level behaviour. Not all behaviour can
be tested at this level, thus testing continues through the
integration processes.</p>
      <p>
        The type of structural coverage that is required depends
on the criticality of the software. In less-critical cases,
statement coverage suffices; more demanding cases require
branch coverage; and the most critical cases require
Modified Condition / Decision Coverage (MCDC)
        <xref ref-type="bibr" rid="ref26">(RTCA 2011)</xref>
        .
A useful tutorial on these types of structural coverage can be
found at (Hayhurts et al. 2001). It is apparent that all of these
coverage levels are based on code structure. This reflects an
implicit assumption that code structure, rather than data, is
the main influence on software behaviour. Cases where data
significantly affects behaviour would be expected to be
defined in a requirement and addressed by a test case.
      </p>
      <p>
        The combination of RBT and coverage measurement is
important. Even though behaviour is, typically, not strongly
data-dependent, there is still value in using suitably realistic
test values. Simply optimising a test set to achieve a given
level of coverage, with little or no consideration of the
requirements, is less effective than RBT
        <xref ref-type="bibr" rid="ref6">(Gay et al. 2015)</xref>
        . The
reverse approach, of testing against requirements without
measuring coverage, cannot provide confidence that the
requirements suitably encapsulate the software’s behaviour. It
can say that the software satisfies the requirements, but
provides no information on what else the software may do. This
information is an important part of assuring safety-related
software.
      </p>
      <p>In summary, for traditional software: hierarchical
decomposition of requirements means that software behaviour can
be understood at the unit level; independent interpretation
of requirements provides confidence that requirements have
been implemented correctly; and the combination of RBT
and structural test coverage provide confidence that the
requirements suitably describe the software’s behaviour, both
what it does and what it does not do.</p>
    </sec>
    <sec id="sec-4">
      <title>3 Neural Network Coverage Measures</title>
      <p>Our main interest is in Artificial Intelligence (AI),
particularly AI implemented using Machine Learning (ML)
techniques. Due to their prevalence, we focus on NNs, but much
of what follows is applicable to other forms of ML.</p>
      <p>
        Historically, NN testing was based on measures like
precision and recall, calculated using a set of data that was held
back from training. The inadequacy of these measures is
demonstrated by adversarial inputs
        <xref ref-type="bibr" rid="ref32">(Szegedy et al. 2014)</xref>
        .
Networks that, based on these measures, perform extremely
well against verification data may display significant
undesirable behaviour when exposed to previously unseen, but
valid, inputs. This does not necessarily mean concepts like
precision and recall should be abandoned: it means that, in
isolation, they do not provide sufficient confidence for the
use of NNs in safety-related applications.
      </p>
      <p>More recently, a variety of coverage measures, based on
internal properties of an NN, have been proposed. It should
be noted that, to the best of our knowledge, use of these
measures has thus far been demonstrated using von Neumann
architecture hardware. For some, different, architectures it
may not be efficient or even possible to collect the
information on which these measures are based.</p>
      <p>Some notable example coverage measures are
summarised below.</p>
      <p>
        DeepXplore
        <xref ref-type="bibr" rid="ref24">(Pei et al. 2017)</xref>
        aims to systematically test
a collection of similar NNs for erroneous corner case
behaviours. It uses a gradient-guided local search that starts
from a seed input and solves a joint optimisation problem,
seeking to find new inputs that cause different neuron
activation patterns and lead the NNs to behave differently on
the same task. In the same paper, the notion of neuron
coverage is developed based on the fraction of neurons that are
activated for at least one test input, with a neuron
considered active if its output is above a threshold value (e.g. 0).
Equivalently, 100% coverage is achieved if each neuron is
activated at least once by the test set.
      </p>
      <p>
        DeepGauge
        <xref ref-type="bibr" rid="ref20">(Ma et al. 2018)</xref>
        considers two levels of
coverage: neuron-level and layer-level.
      </p>
      <p>Neuron-level coverage splits the output range of each
neuron (established during the training phase) into k equal
sections. The k-multisection coverage of that neuron is
the fraction of sections that the neuron’s output falls into
across all inputs in the test set. The k-multisection
coverage of the NN is the average of these neuron coverages.
Layer-level coverage is based on the l neurons, and
combinations thereof, in each layer that have the greatest
output value. Top-l neuron coverage is the fraction of neurons
that are in the top l neurons in their layer for at least one
of the test inputs.</p>
      <p>
        DeepCT
        <xref ref-type="bibr" rid="ref22">(Ma et al. 2019)</xref>
        is inspired by combinatorial
testing (see
        <xref ref-type="bibr" rid="ref23">(NIST 2010)</xref>
        ). Two, layer-level coverage
measures are defined:
      </p>
      <p>Sparse coverage considers the fraction of m-way subsets
in which all neurons are activated for at least one test
input. For example, a layer of four neurons will have six
2-way subsets.</p>
      <p>Dense coverage considers the fraction of m-way
activation patterns that are activated for at least one test input.
For example, a 2-way subset has four activation patterns.</p>
      <p>
        DeepConcolic
        <xref ref-type="bibr" rid="ref31">(Sun et al. 2019)</xref>
        is based on a variety of
measures, loosely inspired by MCDC:
      </p>
      <p>Sign-sign coverage measures whether the change in
output sign (i.e. moving from zero to non-zero) of a neuron in
layer n independently affects the output sign of a specific
neuron in layer n + 1 (i.e. the subsequent layer).
Value-value coverage measures whether a change in
output value (e.g. similar to the k-sections used by
DeepGauge) of a neuron in layer n independently affects the
output value of a specific neuron in layer n + 1.
The notions of sign-value and value-sign coverage
naturally follow.</p>
      <p>These measures show a progression, from simple
aspects of individual neuron behaviour (e.g. activation or
nonactivation), through more complex aspects of individual
neuron behaviour, to the joint behaviour of combinations of
neurons, either in the same layer or in neighbouring layers.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Towards Traditional Coverage for Neural</title>
    </sec>
    <sec id="sec-6">
      <title>Networks</title>
      <sec id="sec-6-1">
        <title>Constraints</title>
        <p>
          For the purposes of this paper, we are primarily concerned
with structural coverage (although we remain interested in
requirements throughout). For completeness, we note that
other attributes will be important in an overall assurance
argument supporting the use of an NN. A key area is the NN
development process. This is covered in detail in
          <xref ref-type="bibr" rid="ref19 ref2">(Ashmore,
Calinescu, and Paterson 2019)</xref>
          and SCSC-153A
          <xref ref-type="bibr" rid="ref29">(SASWG
2020)</xref>
          .
        </p>
        <p>Recall, traditional software test structural coverage
measures, when used in conjunction with RBT, provide a level of
confidence that software satisfies requirements and
requirements cover software behaviour. When viewed in that light
it is apparent that the types of NN coverage measures
discussed above cannot achieve the same aims as traditional
software coverage measures. As illustrated in the following
paragraphs, none of the requisite building blocks, which
allow traditional coverage to work in this way, are present in
the NN context.</p>
        <p>We do not have a complete set of requirements. There is
often a good understanding of the purpose of an NN, for
example: ‘recognise hand-written digits’ or ‘determine
sentiment from social media messages’. However, in our
experience, there is rarely, if ever, a set of requirements that is
accurate, complete, unambiguous and verifiable.</p>
        <p>
          Not all NN requirements can be hierarchically
decomposed. Some level of methodical decomposition is often
possible; for example, ‘recognise hand-written digits’ could be
decomposed to explicitly cater for ‘triangular 4s’, ‘open 4s’
and ‘crossed 7s’. But, requirements like these cannot be
decomposed to a level that can be directly coded against. In
particular, in a safety-related environment, if we could
decompose to a directly-codeable level then there would be no
need to use an NN and traditional software would be the
preferred approach
          <xref ref-type="bibr" rid="ref27">(Salay and Czarnecki 2018)</xref>
          .
        </p>
        <p>
          We do not have a meaningful software unit level at which
software behaviour can be described. As indicated
previously, current NN coverage measures focus on neuron
behaviour, either individually or in patterns. Although this
behaviour controls the network’s output, it does not (and
cannot) describe software behaviour in a way that is meaningful
to a user. Some approaches to explainability may help, for
example: identifying pixels that positively weight towards a
particular class
          <xref ref-type="bibr" rid="ref25">(Ribeiro, Singh, and Guestrin 2016)</xref>
          or
visualising feature maps from a final convolutional layer
          <xref ref-type="bibr" rid="ref3">(Chattopadhay et al. 2018)</xref>
          . Whilst useful, these do not provide the
requisite understanding. Furthermore, they do not generalise
across all types of NNs or all NN applications.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Software Satisfies Requirements</title>
        <p>One of the strengths of NNs is their ability to generalise
from incomplete specifications. It may seem that asking for
a demonstration that an NN satisfies a set of requirements
negates this strength.</p>
        <p>From our perspective, the distinction is in the level at
which requirements are expressed. Placing, and verifying,
requirements on the network’s internals does not provide
traceability between requirements and behaviour. That
traceability might be achieved by placing (an appropriate set of)
requirements on the NN’s Input ! Output behaviour.</p>
        <p>
          Adversarial examples
          <xref ref-type="bibr" rid="ref32">(Szegedy et al. 2014)</xref>
          are one aspect
of this behaviour. ‘Robustness’ to these examples has often
been suggested, sometimes implicitly, as a requirement that
an NN should meet. For feed forward networks, this can be
demonstrated, or a counter-example found, using techniques
based on Satisfiability Modulo Theories (SMT)
          <xref ref-type="bibr" rid="ref14">(Huang et al.
2017)</xref>
          .
        </p>
        <p>
          DeepPoly
          <xref ref-type="bibr" rid="ref30">(Singh et al. 2019)</xref>
          couples an abstract domain
(specifically, a combination of floating-point polyhedra with
intervals) with abstract transformers for common neural
network functions. This allows guarantees to be made, for
example, that all samples within an L1 ball will be classified
correctly, or that all image rotations (up to a given angle)
will be classified correctly. Note that, in these cases,
‘classified correctly’ means all samples in the region will be given
the same class. It is assumed that this behaviour is correct.
        </p>
        <p>
          Marabou (Katz et al. 2019) is an SMT-based tool that can
be used to check whether a particular NN satisfies a specific
property. If the property is not satisfied then a concrete input
for which the property is violated (i.e. a counter-example)
is provided. Amongst other things, the tool has been used
to prove properties about an Airborne Collision Avoidance
System (ACAS) for unmanned aircraft
          <xref ref-type="bibr" rid="ref17">(Julian et al. 2016)</xref>
          .
        </p>
        <p>
          Deep Learning with Differentiable Logic (DL2)
          <xref ref-type="bibr" rid="ref5">(Fischer
et al. 2018)</xref>
          can support training and querying of NNs. In the
training application, logical constraints are translated into
non-negative loss functions, which are incorporated into the
overall optimisation the training is attempting to complete.
The querying application allows constraints over properties
not directly computed by the network (e.g. constraints can
consider the likelihood of an input being in one of a set
of classes). These types of constraint can readily express
system-level requirements.
        </p>
        <p>
          Even if the (suitably-measured) performance of an NN
is adequate, this does not necessarily mean that an NN is
making decisions in the same way as a human would. There
is evidence that NNs use highly-predictive, but non-intuitive
(to a human) features to support classification
          <xref ref-type="bibr" rid="ref16">(Ilyas et al.
2019)</xref>
          . If the distinction between ‘good performance’ and
‘good performance, deciding like a human’ is important then
it should be captured as a requirement and explicitly tested.
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>Requirements Cover Software Behaviour</title>
      </sec>
      <sec id="sec-6-4">
        <title>Summary</title>
        <p>As noted earlier, a key question is whether there are aspects
of software behaviour that are not captured by the
requirements. That is, if a user fully understands the requirements
then will they ever be surprised by the software’s behaviour?
This latter formulation is helpful as it clarifies what we mean
by ‘behaviour’. In particular, we are interested in behaviour
that is externally observable. From the perspective of an
NN, we are primarily interested in behaviour in the sense
of Input ! Output mappings.</p>
        <p>
          In the NN context, one way of categorising different
contributors to behaviour is by considering different types of
input. Four related spaces are defined in
          <xref ref-type="bibr" rid="ref19 ref2">(Ashmore, Calinescu,
and Paterson 2019)</xref>
          : the input domain space, which are
inputs that the NN can accept; the operational domain space,
which are inputs that the NN may receive when used
operationally, the failure domain space, which are inputs the NN
may receive if there are failures elsewhere in the system; and
the adversarial domain space, which are inputs the NN may
receive if it is being attacked by an adversary.
        </p>
        <p>
          Using this structure there are a number of techniques
that can help increase coverage of potential NN behaviour:
space-filling designs for computer experiments
          <xref ref-type="bibr" rid="ref27 ref28">(Santner,
Williams, and Notz 2018)</xref>
          are potentially relevant for the
input domain; the notion of situation coverage
          <xref ref-type="bibr" rid="ref1 ref10">(Alexander,
Hawkins, and Rae 2015)</xref>
          is potentially relevant for the
operational domain; Failure Modes and Effects Analysis (FMEA)
should be useful for the failure domain; and ‘red teaming’
          <xref ref-type="bibr" rid="ref18">(Kardos and Dexter 2017)</xref>
          should inform the adversarial
domain. Whilst they are useful, none of these techniques define
a precise boundary of testing sufficiency. Greater experience
of the practical use of NNs is likely to be required before
such boundaries can be set.
        </p>
        <p>These approaches provide a ‘forward-looking’ way of
understanding behaviour; they rely on choosing inputs to
invoke behaviour. This differs from the approach to traditional
software requirements and structural coverage, where
behaviour is invoked, in a sense, from within the behavioural
regimes defined in the low-level requirements structure.</p>
        <p>Generative Adversarial Networks (GANs) could help fill
this gap. For example, they could be used to find plausible
operational domain inputs that exhibit different behaviours
to those observed in the training data. These could be inputs
that are similar to training samples but result in different
outputs; for classification networks, this is the same as finding
adversarial inputs. Alternatively, they could be used to find
inputs that are sufficiently different from any sample in the
training data.</p>
        <p>Another approach involves looking for specific
undesirable behaviours in an NN. Detection and mitigation of
backdoor attacks (Wang et al. 2019) is one example.</p>
        <p>In some cases, specifically for feed-forward networks, it
may be possible to automatically infer formal properties
(Gopinath et al. 2019). This is a helpful way of
understanding aspects of the NN’s behaviour. However, there is no
guarantee that the inferred properties will be meaningful, in
the sense of system-level requirements.</p>
        <p>The way that NN are constructed means test coverage
measures cannot perform the same function as they do for
traditional software. There are a number of approaches that can
be used to provide some confidence that the NN satisfies
requirements. There are also approaches that can provide some
confidence that the NN’s behaviour is understood.
5</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>The Utility of NN Coverage Measures</title>
      <p>The previous discussion highlights a distinction between the
implicit motivation for NN coverage measures and their
utility. We propose four ways in which NN coverage measures,
like those discussed in Section 3, could have utility.</p>
      <p>Firstly, following the analogy that training data represent
the low-level behavioural requirements, the measures could
be used to optimise training data, for example, by
identifying whether a larger, or more diverse, training set was
exercising a larger portion of internal network behaviour. If
it is then this is an argument for using an alternative,
possibly larger, data set, despite the additional overheads and
increased risk of over fitting.</p>
      <p>Secondly, the measures could be used to compare training
data, used during development, with verification data, used
by an independent team. Suppose, for example, that the full
set of training and verification data achieves greater
coverage than just the training data. This outcome would
demonstrate the independent verification activity is exercising
additional types of internal NN behaviour to those observed
during the development phase. This could be evidence that
a suitably independent verification activity has been
conducted.</p>
      <p>Thirdly, many of the approaches used to measure NN
coverage can also be used to generate additional inputs that
would extend coverage. As such, they provide an
indication of ways in which training data could be meaningfully
extended. Obviously, for situations where the NN is being
developed using supervised learning, the appropriate output
needs to be produced for each of these new inputs.</p>
      <p>Fourthly, the measures can be used to choose between two
different NNs that otherwise offer similar levels of
performance. In such situations, the NN for which a fixed set of
training data achieves greater coverage might be preferred.
In general, this would be expected to be the NN with the
simpler structure.</p>
      <p>6</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusions</title>
      <p>NNs have demonstrated significant utility. Their use in
safety-related systems is predicated on confidence in their
behaviour, both what they do and what they do not do. For
traditional safety-related software much of this confidence
comes from test coverage measures in an RBT context. The
approaches used to measure test coverage for NNs cannot
provide an equivalent confidence.</p>
      <p>There are approaches that can provide some aspects of
this confidence. Appropriate consideration of different
input spaces can help, as can GAN-based methods for finding
‘new’ inputs. NN test coverage measures can provide value
in other ways. They can, for example, provide a principled,
structured way of choosing between different training data
sets, or between different trained models.</p>
      <p>In conclusion, NN test coverage measures can have
significant utility. They represent different types of confidence
than is found in their traditional software testing forebears.
However, more work is required before a holistic and
complete understanding is achieved in the relationship between
the coverage measures and confidence in NN behaviour.</p>
    </sec>
    <sec id="sec-9">
      <title>Disclaimer</title>
      <p>This document is an overview of UK MOD sponsored research and
is released for informational purposes only. The contents of this
document should not be interpreted as representing the views of the
UK MOD, nor should it be assumed that they reflect any current or
future UK MOD policy. The information contained in this
document cannot supersede any statutory or contractual requirements
or liabilities and is offered without prejudice or commitment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Alexander</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Hawkins,
          <string-name>
            <given-names>H. R.</given-names>
            ; and
            <surname>Rae</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. J.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Situation coverage-a coverage criterion for testing autonomous robots</article-title>
          . University of York .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ashmore</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Calinescu, R.; and
          <string-name>
            <surname>Paterson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Assuring the machine learning lifecycle: Desiderata, methods, and challenges</article-title>
          . arXiv preprint arXiv:
          <year>1905</year>
          .04223 .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chattopadhay</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Howlader</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Balasubramanian</surname>
            ,
            <given-names>V. N.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Grad-cam++: Generalized gradientbased visual explanations for deep convolutional networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)</source>
          ,
          <fpage>839</fpage>
          -
          <lpage>847</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Balunovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Drachsler-Cohen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Gehr,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; Zhang,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; and Vechev,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>DL2: Training and querying neural networks with logic</article-title>
          .
          <source>Internet .</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Gay</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Staats,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Whalen</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; and Heimdahl,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>The Risks of Coverage-Directed Test Case Generation. Software Engineering</source>
          , IEEE Transactions on
          <volume>41</volume>
          (
          <issue>8</issue>
          ):
          <fpage>803</fpage>
          -
          <lpage>819</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>ISSN</surname>
          </string-name>
          0098-5589.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          2019.
          <article-title>Property Inference for Deep Neural Networks</article-title>
          . arXiv
          <year>1904</year>
          .
          <year>13215v2</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Graydon</surname>
            ,
            <given-names>P. J.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Holloway</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Planning the Unplanned Experiment: Assessing the Efficacy of Standards for Safety Critical Software</article-title>
          .
          <source>Technical Report NASA/TM2015-218804</source>
          , NASA.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Hawkins</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Habli,
          <string-name>
            <surname>I.;</surname>
          </string-name>
          and Kelly,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>The principles of software safety assurance</article-title>
          .
          <source>31st International System Safety Conference</source>
          , Boston, Massachusetts USA .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          2001.
          <article-title>A Practical Tutorial on Modified Condition/Decision Coverage</article-title>
          .
          <source>Technical report, NASA.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Holloway</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Making the Implicit Explicit: Towards An Assurance Case for DO-178C</article-title>
          . NASA Langley Research Center .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kwiatkowska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>Safety Verification of Deep Neural Networks</article-title>
          .
          <source>In Proceedings of the 29th International Conference on Computer Aided Verification</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Santurkar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tsipras</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Engstrom</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Madry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Adversarial examples are not bugs, they are features</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>125</volume>
          -
          <fpage>136</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Julian</surname>
            ,
            <given-names>K. D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Brush</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          ; Owen,
          <string-name>
            <surname>M. P.</surname>
          </string-name>
          ; and Kochenderfer,
          <string-name>
            <surname>M. J.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Policy compression for aircraft collision avoidance systems</article-title>
          .
          <source>In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Kardos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Dexter,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>A Simple Handbook for Non-Traditional Red Teaming</article-title>
          .
          <source>Technical report</source>
          , Defence Science and Technology Group Edinburgh,
          <string-name>
            <surname>SA</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <year>2019</year>
          .
          <article-title>The marabou framework for verification and analysis of deep neural networks</article-title>
          .
          <source>In International Conference on Computer Aided Verification</source>
          ,
          <fpage>443</fpage>
          -
          <lpage>452</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , L.;
          <string-name>
            <surname>Juefei-Xu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chen,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ;
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ; and
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>DeepGauge: Comprehensive and Multi-Granularity Testing Criteria for Gauging the Robustness of Deep Learning Systems</article-title>
          . arXiv
          <year>1803</year>
          .07519. O.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , L.;
          <string-name>
            <surname>Juefei-Xu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>DeepCT: Tomographic Combinatorial Testing for Deep Learning Systems</article-title>
          .
          <source>In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering</source>
          ,
          <volume>614</volume>
          -
          <fpage>618</fpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>NIST.</surname>
          </string-name>
          <year>2010</year>
          . Practical Combinatorial Testing.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Jana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          . DeepXplore:
          <source>Automated Whitebox Testing of Deep Learning Systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Ribeiro</surname>
          </string-name>
          , M. T.;
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Why Should I Trust You?: Explaining the Predictions of Any Classifier</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>RTCA.</surname>
          </string-name>
          <year>2011</year>
          . DO-178C
          <source>: Software Considerations in Airborne Systems and Equipment Certification.</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Salay</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Czarnecki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Using Machine Learning Safely in Automotive Software: An Assessment and Adaption of Software Process Requirements in ISO 26262</article-title>
          . arXiv
          <year>1808</year>
          .
          <volume>01614</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Santner</surname>
            ,
            <given-names>T. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>B. J.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Notz</surname>
            ,
            <given-names>W. I.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Spacefilling designs for computer experiments</article-title>
          .
          <source>In The Design and Analysis of Computer Experiments</source>
          ,
          <fpage>145</fpage>
          -
          <lpage>200</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>SASWG.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>SCSC-153A: Safety Assurance Objectives for Autonomous Systems</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Gehr,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; Pu¨schel, M.; and Vechev,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>An abstract domain for certifying neural networks</article-title>
          .
          <source>Proceedings of the ACM on Programming Languages 3(POPL)</source>
          :
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kroening</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sharp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Hill,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; and Ashmore,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Structural Test Coverage Criteria for Deep Neural Networks</article-title>
          .
          <source>ACM Transactions on Embedded Computing Systems (TECS)</source>
          <volume>18</volume>
          (
          <issue>5s</issue>
          ):
          <fpage>94</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Sutskever,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; Bruna,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.;</surname>
          </string-name>
          and Fergus,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Intriguing Properties of Neural Networks</article-title>
          .
          <source>In Proceedings of the 2nd International Conference on Learning Representations</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars</article-title>
          . In ICSE '
          <volume>18</volume>
          : 40th
          <string-name>
            <given-names>International</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Shan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Viswanath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Zheng</surname>
          </string-name>
          , H.; and
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>B. Y.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Neural cleanse: Identifying and mitigating backdoor attacks in neural networks</article-title>
          .
          <source>In 2019 IEEE Symposium on Security and Privacy (SP)</source>
          ,
          <fpage>707</fpage>
          -
          <lpage>723</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>