The Utility of Neural Network Test Coverage Measures
                                                    Rob Ashmore, Alec Banks
                                             Defence Science and Technology Laboratory
                                                 {rdashmore,abanks}@dstl.gov.uk


                           Abstract                                 in the aerospace domain. This approach is generally appli-
  In this position paper, we are interested in what test cover-     cable, since the principles adopted in DO-178C are mirrored
  age measures can, and cannot, tell us about neural networks.      elsewhere (e.g. (Hawkins, Habli, and Kelly 2013)).
  We begin with a review of the role of test coverage measures         Before the discussion, we note that although a signifi-
  in traditional development approaches for safety-related soft-    cant amount of practical experience provides confidence in
  ware. We show how those coverage measures, in the neural          DO-178C, and its predecessor, DO-178B, there is little, if
  network sense, cannot achieve the same aims as their equiv-       any, explicit evidence to support the specific coverage cri-
  alents in the traditional sense. We provide indications of ap-    teria that are used (Holloway 2013). This limitation is not
  proaches that can partially meet those aims. We also indicate     restricted to DO-178C: it applies across many safety-related
  the utility of current neural network coverage measures.          software standards (Graydon and Holloway 2015). Never-
                                                                    theless, we believe a traditional safety-related software de-
                     1    Introduction                              velopment standard is an appropriate basis for this work.
                                                                    This is partly because of the excellent safety record of in-
Neural Networks (NNs) have demonstrated significant util-           dustries that use such standards and partly because many NN
ity in a range of safety and security related applications (e.g.    test coverage measures are (implicitly or explicitly) based on
autonomous cars (Tian et al. 2018)). In such cases there is         measures used in traditional software engineering.
a need to provide a commensurate level of assurance that a             The development of safety-related software begins with
particular NN implementation is correct. Approaches to NN           system-level requirements that have been allocated to soft-
assurance have drawn inspiration from those used for tradi-         ware. These high-level requirements are hierarchically de-
tional safety-related software. The concept of test coverage        composed, in a traceable manner, from requirements that
is one such example.                                                detail ‘what’ behaviour is needed, down to requirements
   NNs are different to traditional software. We contend that:      that detail ‘how’ to achieve this. The software require-
these differences fundamentally change the meaning of sev-          ments are also independently used to produce test cases in a
eral types of test coverage (e.g. requirements and structural       Requirements-Based Testing (RBT) approach. The tests are
measures); there are approaches that can partially achieve          run and the coverage of the code structure is measured dy-
the intent of these traditional test coverage measures in an        namically. RBT is an attractive approach to test design since
NN context; and currently-proposed measures of NN test              it acknowledges that exhaustive testing is (generally speak-
coverage have utility in different ways.                            ing) infeasible and it focusses on the demonstration of the
   The remainder of this paper is structured as follows: Sec-       intended behaviour.
tion 2 provides an overview of the role of test coverage mea-          Assuming all tests pass, if structural coverage is incom-
sures in the development of traditional safety-related soft-        plete then one of the following three conditions holds, where
ware; Section 3 summarises a selection of NN test coverage          the notion of correctness directly relates to the intent of the
measures proposed in the literature; Section 4 outlines NN-         high-level requirements:
based approaches that can achieve some of the aims of tradi-
tional test coverage; Section 5 summarises the value that can       • The software’s behaviour is correct, but the software-level
be gained from current NN coverage measures; Section 6                 requirements are incomplete;
concludes the paper.                                                • The software-level requirements are correct, but the soft-
                                                                       ware includes additional, unnecessary behaviour (we in-
    2    Traditional Safety-Related Software                           clude unreachable as well as redundant code here);
                    Development                                     • The software behaviour and software-level requirements
Our discussion of safety-related software development is               are correct, but the test set is incomplete. This includes,
based on DO-178C (RTCA 2011), which is commonly used                   for example, cases where development tools introduce
Crown Copyright c 2021. Use permitted under Creative Com-
                                                                       code for runtime efficiency.
mons License Attribution 4.0 International (CC BY 4.0).                Considering all three of these conditions provides con-
fidence that (once a suitable level of coverage has been            Networks that, based on these measures, perform extremely
achieved) the code sufficiently implements the requirements         well against verification data may display significant unde-
and, furthermore, it contains no undesirable additional be-         sirable behaviour when exposed to previously unseen, but
haviour. The link between requirements and behaviour is             valid, inputs. This does not necessarily mean concepts like
important, because requirements are the key mechanism for           precision and recall should be abandoned: it means that, in
communicating expectations about software behavoiur with            isolation, they do not provide sufficient confidence for the
actors in the development, test and integration of the soft-        use of NNs in safety-related applications.
ware.                                                                  More recently, a variety of coverage measures, based on
   Structural coverage is typically measured at the software        internal properties of an NN, have been proposed. It should
unit level. Equivalently, at this level behaviour can be de-        be noted that, to the best of our knowledge, use of these mea-
fined, and verified. Hierarchical decomposition of require-         sures has thus far been demonstrated using von Neumann
ments, supported by architectural design, provides confi-           architecture hardware. For some, different, architectures it
dence that unit-level behaviour can be aggregated to provide        may not be efficient or even possible to collect the informa-
the required system-level behaviour. Not all behaviour can          tion on which these measures are based.
be tested at this level, thus testing continues through the in-        Some notable example coverage measures are sum-
tegration processes.                                                marised below.
   The type of structural coverage that is required depends            DeepXplore (Pei et al. 2017) aims to systematically test
on the criticality of the software. In less-critical cases,         a collection of similar NNs for erroneous corner case be-
statement coverage suffices; more demanding cases require           haviours. It uses a gradient-guided local search that starts
branch coverage; and the most critical cases require Modi-          from a seed input and solves a joint optimisation problem,
fied Condition / Decision Coverage (MCDC) (RTCA 2011).              seeking to find new inputs that cause different neuron ac-
A useful tutorial on these types of structural coverage can be      tivation patterns and lead the NNs to behave differently on
found at (Hayhurts et al. 2001). It is apparent that all of these   the same task. In the same paper, the notion of neuron cov-
coverage levels are based on code structure. This reflects an       erage is developed based on the fraction of neurons that are
implicit assumption that code structure, rather than data, is       activated for at least one test input, with a neuron consid-
the main influence on software behaviour. Cases where data          ered active if its output is above a threshold value (e.g. 0).
significantly affects behaviour would be expected to be de-         Equivalently, 100% coverage is achieved if each neuron is
fined in a requirement and addressed by a test case.                activated at least once by the test set.
   The combination of RBT and coverage measurement is                  DeepGauge (Ma et al. 2018) considers two levels of cov-
important. Even though behaviour is, typically, not strongly        erage: neuron-level and layer-level.
data-dependent, there is still value in using suitably realistic    • Neuron-level coverage splits the output range of each neu-
test values. Simply optimising a test set to achieve a given           ron (established during the training phase) into k equal
level of coverage, with little or no consideration of the re-          sections. The k-multisection coverage of that neuron is
quirements, is less effective than RBT (Gay et al. 2015). The          the fraction of sections that the neuron’s output falls into
reverse approach, of testing against requirements without              across all inputs in the test set. The k-multisection cover-
measuring coverage, cannot provide confidence that the re-             age of the NN is the average of these neuron coverages.
quirements suitably encapsulate the software’s behaviour. It
can say that the software satisfies the requirements, but pro-      • Layer-level coverage is based on the l neurons, and com-
vides no information on what else the software may do. This            binations thereof, in each layer that have the greatest out-
information is an important part of assuring safety-related            put value. Top-l neuron coverage is the fraction of neurons
software.                                                              that are in the top l neurons in their layer for at least one
   In summary, for traditional software: hierarchical decom-           of the test inputs.
position of requirements means that software behaviour can             DeepCT (Ma et al. 2019) is inspired by combinatorial
be understood at the unit level; independent interpretation         testing (see (NIST 2010)). Two, layer-level coverage mea-
of requirements provides confidence that requirements have          sures are defined:
been implemented correctly; and the combination of RBT              • Sparse coverage considers the fraction of m-way subsets
and structural test coverage provide confidence that the re-           in which all neurons are activated for at least one test in-
quirements suitably describe the software’s behaviour, both            put. For example, a layer of four neurons will have six
what it does and what it does not do.                                  2-way subsets.
    3    Neural Network Coverage Measures                           • Dense coverage considers the fraction of m-way activa-
                                                                       tion patterns that are activated for at least one test input.
Our main interest is in Artificial Intelligence (AI), particu-         For example, a 2-way subset has four activation patterns.
larly AI implemented using Machine Learning (ML) tech-
niques. Due to their prevalence, we focus on NNs, but much             DeepConcolic (Sun et al. 2019) is based on a variety of
of what follows is applicable to other forms of ML.                 measures, loosely inspired by MCDC:
   Historically, NN testing was based on measures like pre-         • Sign-sign coverage measures whether the change in out-
cision and recall, calculated using a set of data that was held        put sign (i.e. moving from zero to non-zero) of a neuron in
back from training. The inadequacy of these measures is                layer n independently affects the output sign of a specific
demonstrated by adversarial inputs (Szegedy et al. 2014).              neuron in layer n + 1 (i.e. the subsequent layer).
• Value-value coverage measures whether a change in out-          particular class (Ribeiro, Singh, and Guestrin 2016) or visu-
  put value (e.g. similar to the k-sections used by Deep-         alising feature maps from a final convolutional layer (Chat-
  Gauge) of a neuron in layer n independently affects the         topadhay et al. 2018). Whilst useful, these do not provide the
  output value of a specific neuron in layer n + 1.               requisite understanding. Furthermore, they do not generalise
• The notions of sign-value and value-sign coverage natu-         across all types of NNs or all NN applications.
  rally follow.
                                                                  Software Satisfies Requirements
  These measures show a progression, from simple as-
pects of individual neuron behaviour (e.g. activation or non-     One of the strengths of NNs is their ability to generalise
activation), through more complex aspects of individual neu-      from incomplete specifications. It may seem that asking for
ron behaviour, to the joint behaviour of combinations of neu-     a demonstration that an NN satisfies a set of requirements
rons, either in the same layer or in neighbouring layers.         negates this strength.
                                                                     From our perspective, the distinction is in the level at
 4   Towards Traditional Coverage for Neural                      which requirements are expressed. Placing, and verifying,
                                                                  requirements on the network’s internals does not provide
                    Networks                                      traceability between requirements and behaviour. That trace-
Constraints                                                       ability might be achieved by placing (an appropriate set of)
For the purposes of this paper, we are primarily concerned        requirements on the NN’s Input → Output behaviour.
with structural coverage (although we remain interested in           Adversarial examples (Szegedy et al. 2014) are one aspect
requirements throughout). For completeness, we note that          of this behaviour. ‘Robustness’ to these examples has often
other attributes will be important in an overall assurance ar-    been suggested, sometimes implicitly, as a requirement that
gument supporting the use of an NN. A key area is the NN          an NN should meet. For feed forward networks, this can be
development process. This is covered in detail in (Ashmore,       demonstrated, or a counter-example found, using techniques
Calinescu, and Paterson 2019) and SCSC-153A (SASWG                based on Satisfiability Modulo Theories (SMT) (Huang et al.
2020).                                                            2017).
   Recall, traditional software test structural coverage mea-        DeepPoly (Singh et al. 2019) couples an abstract domain
sures, when used in conjunction with RBT, provide a level of      (specifically, a combination of floating-point polyhedra with
confidence that software satisfies requirements and require-      intervals) with abstract transformers for common neural net-
ments cover software behaviour. When viewed in that light         work functions. This allows guarantees to be made, for ex-
it is apparent that the types of NN coverage measures dis-        ample, that all samples within an L∞ ball will be classified
cussed above cannot achieve the same aims as traditional          correctly, or that all image rotations (up to a given angle)
software coverage measures. As illustrated in the following       will be classified correctly. Note that, in these cases, ‘classi-
paragraphs, none of the requisite building blocks, which al-      fied correctly’ means all samples in the region will be given
low traditional coverage to work in this way, are present in      the same class. It is assumed that this behaviour is correct.
the NN context.                                                      Marabou (Katz et al. 2019) is an SMT-based tool that can
   We do not have a complete set of requirements. There is        be used to check whether a particular NN satisfies a specific
often a good understanding of the purpose of an NN, for           property. If the property is not satisfied then a concrete input
example: ‘recognise hand-written digits’ or ‘determine sen-       for which the property is violated (i.e. a counter-example)
timent from social media messages’. However, in our expe-         is provided. Amongst other things, the tool has been used
rience, there is rarely, if ever, a set of requirements that is   to prove properties about an Airborne Collision Avoidance
accurate, complete, unambiguous and verifiable.                   System (ACAS) for unmanned aircraft (Julian et al. 2016).
   Not all NN requirements can be hierarchically decom-              Deep Learning with Differentiable Logic (DL2) (Fischer
posed. Some level of methodical decomposition is often pos-       et al. 2018) can support training and querying of NNs. In the
sible; for example, ‘recognise hand-written digits’ could be      training application, logical constraints are translated into
decomposed to explicitly cater for ‘triangular 4s’, ‘open 4s’     non-negative loss functions, which are incorporated into the
and ‘crossed 7s’. But, requirements like these cannot be de-      overall optimisation the training is attempting to complete.
composed to a level that can be directly coded against. In        The querying application allows constraints over properties
particular, in a safety-related environment, if we could de-      not directly computed by the network (e.g. constraints can
compose to a directly-codeable level then there would be no       consider the likelihood of an input being in one of a set
need to use an NN and traditional software would be the           of classes). These types of constraint can readily express
preferred approach (Salay and Czarnecki 2018).                    system-level requirements.
   We do not have a meaningful software unit level at which          Even if the (suitably-measured) performance of an NN
software behaviour can be described. As indicated previ-          is adequate, this does not necessarily mean that an NN is
ously, current NN coverage measures focus on neuron be-           making decisions in the same way as a human would. There
haviour, either individually or in patterns. Although this be-    is evidence that NNs use highly-predictive, but non-intuitive
haviour controls the network’s output, it does not (and can-      (to a human) features to support classification (Ilyas et al.
not) describe software behaviour in a way that is meaningful      2019). If the distinction between ‘good performance’ and
to a user. Some approaches to explainability may help, for        ‘good performance, deciding like a human’ is important then
example: identifying pixels that positively weight towards a      it should be captured as a requirement and explicitly tested.
Requirements Cover Software Behaviour                               Summary
                                                                    The way that NN are constructed means test coverage mea-
As noted earlier, a key question is whether there are aspects       sures cannot perform the same function as they do for tradi-
of software behaviour that are not captured by the require-         tional software. There are a number of approaches that can
ments. That is, if a user fully understands the requirements        be used to provide some confidence that the NN satisfies re-
then will they ever be surprised by the software’s behaviour?       quirements. There are also approaches that can provide some
This latter formulation is helpful as it clarifies what we mean     confidence that the NN’s behaviour is understood.
by ‘behaviour’. In particular, we are interested in behaviour
that is externally observable. From the perspective of an
NN, we are primarily interested in behaviour in the sense
                                                                       5    The Utility of NN Coverage Measures
of Input → Output mappings.                                         The previous discussion highlights a distinction between the
   In the NN context, one way of categorising different con-        implicit motivation for NN coverage measures and their util-
tributors to behaviour is by considering different types of in-     ity. We propose four ways in which NN coverage measures,
put. Four related spaces are defined in (Ashmore, Calinescu,        like those discussed in Section 3, could have utility.
and Paterson 2019): the input domain space, which are in-              Firstly, following the analogy that training data represent
puts that the NN can accept; the operational domain space,          the low-level behavioural requirements, the measures could
which are inputs that the NN may receive when used opera-           be used to optimise training data, for example, by identify-
tionally, the failure domain space, which are inputs the NN         ing whether a larger, or more diverse, training set was ex-
may receive if there are failures elsewhere in the system; and      ercising a larger portion of internal network behaviour. If
the adversarial domain space, which are inputs the NN may           it is then this is an argument for using an alternative, pos-
receive if it is being attacked by an adversary.                    sibly larger, data set, despite the additional overheads and
                                                                    increased risk of over fitting.
   Using this structure there are a number of techniques               Secondly, the measures could be used to compare training
that can help increase coverage of potential NN behaviour:          data, used during development, with verification data, used
space-filling designs for computer experiments (Santner,            by an independent team. Suppose, for example, that the full
Williams, and Notz 2018) are potentially relevant for the           set of training and verification data achieves greater cover-
input domain; the notion of situation coverage (Alexander,          age than just the training data. This outcome would demon-
Hawkins, and Rae 2015) is potentially relevant for the opera-       strate the independent verification activity is exercising ad-
tional domain; Failure Modes and Effects Analysis (FMEA)            ditional types of internal NN behaviour to those observed
should be useful for the failure domain; and ‘red teaming’          during the development phase. This could be evidence that
(Kardos and Dexter 2017) should inform the adversarial do-          a suitably independent verification activity has been con-
main. Whilst they are useful, none of these techniques define       ducted.
a precise boundary of testing sufficiency. Greater experience          Thirdly, many of the approaches used to measure NN cov-
of the practical use of NNs is likely to be required before         erage can also be used to generate additional inputs that
such boundaries can be set.                                         would extend coverage. As such, they provide an indica-
   These approaches provide a ‘forward-looking’ way of un-          tion of ways in which training data could be meaningfully
derstanding behaviour; they rely on choosing inputs to in-          extended. Obviously, for situations where the NN is being
voke behaviour. This differs from the approach to traditional       developed using supervised learning, the appropriate output
software requirements and structural coverage, where be-            needs to be produced for each of these new inputs.
haviour is invoked, in a sense, from within the behavioural            Fourthly, the measures can be used to choose between two
regimes defined in the low-level requirements structure.            different NNs that otherwise offer similar levels of perfor-
   Generative Adversarial Networks (GANs) could help fill           mance. In such situations, the NN for which a fixed set of
this gap. For example, they could be used to find plausible         training data achieves greater coverage might be preferred.
operational domain inputs that exhibit different behaviours         In general, this would be expected to be the NN with the
to those observed in the training data. These could be inputs       simpler structure.
that are similar to training samples but result in different out-
puts; for classification networks, this is the same as finding                          6    Conclusions
adversarial inputs. Alternatively, they could be used to find       NNs have demonstrated significant utility. Their use in
inputs that are sufficiently different from any sample in the       safety-related systems is predicated on confidence in their
training data.                                                      behaviour, both what they do and what they do not do. For
   Another approach involves looking for specific undesir-          traditional safety-related software much of this confidence
able behaviours in an NN. Detection and mitigation of back-         comes from test coverage measures in an RBT context. The
door attacks (Wang et al. 2019) is one example.                     approaches used to measure test coverage for NNs cannot
   In some cases, specifically for feed-forward networks, it        provide an equivalent confidence.
may be possible to automatically infer formal properties               There are approaches that can provide some aspects of
(Gopinath et al. 2019). This is a helpful way of understand-        this confidence. Appropriate consideration of different in-
ing aspects of the NN’s behaviour. However, there is no             put spaces can help, as can GAN-based methods for finding
guarantee that the inferred properties will be meaningful, in       ‘new’ inputs. NN test coverage measures can provide value
the sense of system-level requirements.                             in other ways. They can, for example, provide a principled,
structured way of choosing between different training data       collision avoidance systems. In 2016 IEEE/AIAA 35th Dig-
sets, or between different trained models.                       ital Avionics Systems Conference (DASC), 1–10. IEEE.
   In conclusion, NN test coverage measures can have sig-        Kardos, M.; and Dexter, P. 2017. A Simple Handbook for
nificant utility. They represent different types of confidence   Non-Traditional Red Teaming. Technical report, Defence
than is found in their traditional software testing forebears.   Science and Technology Group Edinburgh, SA.
However, more work is required before a holistic and com-
plete understanding is achieved in the relationship between      Katz, G.; Huang, D. A.; Ibeling, D.; Julian, K.; Lazarus, C.;
the coverage measures and confidence in NN behaviour.            Lim, R.; Shah, P.; Thakoor, S.; Wu, H.; Zeljić, A.; et al.
                                                                 2019. The marabou framework for verification and analy-
                                                                 sis of deep neural networks. In International Conference on
                       References                                Computer Aided Verification, 443–452. Springer.
Alexander, R.; Hawkins, H. R.; and Rae, A. J. 2015. Situ-
ation coverage–a coverage criterion for testing autonomous       Ma, L.; Juefei-Xu, F.; Sun, J.; Chen, C.; Su, T.; Zhang, F.;
robots. University of York .                                     Xue, M.; Li, B.; Li, L.; Liu, Y.; Zhao, J.; and Wang, Y. 2018.
                                                                 DeepGauge: Comprehensive and Multi-Granularity Testing
Ashmore, R.; Calinescu, R.; and Paterson, C. 2019. Assur-        Criteria for Gauging the Robustness of Deep Learning Sys-
ing the machine learning lifecycle: Desiderata, methods, and     tems. arXiv 1803.07519. O.
challenges. arXiv preprint arXiv:1905.04223 .
                                                                 Ma, L.; Juefei-Xu, F.; Xue, M.; Li, B.; Li, L.; Liu, Y.; and
Chattopadhay, A.; Sarkar, A.; Howlader, P.; and Balasub-         Zhao, J. 2019. DeepCT: Tomographic Combinatorial Test-
ramanian, V. N. 2018. Grad-cam++: Generalized gradient-          ing for Deep Learning Systems. In Proceedings of the 26th
based visual explanations for deep convolutional networks.       IEEE International Conference on Software Analysis, Evo-
In 2018 IEEE Winter Conference on Applications of Com-           lution and Reengineering, 614–618. IEEE.
puter Vision (WACV), 839–847. IEEE.
                                                                 NIST. 2010. Practical Combinatorial Testing.
Fischer, M.; Balunovic, M.; Drachsler-Cohen, D.; Gehr, T.;       Pei, K.; Cao, Y.; Yang, J.; and Jana, S. 2017. DeepXplore:
Zhang, C.; and Vechev, M. 2018. DL2: Training and query-         Automated Whitebox Testing of Deep Learning Systems.
ing neural networks with logic. Internet .                       arXiv .
Gay, G.; Staats, M.; Whalen, M.; and Heimdahl, M. 2015.          Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why
The Risks of Coverage-Directed Test Case Generation. Soft-       Should I Trust You?: Explaining the Predictions of Any
ware Engineering, IEEE Transactions on 41(8): 803–819.           Classifier. In Proceedings of the 22nd ACM SIGKDD In-
ISSN 0098-5589.                                                  ternational Conference on Knowledge Discovery and Data
Gopinath, D.; Converse, H.; Pasareanu, C. S.; and Taly, A.       Mining, 1135–1144.
2019. Property Inference for Deep Neural Networks. arXiv         RTCA. 2011. DO-178C: Software Considerations in Air-
1904.13215v2.                                                    borne Systems and Equipment Certification.
Graydon, P. J.; and Holloway, C. M. 2015. Planning the           Salay, R.; and Czarnecki, K. 2018. Using Machine Learning
Unplanned Experiment: Assessing the Efficacy of Standards        Safely in Automotive Software: An Assessment and Adap-
for Safety Critical Software. Technical Report NASA/TM-          tion of Software Process Requirements in ISO 26262. arXiv
2015-218804, NASA.                                               1808.01614.
Hawkins, R.; Habli, I.; and Kelly, T. 2013. The principles of    Santner, T. J.; Williams, B. J.; and Notz, W. I. 2018. Space-
software safety assurance. 31st International System Safety      filling designs for computer experiments. In The Design and
Conference, Boston, Massachusetts USA .                          Analysis of Computer Experiments, 145–200. Springer.
Hayhurts, K.; Veerhusen, D.; Chilenski, J.; and Rierson, L.      SASWG. 2020. SCSC-153A: Safety Assurance Objectives
2001. A Practical Tutorial on Modified Condition/Decision        for Autonomous Systems.
Coverage. Technical report, NASA.
                                                                 Singh, G.; Gehr, T.; Püschel, M.; and Vechev, M. 2019. An
Holloway, C. M. 2013. Making the Implicit Explicit: To-          abstract domain for certifying neural networks. Proceedings
wards An Assurance Case for DO-178C. NASA Langley Re-            of the ACM on Programming Languages 3(POPL): 1–30.
search Center .
                                                                 Sun, Y.; Huang, X.; Kroening, D.; Sharp, J.; Hill, M.; and
Huang, X.; Kwiatkowska, M.; Wang, S.; and Wu, M. 2017.           Ashmore, R. 2019. Structural Test Coverage Criteria for
Safety Verification of Deep Neural Networks. In Pro-             Deep Neural Networks. ACM Transactions on Embedded
ceedings of the 29th International Conference on Computer        Computing Systems (TECS) 18(5s): 94.
Aided Verification, 3–29.                                        Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
Ilyas, A.; Santurkar, S.; Tsipras, D.; Engstrom, L.; Tran, B.;   D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing Proper-
and Madry, A. 2019. Adversarial examples are not bugs,           ties of Neural Networks. In Proceedings of the 2nd Interna-
they are features. In Advances in Neural Information Pro-        tional Conference on Learning Representations, 1–10.
cessing Systems, 125–136.                                        Tian, Y.; Pei, K.; Jana, S.; and Ray, B. 2018. DeepTest:
Julian, K. D.; Lopez, J.; Brush, J. S.; Owen, M. P.; and         Automated Testing of Deep-Neural-Network-driven Au-
Kochenderfer, M. J. 2016. Policy compression for aircraft        tonomous Cars.          In ICSE ’18: 40th International
Conference on Software Engineering, May 27-June 3,
2018, Gothenburg, Sweden. ACM, New York, NY, USA.
arXiv:1708.08559v2.
Wang, B.; Yao, Y.; Shan, S.; Li, H.; Viswanath, B.; Zheng,
H.; and Zhao, B. Y. 2019. Neural cleanse: Identifying and
mitigating backdoor attacks in neural networks. In 2019
IEEE Symposium on Security and Privacy (SP), 707–723.
IEEE.

                         Disclaimer
This document is an overview of UK MOD sponsored research and
is released for informational purposes only. The contents of this
document should not be interpreted as representing the views of the
UK MOD, nor should it be assumed that they reflect any current or
future UK MOD policy. The information contained in this docu-
ment cannot supersede any statutory or contractual requirements
or liabilities and is offered without prejudice or commitment.