The Utility of Neural Network Test Coverage Measures Rob Ashmore, Alec Banks Defence Science and Technology Laboratory {rdashmore,abanks}@dstl.gov.uk Abstract in the aerospace domain. This approach is generally appli- In this position paper, we are interested in what test cover- cable, since the principles adopted in DO-178C are mirrored age measures can, and cannot, tell us about neural networks. elsewhere (e.g. (Hawkins, Habli, and Kelly 2013)). We begin with a review of the role of test coverage measures Before the discussion, we note that although a signifi- in traditional development approaches for safety-related soft- cant amount of practical experience provides confidence in ware. We show how those coverage measures, in the neural DO-178C, and its predecessor, DO-178B, there is little, if network sense, cannot achieve the same aims as their equiv- any, explicit evidence to support the specific coverage cri- alents in the traditional sense. We provide indications of ap- teria that are used (Holloway 2013). This limitation is not proaches that can partially meet those aims. We also indicate restricted to DO-178C: it applies across many safety-related the utility of current neural network coverage measures. software standards (Graydon and Holloway 2015). Never- theless, we believe a traditional safety-related software de- 1 Introduction velopment standard is an appropriate basis for this work. This is partly because of the excellent safety record of in- Neural Networks (NNs) have demonstrated significant util- dustries that use such standards and partly because many NN ity in a range of safety and security related applications (e.g. test coverage measures are (implicitly or explicitly) based on autonomous cars (Tian et al. 2018)). In such cases there is measures used in traditional software engineering. a need to provide a commensurate level of assurance that a The development of safety-related software begins with particular NN implementation is correct. Approaches to NN system-level requirements that have been allocated to soft- assurance have drawn inspiration from those used for tradi- ware. These high-level requirements are hierarchically de- tional safety-related software. The concept of test coverage composed, in a traceable manner, from requirements that is one such example. detail ‘what’ behaviour is needed, down to requirements NNs are different to traditional software. We contend that: that detail ‘how’ to achieve this. The software require- these differences fundamentally change the meaning of sev- ments are also independently used to produce test cases in a eral types of test coverage (e.g. requirements and structural Requirements-Based Testing (RBT) approach. The tests are measures); there are approaches that can partially achieve run and the coverage of the code structure is measured dy- the intent of these traditional test coverage measures in an namically. RBT is an attractive approach to test design since NN context; and currently-proposed measures of NN test it acknowledges that exhaustive testing is (generally speak- coverage have utility in different ways. ing) infeasible and it focusses on the demonstration of the The remainder of this paper is structured as follows: Sec- intended behaviour. tion 2 provides an overview of the role of test coverage mea- Assuming all tests pass, if structural coverage is incom- sures in the development of traditional safety-related soft- plete then one of the following three conditions holds, where ware; Section 3 summarises a selection of NN test coverage the notion of correctness directly relates to the intent of the measures proposed in the literature; Section 4 outlines NN- high-level requirements: based approaches that can achieve some of the aims of tradi- tional test coverage; Section 5 summarises the value that can • The software’s behaviour is correct, but the software-level be gained from current NN coverage measures; Section 6 requirements are incomplete; concludes the paper. • The software-level requirements are correct, but the soft- ware includes additional, unnecessary behaviour (we in- 2 Traditional Safety-Related Software clude unreachable as well as redundant code here); Development • The software behaviour and software-level requirements Our discussion of safety-related software development is are correct, but the test set is incomplete. This includes, based on DO-178C (RTCA 2011), which is commonly used for example, cases where development tools introduce Crown Copyright c 2021. Use permitted under Creative Com- code for runtime efficiency. mons License Attribution 4.0 International (CC BY 4.0). Considering all three of these conditions provides con- fidence that (once a suitable level of coverage has been Networks that, based on these measures, perform extremely achieved) the code sufficiently implements the requirements well against verification data may display significant unde- and, furthermore, it contains no undesirable additional be- sirable behaviour when exposed to previously unseen, but haviour. The link between requirements and behaviour is valid, inputs. This does not necessarily mean concepts like important, because requirements are the key mechanism for precision and recall should be abandoned: it means that, in communicating expectations about software behavoiur with isolation, they do not provide sufficient confidence for the actors in the development, test and integration of the soft- use of NNs in safety-related applications. ware. More recently, a variety of coverage measures, based on Structural coverage is typically measured at the software internal properties of an NN, have been proposed. It should unit level. Equivalently, at this level behaviour can be de- be noted that, to the best of our knowledge, use of these mea- fined, and verified. Hierarchical decomposition of require- sures has thus far been demonstrated using von Neumann ments, supported by architectural design, provides confi- architecture hardware. For some, different, architectures it dence that unit-level behaviour can be aggregated to provide may not be efficient or even possible to collect the informa- the required system-level behaviour. Not all behaviour can tion on which these measures are based. be tested at this level, thus testing continues through the in- Some notable example coverage measures are sum- tegration processes. marised below. The type of structural coverage that is required depends DeepXplore (Pei et al. 2017) aims to systematically test on the criticality of the software. In less-critical cases, a collection of similar NNs for erroneous corner case be- statement coverage suffices; more demanding cases require haviours. It uses a gradient-guided local search that starts branch coverage; and the most critical cases require Modi- from a seed input and solves a joint optimisation problem, fied Condition / Decision Coverage (MCDC) (RTCA 2011). seeking to find new inputs that cause different neuron ac- A useful tutorial on these types of structural coverage can be tivation patterns and lead the NNs to behave differently on found at (Hayhurts et al. 2001). It is apparent that all of these the same task. In the same paper, the notion of neuron cov- coverage levels are based on code structure. This reflects an erage is developed based on the fraction of neurons that are implicit assumption that code structure, rather than data, is activated for at least one test input, with a neuron consid- the main influence on software behaviour. Cases where data ered active if its output is above a threshold value (e.g. 0). significantly affects behaviour would be expected to be de- Equivalently, 100% coverage is achieved if each neuron is fined in a requirement and addressed by a test case. activated at least once by the test set. The combination of RBT and coverage measurement is DeepGauge (Ma et al. 2018) considers two levels of cov- important. Even though behaviour is, typically, not strongly erage: neuron-level and layer-level. data-dependent, there is still value in using suitably realistic • Neuron-level coverage splits the output range of each neu- test values. Simply optimising a test set to achieve a given ron (established during the training phase) into k equal level of coverage, with little or no consideration of the re- sections. The k-multisection coverage of that neuron is quirements, is less effective than RBT (Gay et al. 2015). The the fraction of sections that the neuron’s output falls into reverse approach, of testing against requirements without across all inputs in the test set. The k-multisection cover- measuring coverage, cannot provide confidence that the re- age of the NN is the average of these neuron coverages. quirements suitably encapsulate the software’s behaviour. It can say that the software satisfies the requirements, but pro- • Layer-level coverage is based on the l neurons, and com- vides no information on what else the software may do. This binations thereof, in each layer that have the greatest out- information is an important part of assuring safety-related put value. Top-l neuron coverage is the fraction of neurons software. that are in the top l neurons in their layer for at least one In summary, for traditional software: hierarchical decom- of the test inputs. position of requirements means that software behaviour can DeepCT (Ma et al. 2019) is inspired by combinatorial be understood at the unit level; independent interpretation testing (see (NIST 2010)). Two, layer-level coverage mea- of requirements provides confidence that requirements have sures are defined: been implemented correctly; and the combination of RBT • Sparse coverage considers the fraction of m-way subsets and structural test coverage provide confidence that the re- in which all neurons are activated for at least one test in- quirements suitably describe the software’s behaviour, both put. For example, a layer of four neurons will have six what it does and what it does not do. 2-way subsets. 3 Neural Network Coverage Measures • Dense coverage considers the fraction of m-way activa- tion patterns that are activated for at least one test input. Our main interest is in Artificial Intelligence (AI), particu- For example, a 2-way subset has four activation patterns. larly AI implemented using Machine Learning (ML) tech- niques. Due to their prevalence, we focus on NNs, but much DeepConcolic (Sun et al. 2019) is based on a variety of of what follows is applicable to other forms of ML. measures, loosely inspired by MCDC: Historically, NN testing was based on measures like pre- • Sign-sign coverage measures whether the change in out- cision and recall, calculated using a set of data that was held put sign (i.e. moving from zero to non-zero) of a neuron in back from training. The inadequacy of these measures is layer n independently affects the output sign of a specific demonstrated by adversarial inputs (Szegedy et al. 2014). neuron in layer n + 1 (i.e. the subsequent layer). • Value-value coverage measures whether a change in out- particular class (Ribeiro, Singh, and Guestrin 2016) or visu- put value (e.g. similar to the k-sections used by Deep- alising feature maps from a final convolutional layer (Chat- Gauge) of a neuron in layer n independently affects the topadhay et al. 2018). Whilst useful, these do not provide the output value of a specific neuron in layer n + 1. requisite understanding. Furthermore, they do not generalise • The notions of sign-value and value-sign coverage natu- across all types of NNs or all NN applications. rally follow. Software Satisfies Requirements These measures show a progression, from simple as- pects of individual neuron behaviour (e.g. activation or non- One of the strengths of NNs is their ability to generalise activation), through more complex aspects of individual neu- from incomplete specifications. It may seem that asking for ron behaviour, to the joint behaviour of combinations of neu- a demonstration that an NN satisfies a set of requirements rons, either in the same layer or in neighbouring layers. negates this strength. From our perspective, the distinction is in the level at 4 Towards Traditional Coverage for Neural which requirements are expressed. Placing, and verifying, requirements on the network’s internals does not provide Networks traceability between requirements and behaviour. That trace- Constraints ability might be achieved by placing (an appropriate set of) For the purposes of this paper, we are primarily concerned requirements on the NN’s Input → Output behaviour. with structural coverage (although we remain interested in Adversarial examples (Szegedy et al. 2014) are one aspect requirements throughout). For completeness, we note that of this behaviour. ‘Robustness’ to these examples has often other attributes will be important in an overall assurance ar- been suggested, sometimes implicitly, as a requirement that gument supporting the use of an NN. A key area is the NN an NN should meet. For feed forward networks, this can be development process. This is covered in detail in (Ashmore, demonstrated, or a counter-example found, using techniques Calinescu, and Paterson 2019) and SCSC-153A (SASWG based on Satisfiability Modulo Theories (SMT) (Huang et al. 2020). 2017). Recall, traditional software test structural coverage mea- DeepPoly (Singh et al. 2019) couples an abstract domain sures, when used in conjunction with RBT, provide a level of (specifically, a combination of floating-point polyhedra with confidence that software satisfies requirements and require- intervals) with abstract transformers for common neural net- ments cover software behaviour. When viewed in that light work functions. This allows guarantees to be made, for ex- it is apparent that the types of NN coverage measures dis- ample, that all samples within an L∞ ball will be classified cussed above cannot achieve the same aims as traditional correctly, or that all image rotations (up to a given angle) software coverage measures. As illustrated in the following will be classified correctly. Note that, in these cases, ‘classi- paragraphs, none of the requisite building blocks, which al- fied correctly’ means all samples in the region will be given low traditional coverage to work in this way, are present in the same class. It is assumed that this behaviour is correct. the NN context. Marabou (Katz et al. 2019) is an SMT-based tool that can We do not have a complete set of requirements. There is be used to check whether a particular NN satisfies a specific often a good understanding of the purpose of an NN, for property. If the property is not satisfied then a concrete input example: ‘recognise hand-written digits’ or ‘determine sen- for which the property is violated (i.e. a counter-example) timent from social media messages’. However, in our expe- is provided. Amongst other things, the tool has been used rience, there is rarely, if ever, a set of requirements that is to prove properties about an Airborne Collision Avoidance accurate, complete, unambiguous and verifiable. System (ACAS) for unmanned aircraft (Julian et al. 2016). Not all NN requirements can be hierarchically decom- Deep Learning with Differentiable Logic (DL2) (Fischer posed. Some level of methodical decomposition is often pos- et al. 2018) can support training and querying of NNs. In the sible; for example, ‘recognise hand-written digits’ could be training application, logical constraints are translated into decomposed to explicitly cater for ‘triangular 4s’, ‘open 4s’ non-negative loss functions, which are incorporated into the and ‘crossed 7s’. But, requirements like these cannot be de- overall optimisation the training is attempting to complete. composed to a level that can be directly coded against. In The querying application allows constraints over properties particular, in a safety-related environment, if we could de- not directly computed by the network (e.g. constraints can compose to a directly-codeable level then there would be no consider the likelihood of an input being in one of a set need to use an NN and traditional software would be the of classes). These types of constraint can readily express preferred approach (Salay and Czarnecki 2018). system-level requirements. We do not have a meaningful software unit level at which Even if the (suitably-measured) performance of an NN software behaviour can be described. As indicated previ- is adequate, this does not necessarily mean that an NN is ously, current NN coverage measures focus on neuron be- making decisions in the same way as a human would. There haviour, either individually or in patterns. Although this be- is evidence that NNs use highly-predictive, but non-intuitive haviour controls the network’s output, it does not (and can- (to a human) features to support classification (Ilyas et al. not) describe software behaviour in a way that is meaningful 2019). If the distinction between ‘good performance’ and to a user. Some approaches to explainability may help, for ‘good performance, deciding like a human’ is important then example: identifying pixels that positively weight towards a it should be captured as a requirement and explicitly tested. Requirements Cover Software Behaviour Summary The way that NN are constructed means test coverage mea- As noted earlier, a key question is whether there are aspects sures cannot perform the same function as they do for tradi- of software behaviour that are not captured by the require- tional software. There are a number of approaches that can ments. That is, if a user fully understands the requirements be used to provide some confidence that the NN satisfies re- then will they ever be surprised by the software’s behaviour? quirements. There are also approaches that can provide some This latter formulation is helpful as it clarifies what we mean confidence that the NN’s behaviour is understood. by ‘behaviour’. In particular, we are interested in behaviour that is externally observable. From the perspective of an NN, we are primarily interested in behaviour in the sense 5 The Utility of NN Coverage Measures of Input → Output mappings. The previous discussion highlights a distinction between the In the NN context, one way of categorising different con- implicit motivation for NN coverage measures and their util- tributors to behaviour is by considering different types of in- ity. We propose four ways in which NN coverage measures, put. Four related spaces are defined in (Ashmore, Calinescu, like those discussed in Section 3, could have utility. and Paterson 2019): the input domain space, which are in- Firstly, following the analogy that training data represent puts that the NN can accept; the operational domain space, the low-level behavioural requirements, the measures could which are inputs that the NN may receive when used opera- be used to optimise training data, for example, by identify- tionally, the failure domain space, which are inputs the NN ing whether a larger, or more diverse, training set was ex- may receive if there are failures elsewhere in the system; and ercising a larger portion of internal network behaviour. If the adversarial domain space, which are inputs the NN may it is then this is an argument for using an alternative, pos- receive if it is being attacked by an adversary. sibly larger, data set, despite the additional overheads and increased risk of over fitting. Using this structure there are a number of techniques Secondly, the measures could be used to compare training that can help increase coverage of potential NN behaviour: data, used during development, with verification data, used space-filling designs for computer experiments (Santner, by an independent team. Suppose, for example, that the full Williams, and Notz 2018) are potentially relevant for the set of training and verification data achieves greater cover- input domain; the notion of situation coverage (Alexander, age than just the training data. This outcome would demon- Hawkins, and Rae 2015) is potentially relevant for the opera- strate the independent verification activity is exercising ad- tional domain; Failure Modes and Effects Analysis (FMEA) ditional types of internal NN behaviour to those observed should be useful for the failure domain; and ‘red teaming’ during the development phase. This could be evidence that (Kardos and Dexter 2017) should inform the adversarial do- a suitably independent verification activity has been con- main. Whilst they are useful, none of these techniques define ducted. a precise boundary of testing sufficiency. Greater experience Thirdly, many of the approaches used to measure NN cov- of the practical use of NNs is likely to be required before erage can also be used to generate additional inputs that such boundaries can be set. would extend coverage. As such, they provide an indica- These approaches provide a ‘forward-looking’ way of un- tion of ways in which training data could be meaningfully derstanding behaviour; they rely on choosing inputs to in- extended. Obviously, for situations where the NN is being voke behaviour. This differs from the approach to traditional developed using supervised learning, the appropriate output software requirements and structural coverage, where be- needs to be produced for each of these new inputs. haviour is invoked, in a sense, from within the behavioural Fourthly, the measures can be used to choose between two regimes defined in the low-level requirements structure. different NNs that otherwise offer similar levels of perfor- Generative Adversarial Networks (GANs) could help fill mance. In such situations, the NN for which a fixed set of this gap. For example, they could be used to find plausible training data achieves greater coverage might be preferred. operational domain inputs that exhibit different behaviours In general, this would be expected to be the NN with the to those observed in the training data. These could be inputs simpler structure. that are similar to training samples but result in different out- puts; for classification networks, this is the same as finding 6 Conclusions adversarial inputs. Alternatively, they could be used to find NNs have demonstrated significant utility. Their use in inputs that are sufficiently different from any sample in the safety-related systems is predicated on confidence in their training data. behaviour, both what they do and what they do not do. For Another approach involves looking for specific undesir- traditional safety-related software much of this confidence able behaviours in an NN. Detection and mitigation of back- comes from test coverage measures in an RBT context. The door attacks (Wang et al. 2019) is one example. approaches used to measure test coverage for NNs cannot In some cases, specifically for feed-forward networks, it provide an equivalent confidence. may be possible to automatically infer formal properties There are approaches that can provide some aspects of (Gopinath et al. 2019). This is a helpful way of understand- this confidence. Appropriate consideration of different in- ing aspects of the NN’s behaviour. However, there is no put spaces can help, as can GAN-based methods for finding guarantee that the inferred properties will be meaningful, in ‘new’ inputs. NN test coverage measures can provide value the sense of system-level requirements. in other ways. They can, for example, provide a principled, structured way of choosing between different training data collision avoidance systems. In 2016 IEEE/AIAA 35th Dig- sets, or between different trained models. ital Avionics Systems Conference (DASC), 1–10. IEEE. In conclusion, NN test coverage measures can have sig- Kardos, M.; and Dexter, P. 2017. A Simple Handbook for nificant utility. They represent different types of confidence Non-Traditional Red Teaming. Technical report, Defence than is found in their traditional software testing forebears. Science and Technology Group Edinburgh, SA. However, more work is required before a holistic and com- plete understanding is achieved in the relationship between Katz, G.; Huang, D. A.; Ibeling, D.; Julian, K.; Lazarus, C.; the coverage measures and confidence in NN behaviour. Lim, R.; Shah, P.; Thakoor, S.; Wu, H.; Zeljić, A.; et al. 2019. The marabou framework for verification and analy- sis of deep neural networks. In International Conference on References Computer Aided Verification, 443–452. Springer. Alexander, R.; Hawkins, H. R.; and Rae, A. J. 2015. Situ- ation coverage–a coverage criterion for testing autonomous Ma, L.; Juefei-Xu, F.; Sun, J.; Chen, C.; Su, T.; Zhang, F.; robots. University of York . Xue, M.; Li, B.; Li, L.; Liu, Y.; Zhao, J.; and Wang, Y. 2018. DeepGauge: Comprehensive and Multi-Granularity Testing Ashmore, R.; Calinescu, R.; and Paterson, C. 2019. Assur- Criteria for Gauging the Robustness of Deep Learning Sys- ing the machine learning lifecycle: Desiderata, methods, and tems. arXiv 1803.07519. O. challenges. arXiv preprint arXiv:1905.04223 . Ma, L.; Juefei-Xu, F.; Xue, M.; Li, B.; Li, L.; Liu, Y.; and Chattopadhay, A.; Sarkar, A.; Howlader, P.; and Balasub- Zhao, J. 2019. DeepCT: Tomographic Combinatorial Test- ramanian, V. N. 2018. Grad-cam++: Generalized gradient- ing for Deep Learning Systems. In Proceedings of the 26th based visual explanations for deep convolutional networks. IEEE International Conference on Software Analysis, Evo- In 2018 IEEE Winter Conference on Applications of Com- lution and Reengineering, 614–618. IEEE. puter Vision (WACV), 839–847. IEEE. NIST. 2010. Practical Combinatorial Testing. Fischer, M.; Balunovic, M.; Drachsler-Cohen, D.; Gehr, T.; Pei, K.; Cao, Y.; Yang, J.; and Jana, S. 2017. DeepXplore: Zhang, C.; and Vechev, M. 2018. DL2: Training and query- Automated Whitebox Testing of Deep Learning Systems. ing neural networks with logic. Internet . arXiv . Gay, G.; Staats, M.; Whalen, M.; and Heimdahl, M. 2015. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why The Risks of Coverage-Directed Test Case Generation. Soft- Should I Trust You?: Explaining the Predictions of Any ware Engineering, IEEE Transactions on 41(8): 803–819. Classifier. In Proceedings of the 22nd ACM SIGKDD In- ISSN 0098-5589. ternational Conference on Knowledge Discovery and Data Gopinath, D.; Converse, H.; Pasareanu, C. S.; and Taly, A. Mining, 1135–1144. 2019. Property Inference for Deep Neural Networks. arXiv RTCA. 2011. DO-178C: Software Considerations in Air- 1904.13215v2. borne Systems and Equipment Certification. Graydon, P. J.; and Holloway, C. M. 2015. Planning the Salay, R.; and Czarnecki, K. 2018. Using Machine Learning Unplanned Experiment: Assessing the Efficacy of Standards Safely in Automotive Software: An Assessment and Adap- for Safety Critical Software. Technical Report NASA/TM- tion of Software Process Requirements in ISO 26262. arXiv 2015-218804, NASA. 1808.01614. Hawkins, R.; Habli, I.; and Kelly, T. 2013. The principles of Santner, T. J.; Williams, B. J.; and Notz, W. I. 2018. Space- software safety assurance. 31st International System Safety filling designs for computer experiments. In The Design and Conference, Boston, Massachusetts USA . Analysis of Computer Experiments, 145–200. Springer. Hayhurts, K.; Veerhusen, D.; Chilenski, J.; and Rierson, L. SASWG. 2020. SCSC-153A: Safety Assurance Objectives 2001. A Practical Tutorial on Modified Condition/Decision for Autonomous Systems. Coverage. Technical report, NASA. Singh, G.; Gehr, T.; Püschel, M.; and Vechev, M. 2019. An Holloway, C. M. 2013. Making the Implicit Explicit: To- abstract domain for certifying neural networks. Proceedings wards An Assurance Case for DO-178C. NASA Langley Re- of the ACM on Programming Languages 3(POPL): 1–30. search Center . Sun, Y.; Huang, X.; Kroening, D.; Sharp, J.; Hill, M.; and Huang, X.; Kwiatkowska, M.; Wang, S.; and Wu, M. 2017. Ashmore, R. 2019. Structural Test Coverage Criteria for Safety Verification of Deep Neural Networks. In Pro- Deep Neural Networks. ACM Transactions on Embedded ceedings of the 29th International Conference on Computer Computing Systems (TECS) 18(5s): 94. Aided Verification, 3–29. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, Ilyas, A.; Santurkar, S.; Tsipras, D.; Engstrom, L.; Tran, B.; D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing Proper- and Madry, A. 2019. Adversarial examples are not bugs, ties of Neural Networks. In Proceedings of the 2nd Interna- they are features. In Advances in Neural Information Pro- tional Conference on Learning Representations, 1–10. cessing Systems, 125–136. Tian, Y.; Pei, K.; Jana, S.; and Ray, B. 2018. DeepTest: Julian, K. D.; Lopez, J.; Brush, J. S.; Owen, M. P.; and Automated Testing of Deep-Neural-Network-driven Au- Kochenderfer, M. J. 2016. Policy compression for aircraft tonomous Cars. In ICSE ’18: 40th International Conference on Software Engineering, May 27-June 3, 2018, Gothenburg, Sweden. ACM, New York, NY, USA. arXiv:1708.08559v2. Wang, B.; Yao, Y.; Shan, S.; Li, H.; Viswanath, B.; Zheng, H.; and Zhao, B. Y. 2019. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), 707–723. IEEE. Disclaimer This document is an overview of UK MOD sponsored research and is released for informational purposes only. The contents of this document should not be interpreted as representing the views of the UK MOD, nor should it be assumed that they reflect any current or future UK MOD policy. The information contained in this docu- ment cannot supersede any statutory or contractual requirements or liabilities and is offered without prejudice or commitment.