Requirements Assurance in Machine Learning Alec Banks and Rob Ashmore Defence Science and Technology Laboratory, Salisbury, United Kingdom abanks@dstl.gov.uk, rdashmore@dstl.gov.uk Abstract Training data is an important aspect of approaches that use Machine Learning techniques. More precisely, we assert Treatment of Requirements in Established that training data captures the requirements that should be Software Standards satisfied by the trained algorithm. Hence, for safety applica- tions, any argument relating to behavioural correctness has There are a number of software standards that may be ap- to consider how those requirements are embodied within the plied to safety-related applications. These all seek to avoid training data. To support this, based on approaches for re- the introduction of errors and foster their rigorous removal. quirements assurance in traditional safety-related software, we develop nine specific areas where confidence is required Whilst they are often domain specific, all of these stand- in training data. These are illustrated using a fictional exam- ards have common characteristics, that were distilled into ple. the ‘4+1’ principles of software safety engineering (Haw- kins et. al, 2013). To achieve all of these principles, soft- ware safety assurance must: Introduction  P1. Identify safety requirements at the system level; This paper outlines the challenges associated with the as-  P2. Maintain the intent of these requirements throughout surance of requirements in safety-related Machine Learn- decomposition; ing (ML) systems.  P3. Demonstrably satisfy safety-related requirements in All software operates within the context of the system in the implementation; which it is executed. In traditional safety-related systems  P4. Identify hazardous behaviours introduced by the the behavioural requirements are first established at the software and mitigate them; and system level and then decomposed and refined until such a  P4+1. Provide a level of confidence in software behav- time where the developer is able to unambiguously transfer iour that is commensurate with its contribution to sys- the associated requirement into code. In ML-based systems tem-level risk. the software behaviour is not dictated by requirements that In safety-related applications these principles usually drive have been decomposed to that level (Ashmore and Lennon, the software requirement decomposition to two distinct 2017). Instead, those requirements are implicitly provided levels. High-Level Requirements (HLR) detail ‘what is via the training data. required’ in the design. These are then systematically de- This paper briefly looks at the way requirements are composed into Low-Level Requirements (LLR), which treated in existing safety standards. It goes on to discuss provide coders with information on ‘how to implement’ the concept of requirements in developments based on ML that design. To minimize ambiguity LLR often include techniques. Building from existing approaches, a series of pseudo-code or mathematical formulae. assurance considerations for ML requirements are devel- oped and illustrated using a fictional, indicative example. Requirements Definition in an ML Context In ML applications, the requirements for the software can be considered from two parallel, but related, perspectives. © Crown copyright (2019), Dstl. This material is licensed under the terms of the Open Government Licence except where other-wise stated. To view There are, firstly, the requirements for the construction of this licence, visit http://www.nationalarchives.gov.uk/doc/open- the learning algorithm and, secondly, the general require- government-licence/version/3 or write to the Information Policy Team, ments for behaviour. We use the term ‘general’ purposeful- The National Archives, Kew, London TW9 4DU, or email: psi@nationalarchives.gsi.gov.uk. ly here. If it is possible to detail the exact behaviours ex- pected from the software then ML approaches are arguably inappropriate for safety-related applications (Salay and  R6. Traceable; and Czarnecki, 2018).  R7. Algorithmically correct. For construction of the learning algorithm (e.g. back- Whilst all of these considerations arguably apply to ML- propagation) it is possible (and desirable) to develop HLR based applications some do not easily translate. To achieve and to further decompose these into LLR and onto imple- the same intent through the training data we need to have mentation, verification and validation. This aspect of ML confidence that the data: software is therefore not considered further in this paper,  D1. Relates to the intent of the HLR; although we note that work is being conducted in this area,  D2. Does not contain bias; for example (Srisakaokul et. al, 2018).  D3. Is sufficient; Moving on to the more challenging area of the behaviour  D4. Is syntactically and semantically correct; of the trained algorithm, in an ML-based approach this is dictated by the training data, which may be real and/or  D5. Addresses normal and robustness behaviours; synthetically generated (e.g. Ekbatani, et. al, 2017, for co-  D6. Is self-consistent; puter vision problems), combined with the learning algo-  D7. Conforms to standards; rithm and the structure (e.g. number of neurons and layers  D8. Is compatible with target computer; and in an artificial neural network) to which it is applied.  D9. Is verifiable. From this short discussion and returning to the ‘4+1’ For ease of reference, Table 1 summarises the relationship principles, it is apparent that in the case of ML, principles between requirements traditionally placed on LLR (R1 to 1 and 4+1 are arguably the only ones that can, based on R7) and areas where confidence is needed in the training current practices, be adequately satisfied. Principles P2, P3 data (D1 to D9). and P4 all suffer to a greater or lesser degree because: the Table 1: Relationship between traditional requirements and areas exact behaviour cannot be detailed in LLR (P2); assured of confidence in training data through verification (P3); or sufficiently predicted to per- mit potential hazard identification (P4). We note that some R1 R2 R3 R4 R5 R6 R7 of these may be resolved through behavioural containment D1 X X (e.g. the use of monitors and alternative control structures D2 X within the wider system architecture) but to do so could D3 X also negate the benefits of adopting ML in the design. D4 X X Due to the challenges of addressing each of these princi- D5 X ples individually, we suggest a holistic perspective may be D6 X more beneficial. Whilst all of the principles relate to assur- D7 X ance of requirements and their implementation, discussing D8 X all of them would be a significant endeavor, which would D9 X be too broad for this position paper. Consequently, we fo- cus on the decomposition of the HLR into LLR (i.e. assur- It is readily apparent that each of the traditional LLR re- ance of the requirements rather than their initial derivation quirements is covered by at least one area of training data or final implementation). confidence. This suggests the identified areas are neces- Since it determines the algorithm’s behaviour, we can sary. It does not mean they are sufficient (or complete). consider the training data to be an abstract form of the For example, there are also several ‘meta- LLR. Hence, assurance of training data is paramount to considerations’ (i.e. those not directly concerned with per- gaining confidence. Using RTCA DO-178C (RTCA, 2011) formance but contribute to design confidence). These in- as an example1, traditional requirements verification seeks clude, inter alia: the data source; its control; forensic au- to ensure LLR are: ditability; extent of pre-processing required; etc.  R1. Compliant with HLR; To provide confidence that these areas are suitably ad-  R2. Accurate and consistent; dressed, there are a number of processes, reviews, analyses  R3. Compatible with target computer2; and tests that can be applied, the results of which should  R4. Verifiable; support a safety argument. Some of these approaches will  R5. Conforming to standards; now be discussed in the context of an indicative example. 1 DO-178C is a key software safety document for aircraft. It is sufficiently Indicative Example general to apply to most developments of safety-related software. 2 The target computer is the one on which the algorithm will run during To colour our discussion we adopt a fictional landing sys- operational use. This is often different from the host computer, which is the one used to develop the algorithm. tem that detects whether a medium size Unmanned Air Vehicle (UAV) has landed on an unprepared surface. The bustness requirements, which would need to be imple- associated system requirement might be: mented through the inclusion of sufficient examples in the  SYS-01: Detect landing on unprepared strip. data (robustness is also considered in the discussion of D5, This would then be decomposed into software and hard- below). ware requirements. Given the environmental constraint of The diversity of potential environmental conditions the unprepared strip the conventional ‘Weight-On-Wheels’ compared to those that might be available during training switch3 approach would not be technically feasible. There- data collection would be a prime breeding ground for unin- fore a design decision is made to use a combination of tentional bias (D2). When this limited dataset is then intro- hardware sensor systems (e.g. horizontal and vertical ac- duced into the generative model the potential becomes am- celerometers, altimeters, air data etc.) as inputs to an Arti- plified. Detecting bias in data is difficult, but there are a ficial Neural Network (ANN) that will classify the landing number of ways it could be detected in the trained algo- status into one of the following categories: in-air; on ap- rithm (Tan, 2017). proach; landing; landed. Whilst clearly there is a numerical aspect to the suffi- The hardware requirements are not considered further, ciency of training data (D3), we are also concerned with but the high-level software requirement (HLR) might be: the diversity of input data. Identifying areas of sparsity is a  SW-HLR-01: Classify landing status: {In_Air; notable concern. In our indicative example, there may be On_Approach; Landing; Landed} based on data from: limitations on the environmental conditions in which data Inertial Navigation System {Vert_Acc; Horz_Acc}; La- can be collected. Understanding this may lead to re- ser Altimeter {Height}; Air Data Computer {Airspeed}. strictions being placed on operational use of the UAV until The actual behaviour (LLR) of the ANN would be deter- further data can be collected. mined by the training data (and the learning algorithm and If a formally-structured process is used to record opera- the structure to which it is applied). For the purposes of tional data and strong configuration control is applied then this discussion we assume data has been collected from a data taken from real systems should be syntactically and suitable light aircraft making a number of landings on a semantically correct (D4). However, since our example variety of surfaces in a range of environmental conditions. includes synthetic data, care needs to be applied to ensure For synthetic data, it too must be verified and validated to that data distributions match real world expectations. Syn- ensure that its form (syntax) and intent apropos HLR (se- tactic aspects relate to the structure and ranges of data, mantics) matches the expectation of real data. Approaches which can be tested using simple data analysis tools. The to this are not expected to differ from those discussed here- semantic aspects are more challenging and should include in to ensure data correctness. tests for data poisoning4 and unintentional examples where In our example, the number of operationally-generated small shifts in inputs cause large changes in output. Guid- samples is likely to be limited, partly through cost and ance for the management of safety-related data has been partly through practicality (e.g. it is unlikely that there produced by several organisations, e.g. (DSIWG, 2018), would be many, if any, landings performed with systems and can be looked to for assistance in minimizing the po- deliberately set in failure modes, even if this were possi- tential for data poisoning. However, the developing body ble). Consequently, the recorded (operationally-generated) of knowledge in this area indicates that a stronger argu- data is applied to a generative data modelling tool that pro- ment may be possible through the application of data poi- vides a larger data set which is used to train the ANN. soning detection tools (Steinhardt et. al, 2017). To determine whether the nine areas of training data A safety-related system that only contains requirements confidence have been addressed a series of reviews, anal- to address normal range behaviour has the potential to be- yses and tests are required. The following paragraphs dis- come unsafe under abnormal conditions. Traditional sys- cuss the most significant of these; space limitations prevent tem requirements software design should consider all rea- a complete analysis of all nine areas. sonable failure conditions. In our indicative example ro- For operationally-generated samples, traceability in the bustness cases (D5) would include sensors providing inac- intent of data (D1) is straightforward, although unusual, curate readings as well as total failures (which may mean outlier examples may need close examination to assure no reading is available). Inaccurate readings include credi- validity. For synthetically generated data, ensuring the in- ble but incorrect, as well as incredible data. As with tradi- tent of the data would rely on detailed knowledge of the tional designs, a conscious decision needs to be made behaviour of both the UAV and the data-generating tool. about the extent to which the system can be expected to Additionally, most safety-related systems also feature ro- deal with failure conditions. Our indicative landing system takes four inputs; it might be reasonable to expect that one 3 input could fail at any given time so data that represents Usually consisting of a proximity switch located on the undercarriage, which uses movement induced when weight is applied to the wheel to 4 make an electrical circuit. Where an attacker can alter a small fraction of the training data. landings where Height = 100ft but all other sensors are In conclusion, we assert that any assurance claims re- indicative of a landing would be a reasonable robustness garding the requirements aspects of an ML-based safety- case. However, simultaneous failure of three inputs would related system would, as a bare minimum, need to address be unreasonable. Synthetic data is expected to be invalua- the nine areas developed in this paper. We also encourage ble in generating sufficient data for training systems to the safety-related ML community to test these areas, de- behave robustly, since real data may be dangerous to col- veloping them further as required. lect. Self-consistency (D6) can be more challenging than it might first appear. The large number of landings in multi- References ple environments means that some results might be seem- Ashmore, R. and Lennon, E. 2017. Progress Towards the Assur- ingly contradictory; equivalently, based on the available ance of Non-Traditional Software. In Developments in System data the landing classes may not be separable. For exam- Safety Engineering, Proceedings of the Twenty-fifth Safety- Critical Systems Symposium. Bristol, UK. ple, a really smooth landing in wet conditions may lead to input signals similar to In-Air (e.g. laser altimeter might Certification Authorities Software Team (CAST). 2003. Merging High-Level and Low-Level Requirements. Position Paper read high altitude due to spurious reflections and accel- CAST-15, completed February 2003. erometers read low Vert_Acc and Horz_Acc) but with a Data Safety Initiative Working Group (DSIWG). 2018. Data label of Landing. If that is the case, the underlying features Safety Guidance, SCSC-127B, ISBN 978-1540887481. of the contradictions may need to be explored and the sys- Ekbatani, H.K., Pujol, O. and Segui, S. 2017. Synthetic Data tem may need to be redesigned (e.g. to provide additional Generation for Deep Learning in Counting Pedestrians. In 6th sensor inputs to introduce new variables) for greater ro- International Conference on Pattern Recognition Applications bustness. Fortunately, statistical analysis of the data may and Methods. SCITEPRESS - Science and Technology Publica- help. Data that is distributionally dissimilar but has the tions. same classification may be inconsistent. In many ways this Hawkins, R., Habli, I., and Kelly, T. 2013. The Principles of Software Safety Assurance. 31st International System Safety Con- can be seen as the inverse problem to detecting adversarial ference. Boston, Massachusetts USA, 2013. examples, where the distribution is similar but the classifi- RTCA. 2011. Software Considerations in Airborne Systems and cation is erroneously different. Equipment Certification. DO-178C. The final three considerations (D7, D8 and D9) are re- Salay, R., and Czarnecki, K. 2018. Using Machine Learning Safe- garded as unchanged from traditional safety-related soft- ly in Automotive Software: An Assessment and Adaption of ware and are not discussed further. However, we note that Software Process Requirements in ISO 26262. arXiv preprint the concept of verifiability is potentially different for ML- arXiv:1808.01614. based systems in that training data seeks to implement Srisakaokul, S., Wu, Z., Astorga, A., Alebiosu, O., and Xie, T. general behaviours and direct verification may not be pos- 2018. Multiple-Implementation Testing of Supervised Learning sible. Software. In Proc. AAAI-18 Workshop on Engineering Dependa- ble and Secure Machine Learning Systems (EDSMLS). Steinhardt, J., Koh, P. W. W., and Liang, P. S. 2017. Certified Summary and Conclusion Defenses for Data Poisoning Attacks. In Advances in Neural In- formation Processing Systems (pp. 3517-3529). This short position paper has established the concept that Tan, S., Caruana, R., Hooker, G., and Lou, Y. 2017. Detecting training data provides the functional requirements for a Bias in Black-Box Models using Transparent Model Distillation. safety-related system developed using ML-based ap- arXiv preprint arXiv:1710.06169. proaches. It has shown that it may be possible to make claims that Disclaimer the intent of HLR, passed down from system level re- quirements, have been correctly maintained and imple- This article is an overview of UK MOD sponsored re- mented through the training data. Using traditional assur- search and is released for informational purposes only. The ance concepts as a basis we have developed a series of contents of this article should not be interpreted as repre- training data considerations that we argue could form the senting the views of the UK MOD, nor should it be as- basis of an assurance activity. sumed that they reflect any current or future UK MOD These considerations can be addressed by a combination policy. The information contained in this article cannot of sound data management and a collection of reviews, supersede any statutory or contractual requirements or lia- tests and analyses. Some of these are currently under de- bilities and is offered without prejudice or commitment. velopment but further work is required to develop a com- prehensive toolset that may be used across a wide range of data sets.