Requirements Assurance in Machine Learning
                                                     Alec Banks and Rob Ashmore
                                    Defence Science and Technology Laboratory, Salisbury, United Kingdom
                                                 abanks@dstl.gov.uk, rdashmore@dstl.gov.uk


                              Abstract
   Training data is an important aspect of approaches that use
   Machine Learning techniques. More precisely, we assert                     Treatment of Requirements in Established
   that training data captures the requirements that should be                          Software Standards
   satisfied by the trained algorithm. Hence, for safety applica-
   tions, any argument relating to behavioural correctness has              There are a number of software standards that may be ap-
   to consider how those requirements are embodied within the               plied to safety-related applications. These all seek to avoid
   training data. To support this, based on approaches for re-
                                                                            the introduction of errors and foster their rigorous removal.
   quirements assurance in traditional safety-related software,
   we develop nine specific areas where confidence is required              Whilst they are often domain specific, all of these stand-
   in training data. These are illustrated using a fictional exam-          ards have common characteristics, that were distilled into
   ple.                                                                     the ‘4+1’ principles of software safety engineering (Haw-
                                                                            kins et. al, 2013). To achieve all of these principles, soft-
                                                                            ware safety assurance must:
                          Introduction                                       P1. Identify safety requirements at the system level;
This paper outlines the challenges associated with the as-                   P2. Maintain the intent of these requirements throughout
surance of requirements in safety-related Machine Learn-                       decomposition;
ing (ML) systems.                                                            P3. Demonstrably satisfy safety-related requirements in
   All software operates within the context of the system in                   the implementation;
which it is executed. In traditional safety-related systems                  P4. Identify hazardous behaviours introduced by the
the behavioural requirements are first established at the                      software and mitigate them; and
system level and then decomposed and refined until such a                    P4+1. Provide a level of confidence in software behav-
time where the developer is able to unambiguously transfer                     iour that is commensurate with its contribution to sys-
the associated requirement into code. In ML-based systems                      tem-level risk.
the software behaviour is not dictated by requirements that                 In safety-related applications these principles usually drive
have been decomposed to that level (Ashmore and Lennon,                     the software requirement decomposition to two distinct
2017). Instead, those requirements are implicitly provided                  levels. High-Level Requirements (HLR) detail ‘what is
via the training data.                                                      required’ in the design. These are then systematically de-
   This paper briefly looks at the way requirements are                     composed into Low-Level Requirements (LLR), which
treated in existing safety standards. It goes on to discuss                 provide coders with information on ‘how to implement’
the concept of requirements in developments based on ML                     that design. To minimize ambiguity LLR often include
techniques. Building from existing approaches, a series of                  pseudo-code or mathematical formulae.
assurance considerations for ML requirements are devel-
oped and illustrated using a fictional, indicative example.
                                                                              Requirements Definition in an ML Context
                                                                            In ML applications, the requirements for the software can
                                                                            be considered from two parallel, but related, perspectives.
© Crown copyright (2019), Dstl. This material is licensed under the terms
of the Open Government Licence except where other-wise stated. To view
                                                                            There are, firstly, the requirements for the construction of
this    licence,  visit    http://www.nationalarchives.gov.uk/doc/open-     the learning algorithm and, secondly, the general require-
government-licence/version/3 or write to the Information Policy Team,       ments for behaviour. We use the term ‘general’ purposeful-
The National Archives, Kew, London TW9 4DU, or email:
psi@nationalarchives.gsi.gov.uk.
                                                                            ly here. If it is possible to detail the exact behaviours ex-
                                                                            pected from the software then ML approaches are arguably
inappropriate for safety-related applications (Salay and                        R6. Traceable; and
Czarnecki, 2018).                                                               R7. Algorithmically correct.
   For construction of the learning algorithm (e.g. back-                      Whilst all of these considerations arguably apply to ML-
propagation) it is possible (and desirable) to develop HLR                     based applications some do not easily translate. To achieve
and to further decompose these into LLR and onto imple-                        the same intent through the training data we need to have
mentation, verification and validation. This aspect of ML                      confidence that the data:
software is therefore not considered further in this paper,                     D1. Relates to the intent of the HLR;
although we note that work is being conducted in this area,                     D2. Does not contain bias;
for example (Srisakaokul et. al, 2018).
                                                                                D3. Is sufficient;
   Moving on to the more challenging area of the behaviour
                                                                                D4. Is syntactically and semantically correct;
of the trained algorithm, in an ML-based approach this is
dictated by the training data, which may be real and/or                         D5. Addresses normal and robustness behaviours;
synthetically generated (e.g. Ekbatani, et. al, 2017, for co-                   D6. Is self-consistent;
puter vision problems), combined with the learning algo-                        D7. Conforms to standards;
rithm and the structure (e.g. number of neurons and layers                      D8. Is compatible with target computer; and
in an artificial neural network) to which it is applied.                        D9. Is verifiable.
   From this short discussion and returning to the ‘4+1’                       For ease of reference, Table 1 summarises the relationship
principles, it is apparent that in the case of ML, principles                  between requirements traditionally placed on LLR (R1 to
1 and 4+1 are arguably the only ones that can, based on                        R7) and areas where confidence is needed in the training
current practices, be adequately satisfied. Principles P2, P3                  data (D1 to D9).
and P4 all suffer to a greater or lesser degree because: the
                                                                               Table 1: Relationship between traditional requirements and areas
exact behaviour cannot be detailed in LLR (P2); assured
                                                                                                 of confidence in training data
through verification (P3); or sufficiently predicted to per-
mit potential hazard identification (P4). We note that some                             R1      R2       R3      R4      R5       R6      R7
of these may be resolved through behavioural containment                        D1                                                X       X
(e.g. the use of monitors and alternative control structures                    D2                                                        X
within the wider system architecture) but to do so could                        D3       X
also negate the benefits of adopting ML in the design.                          D4               X                                         X
   Due to the challenges of addressing each of these princi-                    D5       X
ples individually, we suggest a holistic perspective may be                     D6               X
more beneficial. Whilst all of the principles relate to assur-
                                                                                D7                                        X
ance of requirements and their implementation, discussing
                                                                                D8                       X
all of them would be a significant endeavor, which would
                                                                                D9                                X
be too broad for this position paper. Consequently, we fo-
cus on the decomposition of the HLR into LLR (i.e. assur-
                                                                               It is readily apparent that each of the traditional LLR re-
ance of the requirements rather than their initial derivation
                                                                               quirements is covered by at least one area of training data
or final implementation).
                                                                               confidence. This suggests the identified areas are neces-
   Since it determines the algorithm’s behaviour, we can
                                                                               sary. It does not mean they are sufficient (or complete).
consider the training data to be an abstract form of the
                                                                                  For example, there are also several ‘meta-
LLR. Hence, assurance of training data is paramount to
                                                                               considerations’ (i.e. those not directly concerned with per-
gaining confidence. Using RTCA DO-178C (RTCA, 2011)
                                                                               formance but contribute to design confidence). These in-
as an example1, traditional requirements verification seeks
                                                                               clude, inter alia: the data source; its control; forensic au-
to ensure LLR are:
                                                                               ditability; extent of pre-processing required; etc.
 R1. Compliant with HLR;
                                                                                  To provide confidence that these areas are suitably ad-
 R2. Accurate and consistent;                                                 dressed, there are a number of processes, reviews, analyses
 R3. Compatible with target computer2;                                        and tests that can be applied, the results of which should
 R4. Verifiable;                                                              support a safety argument. Some of these approaches will
 R5. Conforming to standards;                                                 now be discussed in the context of an indicative example.

1
  DO-178C is a key software safety document for aircraft. It is sufficiently                     Indicative Example
general to apply to most developments of safety-related software.
2
  The target computer is the one on which the algorithm will run during        To colour our discussion we adopt a fictional landing sys-
operational use. This is often different from the host computer, which is
the one used to develop the algorithm.                                         tem that detects whether a medium size Unmanned Air
Vehicle (UAV) has landed on an unprepared surface. The                    bustness requirements, which would need to be imple-
associated system requirement might be:                                   mented through the inclusion of sufficient examples in the
 SYS-01: Detect landing on unprepared strip.                             data (robustness is also considered in the discussion of D5,
This would then be decomposed into software and hard-                     below).
ware requirements. Given the environmental constraint of                     The diversity of potential environmental conditions
the unprepared strip the conventional ‘Weight-On-Wheels’                  compared to those that might be available during training
switch3 approach would not be technically feasible. There-                data collection would be a prime breeding ground for unin-
fore a design decision is made to use a combination of                    tentional bias (D2). When this limited dataset is then intro-
hardware sensor systems (e.g. horizontal and vertical ac-                 duced into the generative model the potential becomes am-
celerometers, altimeters, air data etc.) as inputs to an Arti-            plified. Detecting bias in data is difficult, but there are a
ficial Neural Network (ANN) that will classify the landing                number of ways it could be detected in the trained algo-
status into one of the following categories: in-air; on ap-               rithm (Tan, 2017).
proach; landing; landed.                                                     Whilst clearly there is a numerical aspect to the suffi-
   The hardware requirements are not considered further,                  ciency of training data (D3), we are also concerned with
but the high-level software requirement (HLR) might be:                   the diversity of input data. Identifying areas of sparsity is a
 SW-HLR-01: Classify landing status: {In_Air;                            notable concern. In our indicative example, there may be
   On_Approach; Landing; Landed} based on data from:                      limitations on the environmental conditions in which data
   Inertial Navigation System {Vert_Acc; Horz_Acc}; La-                   can be collected. Understanding this may lead to re-
   ser Altimeter {Height}; Air Data Computer {Airspeed}.                  strictions being placed on operational use of the UAV until
The actual behaviour (LLR) of the ANN would be deter-                     further data can be collected.
mined by the training data (and the learning algorithm and                   If a formally-structured process is used to record opera-
the structure to which it is applied). For the purposes of                tional data and strong configuration control is applied then
this discussion we assume data has been collected from a                  data taken from real systems should be syntactically and
suitable light aircraft making a number of landings on a                  semantically correct (D4). However, since our example
variety of surfaces in a range of environmental conditions.               includes synthetic data, care needs to be applied to ensure
For synthetic data, it too must be verified and validated to              that data distributions match real world expectations. Syn-
ensure that its form (syntax) and intent apropos HLR (se-                 tactic aspects relate to the structure and ranges of data,
mantics) matches the expectation of real data. Approaches                 which can be tested using simple data analysis tools. The
to this are not expected to differ from those discussed here-             semantic aspects are more challenging and should include
in to ensure data correctness.                                            tests for data poisoning4 and unintentional examples where
   In our example, the number of operationally-generated                  small shifts in inputs cause large changes in output. Guid-
samples is likely to be limited, partly through cost and                  ance for the management of safety-related data has been
partly through practicality (e.g. it is unlikely that there               produced by several organisations, e.g. (DSIWG, 2018),
would be many, if any, landings performed with systems                    and can be looked to for assistance in minimizing the po-
deliberately set in failure modes, even if this were possi-               tential for data poisoning. However, the developing body
ble). Consequently, the recorded (operationally-generated)                of knowledge in this area indicates that a stronger argu-
data is applied to a generative data modelling tool that pro-             ment may be possible through the application of data poi-
vides a larger data set which is used to train the ANN.                   soning detection tools (Steinhardt et. al, 2017).
   To determine whether the nine areas of training data                      A safety-related system that only contains requirements
confidence have been addressed a series of reviews, anal-                 to address normal range behaviour has the potential to be-
yses and tests are required. The following paragraphs dis-                come unsafe under abnormal conditions. Traditional sys-
cuss the most significant of these; space limitations prevent             tem requirements software design should consider all rea-
a complete analysis of all nine areas.                                    sonable failure conditions. In our indicative example ro-
   For operationally-generated samples, traceability in the               bustness cases (D5) would include sensors providing inac-
intent of data (D1) is straightforward, although unusual,                 curate readings as well as total failures (which may mean
outlier examples may need close examination to assure                     no reading is available). Inaccurate readings include credi-
validity. For synthetically generated data, ensuring the in-              ble but incorrect, as well as incredible data. As with tradi-
tent of the data would rely on detailed knowledge of the                  tional designs, a conscious decision needs to be made
behaviour of both the UAV and the data-generating tool.                   about the extent to which the system can be expected to
Additionally, most safety-related systems also feature ro-                deal with failure conditions. Our indicative landing system
                                                                          takes four inputs; it might be reasonable to expect that one
3
                                                                          input could fail at any given time so data that represents
 Usually consisting of a proximity switch located on the undercarriage,
which uses movement induced when weight is applied to the wheel to
                                                                          4
make an electrical circuit.                                                   Where an attacker can alter a small fraction of the training data.
landings where Height = 100ft but all other sensors are             In conclusion, we assert that any assurance claims re-
indicative of a landing would be a reasonable robustness         garding the requirements aspects of an ML-based safety-
case. However, simultaneous failure of three inputs would        related system would, as a bare minimum, need to address
be unreasonable. Synthetic data is expected to be invalua-       the nine areas developed in this paper. We also encourage
ble in generating sufficient data for training systems to        the safety-related ML community to test these areas, de-
behave robustly, since real data may be dangerous to col-        veloping them further as required.
lect.
   Self-consistency (D6) can be more challenging than it
might first appear. The large number of landings in multi-                               References
ple environments means that some results might be seem-          Ashmore, R. and Lennon, E. 2017. Progress Towards the Assur-
ingly contradictory; equivalently, based on the available        ance of Non-Traditional Software. In Developments in System
data the landing classes may not be separable. For exam-         Safety Engineering, Proceedings of the Twenty-fifth Safety-
                                                                 Critical Systems Symposium. Bristol, UK.
ple, a really smooth landing in wet conditions may lead to
input signals similar to In-Air (e.g. laser altimeter might      Certification Authorities Software Team (CAST). 2003. Merging
                                                                 High-Level and Low-Level Requirements. Position Paper
read high altitude due to spurious reflections and accel-        CAST-15, completed February 2003.
erometers read low Vert_Acc and Horz_Acc) but with a
                                                                 Data Safety Initiative Working Group (DSIWG). 2018. Data
label of Landing. If that is the case, the underlying features   Safety Guidance, SCSC-127B, ISBN 978-1540887481.
of the contradictions may need to be explored and the sys-       Ekbatani, H.K., Pujol, O. and Segui, S. 2017. Synthetic Data
tem may need to be redesigned (e.g. to provide additional        Generation for Deep Learning in Counting Pedestrians. In 6th
sensor inputs to introduce new variables) for greater ro-        International Conference on Pattern Recognition Applications
bustness. Fortunately, statistical analysis of the data may      and Methods. SCITEPRESS - Science and Technology Publica-
help. Data that is distributionally dissimilar but has the       tions.
same classification may be inconsistent. In many ways this       Hawkins, R., Habli, I., and Kelly, T. 2013. The Principles of
                                                                 Software Safety Assurance. 31st International System Safety Con-
can be seen as the inverse problem to detecting adversarial
                                                                 ference. Boston, Massachusetts USA, 2013.
examples, where the distribution is similar but the classifi-
                                                                 RTCA. 2011. Software Considerations in Airborne Systems and
cation is erroneously different.                                 Equipment Certification. DO-178C.
   The final three considerations (D7, D8 and D9) are re-
                                                                 Salay, R., and Czarnecki, K. 2018. Using Machine Learning Safe-
garded as unchanged from traditional safety-related soft-        ly in Automotive Software: An Assessment and Adaption of
ware and are not discussed further. However, we note that        Software Process Requirements in ISO 26262. arXiv preprint
the concept of verifiability is potentially different for ML-    arXiv:1808.01614.
based systems in that training data seeks to implement           Srisakaokul, S., Wu, Z., Astorga, A., Alebiosu, O., and Xie, T.
general behaviours and direct verification may not be pos-       2018. Multiple-Implementation Testing of Supervised Learning
sible.                                                           Software. In Proc. AAAI-18 Workshop on Engineering Dependa-
                                                                 ble and Secure Machine Learning Systems (EDSMLS).
                                                                 Steinhardt, J., Koh, P. W. W., and Liang, P. S. 2017. Certified
             Summary and Conclusion                              Defenses for Data Poisoning Attacks. In Advances in Neural In-
                                                                 formation Processing Systems (pp. 3517-3529).
This short position paper has established the concept that       Tan, S., Caruana, R., Hooker, G., and Lou, Y. 2017. Detecting
training data provides the functional requirements for a         Bias in Black-Box Models using Transparent Model Distillation.
safety-related system developed using ML-based ap-               arXiv preprint arXiv:1710.06169.
proaches.
   It has shown that it may be possible to make claims that
                                                                                         Disclaimer
the intent of HLR, passed down from system level re-
quirements, have been correctly maintained and imple-               This article is an overview of UK MOD sponsored re-
mented through the training data. Using traditional assur-       search and is released for informational purposes only. The
ance concepts as a basis we have developed a series of           contents of this article should not be interpreted as repre-
training data considerations that we argue could form the        senting the views of the UK MOD, nor should it be as-
basis of an assurance activity.                                  sumed that they reflect any current or future UK MOD
   These considerations can be addressed by a combination        policy. The information contained in this article cannot
of sound data management and a collection of reviews,            supersede any statutory or contractual requirements or lia-
tests and analyses. Some of these are currently under de-        bilities and is offered without prejudice or commitment.
velopment but further work is required to develop a com-
prehensive toolset that may be used across a wide range of
data sets.