Using Complementary Risk Acceptance Criteria to Structure Assurance Cases for Safety-Critical AI Components Michael Kläs, Rasmus Adler, Lisa Jöckel, Janek Groß, Jan Reich Fraunhofer IESE, Kaiserslautern, Germany {michael.klaes, rasmus.adler, lisa.joeckel, janek.gross, jan.reich}@iese.fraunhofer.de Abstract Unfortunately, existing safety standards are difficult to ap- ply in the context of AI [Salay and Czarnecki, 2018] and re- Artificial Intelligence (AI), particularly current Ma- visions are still ongoing [ISO/IEC, 2021]. Therefore, we cur- chine Learning approaches, promises new and inno- rently do not have any standards that we can easily apply for vative solutions also for realizing safety-critical certifying AI. functions. Assurance cases can support the potential Argument safety claims with assurance cases (ACs) as an certification of such AI applications by providing an established approach in safety engineering may provide an assessable, structured argument explaining why alternative basis for audits and certification in the context of safety is achieved. Existing proposals and patterns AI [BSI, 2021]. They could structure the arguments for those for structuring the safety argument help to structure parts of a solution that are individual and highly innovative. safety measures, but guidance for explaining in a Moreover, they could establish the basis for upcoming evi- concrete use case why the safety measures are actu- dence-based standards for AI certification. ally sufficient is limited. In this paper, we investi- Initial proposals on how to apply the concept of ACs to AI gate this and other challenges and propose solutions. can be found in the literature. A prominent strategy is to argue In particular, we propose considering two comple- the safety objectives and safety requirements [Gauerhof et al., mentary types of risk acceptance criteria as assur- 2020]. As the proposed strategy and patterns abstract from ance objectives and provide, for each objective, a specific safety objectives and derived safety requirements, structure for the supporting argument. We illustrate such approaches also largely abstract from AI-specific safety our proposal on an excerpt of an automated guided concerns and required safety measures. Guidance for achiev- vehicle use case and close with questions triggering ing and arguing safety is thus inherently limited. further discussions on how to best use assurance One approach for overcoming this limitation is to argue us- cases in the context of AI certification. ing known AI-related safety concerns and how they are ad- dressed by AI-specific safety measures [Schwalbe et al., 1 Introduction 2020]. A disadvantage is that it is hard to argue completeness AI, which in this paper we understand as complex data-driven for the identified and addressed safety concerns. Further- models provided by Machine Learning (ML), promises im- more, such approaches can not explain yet what safety proved or additional functionalities that are essential for au- measures and metrics with the respective thresholds need to tonomous systems, e.g., perception for self-driving vehicles. be applied to achieve a defined level of safety. To give just In many cases, such functionalities are safety-critical, so it is one example, neither practical experience nor empirical evi- highly likely that AI becomes safety-critical as well, meaning dence exists on defining a specific neuron coverage level that that its failure can contribute to accidents. There are already would be considered as sufficient when testing a deep neural various reports on fatal accidents due to AI-related failures in network for a concrete application. autonomous vehicles [Pietsch, 2021; Wakabayashi, 2018]. We think that the concepts and ideas introduced in existing In consequence, regulation [European Commission, 2021] AC proposals can be aligned in a more comprehensible and and certification for AI in safety-critical components is being convincing argumentation if the risk acceptance criteria on proposed. Regulation and certification are powerful means to which the question of ‘How safe is safe enough?’ is founded, prevent the market introduction of unsafe products. This con- is made explicit in the AC structure itself. We will show that tributes not only to safety but also to the economy as a few this allows, on the one hand, becoming explicit with respect unsafe products could affect user acceptance of all similar to AI-specific safety measures and, on the other hand, products. The predictability of legal decisions can thus con- soundly arguing higher-level safety-objectives. tribute to economic success as long as liability risk and costs Contribution. Specifically, we propose using an AC struc- for complying with regulations and standards are not unrea- ture that splits at an early stage into two main claims and sonably high and hinder meaningful innovations. related arguments. The first claim refers to the achievement Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) of a probabilistic target value with a certain level of confi- context. Quality modeling approaches can contribute to a dence derived from applying a quantitative risk acceptance more comprehensive list of quality requirements [Mayr et al., criteria. The second claim is that the risk due to “failures” 2012]. Siebert et al. [2021] propose a systematic approach for caused by the AI is as low as reasonably practicable due to building such a quality model for a concrete AI-based system safety measures applied during the AI lifecycle. In the ab- that defines the required aspects for each entity of the AI- sence of evidence-based target values for specific safety based system and how they can be measured. Still, further measures, we propose to monitor quality assurance activities research is needed to better understand (1) to which extent an on a cost-benefit base and define respective stop criteria. evidence generated by a certain method contributes to argu- This ensures, on the one hand, that quantitative objectives ing safety, (2) what suitable performance indicators for the are explicitly argued and underpinned with evidences. On the evidences are, and (3) when a certain method should be pre- other hand, the argumentation over the proposed lifecycle ferred over another for a given context. stages contributes to a more comprehensive and justifiable derivation of reasonable safety measures but without the need 2.2 Assurance Cases for predefine targets for specific safety measures. The aim of ACs are heavily used in practice to assure safety. In particu- this paper is to stimulate the discussion about how to argue lar, if it is very challenging to argue safety, as in the case of safety for AI-based functions by rethinking traditional AC autonomous systems. In recent years, standards like UL 4600 patterns and strategies. [UL, 2021] or reports [Zenzic, 2020] have addressed the de- Structure. The remainder of this paper is structured as fol- velopment of such AC. The application rule VDE-AR-E lows: First, we give some background on quality assurance in 2842-61 [VDE, 2020] already proposes using ACs also for the context of AI and introduce the concept of ACs as applied other critical aspects of trustworthiness, such as fairness, as in safety engineering (Sec. 2). Next, we discuss existing pro- illustrated by Hauer et al. [2021]. posals on how ACs could be used in the context of AI (Sec. An AC is defined as a reasoned, auditable created artifact 3). Then we introduce an example use case and illustrate our that supports the contention that its top-level claim (or set of proposal for structuring ACs (Sec. 4). Finally, we discuss a claims) is satisfied, including systematic argumentation and selection of open question (Sec. 5) and conclude the paper the underlying evidence and explicit assumptions that support with an outlook on possible implications (Sec. 6). the claim(s) [ISO/IEC/IEEE, 2019]. The left part of Fig. 1 illustrates the three main building 2 Background blocks of an AC: (1) its top-level claims typically referring to achieved objectives or fulfilled constraints, (2) an argumen- 2.1 Quality Assurance for AI tation supporting the top-level claims, and (3) evidences on AI-based software components raise new challenges for qual- which the argument is based. The right part illustrates the ar- ity assurance due to their functionality being derived from gumentation in a tree structure and its assumptions. The tree data. Commonly, challenges and safety concerns like lack of is built from reasoning steps that connect lower-level claims specification or interpretability are described [Adler et al., with a higher claim that can be concluded from these lower- 2019; Ashmore et al., 2019; Felderer and Ramler, 2021; level claims. If the conclusion is only valid under some as- Sämann et al., 2020; Willers et al., 2020]. Several papers col- sumptions, these assumptions shall be made explicit. lect existing methods and map them to mentioned challenges There are different languages for modeling ACs, like the [Adler et al., 2019; Sämann et al., 2020; Schwalbe and Schels Goals Structuring Notation (GSN) [SCSC, 2018] or Claim 2019; Willers et al., 2020]. This raises two questions: whether Argument Evidence Notation [Adelard LLP, 2021]. The the list of safety concerns is complete, and to which extent common meta-model of these languages is defined in the the available methods sufficiently address the safety concerns Structured Assurance Case Metamodel (SACM) [OMG, [Adler et al., 2019]. We are currently not aware of any work 2020]. This paper do not refer to a specific language but focus that could provide a sufficient answer on these questions. on the fundamental idea of structuring the argument. Another approach is to structure possible quality assurance activities and measures according to the phases of the AI lifecycle in which they are applied. Studer et al. [2021] pro- pose, for example, a process model based on CRISP-DM, which is often used in data analysis projects, introducing a quality assurance methodology for each project phase. Ash- more et al. [2019] provide a survey of quality assurance meth- ods generating evidences for key assurance requirements be- ing met in each phase of the AI lifecycle. Here, there is a need to show that the quality assurance methods applied during a phase address all assurance requirements related to this phase, and that the list of assurance requirements is complete. However, it is difficult to obtain a complete list of quanti- tative quality assurance requirements. These strongly depend Figure 1: Building blocks and general structure of an AC on the task of the AI-based component and its application 3 Related Work the claim of sufficient safety for the given application in a convincing manner. From a safety perspective, ACs are considered a promising We pinpointed as potential cause of these problems the fact approach for arguing safety for AI-based systems, and vari- that the risk acceptance criterion underlying the top-level ous authors have already proposed strategies and patterns. claim on which the argumentation is based is either implicit Picardi et al. [2019] presented an AC pattern for ML mod- or different criteria are mixed and are thus not easy to distin- els in clinical diagnosis systems, which they later refined and guish during refinement. We therefore claim that a clear dif- supplemented by a process for generating evidences during ferentiation will allow more specific argumentation patterns the ML lifecycle [Picardi et al., 2020]. The activities and de- and better attribution of evidences to sub-claims. siderata during the ML lifecycle are referred from Ashmore et al. [2019]. The ML assurance claim is argued based on ML safety requirements, operating environment, ML model, de- 4 Building Safety Assurance Cases for AI velopment, and test data. In this context, the link between sys- In this section, we will first introduce the example we will tem safety requirements and ML safety requirements is ad- use to illustrate our concepts. Then we will motivate the con- dressed [Gauerhof et al., 2020]. In the recently published sideration of a combination of two risk acceptance criteria to AMLAS report, Hawkins et al. [2021] also provide generic ar- structure ACs for AI. Finally, we will introduce a lifecycle gument patterns and a process for ML safety assurance scop- model and use it to argue completeness of the provided re- ing, ML safety requirements, ML data, model learning, finement. model verification, and model deployment. Wozniak et al. [2020] propose an argument pattern for 4.1 Background of the Selected Example safety assurance that is aligned with the reasoning for soft- Automated guided vehicles (AGV) are driverless vehicles ware and hardware in ISO 26262. They argue satisfaction of that transport material. They are used in industrial applica- an ML safety requirement over correctly decomposing the tions for realizing the flow of material and their safety con- safety requirements into sub-requirements and their satisfac- cepts do not rely on AI [DIN, 2017]. However, their applica- tion, appropriate data acquisition, model design, as well as tion is limited due to limited understanding of the environ- implementation and training of the ML model. ment and their safety concept. Autonomous mobile robots A strategy that does not argue the fulfillment of ML safety (AMR) overcome these disadvantages compared to operator- requirements is provided by Gauerhof et al. [2018]. They ar- controlled vehicle by using more sensors and AI. However, gue that the intended functionality is met by a sufficient re- the goal of achieving similar performance and flexibility as duction of the root causes of functional insufficiencies, which an operator-controlled vehicle is hard to realize without using encompass underspecification, semantic and deductive gap. AI in safety-critical functions like collision avoidance. Oper- Based on previous works [Schwalbe and Schels, 2019; ators of forklifts adapt their speed and safety distance accord- 2020], Schwalbe et al. [2020] propose arguing the sufficient ing to various aspects of the persons at risk, including speed, absence of risk for deep neural networks (DNN) arising from motion path, eye contact, hand gestures signaling right of the insufficiencies they see in their black-box nature, simple way, etc. To implement a conservative version of such a hu- performance issues, incorrect internal logic, and instability. man-like collision avoidance system, the AMR needs an AI- They propose a collection of measures to address these insuf- based component that understands whether a person at risk ficiencies, which include V&V as well as best practices dur- has recognized the AMR and gives way to it. A critical failure ing the creation of DNNs and on the system level. in this context is that the AMR falsely detects the signaling In summary, our review indicates that existing work is of right of way. Such safety-critical false detections have to driven by the safety community, which adapts established be avoided sufficiently to assure that the AMR drives as least safety patterns and concepts to AI. However, the presented as safe as an operator. patterns are still on a rather abstract level, and the applicabil- ity on a concrete use case comprehensively illustrated from 4.2 What does sufficient mean? the top-level claim down toward the evidences has not been The answer to the question of what sufficient means to pre- described yet so far. This might indicate that transferring tra- vent a safety-critical failure like ‘false detections of a human ditional patterns to AI-based systems proves to be difficult. gesture’ depends on the related risks and the risk acceptance We observed two major challenges in argumentation for criteria, as safety is defined as acceptable risk [IEC, 2010]. which existing strategies and patterns still provide insuffi- We should keep two important aspects in mind when dis- cient support. (1) Completeness in the refinement of claims cussing criteria for risk acceptance in settings where AI is part in sub-claims appears difficult to show, especially, when ap- of a safety-critical function: (a) AI is an emerging technology proaches argue over the refinement of safety requirements to that is still heavily in flux, with unforeseeable developments AI/ML requirements or about addressing ML insufficiencies. and improvements in the upcoming years. Thus, coming up For example, if we have a (most likely) incomplete list of in- with a fixed set of safety measures does not appear to be rea- sufficiencies, we cannot argue about addressing each insuffi- sonable. The argument that these safety measures minimize ciency. (2) Considering the current state of AI quality assur- risks as far as reasonably practicable easily becomes invalid. ance, the proposed patterns commonly struggle with bridging Besides, it would be hard to argue that these measures are as the gap between a low-level quantitative evidence, e.g., effective as existing ones in safety standards for traditional achieving a specific neuron coverage during AI testing, and software. (b) AI is also mainly applied to realize functions context, but not specific for AI. We do also not discuss how that cannot be provided yet by traditional technological solu- to get from the target failure rate to a target upper boundary tions. on the uncertainty for AI outcomes. Instead, we assume in A risk acceptance criterion that seems reasonable to apply our example that we would end up with a PRB-derived upper in the context of AI – considering (a) – states that the residual boundary on safety-related uncertainty (u) that we could ac- risk after the application of safety measures should be As Low cept for the AI outcomes: ‘The AI must not falsely detect a As Reasonably Practicable (ALARP). The meaning of ‘rea- signal for the right of way that was not actually given in more sonably practicable’ is not static but depends on the state of than one of N cases’. the technology and the intended application, including the un- Fig. 2 illustrates the relationship between ALARP and a derlying business case and related practical restrictions. Con- target-based criterion such as MEM, GAMAB, or PRB when sidering ALARP as part of the argumentation assures that providing arguments that an AI-based solution is safe. when progress in technology allows for safer solutions, we ALARP can be considered as requesting a certain alpha will see progress in safety. given by the ratio between the reduction of safety-related un- However, doing one’s best to avoid and mitigate risks is certainty in the AI outcomes and the required effort/cost. obviously not enough to argue that the best was sufficient. Given the business case for the planned solution and the state Accordingly, ALARP is only used in an ALARP region, of technology, this alpha might vary and is achieved in Fig. 2 which is the region between an upper tolerance limit marking at point B. Simply speaking, we request that as long as safety unacceptable risk and a lower tolerability limit. Having this measures exist that would increase safety with reasonable in- in mind is of crucial importance when applying ALARP to vestment, they are carried out. How this rather abstract con- AI since the current state of AI technology might not be ad- straint can be further refined will be discussed in the context vanced enough to realize a given application in a sufficiently of the AI lifecycle presented in Sec. 4.3. safe manner. For example, a state-of-the-art traffic sign The upper boundary on acceptable safety-related uncer- recognition algorithm might get one of 200 stop signs wrong tainty u that is derived from the target-based criterion is illus- [INI, 2019]. If used as part of an autonomous vehicle, it may, trated in Fig. 2 as a horizontal line. We consequently need to as a result, regularly ignore someone's right of way. The al- argue that we are confident that the actual safety-related un- gorithm might be as good as reasonably practicable but is still certainty is below u. Please note that this is not achieved at not sufficiently safe to be applied in this specific application. point A, but first at C, which we will discuss further, includ- Thus, we need at least a second risk acceptance criterion ing its implications when talking about testing in Sec. 4.3. that gives us a fixed limit. Finally, we will always end up in one of two kinds of situ- Most existing products have been developed according to ations: a situation where the target-based criterion dominates, functional safety standards that follow the risk acceptance i.e., it defines the required investment (cf. Fig. 2), or a situa- criterion Minimum Endogenous Mortality (MEM). The idea tion where ALARP dominates the required investment. An of MEM is that a technical system must not create a signifi- interesting question, which is, however, not directly related cant risk compared to globally existing risks. For example, a to safety, is whether a solution requiring more investment product should cause a minimal increase in overall death rates than reasonably practicable should actually be targeted. compared to the existing population death rates. This idea leads to very challenging safety requirements and low target Uncertainty of failure rates. Depending on the specific task, such low failure AI Failure rates might be hard to achieve in practice if AI is involved. An alternative criterion given a fixed target is Globalement au moins aussi bon (GAMAB), which says that new technical systems shall be at least as safe as comparable existing ones. However, due to (b) it is hardly applicable in case of many PRB  u: maximum acceptable uncertainty (A) AI-based functions because no technical systems exist yet satisfied on confidence level cl (C) that provide similar functions. An approach related to GAMAB is the idea of having a ‘positive risk balance’ (PRB). PRB is defined in ISO/TR 4804 as the ‘benefit of sufficiently mitigating residual risk of traffic participation due to automated vehicles’ together with Note 1 ‘This includes the expectation that automated vehicles cause less crashes (3.7) on average compared to those made u by drivers’ [ISO/TR, 2020]. The idea of comparing the new A B C AI-based solution with the existing sociotechnical system can lead to less challenging target failure rates compared to Effort/ MEM. This opens up new opportunities for arguing safety. Cost ALARP  α: reasonable Investment required In this paper, we do not discuss how to use this opportunity cost-benefit ratio (B) to satisfy ALARP + PRB to derive a target failure rate for an AI-based safety-critical function, as this is very specific for the function and its usage Figure 2: Implications of considering two risk acceptance criteria 4.3 Arguing considering the AI lifecycle resources. Although the AI specification has some special- As illustrated above, it seems reasonable to argue two sepa- ties, activities are largely AI-independent. Nevertheless, it is rate risk acceptance criteria. It is also advisable to argue each a key phase for both types of risk acceptance criteria. A suf- criterion independently. Important for the argumentation, es- ficiently complete and correct specification is a prerequisite pecially for the argumentation of ALARP, are strategies that to assuring that the safety risk will be as low as reasonable practicable by proving guardrails for the subsequent phases, assure that the refinement of the claims into sub-claims is complete. An accepted way, which we also consider as most but it also constitutes the AI-specific safety target and the scope in which this target has to be achieved. For example, promising, is to use a lifecycle model to argue completeness and localize safety measures. the AI must not falsely detect a signal for the right of way that was not actually given in more than one of N cases in its pre- The lifecycle model for AI components presented in Fig. 3 builds on existing work, in particular on the work of Ashmore viously defined target application scope. et al. [2019] and Gauerhof et al. [2020]. We adapted their Construction is an AI-specific phase. Its objective is to proposals. The objective was to achieve an even clearer sep- build a model from a training dataset that is able to fulfill the AI task in the target application scope considering the re- aration and better assignability of datasets, objectives and corresponding safety measures to the individual phases. In quirements and constraints defined in the specification. During construction, many design decisions have to be addition, we tried to keep the phases sufficiently generic to be applicable for the various development processes in data made, e.g., on the kind of model and its hyperparameters in- science projects that we are aware of. cluding topology, learning algorithm, stop criteria, etc. Many of these decisions are based on trial and error, taking We distinguish between specification, construction, anal- ysis, testing, and operation. The proposed lifecycle model ex- into account experience, so construction is a highly iterative plicitly does not include a 'data' phase. Subsuming data-re- process in a close feedback loop with the analysis phase. We will not be able to show during construction that we lated activities in single phase neither matches reality nor gives weight to the topic of data, which is at the core of any achieved a certain quantitative target, since we focus on fit- ting but not soundly testing the model. Thus, this phase plays AI lifecycle. Especially since different data with different no role in arguing regarding the target-based risk acceptance qualities are consumed in different phases, we modeled indi- vidual data lifecycles as parallel streams that provide the criteria. However, considering quality assurance measures foundation for the evidences created in the AI lifecycle. during construction is important to argue ALARP. The ap- plied quality assurance measures should be guided by the Top level safety target, but also by other quality requirements and con- Sufficiently safe AI-related claim straints identified during the specification. Commonly, it is Top level Argue about a not possible to define fixed success criteria for the different AI-related strategy combination of risk quality measures. For example, in most cases, it would not be acceptance criteria reasonable to enforce a specific type of model or topology, or Target is Claims making Safety risk is as low as Quantitative safety derived request a maximum batch size m and run at least e epochs. risk acceptance criteria explicit reasonable practicable target is satisfied from PRB Instead, we propose analyzing and monitoring the efficiency or MEM of the measures carried out and stopping in accordance with Strategy to argue completeness of Argue about relevant Argue about relevant ALARP if a reasonable saturation is achieved. For example, lifecycle phases lifecycle phases refinement if performing a random search on appropriate hyperparameter AI lifecycle values, the search shall continue as long as the model shows phases reasonable improvements. Analysis is also an AI-specific phase that is performed in Specification Construction Analysis Testing Operation a close feedback loop with construction to provide guardrails for improving construction and indicating the achievement of with quality assurance Data Flow measures (founded on saturation for constructive quality assurance measures. Anal- appropriate datasets) ysis comprises besides means for explainability also “testing” Training Analysis Test Data lifecycle with Data Data Data the model on validation data to estimate and monitor the measures to assure appropriate data Training Data Analysis Data Test Data model performance with respect to the safety target. How- Existing Data Lifecycle Lifecycle Lifecycle ever, although techniques are applied that are similar to the techniques applied in the testing phase, the analysis phase dif- Figure 3: AI lifecycle phases with mapping to risk criteria fers from the testing phase in that objective is to gather in- The proposed separation also results in the fact that certain sides to further improve the AI model rather than provide ev- phases exclusively contribute either to argue ALARP or the idence for the achievement of the specified safety target. Therefore, the quality assurance measures in the analysis target-based risk acceptance criterion, as we will show. phase help to argue ALARP but do not contribute to arguing Specification considers, among other things, the definition regarding the target-based risk acceptance criteria. In analogy of the AI task, the target application scope [Kläs and Vollmer, to the construction phase, it is difficult to define a priori tar- 2018], which is comparable to the operation design domain in gets for most quality measures in the analysis phase. Rather, automotive, and safety-relevant as well as other quality re- their effect and thus their potential contribution to the safety quirements including system constraints like computational target must be monitored and continuously evaluated. and to use market monitoring to strengthen the argument. Testing is also commonly considered to be AI-specific. This approach is described already by the Safety Performance Unlike analysis, the objective of the testing phase is to gen- Indicator [Koopman and Wagner, 2019] or GQM+Strategies erate evidences on the achievement of the quantitative safety [Basili et al., 2010] but it needs to be tailored to the focused target. In providing these evidences, testing depends on the argument for AI. By evaluating the reasoning with data, a ma- specification, including the definition of the AI task and the ture body of knowledge can be developed over time and re- target application scope. Moreover, it relies on specific qual- flected in safety standards for AI. ities of the test data that are not so relevant, for example, for Considering standardization, we see three options for using training data, such as that the data fulfills some representa- ACs. The first is to demand in a safety standard the develop- tiveness criteria and that it was not used previously during ment of an AC for the considered product. The second is to construction or analysis. Since a test dataset can always pro- describe in product- or domain-specific safety standards a ge- vide only a sample of all possible cases in the target applica- neric AC that shall be instantiated. The third is to develop a tion scope, we need to underpin the evidence on satisfying product- or domain-specific AC and use this AC to develop a the safety target with some statistical confidence (cf. Fig. 2) checklist-based safety standard where safety measures are [Kläs and Sembach, 2019]. The confidence level cl, which is chosen depending on the specific criticality/integrity level. independent of the target, may be set based on criticality or Considering certification, we see two main aspects. The requested integrity. For example, we might request that the first is that the AC needs to comply with the standard describ- probability that we falsely confirm our target ‘The AI must ing what the AC should look like. The second and more im- not falsely detect a signal for the right of way that was not portant aspect is that the AC itself needs to be sound, so that actually given in more than one of N cases in its previously it can be accepted by the certification body. The challenge defined target application scope.’ is less than 1-cl = 0.0001. here is that the review of the AC becomes easily more elabo- Moreover, it is important to understand that quality assur- rative than a checklist-based approach, meaning the certifica- ance measures in the testing phase are not applied to further tion body needs much greater expertise. Furthermore, the cer- improve the AI solution and thus does not provide evidences tification body can no longer give up responsibility for the to argue ALARP. Instead, they help argue that we are confi- safety of the system by saying that it is only responsible for dent that we have met the quantitative safety target. compliance with standards but not for system safety. How- Operation in the sense considered here comprises deploy- ever, this aspect is not specific for AI and is generally true for ment, usage, maintenance, and retirement. Although most as- the certification of complex systems by means of ACs. pects are not AI-specific, some are and need to be addressed with appropriate safety measures. On the one hand, measures 6 Conclusion for assuring ALARP include the collection of relevant infor- mation during operation to further improve the AI solution as We conclude that ACs have the potential to justify the usage part of maintenance. Moreover, situations have to be detected of AI in safety-critical systems. A prerequisite is, however, in which the AI solution can only provide outcomes with high that they argue that risks are as low as reasonably practicable uncertainty, in order to allow for appropriate countermeas- (ALARP) and that a reasonable target based on a quantitative ures to be taken on the system level to improve the overall risk acceptance criterion has been chosen and is fulfilled. We safety. Such situations may include settings where lighting presented the first approach for explicitly augmenting the conditions make falsely detecting a signal for the right of way achievement of these complementary objectives for AI. much more likely. On the other hand, evidence on satisfying We also see the potential of the proposed structure for tra- the safety target obtained from testing strongly relies on as- ditional software as it would enforce claims about the effec- sumptions regarding the target application scope; if the AI tiveness of safety measures. It would put into question solution is applied in a different setting or relevant character- whether one is really following the ALARP principle when istics of the application change, this evidence is no longer choosing safety measures according to recommendations valid. Therefore, safety measures have to be taken during op- given by safety standards. It would also raise the question of eration to detect such deviation between the target application how effective software safety measures are and call for em- scope and the actual application scope. pirical evidences about their effectiveness. Last but not least, we advocate that the concept of ACs from the safety community should be carried over to the AI 5 Discussion community. In particular, researchers with a background in We proposed a strategy for arguing the safety of an AI-based empirical studies and data quality need to be involved in the safety function combining two risk acceptance criteria. The development and review of AI-related ACs. structure can help to come up with a sound argument but there are ways of how one could attack this argument. A possible Acknowledgments attack on the ALARP argument is that the body of knowledge concerning the effectiveness and the best combinations of Parts of this work have been funded by the Observatory for measures is not mature enough. A possible attack on the Artificial Intelligence in Work and Society (KIO) of the quantitative claim based on PRB or MEM is that there is not Denkfabrik Digitale Arbeitsgesellschaft in the project "KI enough practical experience and empirical evidence. A pos- Testing & Auditing". We would also like to thank Sonnhild sible response to this attack is to collect data during operation Namingha for an initial review of this paper. References [Mayr et al., 2012] A. Mayr, R. Plösch, M. Kläs, C. Lampasona, and M. Saft. A Comprehensive Code-Based Quality Model for [Adelard LLP, 2021] Adelard LLP. CAE FRAMEWORK, 2021, Embedded Systems: Systematic Development and Validation https://claimsargumentsevidence.org/. Accessed 10 May 2021. by Industrial Projects. In ISSRE 2012, pp. 281-290, 2012. [Adler et al., 2019] R. Adler, M. N. Akram, P. Bauer, p: Feth, P. [OMG, 2020] Object Management Group. About the Structured Gerber, A. Jedlitschka, L. Jöckel, M. Kläs, and D. Schneider. Assurance Case Metamodel Specification Version 2.1, 2020. Hardening of Artificial Neural Networks for Use in Safety- [Koopman and Wagner, 2019] P. Koopman, and M. Wagner. Posi- Critical Applications - A Mapping Study, 2019. tive Trust Balance for Self-driving Car Deployment. In Proc. https://arxiv.org/abs/1909.03036. of SAFECOMP 2020 Workshops, pp. 351–357, 2019. [Ashmore et al., 2019] R. Ashmore, R. Calinescu, and C. Paterson. [Picardi et al., 2019] C. Picardi, R. Hawkins, C. Paterson, and I. Assuring the Machine Learning Lifecycle: Desiderata, Meth- Habli. A Pattern for Arguing the Assurance of Machine ods, and Challenges, ACM Computing Surveys, 2019. Learning in Medical Diagnosis Systems. In Proc. of [Basili et al., 2010] V. R. Basili, et al. Linking Software Develop- SAFECOMP 2019, pp. 165–179, 2019. ment and Business Strategy Through Measurement. Com- [Picardi et al., 2020] C. Picardi, C. Paterson, R. Hawkins, R. Calin- puter. 43(4):57–65, 2010. escu, and I. Habli. Assurance Argument Patterns and Pro- [BSI, 2021] BSI, Fraunhofer HHI, Verband der TÜV. Towards Au- cesses for Machine Learning in Safety-Related Systems. In ditable AI Systems, 2021. Proc. of SafeAI 2020, pp. 23-30, 2020. [DIN, 2017] DIN EN ISO 3691-1:2017 – Industrial trucks - Safety [Pietsch, 2021] B. Pietsch. 2 Killed in Driverless Tesla Car Crash, requirements and verification, 2017. Officials Say. The New York Times, 2021. [European Commission, 2021] European Commission. Proposal [SCSC, 2018] Safety-Critical Systems Club. GSN Community for a Regulation laying down harmonised rules on artificial Standard Version 2 Draft 1, 2018. intelligence (Artificial Intelligence Act), 2021. https://ec.eu- [Salay and Czarnecki, 2018] R. Salay, and K. Czarnecki. Using ropa.eu/newsroom/dae/redirection/item/709090. Machine Learning Safely in Automotive Software: An Assess- [Felderer and Ramler, 2021] M. Felderer, and R. Ramler. Quality ment and Adaption of Software Process Requirements in ISO Assurance for AI-based Systems: Overview and Challenges. 26262, 2018. https://arxiv.org/abs/1808.01614. In Software Quality: Future Perspectives on Software Engi- [Sämann et al., 2020] T. Sämann, P. Schlicht, and F. Hüger. Strat- neering Quality. 2021. egy to Increase the Safety of a DNN-based Perception for [Gauerhof et al., 2020] L. Gauerhof, R. Hawkins, C. Picardi, C. HAD Systems, 2020. https://arxiv.org/abs/2002.08935. Paterson, and I. Habli. Assuring the Safety of Machine Learn- [Schwalbe et al., 2020] G. Schwalbe, et al. Structuring the Safety ing for Pedestrian Detection at Crossings. In Proc. of Argumentation for Deep Neural Network Based Perception in SAFECOMP 2020. Springer, pp. 197–212, 2020. Automotive Applications. In Proc. of SAFECOMP 2020, pp. [Gauerhof et al., 2018] L. Gauerhof, P. Munk, and S. Burton. 383–394, 2020. Structuring Validation Targets of a Machine Learning Func- [Schwalbe and Schels, 2020] G. Schwalbe, and M. Schels. A Sur- tion Applied to Automated Driving. In Proc. of SAFECOMP vey on Methods for the Safety Assurance of Machine Learn- 2018, pp. 45–58, 2018. ing Based Systems. In: Proc. of European Congress on Em- [Hauer et al., 2021] Hauer, M., Adler, R., and Zweig, K. Assuring bedded Real Time Software and Systems, 2020. Fairness of Algorithmic Decision Making (ITEQS 2021). In [Schwalbe and Schels, 2019] G. Schwalbe, and M. Schels. Strate- Proc. of Int. Conf. on Software Testing, 2021. gies for Safety Goal Decomposition for Neural Networks. In [Hawkins et al., 2021] R. Hawkins, C. Paterson, C. Picardi, Y. Jia, Proc. of ACM Computer Science in Cars Symposium, 2020. R. Calinescu, and I. Habli. Guidance on the Assurance of Ma- [Siebert et al., 2021] J. Siebert, L. Joeckel, J. Heidrich, A. chine Learning in Autonomous Systems (AMLAS), 2021. Trendowicz, K. Nakamichi, K. Ohashi, I. Namba., R. Yama- [IEC 2010] IEC 61508-5:2010 – Functional Safety of Electri- moto, and M. Aoyama. Construction of a Quality Model for cal/Electronic/Programmable Electronic Safety-related Sys- Machine Learning Systems, Software Quality Journal. Special tems, 2010. Issue Information Systems Quality, 2021. [INI, 2019] Institut für Neuroinformatik. German Traffic Sign [UL, 2021] Underwriters Laboratories. Presenting the Standard for Benchmarks, 2019. https://benchmark.ini.rub.de/gtsrb_re- Safety for the Evaluation of Autonomous Vehicles and Other sults.html. Accessed 10 May 2021. Products. https://ul.org/UL4600. Accessed 10 May 2021. [ISO/TR, 2020] ISO/TR 4804:2020 – Road vehicles — Safety and [VDE, 2020] VDE-AR-E 2842-61-1:2020-07 – Development and cybersecurity for automated driving systems — Design, verifi- trustworthiness of autonomous/cognitive systems, 2020. cation and validation, 2020. [Wakabayashi, 2018] D. Wakabayashi. Self-Driving Uber Car [ISO/IEC, 2021] ISO/IEC AWI TR 5469 – Artificial intelligence Kills Pedestrian in Arizona. The New York Times, 2018. — Functional safety and AI systems, 2021. [Willers et al., 2020] O. Willers, S. Sudholt, S. Raafatnia, and S. [ISO/IEC/IEEE, 2019] ISO/IEC/IEEE 15026-1:2019 – Systems Abrecht. Safety Concerns and Mitigation Approaches Regard- and software engineering - Systems and software assurance - ing the Use of Deep Learning in Safety-Critical Perception Part 1: Concepts and vocabulary, 2019. Tasks. In Proc. of SAFECOMP 2020, pp. 336–350, 2020. [Kläs and Sembach, 2019] M. Kläs, and L. Sembach. Uncertainty [Wozniak et al., 2020] E. Wozniak, C. Cârlan, E. Acar-Celik, and Wrappers for Data-Driven Models. In Proc. of SAFECOMP H. Putzer. A Safety Case Pattern for Systems with Machine 2019. Springer, pp. 358–364, 2019. Learning Components. In Proc. of SAFECOMP 2020. [Kläs and Vollmer, 2018] M. Kläs, and A.M. Vollmer. Uncertainty Springer, pp. 370-382, 2020. in Machine Learning Applications: A Practice-Driven Classi- [Zenzic, 2020] Zenzic-UK Ltd. Zenzic-Safety-Framework-Report- fication of Uncertainty. In Proc. of SAFECOMP 2019, 2019. 2.0-final, 2020. https://zenzic.io/reports-and-resources/safety- case-framework/. Accessed 10 May 2021.