=Paper= {{Paper |id=Vol-2916/paper_9 |storemode=property |title=Using Complementary Risk Acceptance Criteria to Structure Assurance Cases for Safety-Critical AI Components |pdfUrl=https://ceur-ws.org/Vol-2916/paper_9.pdf |volume=Vol-2916 |authors=Michael Klaes,Rasmus Adler,Lisa Jöckel,Janek Groß,Jan Reich |dblpUrl=https://dblp.org/rec/conf/ijcai/KlaesAJGR21 }} ==Using Complementary Risk Acceptance Criteria to Structure Assurance Cases for Safety-Critical AI Components== https://ceur-ws.org/Vol-2916/paper_9.pdf
                  Using Complementary Risk Acceptance Criteria to Structure
                      Assurance Cases for Safety-Critical AI Components
                   Michael Kläs, Rasmus Adler, Lisa Jöckel, Janek Groß, Jan Reich
                                Fraunhofer IESE, Kaiserslautern, Germany
            {michael.klaes, rasmus.adler, lisa.joeckel, janek.gross, jan.reich}@iese.fraunhofer.de



                          Abstract                                      Unfortunately, existing safety standards are difficult to ap-
                                                                     ply in the context of AI [Salay and Czarnecki, 2018] and re-
    Artificial Intelligence (AI), particularly current Ma-           visions are still ongoing [ISO/IEC, 2021]. Therefore, we cur-
    chine Learning approaches, promises new and inno-                rently do not have any standards that we can easily apply for
    vative solutions also for realizing safety-critical              certifying AI.
    functions. Assurance cases can support the potential
                                                                        Argument safety claims with assurance cases (ACs) as an
    certification of such AI applications by providing an
                                                                     established approach in safety engineering may provide an
    assessable, structured argument explaining why
                                                                     alternative basis for audits and certification in the context of
    safety is achieved. Existing proposals and patterns              AI [BSI, 2021]. They could structure the arguments for those
    for structuring the safety argument help to structure            parts of a solution that are individual and highly innovative.
    safety measures, but guidance for explaining in a                Moreover, they could establish the basis for upcoming evi-
    concrete use case why the safety measures are actu-
                                                                     dence-based standards for AI certification.
    ally sufficient is limited. In this paper, we investi-
                                                                        Initial proposals on how to apply the concept of ACs to AI
    gate this and other challenges and propose solutions.
                                                                     can be found in the literature. A prominent strategy is to argue
    In particular, we propose considering two comple-
                                                                     the safety objectives and safety requirements [Gauerhof et al.,
    mentary types of risk acceptance criteria as assur-
                                                                     2020]. As the proposed strategy and patterns abstract from
    ance objectives and provide, for each objective, a               specific safety objectives and derived safety requirements,
    structure for the supporting argument. We illustrate
                                                                     such approaches also largely abstract from AI-specific safety
    our proposal on an excerpt of an automated guided                concerns and required safety measures. Guidance for achiev-
    vehicle use case and close with questions triggering
                                                                     ing and arguing safety is thus inherently limited.
    further discussions on how to best use assurance                    One approach for overcoming this limitation is to argue us-
    cases in the context of AI certification.
                                                                     ing known AI-related safety concerns and how they are ad-
                                                                     dressed by AI-specific safety measures [Schwalbe et al.,
1   Introduction                                                     2020]. A disadvantage is that it is hard to argue completeness
AI, which in this paper we understand as complex data-driven         for the identified and addressed safety concerns. Further-
models provided by Machine Learning (ML), promises im-               more, such approaches can not explain yet what safety
proved or additional functionalities that are essential for au-      measures and metrics with the respective thresholds need to
tonomous systems, e.g., perception for self-driving vehicles.        be applied to achieve a defined level of safety. To give just
In many cases, such functionalities are safety-critical, so it is    one example, neither practical experience nor empirical evi-
highly likely that AI becomes safety-critical as well, meaning       dence exists on defining a specific neuron coverage level that
that its failure can contribute to accidents. There are already      would be considered as sufficient when testing a deep neural
various reports on fatal accidents due to AI-related failures in     network for a concrete application.
autonomous vehicles [Pietsch, 2021; Wakabayashi, 2018].                 We think that the concepts and ideas introduced in existing
   In consequence, regulation [European Commission, 2021]            AC proposals can be aligned in a more comprehensible and
and certification for AI in safety-critical components is being      convincing argumentation if the risk acceptance criteria on
proposed. Regulation and certification are powerful means to         which the question of ‘How safe is safe enough?’ is founded,
prevent the market introduction of unsafe products. This con-        is made explicit in the AC structure itself. We will show that
tributes not only to safety but also to the economy as a few         this allows, on the one hand, becoming explicit with respect
unsafe products could affect user acceptance of all similar          to AI-specific safety measures and, on the other hand,
products. The predictability of legal decisions can thus con-        soundly arguing higher-level safety-objectives.
tribute to economic success as long as liability risk and costs         Contribution. Specifically, we propose using an AC struc-
for complying with regulations and standards are not unrea-          ture that splits at an early stage into two main claims and
sonably high and hinder meaningful innovations.                      related arguments. The first claim refers to the achievement




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
of a probabilistic target value with a certain level of confi-     context. Quality modeling approaches can contribute to a
dence derived from applying a quantitative risk acceptance         more comprehensive list of quality requirements [Mayr et al.,
criteria. The second claim is that the risk due to “failures”      2012]. Siebert et al. [2021] propose a systematic approach for
caused by the AI is as low as reasonably practicable due to        building such a quality model for a concrete AI-based system
safety measures applied during the AI lifecycle. In the ab-        that defines the required aspects for each entity of the AI-
sence of evidence-based target values for specific safety          based system and how they can be measured. Still, further
measures, we propose to monitor quality assurance activities       research is needed to better understand (1) to which extent an
on a cost-benefit base and define respective stop criteria.        evidence generated by a certain method contributes to argu-
   This ensures, on the one hand, that quantitative objectives     ing safety, (2) what suitable performance indicators for the
are explicitly argued and underpinned with evidences. On the       evidences are, and (3) when a certain method should be pre-
other hand, the argumentation over the proposed lifecycle          ferred over another for a given context.
stages contributes to a more comprehensive and justifiable
derivation of reasonable safety measures but without the need      2.2 Assurance Cases
for predefine targets for specific safety measures. The aim of     ACs are heavily used in practice to assure safety. In particu-
this paper is to stimulate the discussion about how to argue       lar, if it is very challenging to argue safety, as in the case of
safety for AI-based functions by rethinking traditional AC         autonomous systems. In recent years, standards like UL 4600
patterns and strategies.                                           [UL, 2021] or reports [Zenzic, 2020] have addressed the de-
   Structure. The remainder of this paper is structured as fol-    velopment of such AC. The application rule VDE-AR-E
lows: First, we give some background on quality assurance in       2842-61 [VDE, 2020] already proposes using ACs also for
the context of AI and introduce the concept of ACs as applied      other critical aspects of trustworthiness, such as fairness, as
in safety engineering (Sec. 2). Next, we discuss existing pro-     illustrated by Hauer et al. [2021].
posals on how ACs could be used in the context of AI (Sec.            An AC is defined as a reasoned, auditable created artifact
3). Then we introduce an example use case and illustrate our       that supports the contention that its top-level claim (or set of
proposal for structuring ACs (Sec. 4). Finally, we discuss a       claims) is satisfied, including systematic argumentation and
selection of open question (Sec. 5) and conclude the paper         the underlying evidence and explicit assumptions that support
with an outlook on possible implications (Sec. 6).                 the claim(s) [ISO/IEC/IEEE, 2019].
                                                                      The left part of Fig. 1 illustrates the three main building
2   Background                                                     blocks of an AC: (1) its top-level claims typically referring to
                                                                   achieved objectives or fulfilled constraints, (2) an argumen-
2.1 Quality Assurance for AI                                       tation supporting the top-level claims, and (3) evidences on
AI-based software components raise new challenges for qual-        which the argument is based. The right part illustrates the ar-
ity assurance due to their functionality being derived from        gumentation in a tree structure and its assumptions. The tree
data. Commonly, challenges and safety concerns like lack of        is built from reasoning steps that connect lower-level claims
specification or interpretability are described [Adler et al.,     with a higher claim that can be concluded from these lower-
2019; Ashmore et al., 2019; Felderer and Ramler, 2021;             level claims. If the conclusion is only valid under some as-
Sämann et al., 2020; Willers et al., 2020]. Several papers col-    sumptions, these assumptions shall be made explicit.
lect existing methods and map them to mentioned challenges            There are different languages for modeling ACs, like the
[Adler et al., 2019; Sämann et al., 2020; Schwalbe and Schels      Goals Structuring Notation (GSN) [SCSC, 2018] or Claim
2019; Willers et al., 2020]. This raises two questions: whether    Argument Evidence Notation [Adelard LLP, 2021]. The
the list of safety concerns is complete, and to which extent       common meta-model of these languages is defined in the
the available methods sufficiently address the safety concerns     Structured Assurance Case Metamodel (SACM) [OMG,
[Adler et al., 2019]. We are currently not aware of any work       2020]. This paper do not refer to a specific language but focus
that could provide a sufficient answer on these questions.         on the fundamental idea of structuring the argument.
   Another approach is to structure possible quality assurance
activities and measures according to the phases of the AI
lifecycle in which they are applied. Studer et al. [2021] pro-
pose, for example, a process model based on CRISP-DM,
which is often used in data analysis projects, introducing a
quality assurance methodology for each project phase. Ash-
more et al. [2019] provide a survey of quality assurance meth-
ods generating evidences for key assurance requirements be-
ing met in each phase of the AI lifecycle. Here, there is a need
to show that the quality assurance methods applied during a
phase address all assurance requirements related to this
phase, and that the list of assurance requirements is complete.
   However, it is difficult to obtain a complete list of quanti-
tative quality assurance requirements. These strongly depend             Figure 1: Building blocks and general structure of an AC
on the task of the AI-based component and its application
3   Related Work                                                     the claim of sufficient safety for the given application in a
                                                                     convincing manner.
From a safety perspective, ACs are considered a promising               We pinpointed as potential cause of these problems the fact
approach for arguing safety for AI-based systems, and vari-          that the risk acceptance criterion underlying the top-level
ous authors have already proposed strategies and patterns.           claim on which the argumentation is based is either implicit
   Picardi et al. [2019] presented an AC pattern for ML mod-         or different criteria are mixed and are thus not easy to distin-
els in clinical diagnosis systems, which they later refined and      guish during refinement. We therefore claim that a clear dif-
supplemented by a process for generating evidences during            ferentiation will allow more specific argumentation patterns
the ML lifecycle [Picardi et al., 2020]. The activities and de-      and better attribution of evidences to sub-claims.
siderata during the ML lifecycle are referred from Ashmore
et al. [2019]. The ML assurance claim is argued based on ML
safety requirements, operating environment, ML model, de-
                                                                     4   Building Safety Assurance Cases for AI
velopment, and test data. In this context, the link between sys-     In this section, we will first introduce the example we will
tem safety requirements and ML safety requirements is ad-            use to illustrate our concepts. Then we will motivate the con-
dressed [Gauerhof et al., 2020]. In the recently published           sideration of a combination of two risk acceptance criteria to
AMLAS report, Hawkins et al. [2021] also provide generic ar-         structure ACs for AI. Finally, we will introduce a lifecycle
gument patterns and a process for ML safety assurance scop-          model and use it to argue completeness of the provided re-
ing, ML safety requirements, ML data, model learning,                finement.
model verification, and model deployment.
   Wozniak et al. [2020] propose an argument pattern for             4.1 Background of the Selected Example
safety assurance that is aligned with the reasoning for soft-        Automated guided vehicles (AGV) are driverless vehicles
ware and hardware in ISO 26262. They argue satisfaction of           that transport material. They are used in industrial applica-
an ML safety requirement over correctly decomposing the              tions for realizing the flow of material and their safety con-
safety requirements into sub-requirements and their satisfac-        cepts do not rely on AI [DIN, 2017]. However, their applica-
tion, appropriate data acquisition, model design, as well as         tion is limited due to limited understanding of the environ-
implementation and training of the ML model.                         ment and their safety concept. Autonomous mobile robots
   A strategy that does not argue the fulfillment of ML safety       (AMR) overcome these disadvantages compared to operator-
requirements is provided by Gauerhof et al. [2018]. They ar-         controlled vehicle by using more sensors and AI. However,
gue that the intended functionality is met by a sufficient re-       the goal of achieving similar performance and flexibility as
duction of the root causes of functional insufficiencies, which      an operator-controlled vehicle is hard to realize without using
encompass underspecification, semantic and deductive gap.            AI in safety-critical functions like collision avoidance. Oper-
   Based on previous works [Schwalbe and Schels, 2019;               ators of forklifts adapt their speed and safety distance accord-
2020], Schwalbe et al. [2020] propose arguing the sufficient         ing to various aspects of the persons at risk, including speed,
absence of risk for deep neural networks (DNN) arising from          motion path, eye contact, hand gestures signaling right of
the insufficiencies they see in their black-box nature, simple       way, etc. To implement a conservative version of such a hu-
performance issues, incorrect internal logic, and instability.       man-like collision avoidance system, the AMR needs an AI-
They propose a collection of measures to address these insuf-        based component that understands whether a person at risk
ficiencies, which include V&V as well as best practices dur-         has recognized the AMR and gives way to it. A critical failure
ing the creation of DNNs and on the system level.                    in this context is that the AMR falsely detects the signaling
   In summary, our review indicates that existing work is            of right of way. Such safety-critical false detections have to
driven by the safety community, which adapts established             be avoided sufficiently to assure that the AMR drives as least
safety patterns and concepts to AI. However, the presented           as safe as an operator.
patterns are still on a rather abstract level, and the applicabil-
ity on a concrete use case comprehensively illustrated from          4.2 What does sufficient mean?
the top-level claim down toward the evidences has not been           The answer to the question of what sufficient means to pre-
described yet so far. This might indicate that transferring tra-     vent a safety-critical failure like ‘false detections of a human
ditional patterns to AI-based systems proves to be difficult.        gesture’ depends on the related risks and the risk acceptance
   We observed two major challenges in argumentation for             criteria, as safety is defined as acceptable risk [IEC, 2010].
which existing strategies and patterns still provide insuffi-           We should keep two important aspects in mind when dis-
cient support. (1) Completeness in the refinement of claims          cussing criteria for risk acceptance in settings where AI is part
in sub-claims appears difficult to show, especially, when ap-        of a safety-critical function: (a) AI is an emerging technology
proaches argue over the refinement of safety requirements to         that is still heavily in flux, with unforeseeable developments
AI/ML requirements or about addressing ML insufficiencies.           and improvements in the upcoming years. Thus, coming up
For example, if we have a (most likely) incomplete list of in-       with a fixed set of safety measures does not appear to be rea-
sufficiencies, we cannot argue about addressing each insuffi-        sonable. The argument that these safety measures minimize
ciency. (2) Considering the current state of AI quality assur-       risks as far as reasonably practicable easily becomes invalid.
ance, the proposed patterns commonly struggle with bridging          Besides, it would be hard to argue that these measures are as
the gap between a low-level quantitative evidence, e.g.,             effective as existing ones in safety standards for traditional
achieving a specific neuron coverage during AI testing, and
software. (b) AI is also mainly applied to realize functions        context, but not specific for AI. We do also not discuss how
that cannot be provided yet by traditional technological solu-      to get from the target failure rate to a target upper boundary
tions.                                                              on the uncertainty for AI outcomes. Instead, we assume in
   A risk acceptance criterion that seems reasonable to apply       our example that we would end up with a PRB-derived upper
in the context of AI – considering (a) – states that the residual   boundary on safety-related uncertainty (u) that we could ac-
risk after the application of safety measures should be As Low      cept for the AI outcomes: ‘The AI must not falsely detect a
As Reasonably Practicable (ALARP). The meaning of ‘rea-             signal for the right of way that was not actually given in more
sonably practicable’ is not static but depends on the state of      than one of N cases’.
the technology and the intended application, including the un-         Fig. 2 illustrates the relationship between ALARP and a
derlying business case and related practical restrictions. Con-     target-based criterion such as MEM, GAMAB, or PRB when
sidering ALARP as part of the argumentation assures that            providing arguments that an AI-based solution is safe.
when progress in technology allows for safer solutions, we             ALARP can be considered as requesting a certain alpha
will see progress in safety.                                        given by the ratio between the reduction of safety-related un-
   However, doing one’s best to avoid and mitigate risks is         certainty in the AI outcomes and the required effort/cost.
obviously not enough to argue that the best was sufficient.         Given the business case for the planned solution and the state
Accordingly, ALARP is only used in an ALARP region,                 of technology, this alpha might vary and is achieved in Fig. 2
which is the region between an upper tolerance limit marking        at point B. Simply speaking, we request that as long as safety
unacceptable risk and a lower tolerability limit. Having this       measures exist that would increase safety with reasonable in-
in mind is of crucial importance when applying ALARP to             vestment, they are carried out. How this rather abstract con-
AI since the current state of AI technology might not be ad-        straint can be further refined will be discussed in the context
vanced enough to realize a given application in a sufficiently      of the AI lifecycle presented in Sec. 4.3.
safe manner. For example, a state-of-the-art traffic sign              The upper boundary on acceptable safety-related uncer-
recognition algorithm might get one of 200 stop signs wrong         tainty u that is derived from the target-based criterion is illus-
[INI, 2019]. If used as part of an autonomous vehicle, it may,      trated in Fig. 2 as a horizontal line. We consequently need to
as a result, regularly ignore someone's right of way. The al-       argue that we are confident that the actual safety-related un-
gorithm might be as good as reasonably practicable but is still     certainty is below u. Please note that this is not achieved at
not sufficiently safe to be applied in this specific application.   point A, but first at C, which we will discuss further, includ-
   Thus, we need at least a second risk acceptance criterion        ing its implications when talking about testing in Sec. 4.3.
that gives us a fixed limit.                                           Finally, we will always end up in one of two kinds of situ-
   Most existing products have been developed according to          ations: a situation where the target-based criterion dominates,
functional safety standards that follow the risk acceptance         i.e., it defines the required investment (cf. Fig. 2), or a situa-
criterion Minimum Endogenous Mortality (MEM). The idea              tion where ALARP dominates the required investment. An
of MEM is that a technical system must not create a signifi-        interesting question, which is, however, not directly related
cant risk compared to globally existing risks. For example, a       to safety, is whether a solution requiring more investment
product should cause a minimal increase in overall death rates      than reasonably practicable should actually be targeted.
compared to the existing population death rates. This idea
leads to very challenging safety requirements and low target             Uncertainty of
failure rates. Depending on the specific task, such low failure          AI Failure
rates might be hard to achieve in practice if AI is involved.
   An alternative criterion given a fixed target is Globalement
au moins aussi bon (GAMAB), which says that new technical
systems shall be at least as safe as comparable existing ones.
However, due to (b) it is hardly applicable in case of many
                                                                                        PRB  u: maximum acceptable uncertainty (A)
AI-based functions because no technical systems exist yet
                                                                                              satisfied on confidence level cl (C)
that provide similar functions.
   An approach related to GAMAB is the idea of having a
‘positive risk balance’ (PRB). PRB is defined in ISO/TR
4804 as the ‘benefit of sufficiently mitigating residual risk of
traffic participation due to automated vehicles’ together with
Note 1 ‘This includes the expectation that automated vehicles
cause less crashes (3.7) on average compared to those made
                                                                     u
by drivers’ [ISO/TR, 2020]. The idea of comparing the new                                    A B           C
AI-based solution with the existing sociotechnical system can
lead to less challenging target failure rates compared to                                                                      Effort/
MEM. This opens up new opportunities for arguing safety.                                                                         Cost
                                                                              ALARP  α: reasonable       Investment required
   In this paper, we do not discuss how to use this opportunity                  cost-benefit ratio (B)   to satisfy ALARP + PRB
to derive a target failure rate for an AI-based safety-critical
function, as this is very specific for the function and its usage   Figure 2: Implications of considering two risk acceptance criteria
4.3 Arguing considering the AI lifecycle                                                                 resources. Although the AI specification has some special-
As illustrated above, it seems reasonable to argue two sepa-                                             ties, activities are largely AI-independent. Nevertheless, it is
rate risk acceptance criteria. It is also advisable to argue each                                        a key phase for both types of risk acceptance criteria. A suf-
criterion independently. Important for the argumentation, es-                                            ficiently complete and correct specification is a prerequisite
pecially for the argumentation of ALARP, are strategies that                                             to assuring that the safety risk will be as low as reasonable
                                                                                                         practicable by proving guardrails for the subsequent phases,
assure that the refinement of the claims into sub-claims is
complete. An accepted way, which we also consider as most                                                but it also constitutes the AI-specific safety target and the
                                                                                                         scope in which this target has to be achieved. For example,
promising, is to use a lifecycle model to argue completeness
and localize safety measures.                                                                            the AI must not falsely detect a signal for the right of way that
                                                                                                         was not actually given in more than one of N cases in its pre-
   The lifecycle model for AI components presented in Fig. 3
builds on existing work, in particular on the work of Ashmore                                            viously defined target application scope.
et al. [2019] and Gauerhof et al. [2020]. We adapted their                                                  Construction is an AI-specific phase. Its objective is to
proposals. The objective was to achieve an even clearer sep-                                             build a model from a training dataset that is able to fulfill the
                                                                                                         AI task in the target application scope considering the re-
aration and better assignability of datasets, objectives and
corresponding safety measures to the individual phases. In                                               quirements and constraints defined in the specification.
                                                                                                            During construction, many design decisions have to be
addition, we tried to keep the phases sufficiently generic to
be applicable for the various development processes in data                                              made, e.g., on the kind of model and its hyperparameters in-
science projects that we are aware of.                                                                   cluding topology, learning algorithm, stop criteria, etc.
                                                                                                            Many of these decisions are based on trial and error, taking
    We distinguish between specification, construction, anal-
ysis, testing, and operation. The proposed lifecycle model ex-                                           into account experience, so construction is a highly iterative
plicitly does not include a 'data' phase. Subsuming data-re-                                             process in a close feedback loop with the analysis phase.
                                                                                                            We will not be able to show during construction that we
lated activities in single phase neither matches reality nor
gives weight to the topic of data, which is at the core of any                                           achieved a certain quantitative target, since we focus on fit-
                                                                                                         ting but not soundly testing the model. Thus, this phase plays
AI lifecycle. Especially since different data with different
                                                                                                         no role in arguing regarding the target-based risk acceptance
qualities are consumed in different phases, we modeled indi-
vidual data lifecycles as parallel streams that provide the                                              criteria. However, considering quality assurance measures
foundation for the evidences created in the AI lifecycle.                                                during construction is important to argue ALARP. The ap-
                                                                                                         plied quality assurance measures should be guided by the
 Top level                                                                                               safety target, but also by other quality requirements and con-
                                              Sufficiently safe
 AI-related claim                                                                                        straints identified during the specification. Commonly, it is
 Top level                                     Argue about a
                                                                                                         not possible to define fixed success criteria for the different
 AI-related strategy                         combination of risk                                         quality measures. For example, in most cases, it would not be
                                             acceptance criteria
                                                                                                         reasonable to enforce a specific type of model or topology, or
                                                                                             Target is
 Claims making
                        Safety risk is as low as                   Quantitative safety       derived     request a maximum batch size m and run at least e epochs.
 risk acceptance
 criteria explicit
                        reasonable practicable                      target is satisfied     from PRB     Instead, we propose analyzing and monitoring the efficiency
                                                                                             or MEM
                                                                                                         of the measures carried out and stopping in accordance with
 Strategy to argue
 completeness of
                        Argue about relevant                       Argue about relevant                  ALARP if a reasonable saturation is achieved. For example,
                          lifecycle phases                           lifecycle phases
 refinement                                                                                              if performing a random search on appropriate hyperparameter
 AI lifecycle                                                                                            values, the search shall continue as long as the model shows
 phases                                                                                                  reasonable improvements.
                                                                                                            Analysis is also an AI-specific phase that is performed in
        Specification       Construction            Analysis             Testing          Operation
                                                                                                         a close feedback loop with construction to provide guardrails
                                                                                                         for improving construction and indicating the achievement of
 with quality assurance              Data Flow
 measures (founded on                                                                                    saturation for constructive quality assurance measures. Anal-
 appropriate datasets)                                                                                   ysis comprises besides means for explainability also “testing”
                                Training             Analysis             Test
 Data lifecycle with
                                  Data                Data                Data                           the model on validation data to estimate and monitor the
 measures to assure
 appropriate data            Training Data         Analysis Data        Test Data                        model performance with respect to the safety target. How-
     Existing Data
                               Lifecycle             Lifecycle          Lifecycle                        ever, although techniques are applied that are similar to the
                                                                                                         techniques applied in the testing phase, the analysis phase dif-
       Figure 3: AI lifecycle phases with mapping to risk criteria                                       fers from the testing phase in that objective is to gather in-
The proposed separation also results in the fact that certain                                            sides to further improve the AI model rather than provide ev-
phases exclusively contribute either to argue ALARP or the                                               idence for the achievement of the specified safety target.
                                                                                                         Therefore, the quality assurance measures in the analysis
target-based risk acceptance criterion, as we will show.
                                                                                                         phase help to argue ALARP but do not contribute to arguing
   Specification considers, among other things, the definition
                                                                                                         regarding the target-based risk acceptance criteria. In analogy
of the AI task, the target application scope [Kläs and Vollmer,
                                                                                                         to the construction phase, it is difficult to define a priori tar-
2018], which is comparable to the operation design domain in
                                                                                                         gets for most quality measures in the analysis phase. Rather,
automotive, and safety-relevant as well as other quality re-
                                                                                                         their effect and thus their potential contribution to the safety
quirements including system constraints like computational
target must be monitored and continuously evaluated.                and to use market monitoring to strengthen the argument.
   Testing is also commonly considered to be AI-specific.           This approach is described already by the Safety Performance
Unlike analysis, the objective of the testing phase is to gen-      Indicator [Koopman and Wagner, 2019] or GQM+Strategies
erate evidences on the achievement of the quantitative safety       [Basili et al., 2010] but it needs to be tailored to the focused
target. In providing these evidences, testing depends on the        argument for AI. By evaluating the reasoning with data, a ma-
specification, including the definition of the AI task and the      ture body of knowledge can be developed over time and re-
target application scope. Moreover, it relies on specific qual-     flected in safety standards for AI.
ities of the test data that are not so relevant, for example, for      Considering standardization, we see three options for using
training data, such as that the data fulfills some representa-      ACs. The first is to demand in a safety standard the develop-
tiveness criteria and that it was not used previously during        ment of an AC for the considered product. The second is to
construction or analysis. Since a test dataset can always pro-      describe in product- or domain-specific safety standards a ge-
vide only a sample of all possible cases in the target applica-     neric AC that shall be instantiated. The third is to develop a
tion scope, we need to underpin the evidence on satisfying          product- or domain-specific AC and use this AC to develop a
the safety target with some statistical confidence (cf. Fig. 2)     checklist-based safety standard where safety measures are
[Kläs and Sembach, 2019]. The confidence level cl, which is         chosen depending on the specific criticality/integrity level.
independent of the target, may be set based on criticality or          Considering certification, we see two main aspects. The
requested integrity. For example, we might request that the         first is that the AC needs to comply with the standard describ-
probability that we falsely confirm our target ‘The AI must         ing what the AC should look like. The second and more im-
not falsely detect a signal for the right of way that was not       portant aspect is that the AC itself needs to be sound, so that
actually given in more than one of N cases in its previously        it can be accepted by the certification body. The challenge
defined target application scope.’ is less than 1-cl = 0.0001.      here is that the review of the AC becomes easily more elabo-
   Moreover, it is important to understand that quality assur-      rative than a checklist-based approach, meaning the certifica-
ance measures in the testing phase are not applied to further       tion body needs much greater expertise. Furthermore, the cer-
improve the AI solution and thus does not provide evidences         tification body can no longer give up responsibility for the
to argue ALARP. Instead, they help argue that we are confi-         safety of the system by saying that it is only responsible for
dent that we have met the quantitative safety target.               compliance with standards but not for system safety. How-
   Operation in the sense considered here comprises deploy-         ever, this aspect is not specific for AI and is generally true for
ment, usage, maintenance, and retirement. Although most as-         the certification of complex systems by means of ACs.
pects are not AI-specific, some are and need to be addressed
with appropriate safety measures. On the one hand, measures         6   Conclusion
for assuring ALARP include the collection of relevant infor-
mation during operation to further improve the AI solution as       We conclude that ACs have the potential to justify the usage
part of maintenance. Moreover, situations have to be detected       of AI in safety-critical systems. A prerequisite is, however,
in which the AI solution can only provide outcomes with high        that they argue that risks are as low as reasonably practicable
uncertainty, in order to allow for appropriate countermeas-         (ALARP) and that a reasonable target based on a quantitative
ures to be taken on the system level to improve the overall         risk acceptance criterion has been chosen and is fulfilled. We
safety. Such situations may include settings where lighting         presented the first approach for explicitly augmenting the
conditions make falsely detecting a signal for the right of way     achievement of these complementary objectives for AI.
much more likely. On the other hand, evidence on satisfying            We also see the potential of the proposed structure for tra-
the safety target obtained from testing strongly relies on as-      ditional software as it would enforce claims about the effec-
sumptions regarding the target application scope; if the AI         tiveness of safety measures. It would put into question
solution is applied in a different setting or relevant character-   whether one is really following the ALARP principle when
istics of the application change, this evidence is no longer        choosing safety measures according to recommendations
valid. Therefore, safety measures have to be taken during op-       given by safety standards. It would also raise the question of
eration to detect such deviation between the target application     how effective software safety measures are and call for em-
scope and the actual application scope.                             pirical evidences about their effectiveness.
                                                                       Last but not least, we advocate that the concept of ACs
                                                                    from the safety community should be carried over to the AI
5   Discussion                                                      community. In particular, researchers with a background in
We proposed a strategy for arguing the safety of an AI-based        empirical studies and data quality need to be involved in the
safety function combining two risk acceptance criteria. The         development and review of AI-related ACs.
structure can help to come up with a sound argument but there
are ways of how one could attack this argument. A possible          Acknowledgments
attack on the ALARP argument is that the body of knowledge
concerning the effectiveness and the best combinations of           Parts of this work have been funded by the Observatory for
measures is not mature enough. A possible attack on the             Artificial Intelligence in Work and Society (KIO) of the
quantitative claim based on PRB or MEM is that there is not         Denkfabrik Digitale Arbeitsgesellschaft in the project "KI
enough practical experience and empirical evidence. A pos-          Testing & Auditing". We would also like to thank Sonnhild
sible response to this attack is to collect data during operation   Namingha for an initial review of this paper.
References                                                              [Mayr et al., 2012] A. Mayr, R. Plösch, M. Kläs, C. Lampasona,
                                                                             and M. Saft. A Comprehensive Code-Based Quality Model for
[Adelard LLP, 2021] Adelard LLP. CAE FRAMEWORK, 2021,                        Embedded Systems: Systematic Development and Validation
     https://claimsargumentsevidence.org/. Accessed 10 May 2021.             by Industrial Projects. In ISSRE 2012, pp. 281-290, 2012.
[Adler et al., 2019] R. Adler, M. N. Akram, P. Bauer, p: Feth, P.       [OMG, 2020] Object Management Group. About the Structured
     Gerber, A. Jedlitschka, L. Jöckel, M. Kläs, and D. Schneider.           Assurance Case Metamodel Specification Version 2.1, 2020.
     Hardening of Artificial Neural Networks for Use in Safety-         [Koopman and Wagner, 2019] P. Koopman, and M. Wagner. Posi-
     Critical Applications - A Mapping Study, 2019.                          tive Trust Balance for Self-driving Car Deployment. In Proc.
     https://arxiv.org/abs/1909.03036.                                       of SAFECOMP 2020 Workshops, pp. 351–357, 2019.
[Ashmore et al., 2019] R. Ashmore, R. Calinescu, and C. Paterson.       [Picardi et al., 2019] C. Picardi, R. Hawkins, C. Paterson, and I.
     Assuring the Machine Learning Lifecycle: Desiderata, Meth-              Habli. A Pattern for Arguing the Assurance of Machine
     ods, and Challenges, ACM Computing Surveys, 2019.                       Learning in Medical Diagnosis Systems. In Proc. of
[Basili et al., 2010] V. R. Basili, et al. Linking Software Develop-         SAFECOMP 2019, pp. 165–179, 2019.
     ment and Business Strategy Through Measurement. Com-               [Picardi et al., 2020] C. Picardi, C. Paterson, R. Hawkins, R. Calin-
     puter. 43(4):57–65, 2010.                                               escu, and I. Habli. Assurance Argument Patterns and Pro-
[BSI, 2021] BSI, Fraunhofer HHI, Verband der TÜV. Towards Au-                cesses for Machine Learning in Safety-Related Systems. In
     ditable AI Systems, 2021.                                               Proc. of SafeAI 2020, pp. 23-30, 2020.
[DIN, 2017] DIN EN ISO 3691-1:2017 – Industrial trucks - Safety         [Pietsch, 2021] B. Pietsch. 2 Killed in Driverless Tesla Car Crash,
     requirements and verification, 2017.                                    Officials Say. The New York Times, 2021.
[European Commission, 2021] European Commission. Proposal               [SCSC, 2018] Safety-Critical Systems Club. GSN Community
     for a Regulation laying down harmonised rules on artificial             Standard Version 2 Draft 1, 2018.
     intelligence (Artificial Intelligence Act), 2021. https://ec.eu-   [Salay and Czarnecki, 2018] R. Salay, and K. Czarnecki. Using
     ropa.eu/newsroom/dae/redirection/item/709090.                           Machine Learning Safely in Automotive Software: An Assess-
[Felderer and Ramler, 2021] M. Felderer, and R. Ramler. Quality              ment and Adaption of Software Process Requirements in ISO
     Assurance for AI-based Systems: Overview and Challenges.                26262, 2018. https://arxiv.org/abs/1808.01614.
     In Software Quality: Future Perspectives on Software Engi-         [Sämann et al., 2020] T. Sämann, P. Schlicht, and F. Hüger. Strat-
     neering Quality. 2021.                                                  egy to Increase the Safety of a DNN-based Perception for
[Gauerhof et al., 2020] L. Gauerhof, R. Hawkins, C. Picardi, C.              HAD Systems, 2020. https://arxiv.org/abs/2002.08935.
     Paterson, and I. Habli. Assuring the Safety of Machine Learn-      [Schwalbe et al., 2020] G. Schwalbe, et al. Structuring the Safety
     ing for Pedestrian Detection at Crossings. In Proc. of                  Argumentation for Deep Neural Network Based Perception in
     SAFECOMP 2020. Springer, pp. 197–212, 2020.                             Automotive Applications. In Proc. of SAFECOMP 2020, pp.
[Gauerhof et al., 2018] L. Gauerhof, P. Munk, and S. Burton.                 383–394, 2020.
     Structuring Validation Targets of a Machine Learning Func-         [Schwalbe and Schels, 2020] G. Schwalbe, and M. Schels. A Sur-
     tion Applied to Automated Driving. In Proc. of SAFECOMP                 vey on Methods for the Safety Assurance of Machine Learn-
     2018, pp. 45–58, 2018.                                                  ing Based Systems. In: Proc. of European Congress on Em-
[Hauer et al., 2021] Hauer, M., Adler, R., and Zweig, K. Assuring            bedded Real Time Software and Systems, 2020.
     Fairness of Algorithmic Decision Making (ITEQS 2021). In           [Schwalbe and Schels, 2019] G. Schwalbe, and M. Schels. Strate-
     Proc. of Int. Conf. on Software Testing, 2021.                          gies for Safety Goal Decomposition for Neural Networks. In
[Hawkins et al., 2021] R. Hawkins, C. Paterson, C. Picardi, Y. Jia,          Proc. of ACM Computer Science in Cars Symposium, 2020.
     R. Calinescu, and I. Habli. Guidance on the Assurance of Ma-        [Siebert et al., 2021] J. Siebert, L. Joeckel, J. Heidrich, A.
     chine Learning in Autonomous Systems (AMLAS), 2021.                     Trendowicz, K. Nakamichi, K. Ohashi, I. Namba., R. Yama-
[IEC 2010] IEC 61508-5:2010 – Functional Safety of Electri-                  moto, and M. Aoyama. Construction of a Quality Model for
     cal/Electronic/Programmable Electronic Safety-related Sys-              Machine Learning Systems, Software Quality Journal. Special
     tems, 2010.                                                             Issue Information Systems Quality, 2021.
[INI, 2019] Institut für Neuroinformatik. German Traffic Sign           [UL, 2021] Underwriters Laboratories. Presenting the Standard for
     Benchmarks, 2019. https://benchmark.ini.rub.de/gtsrb_re-                Safety for the Evaluation of Autonomous Vehicles and Other
     sults.html. Accessed 10 May 2021.                                       Products. https://ul.org/UL4600. Accessed 10 May 2021.
[ISO/TR, 2020] ISO/TR 4804:2020 – Road vehicles — Safety and            [VDE, 2020] VDE-AR-E 2842-61-1:2020-07 – Development and
     cybersecurity for automated driving systems — Design, verifi-           trustworthiness of autonomous/cognitive systems, 2020.
     cation and validation, 2020.                                       [Wakabayashi, 2018] D. Wakabayashi. Self-Driving Uber Car
[ISO/IEC, 2021] ISO/IEC AWI TR 5469 – Artificial intelligence                Kills Pedestrian in Arizona. The New York Times, 2018.
     — Functional safety and AI systems, 2021.                          [Willers et al., 2020] O. Willers, S. Sudholt, S. Raafatnia, and S.
[ISO/IEC/IEEE, 2019] ISO/IEC/IEEE 15026-1:2019 – Systems                     Abrecht. Safety Concerns and Mitigation Approaches Regard-
     and software engineering - Systems and software assurance -             ing the Use of Deep Learning in Safety-Critical Perception
     Part 1: Concepts and vocabulary, 2019.                                  Tasks. In Proc. of SAFECOMP 2020, pp. 336–350, 2020.
[Kläs and Sembach, 2019] M. Kläs, and L. Sembach. Uncertainty           [Wozniak et al., 2020] E. Wozniak, C. Cârlan, E. Acar-Celik, and
     Wrappers for Data-Driven Models. In Proc. of SAFECOMP                   H. Putzer. A Safety Case Pattern for Systems with Machine
     2019. Springer, pp. 358–364, 2019.                                      Learning Components. In Proc. of SAFECOMP 2020.
[Kläs and Vollmer, 2018] M. Kläs, and A.M. Vollmer. Uncertainty              Springer, pp. 370-382, 2020.
     in Machine Learning Applications: A Practice-Driven Classi-        [Zenzic, 2020] Zenzic-UK Ltd. Zenzic-Safety-Framework-Report-
     fication of Uncertainty. In Proc. of SAFECOMP 2019, 2019.               2.0-final, 2020. https://zenzic.io/reports-and-resources/safety-
                                                                             case-framework/. Accessed 10 May 2021.