<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AI for Safety: How to use Explainable Machine Learning Approaches for Safety Analyses</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iwo Kurzidem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Burton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Schleiss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Cognitive Systems IKS</institution>
          ,
          <addr-line>Hansastraße 32, D-80686 Munich</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Current research in machine learning (ML) and safety focuses on safety assurance of ML. We, however, show how to interpret results of explainable ML approaches for safety. We investigate how individual evaluation of data clusters in specific explainable, outside-model estimators can be analyzed to identify insuficiencies at diferent levels, such as (1) input feature, (2) data or (3) the ML model itself. Additionally, we link our finding to required artifacts of safety within the automotive domain, such as unknown unknowns from ISO 21448 or equivalence class as mentioned in ISO/TR 4804. In our case study we analyze and evaluate the results from an explainable, outside-model estimator (i.e., white-box model) by performance evaluation, decision tree visualization, data distribution and input feature correlation. As explainability is key for safety analyses, the utilized model is a random forest, with extensions via boosting and multi-output regression. The model training is based on an introspective data set, optimized for reliable safety estimation. Our results show that technical limitations can be identified via homogeneous data clusters and assigned to a corresponding equivalence class. For unknown unknowns, each level of insuficiency (input, data and model) must be analyzed separately and systematically narrowed down by process of elimination. In our case study we identify “Fog density” as an unknown unknown input feature for the introspective model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;safety analysis</kwd>
        <kwd>safety engineering</kwd>
        <kwd>explainable machine learning</kwd>
        <kwd>outside-model estimator</kwd>
        <kwd>safety validation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Data
ML model
specification
insufficiency
performance
insufficiency</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The use of artificial intelligence (AI) and especially
machine learning (ML) in safety critical applications, such
as autonomous driving (AD), is still a vivid research area,
as many state-of-the-art ML methodologies create
endto-end trained (i.e., black-box) models for object
detection and localization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Encoded into these black-box
models are performance and specification insuficiencies
that cause epistemic and/or aleatoric uncertainties [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Identifying, estimating and, if possible, mitigating
uncertainties is required for a convincing safety assurance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Figure 1 provides an overview of diferent uncertainty
manifestations typical for ML:
a trivial task, as typical results from quantitative tests of
• Input feature: Is the ML model’s decision pro- the ML model do not allow a straightforward mapping
cess based on the correct input factors from the between a measured lack of performance to a specific
complex environment? insuficiency, due to complex interdependence and
corre• Data: Does the collected data (training &amp; test) lations between the causes.</p>
      <p>
        include enough and proper samples with an ap- The main contribution of this paper is an approach to
propriate distribution? identify specific insuficiencies and eventually link the
• ML model: Is the selected ML methodology ap- analysis results to required artifacts of automotive safety
propriate for the desired task? standards, for example related to unknown unknowns
from ISO 21448 - Safety of the intended functionality
(SOFinding and understanding the root cause(s) of uncer- TIF) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or equivalence class for validation from ISO/TR
tainty and identify the corresponding insuficiency is not 4804 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In doing so, we present a solution to address
The IJCAI-2023 AISafety and SafeRL Joint Workshop open issues in ML safety assurance regarding safety tests,
* Corresponding author. such as how many tests have to be performed within
$ iwo.kurzidem@iks.fraunhofer.de (I. Kurzidem); which operational design domain (ODD) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
simon.burton@iks.fraunhofer.de (S. Burton); In previous work we presented a conceptual
framephi0li0p0p0.-s0c0h0le1i-s9s0@40ik-8s.7f5ra2u(nSh.oBfuerrt.doen)(P. Schleiss) work to create an explainable, introspective model (i.e.,
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License white-box) from a deep neural network (i.e., black-box),
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) cf. Fig 2. In a case study, we used the approach to estimate
Black-box (DNN) prediction of the development of a methodology for safety assurance
test White-box (RF) Black-box for ML algorithms, in particular for object detection and
Data (Test data) behavior instance segmentation. Most of the used approaches for
safety included conventional methods, such as visual
Figure 2: From Black-box to White-box. Adapted from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. analytics [16], combinatorial testing [17], data
augmentation [13] and others. All these methods work within a
well defined, limited semantic space. A couple of methods
the safety and reliability of the black-box via the white- in KI-A used ML techniques, such as principal
compobox for object detection in the automotive domain. While nent analysis (PCA) [15] and search-based testing [18], to
the developed white-box models showed some promis- specifically analyze and search for insuficiencies in data.
ing results, such as providing estimated distributions for However, all of these methods require some insights or
successful and failed defections, their unrestricted usage a-priori knowledge about the root cause of the specific
infor safety assessment is currently not possible, details suficiency to be applied successfully. Our approach does
see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this contribution we use the developed models not assume any specific insuficiency from the outset,
for safety analyses to identify specific insuficiencies. We instead each layer of uncertainty (cf. Fig. 1) is analyzed
leverage the fact that random forests (RFs) contain inter- by itself and by process of elimination the root cause is
pretable decision trees (DTs) and analyze the obtained identified.
      </p>
      <p>DTs with regards to split criteria and data clustering. Besides KI-A and beyond AD, ML has successfully been</p>
      <p>This paper is organized as follows. Section 2 provides used for data clustering and analysis, such as PCA,
kan overview of relevant and related works. We continue means or Latin hypercube sampling, to define relevant
by introducing our approach and its basic premise in sceneries to reduce the efort of verification and
validaSection 3. Next, in Section 4, we demonstrate our ap- tion [19]. Again, none of the mentioned methods explores
proach and perform corresponding analyses. Finally, in all the diferent possible insuficiencies due to input, data
Section 5, we conclude the paper by summarizing our or model, but instead already know where to look.
results and discussing future work.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
    </sec>
    <sec id="sec-4">
      <title>2. Related Works</title>
      <sec id="sec-4-1">
        <title>In [6] we introduced a framework to create explainable,</title>
        <p>
          Currently, most research on AI for AD focuses on improv- introspective white-box models, derived from black-box
ing the safety related aspects of ML models itself. Either model test evaluation, to predict diferent safety related
by means of conventional (i.e., non-ML) analysis meth- aspects of the deep neural network (DNN) object
detecods [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] or methods directly enhancing the ML model [8]. tor. Unfortunately, the measured performance of the
These conventional safety methods include hazard and white-box models did not allow for an unrestricted use
risk analysis [9], simulation [10, 11], (stochastic) fault as reliable safety monitors. In this contribution we
investree analysis [12] etc., while ML specific methods for tigate if we can use the white-box models themselves to
safety cover uncertainty quantification [ 8] and robustifi- analyze certain safety properties and link the obtained
cation [13] among others. However, conventional safety result to insuficiencies within diferent layers, cf. Fig. 1.
methods are not particularly well suited for safety con- Put diferently, can we use the semantic input of the
siderations regarding AI, such as the definition of equiv- white-box to characterize the black-box regarding safe,
alence classes of safe or unsafe behavior or discovering unsafe and unknown behavior.
unknown unknowns, as these characteristics manifest On the one hand, we examine the single DTs of the RF
themselves diferently in ML-based systems, due to cor- white-box models to identify possible equivalence classes.
relation of input to output instead of causality of data This enables us to possibly define an eficient test
stratprocessing. Enhancing ML models requires modification egy for verification and validation further down the ML
of the base network, without providing traceable safety development-cycle. On the other hand, we investigate
artifacts. Therefore, new safety analysis methods are if contradictory samples within DT leafs indicate
unneeded, including approaches leveraging ML itself. Sim- known unknowns. Here unknown unknowns represent
ilar to [14], which uses a Bayesian network to identify previously unconsidered parameter from the complex
novel triggering conditions, as required by SOTIF. environment, not part of the initial problem space.
        </p>
        <p>The German Federal Ministry for Economic Afairs and Regarding results, the analysis of DT leafs might not
Climate Protection initiated the project “KI-Absicherung” end conclusively for either equivalence classes or
un(KI-A), consisting of 24 partners from industry and known unknowns. This does not mean there are
defiacademia, to address the complex topic of AI and safety nitely no such cases to be found, but instead that, given
in the mobility market [15]. The main focus of KI-A was the input space, equivalence classes or unknown
unknowns are unlikely to be found within these data.</p>
        <p>In principle the proposed approach can be applied to
any kind of ML data, however, it greatly benefits from
certain restrictions to be usable in safety. Firstly, the input
dimensions should have a semantic description, meaning
they have an humanly interpretable representation in
the real world. For instance, a semantic dimension may
refer to an object’s attribute (e.g. size) or environmental
conditions (e.g. rain), whereas non-semantic descriptions
include technical aspects (such as pixel intensity, blur,
etc.). Secondly, the input space should be limited. The
aggregation and interpretation of multiple and diferent
input parameters may result in too complex cases to be
analyzed and used in safety argumentation.</p>
        <p>The basic concept of DTs is data partitioning [20]. To
this end, the input space of data is repeatedly partitioned
into disjoint, smaller subsets, such that each subset is
consistent with regards to the desired output. A
visualization of a simple DT is given in Fig. 3. As can be
seen, the input data is partitioned into subsets by
splitting at each node, using the most suitable input feature
(in conjunction with a specified error function, details
in section 3.1). The final data clusters, i.e., the leafs of
DTs (from now, we will use the terms interchangeably)
represent the “most consistent” partitioning given the
defined hyperparameters and provided data. The
collection of multiple DTs together is RF and this ensemble
provides its final output by aggregating the prediction
of each single DT. There are diferent versions of RFs,
such as bagging and boosting extensions, that difer in
way the DTs are created from the provided data (see
section 4.2). The mathematical fundamentals to create DTs,
such suitable split criteria , and their interpretation for
safety analyses are given in following sections.</p>
        <p>
          nodes n
leafs
≤
&gt;
split criterion s
The underlying methodology of DTs creates disjoint
subsets of inputs that produce the same output (while
minimizing variance) [21]. This is very similar to the
definition of equivalence class from ISO/TR 4804 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which
states that, equivalence classes are based on the division
of inputs and outputs, such that a (single) representative
test can be defined. Therefore we use the leafs of DTs
to define an equivalence class. In addition, we use the
quantitative split criteria {1, ..., } of the DTs node’s,
to define the boundaries (i.e., limits) of the corresponding
equivalence class, cf. Fig. 3.
        </p>
        <p>The foundation of DTs is data partitioning by (binary)
splits, to uncover complex patterns. For each possible
binary split value  at node  the resulting decrease in
impurity ∆( , ) is being determined by [21]:




∆( , ) = () −
* () −</p>
        <p>* (), (1)
with  denoting the size of the training data for node ,
 and  representing the samples from the whole
training data assigned to the left child and right child
respectively, and  as the impurity function. The
maximization of decrease in impurity can be understood as
best possible split  for node  into two children ( and
). For regression tasks, typically the squared error loss
is being computed with Eq. (1), to determine the error
during training. Therefore, () calculates the local, i.e.
for a specific node , squared error loss via [21]:
() =
1</p>
        <p>∑︁ ( −  )2.
 ,∈</p>
        <p>(2)
In Eq. (2),  denotes a specific input feature and  the
corresponding model output from the subset of learning
samples .  and  are the model output and desired
output respectively. Both equations, (1) and (2),
essentially split the data into clusters that produce the most
similar output. Figure 4 shows an example for data
splitting, containing measurement samples for object distance
(input feature ) and corresponding softmax confidence
( ). The best split  divides  into two clusters, 
and , that have the highest decrease in impurity. The
horizontal lines within the left () and right cluster
() indicate the arithmetic mean for each of them. Any
other split, for instance * (cf. Fig. 4), yields:
∆( , ) &gt; ∆( * , ).</p>
        <p>(3)</p>
        <p>SnL</p>
        <p>S S*</p>
        <p>SnR
10
20
30
40
50
(a)</p>
        <p>(b)
3.2. Unknown unknowns</p>
      </sec>
      <sec id="sec-4-2">
        <title>The goal of SOTIF is to identify potential unknown haz</title>
        <p>ardous scenarios, arising from the interaction between
the system and its complex environment, and mitigate
not guarantee an equivalence class. The methodology of their efects. To archive this, SOTIF recommends to
RFs and DTs requires some hyperparameters to be set search for triggering conditions that lead to potential
that influence the splitting and, therefore, the resulting hazardous scenarios. Unfortunately, there is no
estabclusters. Most important for our considerations are: lished approach or method to identify such triggering
• Threshold  for the minimum decrease in impu- conditions for all possible systems and environments.
rity, i.e., ∆( , ) &lt;  , Furthermore, the nature of some of these triggering
conditions can be defined as unknown unknowns, i.e.,
some• The minimum amount of samples  to allow thing we are not even aware that we do not know. In
further splits, i.e.,  &gt; . our context it refers to a feature of the input space that
The first threshold  prevents an overfitting, as no thresh- was not considered when approximating the factors that
old allows the splitting of virtually identical values as influence the performance of the black-model.
long as there is any decrease in impurity. Refer to Fig. 4, The key idea is to identify and use inconclusive, yet
nearly all measured softmax confidence values will be interpretable data clusters and, by process of
eliminadiferent after  decimal places (dependent on the preci- tion, show that the only possible explanation for their
sion of the data). Therefore, even splitting samples that existence is an unknown unknown. In the previous
secvary after  digits will decrease impurity, eventually cre- tion 3.1, we examined the mathematical foundation for
ating DTs with one single data point per leaf. The second data clustering via DTs. In particular equations (1) and
parameter  also prevents overfitting. Lets assume (2) partition the available data into the best possible
disthat  is set to the smallest possible value, which is 2. joint and coherent clusters. However, in some cases the
Given a small enough  , each single leaf will converge at resulting, final clusters still have high impurity, although
single data points. Therefore, both,  and  together, further splitting, in principle, is allowed. Simply put,
influence the resulting clusters and if meaningful equiva- the cluster contains contradictory data, which cannot be
lence classes can be defined. Please note, that there are split meaningfully anymore within the defined scope, cf.
more hyperparameters to prevent overfitting, but they Fig. 5(b).
are not relevant for this contribution. Please also note, How can this be interpreted? Given that the
hyperpaduring our analyses (Section 4) we did inspect all of the rameters  and  are not exceeded, either the input,
possible hyperparameters that could in principle provide data or model did not allow for any further optimization.
an explanation for the seen results, e.g. tree_depth, to Now each single layer (cf. Fig. 1) and potential
insufimake sure they are not responsible for it. ciency must be analyzed on its own to identify the root</p>
        <p>In order to define an equivalence class, the DT leaf cause. To clearly uncover an unknown unknown,
neimust contain more samples than , i.e.,  &gt; . ther data nor model shall be the root cause of the impure
The basic reasoning is the following, if a leaf contains data clustering. Only if a “seemingly” new input feature
more samples than  a split could have been possible, can resolve the contradiction, a unknown unknown is
however, it was not necessary as  has not been exceeded. plausible. “Seemingly”, as it is yet unknown, even by the
To put it diferently, there are no more disjoint subsets process of elimination, if such a semantic feature can be
within these data, cf. Fig. 5(a). The only other possibility identified and if so, which one it is specifically. Regarding
is that a further split was not possible although  allowed the modelling, only explainable or interpretable models
for it, given the model, data and input features. Such a are useful for the presented approach, as only those
alleaf can indicate unknown unknowns, cf. Fig. 5(b). low to define comprehensible equivalence classes. To
investigate whether the modelling itself is responsible preventing overfitting. These clusters represent a
reafor the inhomogeneous clustering of data, alternative or sonable model, but no useful information for safety can
modified approaches for model  should be deployed be extracted. The second case is an interesting
abnorand compared. For data, the corresponding distributions mality, as it signifies an early stopping. Given  , it was
of the input features {1, ...,  } within the boundaries of not necessary to create additional child clusters, as the
the potential unknown unknown must be investigated. decrease in impurity is insignificant. In brief, all data</p>
        <p>Do note, that there are numerous leafs per DT that expressed the same output behavior without colliding
are endpoints due to the thresholds of  or  being with the hyperparameter thresholds. This case will be
reached. These clusters cannot be interpreted as neither discussed in detail in Section 4.1. The last case shows
equivalence class nor unknown unknowns. Remember impure clusters, although the defined hyperparameters
that  and  primarily prevent overfitting. On the did not account for this. Therefore, the root cause for this
one hand, smaller and smaller values for  and  inhomogeneous data must lie within one of the diferent
will converge on clusters with single data points. Conse- layers, as shown in Fig. 1. This is the object of Section 4.2.
quently, creating equivalence classes which are correct With these analyses we try to identify insuficiencies
from a safety point of view, but carry no useful informa- and link our finding to safety artifacts from ISO 21448
tion. On the other hand, larger values will always serve and ISO/TR 4804.
as limits for the clusters, and it is impossible to know if
additional clusters where not necessary or not possible, and 4.1. Equivalence class of equal DNN
as such ofer no information about potential unknown behavior
unknowns.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Following the identification of the three basic cases, the</title>
        <p>
          4. Safety Analyses most promising leafs for both, overall 
performance and safety significance, are leafs that accumulate
Based on the results from [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we conduct our safety anal- many similar data points without surpassing any of the
yses and demonstrate the presented methodology via a defined limits of the hyperparameters  and .
Therecase study. In our previous contribution we recognized fore, if  &gt;  is true, at least one input feature 
that the reliability of the RFs models is not suficient for is a coherent predictor. These clusters can be identified
an unrestricted usage for safety. Therefore, we specifi- by searching the final number of samples per leaf and
cally analyzed the model  regarding its single comparing them to .
        </p>
        <p>DTs, including their split criteria and leaf clusters, to The methodology of RFs creates each DT from a
subexplain the mixed performance results.  esti- set of the complete training data. Therefore, all DTs are
mates the reliability of the provided softmax confidence based on slightly diferent data sets and identified,
potenfrom a DNN object detector. Please note, that in order tial equivalence classes may only exist within one single
to use model  as a safety predictor, specific DT and not represent an overall equivalence class. In
input features from the complex environment, which are order to to verify a potential equivalence class, the
aggrearguably safety-relevant, have been pre-selected. gated split criteria {1, ..., } should be applied to the</p>
        <p>
          To create  the implementation from scikit- complete data set. If all the samples show a similar
outlearn was used, with thresholds  = 10 and  = put, an equivalence class can, in principle, be defined. For
0.001. For other hyperparameters, please refer to [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. An our presented analysis we selected the most promising
investigation of  revealed strong similarities equivalence class, i.e. the least restrictive one regarding
between the single DTs within the model. Additionally, its split criteria {1, ..., }, out of all potential candidates.
the DTs occasionally expressed leafs cluster similar to Table 1 shows an identified equivalence class that also
the ones shown in Fig. 5(a) and (b). A further analysis of represents a technical limitation of the trained black-box
all leafs from the DTs revealed three basic cases: object detector. All objects with an detection area smaller
than 3.62332, at a noise level of at least of 74%, do not
1. Leafs that show little variance in data and fulfill
        </p>
        <p>= ,
2. Leafs that show little variance in data and fulfill</p>
        <p>&gt; ,
3. Leafs that show high variance in data and fulfill</p>
        <p>&gt; .</p>
      </sec>
      <sec id="sec-4-4">
        <title>The first case is the most common one. According to equa</title>
        <p>tions (1) and (2), together with a suitable  and , the
RF methodology created the best possible leafs, while
have a softmax confidence higher than 0.1, cf. Fig. 6.</p>
        <p>Efectively, none of such objects are being detected by
the black-box object detector, independent of distance
or occlusion. In terms of ISO/TR 4804 equivalence class,
this means, that for all samples fulfilling Table 1 one sin- Figure 7: (left) Detection of multiple objects under ideal
congular test is suficient to verify the black-box system’s ditions. (right) The noise level has been increased to 74%, only
response. the object with area 3.753m2 is still detected.</p>
        <p>Apart from such a successful equivalence class, some
of the potential clusters do not exhibit the same
behavior over all corresponding samples. The split criteria times lost before the limits have been reached. Within
{1, ..., } do not represent an equivalence class, if they this contribution we did not investigate, whether these
are only true within specific DTs, but not for the complete results could be used to refine the limits of the identified
data. Figure 5 shows a verification of two potential equiv- equivalence class into fine-grained subcategories (cf.
Taalence classes. The first plot (eq. class) visualizes the ble 1). Especially, since transformation and translation
softmax confidence for all samples complying to Table 1. errors could not be ruled out entirely.
This equivalence class has been derived from multiple During the safety analysis to positively identify
equivDTs, on average with  = 15. In contrast, an exam- alence classes, almost all of results converged on a
comple of a plot (invalid cluster) for a potential equivalence bination of factors representing a technical limitation of
class that is not homogeneous for all samples within the the system. Such as robustness against noise and area of
identified { 1, ..., }. the object or maximum detection distance. The
remain</p>
        <p>
          Besides the verification via sample outliers, the equiv- ing cases that are seemingly not technical limitations,
alence classes that showed homogeneous output in all but do show convergence, are still under investigation
data have also been “qualitatively” verified by testing the regarding their meaning (as they require very accurate
black-box detector. In this context qualitatively means, CARLA simulation and transformation).
that the simulation environment of CARLA [22] does Due to the abstraction of the input space by the
not allow a specific object size to be set, instead prede- methodology of [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the identified equivalence class can
ifned assets can be selected and deployed, however, the be used as logical scenario, see [24], for ISO/TR 4804
valiprecise object area (within a frame) needs to be derived dation eforts.
and transformed (incl. rounding and translation errors)
to fit the developed safety framework [ 23]. Therefore,
the exact object area of 3.6233m2 as limit could not be 4.2. Unknown unknowns (of white-box)
verified beyond any doubt. Another anomaly within the DT structure are leafs that
        </p>
        <p>
          For the equivalence class provided by Table 1 a set of show high variance in data, but seem to not gain anything
test cases have been created. One such scenario, with from additional splits, i.e.,  &gt; . Equations (1) and
multiple objects and detection areas smaller and larger (2) ensure that the best possible data clusters are being
than ∼ 3.6233m2, has been generated and tested, cf. Fig. 7. created, except if this is impossible, given either model,
Indeed, the verification result of the diferent test cases data or input. One such instance is shown in Fig. 5(b).
confirm this combination of noise variance and object Regarding this cluster, we selected it specifically, as it
area as credible detection limit. However, the verification appears to be most suitable, due to its comparatively
also revealed that this equivalence class represent the broad limits {1, ..., } for the input features. Similar
upper (or lower) limit. For instance, objects are some- leafs has been identified as reoccurring pattern across
multiple DTs. After aggregation of split criteria, the leafs
in question converge on the criteria presented in Table 2.
1 The appearance of such clusters is one explanation for
the mixed performance results of the model ,
as reported in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
0.5 According to Fig. 1, the first layer to investigate a
performance insuficiency is the ML model itself. In order to
determine if the modelling approach itself is responsible
0 for this, modified approaches have been implemented
0 10 20 30 40 50 60 70 and analyzed. Specifically, we used the RF extensions
of boosting (via LightGBM) and multi-output regression
Figure 6: Examples of (un)successful equivalence classes. (via XGBoost) for python. Boosting (by weighing) uses a
combination of bootstrap and evaluated test data to train
Table 2 area have been filtered out. Also, the corresponding
softInhomogeneous cluster of a DT and its boundaries. max score is divided into low and high. One distinctive
feature of Fig. 9 is the relative high amount of data points
Input feature  Interval  Unit that show high and low confidence at the same time for
Object distance 18.85 ≤  ≤ 31.25 [ m] Obj. distances of around 21 m. This contradiction can
        </p>
        <p>
          Object area 2.018 ≤  [ m2] seemingly not be resolved by recruiting additional input
Object occlusion all [ %] features, such as Obj. occlusion. The existence of such
Noise variance 62 ≤  ≤ 78 [ %] data points provides one plausible explanation why the
cluster is inhomogeneous, despite  &gt; .
Additionally, there exists an almost straight line of low confidence
the successive DT [25]. The idea is, that this methodol- scores at 23 m. This is most likely indicating a technical
ogy explicitly tackles high variance leafs, as it penalizes limit, but as this cluster could not be split further by
availmisclassification by weighing the entire training set  able input dimensions, it must not be represented well in
accordingly. With multi-output regression, several out- the available data. On the whole the displayed section,
put variables are simultaneously predicted [21]. In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] we limited by Table 2, could not be split into homogeneous
trained three diferent models for three diferent target clusters by any of the available input dimensions.
variables. Via multi-output regression we hope to lever- Taking into consideration the distribution of Fig. 5(b),
age some dependencies between these output variables, more data samples will most likely not enforce another
such as a correlation between softmax confidence and split into more homogeneous clusters, as  &gt; 
bounding box size shifts. The minimization of impurity, already indicates that this is not the root-cause. The only
Eq. (1), via the squared error (2) is fundamental to all of case where additional samples help, is if the underlying
the extensions. Please remember, the selection of suitable data distribution within the other input features are not
approaches is limited by the necessity for explainability. appropriate, i.e. imbalanced, as this represents skewed
        </p>
        <p>The evaluation of the overall performance for all mod- information. An investigation of data revealed, that the
els reveals that the measured performance converges, see distributions for object occlusion and object area are not
Fig. 8. All three models display a relatively high amount entirely balanced. The reasons are, objects have fixed
of correct predictions for very low and high softmax con- sizes and for occlusion at least two objects are required,
ifdences. Be aware, that the model Multi-output has one of which is definitely not occluded, while with three
a slightly smaller test set, as for its sequential models or more objects multiple ones are fully occluded, thus
building process the samples with false negatives can- creating small biases. However, all in all the data
distrinot be used. In terms of quantitative values, the Mean bution is considered suficiently good to rule it out as
Squared Error (MSE) and Mean Absolute Error (MAE) root cause for the poor DT data clustering.
show maximum improvements of ∆ MSE ≤ 1.22e-2 and If great data imbalance is not evident, there are only
∆ MAE ≤ 2.32e-2 between the new models and RF base two possible impacts additional samples can have. Either,
(with MSE= 2.25e-2 and MAE= 8.00e-2). Unfortunately, (a) extra measurement samples skew the distribution into
this means no model performs significantly better than a certain direction (basically creating a bias), but still, the
the others. Due to the diferent training approaches
between the models, a detailed comparison of leafs and
structure is not possible without extensive efort. This 1
result either indicate, that these kind of models are
inherently unable to predict the black-box behavior or that 0.8
there are specification insuficiencies in data and/or input
that cause this response. The outcome of all of this is, 0.6
changing the model does not seem to resolve
inhomogeneous clustering, as outliers are apparent for all models 0.4
(cf. Fig. 8). On account of this, we continue by
investigating possible unknown unknowns by visualizing the 0.2
relevant data distribution.</p>
        <p>By the process of elimination, to rule out implausible 00 0.2 0.4 0.6 0.8 1
root causes, we arrive at the collected data. We continue
by highlighting the data distribution given by Table 2. Figure 8: Diferent explainable models and their performance
Figure 9 displays the data points for the relevant input (diagonal line shows ideal behavior). None of them shows a
features of Obj. distance and Noise variance, narrowed definitive advantage over the others, suggesting a root cause
down by the specific split criteria { 1, ..., 4}. For a con- independent of the selected ML methodology.
venient visualization, the data points of 2.018m2 &lt; Obj.
cluster would remain collectively inhomogeneous, or, (b) the new input feature should not correlate with either
the new samples alone can be partitioned into its own of these, as they do not carry any useful information to
cluster (split by the remaining, available input features). disentangle the data, cf. Fig. 9. We also excluded biases.
Although investigating cases (a) and (b) could provide During our initial inspection of the data we already
additional information, no experiments have been car- identified one irregularity, namely, data points that have
ried out within this contribution, as the expected results low softmax confidence at a specific Obj. distance
would not impact the next analysis.  = 23m across virtually all noise levels. Although this is</p>
        <p>All the previous analyses lead us to the only plausible not completely uncommon, see areas outside the
highconclusion: The introspective data set does not include lighted cluster in Fig. 9(left), in this particular case,
howall the necessary data dimensions. ever, none of the other input features could be recruited to</p>
        <p>
          The next step involves reviewing the input features . separate these outliers. Subsequently, we examined the
In order to act as an explainable, introspective model, the corresponding frames in order to determine a potential
input space for the white-box model has been reduced efect that could cause such a hard limit.
to certain input features, called safety features, in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Our review revealed, that for all these frames
Following the process of elimination, neither the model the carla.WeatherParameters contained a nonzero
nor the data provide any convincing evidence that they value for fog_density. In our initial setup to create an
cause this observed inconsistency, cf. Fig. 5(b). Therefore, explainable, introspective model we introduced “Noise
only the input features remain as possible root cause. variance” as technical implementation for all
environThe input features in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] have carefully been selected mental disturbances, such as rain, fog or white-noise. So
based on two principles that ensure safety-relevance and these efects cannot be distinguished from each other
redundancy: within the introspective data set. As a result, they act as
1. The feature is safety-relevant, i.e., factors that unknown unknowns within this system’s environment
typically cause trafic accidents in human driving, (introspective model). Although rain and fog produce
similar visual efects in CARLA, fog acts as a limitation
2. The feature must be measurable via a diferent for the maximum field-of-view distance and therefore
sensor, i.e., independent of the black-box predic- also limits the capabilities of the black-box object
detection. tor. With regards to the (safety) principles 1. and 2., the
These principles still apply, so possible new features must feature “Fog density” can definitely be classified as
safetyadhere to these principles to be useful for a reliable safety relevant and also be detected via other sensors. A linear
monitor. correlation analysis has been carried out to determine
        </p>
        <p>The basic strategy to discover possibly new, important the dependence of Fog density with other input features,
input features revolves around the idea to use the evalu- see Fig. 10. As this table shows, a strong positive
correlaated analysis results from the previous tests. According tion exists between noise and fog, as the basic simulation
to the split criteria of Table 2, occlusion efects are unim- efect is similar. It can also be seen, that both, noise and
portant and object’s area only requires a minimum value fog, show minimal correlations with the other, remaining
for detection (given the noise level interval). Therefore, input features. This indicates a good candidate for a now
(un)known unknown.</p>
        <p>Based on these findings, we separated “Fog density”
from Noise variance and included the meta-data in the
introspective data set as new input feature. The preliminary
experiments indeed show an improvement. Since a new
input feature was introduced, the resulting DTs cannot
simply be compared. It is, however, possible to filter for
all the leafs that fall within the previous boundaries of
Table 2. This inspection showed that additional, improved
sub-clusters have been created, see Fig. 11. By
identifying and including a previously unknown unknown input
feature, the previously inconsistent data cluster could
successfully be subdivided into more balanced leafs, showing
the relevance of this input dimension for the
introspective model. Please be aware that the new sub-clusters
can still result in any of the three basic cases for DT leafs
(cf. Section 4), so the analysis might not end conclusively
every time.</p>
      </sec>
      <sec id="sec-4-5">
        <title>This work was funded by the Bavarian Ministry for Economic Afairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion and Future Work</title>
      <sec id="sec-6-1">
        <title>The work presented in this paper shows how explain</title>
        <p>able ML can help and guide us to discover equivalence
classes (ISO/TR 4804) and unknown unknowns (SOTIF).</p>
        <p>The developed approach makes use of the mathematical
foundation of DTs to identify leafs and interpret their
meaning, with respect to defined thresholds and their
degree of data variance. We successfully use the
methodology to define an equivalence class (Table 1) and uncover
an unknown unknown (Fig. 11) for the application of a
explainable, outside-model estimator.</p>
        <p>Some question, however, do remain. While some
equivalence classes can be identified and meaningfully
interpreted, other cases beyond system (capability) limitations
are dificult to humanly comprehend. Within the
described use case we were able to identify one unknown
unknown by disentangling one inconsistent data cluster.
surance of Machine Learning for Chassis Control applications, in: Proc. 2019 34th IEEE/ACM Int.
Functions, in: Proc. Int. Conf. on Comp. Safety, Conf. on Automated Software Engineering (ASE),
Reliability, and Security, Cham, 2021, pp. 149–16. 2019, pp. 26–37.
[8] A. Schwaiger, P. Sinhamahapatra, J. Gansloser, [19] T. Wuellner, S. Feuerstack, A. Hahn, Clustering
enK. Roscher, Is Uncertainty Quantification in Deep vironmental conditions of historical accident data
Learning Suficient for Out-of-Distribution Detec- to eficiently generate testing sceneries for
martion?, in: Proc. AISafety@IJCAI, 2020. itime systems, in: Proc. Model-Based Safety and
[9] S. Khastgir, H. Sivencrona, G. Dhadyalla, P. Billing, Assessment: 6th Int. Symp., IMBSA 2019,
ThessaS. Birrell, P. Jennings, Introducing ASIL inspired loniki, Greece, October 16–18, 2019, Proc. 6, 2019,
dynamic tactical safety decision framework for au- pp. 349–362.
tomated vehicles, in: Proc. 2017 IEEE 20th Int. Conf. [20] W.-Y. Loh, Classification and regression trees,
on Intelligent Transportation Systems (ITSC), 2017, Wiley interdisciplinary reviews: data mining and
pp. 1–6. knowledge discovery 1 (2011) 14–23.
[10] P. Koopman, M. Wagner, Toward a Framework for [21] G. Louppe, Understanding Random Forests: From
Highly Automated Vehicle Safety Validation, Tech- Theory to Practive, Ph.D. thesis, University of Liège
nical Report, SAE Technical Paper, 2018. - Faculty of Applied Sciences, 2014.
[11] D. Rao, P. Pathrose, F. Huening, J. Sid, An approach [22] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez,
for validating safety of perception software in au- V. Koltun, CARLA: An open urban driving
simulatonomous driving systems, in: Proc. Model-Based tor, in: Proc. 1st Annual Conf. on Robot Learning,
Safety and Assessment: 6th Int. Symp., IMBSA 2019, volume 78, 2017, p. CARLA: An open urban driving
Thessaloniki, Greece, October 16–18, 2019, Proc. 6, simulator.</p>
        <p>2019, pp. 303–316. [23] I. Kurzidem, A. Saad, P. Schleiss, A Systematic
[12] M. Ghadhab, S. Junges, J.-P. Katoen, M. Kuntz, Approach to Analyzing Perception Architectures
M. Volk, Model-based safety analysis for vehicle in Autonomous Vehicles, in: Proc. 7th Int. Symp.
guidance systems, in: Proc. Comp. Safety, Reliabil- on Model-Based Safety and Assessment (IMBSA),
ity, and Security: 36th Int. Conf., SAFECOMP, 2017, Lisbon, 2020, pp. 149–162.</p>
        <p>pp. 3–19. [24] T. Menzel, G. Bagschik, M. Maurer, Scenarios for
[13] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, development, test and validation of automated
veJ. Gilmer, B. Lakshminarayanan, AugMix: A Sim- hicle, in: Proc. 2018 IEEE Intelligent Vehicles Symp.
ple Data Processing Method to Improve Robustness (IV), 2018, pp. 1821–1827.
and Uncertainty, arXiv:1912.02781 [cs, stat] (2020). [25] T. Dietterich, An experimental comparison of three
arXiv:1912.02781. methods for constructing ensembles of decision
[14] A. Adee, R. Gansch, P. Liggesmeyer, C. Glaeser, trees: Bagging, boosting, and randomization,
MaF. Drews, Discovery of Perception Performance chine learning 40 (2000) 139–157.
Limiting Triggering Conditions in Automated
Driving, in: Proc. 2021 5th Int. Conf. on System
Reliability and Safety (ICSRS), 2021, pp. 248–257.
[15] T. Fingscheidt, H. Gottschalk, S. Houben, Deep
Neural Networks and Data for Automated Driving:
Robustness, Uncertainty Quantification, and Insights</p>
        <p>Towards Safety, Springer, 2022.
[16] E. Haedecke, M. Mock, M. Akila, ScrutinAI: A
Visual Analytics Approach for the Semantic Analysis
of Deep Neural Network Predictions, in: Proc.
EuroVis Workshop on Visual Analytics (EuroVA), 2022,
pp. 73–775.
[17] C. Gladisch, C. Heinzemann, M. Herrmann,</p>
        <p>M. Woehrle, Leveraging combinatorial testing for
safety-critical computer vision datasets, in: Proc.
2020 IEEE/CVF Conf. on Comp. Vision and
Pattern Recognition Workshops (CVPRW), Seattle, WA,</p>
        <p>USA, 2020, pp. 1314–1321.
[18] C. Gladisch, T. Heinz, C. Heinzemann, J. Oehlerking,</p>
        <p>A. von Vietinghof, T. Pfitzer, Experience paper:
Search-based testing in automated driving control</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Yurtsever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carballo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Takeda</surname>
          </string-name>
          ,
          <article-title>A survey of autonomous driving: Common practices and emerging technologies</article-title>
          ,
          <source>in: Proc. IEEE Access</source>
          , volume
          <volume>8</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>58443</fpage>
          -
          <lpage>58469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>International</given-names>
            <surname>Organization</surname>
          </string-name>
          for Standardization,
          <source>Safety Of The Intended Functionality - SOTIF (ISO/- PAS 21448)</source>
          , ISO,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Burton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Herd</surname>
          </string-name>
          ,
          <article-title>Addressing uncertainty in the safety assurance of machine-learning, Frontiers in Computer Science Hypothesis and theory article (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>International</given-names>
            <surname>Organization</surname>
          </string-name>
          for Standardization,
          <article-title>Road vehicles - Safety and cybersecurity for automated driving systems - Design, verification and validation</article-title>
          (ISO/TR 4804:
          <year>2020</year>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schleiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hagiwara</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kurzidem</surname>
          </string-name>
          ,
          <article-title>Towards the Quantitative Verification of Deep Learning for Safe Perception</article-title>
          ,
          <source>in: Proc. 2022 IEEE Int. Symp. on Software Reliability Engineering Workshops (ISSREW)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>208</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kurzidem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Misik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schleiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Burton</surname>
          </string-name>
          , Safety Assessment:
          <article-title>From Black-Box to White-Box</article-title>
          ,
          <source>in: Proc. 2022 IEEE Int. Symp. on Software Reliability Engineering Workshops (ISSREW)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>295</fpage>
          -
          <lpage>300</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Burton</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kurzidem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwaiger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schleiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Unterreiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Graeber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Becker</surname>
          </string-name>
          , Safety As-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>