-

AI for Safety: How to use Explainable Machine Learning Approaches for Safety Analyses

Iwo Kurzidem

Simon Burton

Philipp Schleiss

0 0 Fraunhofer Institute for Cognitive Systems IKS , Hansastraße 32, D-80686 Munich

Current research in machine learning (ML) and safety focuses on safety assurance of ML. We, however, show how to interpret results of explainable ML approaches for safety. We investigate how individual evaluation of data clusters in specific explainable, outside-model estimators can be analyzed to identify insuficiencies at diferent levels, such as (1) input feature, (2) data or (3) the ML model itself. Additionally, we link our finding to required artifacts of safety within the automotive domain, such as unknown unknowns from ISO 21448 or equivalence class as mentioned in ISO/TR 4804. In our case study we analyze and evaluate the results from an explainable, outside-model estimator (i.e., white-box model) by performance evaluation, decision tree visualization, data distribution and input feature correlation. As explainability is key for safety analyses, the utilized model is a random forest, with extensions via boosting and multi-output regression. The model training is based on an introspective data set, optimized for reliable safety estimation. Our results show that technical limitations can be identified via homogeneous data clusters and assigned to a corresponding equivalence class. For unknown unknowns, each level of insuficiency (input, data and model) must be analyzed separately and systematically narrowed down by process of elimination. In our case study we identify “Fog density” as an unknown unknown input feature for the introspective model.

eol>safety analysis safety engineering explainable machine learning outside-model estimator safety validation

Data ML model specification insufficiency performance insufficiency

1. Introduction

The use of artificial intelligence (AI) and especially machine learning (ML) in safety critical applications, such as autonomous driving (AD), is still a vivid research area, as many state-of-the-art ML methodologies create endto-end trained (i.e., black-box) models for object detection and localization [ 1 ]. Encoded into these black-box models are performance and specification insuficiencies that cause epistemic and/or aleatoric uncertainties [ 2 ].

Identifying, estimating and, if possible, mitigating uncertainties is required for a convincing safety assurance [ 3 ].

Figure 1 provides an overview of diferent uncertainty manifestations typical for ML: a trivial task, as typical results from quantitative tests of • Input feature: Is the ML model’s decision pro- the ML model do not allow a straightforward mapping cess based on the correct input factors from the between a measured lack of performance to a specific complex environment? insuficiency, due to complex interdependence and corre• Data: Does the collected data (training & test) lations between the causes.

include enough and proper samples with an ap- The main contribution of this paper is an approach to propriate distribution? identify specific insuficiencies and eventually link the • ML model: Is the selected ML methodology ap- analysis results to required artifacts of automotive safety propriate for the desired task? standards, for example related to unknown unknowns from ISO 21448 - Safety of the intended functionality (SOFinding and understanding the root cause(s) of uncer- TIF) [ 2 ] or equivalence class for validation from ISO/TR tainty and identify the corresponding insuficiency is not 4804 [ 4 ]. In doing so, we present a solution to address The IJCAI-2023 AISafety and SafeRL Joint Workshop open issues in ML safety assurance regarding safety tests, * Corresponding author. such as how many tests have to be performed within $ iwo.kurzidem@iks.fraunhofer.de (I. Kurzidem); which operational design domain (ODD) [ 5 ]. simon.burton@iks.fraunhofer.de (S. Burton); In previous work we presented a conceptual framephi0li0p0p0.-s0c0h0le1i-s9s0@40ik-8s.7f5ra2u(nSh.oBfuerrt.doen)(P. Schleiss) work to create an explainable, introspective model (i.e., © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License white-box) from a deep neural network (i.e., black-box), CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) cf. Fig 2. In a case study, we used the approach to estimate Black-box (DNN) prediction of the development of a methodology for safety assurance test White-box (RF) Black-box for ML algorithms, in particular for object detection and Data (Test data) behavior instance segmentation. Most of the used approaches for safety included conventional methods, such as visual Figure 2: From Black-box to White-box. Adapted from [ 6 ]. analytics [16], combinatorial testing [17], data augmentation [13] and others. All these methods work within a well defined, limited semantic space. A couple of methods the safety and reliability of the black-box via the white- in KI-A used ML techniques, such as principal compobox for object detection in the automotive domain. While nent analysis (PCA) [15] and search-based testing [18], to the developed white-box models showed some promis- specifically analyze and search for insuficiencies in data. ing results, such as providing estimated distributions for However, all of these methods require some insights or successful and failed defections, their unrestricted usage a-priori knowledge about the root cause of the specific infor safety assessment is currently not possible, details suficiency to be applied successfully. Our approach does see [ 6 ]. In this contribution we use the developed models not assume any specific insuficiency from the outset, for safety analyses to identify specific insuficiencies. We instead each layer of uncertainty (cf. Fig. 1) is analyzed leverage the fact that random forests (RFs) contain inter- by itself and by process of elimination the root cause is pretable decision trees (DTs) and analyze the obtained identified.

DTs with regards to split criteria and data clustering. Besides KI-A and beyond AD, ML has successfully been

This paper is organized as follows. Section 2 provides used for data clustering and analysis, such as PCA, kan overview of relevant and related works. We continue means or Latin hypercube sampling, to define relevant by introducing our approach and its basic premise in sceneries to reduce the efort of verification and validaSection 3. Next, in Section 4, we demonstrate our ap- tion [19]. Again, none of the mentioned methods explores proach and perform corresponding analyses. Finally, in all the diferent possible insuficiencies due to input, data Section 5, we conclude the paper by summarizing our or model, but instead already know where to look. results and discussing future work.

3. Methodology 2. Related Works In [6] we introduced a framework to create explainable,

Currently, most research on AI for AD focuses on improv- introspective white-box models, derived from black-box ing the safety related aspects of ML models itself. Either model test evaluation, to predict diferent safety related by means of conventional (i.e., non-ML) analysis meth- aspects of the deep neural network (DNN) object detecods [ 7 ] or methods directly enhancing the ML model [8]. tor. Unfortunately, the measured performance of the These conventional safety methods include hazard and white-box models did not allow for an unrestricted use risk analysis [9], simulation [10, 11], (stochastic) fault as reliable safety monitors. In this contribution we investree analysis [12] etc., while ML specific methods for tigate if we can use the white-box models themselves to safety cover uncertainty quantification [ 8] and robustifi- analyze certain safety properties and link the obtained cation [13] among others. However, conventional safety result to insuficiencies within diferent layers, cf. Fig. 1. methods are not particularly well suited for safety con- Put diferently, can we use the semantic input of the siderations regarding AI, such as the definition of equiv- white-box to characterize the black-box regarding safe, alence classes of safe or unsafe behavior or discovering unsafe and unknown behavior. unknown unknowns, as these characteristics manifest On the one hand, we examine the single DTs of the RF themselves diferently in ML-based systems, due to cor- white-box models to identify possible equivalence classes. relation of input to output instead of causality of data This enables us to possibly define an eficient test stratprocessing. Enhancing ML models requires modification egy for verification and validation further down the ML of the base network, without providing traceable safety development-cycle. On the other hand, we investigate artifacts. Therefore, new safety analysis methods are if contradictory samples within DT leafs indicate unneeded, including approaches leveraging ML itself. Sim- known unknowns. Here unknown unknowns represent ilar to [14], which uses a Bayesian network to identify previously unconsidered parameter from the complex novel triggering conditions, as required by SOTIF. environment, not part of the initial problem space.

The German Federal Ministry for Economic Afairs and Regarding results, the analysis of DT leafs might not Climate Protection initiated the project “KI-Absicherung” end conclusively for either equivalence classes or un(KI-A), consisting of 24 partners from industry and known unknowns. This does not mean there are defiacademia, to address the complex topic of AI and safety nitely no such cases to be found, but instead that, given in the mobility market [15]. The main focus of KI-A was the input space, equivalence classes or unknown unknowns are unlikely to be found within these data.

In principle the proposed approach can be applied to any kind of ML data, however, it greatly benefits from certain restrictions to be usable in safety. Firstly, the input dimensions should have a semantic description, meaning they have an humanly interpretable representation in the real world. For instance, a semantic dimension may refer to an object’s attribute (e.g. size) or environmental conditions (e.g. rain), whereas non-semantic descriptions include technical aspects (such as pixel intensity, blur, etc.). Secondly, the input space should be limited. The aggregation and interpretation of multiple and diferent input parameters may result in too complex cases to be analyzed and used in safety argumentation.

The basic concept of DTs is data partitioning [20]. To this end, the input space of data is repeatedly partitioned into disjoint, smaller subsets, such that each subset is consistent with regards to the desired output. A visualization of a simple DT is given in Fig. 3. As can be seen, the input data is partitioned into subsets by splitting at each node, using the most suitable input feature (in conjunction with a specified error function, details in section 3.1). The final data clusters, i.e., the leafs of DTs (from now, we will use the terms interchangeably) represent the “most consistent” partitioning given the defined hyperparameters and provided data. The collection of multiple DTs together is RF and this ensemble provides its final output by aggregating the prediction of each single DT. There are diferent versions of RFs, such as bagging and boosting extensions, that difer in way the DTs are created from the provided data (see section 4.2). The mathematical fundamentals to create DTs, such suitable split criteria , and their interpretation for safety analyses are given in following sections.

nodes n leafs ≤ > split criterion s The underlying methodology of DTs creates disjoint subsets of inputs that produce the same output (while minimizing variance) [21]. This is very similar to the definition of equivalence class from ISO/TR 4804 [ 4 ], which states that, equivalence classes are based on the division of inputs and outputs, such that a (single) representative test can be defined. Therefore we use the leafs of DTs to define an equivalence class. In addition, we use the quantitative split criteria {1, ..., } of the DTs node’s, to define the boundaries (i.e., limits) of the corresponding equivalence class, cf. Fig. 3.

The foundation of DTs is data partitioning by (binary) splits, to uncover complex patterns. For each possible binary split value at node the resulting decrease in impurity ∆( , ) is being determined by [21]: ∆( , ) = () − * () −

* (), (1) with denoting the size of the training data for node , and representing the samples from the whole training data assigned to the left child and right child respectively, and as the impurity function. The maximization of decrease in impurity can be understood as best possible split for node into two children ( and ). For regression tasks, typically the squared error loss is being computed with Eq. (1), to determine the error during training. Therefore, () calculates the local, i.e. for a specific node , squared error loss via [21]: () = 1

∑︁ ( − )2. ,∈

(2) In Eq. (2), denotes a specific input feature and the corresponding model output from the subset of learning samples . and are the model output and desired output respectively. Both equations, (1) and (2), essentially split the data into clusters that produce the most similar output. Figure 4 shows an example for data splitting, containing measurement samples for object distance (input feature ) and corresponding softmax confidence ( ). The best split divides into two clusters, and , that have the highest decrease in impurity. The horizontal lines within the left () and right cluster () indicate the arithmetic mean for each of them. Any other split, for instance * (cf. Fig. 4), yields: ∆( , ) > ∆( * , ).

(3)

SnL

S S*

SnR 10 20 30 40 50 (a)

(b) 3.2. Unknown unknowns

The goal of SOTIF is to identify potential unknown haz

ardous scenarios, arising from the interaction between the system and its complex environment, and mitigate not guarantee an equivalence class. The methodology of their efects. To archive this, SOTIF recommends to RFs and DTs requires some hyperparameters to be set search for triggering conditions that lead to potential that influence the splitting and, therefore, the resulting hazardous scenarios. Unfortunately, there is no estabclusters. Most important for our considerations are: lished approach or method to identify such triggering • Threshold for the minimum decrease in impu- conditions for all possible systems and environments. rity, i.e., ∆( , ) < , Furthermore, the nature of some of these triggering conditions can be defined as unknown unknowns, i.e., some• The minimum amount of samples to allow thing we are not even aware that we do not know. In further splits, i.e., > . our context it refers to a feature of the input space that The first threshold prevents an overfitting, as no thresh- was not considered when approximating the factors that old allows the splitting of virtually identical values as influence the performance of the black-model. long as there is any decrease in impurity. Refer to Fig. 4, The key idea is to identify and use inconclusive, yet nearly all measured softmax confidence values will be interpretable data clusters and, by process of eliminadiferent after decimal places (dependent on the preci- tion, show that the only possible explanation for their sion of the data). Therefore, even splitting samples that existence is an unknown unknown. In the previous secvary after digits will decrease impurity, eventually cre- tion 3.1, we examined the mathematical foundation for ating DTs with one single data point per leaf. The second data clustering via DTs. In particular equations (1) and parameter also prevents overfitting. Lets assume (2) partition the available data into the best possible disthat is set to the smallest possible value, which is 2. joint and coherent clusters. However, in some cases the Given a small enough , each single leaf will converge at resulting, final clusters still have high impurity, although single data points. Therefore, both, and together, further splitting, in principle, is allowed. Simply put, influence the resulting clusters and if meaningful equiva- the cluster contains contradictory data, which cannot be lence classes can be defined. Please note, that there are split meaningfully anymore within the defined scope, cf. more hyperparameters to prevent overfitting, but they Fig. 5(b). are not relevant for this contribution. Please also note, How can this be interpreted? Given that the hyperpaduring our analyses (Section 4) we did inspect all of the rameters and are not exceeded, either the input, possible hyperparameters that could in principle provide data or model did not allow for any further optimization. an explanation for the seen results, e.g. tree_depth, to Now each single layer (cf. Fig. 1) and potential insufimake sure they are not responsible for it. ciency must be analyzed on its own to identify the root

In order to define an equivalence class, the DT leaf cause. To clearly uncover an unknown unknown, neimust contain more samples than , i.e., > . ther data nor model shall be the root cause of the impure The basic reasoning is the following, if a leaf contains data clustering. Only if a “seemingly” new input feature more samples than a split could have been possible, can resolve the contradiction, a unknown unknown is however, it was not necessary as has not been exceeded. plausible. “Seemingly”, as it is yet unknown, even by the To put it diferently, there are no more disjoint subsets process of elimination, if such a semantic feature can be within these data, cf. Fig. 5(a). The only other possibility identified and if so, which one it is specifically. Regarding is that a further split was not possible although allowed the modelling, only explainable or interpretable models for it, given the model, data and input features. Such a are useful for the presented approach, as only those alleaf can indicate unknown unknowns, cf. Fig. 5(b). low to define comprehensible equivalence classes. To investigate whether the modelling itself is responsible preventing overfitting. These clusters represent a reafor the inhomogeneous clustering of data, alternative or sonable model, but no useful information for safety can modified approaches for model should be deployed be extracted. The second case is an interesting abnorand compared. For data, the corresponding distributions mality, as it signifies an early stopping. Given , it was of the input features {1, ..., } within the boundaries of not necessary to create additional child clusters, as the the potential unknown unknown must be investigated. decrease in impurity is insignificant. In brief, all data

Do note, that there are numerous leafs per DT that expressed the same output behavior without colliding are endpoints due to the thresholds of or being with the hyperparameter thresholds. This case will be reached. These clusters cannot be interpreted as neither discussed in detail in Section 4.1. The last case shows equivalence class nor unknown unknowns. Remember impure clusters, although the defined hyperparameters that and primarily prevent overfitting. On the did not account for this. Therefore, the root cause for this one hand, smaller and smaller values for and inhomogeneous data must lie within one of the diferent will converge on clusters with single data points. Conse- layers, as shown in Fig. 1. This is the object of Section 4.2. quently, creating equivalence classes which are correct With these analyses we try to identify insuficiencies from a safety point of view, but carry no useful informa- and link our finding to safety artifacts from ISO 21448 tion. On the other hand, larger values will always serve and ISO/TR 4804. as limits for the clusters, and it is impossible to know if additional clusters where not necessary or not possible, and 4.1. Equivalence class of equal DNN as such ofer no information about potential unknown behavior unknowns.

Following the identification of the three basic cases, the

4. Safety Analyses most promising leafs for both, overall performance and safety significance, are leafs that accumulate Based on the results from [ 6 ], we conduct our safety anal- many similar data points without surpassing any of the yses and demonstrate the presented methodology via a defined limits of the hyperparameters and . Therecase study. In our previous contribution we recognized fore, if > is true, at least one input feature that the reliability of the RFs models is not suficient for is a coherent predictor. These clusters can be identified an unrestricted usage for safety. Therefore, we specifi- by searching the final number of samples per leaf and cally analyzed the model regarding its single comparing them to .

DTs, including their split criteria and leaf clusters, to The methodology of RFs creates each DT from a subexplain the mixed performance results. esti- set of the complete training data. Therefore, all DTs are mates the reliability of the provided softmax confidence based on slightly diferent data sets and identified, potenfrom a DNN object detector. Please note, that in order tial equivalence classes may only exist within one single to use model as a safety predictor, specific DT and not represent an overall equivalence class. In input features from the complex environment, which are order to to verify a potential equivalence class, the aggrearguably safety-relevant, have been pre-selected. gated split criteria {1, ..., } should be applied to the

To create the implementation from scikit- complete data set. If all the samples show a similar outlearn was used, with thresholds = 10 and = put, an equivalence class can, in principle, be defined. For 0.001. For other hyperparameters, please refer to [ 6 ]. An our presented analysis we selected the most promising investigation of revealed strong similarities equivalence class, i.e. the least restrictive one regarding between the single DTs within the model. Additionally, its split criteria {1, ..., }, out of all potential candidates. the DTs occasionally expressed leafs cluster similar to Table 1 shows an identified equivalence class that also the ones shown in Fig. 5(a) and (b). A further analysis of represents a technical limitation of the trained black-box all leafs from the DTs revealed three basic cases: object detector. All objects with an detection area smaller than 3.62332, at a noise level of at least of 74%, do not 1. Leafs that show little variance in data and fulfill

= , 2. Leafs that show little variance in data and fulfill

> , 3. Leafs that show high variance in data and fulfill

> .

The first case is the most common one. According to equa

tions (1) and (2), together with a suitable and , the RF methodology created the best possible leafs, while have a softmax confidence higher than 0.1, cf. Fig. 6.

Efectively, none of such objects are being detected by the black-box object detector, independent of distance or occlusion. In terms of ISO/TR 4804 equivalence class, this means, that for all samples fulfilling Table 1 one sin- Figure 7: (left) Detection of multiple objects under ideal congular test is suficient to verify the black-box system’s ditions. (right) The noise level has been increased to 74%, only response. the object with area 3.753m2 is still detected.

Apart from such a successful equivalence class, some of the potential clusters do not exhibit the same behavior over all corresponding samples. The split criteria times lost before the limits have been reached. Within {1, ..., } do not represent an equivalence class, if they this contribution we did not investigate, whether these are only true within specific DTs, but not for the complete results could be used to refine the limits of the identified data. Figure 5 shows a verification of two potential equiv- equivalence class into fine-grained subcategories (cf. Taalence classes. The first plot (eq. class) visualizes the ble 1). Especially, since transformation and translation softmax confidence for all samples complying to Table 1. errors could not be ruled out entirely. This equivalence class has been derived from multiple During the safety analysis to positively identify equivDTs, on average with = 15. In contrast, an exam- alence classes, almost all of results converged on a comple of a plot (invalid cluster) for a potential equivalence bination of factors representing a technical limitation of class that is not homogeneous for all samples within the the system. Such as robustness against noise and area of identified { 1, ..., }. the object or maximum detection distance. The remain

Besides the verification via sample outliers, the equiv- ing cases that are seemingly not technical limitations, alence classes that showed homogeneous output in all but do show convergence, are still under investigation data have also been “qualitatively” verified by testing the regarding their meaning (as they require very accurate black-box detector. In this context qualitatively means, CARLA simulation and transformation). that the simulation environment of CARLA [22] does Due to the abstraction of the input space by the not allow a specific object size to be set, instead prede- methodology of [ 6 ], the identified equivalence class can ifned assets can be selected and deployed, however, the be used as logical scenario, see [24], for ISO/TR 4804 valiprecise object area (within a frame) needs to be derived dation eforts. and transformed (incl. rounding and translation errors) to fit the developed safety framework [ 23]. Therefore, the exact object area of 3.6233m2 as limit could not be 4.2. Unknown unknowns (of white-box) verified beyond any doubt. Another anomaly within the DT structure are leafs that

For the equivalence class provided by Table 1 a set of show high variance in data, but seem to not gain anything test cases have been created. One such scenario, with from additional splits, i.e., > . Equations (1) and multiple objects and detection areas smaller and larger (2) ensure that the best possible data clusters are being than ∼ 3.6233m2, has been generated and tested, cf. Fig. 7. created, except if this is impossible, given either model, Indeed, the verification result of the diferent test cases data or input. One such instance is shown in Fig. 5(b). confirm this combination of noise variance and object Regarding this cluster, we selected it specifically, as it area as credible detection limit. However, the verification appears to be most suitable, due to its comparatively also revealed that this equivalence class represent the broad limits {1, ..., } for the input features. Similar upper (or lower) limit. For instance, objects are some- leafs has been identified as reoccurring pattern across multiple DTs. After aggregation of split criteria, the leafs in question converge on the criteria presented in Table 2. 1 The appearance of such clusters is one explanation for the mixed performance results of the model , as reported in [ 6 ]. 0.5 According to Fig. 1, the first layer to investigate a performance insuficiency is the ML model itself. In order to determine if the modelling approach itself is responsible 0 for this, modified approaches have been implemented 0 10 20 30 40 50 60 70 and analyzed. Specifically, we used the RF extensions of boosting (via LightGBM) and multi-output regression Figure 6: Examples of (un)successful equivalence classes. (via XGBoost) for python. Boosting (by weighing) uses a combination of bootstrap and evaluated test data to train Table 2 area have been filtered out. Also, the corresponding softInhomogeneous cluster of a DT and its boundaries. max score is divided into low and high. One distinctive feature of Fig. 9 is the relative high amount of data points Input feature Interval Unit that show high and low confidence at the same time for Object distance 18.85 ≤ ≤ 31.25 [ m] Obj. distances of around 21 m. This contradiction can

Object area 2.018 ≤ [ m2] seemingly not be resolved by recruiting additional input Object occlusion all [ %] features, such as Obj. occlusion. The existence of such Noise variance 62 ≤ ≤ 78 [ %] data points provides one plausible explanation why the cluster is inhomogeneous, despite > . Additionally, there exists an almost straight line of low confidence the successive DT [25]. The idea is, that this methodol- scores at 23 m. This is most likely indicating a technical ogy explicitly tackles high variance leafs, as it penalizes limit, but as this cluster could not be split further by availmisclassification by weighing the entire training set able input dimensions, it must not be represented well in accordingly. With multi-output regression, several out- the available data. On the whole the displayed section, put variables are simultaneously predicted [21]. In [ 6 ] we limited by Table 2, could not be split into homogeneous trained three diferent models for three diferent target clusters by any of the available input dimensions. variables. Via multi-output regression we hope to lever- Taking into consideration the distribution of Fig. 5(b), age some dependencies between these output variables, more data samples will most likely not enforce another such as a correlation between softmax confidence and split into more homogeneous clusters, as > bounding box size shifts. The minimization of impurity, already indicates that this is not the root-cause. The only Eq. (1), via the squared error (2) is fundamental to all of case where additional samples help, is if the underlying the extensions. Please remember, the selection of suitable data distribution within the other input features are not approaches is limited by the necessity for explainability. appropriate, i.e. imbalanced, as this represents skewed

The evaluation of the overall performance for all mod- information. An investigation of data revealed, that the els reveals that the measured performance converges, see distributions for object occlusion and object area are not Fig. 8. All three models display a relatively high amount entirely balanced. The reasons are, objects have fixed of correct predictions for very low and high softmax con- sizes and for occlusion at least two objects are required, ifdences. Be aware, that the model Multi-output has one of which is definitely not occluded, while with three a slightly smaller test set, as for its sequential models or more objects multiple ones are fully occluded, thus building process the samples with false negatives can- creating small biases. However, all in all the data distrinot be used. In terms of quantitative values, the Mean bution is considered suficiently good to rule it out as Squared Error (MSE) and Mean Absolute Error (MAE) root cause for the poor DT data clustering. show maximum improvements of ∆ MSE ≤ 1.22e-2 and If great data imbalance is not evident, there are only ∆ MAE ≤ 2.32e-2 between the new models and RF base two possible impacts additional samples can have. Either, (with MSE= 2.25e-2 and MAE= 8.00e-2). Unfortunately, (a) extra measurement samples skew the distribution into this means no model performs significantly better than a certain direction (basically creating a bias), but still, the the others. Due to the diferent training approaches between the models, a detailed comparison of leafs and structure is not possible without extensive efort. This 1 result either indicate, that these kind of models are inherently unable to predict the black-box behavior or that 0.8 there are specification insuficiencies in data and/or input that cause this response. The outcome of all of this is, 0.6 changing the model does not seem to resolve inhomogeneous clustering, as outliers are apparent for all models 0.4 (cf. Fig. 8). On account of this, we continue by investigating possible unknown unknowns by visualizing the 0.2 relevant data distribution.

By the process of elimination, to rule out implausible 00 0.2 0.4 0.6 0.8 1 root causes, we arrive at the collected data. We continue by highlighting the data distribution given by Table 2. Figure 8: Diferent explainable models and their performance Figure 9 displays the data points for the relevant input (diagonal line shows ideal behavior). None of them shows a features of Obj. distance and Noise variance, narrowed definitive advantage over the others, suggesting a root cause down by the specific split criteria { 1, ..., 4}. For a con- independent of the selected ML methodology. venient visualization, the data points of 2.018m2 < Obj. cluster would remain collectively inhomogeneous, or, (b) the new input feature should not correlate with either the new samples alone can be partitioned into its own of these, as they do not carry any useful information to cluster (split by the remaining, available input features). disentangle the data, cf. Fig. 9. We also excluded biases. Although investigating cases (a) and (b) could provide During our initial inspection of the data we already additional information, no experiments have been car- identified one irregularity, namely, data points that have ried out within this contribution, as the expected results low softmax confidence at a specific Obj. distance would not impact the next analysis. = 23m across virtually all noise levels. Although this is

All the previous analyses lead us to the only plausible not completely uncommon, see areas outside the highconclusion: The introspective data set does not include lighted cluster in Fig. 9(left), in this particular case, howall the necessary data dimensions. ever, none of the other input features could be recruited to

The next step involves reviewing the input features . separate these outliers. Subsequently, we examined the In order to act as an explainable, introspective model, the corresponding frames in order to determine a potential input space for the white-box model has been reduced efect that could cause such a hard limit. to certain input features, called safety features, in [ 6 ]. Our review revealed, that for all these frames Following the process of elimination, neither the model the carla.WeatherParameters contained a nonzero nor the data provide any convincing evidence that they value for fog_density. In our initial setup to create an cause this observed inconsistency, cf. Fig. 5(b). Therefore, explainable, introspective model we introduced “Noise only the input features remain as possible root cause. variance” as technical implementation for all environThe input features in [ 6 ] have carefully been selected mental disturbances, such as rain, fog or white-noise. So based on two principles that ensure safety-relevance and these efects cannot be distinguished from each other redundancy: within the introspective data set. As a result, they act as 1. The feature is safety-relevant, i.e., factors that unknown unknowns within this system’s environment typically cause trafic accidents in human driving, (introspective model). Although rain and fog produce similar visual efects in CARLA, fog acts as a limitation 2. The feature must be measurable via a diferent for the maximum field-of-view distance and therefore sensor, i.e., independent of the black-box predic- also limits the capabilities of the black-box object detection. tor. With regards to the (safety) principles 1. and 2., the These principles still apply, so possible new features must feature “Fog density” can definitely be classified as safetyadhere to these principles to be useful for a reliable safety relevant and also be detected via other sensors. A linear monitor. correlation analysis has been carried out to determine

The basic strategy to discover possibly new, important the dependence of Fog density with other input features, input features revolves around the idea to use the evalu- see Fig. 10. As this table shows, a strong positive correlaated analysis results from the previous tests. According tion exists between noise and fog, as the basic simulation to the split criteria of Table 2, occlusion efects are unim- efect is similar. It can also be seen, that both, noise and portant and object’s area only requires a minimum value fog, show minimal correlations with the other, remaining for detection (given the noise level interval). Therefore, input features. This indicates a good candidate for a now (un)known unknown.

Based on these findings, we separated “Fog density” from Noise variance and included the meta-data in the introspective data set as new input feature. The preliminary experiments indeed show an improvement. Since a new input feature was introduced, the resulting DTs cannot simply be compared. It is, however, possible to filter for all the leafs that fall within the previous boundaries of Table 2. This inspection showed that additional, improved sub-clusters have been created, see Fig. 11. By identifying and including a previously unknown unknown input feature, the previously inconsistent data cluster could successfully be subdivided into more balanced leafs, showing the relevance of this input dimension for the introspective model. Please be aware that the new sub-clusters can still result in any of the three basic cases for DT leafs (cf. Section 4), so the analysis might not end conclusively every time.

This work was funded by the Bavarian Ministry for Economic Afairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems. Acknowledgments 5. Conclusion and Future Work The work presented in this paper shows how explain

able ML can help and guide us to discover equivalence classes (ISO/TR 4804) and unknown unknowns (SOTIF).

The developed approach makes use of the mathematical foundation of DTs to identify leafs and interpret their meaning, with respect to defined thresholds and their degree of data variance. We successfully use the methodology to define an equivalence class (Table 1) and uncover an unknown unknown (Fig. 11) for the application of a explainable, outside-model estimator.

Some question, however, do remain. While some equivalence classes can be identified and meaningfully interpreted, other cases beyond system (capability) limitations are dificult to humanly comprehend. Within the described use case we were able to identify one unknown unknown by disentangling one inconsistent data cluster. surance of Machine Learning for Chassis Control applications, in: Proc. 2019 34th IEEE/ACM Int. Functions, in: Proc. Int. Conf. on Comp. Safety, Conf. on Automated Software Engineering (ASE), Reliability, and Security, Cham, 2021, pp. 149–16. 2019, pp. 26–37. [8] A. Schwaiger, P. Sinhamahapatra, J. Gansloser, [19] T. Wuellner, S. Feuerstack, A. Hahn, Clustering enK. Roscher, Is Uncertainty Quantification in Deep vironmental conditions of historical accident data Learning Suficient for Out-of-Distribution Detec- to eficiently generate testing sceneries for martion?, in: Proc. AISafety@IJCAI, 2020. itime systems, in: Proc. Model-Based Safety and [9] S. Khastgir, H. Sivencrona, G. Dhadyalla, P. Billing, Assessment: 6th Int. Symp., IMBSA 2019, ThessaS. Birrell, P. Jennings, Introducing ASIL inspired loniki, Greece, October 16–18, 2019, Proc. 6, 2019, dynamic tactical safety decision framework for au- pp. 349–362. tomated vehicles, in: Proc. 2017 IEEE 20th Int. Conf. [20] W.-Y. Loh, Classification and regression trees, on Intelligent Transportation Systems (ITSC), 2017, Wiley interdisciplinary reviews: data mining and pp. 1–6. knowledge discovery 1 (2011) 14–23. [10] P. Koopman, M. Wagner, Toward a Framework for [21] G. Louppe, Understanding Random Forests: From Highly Automated Vehicle Safety Validation, Tech- Theory to Practive, Ph.D. thesis, University of Liège nical Report, SAE Technical Paper, 2018. - Faculty of Applied Sciences, 2014. [11] D. Rao, P. Pathrose, F. Huening, J. Sid, An approach [22] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, for validating safety of perception software in au- V. Koltun, CARLA: An open urban driving simulatonomous driving systems, in: Proc. Model-Based tor, in: Proc. 1st Annual Conf. on Robot Learning, Safety and Assessment: 6th Int. Symp., IMBSA 2019, volume 78, 2017, p. CARLA: An open urban driving Thessaloniki, Greece, October 16–18, 2019, Proc. 6, simulator.

2019, pp. 303–316. [23] I. Kurzidem, A. Saad, P. Schleiss, A Systematic [12] M. Ghadhab, S. Junges, J.-P. Katoen, M. Kuntz, Approach to Analyzing Perception Architectures M. Volk, Model-based safety analysis for vehicle in Autonomous Vehicles, in: Proc. 7th Int. Symp. guidance systems, in: Proc. Comp. Safety, Reliabil- on Model-Based Safety and Assessment (IMBSA), ity, and Security: 36th Int. Conf., SAFECOMP, 2017, Lisbon, 2020, pp. 149–162.

pp. 3–19. [24] T. Menzel, G. Bagschik, M. Maurer, Scenarios for [13] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, development, test and validation of automated veJ. Gilmer, B. Lakshminarayanan, AugMix: A Sim- hicle, in: Proc. 2018 IEEE Intelligent Vehicles Symp. ple Data Processing Method to Improve Robustness (IV), 2018, pp. 1821–1827. and Uncertainty, arXiv:1912.02781 [cs, stat] (2020). [25] T. Dietterich, An experimental comparison of three arXiv:1912.02781. methods for constructing ensembles of decision [14] A. Adee, R. Gansch, P. Liggesmeyer, C. Glaeser, trees: Bagging, boosting, and randomization, MaF. Drews, Discovery of Perception Performance chine learning 40 (2000) 139–157. Limiting Triggering Conditions in Automated Driving, in: Proc. 2021 5th Int. Conf. on System Reliability and Safety (ICSRS), 2021, pp. 248–257. [15] T. Fingscheidt, H. Gottschalk, S. Houben, Deep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantification, and Insights

Towards Safety, Springer, 2022. [16] E. Haedecke, M. Mock, M. Akila, ScrutinAI: A Visual Analytics Approach for the Semantic Analysis of Deep Neural Network Predictions, in: Proc. EuroVis Workshop on Visual Analytics (EuroVA), 2022, pp. 73–775. [17] C. Gladisch, C. Heinzemann, M. Herrmann,

M. Woehrle, Leveraging combinatorial testing for safety-critical computer vision datasets, in: Proc. 2020 IEEE/CVF Conf. on Comp. Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA,

USA, 2020, pp. 1314–1321. [18] C. Gladisch, T. Heinz, C. Heinzemann, J. Oehlerking,

A. von Vietinghof, T. Pfitzer, Experience paper: Search-based testing in automated driving control

[1]

Yurtsever ,

Lambert ,

Carballo ,

Takeda , A survey of autonomous driving: Common practices and emerging technologies , in: Proc. IEEE Access , volume 8 , 2020 , pp. 58443 - 58469 .

[2]

International

Organization for Standardization, Safety Of The Intended Functionality - SOTIF (ISO/- PAS 21448) , ISO, 2019 .

[3]

Burton ,

Herd , Addressing uncertainty in the safety assurance of machine-learning, Frontiers in Computer Science Hypothesis and theory article ( 2023 ).

[4]

International

Organization for Standardization, Road vehicles - Safety and cybersecurity for automated driving systems - Design, verification and validation (ISO/TR 4804: 2020 ), 2020 .

[5]

Schleiss ,

Hagiwara , I. Kurzidem , Towards the Quantitative Verification of Deep Learning for Safe Perception , in: Proc. 2022 IEEE Int. Symp. on Software Reliability Engineering Workshops (ISSREW) , 2022 , pp. 208 - 215 .

[6]

Kurzidem ,

Misik ,

Schleiss ,

Burton , Safety Assessment: From Black-Box to White-Box , in: Proc. 2022 IEEE Int. Symp. on Software Reliability Engineering Workshops (ISSREW) , 2023 , pp. 295 - 300 .

[7]

Burton , I. Kurzidem ,

Schwaiger ,

Schleiss ,

Unterreiner ,

Graeber ,

Becker , Safety As-