-

Assessing the Reliability of Deep Learning Classifiers Through Robustness Evaluation and Operational Profiles

Xingyu Zhao

Wei Huang

Alec Banks

Victoria Cox

vcoxg@dstl.gov.uk 0

David Flynn

ynn@hw.ac.uk 1

Sven Schewe

Xiaowei Huang

2 0 Defence Science and Technology Laboratory , Salisbury, SP4 0JQ , U.K 1 Heriot-Watt University , Edinburgh, EH14 4AS , U.K 2 University of Liverpool , Liverpool, L69 3BX , U.K

The utilisation of Deep Learning (DL) is advancing into increasingly more sophisticated applications. While it shows great potential to provide transformational capabilities, DL also raises new challenges regarding its reliability in critical functions. In this paper, we present a model-agnostic reliability assessment method for DL classifiers, based on evidence from robustness evaluation and the operational profile (OP) of a given application. We partition the input space into small cells and then “assemble” their robustness (to the ground truth) according to the OP, where estimators on the cells' robustness and OPs are provided. Reliability estimates in terms of the probability of misclassification per input (pmi) can be derived together with confidence levels. A prototype tool is demonstrated with simplified case studies. Model assumptions and extension to real-world applications are also discussed. While our model easily uncovers the inherent difficulties of assessing the DL dependability (e.g. lack of data with ground truth and scalability issues), we provide preliminary/compromised solutions to advance in this research direction.

Industry is adopting increasingly more advanced big data analysis methodologies to enhance the operational performance, safety, and lifespan of their products and services. For many products and systems high in-service reliability and safety are key targets to ensure customer satisfaction and regulatory compliance, respectively. AI and Deep Learning (DL) have steadily grown in interest and applications. Key industrial foresight reviews have identified that the biggest obstacle to reap the benefits of DL-powered robots is the assurance and regulation of their safety and reliability [Lane et al., 2016]. Thus, there is an urgent need to develop methods to enable the dependable use of AI/DL in critical applications [Robu et al., 2018] and, more importantly, to assess and demonstrate the dependability for certification and regulation.

For traditional systems, safety and reliability analysis is guided by established standards, and supported by mature development processes and verification and validation (V&V) tools and techniques. The situation is different for systems that utilise DL: they require new and advanced analysis reflective of the complex requirements in their safe and reliable function. Such analysis also needs to be tailored to fully evaluate the inherent character of DL [Bloomfield et al., 2019], despite the progress made recently [Huang et al., 2020].

DL classifiers are subject to robustness concerns, reliability models without considering robustness evidence are not convincing. Reliability, as a user-centred property, depends on the end-users’ behaviours [Littlewood and Strigini, 2000]. The operational profile (OP) information (quantifying how the software will be operated [Musa, 1993]) should therefore be explicitly modelled in the assessment. However, to the best of our knowledge, there is no dedicated reliability assessment model (RAM) taking into account both the OP and robustness evidence, which motivates this research.

In [Zhao et al., 2020a], we propose a safety case framework tailored for DL, in which we describe an initial idea of combining robustness verification and operational testing for reliability claims. In this paper, we implement this idea as a RAM, inspired by partition-based testing [Hamlet and Taylor, 1990], operational-profile testing [Strigini and Littlewood, 1997; Zhao et al., 2020c] and DL robustness evaluation [Carlini and Wagner, 2017; Webb et al., 2019]. It is modelagnostic and designed for pretrained DL models, yielding upper bounds on the probability of miss-classifications per input (pmi)1 with confidence levels. Although our RAM is theoretically sound, we discover some issues in our case studies (e.g. scalability and lack of data) that we believe represent the inherent difficulties of assessing/assuring DL dependability.

The key contributions of this work are: a) A first RAM for DL classifiers based on the OP information and robustness evidence.

1This reliability measure is similar to the conventional probability of failure per demand (pfd), but retrofitted for classifiers. b) Discussions on model assumptions and extension to real-world applications, highlighting the inherent difficulties of assessing DL dependability uncovered by our model.

b) A prototype tool2 of our RAM with preliminary and compromised solutions to those uncovered difficulties. Related Work In recent years, there has been extensive efforts in verifying DL robustness, evaluating generalisation errors, and detecting adversarial examples (AEs). They are normally based on formal methods [Huang et al., 2017; Katz et al., 2019] or statistical approaches [Webb et al., 2019; Weng et al., 2019]. A comprehensive review of those techniques can be sourced from recent survey papers [Huang et al., 2020; Zhang et al., 2020]. To the best of our knowledge, the only papers on testing DL for assessment within an operational context are [Li et al., 2019; Guerriero et al., 2021]. In [Li et al., 2019], novel stratified sampling methods are used to improve the operational testing efficiency. Similarly, [Guerriero et al., 2021] presents a sampling method from the operational dataset leveraging “auxiliary information for misclassification”, so that it provides unbiased statistical assessment while exposing as many misclassifications as possible. However, neither of them considers robustness evidence in their assessment models.

At the higher level of whole-systems utilising DL, although there are RAMs based on operational data, knowledge from low-level DL components is usually ignored, e.g., [Kalra and Paddock, 2016]. In [Zhao et al., 2020c], we improved [Kalra and Paddock, 2016] by providing a Bayesian mechanism to combine such knowledge, but did not show where to obtain the knowledge. In that sense, this paper is also a follow up of [Zhao et al., 2020c], forming the prior knowledge required. Organisation of the paper We first present preliminaries on OP-based software reliability assessment and DL robustness. Then Section 3 describes the RAM in details with a running example. We conduct case studies in Section 4, while discuss the model assumptions and extensions in Section 5. Finally, we conclude in Section 6 with future work. 2

Preliminaries 2.1 OP Based Software Reliability Assessment

The delivered reliability, as a user-centred and probabilistic property, requires to model the end-users’ behaviours (in the running environments) and to be formally defined by a quantitative metric [Littlewood and Strigini, 2000]. Without loss of generality, we focus on pmi as a generic metric for DL classifiers, where inputs are, e.g., facial images uploaded by users for facial recognition. We discuss later how pmi can be redefined to cope with real-world applications like traffic sign detection. If we denote the unknown pmi as a variable , then :=

x2X where x is an input in the input domain3 X , and IS is an indicator function—it is equal to 1 when S is true and 0 othIfx causes a misclassificationg(x)Op(x) dx (1) erwise. The Op(x) returns the probability that x is the next random input, the OP [Musa, 1993], a notion used in software engineering to quantify how the software will be operated. Mathematically, the OP is a probability density function (PDF) defined over X .

Assuming independence between successive inputs defined in our pmi, we may use the Bernoulli process as the mathematical abstraction of the failure process (common for such “on-demand” type of systems), which implies a Binomial likelihood. Normally for traditional software, upon establishing the likelihood, RAMs on estimating vary case by case—from the basic Maximum Likelihood Estimation (MLE) to Bayesian estimators tailored for certain scenarios when, e.g., seeing no failure [Bishop et al., 2011], inferring ultra-high reliability [Zhao et al., 2020c], with certain forms of prior knowledge like perfectioness [Strigini and Povyakalo, 2013], and with vague prior knowledge that expressed in imprecise probabilities [Walter and Augustin, 2009; Zhao et al., 2019].

OP based RAMs designed for traditional software fail to consider new characteristics of DL, e.g., unrobustness and high-dimensional input space. Specifically, it is quite hard to have the required prior knowledge in those Bayesian RAMs. While frequentist RAMs would require a large sample size to gain enough confidence in the estimates due to the extremely large population size (high-dimensional pixel space), especially for a high-reliable DL model where misclassifications are rare-events. As an example, the usual accuracy testing of DL classifiers is essentially an MLE estimate against the test set. It not only assumes the test set statistically represents the OP (our Assumption 3 later), but also requires a large number of samples to claim high reliability with sufficient confidence.

2.2 DL Robustness and the R-Separation Property

DL is known not to be robust. Robustness requires that the decision of the DL model M is invariant against small perturbations on inputs. That is, all inputs in a region X have the same prediction label, where usually the region is a small norm ball (in a Lp-norm distance4) of radius around an input x. Inside , if an input x0 is classified differently to x by M, then x0 is an AE. Robustness can be defined either as a binary metric (if there exists any adversarial example in ) or as a probabilistic metric (how likely the event of seeing an adversarial example in is). The former aligns with formal verification, e.g. [Huang et al., 2017], while the latter is normally used in statistical approaches, e.g. [Webb et al., 2019]. The former “verification approach” is the binary version of the latter “stochastic approach”5.

Similar to [Webb et al., 2019], we adopt the more general probabilistic definition on the robustness of the model M (in a region and to a target label y):

RM( ; y) := X IfM(x) predicts label yg(x) x2

Op(x j x 2 ) (2) where Op(x j x 2 ) is the conditional OP of region (precisely the “input model” defined in [Webb et al., 2019] and 2Available at https://github.com/havelhuang/ReAsDL.

3We assume continuous X in this paper. For discrete X , the integral in Eq. (1) reduces to sum and OP is a probability mass function. 4Distance mentioned in this paper is defined in L1.

5Thus, we use the more general term robustness “evaluation” rather than robustness “verification” throughout the paper. also used in [Weng et al., 2019]).

We highlight the follow two remarks regarding robustness: Remark 1 (astuteness). Reliability assessment only concerns the robustness to the ground truth label, rather than an arbitrary label y in RM( ; y). When y is such a ground truth, robustness becomes astuteness [Yang et al., 2020], which is also the conditional reliability in the region .

Astuteness is a special case of robustness6. An extreme example showing why we introduce the concept of astuteness is: a perfectly robust classifier that always outs “dogs” for any given input is unreliable. Thus, robustness evidence cannot directly support reliability claims unless the ground truth label is used in RM( ; y).

Remark 2 (r-separation). For real-world image datasets, any data-points with different ground truth are at least distance 2r apart in the input space X (i.e., pixel space), and r is bigger than usual norm ball radius in robustness studies.

The r-separation property was first observed by [Yang et al., 2020]: real-world image datasets studied by the authors implies that r is normally 3 7 times bigger than the radius (denoted ) of norm balls commonly used in robustness studies. Intuitively it says that, although the classification boundary is highly non-linear, there is a minimum distance between two real-world objects of different classes (cf. Figure 1 for a conceptual illustration). Moreover, such minimum distance is bigger than the usual norm ball size in robustness studies.

A RAM for Deep Learning Classifiers The Running Example

To better demonstrate our RAM, we take the Challenge of AI Dependability Assessment raised by the Siemens Mobility7 as a running example. Basically, the challenge is to firstly train a DL model to classify a dataset generated on the unit square [0; 1]2 according to some unknown distribution. The collected data-points (training set) are shown in Figure 2 (lhs). Then we need to build a RAM to claim an upper bound on the probability that the next random point 6Thus, later in this paper, we may refer robustness to astuteness for brevity when it is clear from the context.

7https://ecosystem.siemens.com/ai-da-sc/ is miss-classified, i.e. pmi. If the 2D-points represent traffic lights, then we have 2 types of misclassifications—safetycritical ones when red data-point is labelled green, and performance related otherwise. For brevity, we only focus on misclassifications here, while our RAM can cope with subtypes of misclassifications. The Framework Inspired by [Pietrantuono et al., 2020], the general idea of our RAM is to partition the input domain into m small cells, subject to the r-separation property. Then, for each cell ci (with a single ground truth yi), we estimate: i := 1

RM(ci; yi) and Opi := X Op(x) which are the unastuteness and pooled OP, respectively, estimates of the cell ci—we introduce estimators for both later. Then Eq. (1) can be written as the weighted sum of the cell-wise unastuteness (i.e. the conditional pmi of each cell8) where the weights are the pooled OP of cells: x2ci m = X i=1

Opi i Eq. (4) represents an ideal case in which we know those is and Opis with certainty. In practice, we can only estimate them with imperfect estimators yielding, e.g., a point estimate with variance capturing the measure of trust. To propagate the confidence in the estimates of is and Opis, we assume: Assumption 1. All is and Opis are independent unknown variables under estimations.

Then, the estimate of and its variance are:

m m E[ ] = X E[ iOpi] = X E[ i]E[Opi] (5) i=1 i=1 m V[ ] = X V[ iOpi] i=1 m = X E[ i]2V[Opi] + E[Opi]2V[ i] + V[ i]V[Opi] (6) i=1 Note that, for the variance, the covariance terms are dropped out due to the independence assumption.

8We use “cell unastuteness” and “cell pmi” interchangeably later. (3) (4)

Depending on the specific estimators adopted, certain parametric families of the distribution of can be assumed, from which any quantile of interest (e.g. 95%) can be derived as our confidence bound in reliability. For instance, as readers will see later, we may assume N (E[ ]; V[ ]) since all is and Opis are normal distributed variables after applying the Central Limit Theorem (CLT) in our chosen estimators. Then, an upper bound with 1 confidence is

Ub1 where P r(Z z1 ) = 1 , and Z N (0; 1) is a standard normal distribution.

Now the the problem is reduced to how to obtain the estimates E[ i]s and V[ i]s, for which we will discuss as follows referring to the running example.

Partition of the Input Domain X As per Remark 1, the astuteness evaluation of a cell requires its ground truth label. To leverage the r-separation property and Assumption 4, we partition the input space by choosing a cell radius so that < r. Although we concur with Remark 2 (first observed by [Yang et al., 2020]) and believe that there should exist an r-stable ground truth (which means that the ground truth is stable in such a cell) for any real-world DL classification applications, it is hard to estimate such an r (denoted by r^) and the best we can do is to assume: Assumption 2. There is a r-stable ground truth (as defined in Remark 2) for any real-world classification problems, and it can be sufficiently estimated from the existing dataset.

That said, we get r^ = 0:004013 by iteratively calculating the minimum distance of different labels in the running example. Then we choose a cell radius9 = 0:004 and partition the unit square X into 250 250 cells.

Cell OP Approximation Given a dataset (X; Y ), we estimate the pooled OP of cell ci to get E[Opi] and V[Opi]. We use the well-established Kernel Density Estimation (KDE) to fit a Ocp(x) to approximate the OP.

Assumption 3. The existing dataset (X; Y ) are randomly sampled from the OP, thus statistically represents the OP. This assumption may not hold in practice: training data is normally collected in a balanced way, since the DL model is expected to perform well in all categories of inputs, especially when the OP is unknown at the time of training and/or expected to change in future. Although our model can relax this assumption (cf. Section 5), we adopt it for brevity in demonstrating the running example.

Then given a set of (unlabelled) data-points (X1; : : : ; Xn) from the existing dataset (X; Y ), KDE yields

Ocp(x) = nh 1 Xn K(

x j=1 h

Xj ) (8) where K is the kernel function (e.g. Gaussian or exponential kernels), and h > 0 is a smoothing parameter called the 9Radius in L1 which is the side length of our square cell in L2. bandwidth, cf. [Silverman, 1986] for guidelines on tuning h. The approximated OP10 is shown in Figure 2 (rhs).

Since our cells are small and all equal size, instead of calculating Rx2ci Ocp(x)dx, we may approximate Opi as

Ocpi = Ocp (xci ) vc where Ocp(xci ) is the probability density at the cell’s central point xci , and vc is the constant cell volume (1:6e 5 in the running example).

Now if we introduce new variables Wj = h1 K( x hXj ), the KDE evaluated at x is actually the sample mean of 2 W1; : : : ; Wn. Then by CLT, we have Ocp(x) N ( W ; nW ) where the mean and variance of Ocp(x) are known results: E[Ocp(x)] = V[Ocp(x)] = nh 1 Xn K(

x j=1 f (x) R K2(u)du h

) nh + O( 1 nh ) ^B2 (x) (9) (10) (11) where the last step of Eq. (11) says that V[Ocp(x)] can be approximated using a bootstrap variance ^B2 (x) [Chen, 2017] (cf. the Appendix A for details).

Upon establishing Eq.s (10) and (11), together with Eq. (9), we know for a given cell ci (knowing its central point xci ): E[Opi] = vcE[Ocp(xci )];

V[Opi] = vc2V[Ocp(xci )] (12) which are the cell OP estimates for Eq.s (5) and (6).

Cell Astuteness Evaluation As a corollary of Remark 2

and Assumption 2, we may confidently assume: Assumption 4. If the radius of ci is smaller than r, all datapoints in the region ci share a single ground truth label.

Now, to determine the ground truth label of a cell ci, we can classify our cells into three types:

a) Normal cells: a normal cell contains data-points sharing a same ground truth label, which is then determined as the ground truth label of the cell.

b) Empty cells: a cell is “empty” in the sense that no datapoint that has been observed in it. Due to the lack of data, it is hard to determine an empty cell’s ground truth. For now, we do voting based on the predicted labels (by the DL model) of random samples from the cell, assuming: Assumption 5. The accuracy of the DL model is better than a classifier doing random classifications in any given cell. Essentially the above assumption relates to the oracle problem of DL testing, that we see some recent efforts, e.g. [Guerriero, 2020], may relax it.

c) Cross-boundary cells: our estimate on r is imperfect, thus we may still observe data-points with different labels in one cell. Such cells are crossing the classification boundary. If our estimate on r is sufficiently accurate, they should be very rare. Thus, without the need to determine the ground 10With a Gaussian kernel and h = 0:2 that optimised by crossvalidated grid-search [Bergstra and Bengio, 2012]. truth label of a cross boundary cell, we simply and conservatively set the cell unastuteness to 1.

So far, the problem is reduced to: given a normal or empty cell ci with the known ground truth label yi, evaluate the miss-classification probability upon a random input x 2 ci, i.e. E[ i] and its variance V[ i]. This is essentially a statistical problem that has been studied in [Webb et al., 2019] using Multilevel Splitting Sampling, while we use the Simple Monte Carlo method for brevity in the running example: ^i = The CLT tells us ^i N ( ; n2 ), when n is large, where and 2 are population mean and variance of IfM(xj )6=yig that can be approximated with sample mean ^n and sample variance ^n2=n. Finally, we can get

Notably, to solve the above statistical problem with sampling methods, we need to assume how the inputs in the cell are distributed, i.e., a distribution for the conditional OP Op(x j x 2 ci). Without loss of generality, we assume: Assumption 6. The inputs in a small region like cells are uniformly distributed. which is not uncommon (e.g., in [Webb et al., 2019; Weng et al., 2019]) and can be easily replaced by other distributions if there is supporting evidence for such action. 4

Case Studies

In addition to the running example, we conduct experiments on two synthetic datasets as shown in Figure 3, representing the scenarios with sparse and dense training data respectively. All modelling details and results after applying our RAM on those three datasets are summarised in Table 1, based on which we compare the testing error, Average Cell Unastuteness (ACU) and our RAM results (E[ ] and U b97:5%).

In the running example, we first observe that the ACU is much lower than the test error, meaning the underlying DL model is a robust one. Since our RAM is mainly based on the robustness evidence, its results are close to ACU but not exactly the same because of the nonuniform OP, cf. Figure 2 (rhs). Moreover, from Figure 2 (lhs), we know the classification boundary is near the middle of the unit square input space where misclassifications tend to happen (say “buggy area”), which is also the high density area on the OP. Thus, the contribution to unreliability from the “buggy area” is weighted higher by the OP, which explains why our RAM results are worse than ACU. In contrast, because of the “flat” OP in the DS-1 (cf. Figure 3 (lhs)), our RAM results are very close to the ACU. With more dense data in DS-2, the r-distance is much smaller and leads to more cells. Thanks to the rich data in this case, all three results are more consistent. We note that, given the nature of the three 2D-point datasets, DL models trained on them are much more robust than image datasets. This is why all ACUs are better than test errors, and our RAM finds a middle point representing reliability according to the OP. Later we apply the RAM on two unrobust DL models trained on image-datasets where the ACUs are worse than test error; it confirms our aforementioned observations.

To gain insights on how to extend our method for highdimensional/real-world datasets, we also conduct experiments on the popular MNIST and CIFAR10 datasets. Instead of implementing the exact steps in Section 3.2, we take a few compromised solutions to tackle the scalability issues raised by “the curse of dimensionality”. We articulate these steps in the following paragraph, while detailed discussions on their impact on our results are presented in Section 5.

First, we train Variational Auto-Encoders (VAE) on the MNIST and CIFAR10 datasets and project all inputs into the low dimensional latent spaces of VAE (with 8 and 16 dimensions respectively). Then we apply the proposed RAM on the compressed dataset, i.e., partitioning the latent space, learning the OP in latent space and evaluating the “latentcell unastuteness”. Astuteness (a special case of robustness), by definition is associated with the input pixel space. By “latent-cell unastuteness”, we mean the average unastuteness of norm balls (in the input space) around a large number of samples from a “latent-cell”. The norm ball radius is determined by the r-separation distance in the input space. Taking the computational cost into consideration, we rank the OP of all latent-cells, and choose the top k cells with highest OP for astuteness evaluation. We adopt the existing robustness estimator in [Webb et al., 2019], where the authors omitted the result of V[ i]; we therefore also omit the variance in our experiments for simplicity. In this section, we summarise the model assumptions made in our RAM, and discuss if/how they can be validated and what new-assumptions/compromised-solutions are needed to cope with high-dimensional/real-world applications. Finally, we list the inherent difficulties of assessing DL uncovered by our RAM. Independent is and Opis As per Assumption 1, we assume all is and Opis are independent when “assembling” their estimates via Eq. (5) and deriving the variance via Eq. (6). Largely this assumption is for the mathematical tractability when propagating the confidence in individual estimates at the cell-level to the whole system pmi. Although this independence assumption is hard to justify in practice, it is not unusual in reliability models that use partition, e.g. in [Pietrantuono et al., 2020; Miller et al., 1992]. We believe that RAMs are still useful as a first approximation under this assumption, while we envisage that Bayesian estimators leveraging joint priors and conjugacy may relax it. R-separation and its estimation Assumption 2 derives from Remark 2. We concur with [Yang et al., 2020] and believe that, for any real-world DL classification applications where the inputs are data-points with “physical meanings”, there should always exist an r-stable ground truth. Such r varies between applications, and the smaller the r is, the harder the inherent difficulty of the classification problem is; i.e., r is a difficulty indicator for the given classification problem.

For real-world applications, what really determines the label of an image are its features rather than pixels. Thus, we envisage some latent space (of, e.g., VAE) capturing only the feature-wise information can be explored for highdimensional data. That is, we • first do r-separation based partition in the latent space to learn the OP; • then determine the ground truth labels of cells in the latent space; • map the learned OP and ground truth labels back to the input pixel space; • do astuteness evaluation in the input pixel space and “assemble” the results according to the OP.

Indeed, it is hard to estimate the r (neither in the input nor the latent space), while the best we can do is to estimate it from the existing dataset. One way of solving the problem is to keep monitoring the r estimates as more labelled data is collected, and redo the cell partition when the estimated r has changed significantly.

Approximation of the OP Assumption 3 says that the collected dataset statistically represents the OP, which may not hold for many practical reasons; e.g., the future OP is uncertain at the training stage and thus data is collected in a balanced way to perform well in all categories of inputs. Although we demonstrate our RAM under this assumption for simplicity, it can be easily relaxed. Essentially, we try to fit a density function over the input space from an “operational dataset” (representing the OP). Data-points in this set can be unlabelled raw data generated from historical data of previous applications, simulations and manually scaled based on expert knowledge. Obtaining such operational dataset is an application-specific engineering problem, and tractable thanks to the fact that it does not require labelled data.

Notably, the OP may also be approximated at runtime based on the data stream of operational data. Efficient KDE for data streams [Qahtan et al., 2017] can be used. If the OP was subject to sudden changes, change-point detectors like [Zhao et al., 2020b] should also be paired with the runtime estimator to robustly approximate the OP.

However, we may encounter technical challenges when fitting the PDF from high-dimensional real-world datasets. There are two known major challenges when applying multivariate KDE to high-dimensional data: i) the choice of bandwidth H represents the covariance matrix that mostly impacts the estimation accuracy; ii) scalability issues in terms of storing intermediate data structure (e.g. data-points in hashtables) and querying times made when estimating the density at a given input. For the first challenge, the optimal calculation of bandwidth matrix can refer to some rule of thumb [Silverman, 1986; Scott, 2015] and the cross-validation [Bergstra and Bengio, 2012]. While there are dedicated research on improving the efficiency of multivariate KDE, e.g., [Backurs et al., 2019] presented a framework for multivariate KDE in provably sub-linear query time with linear space and linear pre-processing time to the dimensions.

Determination of the ground truth of a cell Assumptions

4 and 5 are essentially on how to determine the ground truth label for a given cell, that relates to the oracle problem of testing DL [Guerriero, 2020]. While it is still challenging, we partially solve it by leveraging the r-separation property.

Thanks to r, it is easy to determine a cell’s ground truth when we see it contains labelled data-points. However, for an empty cell, it is non-trivial. We assume the overall performance of the DL model is fairly good (e.g., better than a classifier doing random classifications), thus miss-classifications within an empty cell are relatively rare events. Then we can determine the ground truth label of the cell by majority voting of predictions. Indeed, this is a strong assumption when there are some “failure regions” in the input space that perform really badly (even worse than random labelling). In this case, we need to invent a new mechanism to detect such “really bad failure regions” and spend more budget on invoking, say, humans to do the labelling.

Conditional OP of a cell We assume the distribution of inputs (i.e., the conditional OP) within each cell is uniform by Assumption 6. Although we conjecture that this is the common case due to the small size of cells (i.e., those very close/similar inputs within a small region are only subject to noise factors that can be modelled uniformly), the real situation may vary; this requires justification in safety cases.

For a real-world dataset, the conditional OP represents certain distributions of “natural variations” [Zhong et al., 2021], e.g. lighting conditions, obey certain distributions. The conditional OP of cells should faithfully capture the distribution of such natural variations. Recent advance on measuring the natural/realistic AEs [Harel-Canada et al., 2020] highly relates to this assumption and may relax it.

Explosion of the number of cells The number of cells to

evaluate the astuteness is exponential in the dimensions of data. For high-dimensional data, it is impossible to explore all cells in the input space11 as we did for the running example.

A compromised solution is to find the first k cells that dominate the OP. That is, we rank the cells by their pooled OP, and only evaluate the top-k cells where the sum of these k cells’ OPs is greater than a threshold, e.g. 99%. Then, we can conservatively set the cell pmi of the rest to a worst-case bound (e.g. 1) or an empirical/average bound based on the first k cells. Certainly, the price to pay is to sacrifice estimation accuracy. The best we can do for now is to increase the budgets for a larger k. Technically, finding the first k cells dominating the OP is in fact to calculate the modes of the KDE function. The work of [Lee et al., 2019] gives us a hint on how to quickly calculate the modes of Gaussian KDE when the data dimension is high.

This discussion relates to the cost of our RAM, thus a pertinent question is—what is the real cost of conducting DL testing? Is it the the human labour generating labels or timing constraints? A likely answer is: both. Our RAM has partially solved the former (cf. earlier discussions), while the latter is less costly nowadays and can be solved by harnessing the fast growth of computational power and parallel computing.

Efficiency of cell robustness evaluation We have demon

strated via the Simple Monte Carlo method to evaluate cell robustness in the running example. It is well-known that Simple Monte Carlo is not a computationally efficient technique to estimate rare-events (such as AEs in our case) in high-dimensional space. Thus, instead of applying Simple Monte Carlo, the more advanced and efficient sampling approach, the Adaptive Multi-level Splitting method [Webb et al., 2019], has been applied in our case studies on image datasets. We are confident that other statistical sampling methods designed for rare-events may also suffice our need.

In addition to the statistical approach, formal method based verification techniques can also be applied to assess a cell’s pmi, e.g. [Huang et al., 2017]. They provide formal guarantees on whether the DL model will miss-classify any input inside a small region. Although such “robust region” proved by formal methods is normally smaller than our cells, the ^i 11Although dimension reduction methods like VAE may ease the problem of learning OP, they cannot reduce the number of cells to be evaluated. Since robustness by definition has to be evaluated in the input space. can be conservatively set to the proportion of robust region covered in ci in this case.

We would like to note that the cell robustness estimator in our RAM works in a “hot-swappable” manner: any new and more efficient robustness estimator can be easily incorporated. Thus, how to improve the efficiency of cell’s robustness estimation is out of the scope of our RAM.

Inherent difficulties Finally, based on our RAM and the discussions above, we summarise the inherent difficulties of assessing DL reliability as the following questions: • How to accurately build the OP in the high-dimensional input space? • How to build an accurate oracle leveraging the existing human-labels in the training dataset? • What is the local distribution (conditional OP) over a small input region that captures the natural variations of physical conditions? • How to efficiently evaluate the robustness of a small region given AEs are rare events? • How to sample small regions from a large population (high-dimensional space) to test robustness in an unbiased and efficient way?

We try to provide preliminary/compromised solutions in our RAM, while the questions are still challenging in practice. We doubt the existence of other DL RAMs with weaker assumptions achieving the same level of rigorousness as ours, at this stage. 6

Conclusion & Future Work

In this paper, we present a preliminary RAM for DL classifiers. It is the first DL RAM explicitly considers both the OP information and robustness evidence. It uncovers some inherent difficult questions when assessing DL reliability, while preliminary/compromised solutions are discussed, implemented and demonstrated with case studies.

An intuitive way of perceiving our RAM, comparing with the usual accuracy testing, is that we enlarge the testing dataset with more test cases around “seeds” (original datapoints in the test set). We determine the oracle of a new test case according to its seed’s label and the r-distance. Those enlarged test results form the robustness evidence, and how much they contribute to the overall reliability is proportional to its OP. Consequently, exposing to more tests (robustness evaluation) and being more representative of how it will be used (the OP), our RAM is more trustworthy.

In line with the gist of our RAM, we believe the DL reliability should follow the conceptualised equation of: DL reliability = generalisability robustness: In a nutshell, when assessing the DL reliability, we should not only concern how it generalises to a new data-point (according to the future OP), but also the local robustness around it. Align with this insight, indeed, a “naive/over-simplified” version of our RAM would be averaging all local astuteness of data-points in the test set, which is less rigorous (e.g., on determining the norm ball size) and requires stronger assumptions (e.g., the test set is equal to the operational set).

Improving the scalability of our RAM and experimenting with more real-world datasets form important future work. We presume a trained DL model for our assessment purpose. A natural question next is how to actually improve the reliability when our RAM results are not good enough. As described in [Zhao et al., 2021], we plan to investigate DL debug testing (e.g. [Huang et al., 2021]) and retraining methods [Bai et al., 2021], together with the RAM, to form a closed loop of debugging-improving-assessing.

Acknowledgments & Disclaimer

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 956123. This work is partially supported by the UK EPSRC (through the Offshore Robotics for Certification of Assets [EP/R026173/1] and End-to-End Conceptual Guarding of Neural Architectures [EP/T026995/1]) and the UK Dstl (through the project of Safety Argument for Learning-enabled Autonomous Underwater Vehicles). Xingyu Zhao and Alec Banks’ contribution to the work is partially supported through Fellowships at the Assuring Autonomy International Programme. We thank Lorenzo Strigini for insightful comments on earlier versions of the paper.

This document is an overview of UK MOD (part) sponsored research and is released for informational purposes only. The contents of this document should not be interpreted as representing the views of the UK MOD, nor should it be assumed that they reflect any current or future UK MOD policy. The information contained in this document cannot supersede any statutory or contractual requirements or liabilities and is offered without prejudice or commitment. Content includes material subject to © Crown copyright (2018), Dstl. This material is licensed under the terms of the Open Government Licence except where otherwise stated. To view this licence, visit http://www.nationalarchives.gov. uk/doc/open-government-licence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: psi@nationalarchives.gsi.gov.uk. A

KDE bootstrapping

Bootstrapping is a statistical approach to estimate any sampling distribution by random sampling method. We sample with replacement from the original data points (X; Y ) to obtain a new bootstrap dataset (Xb; Y b) and train the KDE on the bootstrap dataset. Assume the bootstrap process is repeated B times, leading to B bootstrap KDEs, denoted as 1 B Op (x); : : : ; Ocp (x). Then we can estimate the variance of c f^(x) by the sample variance of the bootstrap KDE [Chen, 2017]: ^B2(x) =

1 B 1

B b X(Ocp (x)

B)2 b=1 where the B can be approximated by ^B(x) =

[Backurs et al., 2019 ]

Arturs

Backurs , Piotr Indyk, and

Tal

Wagner . Space and time efficient kernel density estimation in high dimensions . In NeurIPS'19 , pages 15773 - 15782 , 2019 .

[Bai et al., 2021 ]

Tao

Bai , Jinqi Luo,

Jun

Zhao ,

Bihan

Wen , and

Qian

Wang . Recent Advances in Adversarial Training for Adversarial Robustness . In IJCAI'21 , 2021 .

[Bergstra and Bengio , 2012]

James

Bergstra and

Yoshua

Bengio . Random search for hyper-parameter optimization . J. of Machine Learning Research , 13 ( 2 ): 281 - 305 , 2012 .

[Bishop et al., 2011 ]

Peter

Bishop , Robin Bloomfield, Bev Littlewood, Andrey Povyakalo, and

David

Wright . Toward a formalism for conservative claims about the dependability of software-based systems . IEEE Transactions on Software Engineering , 37 ( 5 ): 708 - 717 , 2011 .

[Bloomfield et al., 2019 ]

Robin

Bloomfield , Heidy Khlaaf, Philippa Ryan Conmy, and

Gareth

Fletcher . Disruptive innovations and disruptive assurance: Assuring machine learning and autonomy . Computer , 52 ( 9 ): 82 - 89 , 2019 .

[Carlini and Wagner , 2017]

Nicholas

Carlini and

David

Wagner . Towards Evaluating the Robustness of Neural Networks . In IEEE Symp. on Security and Privacy (SP) , pages 39 - 57 , San Jose, CA, USA, 2017 . IEEE.

[Chen , 2017] Yen-Chi Chen . A tutorial on kernel density estimation and recent advances . Biostatistics & Epidemiology , 1 ( 1 ): 161 - 187 , 2017 .

[Guerriero et al., 2021 ] Antonio Guerriero, Roberto Pietrantuono, and

Stefano

Russo . Operation is the hardest teacher: estimating DNN accuracy looking for mispredictions . In ICSE'21 , Madrid, Spain, 2021 .

[Guerriero , 2020] Antonio Guerriero. Reliability Evaluation of ML systems, the oracle problem . In ISSREW'20 , pages 127 - 130 , Coimbra, Portugal, 2020 . IEEE.

[Hamlet and Taylor , 1990]

Hamlet and

Taylor . Partition testing does not inspire confidence . IEEE Tran. on Software Engineering , 16 ( 12 ): 1402 - 1411 , 1990 .

[ Harel-Canada et al., 2020 ]

Fabrice

Harel-Canada ,

Lingxiao

Wang , Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. Is neuron coverage a meaningful measure for testing deep neural networks ? In ESEC/FSE'20 , pages 851 - 862 , 2020 .

[Huang et al., 2017 ]

Xiaowei

Huang , Marta Kwiatkowska,

Sen

Wang , and

Min

Wu . Safety verification of deep neural networks . In CAV'17 , pages 3 - 29 . Springer, 2017 .

[Huang et al., 2020 ]

Xiaowei

Huang , Daniel Kroening, Wenjie Ruan, and et al. A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability . Computer Science Review, 37 : 100270 , 2020 .

[Huang et al., 2021 ]

Wei

Huang , Youcheng Sun,

Xingyu

Zhao ,

James

Sharp , Wenjie Ruan, Jie Meng, and

Xiaowei

Huang . Coverage guided testing for recurrent neural networks . IEEE Tran. on Reliability , 2021 . In press.

[Kalra and Paddock , 2016]

Nidhi

Kalra and Susan M. Paddock . Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice, 94 : 182 - 193 , 2016 .

[Katz et al., 2019 ]

Guy

Katz , Derek A. Huang , Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljic´, David L. Dill, Mykel J.

Kochenderfer , and Clark

Barrett . The Marabou Framework for Verification and Analysis of Deep Neural Networks . In CAV'19 , volume 11561 of LNCS , pages 443 - 452 , Cham, 2019 . Springer.

[Lane et al., 2016 ]

David

Lane , David Bisset,

Rob

Buckingham , Geoff Pegman, and

Tony

Prescott . New foresight review on robotics and autonomous systems . Technical Report No . 2016 .1, LRF , 2016 .

[Lee et al., 2019 ] Jasper CH Lee, Jerry Li, Christopher Musco, Jeff M Phillips, and Wai Ming Tai. Finding the mode of a kernel density estimate . arXiv preprint arXiv:1912.07673 , 2019 .

[Li et al., 2019 ]

Zenan

Li ,

Xiaoxing

Ma , Chang Xu, Chun Cao, Jingwei Xu, and Jian Lu¨. Boosting operational DNN testing efficiency through conditioning . In ESEC/FSE'19 , pages 499 - 509 . ACM, 2019 .

[Littlewood and Strigini , 2000]

Bev

Littlewood and

Lorenzo

Strigini . Software reliability and dependability: A roadmap . In ICSE 2000 , pages 175 - 188 , 2000 .

[Miller et al., 1992 ]

Keith W.

Miller ,

Larry J.

Morell , Robert E. Noonan, Stephen

Park , David M.

Nicol , Branson W. M urrill, and M

Voas . Estimating the probability of failure when testing reveals no failures . IEEE Transactions on Software Engineering , 18 ( 1 ): 33 - 43 , 1992 .

[Musa , 1993

] John

Musa . Operational profiles in softwarereliability engineering . IEEE Software , 10 ( 2 ): 14 - 32 , 1993 .

[Pietrantuono et al., 2020 ]

Roberto

Pietrantuono ,

Peter

Popov , and

Stefano

Russo . Reliability assessment of service-based software under operational profile uncertainty . Reliability Engineering & System Safety , 204 : 107193 , 2020 .

[Qahtan et al., 2017 ]

Abdulhakim

Qahtan ,

Suojin

Wang , and Xiangliang Zhang. KDE-Track: An Efficient Dynamic Density Estimator for Data Streams . IEEE Transactions on Knowledge and Data Engineering , 29 ( 3 ): 642 - 655 , 2017 .

[Robu et al., 2018 ]

Valentin

Robu , David Flynn,

and David

Lane . Train robots to self-certify as safe . Nature , 553 ( 7688 ): 281 - 281 , 2018 .

[Scott , 2015]

David W

Scott . Multivariate density estimation: theory, practice, and visualization . John Wiley & Sons, 2015 .

[Silverman , 1986] Bernard W Silverman. Density estimation for statistics and data analysis , volume 26 . CRC press, 1986 .

[Strigini and Littlewood , 1997]

Lorenzo

Strigini and

Bev

Littlewood . Guidelines for statistical testing . Technical report , City, University of London, 1997 .

[Strigini and Povyakalo , 2013]

Lorenzo

Strigini and

Andrey

Povyakalo . Software fault-freeness and reliability predictions . In SafeComp'13 , volume 8153 of LNCS , pages 106 - 117 , Berlin, Heidelberg, 2013 . Springer.

[Walter and Augustin , 2009]

Gero

Walter and

Thomas

Augustin . Imprecision and prior-data conflict in generalized Bayesian inference . Journal of Statistical Theory & Practice , 3 ( 1 ): 255 - 271 , 2009 .

[Webb et al., 2019 ]

Stefan

Webb , Tom Rainforth, Yee Whye Teh, and

M. Pawan

Kumar . A statistical approach to assessing neural network robustness . In ICLR'19 , New

Orleans

, LA, USA, 2019 .

[Weng et al., 2019 ]

Lily

Weng , Pin-Yu

Chen

, Lam Nguyen, Mark Squillante, Akhilan Boopathy, Ivan Oseledets, and Luca Daniel. PROVEN: Verifying robustness of neural networks with a probabilistic approach . In ICML'19 , volume 97 , pages 6727 - 6736 . PMLR, 2019 .

[Yang et al., 2020 ] Yao-Yuan

Yang

, Cyrus Rashtchian, Hongyang Zhang, Ruslan Salakhutdinov, and

Kamalika

Chaudhuri . A Closer Look at Accuracy vs . Robustness. In NeurIPS'20 , Vancouver, Canada, 2020 .

[Zhang et al., 2020 ]

J. M.

Zhang ,

Harman , L. Ma, and

Liu . Machine learning testing: Survey, landscapes and horizons . IEEE Tran. on Software Engineering , 2020 .

[Zhao et al., 2019 ]

Xingyu

Zhao ,

Valentin

Robu , David Flynn,

Fateme

Dinmohammadi , Michael Fisher, and

Matt

Webster . Probabilistic model checking of robots deployed in extreme environments . In AAAI'19 , volume 33 , pages 8076 - 8084 , 2019 .

[Zhao et al., 2020a] Xingyu

Zhao , Alec

Banks , James Sharp, Valentin Robu, David Flynn, Michael Fisher, and Xiaowei

Huang . A safety framework for critical systems utilising deep neural networks . In SafeComp'20 , volume 12234 of LNCS , pages 244 - 259 . Springer, 2020 .

[Zhao et al., 2020b] Xingyu

Zhao , Radu

Calinescu , Simos Gerasimou, Valentin Robu, and David Flynn. Interval change-point detection for runtime probabilistic model checking . In ASE'20 , pages 163 - 174 . IEEE/ACM, 2020 .

[Zhao et al., 2020c] Xingyu

Zhao , Kizito

Salako , Lorenzo Strigini, Valentin Robu, and David

Flynn . Assessing safety-critical systems from operational testing: A study on autonomous vehicles . Information and Software Technology , 128 : 106393 , 2020 .

[Zhao et al., 2021 ]

Xingyu

Zhao , Wei

Huang

, Sven Schewe, Yi Dong, and

Xiaowei

Huang . Detecting operational adversarial examples for reliable deep learning . In 51th Annual IEEE-IFIP Int. Conf. on Dependable Systems and Networks (DSN'21) , volume Fast Abstract, 2021 .

[Zhong et al., 2021 ]

Ziyuan

Zhong , Yuchi Tian, and

Baishakhi

Ray . Understanding Local Robustness of Deep Neural Networks under Natural Variations . In FASE'21 , pages 313 - 337 , 2021 .