Assessing the Reliability of Deep Learning Classifiers Through Robustness Evaluation and Operational Profiles Xingyu Zhao1 , Wei Huang1 , Alec Banks2 , Victoria Cox2 , David Flynn3 , Sven Schewe1 and Xiaowei Huang1 1 University of Liverpool, Liverpool, L69 3BX, U.K. 2 Defence Science and Technology Laboratory, Salisbury, SP4 0JQ, U.K. 3 Heriot-Watt University, Edinburgh, EH14 4AS, U.K. {xingyu.zhao,w.huang23,sven.schewe,xiaowei.huang}@liverpool.ac.uk {abanks,vcox}@dstl.gov.uk, d.flynn@hw.ac.uk Abstract the dependable use of AI/DL in critical applications [Robu et al., 2018] and, more importantly, to assess and demonstrate The utilisation of Deep Learning (DL) is advancing the dependability for certification and regulation. into increasingly more sophisticated applications. For traditional systems, safety and reliability analysis is While it shows great potential to provide trans- guided by established standards, and supported by mature de- formational capabilities, DL also raises new chal- velopment processes and verification and validation (V&V) lenges regarding its reliability in critical functions. tools and techniques. The situation is different for systems In this paper, we present a model-agnostic reliabil- that utilise DL: they require new and advanced analysis re- ity assessment method for DL classifiers, based on flective of the complex requirements in their safe and reliable evidence from robustness evaluation and the oper- function. Such analysis also needs to be tailored to fully eval- ational profile (OP) of a given application. We par- uate the inherent character of DL [Bloomfield et al., 2019], tition the input space into small cells and then “as- despite the progress made recently [Huang et al., 2020]. semble” their robustness (to the ground truth) ac- DL classifiers are subject to robustness concerns, reliabil- cording to the OP, where estimators on the cells’ ity models without considering robustness evidence are not robustness and OPs are provided. Reliability es- convincing. Reliability, as a user-centred property, depends timates in terms of the probability of misclassifi- on the end-users’ behaviours [Littlewood and Strigini, 2000]. cation per input (pmi) can be derived together with The operational profile (OP) information (quantifying how confidence levels. A prototype tool is demonstrated the software will be operated [Musa, 1993]) should therefore with simplified case studies. Model assumptions be explicitly modelled in the assessment. However, to the best and extension to real-world applications are also of our knowledge, there is no dedicated reliability assessment discussed. While our model easily uncovers the in- model (RAM) taking into account both the OP and robustness herent difficulties of assessing the DL dependabil- evidence, which motivates this research. ity (e.g. lack of data with ground truth and scalabil- ity issues), we provide preliminary/compromised In [Zhao et al., 2020a], we propose a safety case frame- solutions to advance in this research direction. work tailored for DL, in which we describe an initial idea of combining robustness verification and operational testing for reliability claims. In this paper, we implement this idea as a 1 Introduction RAM, inspired by partition-based testing [Hamlet and Taylor, 1990], operational-profile testing [Strigini and Littlewood, Industry is adopting increasingly more advanced big data 1997; Zhao et al., 2020c] and DL robustness evaluation [Car- analysis methodologies to enhance the operational perfor- lini and Wagner, 2017; Webb et al., 2019]. It is model- mance, safety, and lifespan of their products and services. agnostic and designed for pretrained DL models, yielding up- For many products and systems high in-service reliability and per bounds on the probability of miss-classifications per input safety are key targets to ensure customer satisfaction and reg- (pmi)1 with confidence levels. Although our RAM is theoret- ulatory compliance, respectively. AI and Deep Learning (DL) ically sound, we discover some issues in our case studies (e.g. have steadily grown in interest and applications. Key indus- scalability and lack of data) that we believe represent the in- trial foresight reviews have identified that the biggest obstacle herent difficulties of assessing/assuring DL dependability. to reap the benefits of DL-powered robots is the assurance and The key contributions of this work are: regulation of their safety and reliability [Lane et al., 2016]. Thus, there is an urgent need to develop methods to enable a) A first RAM for DL classifiers based on the OP infor- mation and robustness evidence. Copyright © 2021 for this paper by its authors. Use permitted 1 under Creative Commons License Attribution 4.0 International (CC This reliability measure is similar to the conventional probabil- BY 4.0). ity of failure per demand (pfd), but retrofitted for classifiers. b) Discussions on model assumptions and extension to erwise. The Op(x) returns the probability that x is the next real-world applications, highlighting the inherent difficulties random input, the OP [Musa, 1993], a notion used in soft- of assessing DL dependability uncovered by our model. ware engineering to quantify how the software will be oper- b) A prototype tool2 of our RAM with preliminary and ated. Mathematically, the OP is a probability density function compromised solutions to those uncovered difficulties. (PDF) defined over X . Related Work In recent years, there has been extensive Assuming independence between successive inputs de- efforts in verifying DL robustness, evaluating generalisa- fined in our pmi, we may use the Bernoulli process as the tion errors, and detecting adversarial examples (AEs). They mathematical abstraction of the failure process (common for are normally based on formal methods [Huang et al., 2017; such “on-demand” type of systems), which implies a Bino- Katz et al., 2019] or statistical approaches [Webb et al., 2019; mial likelihood. Normally for traditional software, upon es- Weng et al., 2019]. A comprehensive review of those tech- tablishing the likelihood, RAMs on estimating λ vary case niques can be sourced from recent survey papers [Huang et by case—from the basic Maximum Likelihood Estimation al., 2020; Zhang et al., 2020]. To the best of our knowledge, (MLE) to Bayesian estimators tailored for certain scenar- the only papers on testing DL for assessment within an oper- ios when, e.g., seeing no failure [Bishop et al., 2011], in- ational context are [Li et al., 2019; Guerriero et al., 2021]. In ferring ultra-high reliability [Zhao et al., 2020c], with cer- [Li et al., 2019], novel stratified sampling methods are used to tain forms of prior knowledge like perfectioness [Strigini improve the operational testing efficiency. Similarly, [Guer- and Povyakalo, 2013], and with vague prior knowledge that riero et al., 2021] presents a sampling method from the oper- expressed in imprecise probabilities [Walter and Augustin, ational dataset leveraging “auxiliary information for misclas- 2009; Zhao et al., 2019]. sification”, so that it provides unbiased statistical assessment OP based RAMs designed for traditional software fail to while exposing as many misclassifications as possible. How- consider new characteristics of DL, e.g., unrobustness and ever, neither of them considers robustness evidence in their high-dimensional input space. Specifically, it is quite hard to assessment models. have the required prior knowledge in those Bayesian RAMs. At the higher level of whole-systems utilising DL, although While frequentist RAMs would require a large sample size to there are RAMs based on operational data, knowledge from gain enough confidence in the estimates due to the extremely low-level DL components is usually ignored, e.g., [Kalra and large population size (high-dimensional pixel space), espe- Paddock, 2016]. In [Zhao et al., 2020c], we improved [Kalra cially for a high-reliable DL model where misclassifications and Paddock, 2016] by providing a Bayesian mechanism to are rare-events. As an example, the usual accuracy testing of combine such knowledge, but did not show where to obtain DL classifiers is essentially an MLE estimate against the test the knowledge. In that sense, this paper is also a follow up of set. It not only assumes the test set statistically represents the [Zhao et al., 2020c], forming the prior knowledge required. OP (our Assumption 3 later), but also requires a large number of samples to claim high reliability with sufficient confidence. Organisation of the paper We first present preliminaries on OP-based software reliability assessment and DL robust- 2.2 DL Robustness and the R-Separation Property ness. Then Section 3 describes the RAM in details with a run- DL is known not to be robust. Robustness requires that the ning example. We conduct case studies in Section 4, while decision of the DL model M is invariant against small per- discuss the model assumptions and extensions in Section 5. turbations on inputs. That is, all inputs in a region η ⊂ X Finally, we conclude in Section 6 with future work. have the same prediction label, where usually the region η is a small norm ball (in a Lp -norm distance4 ) of radius  around 2 Preliminaries an input x. Inside η, if an input x0 is classified differently to x 2.1 OP Based Software Reliability Assessment by M, then x0 is an AE. Robustness can be defined either as The delivered reliability, as a user-centred and probabilistic a binary metric (if there exists any adversarial example in η) property, requires to model the end-users’ behaviours (in the or as a probabilistic metric (how likely the event of seeing an running environments) and to be formally defined by a quan- adversarial example in η is). The former aligns with formal titative metric [Littlewood and Strigini, 2000]. Without loss verification, e.g. [Huang et al., 2017], while the latter is nor- of generality, we focus on pmi as a generic metric for DL mally used in statistical approaches, e.g. [Webb et al., 2019]. classifiers, where inputs are, e.g., facial images uploaded by The former “verification approach” is the binary version of users for facial recognition. We discuss later how pmi can be the latter “stochastic approach”5 . Similar to [Webb et al., 2019], we adopt the more general redefined to cope with real-world applications like traffic sign probabilistic definition on the robustness of the model M (in detection. If we denote the unknown pmi as a variable λ, then a region η and to a target label y): Z X λ := I{x causes a misclassification} (x)Op(x) dx (1) RM (η, y) := I{M(x) predicts label y} (x) × Op(x | x ∈ η) (2) x∈η x∈X where x is an input in the input domain3 X , and IS is an where Op(x | x ∈ η) is the conditional OP of region η (pre- indicator function—it is equal to 1 when S is true and 0 oth- cisely the “input model” defined in [Webb et al., 2019] and 2 4 Available at https://github.com/havelhuang/ReAsDL. Distance mentioned in this paper is defined in L∞ . 3 5 We assume continuous X in this paper. For discrete X , the inte- Thus, we use the more general term robustness “evaluation” gral in Eq. (1) reduces to sum and OP is a probability mass function. rather than robustness “verification” throughout the paper. also used in [Weng et al., 2019]). is miss-classified, i.e. pmi. If the 2D-points represent traf- We highlight the follow two remarks regarding robustness: fic lights, then we have 2 types of misclassifications—safety- critical ones when red data-point is labelled green, and per- Remark 1 (astuteness). Reliability assessment only concerns formance related otherwise. For brevity, we only focus on the robustness to the ground truth label, rather than an arbi- misclassifications here, while our RAM can cope with sub- trary label y in RM (η, y). When y is such a ground truth, types of misclassifications. robustness becomes astuteness [Yang et al., 2020], which is also the conditional reliability in the region η. Astuteness is a special case of robustness6 . An extreme ex- ample showing why we introduce the concept of astuteness is: a perfectly robust classifier that always outs “dogs” for any given input is unreliable. Thus, robustness evidence can- not directly support reliability claims unless the ground truth label is used in RM (η, y). Remark 2 (r-separation). For real-world image datasets, any data-points with different ground truth are at least dis- tance 2r apart in the input space X (i.e., pixel space), and r is bigger than usual norm ball radius in robustness studies. Figure 2: The 2D-point dataset (lhs), and its approximated OP (rhs). The r-separation property was first observed by [Yang et al., 2020]: real-world image datasets studied by the authors implies that r is normally 3 ∼ 7 times bigger than the radius 3.2 The Model (denoted ) of norm balls commonly used in robustness stud- The Framework Inspired by [Pietrantuono et al., 2020], ies. Intuitively it says that, although the classification bound- the general idea of our RAM is to partition the input domain ary is highly non-linear, there is a minimum distance between into m small cells, subject to the r-separation property. Then, two real-world objects of different classes (cf. Figure 1 for a for each cell ci (with a single ground truth yi ), we estimate: conceptual illustration). Moreover, such minimum distance X is bigger than the usual norm ball size in robustness studies. λi := 1 − RM (ci , yi ) and Opi := Op(x) (3) x∈ci which are the unastuteness and pooled OP, respectively, es- timates of the cell ci —we introduce estimators for both later. Then Eq. (1) can be written as the weighted sum of the cell-wise unastuteness (i.e. the conditional pmi of each cell8 ) where the weights are the pooled OP of cells: m X λ= Opi λi (4) i=1 Eq. (4) represents an ideal case in which we know those λi s and Opi s with certainty. In practice, we can only estimate Figure 1: Illustration of the r-separation property. them with imperfect estimators yielding, e.g., a point estimate with variance capturing the measure of trust. To propagate the confidence in the estimates of λi s and Opi s, we assume: Assumption 1. All λi s and Opi s are independent unknown 3 A RAM for Deep Learning Classifiers variables under estimations. 3.1 The Running Example Then, the estimate of λ and its variance are: m m To better demonstrate our RAM, we take the Challenge of X X E[λ] = E[λi Opi ] = E[λi ]E[Opi ] (5) AI Dependability Assessment raised by the Siemens Mobil- i=1 i=1 ity7 as a running example. Basically, the challenge is to m firstly train a DL model to classify a dataset generated on X V[λ] = V[λi Opi ] the unit square [0, 1]2 according to some unknown distribu- i=1 tion. The collected data-points (training set) are shown in Xm Figure 2 (lhs). Then we need to build a RAM to claim an = E[λi ]2 V[Opi ] + E[Opi ]2 V[λi ] + V[λi ]V[Opi ] (6) upper bound on the probability that the next random point i=1 6 Note that, for the variance, the covariance terms are dropped Thus, later in this paper, we may refer robustness to astuteness out due to the independence assumption. for brevity when it is clear from the context. 7 8 https://ecosystem.siemens.com/ai-da-sc/ We use “cell unastuteness” and “cell pmi” interchangeably later. Depending on the specific estimators adopted, certain para- bandwidth, cf. [Silverman, 1986] for guidelines on tuning h. metric families of the distribution of λ can be assumed, from The approximated OP10 is shown in Figure 2 (rhs). which any quantile of interest (e.g. 95%) can be derived as Since our cells are small and all equal size, instead of cal- our confidence bound in reliability. For instance, as readers R culating x∈ci Op(x)dx, c we may approximate Opi as will see later, we may assume λ ∼ N (E[λ], V[λ]) since all λi s and Opi s are normal distributed variables after applying Op c i = Op c (xc ) vc (9) i the Central Limit Theorem (CLT) in our chosen estimators. Then, an upper bound with 1 − α confidence is where Op(xc c ) is the probability density at the cell’s central i p point xci , and vc is the constant cell volume (1.6e−5 in the Ub 1−α = E[λ] + z1−α V[λ] (7) running example). x−X where P r(Z ≤ z1−α ) = 1 − α, and Z ∼ N (0, 1) is a stan- Now if we introduce new variables Wj = h1 K( h j ), dard normal distribution. the KDE evaluated at x is actually the sample mean of σ2 Now the the problem is reduced to how to obtain the esti- W1 , . . . , Wn . Then by CLT, we have Op(x) c ∼ N (µW , nW ) mates E[λi ]s and V[λi ]s, for which we will discuss as follows where the mean and variance of Op(x) c are known results: referring to the running example. n Partition of the Input Domain X As per Remark 1, the 1 X x − Xj E[Op(x)] c = K( ) (10) astuteness evaluation of a cell requires its ground truth label. nh j=1 h To leverage the r-separation property and Assumption 4, we f (x) K 2 (u)du R partition the input space by choosing a cell radius  so that 1 2 V[Op(x)] = c + O( ) ≈ σ̂B (x) (11)  < r. Although we concur with Remark 2 (first observed nh nh by [Yang et al., 2020]) and believe that there should exist an r-stable ground truth (which means that the ground truth where the last step of Eq. (11) says that V[Op(x)] c can be 2 is stable in such a cell) for any real-world DL classification approximated using a bootstrap variance σ̂B (x) [Chen, 2017] applications, it is hard to estimate such an r (denoted by r̂) (cf. the Appendix A for details). and the best we can do is to assume: Upon establishing Eq.s (10) and (11), together with Eq. (9), we know for a given cell ci (knowing its central point xci ): Assumption 2. There is a r-stable ground truth (as defined in Remark 2) for any real-world classification problems, and E[Opi ] = vc E[Op(x c c )], i V[Opi ] = vc2 V[Op(x c c )] (12) i it can be sufficiently estimated from the existing dataset. which are the cell OP estimates for Eq.s (5) and (6). That said, we get r̂ = 0.004013 by iteratively calculating the minimum distance of different labels in the running ex- Cell Astuteness Evaluation As a corollary of Remark 2 ample. Then we choose a cell radius9  = 0.004 and partition and Assumption 2, we may confidently assume: the unit square X into 250 × 250 cells. Assumption 4. If the radius of ci is smaller than r, all data- points in the region ci share a single ground truth label. Cell OP Approximation Given a dataset (X, Y ), we esti- mate the pooled OP of cell ci to get E[Opi ] and V[Opi ]. We Now, to determine the ground truth label of a cell ci , we use the well-established Kernel Density Estimation (KDE) to can classify our cells into three types: fit a Op(x) c to approximate the OP. a) Normal cells: a normal cell contains data-points sharing a same ground truth label, which is then determined as the Assumption 3. The existing dataset (X, Y ) are randomly ground truth label of the cell. sampled from the OP, thus statistically represents the OP. b) Empty cells: a cell is “empty” in the sense that no data- This assumption may not hold in practice: training data is point that has been observed in it. Due to the lack of data, it is normally collected in a balanced way, since the DL model hard to determine an empty cell’s ground truth. For now, we is expected to perform well in all categories of inputs, espe- do voting based on the predicted labels (by the DL model) of cially when the OP is unknown at the time of training and/or random samples from the cell, assuming: expected to change in future. Although our model can re- Assumption 5. The accuracy of the DL model is better than lax this assumption (cf. Section 5), we adopt it for brevity in a classifier doing random classifications in any given cell. demonstrating the running example. Essentially the above assumption relates to the oracle prob- Then given a set of (unlabelled) data-points (X1 , . . . , Xn ) lem of DL testing, that we see some recent efforts, e.g. [Guer- from the existing dataset (X, Y ), KDE yields riero, 2020], may relax it. 1 X n x − Xj c) Cross-boundary cells: our estimate on r is imperfect, Op(x) c = K( ) (8) thus we may still observe data-points with different labels in nh j=1 h one cell. Such cells are crossing the classification boundary. If our estimate on r is sufficiently accurate, they should be where K is the kernel function (e.g. Gaussian or exponen- very rare. Thus, without the need to determine the ground tial kernels), and h > 0 is a smoothing parameter called the 10 With a Gaussian kernel and h = 0.2 that optimised by cross- 9 Radius in L∞ which is the side length of our square cell in L2 . validated grid-search [Bergstra and Bengio, 2012]. truth label of a cross boundary cell, we simply and conserva- In the running example, we first observe that the ACU tively set the cell unastuteness to 1. is much lower than the test error, meaning the underlying So far, the problem is reduced to: given a normal or empty DL model is a robust one. Since our RAM is mainly based cell ci with the known ground truth label yi , evaluate the on the robustness evidence, its results are close to ACU but miss-classification probability upon a random input x ∈ ci , not exactly the same because of the nonuniform OP, cf. Fig- i.e. E[λi ] and its variance V[λi ]. This is essentially a sta- ure 2 (rhs). Moreover, from Figure 2 (lhs), we know the clas- tistical problem that has been studied in [Webb et al., 2019] sification boundary is near the middle of the unit square input using Multilevel Splitting Sampling, while we use the Simple space where misclassifications tend to happen (say “buggy Monte Carlo method for brevity in the running example: area”), which is also the high density area on the OP. Thus, n the contribution to unreliability from the “buggy area” is 1X weighted higher by the OP, which explains why our RAM re- λ̂i = I{M (xj )6=yi } n j=1 sults are worse than ACU. In contrast, because of the “flat” OP in the DS-1 (cf. Figure 3 (lhs)), our RAM results are 2 The CLT tells us λ̂i ∼ N (µ, σn ), when n is large, where very close to the ACU. With more dense data in DS-2, the r-distance is much smaller and leads to more cells. Thanks to µ and σ 2 are population mean and variance of I{M(xj )6=yi } the rich data in this case, all three results are more consistent. that can be approximated with sample mean µ̂n and sample We note that, given the nature of the three 2D-point datasets, variance σ̂n2 /n. Finally, we can get DL models trained on them are much more robust than image 1X n datasets. This is why all ACUs are better than test errors, and E[λi ] = µ̂n = I{M(xj )6=yi } (13) our RAM finds a middle point representing reliability accord- n j=1 ing to the OP. Later we apply the RAM on two unrobust DL n models trained on image-datasets where the ACUs are worse σ̂n2 1 X V[λi ] = = (I{M(xj )6=yi } − µ̂n )2 (14) than test error; it confirms our aforementioned observations. n (n − 1)n j=1 To gain insights on how to extend our method for high- dimensional/real-world datasets, we also conduct experi- Notably, to solve the above statistical problem with sam- ments on the popular MNIST and CIFAR10 datasets. Instead pling methods, we need to assume how the inputs in the of implementing the exact steps in Section 3.2, we take a few cell are distributed, i.e., a distribution for the conditional OP compromised solutions to tackle the scalability issues raised Op(x | x ∈ ci ). Without loss of generality, we assume: by “the curse of dimensionality”. We articulate these steps in Assumption 6. The inputs in a small region like cells are the following paragraph, while detailed discussions on their uniformly distributed. impact on our results are presented in Section 5. which is not uncommon (e.g., in [Webb et al., 2019; Weng et First, we train Variational Auto-Encoders (VAE) on the al., 2019]) and can be easily replaced by other distributions if MNIST and CIFAR10 datasets and project all inputs into the there is supporting evidence for such action. low dimensional latent spaces of VAE (with 8 and 16 di- mensions respectively). Then we apply the proposed RAM on the compressed dataset, i.e., partitioning the latent space, 4 Case Studies learning the OP in latent space and evaluating the “latent- In addition to the running example, we conduct experiments cell unastuteness”. Astuteness (a special case of robustness), on two synthetic datasets as shown in Figure 3, representing by definition is associated with the input pixel space. By the scenarios with sparse and dense training data respectively. “latent-cell unastuteness”, we mean the average unastuteness All modelling details and results after applying our RAM of norm balls (in the input space) around a large number of on those three datasets are summarised in Table 1, based on samples from a “latent-cell”. The norm ball radius is deter- which we compare the testing error, Average Cell Unastute- mined by the r-separation distance in the input space. Taking ness (ACU) and our RAM results (E[λ] and U b97.5% ). the computational cost into consideration, we rank the OP of all latent-cells, and choose the top k cells with highest OP for astuteness evaluation. We adopt the existing robustness estimator in [Webb et al., 2019], where the authors omitted the result of V[λi ]; we therefore also omit the variance in our experiments for simplicity. 5 Discussions In this section, we summarise the model assumptions made in our RAM, and discuss if/how they can be validated and what new-assumptions/compromised-solutions are needed to cope with high-dimensional/real-world applications. Fi- Figure 3: Synthetic datasets DS-1 (lhs) and DS-2 (rhs) representing nally, we list the inherent difficulties of assessing DL un- relatively sparse and dense training data respectively. covered by our RAM. Table 1: The RAM details and results. For image datasets, the r,  and # are associated with latent spaces. Time is in seconds per cell. train/test error r-separation cell radius  # of cells ACU E[λ] V[λ] U b97.5% time The run. exp. 0.0005/0.0180 0.004013 0.004 250 × 250 0.002982 0.004891 0.000004 0.004899 0.04 Synth. DS-1 0.0037/0.0800 0.004392 0.004 250 × 250 0.008025 0.008290 0.000014 0.008319 0.03 Synth. DS-2 0.0004/0.0079 0.002001 0.002 500 × 500 0.004739 0.005249 0.000002 0.005252 0.04 MNIST 0.0051/0.0235 0.1003 0.100 top-170000 0.106615 0.036517 / / 0.43 CIFAR10 0.0199/0.0853 0.1947 0.125 top-23000 0.238138 0.234419 / / 6.74 Independent λi s and Opi s As per Assumption 1, we as- fit a density function over the input space from an “opera- sume all λi s and Opi s are independent when “assembling” tional dataset” (representing the OP). Data-points in this set their estimates via Eq. (5) and deriving the variance via can be unlabelled raw data generated from historical data of Eq. (6). Largely this assumption is for the mathematical previous applications, simulations and manually scaled based tractability when propagating the confidence in individual es- on expert knowledge. Obtaining such operational dataset timates at the cell-level to the whole system pmi. Although is an application-specific engineering problem, and tractable this independence assumption is hard to justify in practice, thanks to the fact that it does not require labelled data. it is not unusual in reliability models that use partition, e.g. Notably, the OP may also be approximated at runtime in [Pietrantuono et al., 2020; Miller et al., 1992]. We be- based on the data stream of operational data. Efficient KDE lieve that RAMs are still useful as a first approximation under for data streams [Qahtan et al., 2017] can be used. If the OP this assumption, while we envisage that Bayesian estimators was subject to sudden changes, change-point detectors like leveraging joint priors and conjugacy may relax it. [Zhao et al., 2020b] should also be paired with the runtime R-separation and its estimation Assumption 2 derives estimator to robustly approximate the OP. from Remark 2. We concur with [Yang et al., 2020] and be- However, we may encounter technical challenges when lieve that, for any real-world DL classification applications fitting the PDF from high-dimensional real-world datasets. where the inputs are data-points with “physical meanings”, There are two known major challenges when applying multi- there should always exist an r-stable ground truth. Such variate KDE to high-dimensional data: i) the choice of band- r varies between applications, and the smaller the r is, the width H represents the covariance matrix that mostly im- harder the inherent difficulty of the classification problem is; pacts the estimation accuracy; ii) scalability issues in terms of i.e., r is a difficulty indicator for the given classification prob- storing intermediate data structure (e.g. data-points in hash- lem. tables) and querying times made when estimating the density For real-world applications, what really determines the la- at a given input. For the first challenge, the optimal calcula- bel of an image are its features rather than pixels. Thus, tion of bandwidth matrix can refer to some rule of thumb [Sil- we envisage some latent space (of, e.g., VAE) capturing verman, 1986; Scott, 2015] and the cross-validation [Bergstra only the feature-wise information can be explored for high- and Bengio, 2012]. While there are dedicated research on dimensional data. That is, we improving the efficiency of multivariate KDE, e.g., [Backurs et al., 2019] presented a framework for multivariate KDE in • first do r-separation based partition in the latent space to provably sub-linear query time with linear space and linear learn the OP; pre-processing time to the dimensions. • then determine the ground truth labels of cells in the la- Determination of the ground truth of a cell Assumptions tent space; 4 and 5 are essentially on how to determine the ground truth • map the learned OP and ground truth labels back to the label for a given cell, that relates to the oracle problem of input pixel space; testing DL [Guerriero, 2020]. While it is still challenging, we partially solve it by leveraging the r-separation property. • do astuteness evaluation in the input pixel space and “as- Thanks to r, it is easy to determine a cell’s ground truth semble” the results according to the OP. when we see it contains labelled data-points. However, for Indeed, it is hard to estimate the r (neither in the input nor an empty cell, it is non-trivial. We assume the overall perfor- the latent space), while the best we can do is to estimate it mance of the DL model is fairly good (e.g., better than a clas- from the existing dataset. One way of solving the problem sifier doing random classifications), thus miss-classifications is to keep monitoring the r estimates as more labelled data is within an empty cell are relatively rare events. Then we can collected, and redo the cell partition when the estimated r has determine the ground truth label of the cell by majority vot- changed significantly. ing of predictions. Indeed, this is a strong assumption when Approximation of the OP Assumption 3 says that the col- there are some “failure regions” in the input space that per- lected dataset statistically represents the OP, which may not form really badly (even worse than random labelling). In this hold for many practical reasons; e.g., the future OP is un- case, we need to invent a new mechanism to detect such “re- certain at the training stage and thus data is collected in a ally bad failure regions” and spend more budget on invoking, balanced way to perform well in all categories of inputs. Al- say, humans to do the labelling. though we demonstrate our RAM under this assumption for Conditional OP of a cell We assume the distribution of simplicity, it can be easily relaxed. Essentially, we try to inputs (i.e., the conditional OP) within each cell is uniform by Assumption 6. Although we conjecture that this is the can be conservatively set to the proportion of robust region common case due to the small size of cells (i.e., those very covered in ci in this case. close/similar inputs within a small region are only subject to We would like to note that the cell robustness estimator noise factors that can be modelled uniformly), the real situa- in our RAM works in a “hot-swappable” manner: any new tion may vary; this requires justification in safety cases. and more efficient robustness estimator can be easily incor- For a real-world dataset, the conditional OP represents cer- porated. Thus, how to improve the efficiency of cell’s robust- tain distributions of “natural variations” [Zhong et al., 2021], ness estimation is out of the scope of our RAM. e.g. lighting conditions, obey certain distributions. The con- Inherent difficulties Finally, based on our RAM and the ditional OP of cells should faithfully capture the distribution discussions above, we summarise the inherent difficulties of of such natural variations. Recent advance on measuring the assessing DL reliability as the following questions: natural/realistic AEs [Harel-Canada et al., 2020] highly re- lates to this assumption and may relax it. • How to accurately build the OP in the high-dimensional input space? Explosion of the number of cells The number of cells to evaluate the astuteness is exponential in the dimensions of • How to build an accurate oracle leveraging the existing data. For high-dimensional data, it is impossible to explore all human-labels in the training dataset? cells in the input space11 as we did for the running example. • What is the local distribution (conditional OP) over a A compromised solution is to find the first k cells that dom- small input region that captures the natural variations of inate the OP. That is, we rank the cells by their pooled OP, and physical conditions? only evaluate the top-k cells where the sum of these k cells’ • How to efficiently evaluate the robustness of a small re- OPs is greater than a threshold, e.g. 99%. Then, we can con- gion given AEs are rare events? servatively set the cell pmi of the rest to a worst-case bound (e.g. 1) or an empirical/average bound based on the first k • How to sample small regions from a large population cells. Certainly, the price to pay is to sacrifice estimation ac- (high-dimensional space) to test robustness in an unbi- curacy. The best we can do for now is to increase the budgets ased and efficient way? for a larger k. Technically, finding the first k cells dominat- We try to provide preliminary/compromised solutions in ing the OP is in fact to calculate the modes of the KDE func- our RAM, while the questions are still challenging in prac- tion. The work of [Lee et al., 2019] gives us a hint on how to tice. We doubt the existence of other DL RAMs with weaker quickly calculate the modes of Gaussian KDE when the data assumptions achieving the same level of rigorousness as ours, dimension is high. at this stage. This discussion relates to the cost of our RAM, thus a per- tinent question is—what is the real cost of conducting DL testing? Is it the the human labour generating labels or timing 6 Conclusion & Future Work constraints? A likely answer is: both. Our RAM has partially In this paper, we present a preliminary RAM for DL clas- solved the former (cf. earlier discussions), while the latter is sifiers. It is the first DL RAM explicitly considers both the less costly nowadays and can be solved by harnessing the fast OP information and robustness evidence. It uncovers some growth of computational power and parallel computing. inherent difficult questions when assessing DL reliability, while preliminary/compromised solutions are discussed, im- Efficiency of cell robustness evaluation We have demon- plemented and demonstrated with case studies. strated via the Simple Monte Carlo method to evaluate cell An intuitive way of perceiving our RAM, comparing with robustness in the running example. It is well-known that the usual accuracy testing, is that we enlarge the testing Simple Monte Carlo is not a computationally efficient tech- dataset with more test cases around “seeds” (original data- nique to estimate rare-events (such as AEs in our case) in points in the test set). We determine the oracle of a new test high-dimensional space. Thus, instead of applying Simple case according to its seed’s label and the r-distance. Those Monte Carlo, the more advanced and efficient sampling ap- enlarged test results form the robustness evidence, and how proach, the Adaptive Multi-level Splitting method [Webb et much they contribute to the overall reliability is proportional al., 2019], has been applied in our case studies on image to its OP. Consequently, exposing to more tests (robustness datasets. We are confident that other statistical sampling evaluation) and being more representative of how it will be methods designed for rare-events may also suffice our need. used (the OP), our RAM is more trustworthy. In addition to the statistical approach, formal method based In line with the gist of our RAM, we believe the DL relia- verification techniques can also be applied to assess a cell’s bility should follow the conceptualised equation of: pmi, e.g. [Huang et al., 2017]. They provide formal guaran- tees on whether the DL model will miss-classify any input DL reliability = generalisability × robustness. inside a small region. Although such “robust region” proved by formal methods is normally smaller than our cells, the λ̂i In a nutshell, when assessing the DL reliability, we should not only concern how it generalises to a new data-point (ac- 11 Although dimension reduction methods like VAE may ease the cording to the future OP), but also the local robustness around problem of learning OP, they cannot reduce the number of cells to it. Align with this insight, indeed, a “naive/over-simplified” be evaluated. Since robustness by definition has to be evaluated in version of our RAM would be averaging all local astuteness the input space. of data-points in the test set, which is less rigorous (e.g., on determining the norm ball size) and requires stronger assump- References tions (e.g., the test set is equal to the operational set). [Backurs et al., 2019] Arturs Backurs, Piotr Indyk, and Tal Improving the scalability of our RAM and experimenting Wagner. Space and time efficient kernel density estimation with more real-world datasets form important future work. in high dimensions. In NeurIPS’19, pages 15773–15782, We presume a trained DL model for our assessment purpose. 2019. A natural question next is how to actually improve the reli- ability when our RAM results are not good enough. As de- [Bai et al., 2021] Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, scribed in [Zhao et al., 2021], we plan to investigate DL de- and Qian Wang. Recent Advances in Adversarial Training bug testing (e.g. [Huang et al., 2021]) and retraining methods for Adversarial Robustness. In IJCAI’21, 2021. [Bai et al., 2021], together with the RAM, to form a closed [Bergstra and Bengio, 2012] James Bergstra and Yoshua loop of debugging-improving-assessing. Bengio. Random search for hyper-parameter optimization. J. of Machine Learning Research, 13(2):281–305, 2012. Acknowledgments & Disclaimer [Bishop et al., 2011] Peter Bishop, Robin Bloomfield, Bev This project has received funding from the European Littlewood, Andrey Povyakalo, and David Wright. Toward Union’s Horizon 2020 research and innovation programme a formalism for conservative claims about the dependabil- under grant agreement No 956123. This work is partially sup- ity of software-based systems. IEEE Transactions on Soft- ported by the UK EPSRC (through the Offshore Robotics for ware Engineering, 37(5):708–717, 2011. Certification of Assets [EP/R026173/1] and End-to-End Con- [Bloomfield et al., 2019] Robin Bloomfield, Heidy Khlaaf, ceptual Guarding of Neural Architectures [EP/T026995/1]) Philippa Ryan Conmy, and Gareth Fletcher. Disruptive and the UK Dstl (through the project of Safety Argument innovations and disruptive assurance: Assuring machine for Learning-enabled Autonomous Underwater Vehicles). learning and autonomy. Computer, 52(9):82–89, 2019. Xingyu Zhao and Alec Banks’ contribution to the work is partially supported through Fellowships at the Assuring Au- [Carlini and Wagner, 2017] Nicholas Carlini and David tonomy International Programme. We thank Lorenzo Strigini Wagner. Towards Evaluating the Robustness of Neural for insightful comments on earlier versions of the paper. Networks. In IEEE Symp. on Security and Privacy (SP), This document is an overview of UK MOD (part) spon- pages 39–57, San Jose, CA, USA, 2017. IEEE. sored research and is released for informational purposes [Chen, 2017] Yen-Chi Chen. A tutorial on kernel density es- only. The contents of this document should not be inter- timation and recent advances. Biostatistics & Epidemiol- preted as representing the views of the UK MOD, nor should ogy, 1(1):161–187, 2017. it be assumed that they reflect any current or future UK [Guerriero et al., 2021] Antonio Guerriero, Roberto Pietran- MOD policy. The information contained in this document cannot supersede any statutory or contractual requirements tuono, and Stefano Russo. Operation is the hardest teacher: or liabilities and is offered without prejudice or commit- estimating DNN accuracy looking for mispredictions. In ment. Content includes material subject to © Crown copy- ICSE’21, Madrid, Spain, 2021. right (2018), Dstl. This material is licensed under the terms of [Guerriero, 2020] Antonio Guerriero. Reliability Evaluation the Open Government Licence except where otherwise stated. of ML systems, the oracle problem. In ISSREW’20, pages To view this licence, visit http://www.nationalarchives.gov. 127–130, Coimbra, Portugal, 2020. IEEE. uk/doc/open-government-licence/version/3 or write to the In- [Hamlet and Taylor, 1990] D. Hamlet and R. Taylor. Parti- formation Policy Team, The National Archives, Kew, London tion testing does not inspire confidence. IEEE Tran. on TW9 4DU, or email: psi@nationalarchives.gsi.gov.uk. Software Engineering, 16(12):1402–1411, 1990. A KDE bootstrapping [Harel-Canada et al., 2020] Fabrice Harel-Canada, Lingx- iao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Bootstrapping is a statistical approach to estimate any sam- Miryung Kim. Is neuron coverage a meaningful measure pling distribution by random sampling method. We sample for testing deep neural networks? In ESEC/FSE’20, pages with replacement from the original data points (X, Y ) to ob- 851–862, 2020. tain a new bootstrap dataset (X b , Y b ) and train the KDE on the bootstrap dataset. Assume the bootstrap process is re- [Huang et al., 2017] Xiaowei Huang, Marta Kwiatkowska, peated B times, leading to B bootstrap KDEs, denoted as Sen Wang, and Min Wu. Safety verification of deep neural 1 B networks. In CAV’17, pages 3–29. Springer, 2017. Op c (x), . . . , Op c (x). Then we can estimate the variance of [Huang et al., 2020] Xiaowei Huang, Daniel Kroening, fˆ(x) by the sample variance of the bootstrap KDE [Chen, Wenjie Ruan, and et al. A survey of safety and trustworthi- 2017]: B ness of deep neural networks: Verification, testing, adver- 2 1 X cb sarial attack and defence, and interpretability. Computer σ̂B (x) = (Op (x) − µB )2 B−1 Science Review, 37:100270, 2020. b=1 where the µB can be approximated by [Huang et al., 2021] Wei Huang, Youcheng Sun, Xingyu B Zhao, James Sharp, Wenjie Ruan, Jie Meng, and Xiaowei 1 X cb Huang. Coverage guided testing for recurrent neural net- µ̂B (x) = Op (x). B works. IEEE Tran. on Reliability, 2021. In press. b=1 [Kalra and Paddock, 2016] Nidhi Kalra and Susan M. Pad- [Strigini and Littlewood, 1997] Lorenzo Strigini and Bev dock. Driving to safety: How many miles of driving Littlewood. Guidelines for statistical testing. Technical would it take to demonstrate autonomous vehicle reliabil- report, City, University of London, 1997. ity? Transportation Research Part A: Policy and Practice, [Strigini and Povyakalo, 2013] Lorenzo Strigini and Andrey 94:182 – 193, 2016. Povyakalo. Software fault-freeness and reliability predic- [Katz et al., 2019] Guy Katz, Derek A. Huang, Duligur Ibel- tions. In SafeComp’13, volume 8153 of LNCS, pages 106– ing, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth 117, Berlin, Heidelberg, 2013. Springer. Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljić, [Walter and Augustin, 2009] Gero Walter and Thomas Au- David L. Dill, Mykel J. Kochenderfer, and Clark Barrett. gustin. Imprecision and prior-data conflict in generalized The Marabou Framework for Verification and Analysis Bayesian inference. Journal of Statistical Theory & Prac- of Deep Neural Networks. In CAV’19, volume 11561 of tice, 3(1):255–271, 2009. LNCS, pages 443–452, Cham, 2019. Springer. [Webb et al., 2019] Stefan Webb, Tom Rainforth, Yee Whye [Lane et al., 2016] David Lane, David Bisset, Rob Bucking- Teh, and M. Pawan Kumar. A statistical approach to as- ham, Geoff Pegman, and Tony Prescott. New foresight sessing neural network robustness. In ICLR’19, New Or- review on robotics and autonomous systems. Technical leans, LA, USA, 2019. Report No. 2016.1, LRF, 2016. [Weng et al., 2019] Lily Weng, Pin-Yu Chen, Lam Nguyen, [Lee et al., 2019] Jasper CH Lee, Jerry Li, Christopher Mark Squillante, Akhilan Boopathy, Ivan Oseledets, and Musco, Jeff M Phillips, and Wai Ming Tai. Finding Luca Daniel. PROVEN: Verifying robustness of neural the mode of a kernel density estimate. arXiv preprint networks with a probabilistic approach. In ICML’19, vol- arXiv:1912.07673, 2019. ume 97, pages 6727–6736. PMLR, 2019. [Yang et al., 2020] Yao-Yuan Yang, Cyrus Rashtchian, [Li et al., 2019] Zenan Li, Xiaoxing Ma, Chang Xu, Chun Hongyang Zhang, Ruslan Salakhutdinov, and Kamalika Cao, Jingwei Xu, and Jian Lü. Boosting operational DNN Chaudhuri. A Closer Look at Accuracy vs. Robustness. testing efficiency through conditioning. In ESEC/FSE’19, In NeurIPS’20, Vancouver, Canada, 2020. pages 499–509. ACM, 2019. [Zhang et al., 2020] J. M. Zhang, M. Harman, L. Ma, and [Littlewood and Strigini, 2000] Bev Littlewood and Lorenzo Y. Liu. Machine learning testing: Survey, landscapes and Strigini. Software reliability and dependability: A horizons. IEEE Tran. on Software Engineering, 2020. roadmap. In ICSE 2000, pages 175–188, 2000. [Zhao et al., 2019] Xingyu Zhao, Valentin Robu, David [Miller et al., 1992] Keith W. Miller, Larry J. Morell, Flynn, Fateme Dinmohammadi, Michael Fisher, and Matt Robert E. Noonan, Stephen K. Park, David M. Nicol, Webster. Probabilistic model checking of robots deployed Branson W. Murrill, and M Voas. Estimating the probabil- in extreme environments. In AAAI’19, volume 33, pages ity of failure when testing reveals no failures. IEEE Trans- 8076–8084, 2019. actions on Software Engineering, 18(1):33–43, 1992. [Zhao et al., 2020a] Xingyu Zhao, Alec Banks, James Sharp, [Musa, 1993] John Musa. Operational profiles in software- Valentin Robu, David Flynn, Michael Fisher, and Xiaowei reliability engineering. IEEE Software, 10(2):14–32, Huang. A safety framework for critical systems utilising 1993. deep neural networks. In SafeComp’20, volume 12234 of LNCS, pages 244–259. Springer, 2020. [Pietrantuono et al., 2020] Roberto Pietrantuono, Peter Popov, and Stefano Russo. Reliability assessment of [Zhao et al., 2020b] Xingyu Zhao, Radu Calinescu, Simos service-based software under operational profile un- Gerasimou, Valentin Robu, and David Flynn. Interval certainty. Reliability Engineering & System Safety, change-point detection for runtime probabilistic model 204:107193, 2020. checking. In ASE’20, pages 163–174. IEEE/ACM, 2020. [Qahtan et al., 2017] Abdulhakim Qahtan, Suojin Wang, and [Zhao et al., 2020c] Xingyu Zhao, Kizito Salako, Lorenzo Xiangliang Zhang. KDE-Track: An Efficient Dynamic Strigini, Valentin Robu, and David Flynn. Assessing Density Estimator for Data Streams. IEEE Transactions on safety-critical systems from operational testing: A study Knowledge and Data Engineering, 29(3):642–655, 2017. on autonomous vehicles. Information and Software Tech- nology, 128:106393, 2020. [Robu et al., 2018] Valentin Robu, David Flynn, and David [Zhao et al., 2021] Xingyu Zhao, Wei Huang, Sven Schewe, Lane. Train robots to self-certify as safe. Nature, Yi Dong, and Xiaowei Huang. Detecting operational ad- 553(7688):281–281, 2018. versarial examples for reliable deep learning. In 51th An- [Scott, 2015] David W Scott. Multivariate density estima- nual IEEE-IFIP Int. Conf. on Dependable Systems and tion: theory, practice, and visualization. John Wiley & Networks (DSN’21), volume Fast Abstract, 2021. Sons, 2015. [Zhong et al., 2021] Ziyuan Zhong, Yuchi Tian, and [Silverman, 1986] Bernard W Silverman. Density estimation Baishakhi Ray. Understanding Local Robustness of Deep for statistics and data analysis, volume 26. CRC press, Neural Networks under Natural Variations. In FASE’21, 1986. pages 313–337, 2021.