Assessing the Reliability of Deep Learning Classifiers Through Robustness
                         Evaluation and Operational Profiles
               Xingyu Zhao1 , Wei Huang1 , Alec Banks2 , Victoria Cox2 , David Flynn3 ,
                                    Sven Schewe1 and Xiaowei Huang1
                             1
                               University of Liverpool, Liverpool, L69 3BX, U.K.
                  2
                    Defence Science and Technology Laboratory, Salisbury, SP4 0JQ, U.K.
                           3
                             Heriot-Watt University, Edinburgh, EH14 4AS, U.K.
                   {xingyu.zhao,w.huang23,sven.schewe,xiaowei.huang}@liverpool.ac.uk
                                {abanks,vcox}@dstl.gov.uk, d.flynn@hw.ac.uk

                          Abstract                                  the dependable use of AI/DL in critical applications [Robu et
                                                                    al., 2018] and, more importantly, to assess and demonstrate
     The utilisation of Deep Learning (DL) is advancing             the dependability for certification and regulation.
     into increasingly more sophisticated applications.                For traditional systems, safety and reliability analysis is
     While it shows great potential to provide trans-               guided by established standards, and supported by mature de-
     formational capabilities, DL also raises new chal-             velopment processes and verification and validation (V&V)
     lenges regarding its reliability in critical functions.        tools and techniques. The situation is different for systems
     In this paper, we present a model-agnostic reliabil-           that utilise DL: they require new and advanced analysis re-
     ity assessment method for DL classifiers, based on             flective of the complex requirements in their safe and reliable
     evidence from robustness evaluation and the oper-              function. Such analysis also needs to be tailored to fully eval-
     ational profile (OP) of a given application. We par-           uate the inherent character of DL [Bloomfield et al., 2019],
     tition the input space into small cells and then “as-          despite the progress made recently [Huang et al., 2020].
     semble” their robustness (to the ground truth) ac-
                                                                       DL classifiers are subject to robustness concerns, reliabil-
     cording to the OP, where estimators on the cells’
                                                                    ity models without considering robustness evidence are not
     robustness and OPs are provided. Reliability es-
                                                                    convincing. Reliability, as a user-centred property, depends
     timates in terms of the probability of misclassifi-
                                                                    on the end-users’ behaviours [Littlewood and Strigini, 2000].
     cation per input (pmi) can be derived together with
                                                                    The operational profile (OP) information (quantifying how
     confidence levels. A prototype tool is demonstrated
                                                                    the software will be operated [Musa, 1993]) should therefore
     with simplified case studies. Model assumptions
                                                                    be explicitly modelled in the assessment. However, to the best
     and extension to real-world applications are also
                                                                    of our knowledge, there is no dedicated reliability assessment
     discussed. While our model easily uncovers the in-
                                                                    model (RAM) taking into account both the OP and robustness
     herent difficulties of assessing the DL dependabil-
                                                                    evidence, which motivates this research.
     ity (e.g. lack of data with ground truth and scalabil-
     ity issues), we provide preliminary/compromised                   In [Zhao et al., 2020a], we propose a safety case frame-
     solutions to advance in this research direction.               work tailored for DL, in which we describe an initial idea of
                                                                    combining robustness verification and operational testing for
                                                                    reliability claims. In this paper, we implement this idea as a
1   Introduction                                                    RAM, inspired by partition-based testing [Hamlet and Taylor,
                                                                    1990], operational-profile testing [Strigini and Littlewood,
Industry is adopting increasingly more advanced big data            1997; Zhao et al., 2020c] and DL robustness evaluation [Car-
analysis methodologies to enhance the operational perfor-           lini and Wagner, 2017; Webb et al., 2019]. It is model-
mance, safety, and lifespan of their products and services.         agnostic and designed for pretrained DL models, yielding up-
For many products and systems high in-service reliability and       per bounds on the probability of miss-classifications per input
safety are key targets to ensure customer satisfaction and reg-     (pmi)1 with confidence levels. Although our RAM is theoret-
ulatory compliance, respectively. AI and Deep Learning (DL)         ically sound, we discover some issues in our case studies (e.g.
have steadily grown in interest and applications. Key indus-        scalability and lack of data) that we believe represent the in-
trial foresight reviews have identified that the biggest obstacle   herent difficulties of assessing/assuring DL dependability.
to reap the benefits of DL-powered robots is the assurance and
                                                                       The key contributions of this work are:
regulation of their safety and reliability [Lane et al., 2016].
Thus, there is an urgent need to develop methods to enable             a) A first RAM for DL classifiers based on the OP infor-
                                                                    mation and robustness evidence.
    Copyright © 2021 for this paper by its authors. Use permitted
                                                                        1
under Creative Commons License Attribution 4.0 International (CC          This reliability measure is similar to the conventional probabil-
BY 4.0).                                                            ity of failure per demand (pfd), but retrofitted for classifiers.
   b) Discussions on model assumptions and extension to                 erwise. The Op(x) returns the probability that x is the next
real-world applications, highlighting the inherent difficulties         random input, the OP [Musa, 1993], a notion used in soft-
of assessing DL dependability uncovered by our model.                   ware engineering to quantify how the software will be oper-
   b) A prototype tool2 of our RAM with preliminary and                 ated. Mathematically, the OP is a probability density function
compromised solutions to those uncovered difficulties.                  (PDF) defined over X .
Related Work In recent years, there has been extensive                     Assuming independence between successive inputs de-
efforts in verifying DL robustness, evaluating generalisa-              fined in our pmi, we may use the Bernoulli process as the
tion errors, and detecting adversarial examples (AEs). They             mathematical abstraction of the failure process (common for
are normally based on formal methods [Huang et al., 2017;               such “on-demand” type of systems), which implies a Bino-
Katz et al., 2019] or statistical approaches [Webb et al., 2019;        mial likelihood. Normally for traditional software, upon es-
Weng et al., 2019]. A comprehensive review of those tech-               tablishing the likelihood, RAMs on estimating λ vary case
niques can be sourced from recent survey papers [Huang et               by case—from the basic Maximum Likelihood Estimation
al., 2020; Zhang et al., 2020]. To the best of our knowledge,           (MLE) to Bayesian estimators tailored for certain scenar-
the only papers on testing DL for assessment within an oper-            ios when, e.g., seeing no failure [Bishop et al., 2011], in-
ational context are [Li et al., 2019; Guerriero et al., 2021]. In       ferring ultra-high reliability [Zhao et al., 2020c], with cer-
[Li et al., 2019], novel stratified sampling methods are used to        tain forms of prior knowledge like perfectioness [Strigini
improve the operational testing efficiency. Similarly, [Guer-           and Povyakalo, 2013], and with vague prior knowledge that
riero et al., 2021] presents a sampling method from the oper-           expressed in imprecise probabilities [Walter and Augustin,
ational dataset leveraging “auxiliary information for misclas-          2009; Zhao et al., 2019].
sification”, so that it provides unbiased statistical assessment           OP based RAMs designed for traditional software fail to
while exposing as many misclassifications as possible. How-             consider new characteristics of DL, e.g., unrobustness and
ever, neither of them considers robustness evidence in their            high-dimensional input space. Specifically, it is quite hard to
assessment models.                                                      have the required prior knowledge in those Bayesian RAMs.
   At the higher level of whole-systems utilising DL, although          While frequentist RAMs would require a large sample size to
there are RAMs based on operational data, knowledge from                gain enough confidence in the estimates due to the extremely
low-level DL components is usually ignored, e.g., [Kalra and            large population size (high-dimensional pixel space), espe-
Paddock, 2016]. In [Zhao et al., 2020c], we improved [Kalra             cially for a high-reliable DL model where misclassifications
and Paddock, 2016] by providing a Bayesian mechanism to                 are rare-events. As an example, the usual accuracy testing of
combine such knowledge, but did not show where to obtain                DL classifiers is essentially an MLE estimate against the test
the knowledge. In that sense, this paper is also a follow up of         set. It not only assumes the test set statistically represents the
[Zhao et al., 2020c], forming the prior knowledge required.             OP (our Assumption 3 later), but also requires a large number
                                                                        of samples to claim high reliability with sufficient confidence.
Organisation of the paper We first present preliminaries
on OP-based software reliability assessment and DL robust-              2.2    DL Robustness and the R-Separation Property
ness. Then Section 3 describes the RAM in details with a run-           DL is known not to be robust. Robustness requires that the
ning example. We conduct case studies in Section 4, while               decision of the DL model M is invariant against small per-
discuss the model assumptions and extensions in Section 5.              turbations on inputs. That is, all inputs in a region η ⊂ X
Finally, we conclude in Section 6 with future work.                     have the same prediction label, where usually the region η is
                                                                        a small norm ball (in a Lp -norm distance4 ) of radius  around
2       Preliminaries                                                   an input x. Inside η, if an input x0 is classified differently to x
2.1      OP Based Software Reliability Assessment                       by M, then x0 is an AE. Robustness can be defined either as
The delivered reliability, as a user-centred and probabilistic          a binary metric (if there exists any adversarial example in η)
property, requires to model the end-users’ behaviours (in the           or as a probabilistic metric (how likely the event of seeing an
running environments) and to be formally defined by a quan-             adversarial example in η is). The former aligns with formal
titative metric [Littlewood and Strigini, 2000]. Without loss           verification, e.g. [Huang et al., 2017], while the latter is nor-
of generality, we focus on pmi as a generic metric for DL               mally used in statistical approaches, e.g. [Webb et al., 2019].
classifiers, where inputs are, e.g., facial images uploaded by          The former “verification approach” is the binary version of
users for facial recognition. We discuss later how pmi can be           the latter “stochastic approach”5 .
                                                                           Similar to [Webb et al., 2019], we adopt the more general
redefined to cope with real-world applications like traffic sign        probabilistic definition on the robustness of the model M (in
detection. If we denote the unknown pmi as a variable λ, then           a region η and to a target label y):
              Z                                                                          X
        λ :=        I{x causes a misclassification} (x)Op(x) dx (1)       RM (η, y) :=         I{M(x) predicts label y} (x) × Op(x | x ∈ η)   (2)
                                                                                         x∈η
               x∈X
where x is an input in the input domain3 X , and IS is an               where Op(x | x ∈ η) is the conditional OP of region η (pre-
indicator function—it is equal to 1 when S is true and 0 oth-           cisely the “input model” defined in [Webb et al., 2019] and
    2                                                                      4
     Available at https://github.com/havelhuang/ReAsDL.                      Distance mentioned in this paper is defined in L∞ .
    3                                                                      5
     We assume continuous X in this paper. For discrete X , the inte-        Thus, we use the more general term robustness “evaluation”
gral in Eq. (1) reduces to sum and OP is a probability mass function.   rather than robustness “verification” throughout the paper.
also used in [Weng et al., 2019]).                                       is miss-classified, i.e. pmi. If the 2D-points represent traf-
   We highlight the follow two remarks regarding robustness:             fic lights, then we have 2 types of misclassifications—safety-
                                                                         critical ones when red data-point is labelled green, and per-
Remark 1 (astuteness). Reliability assessment only concerns              formance related otherwise. For brevity, we only focus on
the robustness to the ground truth label, rather than an arbi-           misclassifications here, while our RAM can cope with sub-
trary label y in RM (η, y). When y is such a ground truth,               types of misclassifications.
robustness becomes astuteness [Yang et al., 2020], which is
also the conditional reliability in the region η.
   Astuteness is a special case of robustness6 . An extreme ex-
ample showing why we introduce the concept of astuteness
is: a perfectly robust classifier that always outs “dogs” for
any given input is unreliable. Thus, robustness evidence can-
not directly support reliability claims unless the ground truth
label is used in RM (η, y).
Remark 2 (r-separation). For real-world image datasets,
any data-points with different ground truth are at least dis-
tance 2r apart in the input space X (i.e., pixel space), and r
is bigger than usual norm ball radius in robustness studies.             Figure 2: The 2D-point dataset (lhs), and its approximated OP (rhs).
   The r-separation property was first observed by [Yang et
al., 2020]: real-world image datasets studied by the authors
implies that r is normally 3 ∼ 7 times bigger than the radius            3.2      The Model
(denoted ) of norm balls commonly used in robustness stud-              The Framework Inspired by [Pietrantuono et al., 2020],
ies. Intuitively it says that, although the classification bound-        the general idea of our RAM is to partition the input domain
ary is highly non-linear, there is a minimum distance between            into m small cells, subject to the r-separation property. Then,
two real-world objects of different classes (cf. Figure 1 for a          for each cell ci (with a single ground truth yi ), we estimate:
conceptual illustration). Moreover, such minimum distance                                                             X
is bigger than the usual norm ball size in robustness studies.                  λi := 1 − RM (ci , yi ) and Opi :=          Op(x)     (3)
                                                                                                                                  x∈ci

                                                                         which are the unastuteness and pooled OP, respectively, es-
                                                                         timates of the cell ci —we introduce estimators for both later.
                                                                         Then Eq. (1) can be written as the weighted sum of the
                                                                         cell-wise unastuteness (i.e. the conditional pmi of each cell8 )
                                                                         where the weights are the pooled OP of cells:
                                                                                                               m
                                                                                                               X
                                                                                                          λ=         Opi λi                      (4)
                                                                                                               i=1

                                                                            Eq. (4) represents an ideal case in which we know those
                                                                         λi s and Opi s with certainty. In practice, we can only estimate
          Figure 1: Illustration of the r-separation property.           them with imperfect estimators yielding, e.g., a point estimate
                                                                         with variance capturing the measure of trust. To propagate the
                                                                         confidence in the estimates of λi s and Opi s, we assume:
                                                                         Assumption 1. All λi s and Opi s are independent unknown
3       A RAM for Deep Learning Classifiers                              variables under estimations.
3.1      The Running Example                                             Then, the estimate of λ and its variance are:
                                                                                       m                    m
To better demonstrate our RAM, we take the Challenge of                                X                    X
                                                                           E[λ] =            E[λi Opi ] =         E[λi ]E[Opi ]                  (5)
AI Dependability Assessment raised by the Siemens Mobil-                               i=1                  i=1
ity7 as a running example. Basically, the challenge is to                               m
firstly train a DL model to classify a dataset generated on
                                                                                       X
                                                                           V[λ] =            V[λi Opi ]
the unit square [0, 1]2 according to some unknown distribu-                            i=1
tion. The collected data-points (training set) are shown in                            Xm
Figure 2 (lhs). Then we need to build a RAM to claim an                            =         E[λi ]2 V[Opi ] + E[Opi ]2 V[λi ] + V[λi ]V[Opi ]   (6)
upper bound on the probability that the next random point                              i=1


    6
                                                                         Note that, for the variance, the covariance terms are dropped
      Thus, later in this paper, we may refer robustness to astuteness   out due to the independence assumption.
for brevity when it is clear from the context.
    7                                                                       8
      https://ecosystem.siemens.com/ai-da-sc/                                   We use “cell unastuteness” and “cell pmi” interchangeably later.
   Depending on the specific estimators adopted, certain para-            bandwidth, cf. [Silverman, 1986] for guidelines on tuning h.
metric families of the distribution of λ can be assumed, from             The approximated OP10 is shown in Figure 2 (rhs).
which any quantile of interest (e.g. 95%) can be derived as                 Since our cells are small and all equal size, instead of cal-
our confidence bound in reliability. For instance, as readers
                                                                                  R
                                                                          culating x∈ci Op(x)dx,
                                                                                         c        we may approximate Opi as
will see later, we may assume λ ∼ N (E[λ], V[λ]) since all
λi s and Opi s are normal distributed variables after applying                                  Op
                                                                                                c i = Op
                                                                                                      c (xc ) vc                        (9)
                                                                                                           i
the Central Limit Theorem (CLT) in our chosen estimators.
Then, an upper bound with 1 − α confidence is                             where Op(xc c ) is the probability density at the cell’s central
                                                                                           i
                                        p                                 point xci , and vc is the constant cell volume (1.6e−5 in the
                Ub 1−α = E[λ] + z1−α V[λ]                  (7)            running example).
                                                                                                                                   x−X
where P r(Z ≤ z1−α ) = 1 − α, and Z ∼ N (0, 1) is a stan-                    Now if we introduce new variables Wj = h1 K( h j ),
dard normal distribution.                                                 the KDE evaluated at x is actually the sample mean of
                                                                                                                                     σ2
   Now the the problem is reduced to how to obtain the esti-              W1 , . . . , Wn . Then by CLT, we have Op(x)
                                                                                                                  c       ∼ N (µW , nW )
mates E[λi ]s and V[λi ]s, for which we will discuss as follows           where the mean and variance of Op(x)
                                                                                                            c       are known results:
referring to the running example.
                                                                                                n
Partition of the Input Domain X As per Remark 1, the                                     1 X      x − Xj
                                                                             E[Op(x)]
                                                                               c      =        K(        )                             (10)
astuteness evaluation of a cell requires its ground truth label.                        nh j=1       h
To leverage the r-separation property and Assumption 4, we
                                                                                        f (x) K 2 (u)du
                                                                                             R
partition the input space by choosing a cell radius  so that                                                1     2
                                                                             V[Op(x)] =
                                                                               c                        + O( ) ≈ σ̂B (x)               (11)
 < r. Although we concur with Remark 2 (first observed                                        nh           nh
by [Yang et al., 2020]) and believe that there should exist
an r-stable ground truth (which means that the ground truth               where the last step of Eq. (11) says that V[Op(x)]
                                                                                                                         c       can be
                                                                                                                     2
is stable in such a cell) for any real-world DL classification            approximated using a bootstrap variance σ̂B  (x) [Chen, 2017]
applications, it is hard to estimate such an r (denoted by r̂)            (cf. the Appendix A for details).
and the best we can do is to assume:                                         Upon establishing Eq.s (10) and (11), together with Eq. (9),
                                                                          we know for a given cell ci (knowing its central point xci ):
Assumption 2. There is a r-stable ground truth (as defined
in Remark 2) for any real-world classification problems, and                E[Opi ] = vc E[Op(x
                                                                                           c c )],
                                                                                                i         V[Opi ] = vc2 V[Op(x
                                                                                                                          c c )] (12)
                                                                                                                               i
it can be sufficiently estimated from the existing dataset.
                                                                          which are the cell OP estimates for Eq.s (5) and (6).
   That said, we get r̂ = 0.004013 by iteratively calculating
the minimum distance of different labels in the running ex-               Cell Astuteness Evaluation As a corollary of Remark 2
ample. Then we choose a cell radius9  = 0.004 and partition              and Assumption 2, we may confidently assume:
the unit square X into 250 × 250 cells.                                   Assumption 4. If the radius of ci is smaller than r, all data-
                                                                          points in the region ci share a single ground truth label.
Cell OP Approximation Given a dataset (X, Y ), we esti-
mate the pooled OP of cell ci to get E[Opi ] and V[Opi ]. We                 Now, to determine the ground truth label of a cell ci , we
use the well-established Kernel Density Estimation (KDE) to               can classify our cells into three types:
fit a Op(x)
      c     to approximate the OP.                                           a) Normal cells: a normal cell contains data-points sharing
                                                                          a same ground truth label, which is then determined as the
Assumption 3. The existing dataset (X, Y ) are randomly                   ground truth label of the cell.
sampled from the OP, thus statistically represents the OP.                   b) Empty cells: a cell is “empty” in the sense that no data-
This assumption may not hold in practice: training data is                point that has been observed in it. Due to the lack of data, it is
normally collected in a balanced way, since the DL model                  hard to determine an empty cell’s ground truth. For now, we
is expected to perform well in all categories of inputs, espe-            do voting based on the predicted labels (by the DL model) of
cially when the OP is unknown at the time of training and/or              random samples from the cell, assuming:
expected to change in future. Although our model can re-                  Assumption 5. The accuracy of the DL model is better than
lax this assumption (cf. Section 5), we adopt it for brevity in           a classifier doing random classifications in any given cell.
demonstrating the running example.
                                                                          Essentially the above assumption relates to the oracle prob-
   Then given a set of (unlabelled) data-points (X1 , . . . , Xn )
                                                                          lem of DL testing, that we see some recent efforts, e.g. [Guer-
from the existing dataset (X, Y ), KDE yields
                                                                          riero, 2020], may relax it.
                            1 X
                                    n
                                     x − Xj                                  c) Cross-boundary cells: our estimate on r is imperfect,
                   Op(x)
                   c     =        K(        )                      (8)    thus we may still observe data-points with different labels in
                           nh j=1       h                                 one cell. Such cells are crossing the classification boundary.
                                                                          If our estimate on r is sufficiently accurate, they should be
where K is the kernel function (e.g. Gaussian or exponen-                 very rare. Thus, without the need to determine the ground
tial kernels), and h > 0 is a smoothing parameter called the
                                                                             10
                                                                                With a Gaussian kernel and h = 0.2 that optimised by cross-
   9
       Radius in L∞ which is the side length of our square cell in L2 .   validated grid-search [Bergstra and Bengio, 2012].
truth label of a cross boundary cell, we simply and conserva-            In the running example, we first observe that the ACU
tively set the cell unastuteness to 1.                                is much lower than the test error, meaning the underlying
   So far, the problem is reduced to: given a normal or empty         DL model is a robust one. Since our RAM is mainly based
cell ci with the known ground truth label yi , evaluate the           on the robustness evidence, its results are close to ACU but
miss-classification probability upon a random input x ∈ ci ,          not exactly the same because of the nonuniform OP, cf. Fig-
i.e. E[λi ] and its variance V[λi ]. This is essentially a sta-       ure 2 (rhs). Moreover, from Figure 2 (lhs), we know the clas-
tistical problem that has been studied in [Webb et al., 2019]         sification boundary is near the middle of the unit square input
using Multilevel Splitting Sampling, while we use the Simple          space where misclassifications tend to happen (say “buggy
Monte Carlo method for brevity in the running example:                area”), which is also the high density area on the OP. Thus,
                                n
                                                                      the contribution to unreliability from the “buggy area” is
                             1X                                       weighted higher by the OP, which explains why our RAM re-
                     λ̂i =         I{M (xj )6=yi }
                             n j=1                                    sults are worse than ACU. In contrast, because of the “flat”
                                                                      OP in the DS-1 (cf. Figure 3 (lhs)), our RAM results are
                                      2
The CLT tells us λ̂i ∼ N (µ, σn ), when n is large, where             very close to the ACU. With more dense data in DS-2, the
                                                                      r-distance is much smaller and leads to more cells. Thanks to
µ and σ 2 are population mean and variance of I{M(xj )6=yi }          the rich data in this case, all three results are more consistent.
that can be approximated with sample mean µ̂n and sample              We note that, given the nature of the three 2D-point datasets,
variance σ̂n2 /n. Finally, we can get                                 DL models trained on them are much more robust than image
                     1X
                         n                                            datasets. This is why all ACUs are better than test errors, and
    E[λi ] = µ̂n =         I{M(xj )6=yi }                     (13)    our RAM finds a middle point representing reliability accord-
                     n j=1                                            ing to the OP. Later we apply the RAM on two unrobust DL
                                n                                     models trained on image-datasets where the ACUs are worse
               σ̂n2       1    X
    V[λi ] =        =              (I{M(xj )6=yi } − µ̂n )2   (14)    than test error; it confirms our aforementioned observations.
                n     (n − 1)n j=1                                       To gain insights on how to extend our method for high-
                                                                      dimensional/real-world datasets, we also conduct experi-
   Notably, to solve the above statistical problem with sam-          ments on the popular MNIST and CIFAR10 datasets. Instead
pling methods, we need to assume how the inputs in the                of implementing the exact steps in Section 3.2, we take a few
cell are distributed, i.e., a distribution for the conditional OP     compromised solutions to tackle the scalability issues raised
Op(x | x ∈ ci ). Without loss of generality, we assume:               by “the curse of dimensionality”. We articulate these steps in
Assumption 6. The inputs in a small region like cells are             the following paragraph, while detailed discussions on their
uniformly distributed.                                                impact on our results are presented in Section 5.
which is not uncommon (e.g., in [Webb et al., 2019; Weng et              First, we train Variational Auto-Encoders (VAE) on the
al., 2019]) and can be easily replaced by other distributions if      MNIST and CIFAR10 datasets and project all inputs into the
there is supporting evidence for such action.                         low dimensional latent spaces of VAE (with 8 and 16 di-
                                                                      mensions respectively). Then we apply the proposed RAM
                                                                      on the compressed dataset, i.e., partitioning the latent space,
4    Case Studies                                                     learning the OP in latent space and evaluating the “latent-
In addition to the running example, we conduct experiments            cell unastuteness”. Astuteness (a special case of robustness),
on two synthetic datasets as shown in Figure 3, representing          by definition is associated with the input pixel space. By
the scenarios with sparse and dense training data respectively.       “latent-cell unastuteness”, we mean the average unastuteness
All modelling details and results after applying our RAM              of norm balls (in the input space) around a large number of
on those three datasets are summarised in Table 1, based on           samples from a “latent-cell”. The norm ball radius is deter-
which we compare the testing error, Average Cell Unastute-            mined by the r-separation distance in the input space. Taking
ness (ACU) and our RAM results (E[λ] and U b97.5% ).                  the computational cost into consideration, we rank the OP of
                                                                      all latent-cells, and choose the top k cells with highest OP
                                                                      for astuteness evaluation. We adopt the existing robustness
                                                                      estimator in [Webb et al., 2019], where the authors omitted
                                                                      the result of V[λi ]; we therefore also omit the variance in our
                                                                      experiments for simplicity.


                                                                      5   Discussions
                                                                      In this section, we summarise the model assumptions made
                                                                      in our RAM, and discuss if/how they can be validated and
                                                                      what new-assumptions/compromised-solutions are needed to
                                                                      cope with high-dimensional/real-world applications. Fi-
Figure 3: Synthetic datasets DS-1 (lhs) and DS-2 (rhs) representing   nally, we list the inherent difficulties of assessing DL un-
relatively sparse and dense training data respectively.               covered by our RAM.
  Table 1: The RAM details and results. For image datasets, the r,  and # are associated with latent spaces. Time is in seconds per cell.

                   train/test error   r-separation   cell radius        # of cells     ACU        E[λ]        V[λ]      U b97.5%     time
  The run. exp.    0.0005/0.0180       0.004013         0.004           250 × 250     0.002982   0.004891    0.000004    0.004899     0.04
  Synth. DS-1      0.0037/0.0800       0.004392         0.004           250 × 250     0.008025   0.008290    0.000014    0.008319     0.03
  Synth. DS-2      0.0004/0.0079       0.002001         0.002           500 × 500     0.004739   0.005249    0.000002    0.005252     0.04
    MNIST          0.0051/0.0235         0.1003         0.100          top-170000     0.106615   0.036517        /           /        0.43
   CIFAR10         0.0199/0.0853         0.1947         0.125           top-23000     0.238138   0.234419        /           /        6.74


Independent λi s and Opi s As per Assumption 1, we as-                     fit a density function over the input space from an “opera-
sume all λi s and Opi s are independent when “assembling”                  tional dataset” (representing the OP). Data-points in this set
their estimates via Eq. (5) and deriving the variance via                  can be unlabelled raw data generated from historical data of
Eq. (6). Largely this assumption is for the mathematical                   previous applications, simulations and manually scaled based
tractability when propagating the confidence in individual es-             on expert knowledge. Obtaining such operational dataset
timates at the cell-level to the whole system pmi. Although                is an application-specific engineering problem, and tractable
this independence assumption is hard to justify in practice,               thanks to the fact that it does not require labelled data.
it is not unusual in reliability models that use partition, e.g.               Notably, the OP may also be approximated at runtime
in [Pietrantuono et al., 2020; Miller et al., 1992]. We be-                based on the data stream of operational data. Efficient KDE
lieve that RAMs are still useful as a first approximation under            for data streams [Qahtan et al., 2017] can be used. If the OP
this assumption, while we envisage that Bayesian estimators                was subject to sudden changes, change-point detectors like
leveraging joint priors and conjugacy may relax it.                        [Zhao et al., 2020b] should also be paired with the runtime
R-separation and its estimation Assumption 2 derives                       estimator to robustly approximate the OP.
from Remark 2. We concur with [Yang et al., 2020] and be-                      However, we may encounter technical challenges when
lieve that, for any real-world DL classification applications              fitting the PDF from high-dimensional real-world datasets.
where the inputs are data-points with “physical meanings”,                 There are two known major challenges when applying multi-
there should always exist an r-stable ground truth. Such                   variate KDE to high-dimensional data: i) the choice of band-
r varies between applications, and the smaller the r is, the               width H represents the covariance matrix that mostly im-
harder the inherent difficulty of the classification problem is;           pacts the estimation accuracy; ii) scalability issues in terms of
i.e., r is a difficulty indicator for the given classification prob-       storing intermediate data structure (e.g. data-points in hash-
lem.                                                                       tables) and querying times made when estimating the density
   For real-world applications, what really determines the la-             at a given input. For the first challenge, the optimal calcula-
bel of an image are its features rather than pixels. Thus,                 tion of bandwidth matrix can refer to some rule of thumb [Sil-
we envisage some latent space (of, e.g., VAE) capturing                    verman, 1986; Scott, 2015] and the cross-validation [Bergstra
only the feature-wise information can be explored for high-                and Bengio, 2012]. While there are dedicated research on
dimensional data. That is, we                                              improving the efficiency of multivariate KDE, e.g., [Backurs
                                                                           et al., 2019] presented a framework for multivariate KDE in
   • first do r-separation based partition in the latent space to          provably sub-linear query time with linear space and linear
     learn the OP;                                                         pre-processing time to the dimensions.
   • then determine the ground truth labels of cells in the la-            Determination of the ground truth of a cell Assumptions
     tent space;                                                           4 and 5 are essentially on how to determine the ground truth
   • map the learned OP and ground truth labels back to the                label for a given cell, that relates to the oracle problem of
     input pixel space;                                                    testing DL [Guerriero, 2020]. While it is still challenging, we
                                                                           partially solve it by leveraging the r-separation property.
   • do astuteness evaluation in the input pixel space and “as-               Thanks to r, it is easy to determine a cell’s ground truth
     semble” the results according to the OP.                              when we see it contains labelled data-points. However, for
   Indeed, it is hard to estimate the r (neither in the input nor          an empty cell, it is non-trivial. We assume the overall perfor-
the latent space), while the best we can do is to estimate it              mance of the DL model is fairly good (e.g., better than a clas-
from the existing dataset. One way of solving the problem                  sifier doing random classifications), thus miss-classifications
is to keep monitoring the r estimates as more labelled data is             within an empty cell are relatively rare events. Then we can
collected, and redo the cell partition when the estimated r has            determine the ground truth label of the cell by majority vot-
changed significantly.                                                     ing of predictions. Indeed, this is a strong assumption when
Approximation of the OP Assumption 3 says that the col-                    there are some “failure regions” in the input space that per-
lected dataset statistically represents the OP, which may not              form really badly (even worse than random labelling). In this
hold for many practical reasons; e.g., the future OP is un-                case, we need to invent a new mechanism to detect such “re-
certain at the training stage and thus data is collected in a              ally bad failure regions” and spend more budget on invoking,
balanced way to perform well in all categories of inputs. Al-              say, humans to do the labelling.
though we demonstrate our RAM under this assumption for                    Conditional OP of a cell We assume the distribution of
simplicity, it can be easily relaxed. Essentially, we try to               inputs (i.e., the conditional OP) within each cell is uniform
by Assumption 6. Although we conjecture that this is the              can be conservatively set to the proportion of robust region
common case due to the small size of cells (i.e., those very          covered in ci in this case.
close/similar inputs within a small region are only subject to           We would like to note that the cell robustness estimator
noise factors that can be modelled uniformly), the real situa-        in our RAM works in a “hot-swappable” manner: any new
tion may vary; this requires justification in safety cases.           and more efficient robustness estimator can be easily incor-
   For a real-world dataset, the conditional OP represents cer-       porated. Thus, how to improve the efficiency of cell’s robust-
tain distributions of “natural variations” [Zhong et al., 2021],      ness estimation is out of the scope of our RAM.
e.g. lighting conditions, obey certain distributions. The con-        Inherent difficulties Finally, based on our RAM and the
ditional OP of cells should faithfully capture the distribution       discussions above, we summarise the inherent difficulties of
of such natural variations. Recent advance on measuring the           assessing DL reliability as the following questions:
natural/realistic AEs [Harel-Canada et al., 2020] highly re-
lates to this assumption and may relax it.                                • How to accurately build the OP in the high-dimensional
                                                                            input space?
Explosion of the number of cells The number of cells to
evaluate the astuteness is exponential in the dimensions of               • How to build an accurate oracle leveraging the existing
data. For high-dimensional data, it is impossible to explore all            human-labels in the training dataset?
cells in the input space11 as we did for the running example.             • What is the local distribution (conditional OP) over a
   A compromised solution is to find the first k cells that dom-            small input region that captures the natural variations of
inate the OP. That is, we rank the cells by their pooled OP, and            physical conditions?
only evaluate the top-k cells where the sum of these k cells’
                                                                          • How to efficiently evaluate the robustness of a small re-
OPs is greater than a threshold, e.g. 99%. Then, we can con-
                                                                            gion given AEs are rare events?
servatively set the cell pmi of the rest to a worst-case bound
(e.g. 1) or an empirical/average bound based on the first k               • How to sample small regions from a large population
cells. Certainly, the price to pay is to sacrifice estimation ac-           (high-dimensional space) to test robustness in an unbi-
curacy. The best we can do for now is to increase the budgets               ased and efficient way?
for a larger k. Technically, finding the first k cells dominat-          We try to provide preliminary/compromised solutions in
ing the OP is in fact to calculate the modes of the KDE func-         our RAM, while the questions are still challenging in prac-
tion. The work of [Lee et al., 2019] gives us a hint on how to        tice. We doubt the existence of other DL RAMs with weaker
quickly calculate the modes of Gaussian KDE when the data             assumptions achieving the same level of rigorousness as ours,
dimension is high.                                                    at this stage.
   This discussion relates to the cost of our RAM, thus a per-
tinent question is—what is the real cost of conducting DL
testing? Is it the the human labour generating labels or timing       6    Conclusion & Future Work
constraints? A likely answer is: both. Our RAM has partially          In this paper, we present a preliminary RAM for DL clas-
solved the former (cf. earlier discussions), while the latter is      sifiers. It is the first DL RAM explicitly considers both the
less costly nowadays and can be solved by harnessing the fast         OP information and robustness evidence. It uncovers some
growth of computational power and parallel computing.                 inherent difficult questions when assessing DL reliability,
                                                                      while preliminary/compromised solutions are discussed, im-
Efficiency of cell robustness evaluation We have demon-
                                                                      plemented and demonstrated with case studies.
strated via the Simple Monte Carlo method to evaluate cell
                                                                         An intuitive way of perceiving our RAM, comparing with
robustness in the running example. It is well-known that
                                                                      the usual accuracy testing, is that we enlarge the testing
Simple Monte Carlo is not a computationally efficient tech-
                                                                      dataset with more test cases around “seeds” (original data-
nique to estimate rare-events (such as AEs in our case) in
                                                                      points in the test set). We determine the oracle of a new test
high-dimensional space. Thus, instead of applying Simple
                                                                      case according to its seed’s label and the r-distance. Those
Monte Carlo, the more advanced and efficient sampling ap-
                                                                      enlarged test results form the robustness evidence, and how
proach, the Adaptive Multi-level Splitting method [Webb et
                                                                      much they contribute to the overall reliability is proportional
al., 2019], has been applied in our case studies on image
                                                                      to its OP. Consequently, exposing to more tests (robustness
datasets. We are confident that other statistical sampling
                                                                      evaluation) and being more representative of how it will be
methods designed for rare-events may also suffice our need.
                                                                      used (the OP), our RAM is more trustworthy.
   In addition to the statistical approach, formal method based
                                                                         In line with the gist of our RAM, we believe the DL relia-
verification techniques can also be applied to assess a cell’s
                                                                      bility should follow the conceptualised equation of:
pmi, e.g. [Huang et al., 2017]. They provide formal guaran-
tees on whether the DL model will miss-classify any input                  DL reliability = generalisability × robustness.
inside a small region. Although such “robust region” proved
by formal methods is normally smaller than our cells, the λ̂i         In a nutshell, when assessing the DL reliability, we should
                                                                      not only concern how it generalises to a new data-point (ac-
   11
      Although dimension reduction methods like VAE may ease the      cording to the future OP), but also the local robustness around
problem of learning OP, they cannot reduce the number of cells to     it. Align with this insight, indeed, a “naive/over-simplified”
be evaluated. Since robustness by definition has to be evaluated in   version of our RAM would be averaging all local astuteness
the input space.                                                      of data-points in the test set, which is less rigorous (e.g., on
determining the norm ball size) and requires stronger assump-      References
tions (e.g., the test set is equal to the operational set).        [Backurs et al., 2019] Arturs Backurs, Piotr Indyk, and Tal
   Improving the scalability of our RAM and experimenting             Wagner. Space and time efficient kernel density estimation
with more real-world datasets form important future work.             in high dimensions. In NeurIPS’19, pages 15773–15782,
We presume a trained DL model for our assessment purpose.             2019.
A natural question next is how to actually improve the reli-
ability when our RAM results are not good enough. As de-           [Bai et al., 2021] Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen,
scribed in [Zhao et al., 2021], we plan to investigate DL de-         and Qian Wang. Recent Advances in Adversarial Training
bug testing (e.g. [Huang et al., 2021]) and retraining methods        for Adversarial Robustness. In IJCAI’21, 2021.
[Bai et al., 2021], together with the RAM, to form a closed        [Bergstra and Bengio, 2012] James Bergstra and Yoshua
loop of debugging-improving-assessing.                                Bengio. Random search for hyper-parameter optimization.
                                                                      J. of Machine Learning Research, 13(2):281–305, 2012.
Acknowledgments & Disclaimer                                       [Bishop et al., 2011] Peter Bishop, Robin Bloomfield, Bev
     This project has received funding from the European              Littlewood, Andrey Povyakalo, and David Wright. Toward
Union’s Horizon 2020 research and innovation programme                a formalism for conservative claims about the dependabil-
under grant agreement No 956123. This work is partially sup-          ity of software-based systems. IEEE Transactions on Soft-
ported by the UK EPSRC (through the Offshore Robotics for             ware Engineering, 37(5):708–717, 2011.
Certification of Assets [EP/R026173/1] and End-to-End Con-         [Bloomfield et al., 2019] Robin Bloomfield, Heidy Khlaaf,
ceptual Guarding of Neural Architectures [EP/T026995/1])
                                                                      Philippa Ryan Conmy, and Gareth Fletcher. Disruptive
and the UK Dstl (through the project of Safety Argument
                                                                      innovations and disruptive assurance: Assuring machine
for Learning-enabled Autonomous Underwater Vehicles).
                                                                      learning and autonomy. Computer, 52(9):82–89, 2019.
Xingyu Zhao and Alec Banks’ contribution to the work is
partially supported through Fellowships at the Assuring Au-        [Carlini and Wagner, 2017] Nicholas Carlini and David
tonomy International Programme. We thank Lorenzo Strigini             Wagner. Towards Evaluating the Robustness of Neural
for insightful comments on earlier versions of the paper.             Networks. In IEEE Symp. on Security and Privacy (SP),
   This document is an overview of UK MOD (part) spon-                pages 39–57, San Jose, CA, USA, 2017. IEEE.
sored research and is released for informational purposes          [Chen, 2017] Yen-Chi Chen. A tutorial on kernel density es-
only. The contents of this document should not be inter-              timation and recent advances. Biostatistics & Epidemiol-
preted as representing the views of the UK MOD, nor should            ogy, 1(1):161–187, 2017.
it be assumed that they reflect any current or future UK
                                                                   [Guerriero et al., 2021] Antonio Guerriero, Roberto Pietran-
MOD policy. The information contained in this document
cannot supersede any statutory or contractual requirements            tuono, and Stefano Russo. Operation is the hardest teacher:
or liabilities and is offered without prejudice or commit-            estimating DNN accuracy looking for mispredictions. In
ment. Content includes material subject to © Crown copy-              ICSE’21, Madrid, Spain, 2021.
right (2018), Dstl. This material is licensed under the terms of   [Guerriero, 2020] Antonio Guerriero. Reliability Evaluation
the Open Government Licence except where otherwise stated.            of ML systems, the oracle problem. In ISSREW’20, pages
To view this licence, visit http://www.nationalarchives.gov.          127–130, Coimbra, Portugal, 2020. IEEE.
uk/doc/open-government-licence/version/3 or write to the In-       [Hamlet and Taylor, 1990] D. Hamlet and R. Taylor. Parti-
formation Policy Team, The National Archives, Kew, London             tion testing does not inspire confidence. IEEE Tran. on
TW9 4DU, or email: psi@nationalarchives.gsi.gov.uk.                   Software Engineering, 16(12):1402–1411, 1990.
A    KDE bootstrapping                                             [Harel-Canada et al., 2020] Fabrice Harel-Canada, Lingx-
                                                                      iao Wang, Muhammad Ali Gulzar, Quanquan Gu, and
Bootstrapping is a statistical approach to estimate any sam-          Miryung Kim. Is neuron coverage a meaningful measure
pling distribution by random sampling method. We sample               for testing deep neural networks? In ESEC/FSE’20, pages
with replacement from the original data points (X, Y ) to ob-         851–862, 2020.
tain a new bootstrap dataset (X b , Y b ) and train the KDE on
the bootstrap dataset. Assume the bootstrap process is re-         [Huang et al., 2017] Xiaowei Huang, Marta Kwiatkowska,
peated B times, leading to B bootstrap KDEs, denoted as               Sen Wang, and Min Wu. Safety verification of deep neural
    1               B                                                 networks. In CAV’17, pages 3–29. Springer, 2017.
Op
 c (x), . . . , Op
                 c (x). Then we can estimate the variance of
                                                                   [Huang et al., 2020] Xiaowei Huang, Daniel Kroening,
fˆ(x) by the sample variance of the bootstrap KDE [Chen,
                                                                      Wenjie Ruan, and et al. A survey of safety and trustworthi-
2017]:
                               B                                      ness of deep neural networks: Verification, testing, adver-
                2        1 X cb                                       sarial attack and defence, and interpretability. Computer
             σ̂B  (x) =          (Op (x) − µB )2
                        B−1                                           Science Review, 37:100270, 2020.
                              b=1
where the µB can be approximated by                                [Huang et al., 2021] Wei Huang, Youcheng Sun, Xingyu
                                 B                                    Zhao, James Sharp, Wenjie Ruan, Jie Meng, and Xiaowei
                              1 X cb                                  Huang. Coverage guided testing for recurrent neural net-
                  µ̂B (x) =       Op (x).
                              B                                       works. IEEE Tran. on Reliability, 2021. In press.
                                b=1
[Kalra and Paddock, 2016] Nidhi Kalra and Susan M. Pad-          [Strigini and Littlewood, 1997] Lorenzo Strigini and Bev
  dock. Driving to safety: How many miles of driving                Littlewood. Guidelines for statistical testing. Technical
  would it take to demonstrate autonomous vehicle reliabil-         report, City, University of London, 1997.
  ity? Transportation Research Part A: Policy and Practice,      [Strigini and Povyakalo, 2013] Lorenzo Strigini and Andrey
  94:182 – 193, 2016.                                               Povyakalo. Software fault-freeness and reliability predic-
[Katz et al., 2019] Guy Katz, Derek A. Huang, Duligur Ibel-         tions. In SafeComp’13, volume 8153 of LNCS, pages 106–
  ing, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth          117, Berlin, Heidelberg, 2013. Springer.
  Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljić,          [Walter and Augustin, 2009] Gero Walter and Thomas Au-
  David L. Dill, Mykel J. Kochenderfer, and Clark Barrett.          gustin. Imprecision and prior-data conflict in generalized
  The Marabou Framework for Verification and Analysis               Bayesian inference. Journal of Statistical Theory & Prac-
  of Deep Neural Networks. In CAV’19, volume 11561 of               tice, 3(1):255–271, 2009.
  LNCS, pages 443–452, Cham, 2019. Springer.                     [Webb et al., 2019] Stefan Webb, Tom Rainforth, Yee Whye
[Lane et al., 2016] David Lane, David Bisset, Rob Bucking-          Teh, and M. Pawan Kumar. A statistical approach to as-
  ham, Geoff Pegman, and Tony Prescott. New foresight               sessing neural network robustness. In ICLR’19, New Or-
  review on robotics and autonomous systems. Technical              leans, LA, USA, 2019.
  Report No. 2016.1, LRF, 2016.                                  [Weng et al., 2019] Lily Weng, Pin-Yu Chen, Lam Nguyen,
[Lee et al., 2019] Jasper CH Lee, Jerry Li, Christopher             Mark Squillante, Akhilan Boopathy, Ivan Oseledets, and
  Musco, Jeff M Phillips, and Wai Ming Tai. Finding                 Luca Daniel. PROVEN: Verifying robustness of neural
  the mode of a kernel density estimate. arXiv preprint             networks with a probabilistic approach. In ICML’19, vol-
  arXiv:1912.07673, 2019.                                           ume 97, pages 6727–6736. PMLR, 2019.
                                                                 [Yang et al., 2020] Yao-Yuan Yang, Cyrus Rashtchian,
[Li et al., 2019] Zenan Li, Xiaoxing Ma, Chang Xu, Chun
                                                                    Hongyang Zhang, Ruslan Salakhutdinov, and Kamalika
   Cao, Jingwei Xu, and Jian Lü. Boosting operational DNN
                                                                    Chaudhuri. A Closer Look at Accuracy vs. Robustness.
   testing efficiency through conditioning. In ESEC/FSE’19,
                                                                    In NeurIPS’20, Vancouver, Canada, 2020.
   pages 499–509. ACM, 2019.
                                                                 [Zhang et al., 2020] J. M. Zhang, M. Harman, L. Ma, and
[Littlewood and Strigini, 2000] Bev Littlewood and Lorenzo          Y. Liu. Machine learning testing: Survey, landscapes and
   Strigini.  Software reliability and dependability: A             horizons. IEEE Tran. on Software Engineering, 2020.
   roadmap. In ICSE 2000, pages 175–188, 2000.
                                                                 [Zhao et al., 2019] Xingyu Zhao, Valentin Robu, David
[Miller et al., 1992] Keith W. Miller, Larry J. Morell,             Flynn, Fateme Dinmohammadi, Michael Fisher, and Matt
  Robert E. Noonan, Stephen K. Park, David M. Nicol,                Webster. Probabilistic model checking of robots deployed
  Branson W. Murrill, and M Voas. Estimating the probabil-          in extreme environments. In AAAI’19, volume 33, pages
  ity of failure when testing reveals no failures. IEEE Trans-      8076–8084, 2019.
  actions on Software Engineering, 18(1):33–43, 1992.            [Zhao et al., 2020a] Xingyu Zhao, Alec Banks, James Sharp,
[Musa, 1993] John Musa. Operational profiles in software-           Valentin Robu, David Flynn, Michael Fisher, and Xiaowei
  reliability engineering. IEEE Software, 10(2):14–32,              Huang. A safety framework for critical systems utilising
  1993.                                                             deep neural networks. In SafeComp’20, volume 12234 of
                                                                    LNCS, pages 244–259. Springer, 2020.
[Pietrantuono et al., 2020] Roberto Pietrantuono, Peter
   Popov, and Stefano Russo. Reliability assessment of           [Zhao et al., 2020b] Xingyu Zhao, Radu Calinescu, Simos
   service-based software under operational profile un-             Gerasimou, Valentin Robu, and David Flynn. Interval
   certainty.   Reliability Engineering & System Safety,            change-point detection for runtime probabilistic model
   204:107193, 2020.                                                checking. In ASE’20, pages 163–174. IEEE/ACM, 2020.
[Qahtan et al., 2017] Abdulhakim Qahtan, Suojin Wang, and        [Zhao et al., 2020c] Xingyu Zhao, Kizito Salako, Lorenzo
  Xiangliang Zhang. KDE-Track: An Efficient Dynamic                 Strigini, Valentin Robu, and David Flynn. Assessing
  Density Estimator for Data Streams. IEEE Transactions on          safety-critical systems from operational testing: A study
  Knowledge and Data Engineering, 29(3):642–655, 2017.              on autonomous vehicles. Information and Software Tech-
                                                                    nology, 128:106393, 2020.
[Robu et al., 2018] Valentin Robu, David Flynn, and David        [Zhao et al., 2021] Xingyu Zhao, Wei Huang, Sven Schewe,
  Lane. Train robots to self-certify as safe. Nature,               Yi Dong, and Xiaowei Huang. Detecting operational ad-
  553(7688):281–281, 2018.                                          versarial examples for reliable deep learning. In 51th An-
[Scott, 2015] David W Scott. Multivariate density estima-           nual IEEE-IFIP Int. Conf. on Dependable Systems and
   tion: theory, practice, and visualization. John Wiley &          Networks (DSN’21), volume Fast Abstract, 2021.
   Sons, 2015.                                                   [Zhong et al., 2021] Ziyuan Zhong, Yuchi Tian, and
[Silverman, 1986] Bernard W Silverman. Density estimation           Baishakhi Ray. Understanding Local Robustness of Deep
   for statistics and data analysis, volume 26. CRC press,          Neural Networks under Natural Variations. In FASE’21,
   1986.                                                            pages 313–337, 2021.