<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing the Reliability of Deep Learning Classifiers Through Robustness Evaluation and Operational Profiles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xingyu Zhao</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei Huang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alec Banks</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victoria Cox</string-name>
          <email>vcoxg@dstl.gov.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Flynn</string-name>
          <email>ynn@hw.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sven Schewe</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaowei Huang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Defence Science and Technology Laboratory</institution>
          ,
          <addr-line>Salisbury, SP4 0JQ</addr-line>
          ,
          <country country="UK">U.K</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Heriot-Watt University</institution>
          ,
          <addr-line>Edinburgh, EH14 4AS</addr-line>
          ,
          <country country="UK">U.K</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Liverpool</institution>
          ,
          <addr-line>Liverpool, L69 3BX</addr-line>
          ,
          <country country="UK">U.K</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The utilisation of Deep Learning (DL) is advancing into increasingly more sophisticated applications. While it shows great potential to provide transformational capabilities, DL also raises new challenges regarding its reliability in critical functions. In this paper, we present a model-agnostic reliability assessment method for DL classifiers, based on evidence from robustness evaluation and the operational profile (OP) of a given application. We partition the input space into small cells and then “assemble” their robustness (to the ground truth) according to the OP, where estimators on the cells' robustness and OPs are provided. Reliability estimates in terms of the probability of misclassification per input (pmi) can be derived together with confidence levels. A prototype tool is demonstrated with simplified case studies. Model assumptions and extension to real-world applications are also discussed. While our model easily uncovers the inherent difficulties of assessing the DL dependability (e.g. lack of data with ground truth and scalability issues), we provide preliminary/compromised solutions to advance in this research direction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Industry is adopting increasingly more advanced big data
analysis methodologies to enhance the operational
performance, safety, and lifespan of their products and services.
For many products and systems high in-service reliability and
safety are key targets to ensure customer satisfaction and
regulatory compliance, respectively. AI and Deep Learning (DL)
have steadily grown in interest and applications. Key
industrial foresight reviews have identified that the biggest obstacle
to reap the benefits of DL-powered robots is the assurance and
regulation of their safety and reliability [Lane et al., 2016].
Thus, there is an urgent need to develop methods to enable
the dependable use of AI/DL in critical applications [Robu et
al., 2018] and, more importantly, to assess and demonstrate
the dependability for certification and regulation.</p>
      <p>For traditional systems, safety and reliability analysis is
guided by established standards, and supported by mature
development processes and verification and validation (V&amp;V)
tools and techniques. The situation is different for systems
that utilise DL: they require new and advanced analysis
reflective of the complex requirements in their safe and reliable
function. Such analysis also needs to be tailored to fully
evaluate the inherent character of DL [Bloomfield et al., 2019],
despite the progress made recently [Huang et al., 2020].</p>
      <p>
        DL classifiers are subject to robustness concerns,
reliability models without considering robustness evidence are not
convincing. Reliability, as a user-centred property, depends
on the end-users’ behaviours [Littlewood and Strigini, 2000].
The operational profile (OP) information
        <xref ref-type="bibr" rid="ref22">(quantifying how
the software will be operated [Musa, 1993])</xref>
        should therefore
be explicitly modelled in the assessment. However, to the best
of our knowledge, there is no dedicated reliability assessment
model (RAM) taking into account both the OP and robustness
evidence, which motivates this research.
      </p>
      <p>In [Zhao et al., 2020a], we propose a safety case
framework tailored for DL, in which we describe an initial idea of
combining robustness verification and operational testing for
reliability claims. In this paper, we implement this idea as a
RAM, inspired by partition-based testing [Hamlet and Taylor,
1990], operational-profile testing [Strigini and Littlewood,
1997; Zhao et al., 2020c] and DL robustness evaluation
[Carlini and Wagner, 2017; Webb et al., 2019]. It is
modelagnostic and designed for pretrained DL models, yielding
upper bounds on the probability of miss-classifications per input
(pmi)1 with confidence levels. Although our RAM is
theoretically sound, we discover some issues in our case studies (e.g.
scalability and lack of data) that we believe represent the
inherent difficulties of assessing/assuring DL dependability.</p>
      <p>The key contributions of this work are:
a) A first RAM for DL classifiers based on the OP
information and robustness evidence.</p>
      <p>1This reliability measure is similar to the conventional
probability of failure per demand (pfd), but retrofitted for classifiers.
b) Discussions on model assumptions and extension to
real-world applications, highlighting the inherent difficulties
of assessing DL dependability uncovered by our model.</p>
      <p>b) A prototype tool2 of our RAM with preliminary and
compromised solutions to those uncovered difficulties.
Related Work In recent years, there has been extensive
efforts in verifying DL robustness, evaluating
generalisation errors, and detecting adversarial examples (AEs). They
are normally based on formal methods [Huang et al., 2017;
Katz et al., 2019] or statistical approaches [Webb et al., 2019;
Weng et al., 2019]. A comprehensive review of those
techniques can be sourced from recent survey papers [Huang et
al., 2020; Zhang et al., 2020]. To the best of our knowledge,
the only papers on testing DL for assessment within an
operational context are [Li et al., 2019; Guerriero et al., 2021]. In
[Li et al., 2019], novel stratified sampling methods are used to
improve the operational testing efficiency. Similarly,
[Guerriero et al., 2021] presents a sampling method from the
operational dataset leveraging “auxiliary information for
misclassification”, so that it provides unbiased statistical assessment
while exposing as many misclassifications as possible.
However, neither of them considers robustness evidence in their
assessment models.</p>
      <p>At the higher level of whole-systems utilising DL, although
there are RAMs based on operational data, knowledge from
low-level DL components is usually ignored, e.g., [Kalra and
Paddock, 2016]. In [Zhao et al., 2020c], we improved [Kalra
and Paddock, 2016] by providing a Bayesian mechanism to
combine such knowledge, but did not show where to obtain
the knowledge. In that sense, this paper is also a follow up of
[Zhao et al., 2020c], forming the prior knowledge required.
Organisation of the paper We first present preliminaries
on OP-based software reliability assessment and DL
robustness. Then Section 3 describes the RAM in details with a
running example. We conduct case studies in Section 4, while
discuss the model assumptions and extensions in Section 5.
Finally, we conclude in Section 6 with future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <sec id="sec-2-1">
        <title>2.1 OP Based Software Reliability Assessment</title>
        <p>The delivered reliability, as a user-centred and probabilistic
property, requires to model the end-users’ behaviours (in the
running environments) and to be formally defined by a
quantitative metric [Littlewood and Strigini, 2000]. Without loss
of generality, we focus on pmi as a generic metric for DL
classifiers, where inputs are, e.g., facial images uploaded by
users for facial recognition. We discuss later how pmi can be
redefined to cope with real-world applications like traffic sign
detection. If we denote the unknown pmi as a variable , then
:=</p>
        <p>Z</p>
        <p>x2X
where x is an input in the input domain3 X , and IS is an
indicator function—it is equal to 1 when S is true and 0
othIfx causes a misclassificationg(x)Op(x) dx
(1)
erwise. The Op(x) returns the probability that x is the next
random input, the OP [Musa, 1993], a notion used in
software engineering to quantify how the software will be
operated. Mathematically, the OP is a probability density function
(PDF) defined over X .</p>
        <p>Assuming independence between successive inputs
defined in our pmi, we may use the Bernoulli process as the
mathematical abstraction of the failure process (common for
such “on-demand” type of systems), which implies a
Binomial likelihood. Normally for traditional software, upon
establishing the likelihood, RAMs on estimating vary case
by case—from the basic Maximum Likelihood Estimation
(MLE) to Bayesian estimators tailored for certain
scenarios when, e.g., seeing no failure [Bishop et al., 2011],
inferring ultra-high reliability [Zhao et al., 2020c], with
certain forms of prior knowledge like perfectioness [Strigini
and Povyakalo, 2013], and with vague prior knowledge that
expressed in imprecise probabilities [Walter and Augustin,
2009; Zhao et al., 2019].</p>
        <p>OP based RAMs designed for traditional software fail to
consider new characteristics of DL, e.g., unrobustness and
high-dimensional input space. Specifically, it is quite hard to
have the required prior knowledge in those Bayesian RAMs.
While frequentist RAMs would require a large sample size to
gain enough confidence in the estimates due to the extremely
large population size (high-dimensional pixel space),
especially for a high-reliable DL model where misclassifications
are rare-events. As an example, the usual accuracy testing of
DL classifiers is essentially an MLE estimate against the test
set. It not only assumes the test set statistically represents the
OP (our Assumption 3 later), but also requires a large number
of samples to claim high reliability with sufficient confidence.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 DL Robustness and the R-Separation Property</title>
        <p>DL is known not to be robust. Robustness requires that the
decision of the DL model M is invariant against small
perturbations on inputs. That is, all inputs in a region X
have the same prediction label, where usually the region is
a small norm ball (in a Lp-norm distance4) of radius around
an input x. Inside , if an input x0 is classified differently to x
by M, then x0 is an AE. Robustness can be defined either as
a binary metric (if there exists any adversarial example in )
or as a probabilistic metric (how likely the event of seeing an
adversarial example in is). The former aligns with formal
verification, e.g. [Huang et al., 2017], while the latter is
normally used in statistical approaches, e.g. [Webb et al., 2019].
The former “verification approach” is the binary version of
the latter “stochastic approach”5.</p>
        <p>Similar to [Webb et al., 2019], we adopt the more general
probabilistic definition on the robustness of the model M (in
a region and to a target label y):</p>
        <p>RM( ; y) := X IfM(x) predicts label yg(x)
x2</p>
        <p>Op(x j x 2 ) (2)
where Op(x j x 2 ) is the conditional OP of region
(precisely the “input model” defined in [Webb et al., 2019] and
2Available at https://github.com/havelhuang/ReAsDL.</p>
        <p>3We assume continuous X in this paper. For discrete X , the
integral in Eq. (1) reduces to sum and OP is a probability mass function.
4Distance mentioned in this paper is defined in L1.</p>
        <p>5Thus, we use the more general term robustness “evaluation”
rather than robustness “verification” throughout the paper.
also used in [Weng et al., 2019]).</p>
        <p>We highlight the follow two remarks regarding robustness:
Remark 1 (astuteness). Reliability assessment only concerns
the robustness to the ground truth label, rather than an
arbitrary label y in RM( ; y). When y is such a ground truth,
robustness becomes astuteness [Yang et al., 2020], which is
also the conditional reliability in the region .</p>
        <p>Astuteness is a special case of robustness6. An extreme
example showing why we introduce the concept of astuteness
is: a perfectly robust classifier that always outs “dogs” for
any given input is unreliable. Thus, robustness evidence
cannot directly support reliability claims unless the ground truth
label is used in RM( ; y).</p>
        <p>Remark 2 (r-separation). For real-world image datasets,
any data-points with different ground truth are at least
distance 2r apart in the input space X (i.e., pixel space), and r
is bigger than usual norm ball radius in robustness studies.</p>
        <p>The r-separation property was first observed by [Yang et
al., 2020]: real-world image datasets studied by the authors
implies that r is normally 3 7 times bigger than the radius
(denoted ) of norm balls commonly used in robustness
studies. Intuitively it says that, although the classification
boundary is highly non-linear, there is a minimum distance between
two real-world objects of different classes (cf. Figure 1 for a
conceptual illustration). Moreover, such minimum distance
is bigger than the usual norm ball size in robustness studies.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A RAM for Deep Learning Classifiers</title>
      <sec id="sec-3-1">
        <title>The Running Example</title>
        <p>To better demonstrate our RAM, we take the Challenge of
AI Dependability Assessment raised by the Siemens
Mobility7 as a running example. Basically, the challenge is to
firstly train a DL model to classify a dataset generated on
the unit square [0; 1]2 according to some unknown
distribution. The collected data-points (training set) are shown in
Figure 2 (lhs). Then we need to build a RAM to claim an
upper bound on the probability that the next random point
6Thus, later in this paper, we may refer robustness to astuteness
for brevity when it is clear from the context.</p>
        <p>7https://ecosystem.siemens.com/ai-da-sc/
is miss-classified, i.e. pmi. If the 2D-points represent
traffic lights, then we have 2 types of
misclassifications—safetycritical ones when red data-point is labelled green, and
performance related otherwise. For brevity, we only focus on
misclassifications here, while our RAM can cope with
subtypes of misclassifications.
The Framework Inspired by [Pietrantuono et al., 2020],
the general idea of our RAM is to partition the input domain
into m small cells, subject to the r-separation property. Then,
for each cell ci (with a single ground truth yi), we estimate:
i := 1</p>
        <p>RM(ci; yi) and Opi := X Op(x)
which are the unastuteness and pooled OP, respectively,
estimates of the cell ci—we introduce estimators for both later.
Then Eq. (1) can be written as the weighted sum of the
cell-wise unastuteness (i.e. the conditional pmi of each cell8)
where the weights are the pooled OP of cells:
x2ci
m
= X
i=1</p>
        <p>Opi i
Eq. (4) represents an ideal case in which we know those
is and Opis with certainty. In practice, we can only estimate
them with imperfect estimators yielding, e.g., a point estimate
with variance capturing the measure of trust. To propagate the
confidence in the estimates of is and Opis, we assume:
Assumption 1. All is and Opis are independent unknown
variables under estimations.</p>
        <p>Then, the estimate of and its variance are:</p>
        <p>m m
E[ ] = X E[ iOpi] = X E[ i]E[Opi] (5)
i=1 i=1
m
V[ ] = X V[ iOpi]
i=1
m
= X E[ i]2V[Opi] + E[Opi]2V[ i] + V[ i]V[Opi] (6)
i=1
Note that, for the variance, the covariance terms are dropped
out due to the independence assumption.</p>
        <p>8We use “cell unastuteness” and “cell pmi” interchangeably later.
(3)
(4)</p>
        <p>Depending on the specific estimators adopted, certain
parametric families of the distribution of can be assumed, from
which any quantile of interest (e.g. 95%) can be derived as
our confidence bound in reliability. For instance, as readers
will see later, we may assume N (E[ ]; V[ ]) since all
is and Opis are normal distributed variables after applying
the Central Limit Theorem (CLT) in our chosen estimators.
Then, an upper bound with 1 confidence is</p>
        <p>Ub1
where P r(Z z1 ) = 1 , and Z N (0; 1) is a
standard normal distribution.</p>
        <p>Now the the problem is reduced to how to obtain the
estimates E[ i]s and V[ i]s, for which we will discuss as follows
referring to the running example.</p>
        <p>
          Partition of the Input Domain X As per Remark 1, the
astuteness evaluation of a cell requires its ground truth label.
To leverage the r-separation property and Assumption 4, we
partition the input space by choosing a cell radius so that
&lt; r. Although we concur with Remark 2
          <xref ref-type="bibr" rid="ref11 ref13 ref23 ref33 ref34 ref36 ref37 ref38">(first observed
by [Yang et al., 2020])</xref>
          and believe that there should exist
an r-stable ground truth (which means that the ground truth
is stable in such a cell) for any real-world DL classification
applications, it is hard to estimate such an r (denoted by r^)
and the best we can do is to assume:
Assumption 2. There is a r-stable ground truth (as defined
in Remark 2) for any real-world classification problems, and
it can be sufficiently estimated from the existing dataset.
        </p>
        <p>That said, we get r^ = 0:004013 by iteratively calculating
the minimum distance of different labels in the running
example. Then we choose a cell radius9 = 0:004 and partition
the unit square X into 250 250 cells.</p>
        <p>Cell OP Approximation Given a dataset (X; Y ), we
estimate the pooled OP of cell ci to get E[Opi] and V[Opi]. We
use the well-established Kernel Density Estimation (KDE) to
fit a Ocp(x) to approximate the OP.</p>
        <p>Assumption 3. The existing dataset (X; Y ) are randomly
sampled from the OP, thus statistically represents the OP.
This assumption may not hold in practice: training data is
normally collected in a balanced way, since the DL model
is expected to perform well in all categories of inputs,
especially when the OP is unknown at the time of training and/or
expected to change in future. Although our model can
relax this assumption (cf. Section 5), we adopt it for brevity in
demonstrating the running example.</p>
        <p>Then given a set of (unlabelled) data-points (X1; : : : ; Xn)
from the existing dataset (X; Y ), KDE yields</p>
        <p>Ocp(x) =
nh
1 Xn K(</p>
        <p>x
j=1
h</p>
        <p>Xj
)
(8)
where K is the kernel function (e.g. Gaussian or
exponential kernels), and h &gt; 0 is a smoothing parameter called the
9Radius in L1 which is the side length of our square cell in L2.
bandwidth, cf. [Silverman, 1986] for guidelines on tuning h.
The approximated OP10 is shown in Figure 2 (rhs).</p>
        <p>Since our cells are small and all equal size, instead of
calculating Rx2ci Ocp(x)dx, we may approximate Opi as</p>
        <p>Ocpi = Ocp (xci ) vc
where Ocp(xci ) is the probability density at the cell’s central
point xci , and vc is the constant cell volume (1:6e 5 in the
running example).</p>
        <p>Now if we introduce new variables Wj = h1 K( x hXj ),
the KDE evaluated at x is actually the sample mean of
2
W1; : : : ; Wn. Then by CLT, we have Ocp(x) N ( W ; nW )
where the mean and variance of Ocp(x) are known results:
E[Ocp(x)] =
V[Ocp(x)] =
nh
1 Xn K(</p>
        <p>x
j=1
f (x) R K2(u)du
h</p>
        <p>Xj</p>
        <p>)
nh
+ O(
1
nh
)
^B2 (x)
(9)
(10)
(11)
where the last step of Eq. (11) says that V[Ocp(x)] can be
approximated using a bootstrap variance ^B2 (x) [Chen, 2017]
(cf. the Appendix A for details).</p>
        <p>Upon establishing Eq.s (10) and (11), together with Eq. (9),
we know for a given cell ci (knowing its central point xci ):
E[Opi] = vcE[Ocp(xci )];</p>
        <p>V[Opi] = vc2V[Ocp(xci )] (12)
which are the cell OP estimates for Eq.s (5) and (6).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Cell Astuteness Evaluation As a corollary of Remark 2</title>
        <p>and Assumption 2, we may confidently assume:
Assumption 4. If the radius of ci is smaller than r, all
datapoints in the region ci share a single ground truth label.</p>
        <p>Now, to determine the ground truth label of a cell ci, we
can classify our cells into three types:</p>
        <p>a) Normal cells: a normal cell contains data-points sharing
a same ground truth label, which is then determined as the
ground truth label of the cell.</p>
        <p>b) Empty cells: a cell is “empty” in the sense that no
datapoint that has been observed in it. Due to the lack of data, it is
hard to determine an empty cell’s ground truth. For now, we
do voting based on the predicted labels (by the DL model) of
random samples from the cell, assuming:
Assumption 5. The accuracy of the DL model is better than
a classifier doing random classifications in any given cell.
Essentially the above assumption relates to the oracle
problem of DL testing, that we see some recent efforts, e.g.
[Guerriero, 2020], may relax it.</p>
        <p>c) Cross-boundary cells: our estimate on r is imperfect,
thus we may still observe data-points with different labels in
one cell. Such cells are crossing the classification boundary.
If our estimate on r is sufficiently accurate, they should be
very rare. Thus, without the need to determine the ground
10With a Gaussian kernel and h = 0:2 that optimised by
crossvalidated grid-search [Bergstra and Bengio, 2012].
truth label of a cross boundary cell, we simply and
conservatively set the cell unastuteness to 1.</p>
        <p>So far, the problem is reduced to: given a normal or empty
cell ci with the known ground truth label yi, evaluate the
miss-classification probability upon a random input x 2 ci,
i.e. E[ i] and its variance V[ i]. This is essentially a
statistical problem that has been studied in [Webb et al., 2019]
using Multilevel Splitting Sampling, while we use the Simple
Monte Carlo method for brevity in the running example:
^i =
The CLT tells us ^i N ( ; n2 ), when n is large, where
and 2 are population mean and variance of IfM(xj )6=yig
that can be approximated with sample mean ^n and sample
variance ^n2=n. Finally, we can get</p>
        <p>
          Notably, to solve the above statistical problem with
sampling methods, we need to assume how the inputs in the
cell are distributed, i.e., a distribution for the conditional OP
Op(x j x 2 ci). Without loss of generality, we assume:
Assumption 6. The inputs in a small region like cells are
uniformly distributed.
which is not uncommon
          <xref ref-type="bibr" rid="ref1 ref1 ref16 ref16 ref18 ref18 ref19 ref19 ref31 ref31 ref32 ref32 ref35 ref35 ref5 ref5">(e.g., in [Webb et al., 2019; Weng et
al., 2019])</xref>
          and can be easily replaced by other distributions if
there is supporting evidence for such action.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Case Studies</title>
      <p>In addition to the running example, we conduct experiments
on two synthetic datasets as shown in Figure 3, representing
the scenarios with sparse and dense training data respectively.
All modelling details and results after applying our RAM
on those three datasets are summarised in Table 1, based on
which we compare the testing error, Average Cell
Unastuteness (ACU) and our RAM results (E[ ] and U b97:5%).</p>
      <p>In the running example, we first observe that the ACU
is much lower than the test error, meaning the underlying
DL model is a robust one. Since our RAM is mainly based
on the robustness evidence, its results are close to ACU but
not exactly the same because of the nonuniform OP, cf.
Figure 2 (rhs). Moreover, from Figure 2 (lhs), we know the
classification boundary is near the middle of the unit square input
space where misclassifications tend to happen (say “buggy
area”), which is also the high density area on the OP. Thus,
the contribution to unreliability from the “buggy area” is
weighted higher by the OP, which explains why our RAM
results are worse than ACU. In contrast, because of the “flat”
OP in the DS-1 (cf. Figure 3 (lhs)), our RAM results are
very close to the ACU. With more dense data in DS-2, the
r-distance is much smaller and leads to more cells. Thanks to
the rich data in this case, all three results are more consistent.
We note that, given the nature of the three 2D-point datasets,
DL models trained on them are much more robust than image
datasets. This is why all ACUs are better than test errors, and
our RAM finds a middle point representing reliability
according to the OP. Later we apply the RAM on two unrobust DL
models trained on image-datasets where the ACUs are worse
than test error; it confirms our aforementioned observations.</p>
      <p>To gain insights on how to extend our method for
highdimensional/real-world datasets, we also conduct
experiments on the popular MNIST and CIFAR10 datasets. Instead
of implementing the exact steps in Section 3.2, we take a few
compromised solutions to tackle the scalability issues raised
by “the curse of dimensionality”. We articulate these steps in
the following paragraph, while detailed discussions on their
impact on our results are presented in Section 5.</p>
      <p>First, we train Variational Auto-Encoders (VAE) on the
MNIST and CIFAR10 datasets and project all inputs into the
low dimensional latent spaces of VAE (with 8 and 16
dimensions respectively). Then we apply the proposed RAM
on the compressed dataset, i.e., partitioning the latent space,
learning the OP in latent space and evaluating the
“latentcell unastuteness”. Astuteness (a special case of robustness),
by definition is associated with the input pixel space. By
“latent-cell unastuteness”, we mean the average unastuteness
of norm balls (in the input space) around a large number of
samples from a “latent-cell”. The norm ball radius is
determined by the r-separation distance in the input space. Taking
the computational cost into consideration, we rank the OP of
all latent-cells, and choose the top k cells with highest OP
for astuteness evaluation. We adopt the existing robustness
estimator in [Webb et al., 2019], where the authors omitted
the result of V[ i]; we therefore also omit the variance in our
experiments for simplicity.
In this section, we summarise the model assumptions made
in our RAM, and discuss if/how they can be validated and
what new-assumptions/compromised-solutions are needed to
cope with high-dimensional/real-world applications.
Finally, we list the inherent difficulties of assessing DL
uncovered by our RAM.
Independent is and Opis As per Assumption 1, we
assume all is and Opis are independent when “assembling”
their estimates via Eq. (5) and deriving the variance via
Eq. (6). Largely this assumption is for the mathematical
tractability when propagating the confidence in individual
estimates at the cell-level to the whole system pmi. Although
this independence assumption is hard to justify in practice,
it is not unusual in reliability models that use partition, e.g.
in [Pietrantuono et al., 2020; Miller et al., 1992]. We
believe that RAMs are still useful as a first approximation under
this assumption, while we envisage that Bayesian estimators
leveraging joint priors and conjugacy may relax it.
R-separation and its estimation Assumption 2 derives
from Remark 2. We concur with [Yang et al., 2020] and
believe that, for any real-world DL classification applications
where the inputs are data-points with “physical meanings”,
there should always exist an r-stable ground truth. Such
r varies between applications, and the smaller the r is, the
harder the inherent difficulty of the classification problem is;
i.e., r is a difficulty indicator for the given classification
problem.</p>
      <p>For real-world applications, what really determines the
label of an image are its features rather than pixels. Thus,
we envisage some latent space (of, e.g., VAE) capturing
only the feature-wise information can be explored for
highdimensional data. That is, we
• first do r-separation based partition in the latent space to
learn the OP;
• then determine the ground truth labels of cells in the
latent space;
• map the learned OP and ground truth labels back to the
input pixel space;
• do astuteness evaluation in the input pixel space and
“assemble” the results according to the OP.</p>
      <p>Indeed, it is hard to estimate the r (neither in the input nor
the latent space), while the best we can do is to estimate it
from the existing dataset. One way of solving the problem
is to keep monitoring the r estimates as more labelled data is
collected, and redo the cell partition when the estimated r has
changed significantly.</p>
      <p>Approximation of the OP Assumption 3 says that the
collected dataset statistically represents the OP, which may not
hold for many practical reasons; e.g., the future OP is
uncertain at the training stage and thus data is collected in a
balanced way to perform well in all categories of inputs.
Although we demonstrate our RAM under this assumption for
simplicity, it can be easily relaxed. Essentially, we try to
fit a density function over the input space from an
“operational dataset” (representing the OP). Data-points in this set
can be unlabelled raw data generated from historical data of
previous applications, simulations and manually scaled based
on expert knowledge. Obtaining such operational dataset
is an application-specific engineering problem, and tractable
thanks to the fact that it does not require labelled data.</p>
      <p>Notably, the OP may also be approximated at runtime
based on the data stream of operational data. Efficient KDE
for data streams [Qahtan et al., 2017] can be used. If the OP
was subject to sudden changes, change-point detectors like
[Zhao et al., 2020b] should also be paired with the runtime
estimator to robustly approximate the OP.</p>
      <p>However, we may encounter technical challenges when
fitting the PDF from high-dimensional real-world datasets.
There are two known major challenges when applying
multivariate KDE to high-dimensional data: i) the choice of
bandwidth H represents the covariance matrix that mostly
impacts the estimation accuracy; ii) scalability issues in terms of
storing intermediate data structure (e.g. data-points in
hashtables) and querying times made when estimating the density
at a given input. For the first challenge, the optimal
calculation of bandwidth matrix can refer to some rule of thumb
[Silverman, 1986; Scott, 2015] and the cross-validation [Bergstra
and Bengio, 2012]. While there are dedicated research on
improving the efficiency of multivariate KDE, e.g., [Backurs
et al., 2019] presented a framework for multivariate KDE in
provably sub-linear query time with linear space and linear
pre-processing time to the dimensions.</p>
      <sec id="sec-4-1">
        <title>Determination of the ground truth of a cell Assumptions</title>
        <p>4 and 5 are essentially on how to determine the ground truth
label for a given cell, that relates to the oracle problem of
testing DL [Guerriero, 2020]. While it is still challenging, we
partially solve it by leveraging the r-separation property.</p>
        <p>Thanks to r, it is easy to determine a cell’s ground truth
when we see it contains labelled data-points. However, for
an empty cell, it is non-trivial. We assume the overall
performance of the DL model is fairly good (e.g., better than a
classifier doing random classifications), thus miss-classifications
within an empty cell are relatively rare events. Then we can
determine the ground truth label of the cell by majority
voting of predictions. Indeed, this is a strong assumption when
there are some “failure regions” in the input space that
perform really badly (even worse than random labelling). In this
case, we need to invent a new mechanism to detect such
“really bad failure regions” and spend more budget on invoking,
say, humans to do the labelling.</p>
        <p>Conditional OP of a cell We assume the distribution of
inputs (i.e., the conditional OP) within each cell is uniform
by Assumption 6. Although we conjecture that this is the
common case due to the small size of cells (i.e., those very
close/similar inputs within a small region are only subject to
noise factors that can be modelled uniformly), the real
situation may vary; this requires justification in safety cases.</p>
        <p>For a real-world dataset, the conditional OP represents
certain distributions of “natural variations” [Zhong et al., 2021],
e.g. lighting conditions, obey certain distributions. The
conditional OP of cells should faithfully capture the distribution
of such natural variations. Recent advance on measuring the
natural/realistic AEs [Harel-Canada et al., 2020] highly
relates to this assumption and may relax it.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Explosion of the number of cells The number of cells to</title>
        <p>evaluate the astuteness is exponential in the dimensions of
data. For high-dimensional data, it is impossible to explore all
cells in the input space11 as we did for the running example.</p>
        <p>A compromised solution is to find the first k cells that
dominate the OP. That is, we rank the cells by their pooled OP, and
only evaluate the top-k cells where the sum of these k cells’
OPs is greater than a threshold, e.g. 99%. Then, we can
conservatively set the cell pmi of the rest to a worst-case bound
(e.g. 1) or an empirical/average bound based on the first k
cells. Certainly, the price to pay is to sacrifice estimation
accuracy. The best we can do for now is to increase the budgets
for a larger k. Technically, finding the first k cells
dominating the OP is in fact to calculate the modes of the KDE
function. The work of [Lee et al., 2019] gives us a hint on how to
quickly calculate the modes of Gaussian KDE when the data
dimension is high.</p>
        <p>This discussion relates to the cost of our RAM, thus a
pertinent question is—what is the real cost of conducting DL
testing? Is it the the human labour generating labels or timing
constraints? A likely answer is: both. Our RAM has partially
solved the former (cf. earlier discussions), while the latter is
less costly nowadays and can be solved by harnessing the fast
growth of computational power and parallel computing.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Efficiency of cell robustness evaluation We have demon</title>
        <p>strated via the Simple Monte Carlo method to evaluate cell
robustness in the running example. It is well-known that
Simple Monte Carlo is not a computationally efficient
technique to estimate rare-events (such as AEs in our case) in
high-dimensional space. Thus, instead of applying Simple
Monte Carlo, the more advanced and efficient sampling
approach, the Adaptive Multi-level Splitting method [Webb et
al., 2019], has been applied in our case studies on image
datasets. We are confident that other statistical sampling
methods designed for rare-events may also suffice our need.</p>
        <p>In addition to the statistical approach, formal method based
verification techniques can also be applied to assess a cell’s
pmi, e.g. [Huang et al., 2017]. They provide formal
guarantees on whether the DL model will miss-classify any input
inside a small region. Although such “robust region” proved
by formal methods is normally smaller than our cells, the ^i
11Although dimension reduction methods like VAE may ease the
problem of learning OP, they cannot reduce the number of cells to
be evaluated. Since robustness by definition has to be evaluated in
the input space.
can be conservatively set to the proportion of robust region
covered in ci in this case.</p>
        <p>We would like to note that the cell robustness estimator
in our RAM works in a “hot-swappable” manner: any new
and more efficient robustness estimator can be easily
incorporated. Thus, how to improve the efficiency of cell’s
robustness estimation is out of the scope of our RAM.</p>
        <p>Inherent difficulties Finally, based on our RAM and the
discussions above, we summarise the inherent difficulties of
assessing DL reliability as the following questions:
• How to accurately build the OP in the high-dimensional
input space?
• How to build an accurate oracle leveraging the existing
human-labels in the training dataset?
• What is the local distribution (conditional OP) over a
small input region that captures the natural variations of
physical conditions?
• How to efficiently evaluate the robustness of a small
region given AEs are rare events?
• How to sample small regions from a large population
(high-dimensional space) to test robustness in an
unbiased and efficient way?</p>
        <p>We try to provide preliminary/compromised solutions in
our RAM, while the questions are still challenging in
practice. We doubt the existence of other DL RAMs with weaker
assumptions achieving the same level of rigorousness as ours,
at this stage.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion &amp; Future Work</title>
      <p>In this paper, we present a preliminary RAM for DL
classifiers. It is the first DL RAM explicitly considers both the
OP information and robustness evidence. It uncovers some
inherent difficult questions when assessing DL reliability,
while preliminary/compromised solutions are discussed,
implemented and demonstrated with case studies.</p>
      <p>An intuitive way of perceiving our RAM, comparing with
the usual accuracy testing, is that we enlarge the testing
dataset with more test cases around “seeds” (original
datapoints in the test set). We determine the oracle of a new test
case according to its seed’s label and the r-distance. Those
enlarged test results form the robustness evidence, and how
much they contribute to the overall reliability is proportional
to its OP. Consequently, exposing to more tests (robustness
evaluation) and being more representative of how it will be
used (the OP), our RAM is more trustworthy.</p>
      <p>In line with the gist of our RAM, we believe the DL
reliability should follow the conceptualised equation of:
DL reliability = generalisability
robustness:
In a nutshell, when assessing the DL reliability, we should
not only concern how it generalises to a new data-point
(according to the future OP), but also the local robustness around
it. Align with this insight, indeed, a “naive/over-simplified”
version of our RAM would be averaging all local astuteness
of data-points in the test set, which is less rigorous (e.g., on
determining the norm ball size) and requires stronger
assumptions (e.g., the test set is equal to the operational set).</p>
      <p>
        Improving the scalability of our RAM and experimenting
with more real-world datasets form important future work.
We presume a trained DL model for our assessment purpose.
A natural question next is how to actually improve the
reliability when our RAM results are not good enough. As
described in [Zhao et al., 2021], we plan to investigate DL
debug testing
        <xref ref-type="bibr" rid="ref14 ref2 ref39 ref40 ref8">(e.g. [Huang et al., 2021])</xref>
        and retraining methods
[Bai et al., 2021], together with the RAM, to form a closed
loop of debugging-improving-assessing.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments &amp; Disclaimer</title>
      <p>This project has received funding from the European
Union’s Horizon 2020 research and innovation programme
under grant agreement No 956123. This work is partially
supported by the UK EPSRC (through the Offshore Robotics for
Certification of Assets [EP/R026173/1] and End-to-End
Conceptual Guarding of Neural Architectures [EP/T026995/1])
and the UK Dstl (through the project of Safety Argument
for Learning-enabled Autonomous Underwater Vehicles).
Xingyu Zhao and Alec Banks’ contribution to the work is
partially supported through Fellowships at the Assuring
Autonomy International Programme. We thank Lorenzo Strigini
for insightful comments on earlier versions of the paper.</p>
      <p>This document is an overview of UK MOD (part)
sponsored research and is released for informational purposes
only. The contents of this document should not be
interpreted as representing the views of the UK MOD, nor should
it be assumed that they reflect any current or future UK
MOD policy. The information contained in this document
cannot supersede any statutory or contractual requirements
or liabilities and is offered without prejudice or
commitment. Content includes material subject to © Crown
copyright (2018), Dstl. This material is licensed under the terms of
the Open Government Licence except where otherwise stated.
To view this licence, visit http://www.nationalarchives.gov.
uk/doc/open-government-licence/version/3 or write to the
Information Policy Team, The National Archives, Kew, London
TW9 4DU, or email: psi@nationalarchives.gsi.gov.uk.
A</p>
    </sec>
    <sec id="sec-7">
      <title>KDE bootstrapping</title>
      <p>Bootstrapping is a statistical approach to estimate any
sampling distribution by random sampling method. We sample
with replacement from the original data points (X; Y ) to
obtain a new bootstrap dataset (Xb; Y b) and train the KDE on
the bootstrap dataset. Assume the bootstrap process is
repeated B times, leading to B bootstrap KDEs, denoted as
1 B
Op (x); : : : ; Ocp (x). Then we can estimate the variance of
c
f^(x) by the sample variance of the bootstrap KDE [Chen,
2017]:
^B2(x) =</p>
      <p>1
B
1</p>
      <p>B b
X(Ocp (x)</p>
      <p>B)2
b=1
where the B can be approximated by
^B(x) =</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Backurs et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Arturs</given-names>
            <surname>Backurs</surname>
          </string-name>
          , Piotr Indyk, and
          <string-name>
            <given-names>Tal</given-names>
            <surname>Wagner</surname>
          </string-name>
          .
          <article-title>Space and time efficient kernel density estimation in high dimensions</article-title>
          .
          <source>In NeurIPS'19</source>
          , pages
          <fpage>15773</fpage>
          -
          <lpage>15782</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Bai et al.,
          <year>2021</year>
          ]
          <string-name>
            <given-names>Tao</given-names>
            <surname>Bai</surname>
          </string-name>
          , Jinqi Luo,
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bihan</given-names>
            <surname>Wen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Qian</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Recent Advances in Adversarial Training for Adversarial Robustness</article-title>
          .
          <source>In IJCAI'21</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Bergstra and Bengio</source>
          , 2012]
          <string-name>
            <given-names>James</given-names>
            <surname>Bergstra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Random search for hyper-parameter optimization</article-title>
          .
          <source>J. of Machine Learning Research</source>
          ,
          <volume>13</volume>
          (
          <issue>2</issue>
          ):
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Bishop et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Bishop</surname>
          </string-name>
          , Robin Bloomfield, Bev Littlewood, Andrey Povyakalo, and
          <string-name>
            <given-names>David</given-names>
            <surname>Wright</surname>
          </string-name>
          .
          <article-title>Toward a formalism for conservative claims about the dependability of software-based systems</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          ,
          <volume>37</volume>
          (
          <issue>5</issue>
          ):
          <fpage>708</fpage>
          -
          <lpage>717</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Bloomfield et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Robin</given-names>
            <surname>Bloomfield</surname>
          </string-name>
          , Heidy Khlaaf, Philippa Ryan Conmy, and
          <string-name>
            <given-names>Gareth</given-names>
            <surname>Fletcher</surname>
          </string-name>
          .
          <article-title>Disruptive innovations and disruptive assurance: Assuring machine learning and autonomy</article-title>
          .
          <source>Computer</source>
          ,
          <volume>52</volume>
          (
          <issue>9</issue>
          ):
          <fpage>82</fpage>
          -
          <lpage>89</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Carlini and Wagner</source>
          , 2017]
          <string-name>
            <given-names>Nicholas</given-names>
            <surname>Carlini</surname>
          </string-name>
          and
          <string-name>
            <given-names>David</given-names>
            <surname>Wagner</surname>
          </string-name>
          .
          <article-title>Towards Evaluating the Robustness of Neural Networks</article-title>
          .
          <source>In IEEE Symp. on Security and Privacy (SP)</source>
          , pages
          <fpage>39</fpage>
          -
          <lpage>57</lpage>
          , San Jose, CA, USA,
          <year>2017</year>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Chen</source>
          , 2017]
          <string-name>
            <surname>Yen-Chi Chen</surname>
          </string-name>
          .
          <article-title>A tutorial on kernel density estimation and recent advances</article-title>
          .
          <source>Biostatistics &amp; Epidemiology</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>161</fpage>
          -
          <lpage>187</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Guerriero et al.,
          <year>2021</year>
          ] Antonio Guerriero, Roberto Pietrantuono, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Russo</surname>
          </string-name>
          .
          <article-title>Operation is the hardest teacher: estimating DNN accuracy looking for mispredictions</article-title>
          .
          <source>In ICSE'21</source>
          , Madrid, Spain,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Guerriero</source>
          , 2020] Antonio Guerriero.
          <article-title>Reliability Evaluation of ML systems, the oracle problem</article-title>
          .
          <source>In ISSREW'20</source>
          , pages
          <fpage>127</fpage>
          -
          <lpage>130</lpage>
          , Coimbra, Portugal,
          <year>2020</year>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Hamlet and Taylor</source>
          , 1990]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hamlet</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <article-title>Partition testing does not inspire confidence</article-title>
          .
          <source>IEEE Tran. on Software Engineering</source>
          ,
          <volume>16</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1402</fpage>
          -
          <lpage>1411</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [
          <string-name>
            <surname>Harel-Canada</surname>
          </string-name>
          et al.,
          <year>2020</year>
          ]
          <string-name>
            <given-names>Fabrice</given-names>
            <surname>Harel-Canada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lingxiao</given-names>
            <surname>Wang</surname>
          </string-name>
          , Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim.
          <article-title>Is neuron coverage a meaningful measure for testing deep neural networks</article-title>
          ?
          <source>In ESEC/FSE'20</source>
          , pages
          <fpage>851</fpage>
          -
          <lpage>862</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Huang et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Xiaowei</given-names>
            <surname>Huang</surname>
          </string-name>
          , Marta Kwiatkowska,
          <string-name>
            <given-names>Sen</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Min</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Safety verification of deep neural networks</article-title>
          .
          <source>In CAV'17</source>
          , pages
          <fpage>3</fpage>
          -
          <lpage>29</lpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Huang et al.,
          <year>2020</year>
          ]
          <string-name>
            <given-names>Xiaowei</given-names>
            <surname>Huang</surname>
          </string-name>
          , Daniel Kroening, Wenjie Ruan, and et al.
          <article-title>A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability</article-title>
          . Computer Science Review,
          <volume>37</volume>
          :
          <fpage>100270</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Huang et al.,
          <year>2021</year>
          ]
          <string-name>
            <given-names>Wei</given-names>
            <surname>Huang</surname>
          </string-name>
          , Youcheng Sun,
          <string-name>
            <given-names>Xingyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>James</given-names>
            <surname>Sharp</surname>
          </string-name>
          , Wenjie Ruan, Jie Meng, and
          <string-name>
            <given-names>Xiaowei</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Coverage guided testing for recurrent neural networks</article-title>
          .
          <source>IEEE Tran. on Reliability</source>
          ,
          <year>2021</year>
          . In press.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Kalra and Paddock</source>
          , 2016]
          <string-name>
            <given-names>Nidhi</given-names>
            <surname>Kalra and Susan M. Paddock</surname>
          </string-name>
          .
          <article-title>Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy</article-title>
          and Practice,
          <volume>94</volume>
          :
          <fpage>182</fpage>
          -
          <lpage>193</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Katz et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Guy</given-names>
            <surname>Katz</surname>
          </string-name>
          , Derek A.
          <string-name>
            <surname>Huang</surname>
            , Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljic´, David L. Dill,
            <given-names>Mykel J.</given-names>
          </string-name>
          <string-name>
            <surname>Kochenderfer</surname>
            , and
            <given-names>Clark</given-names>
          </string-name>
          <string-name>
            <surname>Barrett</surname>
          </string-name>
          .
          <article-title>The Marabou Framework for Verification and Analysis of Deep Neural Networks</article-title>
          .
          <source>In CAV'19</source>
          , volume
          <volume>11561</volume>
          <source>of LNCS</source>
          , pages
          <fpage>443</fpage>
          -
          <lpage>452</lpage>
          , Cham,
          <year>2019</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Lane et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>David</given-names>
            <surname>Lane</surname>
          </string-name>
          , David Bisset,
          <string-name>
            <given-names>Rob</given-names>
            <surname>Buckingham</surname>
          </string-name>
          , Geoff Pegman, and
          <string-name>
            <given-names>Tony</given-names>
            <surname>Prescott</surname>
          </string-name>
          .
          <article-title>New foresight review on robotics and autonomous systems</article-title>
          .
          <source>Technical Report No</source>
          .
          <year>2016</year>
          .1,
          <string-name>
            <surname>LRF</surname>
          </string-name>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>[Lee</surname>
          </string-name>
          et al.,
          <year>2019</year>
          ]
          <article-title>Jasper CH Lee, Jerry Li, Christopher Musco, Jeff M Phillips, and Wai Ming Tai. Finding the mode of a kernel density estimate</article-title>
          .
          <source>arXiv preprint arXiv:1912.07673</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Zenan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoxing</given-names>
            <surname>Ma</surname>
          </string-name>
          , Chang Xu, Chun Cao, Jingwei Xu, and Jian Lu¨.
          <article-title>Boosting operational DNN testing efficiency through conditioning</article-title>
          .
          <source>In ESEC/FSE'19</source>
          , pages
          <fpage>499</fpage>
          -
          <lpage>509</lpage>
          . ACM,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>[Littlewood and Strigini</source>
          , 2000]
          <string-name>
            <given-names>Bev</given-names>
            <surname>Littlewood</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lorenzo</given-names>
            <surname>Strigini</surname>
          </string-name>
          .
          <article-title>Software reliability and dependability: A roadmap</article-title>
          .
          <source>In ICSE 2000</source>
          , pages
          <fpage>175</fpage>
          -
          <lpage>188</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>[Miller</surname>
          </string-name>
          et al.,
          <year>1992</year>
          ]
          <string-name>
            <given-names>Keith W.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Larry J.</given-names>
            <surname>Morell</surname>
          </string-name>
          , Robert E. Noonan,
          <string-name>
            <surname>Stephen</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>David M.</given-names>
          </string-name>
          <string-name>
            <surname>Nicol</surname>
            , Branson W.
            <given-names>M</given-names>
            urrill, and M
          </string-name>
          <string-name>
            <surname>Voas</surname>
          </string-name>
          .
          <article-title>Estimating the probability of failure when testing reveals no failures</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <fpage>33</fpage>
          -
          <lpage>43</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[Musa</source>
          , 1993
          <string-name>
            <given-names>] John</given-names>
            <surname>Musa</surname>
          </string-name>
          .
          <article-title>Operational profiles in softwarereliability engineering</article-title>
          .
          <source>IEEE Software</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <fpage>14</fpage>
          -
          <lpage>32</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Pietrantuono et al.,
          <year>2020</year>
          ]
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Pietrantuono</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Popov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Russo</surname>
          </string-name>
          .
          <article-title>Reliability assessment of service-based software under operational profile uncertainty</article-title>
          .
          <source>Reliability Engineering &amp; System Safety</source>
          ,
          <volume>204</volume>
          :
          <fpage>107193</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Qahtan et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Abdulhakim</given-names>
            <surname>Qahtan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Suojin</given-names>
            <surname>Wang</surname>
          </string-name>
          , and Xiangliang Zhang. KDE-Track:
          <article-title>An Efficient Dynamic Density Estimator for Data Streams</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>29</volume>
          (
          <issue>3</issue>
          ):
          <fpage>642</fpage>
          -
          <lpage>655</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [Robu et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Valentin</given-names>
            <surname>Robu</surname>
          </string-name>
          , David Flynn,
          <string-name>
            <given-names>and David</given-names>
            <surname>Lane</surname>
          </string-name>
          .
          <article-title>Train robots to self-certify as safe</article-title>
          .
          <source>Nature</source>
          ,
          <volume>553</volume>
          (
          <issue>7688</issue>
          ):
          <fpage>281</fpage>
          -
          <lpage>281</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>[Scott</source>
          , 2015]
          <string-name>
            <given-names>David W</given-names>
            <surname>Scott</surname>
          </string-name>
          .
          <article-title>Multivariate density estimation: theory, practice, and visualization</article-title>
          . John Wiley &amp; Sons,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>[Silverman</source>
          , 1986] Bernard W Silverman.
          <article-title>Density estimation for statistics and data analysis</article-title>
          , volume
          <volume>26</volume>
          . CRC press,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>[Strigini and Littlewood</source>
          , 1997]
          <string-name>
            <given-names>Lorenzo</given-names>
            <surname>Strigini</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bev</given-names>
            <surname>Littlewood</surname>
          </string-name>
          .
          <article-title>Guidelines for statistical testing</article-title>
          .
          <source>Technical report</source>
          , City, University of London,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>[Strigini and Povyakalo</source>
          , 2013]
          <string-name>
            <given-names>Lorenzo</given-names>
            <surname>Strigini</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrey</given-names>
            <surname>Povyakalo</surname>
          </string-name>
          .
          <article-title>Software fault-freeness and reliability predictions</article-title>
          .
          <source>In SafeComp'13</source>
          , volume
          <volume>8153</volume>
          <source>of LNCS</source>
          , pages
          <fpage>106</fpage>
          -
          <lpage>117</lpage>
          , Berlin, Heidelberg,
          <year>2013</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>[Walter and Augustin</source>
          , 2009]
          <string-name>
            <given-names>Gero</given-names>
            <surname>Walter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Augustin</surname>
          </string-name>
          .
          <article-title>Imprecision and prior-data conflict in generalized Bayesian inference</article-title>
          .
          <source>Journal of Statistical Theory &amp; Practice</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>255</fpage>
          -
          <lpage>271</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [Webb et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Webb</surname>
          </string-name>
          , Tom Rainforth, Yee Whye Teh, and
          <string-name>
            <given-names>M. Pawan</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>A statistical approach to assessing neural network robustness</article-title>
          .
          <source>In ICLR'19</source>
          ,
          <string-name>
            <surname>New</surname>
            <given-names>Orleans</given-names>
          </string-name>
          , LA, USA,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [Weng et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Lily</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pin-Yu</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Lam Nguyen, Mark Squillante, Akhilan Boopathy, Ivan Oseledets, and Luca Daniel. PROVEN:
          <article-title>Verifying robustness of neural networks with a probabilistic approach</article-title>
          .
          <source>In ICML'19</source>
          , volume
          <volume>97</volume>
          , pages
          <fpage>6727</fpage>
          -
          <lpage>6736</lpage>
          . PMLR,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [Yang et al.,
          <year>2020</year>
          ]
          <string-name>
            <surname>Yao-Yuan</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Cyrus Rashtchian, Hongyang Zhang, Ruslan Salakhutdinov, and
          <string-name>
            <given-names>Kamalika</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          .
          <article-title>A Closer Look at Accuracy vs</article-title>
          .
          <source>Robustness. In NeurIPS'20</source>
          , Vancouver, Canada,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [Zhang et al.,
          <year>2020</year>
          ]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harman</surname>
          </string-name>
          , L. Ma, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Machine learning testing: Survey, landscapes and horizons</article-title>
          .
          <source>IEEE Tran. on Software Engineering</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>[Zhao</surname>
          </string-name>
          et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Xingyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Valentin</given-names>
            <surname>Robu</surname>
          </string-name>
          , David Flynn,
          <string-name>
            <given-names>Fateme</given-names>
            <surname>Dinmohammadi</surname>
          </string-name>
          , Michael Fisher, and
          <string-name>
            <given-names>Matt</given-names>
            <surname>Webster</surname>
          </string-name>
          .
          <article-title>Probabilistic model checking of robots deployed in extreme environments</article-title>
          .
          <source>In AAAI'19</source>
          , volume
          <volume>33</volume>
          , pages
          <fpage>8076</fpage>
          -
          <lpage>8084</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>[Zhao</surname>
            et al., 2020a]
            <given-names>Xingyu</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Alec</given-names>
          </string-name>
          <string-name>
            <surname>Banks</surname>
            , James Sharp, Valentin Robu, David Flynn, Michael Fisher, and
            <given-names>Xiaowei</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>A safety framework for critical systems utilising deep neural networks</article-title>
          .
          <source>In SafeComp'20</source>
          , volume
          <volume>12234</volume>
          <source>of LNCS</source>
          , pages
          <fpage>244</fpage>
          -
          <lpage>259</lpage>
          . Springer,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>[Zhao</surname>
            et al., 2020b]
            <given-names>Xingyu</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Radu</given-names>
          </string-name>
          <string-name>
            <surname>Calinescu</surname>
          </string-name>
          , Simos Gerasimou, Valentin Robu, and David Flynn.
          <article-title>Interval change-point detection for runtime probabilistic model checking</article-title>
          .
          <source>In ASE'20</source>
          , pages
          <fpage>163</fpage>
          -
          <lpage>174</lpage>
          . IEEE/ACM,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>[Zhao</surname>
            et al., 2020c]
            <given-names>Xingyu</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Kizito</given-names>
          </string-name>
          <string-name>
            <surname>Salako</surname>
            , Lorenzo Strigini, Valentin Robu, and
            <given-names>David</given-names>
          </string-name>
          <string-name>
            <surname>Flynn</surname>
          </string-name>
          .
          <article-title>Assessing safety-critical systems from operational testing: A study on autonomous vehicles</article-title>
          .
          <source>Information and Software Technology</source>
          ,
          <volume>128</volume>
          :
          <fpage>106393</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>[Zhao</surname>
          </string-name>
          et al.,
          <year>2021</year>
          ]
          <string-name>
            <given-names>Xingyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Sven Schewe, Yi Dong, and
          <string-name>
            <given-names>Xiaowei</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Detecting operational adversarial examples for reliable deep learning</article-title>
          .
          <source>In 51th Annual IEEE-IFIP Int. Conf. on Dependable Systems and Networks (DSN'21)</source>
          , volume Fast Abstract,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [Zhong et al.,
          <year>2021</year>
          ]
          <string-name>
            <given-names>Ziyuan</given-names>
            <surname>Zhong</surname>
          </string-name>
          , Yuchi Tian, and
          <string-name>
            <given-names>Baishakhi</given-names>
            <surname>Ray</surname>
          </string-name>
          .
          <article-title>Understanding Local Robustness of Deep Neural Networks under Natural Variations</article-title>
          .
          <source>In FASE'21</source>
          , pages
          <fpage>313</fpage>
          -
          <lpage>337</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>