<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Increasing Trustworthiness of Deep Neural Networks via Accuracy Monitoring</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhihui Shao</string-name>
          <email>zshao006@ucr.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianyi Yang</string-name>
          <email>jyang239@ucr.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shaolei Ren</string-name>
          <email>sren@ece.ucr.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California</institution>
          ,
          <addr-line>Riverside</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Inference accuracy of deep neural networks (DNNs) is a crucial performance metric, but can vary greatly in practice subject to actual test datasets and is typically unknown due to the lack of ground truth labels. This has raised significant concerns with trustworthiness of DNNs, especially in safety-critical applications. In this paper, we address trustworthiness of DNNs by using post-hoc processing to monitor the true inference accuracy on a user's dataset. Concretely, we propose a neural network-based accuracy monitor model, which only takes the deployed DNN's softmax probability output as its input and directly predicts if the DNN's prediction result is correct or not, thus leading to an estimate of the true inference accuracy. The accuracy monitor model can be pre-trained on a dataset relevant to the target application of interest, and only needs to actively label a small portion (1% in our experiments) of the user's dataset for model transfer. For estimation robustness, we further employ an ensemble of monitor models based on the Monte-Carlo dropout method. We evaluate our approach on different deployed DNN models for image classification and traffic sign detection over multiple datasets (including adversarial samples). The result shows that our accuracy monitor model provides a close-to-true accuracy estimation and outperforms the existing baseline methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Deep neural networks (DNNs) have achieved
unprecedentedly high classification accuracy and found success in
numerous applications, including image classification, speech
recognition, and nature language processing. Nonetheless,
training an error-free or 100% accurate DNN is impossible
in most practical cases. Inference accuracy is a crucial
metric for quantifying the performance of DNNs. Typically, the
reported inference accuracy of a DNN is measured offline
on test datasets with labels, but this can significantly differ
from the true accuracy on a user’s dataset because of, e.g.,
data distribution shift away from the training dataset or even
adversarial modification to the user’s data [Che et al., 2019;
Kull et al., 2019; Malinin and Gales, 2018]. Moreover,
obtaining the true accuracy is very challenging in practice due
to the lack of ground-truth labels.</p>
      <p>The unknown inference accuracy has further decreased
the transparency of already hard-to-explain DNNs and raised
significant concerns with their trustworthiness, especially in
safety-critical applications. Consequently, studies on
increasing trustworthiness of DNNs have been proliferating. For
example, many studies have considered out-of-distribution
(OOD) detection and adversarial sample detection, since
OOD and adversarial samples often dramatically decrease
inference accuracy of DNNs [Hendrycks and Gimpel, 2017;
Che et al., 2019; Lee et al., 2018; Liang et al., 2018]. While
these efforts can offer an increased assurance of DNNs to
users to some extent, they do not provide a quantitative
measure of actual classification accuracy, which is a more
direct and sensible measure of the target DNN’s performance.
Some other studies propose (post-hoc) processing to
quantify/estimate the prediction confidence of a DNN [Guo et al.,
2017; Kull et al., 2019; Snoek et al., 2019]. Nonetheless, they
typically require the target DNN’s training/validation dataset
to train a (sometimes complicated) new transformation model
for confidence calibration, and do not transfer well to new
unseen datasets. The accuracy of a target DNN on a user’s
operational dataset can also be estimated via selective random
sampling, but it can suffer from a high estimation variance
[Li et al., 2019].</p>
      <p>Contribution. In this paper, we propose a simple yet
effective post-hoc method — accuracy monitoring — which
increases the trustworthiness of DNN classification results by
estimating the true inference accuracy on an actual (possibly
OOD/adversarial) dataset. Concretely, as shown in Fig. 1,
we propose a neural network-based accuracy monitor model,
which only takes the deployed DNN’s softmax probability
output as its input and directly predicts if the DNN’s
prediction result is correct or not. Thus, over a sequence of
prediction samples from a user’s dataset, our accuracy monitor
can form an estimate of the target DNN’s true inference
accuracy. Furthermore, we employ an ensemble of monitoring
models based on the Monte-Carlo dropout method, providing</p>
    </sec>
    <sec id="sec-2">
      <title>User</title>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>Model
Provider
Accuracy
Monitor</p>
    </sec>
    <sec id="sec-4">
      <title>Monitor</title>
    </sec>
    <sec id="sec-5">
      <title>Models</title>
    </sec>
    <sec id="sec-6">
      <title>1 Training</title>
    </sec>
    <sec id="sec-7">
      <title>2 Softmax</title>
    </sec>
    <sec id="sec-8">
      <title>Probabilities</title>
    </sec>
    <sec id="sec-9">
      <title>Estimated</title>
    </sec>
    <sec id="sec-10">
      <title>Accuracy</title>
      <p>a robust estimate of the target DNN’s true accuracy.</p>
      <p>Utilizing as little information as the target DNN’s
softmax probability output for accuracy estimation provides
better transferability than more complicated calibration methods
[Kull et al., 2019]. Specifically, we can pre-train an
accuracy monitor model based on a labeled dataset relevant to
the target application of interest (e.g., public datasets for
image classification). Then, for model transfer, we can
selectively label a small amount (1% in our work) of data from the
user’s test dataset with active learning via an entropy
acquisition function [Beluch et al., 2018], and re-train our monitor
models on the selectively labeled data using transfer
learning. In addition, without the need of accessing the target
DNN’s training/validation datasets, our accuracy monitoring
method can be easily applied as a plug-in module on top of
the target DNN to monitor its runtime performance on a
variety of datasets. Thus, our method is not restricted to the
DNN providers themselves; instead, even an end user can
employ our method to monitor the target DNN’s accuracy
performance on its own, bringing further increased trustworthiness
of accuracy monitoring.</p>
      <p>To evaluate the effectiveness of our accuracy monitoring
method, we consider different target DNN models for image
classification (10 classes and 1000 classes) and for traffic sign
detection in autonomous driving, respectively. Our results
show that, by only utilizing the prediction class and softmax
probability output of the deployed DNN model and labeling
1% of the user’s dataset, our method can monitor the healthy
of the target DNN models, providing a remarkably accurate
estimation of the true classification accuracy on a variety of
user’s datasets.
2</p>
      <sec id="sec-10-1">
        <title>Related Works</title>
        <sec id="sec-10-1-1">
          <title>Prediction uncertainty estimation. Several methods have</title>
          <p>been proposed to estimate DNN prediction uncertainty. In
[Schulam and Saria, 2019], the model uncertainty is
estimated with ensemble models via re-sampling the original
DNN model parameters based on the Hessian matrix and
gradient matrix on the training data. Additionally, [Jiang et al.,
2018] estimates model uncertainty via the similarity between
the test data and training data. However, it requires not only
the training data but also a white-box target DNN model.
Other methods (e.g, MC dropout, ensembles, stochastic
variational Bayesian inference, prior networks) are summarized
in [Malinin and Gales, 2018; Snoek et al., 2019], which also
require a white-box model and/or the original training dataset.
By contrast, our post-hoc processing method only needs the
target DNN’s softmax probability output and applies to a
variety of datasets, including OOD and adversarial samples.</p>
          <p>Concept/distribution drift detection. After model
deployment, some studies indirectly tackle the problem of
model accuracy monitoring via concept/data distribution drift
detection in the absence of labels. In [Pinto et al., 2019],
an automatic concept drift detection algorithm SAMM is
developed with no labeled test data by utilizing the feature
distance between test data and reference data. Other
approaches include ML Health [Ghanta et al., 2019b] and MD3
[Sethi and Kantardzic, 2017]. Moreover, [Che et al., 2019;
Liang et al., 2018; Lee et al., 2018] study OOD and
adversarial detection by setting a threshold to decide if an input data
is sufficiently similar to the pre-learnt in-distribution or
nonadversarial data distribution. These approaches do not offer
a measure of the actual accuracy. Moreover, they require
access to the original training and/or validation datasets, which
are not needed by our accuracy monitor.</p>
          <p>Accuracy estimation for the target model. Secondary
models are trained to estimate the accuracy of the primary
model, but they are trained on the same dataset as the primary
model and requires either the original input data [Ghanta
et al., 2019a] or saliency maps [Mohseni et al., 2019]. In
[Nguyen et al., 2018], an active testing framework is
proposed to estimate model accuracy, with a focus on noisy
labeled datasets instead of unlabeled datasets that we consider.
Our problem is also related to operational testing [Li et al.,
2019], which uses selective random sampling to provide an
accuracy estimate for a target DNN on an actual operational
dataset prior to DNN deployment. The work [Istrate et al.,
2019] predicts the accuracy of a target DNN architecture on
a given dataset, while [Unterthiner et al., 2020] predicts
accuracy based on the target DNN’s weights. These studies
require a large number of DNN training experiments.</p>
          <p>Prediction confidence via softmax probability. A related
study [Hendrycks and Gimpel, 2017] utilizes the maximum
softmax probability of the target DNN for misclassification
detection, whereas our approach exploits the softmax
probabilities for all classes. Further, an abnormality module is
designed to detect OOD data in [Hendrycks and Gimpel, 2017],
for which a decoder is required and trained with a white-box
target model. In [Guo et al., 2017], temperature scaling is
proposed to calibrate the original softmax probability, but a
labeled validation set is required to learn the hyperparameter
T . Likewise, [Kull et al., 2019] advances the temperature
scaling method by training a sophisticated Dirichlet
distribution for better confidence calibration. These methods are
sensitive to and do not transfer well to a user’s datasets with
OOD/adversarial samples.
3</p>
        </sec>
      </sec>
      <sec id="sec-10-2">
        <title>Problem Formulation</title>
        <p>We consider a deployed target DNN model that performs
classification tasks with C classes. The DNN provides
softmax probabilities denoted as p(x) = M d (x), where x
represents the input data, d denotes target DNN’s parameters
(not required by the accuracy monitor), and p(x) 2 RC .
Thus, the predicted class is y~ = arg maxk2f1;2:::Cgfpk(x)g.</p>
        <p>The empirical accuracy Acc of a deployed DNN model
M d on a user’s dataset (xi; yi) 2 DU can be calculated as
Correct/
Wrong</p>
        <p>Labeled Dataset
Image Label</p>
        <p>Image
Deployed Model</p>
        <p>Softmax probability 
Train
Monitor Models</p>
        <p>User Data
Deployed Model
…</p>
        <p>…
Monitor
Model</p>
        <p>Monitor
Model
…</p>
        <p>Monitor</p>
        <p>Model</p>
        <p>Estimated accuracy
Estimate with MC dropout
Pre-train with labled data</p>
        <p>Transfer with active learning</p>
        <p>User Data
Deployed Model
Monitor Models</p>
        <p>Softmax probability</p>
        <p>Entropy
Acquisition</p>
        <p>Function

(1)
follows</p>
        <p>Acc =
1
U</p>
        <p>X</p>
        <p>I (yi = y~i) ;
jD j (xi;yi)2DU
where I( ) is the Boolean indicator function. The exact value
of Acc cannot be possibly obtained without knowing all the
true class yi, which is often the case in practice (e.g., a user
employs a classifier due to the high cost of manually
labeling its data). It can also significantly differ from the
accuracy value evaluated based on the DNN model provider’s test
dataset due to data distribution disparity.</p>
        <p>In this paper, we leverage a simple plug-in accuracy
monitor model to estimate the empirical accuracy Acc without
all the true labels for user’s dataset. Specifically, the
neural network-based monitor model s(p(x)) =
M
parameterized by
abilities p(x) = M
a takes the target DNN’s softmax
probd (x) as its input and outputs a softmax
probability/score s(p(x)) to indicate the likelihood of correct
classification for data x. Then, if the probability of correct
classification is greater than or equal to a threshold ths, the
target DNN’s classification is considered correct and
othera (p(x))
wise wrong. By default, we use s(p(x))
ths = 0:5 in
order for a classification result to be considered correct. Thus,
the accuracy of the deployed DNN on the user’s dataset
estimated by our monitor model is</p>
        <p>Agcc =
1
U</p>
        <p>X
jD j (xi;yi)2DU</p>
        <p>I [s(p(x))
ths] :
(2)</p>
        <p>Our problem formulation is similar to that for the
existing confidence calibration techniques [Kull et al., 2019;
Guo et al., 2017] that focus on estimating the
probability of correct/wrong prediction for each individual sample.
Nonetheless, our key goal is to make the estimated average
accuracy Agcc as close to the true empirical accuracy Acc as
possible. This allows the application of our method in even
OOD/adversarial datasets, while still offering an important
view of the average accuracy performance of the target DNN.</p>
        <p>Note finally that our accuracy monitoring method does not
require a white-box target DNN model and can be applied on
top of the target DNN to monitor its accuracy performance,
either by the DNN model provider or by an end user (provided
that it has access to a relevant labeled dataset, not necessarily
the target DNN’s training/validation dataset).
4</p>
      </sec>
      <sec id="sec-10-3">
        <title>Design of DNN Accuracy Monitoring</title>
        <p>cluding three phases. First, monitor models are pre-trained
over a labeled dataset that shares the same application as the
user’s dataset. Then, monitor models are re-trained with a
small t% of labeled data from the user’s dataset using active
learning. Finally, multiple monitor models are provided to
approximate Bayesian neural networks via MC dropout,
achieving a more robust accuracy estimation. Algorithm 1 describes
the steps of our proposed method. Next, we provide details
of the three phases for accuracy monitoring.</p>
        <p>Training phase. To pre-train initial monitor models, the
accuracy monitor can leverage a labeled dataset D
can be the target DNN’s training/validation dataset (if the
R, which
DNN provider wants to monitor its own model’s accuracy)
or a different dataset relevant to the target application (if the
DNN user wants to monitor the accuracy by itself but does not
have the target DNN’s original training/validation dataset).
For example, if the target DNN is developed by one entity
but later provided to another user as a black-box model for
image classification, CIFAR10, CINIC10 or ImageNet2012
can be used by the user to pre-train its own accuracy monitor
models. We run the target DNN on the labeled dataset and
obtain prediction softmax probabilities pR(x) produced by the
target DNN. Meantime, the correct/wrong result CW R(x) of
the target DNN can also be obtained by comparing the DNN’s
predicted class with the true data label. Then, based on pR(x)
and CW R, we can train B monitor models M (b) .</p>
        <sec id="sec-10-3-1">
          <title>Transfer with active learning. Due to the possible distri</title>
          <p>a
bution differences between the chosen labeled dataset D
R and</p>
          <p>U , the monitor models pre-trained</p>
          <p>R may not provide a satisfactory accuracy for
solely on the D
the user’s actual dataset D
the target DNN as shown in Section 5. To address this
issue, we need to transfer the monitor models into the user’s
dataset. In the transfer learning phase, we freeze the weights
of all layers in the monitor models except for the last two
layers. Only the weights of the last two layers will be updated
during transfer learning. Due to expensive labeling cost, we
Algorithm 1: DNN Accuracy Monitoring
Input: A labeled dataset DR, user dataset DU , target
model M d (x), the MC dropout model number B,
data labeling budget t%.
1. Obtain softmax probabilities for DR and DU .</p>
          <p>pR(x) M d (x) for x 2 DR;
CW R(x) I (y~ = y) for (x; y) 2 DR;
pU (x) M d (x) for x 2 DU ;
2. Train monitor models with pR and CW R.
for b = 1 to B do
only sample a small amount of user’s dataset (denoted as DsU )
from DU , and only DsU are manually labeled. To minimize
the size of DsU , entropy-based active learning [Beluch et al.,
2018] is utilized during the transfer. Specifically, we calculate
the average entropy of softmax probabilities produced by the
monitor models, and label t% of user’s data with the greatest
entropy.</p>
          <p>Note that while labeling user’s data, only the user’s data
label y and deployed DNN’s softmax probabilities p(x)
(instead of the raw data x) are utilized by the monitor
models. Moreover, by doing so, the accuracy monitor actually
performs accuracy estimation of the target DNN model over
a low-dimension softmax probability representation of x,
which effectively facilities transfer learning to user’s dataset.
As shown in our experiments, by labeling only 1% of the
user’s dataset, the monitor models can produce a highly
accurate estimation of the target DNN’s average accuracy.</p>
        </sec>
        <sec id="sec-10-3-2">
          <title>Robust accuracy estimation with MC dropout. Estimat</title>
          <p>ing accuracy for the target DNN by a single monitor model
may not be robust because of the indispensable uncertainty
in deep learning. Based on [Gal and Ghahramani, 2016], we
employ the MC dropout method to approximate a Bayesian
neural network and provide more robust accuracy estimation.
Specifically, we train an ensemble of monitor models in the
training phase using the same labeled dataset but different
initialized weights and dropout layers. Then, we transfer the
trained models using the same dataset DsU . When estimating
the target DNN’s classification accuracy, multiple estimated
accuracies can be obtained from the ensemble. The mean
of the results is considered as the monitor’s assessment on
the deployed DNN’s classification accuracy over the user’s
dataset. Moreover, the standard deviation (std) can also be
provided to represent the uncertainty of estimated accuracy
by the ensemble of monitor models.
5</p>
        </sec>
      </sec>
      <sec id="sec-10-4">
        <title>Experiments</title>
        <p>We first evaluate the effectiveness of our accuracy
monitoring method on two image classification applications:
smallscale image classification with 10 classes, and large-scale
image classification with 1000 classes. Then, we consider a
mission-critical application — traffic sign detection for
autonomous driving.</p>
        <sec id="sec-10-4-1">
          <title>5.1 Setup</title>
          <p>Our accuracy monitor model is trained as a neural network
with dropout layers using Tensorflow and Keras [Abadi et
al., 2016]. The weight parameter a is trained via
minimizing binary cross-entropy loss using Adam [Kingma and Ba,
2015] with a learning rate = 0:001. The input of the
monitor model is the softmax probabilities p(x) produced by the
target DNN, while the output represents if the classification
is correct or not for an input image x with a softmax score
s(p(x)), which will then be averaged over multiple samples
to form an estimate of the average accuracy.</p>
          <p>Dataset. The datasets include CIFAR-10 [Krizhevsky,
2009], CINIC-10 [Darlow et al., 2018], STL-10 [Coates
et al., 2011], ImageNet2012 [Russakovsky et al., 2015]
and German Traffic Sign Detection (GTSD) [Houben et
al., 2013]. In addition, we also consider a user’s dataset
with adversarial images for 10-class classification and GTSD
classification, denoted as AD-10, and GTSD-AD,
respectively. The adversarial images are generated using DeepFool
[Moosavi-Dezfooli et al., 2016] policy with “Foolbox”
package [Rauber et al., 2017].</p>
          <p>Target DNN model. The target DNN model for 10-class
image classification is VGG16 [Simonyan and Zisserman,
2015], while MobileNet [Howard et al., 2017] and
ResNet50 [He et al., 2016] are used as the target DNNs for
1000class image classification. The target model for GTSD is
a native convolutional neural network (CNN) trained on the
GTSD training dataset. The accuracy monitor estimates the
classification accuracy achieved by these DNNs on the above
datasets (which can be OOD with respect to the DNNs’
original training datasets).</p>
        </sec>
        <sec id="sec-10-4-2">
          <title>5.2 Baseline Approaches and Metrics</title>
          <p>The following baselines and metrics are considered.</p>
          <p>RS: With random sampling (RS), u% of user’s data is
randomly sampled and manually labeled. Then, the accuracy on
the sampled user’s dataset is considered as the overall
accuracy. We also run RS for 100 times, and highlight the
accuracy range achieved by 100 runs. Note, however, that in
practice the RS is only performed once for each test dataset.</p>
          <p>MP and MP*: In the MP approach considered
in [Hendrycks and Gimpel, 2017], no manual labeling
is needed; instead, the maximum softmax probability
M P (x) = maxk2f1;2:::Cgfpk(x)g produced by the target
DNN model is utilized: if M P (x) thMP where thMP is
a threshold, then the classification for x is considered correct
and otherwise wrong. In our experiment, the threshold thMP
is determined based on the same labeled dataset used to train
our monitor models to achieve the best accuracy estimation
during the validation. Alternatively, we can also use the
maximum softmax probability on the use’s dataset to estimate the
target DNN’s accuracy as M P *= P(x;y)2DU M P (x), and
we use MP* to represent this approach.</p>
          <p>Entropy: The prediction entropy Entropy(p(x)) can be
calculated from softmax probability of the target DNN model.
Then, the target DNN’s classification for x is considered
correct if Entropy &lt; thEn, where thEn is the entropy threshold
decided by the monitor according to its chosen labeled ataset,
and wrong otherwise.</p>
          <p>Temperature scaling (TS): By using temperature scaling,
the softmax probability can be calibrated from the logits with
a hyper-parameter T . According to [Guo et al., 2017], given
the logit output zi, the model accuacy can be estimated as
T S = max SM (zi=T ), where SM is the softmax function
and T is called the temperature. Usually, the temperature T
is obtained via minimizing the Negative log likelihood (NLL)
on the target DNN’s validation set. Here, we use the actively
labeled user’s data samples as the validation set.</p>
        </sec>
        <sec id="sec-10-4-3">
          <title>Perfect confidence calibration: This is an oracle that</title>
          <p>
            gives the true accuracy of the target DNN and no practical
confidence calibration methods
            <xref ref-type="bibr" rid="ref15 ref19 ref21 ref24 ref27 ref3 ref33 ref8 ref9">(e.g., [Kull et al., 2019])</xref>
            can
outperform.
          </p>
          <p>Metrics: Our main performance metric is the estimated
average accuracy of the target DNN. Additionally, we also
consider AUPR (Area Under the Precision-Recall Curve) to
isolate the effects of different thresholds ths. The value of
threshold-less AUPR varies from positive class ratio p
(random guess) to 1.0 (perfect classification), and measures a
model’s capability of distinguishing between correct/wrong
classification. The higher AUPR, the better.
5.3</p>
        </sec>
        <sec id="sec-10-4-4">
          <title>Result on 10-class Image Classification</title>
          <p>For 10-class image classification, the target model is a
VGG16 model trained on CIFAR-10 [Geifmany, 2018]. We
evaluate the performance of our proposed method on four
datasets shown in Table 1. The dataset sizes are 10k
(CIFAR10), 90k (CINIC-10), 8k (STL-10) and 10k (AD-10). The
AD-10
reported inference accuracy of the target VGG16 model is
93.56% measured on CIFAR-10, while the inference
accuracies for other datasets are 76.17% (CINIC-10), 63.04%
(STL10), and 37.80% (AD-10), indicating a significant accuracy
degradation due to OOD/adversarial data. First, we train an
ensemble of 20 monitor models on 9000 images from a
public dataset (i.e., CINIC-10 training dataset in our experiment)
and the structure of monitor model is shown in Fig. 3(a),
including two hidden dense layers and one dropout layer. In
the training phase, each monitor model is trained over 200
epochs with Adam optimizer. Fig. 3(b) shows the training and
validation loss for a monitor model in training phase. Then,
two hidden layers are frozen to perform transfer learning as
shown in Fig. 3(a). To improve the transfer efficiency, an
active learning approach is utilized to select 1% samples with
the highest entropy from the user’s test dataset. In the
prediction phase, robust estimation and its uncertainty are provided
by the ensemble of monitor models.</p>
          <p>The estimated accuracy results are summarized in Table 1,
compared with baseline approaches. Our method can provide
much more accurate estimate of the target DNN’s inference
accuracy on user’s test datasets. While our monitor models
are trained on CINIC-10, with transfer learning on only 1%
of the user’s dataset, we can still accurately estimate the target
DNN’s inference accuracy when user’s dataset is STL-10.</p>
          <p>The inference accuracy via the RS approach exhibits a
large variance with 1% labeled data, and at least 10% labeled
samples are required to achieve a small estimation error. For
temperature scaling method, the estimated accuracy still
deviates from the true accuracy with a large gap. For MP-based
and entropy-based approaches, the estimated accuracy varies
greatly with threshold values. Although in theory one can
always find a threshold with which the resulting estimated
accuracy coincides with the true accuracy z% = Acc, such
a threshold is not very meaningful, since it simply says
samples with top z% maximum softmax probability or entropy
are correct.</p>
          <p>As shown in Table 1, our method has a similar AUPR value
with the baseline approaches, demonstrating that the
overall capability of distinguishing correct/wrong classification is
comparable among different methods. Nonetheless, AUPR
is not as an intuitive metric as average accuracy, which our
accuracy monitor is specifically designed for. Also, AUPR
is only applicable for methods with variable thresholds (e.g.,
MP, Entropy and TS) as provided in Table 1.</p>
          <p>In addition, we also evaluate the performance of monitor
model on small-batch datasets to see if our monitor models
can track the true empirical accuracy of the target DNN on
user’s time-varying datasets. Specifically, we randomly
select 500 images as a batch from STL-10 with replacement,
and repeat to have a total of 100 batches each having 500
images. We show the results in Fig. 4 and demonstrate our
accuracy monitor can closely track the empirical true accuracy,
whereas the baseline approaches cannot. Even 20% RS (i.e.,
randomly label 100 images for each batch) and temperature
scaling algorithms cannot provide a good accuracy estimate.</p>
        </sec>
        <sec id="sec-10-4-5">
          <title>5.4 Result on 1000-class Image Classification</title>
          <p>For image classification with 1000 classes, two target models
(MobileNet and ResNet-50) are applied on ImageNet2012’s
validation dataset. The original validation set includes 50k
images. We randomly split ImageNet2012 into 3 datasets:
training dataset with 20k, ImageNet A with 20k images and
ImageNet B with 10k images. The reported accuracies on
ImageNet dataset are 70.40% for MobileNet and 74.90% for
ResNet-50, respectively. For MobileNet, the true accuracies
on test datasets are 68.59% (ImageNet A) and 67.91%
(ImageNet B), respectively. For ResNet-50, the true accuracies
are 68.36% (ImageNet A) and 67.47% (ImageNet B),
respectively. The true accuracies vary due to the distribution shift.</p>
          <p>For the 1000-class target model, the softmax probability
p(x) includes 1000 values. Therefore, the monitor model
structure is changed accordingly with 1000 input nodes and
1000 hidden nodes in hidden layers. Other settings remain
the same. The results for MobileNet and ResNet-50 on
ImageNet-A and ImageNet-B are summarized in Tables 2.
They demonstrate that the monitor model also outperforms
RS 1%
RS 10%</p>
          <p>RS 20%
True</p>
          <p>Entropy MP* Monitor
MP TS True
1.0
y
c
ra0.8
u
c
c0.6
A
l
e
d0.4
o
M
0.20
20 40 60 80 100
Index of Data Batch
(a)
20 40 60 80 100
Index of Data Batch
(b)
baseline approaches for large-scale image classification.
Similarly, the RS’s estimated accuracy exhibits a high variation
and at least 10% labeled data are required to achieve a
similar performance as the monitor model. Due to distribution
similarity between the training dataset and ImageNet A/B
which are all selected from ImageNet2012, the MP-based and
entropy-based approaches (with thresholds optimized based
on the training dataset) offer a reasonable estimate of the true
accuracy, but they are still worse than our monitor model.
Similarly, temperature scaling has a higher estimation error
due to limited (1%) labeled samples. Our accuracy
monitor exhibits a slightly larger estimation error on 1000-class
models than the 10-class case. One possible reason is the
higher dimensions in the softmax probability, which may
require more complex feature extraction layers instead of
simple fully-connected layers in our current experiment.
5.5</p>
        </sec>
        <sec id="sec-10-4-6">
          <title>Result on Traffic Sign Detection</title>
          <p>We now consider traffic sign detection in safety-critical
autonomous driving on the GTSD dataset, including 40k
samples grouped into 43 categories/classes [Houben et al., 2013].
We train a CNN on GTSD training dataset (27k samples)
using 50 epochs via Adam optimizer. The CNN includes
convolution layers, dropout layers, and fully connected layers.</p>
          <p>We evaluate the proposed method and baseline approaches
on four test datasets generated from GTSD, including the
original test dataset (GTSD-D1), augmented test dataset
(GTSD-D2), out-of-distribution dataset (GTSD-OOD), and
adversarial dataset (GTSD-AD). Specifically, GTSD-D1
includes 10k samples randomly selected from the GTSD test
dataset, while GTSD-D2 includes 10k augmented samples
from the GTSD test dataset. The augmentation operations
and parameters for GTSD-D2 are random rotation within
[ 10; 10] degrees and random vertical/horizontal shift within
[ 0:1; 0:1]. As for GTSD-OOD, it includes 12k OOD
samples from CIFAR-10 and 18k samples from the augmented
dataset GTSD-D2. The OOD samples from CIFAR-10 are
resized into (30; 30; 3) using tf.image.resize function with
default parameters, and they are treated with NULL label,
indicating not belonging to any of the 43 classes in GTSD. The
GTSD-AD dataset includes 15k normal samples and 15k
adversarial samples. The normal samples are randomly selected
from the augmented dataset GTSD-D2, while adversarial
samples are generated with DeepFool. The reported inference
accuracy measured on GTSD-D1 is 97.34%, while the
inference accuracies for the other datasets are 84.01%
(GTSDD2), 51.47% (GTSD-OOD) and 42.91% (GTSD-AD),
respectively.</p>
          <p>For the target DNN, the softmax probability vector p(x)
includes 43 elements. The structure of our monitor model
in Fig. 3(a) is modified to include 100 and 50 hidden nodes
in two hidden layers, respectively. First, we pre-train an
ensemble of 20 monitor models on a public dataset (for which
we choose GTSD-D2 in our evaluation). In the training
phase, each monitor model is trained over 200 epochs with
Adam optimizer. Then, when applied to different datasets,
the weights in the first two layers are frozen to perform
transfer learning. Other settings remain the same.</p>
          <p>The estimated accuracies by different methods are
summarized in Table 3. The results show that our method still
outperforms the considered baselines, providing a much more
accurate estimate of the target DNN’s inference accuracy on
user’s datasets. With pre-trained monitor models and only
1% labeled data, we can accurately estimate the target DNN’s
inference accuracy when applied to different datsets
(GTSDOOD or GTSD-AD). Also, the estimated accuracy by RS
exhibits a high variation with 1% labeled data and at least 10%
labeled data is required to achieve a similar performance as
our method. Additionally, the baselines provide an estimated
accuracy with a large error. For instance, the MP* and TS
methods often provide a higher estimated accuracy than the
true accuracy.</p>
          <p>To sum up, the results on GTSD further demonstrates the
effectiveness of our proposed method for accuracy estimation
of a target DNN.
6</p>
        </sec>
      </sec>
      <sec id="sec-10-5">
        <title>Conclusion</title>
        <p>In this paper, to increase the trustworthiness of DNN
classification results, we propose a post-hoc method for monitoring
the prediction performance of a target DNN models and
estimating its empirical inference accuracy on user’s (possibly
OOD/adversarial) dataset. The monitor model only takes the
softmax probability produced by the target DNN model as its
input. Thus, it can be easily employed as a plug-in module
on top of a target DNN to monitor its accuracy. Importantly,
by active learning with a small amount of labeled data from
user’s datasets, our monitor model can produce a very
accurate estimate of inference accuracy of the target DNN model.
Our experiment results on different datasets validate the
effectiveness and efficiency of the proposed method for image
classification and traffic sign detection.</p>
      </sec>
      <sec id="sec-10-6">
        <title>Acknowledgments</title>
        <p>This work was supported in part by the U.S. National Science
Foundation under grants CNS-1551661, ECCS-1610471, and
CNS-1910208.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Abadi et al.,
          <year>2016</year>
          ] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis,
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Matthieu</given-names>
            <surname>Devin</surname>
          </string-name>
          . Tensorflow:
          <article-title>Large-scale machine learning on heterogeneous distributed systems</article-title>
          .
          <source>arXiv preprint arXiv:1603.04467</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Beluch et al.,
          <year>2018</year>
          ] William H Beluch, Tim Genewein, Andreas Nu¨rnberger, and
          <string-name>
            <surname>Jan M Ko</surname>
          </string-name>
          <article-title>¨hler. The power of ensembles for active learning in image classification</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Che et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Tong</given-names>
            <surname>Che</surname>
          </string-name>
          , Xiaofeng Liu,
          <string-name>
            <given-names>Site</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yubin</given-names>
            <surname>Ge</surname>
          </string-name>
          , Ruixiang Zhang, Caiming Xiong, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Deep verifier networks: Verification of deep discriminative models with deep generative models</article-title>
          .
          <source>arXiv preprint arXiv:1911.07421</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Coates et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>Adam</given-names>
            <surname>Coates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Honglak</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>An analysis of single-layer networks in unsupervised feature learning</article-title>
          .
          <source>In AISTATS</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Darlow et al.,
          <year>2018</year>
          ] Luke N Darlow,
          <string-name>
            <surname>Elliot J Crowley</surname>
          </string-name>
          ,
          <source>Antreas Antoniou, and Amos J Storkey. CINIC-10 is not ImageNet or CIFAR-10</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Gal and Ghahramani</source>
          , 2016]
          <string-name>
            <given-names>Yarin</given-names>
            <surname>Gal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zoubin</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          .
          <article-title>Dropout as a bayesian approximation: Representing model uncertainty in deep learning</article-title>
          .
          <source>In ICML</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Geifmany</source>
          , 2018] Geifmany.
          <article-title>VGG16 models for CIFAR-10</article-title>
          and CIFAR-
          <volume>100</volume>
          . https://github.com/geifmany/cifar-vgg,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Ghanta et al., 2019a]
          <string-name>
            <given-names>Sindhu</given-names>
            <surname>Ghanta</surname>
          </string-name>
          , Sriram Subramanian,
          <string-name>
            <surname>Khermosh</surname>
          </string-name>
          , et al.
          <article-title>MMP: Model performance predictor</article-title>
          .
          <source>In OpML</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Ghanta et al., 2019b]
          <string-name>
            <given-names>Sindhu</given-names>
            <surname>Ghanta</surname>
          </string-name>
          , Sriram Subramanian, Lior Khermosh, Swaminathan Sundararaman, Harshil Shah, Yakov Goldberg, Drew Roselli, and
          <article-title>Nisha Talagala</article-title>
          .
          <article-title>ML health: Fitness tracking for production models</article-title>
          .
          <source>arXiv preprint arXiv:1902.02808</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Guo et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Chuan</given-names>
            <surname>Guo</surname>
          </string-name>
          , Geoff Pleiss,
          <string-name>
            <given-names>Yu</given-names>
            <surname>Sun</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kilian Q Weinberger</surname>
          </string-name>
          .
          <article-title>On calibration of modern neural networks</article-title>
          .
          <source>In Proceedings of the International Conference on Machine Learning</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [He et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Hendrycks and Gimpel</source>
          , 2017]
          <string-name>
            <given-names>Dan</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Gimpel</surname>
          </string-name>
          .
          <article-title>A baseline for detecting misclassified and out-ofdistribution examples in neural networks</article-title>
          .
          <source>In ICLR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Houben et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Houben</surname>
          </string-name>
          , Johannes Stallkamp, Jan Salmen,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Schlipsing</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Igel</surname>
          </string-name>
          .
          <article-title>Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark</article-title>
          .
          <source>In IJCNN</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Howard et al.,
          <year>2017</year>
          ] Andrew G Howard,
          <article-title>Menglong Zhu</article-title>
          , Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and
          <string-name>
            <given-names>Hartwig</given-names>
            <surname>Adam</surname>
          </string-name>
          . Mobilenets:
          <article-title>Efficient convolutional neural networks for mobile vision applications</article-title>
          .
          <source>arXiv preprint arXiv:1704.04861</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Istrate et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Roxana</given-names>
            <surname>Istrate</surname>
          </string-name>
          , Florian Scheidegger, Giovanni Mariani, Dimitrios Nikolopoulos, Constantine Bekas, and
          <article-title>Adelmo Cristiano Innocenza Malossi</article-title>
          . TAPAS:
          <article-title>Train-less accuracy predictor for architecture search</article-title>
          .
          <source>In AAAI</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Jiang et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Heinrich</given-names>
            <surname>Jiang</surname>
          </string-name>
          , Been Kim, Melody Guan, and
          <string-name>
            <given-names>Maya</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>To trust or not to trust a classifier</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Kingma and Ba</source>
          , 2015]
          <article-title>Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization</article-title>
          .
          <source>ICLR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Krizhevsky</source>
          , 2009]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          .
          <article-title>Learning multiple layers of features from tiny images</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [Kull et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Meelis</given-names>
            <surname>Kull</surname>
          </string-name>
          , Miquel Perello Nieto, Markus Ka¨ngsepp, Telmo Silva Filho, Hao Song, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Flach</surname>
          </string-name>
          .
          <article-title>Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration</article-title>
          .
          <source>In NeurIPS</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>[Lee</surname>
          </string-name>
          et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Kimin</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Kibok</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Honglak</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Jinwoo</given-names>
            <surname>Shin</surname>
          </string-name>
          .
          <article-title>A simple unified framework for detecting out-of-distribution samples and adversarial attacks</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Zenan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoxing</given-names>
            <surname>Ma</surname>
          </string-name>
          , Chang Xu, Chun Cao, Jingwei Xu, and Jian Lu¨.
          <article-title>Boosting operational DNN testing efficiency through conditioning</article-title>
          .
          <source>In ESEC/FSE</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [Liang et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Shiyu</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yixuan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Rayadurgam</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Enhancing the reliability of out-of-distribution image detection in neural networks</article-title>
          .
          <source>In ICLR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>[Malinin and Gales</source>
          , 2018]
          <string-name>
            <given-names>Andrey</given-names>
            <surname>Malinin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Gales</surname>
          </string-name>
          .
          <article-title>Predictive uncertainty estimation via prior networks</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Mohseni et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Sina</given-names>
            <surname>Mohseni</surname>
          </string-name>
          , Akshay Jagadeesh, and
          <string-name>
            <given-names>Zhangyang</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Predicting model failure using saliency maps in autonomous driving systems</article-title>
          .
          <source>In ICML Workshop on Uncertainty and Robustness in Deep Learning</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [
          <string-name>
            <surname>Moosavi-Dezfooli</surname>
          </string-name>
          et al.,
          <year>2016</year>
          ]
          <string-name>
            <surname>Seyed-Mohsen</surname>
            <given-names>MoosaviDezfooli</given-names>
          </string-name>
          , Alhussein Fawzi, and
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Frossard</surname>
          </string-name>
          .
          <article-title>Deepfool: A simple and accurate method to fool deep neural networks</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [Nguyen et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Phuc</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , Deva Ramanan, and
          <string-name>
            <given-names>Charless</given-names>
            <surname>Fowlkes</surname>
          </string-name>
          .
          <article-title>Active testing: An efficient and robust framework for estimating accuracy</article-title>
          .
          <source>In ICML</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [Pinto et al.,
          <year>2019</year>
          ]
          <article-title>Fa´bio Pinto, Marco OP Sampaio, and</article-title>
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Bizarro</surname>
          </string-name>
          .
          <article-title>Automatic model monitoring for data streams</article-title>
          .
          <source>In KDD-ADF</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [Rauber et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Jonas</given-names>
            <surname>Rauber</surname>
          </string-name>
          , Wieland Brendel, and
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Bethge</surname>
          </string-name>
          .
          <article-title>Foolbox: A python toolbox to benchmark the robustness of machine learning models</article-title>
          .
          <source>In ICML</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [Russakovsky et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Olga</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          , Jia Deng, Hao Su,
          <string-name>
            <surname>Alexander C. Berg</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          , page
          <volume>211</volume>
          -
          <fpage>252</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>[Schulam and Saria</source>
          , 2019]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Schulam</surname>
          </string-name>
          and
          <string-name>
            <given-names>Suchi</given-names>
            <surname>Saria</surname>
          </string-name>
          .
          <article-title>Can you trust this prediction? Auditing pointwise reliability after learning</article-title>
          .
          <source>In AISTATS</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>[Sethi and Kantardzic</source>
          , 2017]
          <article-title>Tegjyot Singh Sethi</article-title>
          and
          <string-name>
            <given-names>Mehmed</given-names>
            <surname>Kantardzic</surname>
          </string-name>
          .
          <article-title>On the reliable detection of concept drift from streaming unlabeled data</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>[Simonyan and Zisserman</source>
          , 2015]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>In Asian Conference on Pattern Recognition</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [Snoek et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Jasper</given-names>
            <surname>Snoek</surname>
          </string-name>
          , Yaniv Ovadia, Emily Fertig,
          <string-name>
            <surname>Ren</surname>
          </string-name>
          , et al.
          <article-title>Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift</article-title>
          .
          <source>In NeurIPS</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [Unterthiner et al.,
          <year>2020</year>
          ] Thomas Unterthiner, Daniel Keysers, Sylvain Gelly, Olivier Bousquet, and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Tolstikhin</surname>
          </string-name>
          .
          <article-title>Predicting neural network accuracy from weights</article-title>
          .
          <source>arXiv preprint arXiv:2002.11448</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>