=Paper= {{Paper |id=Vol-2675/paper3 |storemode=property |title=Uncertainty Quantification in Chest X-Ray Image Classification using Bayesian Deep Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2675/paper3.pdf |volume=Vol-2675 |authors=Yumin Liu,Claire Zhao,Jonathan Rubin |dblpUrl=https://dblp.org/rec/conf/ecai/LiuZR20 }} ==Uncertainty Quantification in Chest X-Ray Image Classification using Bayesian Deep Neural Networks== https://ceur-ws.org/Vol-2675/paper3.pdf
          Uncertainty Quantification in Chest X-Ray Image
         Classification using Bayesian Deep Neural Networks
                                             Yumin Liu1 and Claire Zhao2 and Jonathan Rubin3


Abstract. Deep neural networks (DNNs) have proven their effec-                           further examined by a radiologist. Applying this mechanism is ben-
tiveness on numerous tasks. However, research into the reliability of                    eficial since there are lots of X-ray images everyday but there are
DNNs falls behind their successful applications and remains to be                        limited radiologist resources. It can help prioritize X-ray images for
further investigated. In addition to prediction, it is also important to                 radiologists to examine, require more attention to low confidence in-
evaluate how confident a DNN is about its predictions, especially                        stances and support treatment recommendations for highly confident
when those predictions are being used within medical applications.                       instances.
In this paper, we quantify the uncertainty of DNNs for the task of                          Neural network-based deep learning algorithms are also getting
Chest X-Ray (CXR) image classification. We investigate uncertain-                        popular for medical X-ray image processing [27, 1, 35]. It is neces-
ties of several commonly used DNN architectures including ResNet,                        sary to examine the uncertainty of neural network models in medical
ResNeXt, DenseNet and SENet. We then propose an uncertainty-                             X-ray image processing. The confidence of a prediction by a machine
based evaluation strategy that retains subsets of held-out test data                     learning method can be measured by the uncertainty of the method
ordered via uncertainty quantification. We analyze the impact of this                    outputs. A typical way to estimate uncertainty is through Bayesian
strategy on the classifier performance. In addition, we also examine                     learning [2], which regards the parameters of methods as random
the impact of setting uncertainty thresholds on the performance. Re-                     variables and attempts to get the posterior distribution of the parame-
sults show that utilizing uncertainty information may improve DNN                        ters during training while marginalizing out the parameters to get the
performance for some metrics and observations.                                           distribution of the prediction during inference. Bayesian learning is
                                                                                         well developed in traditional non-neural network machine learning
                                                                                         framework [2]
1     INTRODUCTION
Neural networks have been very successful in many fields such as
natural language processing [41, 23], computer vision [18, 8], speech                    2    RELATED WORKS
recognition [15, 5], machine translation [6], control system [36], auto
                                                                                         In recent years Bayesian learning and estimation of prediction un-
driving [4] and so on. However, there is much less research avail-
                                                                                         certainty have gained more and more attention in neural networks
able on how reliable neural network predictions are. A common crit-
                                                                                         context due to the wide application of deep neural networks in many
icism of neural networks is that they are a black box that can per-
                                                                                         areas [11, 3, 12, 13, 22, 14, 32, 24, 40, 26, 12, 31, 32].
form very well for many tasks, yet lacking interpretability. On the
                                                                                            The authors in [3] introduced a method called “Bayes By Back-
other hand, it is very important to ensure the reliability of a system
                                                                                         prop” to learn the posterior distribution on the weights of neural net-
involved in high risk fields, including stock-market analysis, self-
                                                                                         works and get weight uncertainty. Essentially this method assumes
driving cars and medical imaging [28]. As the rapid development of
                                                                                         the weights come from a multivariate Gaussian distribution and up-
machine learning and artificial intelligence especially deep learning,
                                                                                         dates the mean and covariance of the Gaussian instead of the weight
they are getting more and more applications in health areas includ-
                                                                                         samples during training. During inference the network weights are
ing disease diagnosis [9, 10], drug discovery [25, 30] and medical
                                                                                         drawn from the learned distribution. This method is mathematically
imaging [7, 16, 33]. Rather than just being told a final result by an
                                                                                         grounded, backpropagation-compatible and can learn the distribution
machine learning algorithm, shareholders (doctors, physicians, radi-
                                                                                         of network weights directly, but it cannot utilize pre-trained model
ologists, etc) would like to know how “confident” a neural network
                                                                                         and has to build the corresponding model for every neural network
model is, so that they can take different actions according to differ-
                                                                                         architecture. [13] reformulated dropout in neural networks as approx-
ent confidence levels. For example, in a medical image classification
                                                                                         imate Bayesian inference in deep Gaussian processes and thus can
scenario, a neural network model is applied to detect whether a pa-
                                                                                         estimate uncertainty in neural networks with dropout layers. This
tient has a certain type of lung pathology by classifying his/her chest
                                                                                         method requires dropout layers applied before every weight layer.
X-ray images. An ideal situation would be that physicians can trust
                                                                                         During inference, the dropout layers with random 0-1s drawn from
the result of the neural network, if it is highly confident (low uncer-
                                                                                         Bernoulli distribution mask out some weights and only use a subset
tainty) about its prediction. On the contrary, if the neural network
                                                                                         of the weights learned during training phase to make a prediction. In
gives a prediction with low confidence (or high uncertainty), then
                                                                                         [22], the authors further proposed that there are two types of uncer-
the prediction could not be trusted and the patient’s scan should be
                                                                                         tainties and they showed the benefits of explicitly formulating these
1 Northeastern University, USA, email: yuminliu@ece.neu.edu                              two uncertainties separately. The first type is called aleatoric uncer-
2 Philips Research North America, USA, email: claire.zhao@philips.com                    tainty (or data uncertainty), which is due to the noise in the data and
3 Philips Research North America, USA, email: jonathan.rubin@philips.com
                                                                                         cannot be eliminated, while the other type is called epistemic uncer-




    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
tainty (or model uncertainty), which accounts for uncertainty in the     point estimate of w by either maximum likelihood estimator (MLE)
model and can be eliminated given enough data. The network archi-        w∗ = arg maxw p(D|w) or maximum a posterior (MAP): w∗ =
tectures have to be modified to add extra outputs in order to model      arg maxw p(w|D) where p(w|D) = p(w)p(D|w)    p(D)
                                                                                                                              ∝ p(w)p(D|w).
these uncertainties. [24] adopted this typing of uncertainty, but mod-   The w∗ are fixed after training and used for inference for the new
ified the formulation of aleatoric and epistemic uncertainty to avoid    data. In Bayesian learning, we estimate the posterior distribution
the requirement of extra outputs.                                        p(w|D) during training and marginalize out w during the inference
   [26] proposed a method called “Stochastic Weight Averaging            to get a probability distribution of the prediction.
Gaussian (SWAG)” to approximate the posterior distribution over the                                                 R
weights of neural networks as a Gaussian distribution by utilizing in-     p(y|x, D) = Ew∼p(w|D) [p(y|x, w)] =          p(y|x, w)p(w|D)dw      (1)
formation in Stochastic Gradient Descent (SGD). This method has
an advantage in that it can be applied to almost all existing neural     After getting the p(y|x), we can calculate the statistical moments of
networks without modifying their original architectures and can di-      the predicted variable and regard the first and second moment (i.e.,
rectly leverage pre-trained models. [34] also decomposed predictive      mean and variance) as the prediction and uncertainty, respectively.
uncertainty in deep learning into two components and modeled them           However, in practice
                                                                                           R       there are two major difficulties. The first one
separately. They shown that quantifying the uncertainty can help to      is that p(D) = p(w)p(D|w)dw is usually intractable and thus
improve the predictive performance in medical image super resolu-        we cannot get exact p(w|D). The second lies in that Eq. (1) is also
tion. [39] investigated the relationship between uncertain labels in     usually intractable for neural networks. One common approach to
CheXpert [21] and Chest X-ray14 [37] data sets and the estimated         deal with the first difficulty is to use a simpler form of distribution
uncertainty for corresponding instances using Bayesian neural net-       q(w|θ) with hyperparameters θ to approximate p(w|D) by mini-
work and suggested that utilizing uncertain labels helped prevent        mizing the Kullback-Leibler (KL) divergence between q(w|θ) and
over-confident for ambiguous instances.                                  p(w|D). This turns the problem into an easier optimization prob-
   Despite the above works in Bayesian deep neural network learn-        lem:
                                                                                        θ∗ = arg min KL[q(w|θ)||p(w|D)]
ing and uncertainty quantification, there are few works on evaluat-                                θ
ing the effects of uncertainty-based evaluation strategies for medical                                  Z
                                                                                                                       q(w|θ)                   (2)
image classification. To the best of our knowledge, we are the first                       = arg min q(w|θ)log                dw
                                                                                                   θ                  p(w|D)
to apply uncertainty quantification strategies for chest X-ray image
classification using deep neural networks and evaluate their impacts     For the second difficulty, the usual approach is to use sampling to
on performances. The main contributions of this paper are:               estimate Eq. (1), and it becomes

                                                                            p(y|x) ≈ Ew∼q(w|θ∗ ) [p(y|x, w)] ≈ T1 Ti=1 p(y|x, w(i) ) (3)
                                                                                                                    P
• We apply uncertainty quantification to five deep neural network
  models for chest X-ray image classification and analyze their per-
  formances.                                                             where w(i) ∼ q(w|θ∗ ).
• We investigate the impact that uncertainty information has on clas-       People had proposed different methods to approximate the poste-
  sification task performance by evaluating subsets of held-out test     rior p(w|θ) or to get the samples of w [26, 3, 12, 13].
  data ordered via uncertainty quantification.
                                                                         3.2    Stochastic Weight Averaging Gaussian (SWAG)
3     METHOD                                                             The basic idea of SWAG [26] is to regard the weights of the neu-
                                                                         ral networks as random variables and get their statistical moments
In this section, we will introduce the basic ideas of Bayesian Neural    through training with SGD. Then use these moments to fit a multi-
Networks and one of its approximations – SWAG [26], which is used        variate Gaussian to get the posterior distribution of the weights. Af-
in this paper. We also describe the uncertainty quantification method    ter the original training process in which we get the optimal weights,
used in this paper.                                                      we continue to train the model using the same training data with
                                                                         SGD and get T samples of the weightsP w1 , w2 ,· · · ,wt ,· · · ,wT . The
3.1    Bayesian Neural Network                                           mean of those samples is w = T1 Tt=1 wt . The mean of the square
                                                                         is w2 = T1 Tt=1 wt2 and we define a diagonal matrix Σdiag =
                                                                                        P
In the ordinary deterministic neural networks, we get point estima-      diag(w2 −w2 ) and a deviation matrix R = [R1 , · · · , Rt , · · · , RT ]
tion of the network weights w which are regarded as fixed values         whose columns Rt = wt − wt , where wt isPthe running av-
and will not be changed after training. During inference, for each in-   erage of the first t weights samples wt = 1t tj=1 wj . In the
put xi we get one deterministic prediction p(yi |xi ) = p(yi |xi , w)    original paper, the authors used the last K columns of R to get
without getting the uncertainty information.                             the low rank approximation of R. The K-rank approximation is
   In the Bayesian neural network settings, in addition to the tar-      Rb = [RT −K+1 , · · · , RT ]. Then the mean and covariance matrix
get prediction, we also want to get the uncertainty for the predic-      for the fitted Gaussian are given by:
tion. To do so we regard the neural network weights as random vari-
ables that subject to some form of distribution and try to estimate                                   wSW A = w                                (4)
the posterior distribution of the network weights given the training
data during training. We then integrated out the weights and get the
                                                                                                    1            1       bT
distribution over the prediction during inference. From the predic-                     ΣSW A =       Σdiag +          R
                                                                                                                       bR                      (5)
tion distribution we can further calculate the prediction output and                                2         2(K − 1)
corresponding uncertainty. More specifically, let D = {(X, Y )}             During inference, for each input (image) xi , sample the weights
and w be the training data and weights of a neural network, respec-      from the Gaussian ws ∼ N (wSW A , ΣSW A ) then update the batch
tively. The ordinary deterministic neural network methods try to get a   norm statistics by performing one epoch of forward pass, and then
the sample prediction is given by p(ŷis |xi ) = p(yi |xi , ws ). Repeat          Algorithm 1 Uncertainty Quantification
the precedure for S times and we get S predictions ŷi1 , ŷi2 , · · · , ŷis ,
                                                                                   1: Input:
· · · , ŷiS for the same input xi . By using these S predictions we can
                                                                                      D = {(X, Y )} / Xi : training / evaluating chest X-ray images
get the final prediction and uncertainty.     For regression problem, the
                                                                                      and corresponding observation labels
final prediction will be ŷi = S1 S
                                     P
                                       s=1 ŷis .
                                                                                   2: Initialization:
                                                                                      load pre-trained neural network (NN) models by ImageNet
3.3    Uncertainty Quantification                                                  3: Training:
                                                                                      Fine-tune NN models using cheXpert dataset
Some methods had been proposed to quantify the uncertainty in clas-
                                                                                   4: Perform SWAG:
sification [24, 22]. Here we adopt the method proposed by [24] since
                                                                                            Continue training with SGD
it does not require extra output and does not need to modify the net-
                                                                                              i) train NN models using SGD for some epochs with D
work architectures.
                                                                                              ii) save statistics of the weights for those epochs
   For a classification problem, suppose there are C classes, denote
                                                                                              iii) calculate wSW A and ΣSW A using Eq. 4 and 5
ps , [ps1 , ps2 , · · · , psc ] = p(y|x, θs ), s ∈ {1, 2, · · · , S} as the
                                                                                              vi) fit a Gaussian using wSW A as mean and ΣSW A as
softmax (or sigmoid in binary case if C = 2) output of the neu-
                                                                                                 covariance
ral network for a same repeated input x for S times, then the pre-
                                                                                            Prediction
 1
   PS “probability” is the average of those S sample outputs p =
dicted
                                                                                              for s from 1 to S
S     s=1 ps The predicted class label index is ŷ = arg maxc p. The                              draw weights ws ∼ N (wSW A |ΣSW A )
aleatoric  uncertainty Ua and the epistemic   PS uncertainty Ue are TUa =
 1
   P  S                        T           1                                                      update batch norm statistics using D
      s=1 [diag(p s )  −  ps p s ], Ue  =       s=1 (ps − p)(ps − p) The
S                                          S                                                      p(yis |Xi ) = p(yis |Xi , ws )
total uncertainty is Utotal = Ua + Ue . For binary classification, the
                                                                                              end for
sigmoid output is a scalar and the uncertainty equations are reduced
                                                                                   5: Calculate Outputs:
to
                                                                                      p(yi |Xi ) = S1 S
                                                                                                         P
                                       S                                                                     s=1 p(yis |Xi )
                                   1X
                          Ua =            ps (1 − ps )                  (6)           Calculate ŷi , Ua and Ue using Eq. (8), (6) and (7).
                                   S s=1                                              Utotal = Ua + Ue
                                      S
                                                                                   6: Return:
                                 1X                                                   ŷi , Ua , Ue , Utotal
                          Ue =         (ps − p)2                           (7)
                                 S s=1
                                                                                  from which we can get wSW A and ΣSW A using Eq. 4 and 5. Then
where p = S1 S
                P
                   s=1 ps and ps = p(y = 1|x, θs ) = 1 − p(y =
                                                                                  we fit a multivariate Gaussian using wSW A as mean and ΣSW A as
0|x, θs ). The predicted label is:
                                                                                  covariance and get an approximated distribution for the neural net-
                               (                                                  work weights. When doing a prediction, an input chest X-ray image
                                  1 p ≥ 0.5
                          ŷ =                              (8)                   is repeatedly fed into the network for S times, each time with a new
                                  0 p < 0.5                                       set of weights sampled from the Gaussian distribution. The S out-
                                                                                  put probabilities are used to calculate the final predicted label ŷi and
In this way, we can get uncertainties for all the instances.
                                                                                  uncertainty Utotal = Ua + Ue . It is worthwhile to note that, after
                                                                                  drawing sample weights the network batch norm statistics need to
3.4    Transfer Learning                                                          be updated for the models that use batch normalization. It can be
                                                                                  achieved by running one epoch with partial or full training set D.
Transfer learning is a widely used technique to help improve perfor-
                                                                                  More detailed justification for the necessity was given in the original
mance for deep neural networks in image classification. Here we can
                                                                                  paper [26].
also benefit from transfer learning by loading pre-trained neural net-
work models trained by ImageNet (http://image-net.org)
dataset. The SWAG method has one advantageous characteristic that                 4   DATASET
it does not require to modify any architecture of the original neu-
ral networks and therefore we can fully utilize pre-trained models                We perform experiments using the CheXpert data set [21]. CheXpert
trained by ImageNet dataset to speed up training process and get                  is a large chest X-ray dataset released by researchers at Stanford Uni-
better predictions. In the initialization stage, we download the pre-             versity. This dataset consists of 224,316 chest radiographs of 65,240
trained model parameters and use them to initialize our models to be              patients. Each data instance contains a chest X-ray image and a vec-
trained.                                                                          tor label describing the presence of 14 observations (pathologies) as
                                                                                  positive, negative, or uncertain. The labels were extracted from ra-
                                                                                  diology reports using natural language processing approaches. For
3.5    Procedure
                                                                                  our experiments we focus on 5 observations, namely Cardiomegaly,
Basically we follow the method in [26] to approximate the Bayesian                Edema, Atelectasis, Consolidation and Pleural Effusion. As [21] had
neural network and the formulas in [24] to quantify uncertainty of                pointed out, these 5 observations were selected based on their clinical
the models. The overall algorithm for SWAG and uncertainty quan-                  importance and prevalence in this dataset. In their experiment they
tification is shown in Algorithm 1. We initialize the model with cor-             also used these 5 observations to evaluate the labeling approaches. A
responding pre-trained model, and then fine-tune it by training using             sample image for each observation is shown in Figure 1.
chest X-ray images and observation labels. After that we perform                     The original dataset consists of training set and validation set and
SWAG algorithm by continuing training using Stochastic Gradient                   we do not have access to test set. The labels for the training set were
Descent for T epochs and calculate statistics w, w2 , Σdiag and R, b              generated by automated rule-based labeler which extract informa-
Figure 1: Sample image for each observation. From left to right: no finding (all negative), cardiomegaly, edema, consolidation, atelectasis and
pleural effusion

tion from radiology reports. This was done by the Stanford research
group who released the dataset. There are three possible values for
the label of an instance for a given observation, i.e., 1, 0 and −1. 1
means the observation is positive (or exists), 0 means negative (or
not exists), and −1 means not certain about whether the observation
exists. The labels for the validation set were determined by the ma-
jority vote from three board-certified radiologists and only contains
positive (1) or negative (0) values. The original paper [21] investi-
gated several different ways to deal with the uncertain labels (−1),
such as regarding them as positive (1), negative (0), the same with             (a) Prevalence of observations         (b) Gender proportion
the majority class, or a separate class. They found out that for differ-
ent observations, the optimal ways to deal with the uncertain labels
are different, and they gave the replacement for 5 observations men-
tioned above. Based on the results from [21] and for simplicity, we
replace the uncertain labels with 0 or 1 for different observations.
   Specifically, the uncertain labels of cardiomegaly, consolidation
and pleural effusion are replaced with 0, while edema and atelecta-
sis with 1. Therefore the problem becomes a multi-label binary im-
age classification problem. The predicted result is a five dimensional
vector with element value being 1 or 0, where 1 means that the net-              (c) Training set age histogram   (d) Validation set age histogram
work predicts existence for the corresponding observation while 0
means the network predicts not existence of the corresponding obser-                               Figure 2: Patient statistics
vation. We follow the official training set / validation set split given
by the data set provider. After removing invalid instances, we get a        skip connections to mitigrate the gradient vanishment problem and
total number of 223,414 instances for training and 234 instances for        was the winner of ILSVRC 2015 [29] and COCO 2015 (http://
validation. We first initialize the neural network’s parameters with        cocodataset.org) competition. ResNeXt is a variant of ResNet
corresponding downloaded pre-trained model parameters, and then             and won the 2nd place in ILSVRC 2016 classification task. DenseNet
train the neural network using the training set and test their perfor-      further utilizes the concept of skip connections by connecting previ-
mance on the validation set. We will use the original training set as       ous layer output to all its subsequent layers and forming “dense” skip
the training set and original validation set as the evaluation set in our   connections. DenseNet further alleviates vanishing gradient prob-
experiments.                                                                lem, reduce number of parameters and reuses intermediate features,
   In Figure 2 we show the patient statistics of the 5 observations af-     and is widely used since it was proposed. SENet uses squeeze-and-
ter replacing the uncertain labels in the training set. The prevalence      excitation block to model image channel interdependencies and won
is the ratio of the number of positive instances over the total num-        the ILSVRC 2017 competition for classification task.
ber of instances. From the figure we can see that all five observations        All networks are trained as binary classifiers for multi-label clas-
are imbalance as the prevalence being under 50%. Besides, there is          sification instead of training separate models for each class.
a gap in the prevalence for the training and evaluation sets in all ob-        The pipeline of the experiment is shown in Figure 4. We use
servations, which will probably affect the performance of the neural        PyTorch implementation. The neural network models and pre-
network models.                                                             trained parameters are from torchvision (except SENet154 which
                                                                            is from pretrainedmodels, https://github.com/Cadene/
                                                                            pretrained-models.pytorch).
5   EXPERIMENT
                                                                               In our experiment we set the number of sample weights T = 5,
In this section, we perform experiments and present the investiga-          the number of columns of the deviation matrix K = 10 and the
tion results of uncertainty quantification and strategy on five dif-        number of repeated prediction samples S = 10. During training,
ferent neural network models using PyTorch implementation. These            we use Adam optimizer with weight decay regularizer and ReduceL-
neural networks are DenseNet [20] with 121 layers (denote as                ROnPlateau learning rate scheduler. The the initial learning rate is
DenseNet121), DenseNet with 201 layers (denote as DenseNet201),             1 × 10−5 and weight decay coefficient is 0.005. The maximum num-
ResNet [17] with 152 layers (denote as ResNet152), ResNeXt [38]             ber of fine-tuning epoch is 50 epochs. The original chest X-ray im-
with 101 layers (denote as ResNeXt101) and Squeeze-and-Excitation           ages are resized and randomly cropped to 256 × 256 (except for
network [19] with 154 layers (denote as SENet154). ResNet uses              SENet154 which has a fixed input size 224 × 224). We stop fine-
  Figure 3: Comparison of performance between original deterministic network and Bayesian neural network with uncertainty strategy. The
                                                     neural network is DenseNet with 201 layers.

                                                                              better than SWAG for most of the networks. For cardimegaly, con-
                                                                              solidation and atelectasis, the performances are mixed. This maybe
                                                                              because edema and pleural effusion are harder to detect and more
                                                                              sensitive to network weights perturbation. On the whole the SWAG
                                                                              algorithm does not outperform the original neural network. These
                                                                              might be accountable because SWAG uses a Gaussian to approxi-
                                                                              mate the distribution over the optimal weights and then draws sam-
                                                                              ple weights from the approximated Gaussian distribution, and may
                                                                              deviate from the optimal weights if the approximation is inaccurate.
                 Figure 4: Pipeline of the experiment                         Therefore we need to adopt some strategy to prevent the performance
                                                                              from deterioration. The benefit lies in that we can get the uncertainty
tuning the model when the AUC (explained below) does not increase             estimation for each prediction while keeping similar or even better
for consecutive 10 epochs and save the model with the best AUC as             prediction results.
the optimal trained model.
   We use four metrics to evaluate the network classification perfor-                                     Table 1: Original AUC vs SWAG AUC
mance: Area under curve (AUC), Sensitivity, Specifity and Precision.          Networks
                                                                                         AUC         Average
                                                                                               Original    SWAG
                                                                                                                     Cardiomegaly
                                                                                                                  Original   SWAG
                                                                                                                                          Edema
                                                                                                                                    Original  SWAG
                                                                                                                                                         Consolidation
                                                                                                                                                      Original    SWAG
                                                                                                                                                                             Atelectasis
                                                                                                                                                                         Original    SWAG
                                                                                                                                                                                             Pleural Effusion
                                                                                                                                                                                            Original    SWAG
Those metrics are widely used for machine learning and medicine                 Resnet152
                                                                               ResNext101
                                                                                               0.8831
                                                                                               0.8807
                                                                                                        0.8786
                                                                                                        0.8726
                                                                                                                  0.8376
                                                                                                                  0.8013
                                                                                                                           0.8149
                                                                                                                           0.8339
                                                                                                                                    0.9123
                                                                                                                                    0.9212
                                                                                                                                             0.8713
                                                                                                                                             0.8748
                                                                                                                                                      0.8927
                                                                                                                                                      0.9250
                                                                                                                                                               0.9234
                                                                                                                                                               0.9311
                                                                                                                                                                         0.8543
                                                                                                                                                                         0.8246
                                                                                                                                                                                  0.8537
                                                                                                                                                                                  0.8162
                                                                                                                                                                                            0.9184
                                                                                                                                                                                            0.9314
                                                                                                                                                                                                      0.9298
                                                                                                                                                                                                      0.9071

community. The AUC is often used to measure the quality of a clas-              SEnet154
                                                                               Densenet121
                                                                                               0.8794
                                                                                               0.8842
                                                                                                        0.8695
                                                                                                        0.8942
                                                                                                                  0.8203
                                                                                                                  0.8436
                                                                                                                           0.8040
                                                                                                                           0.8752
                                                                                                                                    0.9195
                                                                                                                                    0.9264
                                                                                                                                             0.8702
                                                                                                                                             0.8940
                                                                                                                                                      0.9216
                                                                                                                                                      0.9139
                                                                                                                                                               0.9187
                                                                                                                                                               0.9512
                                                                                                                                                                         0.8056
                                                                                                                                                                         0.8153
                                                                                                                                                                                  0.8553
                                                                                                                                                                                  0.8489
                                                                                                                                                                                            0.9301
                                                                                                                                                                                            0.9220
                                                                                                                                                                                                      0.8992
                                                                                                                                                                                                      0.9016
                                                                               Densenet201     0.8793   0.8356    0.8259   0.8397   0.9165   0.8796   0.9313   0.7739    0.7936   0.7714    0.9294    0.9132
sifier and is defined as the area under the Receiver Operating Charac-
teristic (ROC) curve which plots the sensitivity against the false pos-
itive rate. The sensitivity (or true positive rate or recall) is defined as
the ratio of the number of correctly predicted positive instances over        5.2            With Coverage Strategy
the number of total positive instances. The specificity is defined as
the ratio of the number of correctly predicted negative instances over        Next we utilize the uncertainty quantification information to deter-
the total number of negative instances. And the precision is defined          mine if the performances can be improved. One strategy is to sort
as the ratio of the number of correctly predicted positive instances          instances according to uncertainty in an ascending order, and then
over the number of instances that are predicted as positive.                  take those instances with less uncertainty into consideration and dis-
                                                                              card the rest. In clinical practice, the discarded instances could be
                                                                              flagged for further evaluation by a physician.
5.1    Without Strategy                                                          Ideally we would expect a decreasing trend for the metrics when
                                                                              data coverage increase as shown in Figure 6. The horizontal axis
First we compare the AUC of the original ordinary deterministic neu-          “Data coverage” is the percentage of instances being considered. For
ral networks with the AUC corresponding neural networks after per-            example, a data coverage of 20% means that only the top twenty per-
forming SWAG but before applying any uncertainty strategies. The              cent of the least uncertain (or the most confident) instances are taken
results are shown in Table 1. The “Average” column is the average             into consideration and the rest are discarded.
over all 5 observations. The bold font indicates better performance.             Figure 3 shows the comparison of performances with regard to the
For edema and pleural effusion, the original neural network performs          foure metrics (AUC, sensitivity, specificity and precision) between
        (a) Cardiomegaly                              (b) Edema                        (c) Consolidation                     (d) Atelectasis                   (e) Pleural effusion

                                  Figure 5: Estimated total uncertainty (aleatoric + epistemic) histogram for each observation


                                                                                                           Table 3: Effect of uncertainty strategy for DenseNet201.

                                                                                                              Densenet201           AUC        Sens.     Spec.        Prec.
                                                                                                                                                          √
                                                                                                              Cardiomegaly            ×
                                                                                                                                      √         ×
                                                                                                                                                √         √             ◦
                                                                                                                                                                        √
                                                                                                                 Edema
                                                                                                              Consolidation           ×         ×          ◦            -
                                                                                                               Atelectasis            ◦
                                                                                                                                      √         ×
                                                                                                                                                √          ◦
                                                                                                                                                           √            ×
                                                                                                                                                                        √
                                                                                                             Pleural effusion
                                                                                                             √
                                                                                                                 : helpful; ×: not helpful; ◦: mixed behavior; -: missing value

                                                                                                 uncertainty strategy will help to improve some performance metrics
                                                                                                 for all four neural network models.

                                                                                                       Table 4: Effect of uncertainty strategy for different networks
Figure 6: Expected ideal performance. The metric decreases as data
coverage increases.                                                                                              ResNet152          AUC        Sens.     Spec.        Prec.
                                                                                                                                     √
                                                                                                              Cardiomegaly                      ×          -
                                                                                                                                                           √            -
                                                                                                                                                                        √
the original deterministic networks and Bayesian neural networks                                                 Edema                ×         ×
with uncertainty strategy. The solid lines are the Bayesian neural net-                                       Consolidation           ×         ×          -
                                                                                                                                                           √            -
                                                                                                                                                                        √
work with uncertainty strategy, while the dashed lines are the origi-                                          Atelectasis            ×
                                                                                                                                      √         ×
                                                                                                                                                √          √            √
                                                                                                             Pleural effusion
nal ordinary deterministic networks without any uncertainty strategy.                                        √
                                                                                                                 : helpful; ×: not helpful; ◦: mixed behavior; -: missing value
Different colors represent different observations.
   From Figure 3 we can see that for edema and pleural effusion, the
AUC decreases as the coverage increases, and are above the corre-
sponding original AUC until around 45% and 90% coverage, respec-                                       Table 5: Effect of uncertainty strategy for different networks
tively. This means that applying the uncertainty strategy can improve
                                                                                                                 SENet154           AUC        Sens.     Spec.        Prec.
AUC for these two observations. The highest AUC gain can be 8%                                                                       √
and 6% for edema and pleural effusion, respectively. We also observe                                          Cardiomegaly                      -          -
                                                                                                                                                           √            -
                                                                                                                 Edema                ×
                                                                                                                                      √         ×                       ×
similar trend in sensitivity, specificity and precision for both edema                                        Consolidation                     -          -            -
and pleural effusion. Three observations (cardiomegaly, atelectasis                                            Atelectasis            ◦         -          ×            -
                                                                                                                                      √                    √            √
and consolidation) have low sensitivity as most of the predictions are                                       Pleural effusion                   ×
                                                                                                             √
negative. On the contrary the specificity is high.                                                               : helpful; ×: not helpful; ◦: mixed behavior; -: missing value
   The highest gains for applying the uncertainty strategy are shown
in the Table 2. The effect of the uncertainty strategy over the five ob-                            Despite that for some observations (e.g., pleural effusion), several
                                                                                                 metrics performance benefit a lot from applying the uncertainty strat-
Table 2: Perfomance gain for edema and pleural effusion. The values                              egy, we should also notice that the strategy does not help to improve
                       are the absolute and relative gains                                       performance for some other observations with regard to these met-
   Gain (% Gain)          AUC           Sensitivity        Specificity     Precision
                                                                                                 rics, and in some cases even degrade the performance. The reasons
       Edema          0.0835(9.11%)   0.3778(60.71%)     0.0476(5.00%)   0.2432(32.14%)
                                                                                                 behind might be varied and needs more investigation. For example,
   Pleural effusion   0.0706(7.60%)   0.2687(36.73%)     0.0778(8.44%)   0.2097(26.53%)          this may be that the neural network weight distribution approximated
                                                                                                 by the SWAG algorithm does not capture the true distribution, or
servations with the model DenseNet201 can be summarized as in the                                even the uncertainty quantification formulas are inappropriate.
                        √
Table 3. The symbols , ×, ◦ and − represents helpful, not helpful,
mixed behavior and missing value, respectively. For edema and pleu-                              5.3       With Absolute Threshold Strategy
ral effusion, applying uncertainty strategy is beneficial for improving
all four metrics. However, for other observations, it does not show                              We also plot the total uncertainty distribution for each observation, as
benefits or only limited benefits for some metrics. The reason why                               shown in Figure 5. From the figure we can see that for cardiomegaly,
it show varied behavior may be interesting and needs further inves-                              the estimated uncertainty tends to be smaller, while for edema, at-
tigation. Similarly, we summarize the effect of applying uncertainty                             electasis and plueral effusion, the proportion of larger estimated un-
strategy for different neural network architectures and the results are                          certainty is higher. Consolidation has a relatively even distribution
shown in Table 4 to Table 7. From the tables we can see that applying                            for estimated uncertainty. This suggest that edema, atelectasis and
Figure 7: Comparison of performance between original deterministic network and Bayesian neural network with uncertainty threshold.

                                                                          pleural effusion are more prone to be affected by setting an uncer-
                                                                          tainty threshold. Combining this finding with the results in Table
                                                                          2, we set thresholds for both edema and pleural effusion to check
                                                                          the influence on metric performance. We only consider the instances
Table 6: Effect of uncertainty strategy for different networks            whose estimated uncertainty is smaller than the threshold to compute
                                                                          the performance metrics. We vary the threshold from 0.2 to 0.24 by a
      ResNext101            AUC       Sens.      Spec.        Prec.       step of 0.01 and the results are shown in Figure 7. The black dashed
                             √
      Cardiomegaly                      ×          ×
                                                   √            ×
                                                                √         line is the average metric values of the original deterministic neural
         Edema                ×
                              √         ×          √                      network, while the solid color thin lines are metric values for each
      Consolidation                     ×
                                        √                       -
                                                                          observation, and the thick brown line is the average metric values
       Atelectasis            ◦
                              √         √          ×
                                                   √            ×
                                                                √
     Pleural effusion                                                     of all five observation after applying threshold only to edema and
     √                                                                    pleural effusion. Comparing the thick brown line with the dash black
         : helpful; ×: not helpful; ◦: mixed behavior; -: missing value
                                                                          line, we can see that the average specificity and precision have been
                                                                          improved while the average AUC and sensitivity roughly keep the
                                                                          same. This means that applying uncertainty threshold to edema and
                                                                          pleural effusion is beneficial.



                                                                          6   CONCLUSION

Table 7: Effect of uncertainty strategy for different networks            In this paper we investigate uncertainty quantification in medical im-
                                                                          age classification using Bayesian deep neural networks. We train five
      DenseNet121           AUC       Sens.      Spec.        Prec.       different deep neural network models on the CheXpert X-ray image
      Cardiomegaly            ×         ×          -
                                                   √            ◦
                                                                √         data for five clinical observations and quantify the model uncertainty.
         Edema                ×
                              √         ×                                 Then we analyze the performance of the network for situations with
      Consolidation           √         -          -
                                                   √            -
                                                                √
       Atelectasis                      ◦                                 and without applying uncertainty strategy. The results show that the
                              √         √          √            √
     Pleural effusion                                                     uncertainty quantification and strategy improve several performance
     √                                                                    metrics for some observations. This suggests that uncertainty quan-
         : helpful; ×: not helpful; ◦: mixed behavior; -: missing value
                                                                          tification is helpful in medical image classification using neural net-
                                                                          works. However, the results also show that in some cases the strategy
                                                                          is not helpful, or can even deteriorate the performance. Further anal-
                                                                          ysis may be needed to examine this phenomenon.
REFERENCES                                                                             bayesian deep learning for computer vision?’, in Advances in neural
                                                                                       information processing systems, pp. 5574–5584, (2017).
                                                                                [23]   Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Brad-
 [1] Yaniv Bar, Idit Diamant, Lior Wolf, Sivan Lieberman, Eli Konen, and               bury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard
     Hayit Greenspan, ‘Chest pathology detection using deep learning with              Socher, ‘Ask me anything: Dynamic memory networks for natural lan-
     non-medical training’, in 2015 IEEE 12th International Symposium on               guage processing’, in International conference on machine learning,
     Biomedical Imaging (ISBI), pp. 294–297. IEEE, (2015).                             pp. 1378–1387, (2016).
 [2] Christopher M Bishop, Pattern recognition and machine learning,            [24]   Yongchan Kwon, Joong-Ho Won, Beom Joon Kim, and Myunghee Cho
     springer, 2006.                                                                   Paik, ‘Uncertainty quantification using bayesian neural networks in
 [3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan                   classification: Application to ischemic stroke lesion segmentation’,
     Wierstra, ‘Weight uncertainty in neural networks’, arXiv preprint                 Medical Imaging with Deep Learning, 2018, (2018).
     arXiv:1505.05424, (2015).                                                  [25]   Antonio Lavecchia, ‘Machine-learning approaches in drug discovery:
 [4] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard                  methods and applications’, Drug discovery today, 20(3), 318–331,
     Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Mon-                 (2015).
     fort, Urs Muller, Jiakai Zhang, et al., ‘End to end learning for self-     [26]   Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and
     driving cars’, arXiv preprint arXiv:1604.07316, (2016).                           Andrew Gordon Wilson, ‘A simple baseline for bayesian uncertainty in
 [5] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, ‘Listen, at-            deep learning’, arXiv preprint arXiv:1902.02476, (2019).
     tend and spell: A neural network for large vocabulary conversational       [27]   Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel
     speech recognition’, in 2016 IEEE International Conference on Acous-              Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie
     tics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE,                 Shpanskaya, et al., ‘Chexnet: Radiologist-level pneumonia detection
     (2016).                                                                           on chest x-rays with deep learning’, arXiv preprint arXiv:1711.05225,
 [6] Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio, ‘A character-                   (2017).
     level decoder without explicit segmentation for neural machine transla-    [28]   Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib, ‘Deep learn-
     tion’, arXiv preprint arXiv:1603.06147, (2016).                                   ing for medical image processing: Overview, challenges and the future’,
 [7] Marleen de Bruijne. Machine learning approaches in medical image                  in Classification in BioApps, 323–350, Springer, (2018).
     analysis: From detection to diagnosis, 2016.                               [29]   Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
 [8] Chao Dong, Chen Change Loy, and Xiaoou Tang, ‘Accelerating the                    Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
     super-resolution convolutional neural network’, in European confer-               Michael Bernstein, Alexander C. Berg, and Li Fei-Fei, ‘ImageNet
     ence on computer vision, pp. 391–407. Springer, (2016).                           Large Scale Visual Recognition Challenge’, International Journal of
 [9] Meherwar Fatima and Maruf Pasha, ‘Survey of machine learning algo-                Computer Vision (IJCV), 115(3), 211–252, (2015).
     rithms for disease diagnostic’, Journal of Intelligent Learning Systems    [30]   Benjamin Sanchez-Lengeling and Alán Aspuru-Guzik, ‘Inverse molec-
     and Applications, 9(01), 1, (2017).                                               ular design using machine learning: Generative models for matter engi-
[10] Konstantinos P Ferentinos, ‘Deep learning models for plant disease de-            neering’, Science, 361(6400), 360–365, (2018).
     tection and diagnosis’, Computers and Electronics in Agriculture, 145,     [31]   Peter Schulam and Suchi Saria, ‘Can you trust this prediction? auditing
     311–318, (2018).                                                                  pointwise reliability after learning’, in The 22nd International Confer-
[11] Yarin Gal, Uncertainty in deep learning, Ph.D. dissertation, PhD thesis,          ence on Artificial Intelligence and Statistics, pp. 1022–1031, (2019).
     University of Cambridge, 2016.                                             [32]   Murat Sensoy, Lance Kaplan, and Melih Kandemir, ‘Evidential deep
[12] Yarin Gal and Zoubin Ghahramani, ‘Bayesian convolutional neural net-              learning to quantify classification uncertainty’, in Advances in Neural
     works with bernoulli approximate variational inference’, arXiv preprint           Information Processing Systems, pp. 3179–3189, (2018).
     arXiv:1506.02158, (2015).                                                  [33]   Kenji Suzuki, ‘Overview of deep learning in medical imaging’, Radio-
[13] Yarin Gal and Zoubin Ghahramani, ‘Dropout as a bayesian approxima-                logical physics and technology, 10(3), 257–273, (2017).
     tion: Representing model uncertainty in deep learning’, in international   [34]   Ryutaro Tanno, Daniel Worrall, Enrico Kaden, Aurobrata Ghosh,
     conference on machine learning, pp. 1050–1059, (2016).                            Francesco Grussu, Alberto Bizzi, Stamatios N Sotiropoulos, Anto-
[14] Yarin Gal, Riashat Islam, and Zoubin Ghahramani, ‘Deep bayesian ac-               nio Criminisi, and Daniel C Alexander, ‘Uncertainty quantification
     tive learning with image data’, in Proceedings of the 34th International          in deep learning for safer neuroimage enhancement’, arXiv preprint
     Conference on Machine Learning-Volume 70, pp. 1183–1192. JMLR.                    arXiv:1907.13418, (2019).
     org, (2017).                                                               [35]   Demetri Terzopoulos et al., ‘Semi-supervised multi-task learning with
[15] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, ‘Speech                   chest x-ray images’, in International Workshop on Machine Learning
     recognition with deep recurrent neural networks’, in 2013 IEEE inter-             in Medical Imaging, pp. 151–159. Springer, (2019).
     national conference on acoustics, speech and signal processing, pp.        [36]   Huanqing Wang, Peter Xiaoping Liu, Shuai Li, and Ding Wang, ‘Adap-
     6645–6649. IEEE, (2013).                                                          tive neural output-feedback control for a class of nonlower triangular
[16] Hayit Greenspan, Bram Van Ginneken, and Ronald M Summers,                         nonlinear systems with unmodeled dynamics’, IEEE transactions on
     ‘Guest editorial deep learning in medical imaging: Overview and future            neural networks and learning systems, 29(8), 3658–3668, (2017).
     promise of an exciting new technique’, IEEE Transactions on Medical        [37]   Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi
     Imaging, 35(5), 1153–1159, (2016).                                                Bagheri, and Ronald M Summers, ‘Chestx-ray8: Hospital-scale chest x-
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ‘Deep resid-               ray database and benchmarks on weakly-supervised classification and
     ual learning for image recognition’, in Proceedings of the IEEE confer-           localization of common thorax diseases’, in Proceedings of the IEEE
     ence on computer vision and pattern recognition, pp. 770–778, (2016).             conference on computer vision and pattern recognition, pp. 2097–2106,
[18] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,                      (2017).
     Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam,             [38]   Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming
     ‘Mobilenets: Efficient convolutional neural networks for mobile vision            He, ‘Aggregated residual transformations for deep neural networks’, in
     applications’, arXiv preprint arXiv:1704.04861, (2017).                           Proceedings of the IEEE conference on computer vision and pattern
[19] Jie Hu, Li Shen, and Gang Sun, ‘Squeeze-and-excitation networks’, in              recognition, pp. 1492–1500, (2017).
     Proceedings of the IEEE conference on computer vision and pattern          [39]   Hao-Yu Yang, Junling Yang, Yue Pan, Kunlin Cao, Qi Song, Feng Gao,
     recognition, pp. 7132–7141, (2018).                                               and Youbing Yin, ‘Learn to be uncertain: Leveraging uncertain labels
[20] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-                 in chest x-rays with bayesian neural networks’, in Proceedings of the
     berger, ‘Densely connected convolutional networks’, in Proceedings of             IEEE Conference on Computer Vision and Pattern Recognition Work-
     the IEEE conference on computer vision and pattern recognition, pp.               shops, pp. 5–8, (2019).
     4700–4708, (2017).                                                         [40]   Jiayu Yao, Weiwei Pan, Soumya Ghosh, and Finale Doshi-Velez, ‘Qual-
[21] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana                    ity of uncertainty quantification for bayesian neural network inference’,
     Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn                 arXiv preprint arXiv:1906.09686, (2019).
     Ball, Katie Shpanskaya, et al., ‘Chexpert: A large chest radiograph        [41]   Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria,
     dataset with uncertainty labels and expert comparison’, arXiv preprint            ‘Recent trends in deep learning based natural language processing’,
     arXiv:1901.07031, (2019).                                                         ieee Computational intelligenCe magazine, 13(3), 55–75, (2018).
[22] Alex Kendall and Yarin Gal, ‘What uncertainties do we need in