=Paper=
{{Paper
|id=Vol-2675/paper3
|storemode=property
|title=Uncertainty Quantification in Chest X-Ray Image Classification using Bayesian Deep Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2675/paper3.pdf
|volume=Vol-2675
|authors=Yumin Liu,Claire Zhao,Jonathan Rubin
|dblpUrl=https://dblp.org/rec/conf/ecai/LiuZR20
}}
==Uncertainty Quantification in Chest X-Ray Image Classification using Bayesian Deep Neural Networks==
Uncertainty Quantification in Chest X-Ray Image Classification using Bayesian Deep Neural Networks Yumin Liu1 and Claire Zhao2 and Jonathan Rubin3 Abstract. Deep neural networks (DNNs) have proven their effec- further examined by a radiologist. Applying this mechanism is ben- tiveness on numerous tasks. However, research into the reliability of eficial since there are lots of X-ray images everyday but there are DNNs falls behind their successful applications and remains to be limited radiologist resources. It can help prioritize X-ray images for further investigated. In addition to prediction, it is also important to radiologists to examine, require more attention to low confidence in- evaluate how confident a DNN is about its predictions, especially stances and support treatment recommendations for highly confident when those predictions are being used within medical applications. instances. In this paper, we quantify the uncertainty of DNNs for the task of Neural network-based deep learning algorithms are also getting Chest X-Ray (CXR) image classification. We investigate uncertain- popular for medical X-ray image processing [27, 1, 35]. It is neces- ties of several commonly used DNN architectures including ResNet, sary to examine the uncertainty of neural network models in medical ResNeXt, DenseNet and SENet. We then propose an uncertainty- X-ray image processing. The confidence of a prediction by a machine based evaluation strategy that retains subsets of held-out test data learning method can be measured by the uncertainty of the method ordered via uncertainty quantification. We analyze the impact of this outputs. A typical way to estimate uncertainty is through Bayesian strategy on the classifier performance. In addition, we also examine learning [2], which regards the parameters of methods as random the impact of setting uncertainty thresholds on the performance. Re- variables and attempts to get the posterior distribution of the parame- sults show that utilizing uncertainty information may improve DNN ters during training while marginalizing out the parameters to get the performance for some metrics and observations. distribution of the prediction during inference. Bayesian learning is well developed in traditional non-neural network machine learning framework [2] 1 INTRODUCTION Neural networks have been very successful in many fields such as natural language processing [41, 23], computer vision [18, 8], speech 2 RELATED WORKS recognition [15, 5], machine translation [6], control system [36], auto In recent years Bayesian learning and estimation of prediction un- driving [4] and so on. However, there is much less research avail- certainty have gained more and more attention in neural networks able on how reliable neural network predictions are. A common crit- context due to the wide application of deep neural networks in many icism of neural networks is that they are a black box that can per- areas [11, 3, 12, 13, 22, 14, 32, 24, 40, 26, 12, 31, 32]. form very well for many tasks, yet lacking interpretability. On the The authors in [3] introduced a method called “Bayes By Back- other hand, it is very important to ensure the reliability of a system prop” to learn the posterior distribution on the weights of neural net- involved in high risk fields, including stock-market analysis, self- works and get weight uncertainty. Essentially this method assumes driving cars and medical imaging [28]. As the rapid development of the weights come from a multivariate Gaussian distribution and up- machine learning and artificial intelligence especially deep learning, dates the mean and covariance of the Gaussian instead of the weight they are getting more and more applications in health areas includ- samples during training. During inference the network weights are ing disease diagnosis [9, 10], drug discovery [25, 30] and medical drawn from the learned distribution. This method is mathematically imaging [7, 16, 33]. Rather than just being told a final result by an grounded, backpropagation-compatible and can learn the distribution machine learning algorithm, shareholders (doctors, physicians, radi- of network weights directly, but it cannot utilize pre-trained model ologists, etc) would like to know how “confident” a neural network and has to build the corresponding model for every neural network model is, so that they can take different actions according to differ- architecture. [13] reformulated dropout in neural networks as approx- ent confidence levels. For example, in a medical image classification imate Bayesian inference in deep Gaussian processes and thus can scenario, a neural network model is applied to detect whether a pa- estimate uncertainty in neural networks with dropout layers. This tient has a certain type of lung pathology by classifying his/her chest method requires dropout layers applied before every weight layer. X-ray images. An ideal situation would be that physicians can trust During inference, the dropout layers with random 0-1s drawn from the result of the neural network, if it is highly confident (low uncer- Bernoulli distribution mask out some weights and only use a subset tainty) about its prediction. On the contrary, if the neural network of the weights learned during training phase to make a prediction. In gives a prediction with low confidence (or high uncertainty), then [22], the authors further proposed that there are two types of uncer- the prediction could not be trusted and the patient’s scan should be tainties and they showed the benefits of explicitly formulating these 1 Northeastern University, USA, email: yuminliu@ece.neu.edu two uncertainties separately. The first type is called aleatoric uncer- 2 Philips Research North America, USA, email: claire.zhao@philips.com tainty (or data uncertainty), which is due to the noise in the data and 3 Philips Research North America, USA, email: jonathan.rubin@philips.com cannot be eliminated, while the other type is called epistemic uncer- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tainty (or model uncertainty), which accounts for uncertainty in the point estimate of w by either maximum likelihood estimator (MLE) model and can be eliminated given enough data. The network archi- w∗ = arg maxw p(D|w) or maximum a posterior (MAP): w∗ = tectures have to be modified to add extra outputs in order to model arg maxw p(w|D) where p(w|D) = p(w)p(D|w) p(D) ∝ p(w)p(D|w). these uncertainties. [24] adopted this typing of uncertainty, but mod- The w∗ are fixed after training and used for inference for the new ified the formulation of aleatoric and epistemic uncertainty to avoid data. In Bayesian learning, we estimate the posterior distribution the requirement of extra outputs. p(w|D) during training and marginalize out w during the inference [26] proposed a method called “Stochastic Weight Averaging to get a probability distribution of the prediction. Gaussian (SWAG)” to approximate the posterior distribution over the R weights of neural networks as a Gaussian distribution by utilizing in- p(y|x, D) = Ew∼p(w|D) [p(y|x, w)] = p(y|x, w)p(w|D)dw (1) formation in Stochastic Gradient Descent (SGD). This method has an advantage in that it can be applied to almost all existing neural After getting the p(y|x), we can calculate the statistical moments of networks without modifying their original architectures and can di- the predicted variable and regard the first and second moment (i.e., rectly leverage pre-trained models. [34] also decomposed predictive mean and variance) as the prediction and uncertainty, respectively. uncertainty in deep learning into two components and modeled them However, in practice R there are two major difficulties. The first one separately. They shown that quantifying the uncertainty can help to is that p(D) = p(w)p(D|w)dw is usually intractable and thus improve the predictive performance in medical image super resolu- we cannot get exact p(w|D). The second lies in that Eq. (1) is also tion. [39] investigated the relationship between uncertain labels in usually intractable for neural networks. One common approach to CheXpert [21] and Chest X-ray14 [37] data sets and the estimated deal with the first difficulty is to use a simpler form of distribution uncertainty for corresponding instances using Bayesian neural net- q(w|θ) with hyperparameters θ to approximate p(w|D) by mini- work and suggested that utilizing uncertain labels helped prevent mizing the Kullback-Leibler (KL) divergence between q(w|θ) and over-confident for ambiguous instances. p(w|D). This turns the problem into an easier optimization prob- Despite the above works in Bayesian deep neural network learn- lem: θ∗ = arg min KL[q(w|θ)||p(w|D)] ing and uncertainty quantification, there are few works on evaluat- θ ing the effects of uncertainty-based evaluation strategies for medical Z q(w|θ) (2) image classification. To the best of our knowledge, we are the first = arg min q(w|θ)log dw θ p(w|D) to apply uncertainty quantification strategies for chest X-ray image classification using deep neural networks and evaluate their impacts For the second difficulty, the usual approach is to use sampling to on performances. The main contributions of this paper are: estimate Eq. (1), and it becomes p(y|x) ≈ Ew∼q(w|θ∗ ) [p(y|x, w)] ≈ T1 Ti=1 p(y|x, w(i) ) (3) P • We apply uncertainty quantification to five deep neural network models for chest X-ray image classification and analyze their per- formances. where w(i) ∼ q(w|θ∗ ). • We investigate the impact that uncertainty information has on clas- People had proposed different methods to approximate the poste- sification task performance by evaluating subsets of held-out test rior p(w|θ) or to get the samples of w [26, 3, 12, 13]. data ordered via uncertainty quantification. 3.2 Stochastic Weight Averaging Gaussian (SWAG) 3 METHOD The basic idea of SWAG [26] is to regard the weights of the neu- ral networks as random variables and get their statistical moments In this section, we will introduce the basic ideas of Bayesian Neural through training with SGD. Then use these moments to fit a multi- Networks and one of its approximations – SWAG [26], which is used variate Gaussian to get the posterior distribution of the weights. Af- in this paper. We also describe the uncertainty quantification method ter the original training process in which we get the optimal weights, used in this paper. we continue to train the model using the same training data with SGD and get T samples of the weightsP w1 , w2 ,· · · ,wt ,· · · ,wT . The 3.1 Bayesian Neural Network mean of those samples is w = T1 Tt=1 wt . The mean of the square is w2 = T1 Tt=1 wt2 and we define a diagonal matrix Σdiag = P In the ordinary deterministic neural networks, we get point estima- diag(w2 −w2 ) and a deviation matrix R = [R1 , · · · , Rt , · · · , RT ] tion of the network weights w which are regarded as fixed values whose columns Rt = wt − wt , where wt isPthe running av- and will not be changed after training. During inference, for each in- erage of the first t weights samples wt = 1t tj=1 wj . In the put xi we get one deterministic prediction p(yi |xi ) = p(yi |xi , w) original paper, the authors used the last K columns of R to get without getting the uncertainty information. the low rank approximation of R. The K-rank approximation is In the Bayesian neural network settings, in addition to the tar- Rb = [RT −K+1 , · · · , RT ]. Then the mean and covariance matrix get prediction, we also want to get the uncertainty for the predic- for the fitted Gaussian are given by: tion. To do so we regard the neural network weights as random vari- ables that subject to some form of distribution and try to estimate wSW A = w (4) the posterior distribution of the network weights given the training data during training. We then integrated out the weights and get the 1 1 bT distribution over the prediction during inference. From the predic- ΣSW A = Σdiag + R bR (5) tion distribution we can further calculate the prediction output and 2 2(K − 1) corresponding uncertainty. More specifically, let D = {(X, Y )} During inference, for each input (image) xi , sample the weights and w be the training data and weights of a neural network, respec- from the Gaussian ws ∼ N (wSW A , ΣSW A ) then update the batch tively. The ordinary deterministic neural network methods try to get a norm statistics by performing one epoch of forward pass, and then the sample prediction is given by p(ŷis |xi ) = p(yi |xi , ws ). Repeat Algorithm 1 Uncertainty Quantification the precedure for S times and we get S predictions ŷi1 , ŷi2 , · · · , ŷis , 1: Input: · · · , ŷiS for the same input xi . By using these S predictions we can D = {(X, Y )} / Xi : training / evaluating chest X-ray images get the final prediction and uncertainty. For regression problem, the and corresponding observation labels final prediction will be ŷi = S1 S P s=1 ŷis . 2: Initialization: load pre-trained neural network (NN) models by ImageNet 3.3 Uncertainty Quantification 3: Training: Fine-tune NN models using cheXpert dataset Some methods had been proposed to quantify the uncertainty in clas- 4: Perform SWAG: sification [24, 22]. Here we adopt the method proposed by [24] since Continue training with SGD it does not require extra output and does not need to modify the net- i) train NN models using SGD for some epochs with D work architectures. ii) save statistics of the weights for those epochs For a classification problem, suppose there are C classes, denote iii) calculate wSW A and ΣSW A using Eq. 4 and 5 ps , [ps1 , ps2 , · · · , psc ] = p(y|x, θs ), s ∈ {1, 2, · · · , S} as the vi) fit a Gaussian using wSW A as mean and ΣSW A as softmax (or sigmoid in binary case if C = 2) output of the neu- covariance ral network for a same repeated input x for S times, then the pre- Prediction 1 PS “probability” is the average of those S sample outputs p = dicted for s from 1 to S S s=1 ps The predicted class label index is ŷ = arg maxc p. The draw weights ws ∼ N (wSW A |ΣSW A ) aleatoric uncertainty Ua and the epistemic PS uncertainty Ue are TUa = 1 P S T 1 update batch norm statistics using D s=1 [diag(p s ) − ps p s ], Ue = s=1 (ps − p)(ps − p) The S S p(yis |Xi ) = p(yis |Xi , ws ) total uncertainty is Utotal = Ua + Ue . For binary classification, the end for sigmoid output is a scalar and the uncertainty equations are reduced 5: Calculate Outputs: to p(yi |Xi ) = S1 S P S s=1 p(yis |Xi ) 1X Ua = ps (1 − ps ) (6) Calculate ŷi , Ua and Ue using Eq. (8), (6) and (7). S s=1 Utotal = Ua + Ue S 6: Return: 1X ŷi , Ua , Ue , Utotal Ue = (ps − p)2 (7) S s=1 from which we can get wSW A and ΣSW A using Eq. 4 and 5. Then where p = S1 S P s=1 ps and ps = p(y = 1|x, θs ) = 1 − p(y = we fit a multivariate Gaussian using wSW A as mean and ΣSW A as 0|x, θs ). The predicted label is: covariance and get an approximated distribution for the neural net- ( work weights. When doing a prediction, an input chest X-ray image 1 p ≥ 0.5 ŷ = (8) is repeatedly fed into the network for S times, each time with a new 0 p < 0.5 set of weights sampled from the Gaussian distribution. The S out- put probabilities are used to calculate the final predicted label ŷi and In this way, we can get uncertainties for all the instances. uncertainty Utotal = Ua + Ue . It is worthwhile to note that, after drawing sample weights the network batch norm statistics need to 3.4 Transfer Learning be updated for the models that use batch normalization. It can be achieved by running one epoch with partial or full training set D. Transfer learning is a widely used technique to help improve perfor- More detailed justification for the necessity was given in the original mance for deep neural networks in image classification. Here we can paper [26]. also benefit from transfer learning by loading pre-trained neural net- work models trained by ImageNet (http://image-net.org) dataset. The SWAG method has one advantageous characteristic that 4 DATASET it does not require to modify any architecture of the original neu- ral networks and therefore we can fully utilize pre-trained models We perform experiments using the CheXpert data set [21]. CheXpert trained by ImageNet dataset to speed up training process and get is a large chest X-ray dataset released by researchers at Stanford Uni- better predictions. In the initialization stage, we download the pre- versity. This dataset consists of 224,316 chest radiographs of 65,240 trained model parameters and use them to initialize our models to be patients. Each data instance contains a chest X-ray image and a vec- trained. tor label describing the presence of 14 observations (pathologies) as positive, negative, or uncertain. The labels were extracted from ra- diology reports using natural language processing approaches. For 3.5 Procedure our experiments we focus on 5 observations, namely Cardiomegaly, Basically we follow the method in [26] to approximate the Bayesian Edema, Atelectasis, Consolidation and Pleural Effusion. As [21] had neural network and the formulas in [24] to quantify uncertainty of pointed out, these 5 observations were selected based on their clinical the models. The overall algorithm for SWAG and uncertainty quan- importance and prevalence in this dataset. In their experiment they tification is shown in Algorithm 1. We initialize the model with cor- also used these 5 observations to evaluate the labeling approaches. A responding pre-trained model, and then fine-tune it by training using sample image for each observation is shown in Figure 1. chest X-ray images and observation labels. After that we perform The original dataset consists of training set and validation set and SWAG algorithm by continuing training using Stochastic Gradient we do not have access to test set. The labels for the training set were Descent for T epochs and calculate statistics w, w2 , Σdiag and R, b generated by automated rule-based labeler which extract informa- Figure 1: Sample image for each observation. From left to right: no finding (all negative), cardiomegaly, edema, consolidation, atelectasis and pleural effusion tion from radiology reports. This was done by the Stanford research group who released the dataset. There are three possible values for the label of an instance for a given observation, i.e., 1, 0 and −1. 1 means the observation is positive (or exists), 0 means negative (or not exists), and −1 means not certain about whether the observation exists. The labels for the validation set were determined by the ma- jority vote from three board-certified radiologists and only contains positive (1) or negative (0) values. The original paper [21] investi- gated several different ways to deal with the uncertain labels (−1), such as regarding them as positive (1), negative (0), the same with (a) Prevalence of observations (b) Gender proportion the majority class, or a separate class. They found out that for differ- ent observations, the optimal ways to deal with the uncertain labels are different, and they gave the replacement for 5 observations men- tioned above. Based on the results from [21] and for simplicity, we replace the uncertain labels with 0 or 1 for different observations. Specifically, the uncertain labels of cardiomegaly, consolidation and pleural effusion are replaced with 0, while edema and atelecta- sis with 1. Therefore the problem becomes a multi-label binary im- age classification problem. The predicted result is a five dimensional vector with element value being 1 or 0, where 1 means that the net- (c) Training set age histogram (d) Validation set age histogram work predicts existence for the corresponding observation while 0 means the network predicts not existence of the corresponding obser- Figure 2: Patient statistics vation. We follow the official training set / validation set split given by the data set provider. After removing invalid instances, we get a skip connections to mitigrate the gradient vanishment problem and total number of 223,414 instances for training and 234 instances for was the winner of ILSVRC 2015 [29] and COCO 2015 (http:// validation. We first initialize the neural network’s parameters with cocodataset.org) competition. ResNeXt is a variant of ResNet corresponding downloaded pre-trained model parameters, and then and won the 2nd place in ILSVRC 2016 classification task. DenseNet train the neural network using the training set and test their perfor- further utilizes the concept of skip connections by connecting previ- mance on the validation set. We will use the original training set as ous layer output to all its subsequent layers and forming “dense” skip the training set and original validation set as the evaluation set in our connections. DenseNet further alleviates vanishing gradient prob- experiments. lem, reduce number of parameters and reuses intermediate features, In Figure 2 we show the patient statistics of the 5 observations af- and is widely used since it was proposed. SENet uses squeeze-and- ter replacing the uncertain labels in the training set. The prevalence excitation block to model image channel interdependencies and won is the ratio of the number of positive instances over the total num- the ILSVRC 2017 competition for classification task. ber of instances. From the figure we can see that all five observations All networks are trained as binary classifiers for multi-label clas- are imbalance as the prevalence being under 50%. Besides, there is sification instead of training separate models for each class. a gap in the prevalence for the training and evaluation sets in all ob- The pipeline of the experiment is shown in Figure 4. We use servations, which will probably affect the performance of the neural PyTorch implementation. The neural network models and pre- network models. trained parameters are from torchvision (except SENet154 which is from pretrainedmodels, https://github.com/Cadene/ pretrained-models.pytorch). 5 EXPERIMENT In our experiment we set the number of sample weights T = 5, In this section, we perform experiments and present the investiga- the number of columns of the deviation matrix K = 10 and the tion results of uncertainty quantification and strategy on five dif- number of repeated prediction samples S = 10. During training, ferent neural network models using PyTorch implementation. These we use Adam optimizer with weight decay regularizer and ReduceL- neural networks are DenseNet [20] with 121 layers (denote as ROnPlateau learning rate scheduler. The the initial learning rate is DenseNet121), DenseNet with 201 layers (denote as DenseNet201), 1 × 10−5 and weight decay coefficient is 0.005. The maximum num- ResNet [17] with 152 layers (denote as ResNet152), ResNeXt [38] ber of fine-tuning epoch is 50 epochs. The original chest X-ray im- with 101 layers (denote as ResNeXt101) and Squeeze-and-Excitation ages are resized and randomly cropped to 256 × 256 (except for network [19] with 154 layers (denote as SENet154). ResNet uses SENet154 which has a fixed input size 224 × 224). We stop fine- Figure 3: Comparison of performance between original deterministic network and Bayesian neural network with uncertainty strategy. The neural network is DenseNet with 201 layers. better than SWAG for most of the networks. For cardimegaly, con- solidation and atelectasis, the performances are mixed. This maybe because edema and pleural effusion are harder to detect and more sensitive to network weights perturbation. On the whole the SWAG algorithm does not outperform the original neural network. These might be accountable because SWAG uses a Gaussian to approxi- mate the distribution over the optimal weights and then draws sam- ple weights from the approximated Gaussian distribution, and may deviate from the optimal weights if the approximation is inaccurate. Figure 4: Pipeline of the experiment Therefore we need to adopt some strategy to prevent the performance from deterioration. The benefit lies in that we can get the uncertainty tuning the model when the AUC (explained below) does not increase estimation for each prediction while keeping similar or even better for consecutive 10 epochs and save the model with the best AUC as prediction results. the optimal trained model. We use four metrics to evaluate the network classification perfor- Table 1: Original AUC vs SWAG AUC mance: Area under curve (AUC), Sensitivity, Specifity and Precision. Networks AUC Average Original SWAG Cardiomegaly Original SWAG Edema Original SWAG Consolidation Original SWAG Atelectasis Original SWAG Pleural Effusion Original SWAG Those metrics are widely used for machine learning and medicine Resnet152 ResNext101 0.8831 0.8807 0.8786 0.8726 0.8376 0.8013 0.8149 0.8339 0.9123 0.9212 0.8713 0.8748 0.8927 0.9250 0.9234 0.9311 0.8543 0.8246 0.8537 0.8162 0.9184 0.9314 0.9298 0.9071 community. The AUC is often used to measure the quality of a clas- SEnet154 Densenet121 0.8794 0.8842 0.8695 0.8942 0.8203 0.8436 0.8040 0.8752 0.9195 0.9264 0.8702 0.8940 0.9216 0.9139 0.9187 0.9512 0.8056 0.8153 0.8553 0.8489 0.9301 0.9220 0.8992 0.9016 Densenet201 0.8793 0.8356 0.8259 0.8397 0.9165 0.8796 0.9313 0.7739 0.7936 0.7714 0.9294 0.9132 sifier and is defined as the area under the Receiver Operating Charac- teristic (ROC) curve which plots the sensitivity against the false pos- itive rate. The sensitivity (or true positive rate or recall) is defined as the ratio of the number of correctly predicted positive instances over 5.2 With Coverage Strategy the number of total positive instances. The specificity is defined as the ratio of the number of correctly predicted negative instances over Next we utilize the uncertainty quantification information to deter- the total number of negative instances. And the precision is defined mine if the performances can be improved. One strategy is to sort as the ratio of the number of correctly predicted positive instances instances according to uncertainty in an ascending order, and then over the number of instances that are predicted as positive. take those instances with less uncertainty into consideration and dis- card the rest. In clinical practice, the discarded instances could be flagged for further evaluation by a physician. 5.1 Without Strategy Ideally we would expect a decreasing trend for the metrics when data coverage increase as shown in Figure 6. The horizontal axis First we compare the AUC of the original ordinary deterministic neu- “Data coverage” is the percentage of instances being considered. For ral networks with the AUC corresponding neural networks after per- example, a data coverage of 20% means that only the top twenty per- forming SWAG but before applying any uncertainty strategies. The cent of the least uncertain (or the most confident) instances are taken results are shown in Table 1. The “Average” column is the average into consideration and the rest are discarded. over all 5 observations. The bold font indicates better performance. Figure 3 shows the comparison of performances with regard to the For edema and pleural effusion, the original neural network performs foure metrics (AUC, sensitivity, specificity and precision) between (a) Cardiomegaly (b) Edema (c) Consolidation (d) Atelectasis (e) Pleural effusion Figure 5: Estimated total uncertainty (aleatoric + epistemic) histogram for each observation Table 3: Effect of uncertainty strategy for DenseNet201. Densenet201 AUC Sens. Spec. Prec. √ Cardiomegaly × √ × √ √ ◦ √ Edema Consolidation × × ◦ - Atelectasis ◦ √ × √ ◦ √ × √ Pleural effusion √ : helpful; ×: not helpful; ◦: mixed behavior; -: missing value uncertainty strategy will help to improve some performance metrics for all four neural network models. Table 4: Effect of uncertainty strategy for different networks Figure 6: Expected ideal performance. The metric decreases as data coverage increases. ResNet152 AUC Sens. Spec. Prec. √ Cardiomegaly × - √ - √ the original deterministic networks and Bayesian neural networks Edema × × with uncertainty strategy. The solid lines are the Bayesian neural net- Consolidation × × - √ - √ work with uncertainty strategy, while the dashed lines are the origi- Atelectasis × √ × √ √ √ Pleural effusion nal ordinary deterministic networks without any uncertainty strategy. √ : helpful; ×: not helpful; ◦: mixed behavior; -: missing value Different colors represent different observations. From Figure 3 we can see that for edema and pleural effusion, the AUC decreases as the coverage increases, and are above the corre- sponding original AUC until around 45% and 90% coverage, respec- Table 5: Effect of uncertainty strategy for different networks tively. This means that applying the uncertainty strategy can improve SENet154 AUC Sens. Spec. Prec. AUC for these two observations. The highest AUC gain can be 8% √ and 6% for edema and pleural effusion, respectively. We also observe Cardiomegaly - - √ - Edema × √ × × similar trend in sensitivity, specificity and precision for both edema Consolidation - - - and pleural effusion. Three observations (cardiomegaly, atelectasis Atelectasis ◦ - × - √ √ √ and consolidation) have low sensitivity as most of the predictions are Pleural effusion × √ negative. On the contrary the specificity is high. : helpful; ×: not helpful; ◦: mixed behavior; -: missing value The highest gains for applying the uncertainty strategy are shown in the Table 2. The effect of the uncertainty strategy over the five ob- Despite that for some observations (e.g., pleural effusion), several metrics performance benefit a lot from applying the uncertainty strat- Table 2: Perfomance gain for edema and pleural effusion. The values egy, we should also notice that the strategy does not help to improve are the absolute and relative gains performance for some other observations with regard to these met- Gain (% Gain) AUC Sensitivity Specificity Precision rics, and in some cases even degrade the performance. The reasons Edema 0.0835(9.11%) 0.3778(60.71%) 0.0476(5.00%) 0.2432(32.14%) behind might be varied and needs more investigation. For example, Pleural effusion 0.0706(7.60%) 0.2687(36.73%) 0.0778(8.44%) 0.2097(26.53%) this may be that the neural network weight distribution approximated by the SWAG algorithm does not capture the true distribution, or servations with the model DenseNet201 can be summarized as in the even the uncertainty quantification formulas are inappropriate. √ Table 3. The symbols , ×, ◦ and − represents helpful, not helpful, mixed behavior and missing value, respectively. For edema and pleu- 5.3 With Absolute Threshold Strategy ral effusion, applying uncertainty strategy is beneficial for improving all four metrics. However, for other observations, it does not show We also plot the total uncertainty distribution for each observation, as benefits or only limited benefits for some metrics. The reason why shown in Figure 5. From the figure we can see that for cardiomegaly, it show varied behavior may be interesting and needs further inves- the estimated uncertainty tends to be smaller, while for edema, at- tigation. Similarly, we summarize the effect of applying uncertainty electasis and plueral effusion, the proportion of larger estimated un- strategy for different neural network architectures and the results are certainty is higher. Consolidation has a relatively even distribution shown in Table 4 to Table 7. From the tables we can see that applying for estimated uncertainty. This suggest that edema, atelectasis and Figure 7: Comparison of performance between original deterministic network and Bayesian neural network with uncertainty threshold. pleural effusion are more prone to be affected by setting an uncer- tainty threshold. Combining this finding with the results in Table 2, we set thresholds for both edema and pleural effusion to check the influence on metric performance. We only consider the instances Table 6: Effect of uncertainty strategy for different networks whose estimated uncertainty is smaller than the threshold to compute the performance metrics. We vary the threshold from 0.2 to 0.24 by a ResNext101 AUC Sens. Spec. Prec. step of 0.01 and the results are shown in Figure 7. The black dashed √ Cardiomegaly × × √ × √ line is the average metric values of the original deterministic neural Edema × √ × √ network, while the solid color thin lines are metric values for each Consolidation × √ - observation, and the thick brown line is the average metric values Atelectasis ◦ √ √ × √ × √ Pleural effusion of all five observation after applying threshold only to edema and √ pleural effusion. Comparing the thick brown line with the dash black : helpful; ×: not helpful; ◦: mixed behavior; -: missing value line, we can see that the average specificity and precision have been improved while the average AUC and sensitivity roughly keep the same. This means that applying uncertainty threshold to edema and pleural effusion is beneficial. 6 CONCLUSION Table 7: Effect of uncertainty strategy for different networks In this paper we investigate uncertainty quantification in medical im- age classification using Bayesian deep neural networks. We train five DenseNet121 AUC Sens. Spec. Prec. different deep neural network models on the CheXpert X-ray image Cardiomegaly × × - √ ◦ √ data for five clinical observations and quantify the model uncertainty. Edema × √ × Then we analyze the performance of the network for situations with Consolidation √ - - √ - √ Atelectasis ◦ and without applying uncertainty strategy. The results show that the √ √ √ √ Pleural effusion uncertainty quantification and strategy improve several performance √ metrics for some observations. This suggests that uncertainty quan- : helpful; ×: not helpful; ◦: mixed behavior; -: missing value tification is helpful in medical image classification using neural net- works. However, the results also show that in some cases the strategy is not helpful, or can even deteriorate the performance. Further anal- ysis may be needed to examine this phenomenon. REFERENCES bayesian deep learning for computer vision?’, in Advances in neural information processing systems, pp. 5574–5584, (2017). [23] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Brad- [1] Yaniv Bar, Idit Diamant, Lior Wolf, Sivan Lieberman, Eli Konen, and bury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Hayit Greenspan, ‘Chest pathology detection using deep learning with Socher, ‘Ask me anything: Dynamic memory networks for natural lan- non-medical training’, in 2015 IEEE 12th International Symposium on guage processing’, in International conference on machine learning, Biomedical Imaging (ISBI), pp. 294–297. IEEE, (2015). pp. 1378–1387, (2016). [2] Christopher M Bishop, Pattern recognition and machine learning, [24] Yongchan Kwon, Joong-Ho Won, Beom Joon Kim, and Myunghee Cho springer, 2006. Paik, ‘Uncertainty quantification using bayesian neural networks in [3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan classification: Application to ischemic stroke lesion segmentation’, Wierstra, ‘Weight uncertainty in neural networks’, arXiv preprint Medical Imaging with Deep Learning, 2018, (2018). arXiv:1505.05424, (2015). [25] Antonio Lavecchia, ‘Machine-learning approaches in drug discovery: [4] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard methods and applications’, Drug discovery today, 20(3), 318–331, Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Mon- (2015). fort, Urs Muller, Jiakai Zhang, et al., ‘End to end learning for self- [26] Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and driving cars’, arXiv preprint arXiv:1604.07316, (2016). Andrew Gordon Wilson, ‘A simple baseline for bayesian uncertainty in [5] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, ‘Listen, at- deep learning’, arXiv preprint arXiv:1902.02476, (2019). tend and spell: A neural network for large vocabulary conversational [27] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel speech recognition’, in 2016 IEEE International Conference on Acous- Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie tics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE, Shpanskaya, et al., ‘Chexnet: Radiologist-level pneumonia detection (2016). on chest x-rays with deep learning’, arXiv preprint arXiv:1711.05225, [6] Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio, ‘A character- (2017). level decoder without explicit segmentation for neural machine transla- [28] Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib, ‘Deep learn- tion’, arXiv preprint arXiv:1603.06147, (2016). ing for medical image processing: Overview, challenges and the future’, [7] Marleen de Bruijne. Machine learning approaches in medical image in Classification in BioApps, 323–350, Springer, (2018). analysis: From detection to diagnosis, 2016. [29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev [8] Chao Dong, Chen Change Loy, and Xiaoou Tang, ‘Accelerating the Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, super-resolution convolutional neural network’, in European confer- Michael Bernstein, Alexander C. Berg, and Li Fei-Fei, ‘ImageNet ence on computer vision, pp. 391–407. Springer, (2016). Large Scale Visual Recognition Challenge’, International Journal of [9] Meherwar Fatima and Maruf Pasha, ‘Survey of machine learning algo- Computer Vision (IJCV), 115(3), 211–252, (2015). rithms for disease diagnostic’, Journal of Intelligent Learning Systems [30] Benjamin Sanchez-Lengeling and Alán Aspuru-Guzik, ‘Inverse molec- and Applications, 9(01), 1, (2017). ular design using machine learning: Generative models for matter engi- [10] Konstantinos P Ferentinos, ‘Deep learning models for plant disease de- neering’, Science, 361(6400), 360–365, (2018). tection and diagnosis’, Computers and Electronics in Agriculture, 145, [31] Peter Schulam and Suchi Saria, ‘Can you trust this prediction? auditing 311–318, (2018). pointwise reliability after learning’, in The 22nd International Confer- [11] Yarin Gal, Uncertainty in deep learning, Ph.D. dissertation, PhD thesis, ence on Artificial Intelligence and Statistics, pp. 1022–1031, (2019). University of Cambridge, 2016. [32] Murat Sensoy, Lance Kaplan, and Melih Kandemir, ‘Evidential deep [12] Yarin Gal and Zoubin Ghahramani, ‘Bayesian convolutional neural net- learning to quantify classification uncertainty’, in Advances in Neural works with bernoulli approximate variational inference’, arXiv preprint Information Processing Systems, pp. 3179–3189, (2018). arXiv:1506.02158, (2015). [33] Kenji Suzuki, ‘Overview of deep learning in medical imaging’, Radio- [13] Yarin Gal and Zoubin Ghahramani, ‘Dropout as a bayesian approxima- logical physics and technology, 10(3), 257–273, (2017). tion: Representing model uncertainty in deep learning’, in international [34] Ryutaro Tanno, Daniel Worrall, Enrico Kaden, Aurobrata Ghosh, conference on machine learning, pp. 1050–1059, (2016). Francesco Grussu, Alberto Bizzi, Stamatios N Sotiropoulos, Anto- [14] Yarin Gal, Riashat Islam, and Zoubin Ghahramani, ‘Deep bayesian ac- nio Criminisi, and Daniel C Alexander, ‘Uncertainty quantification tive learning with image data’, in Proceedings of the 34th International in deep learning for safer neuroimage enhancement’, arXiv preprint Conference on Machine Learning-Volume 70, pp. 1183–1192. JMLR. arXiv:1907.13418, (2019). org, (2017). [35] Demetri Terzopoulos et al., ‘Semi-supervised multi-task learning with [15] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, ‘Speech chest x-ray images’, in International Workshop on Machine Learning recognition with deep recurrent neural networks’, in 2013 IEEE inter- in Medical Imaging, pp. 151–159. Springer, (2019). national conference on acoustics, speech and signal processing, pp. [36] Huanqing Wang, Peter Xiaoping Liu, Shuai Li, and Ding Wang, ‘Adap- 6645–6649. IEEE, (2013). tive neural output-feedback control for a class of nonlower triangular [16] Hayit Greenspan, Bram Van Ginneken, and Ronald M Summers, nonlinear systems with unmodeled dynamics’, IEEE transactions on ‘Guest editorial deep learning in medical imaging: Overview and future neural networks and learning systems, 29(8), 3658–3668, (2017). promise of an exciting new technique’, IEEE Transactions on Medical [37] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Imaging, 35(5), 1153–1159, (2016). Bagheri, and Ronald M Summers, ‘Chestx-ray8: Hospital-scale chest x- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ‘Deep resid- ray database and benchmarks on weakly-supervised classification and ual learning for image recognition’, in Proceedings of the IEEE confer- localization of common thorax diseases’, in Proceedings of the IEEE ence on computer vision and pattern recognition, pp. 770–778, (2016). conference on computer vision and pattern recognition, pp. 2097–2106, [18] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, (2017). Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, [38] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming ‘Mobilenets: Efficient convolutional neural networks for mobile vision He, ‘Aggregated residual transformations for deep neural networks’, in applications’, arXiv preprint arXiv:1704.04861, (2017). Proceedings of the IEEE conference on computer vision and pattern [19] Jie Hu, Li Shen, and Gang Sun, ‘Squeeze-and-excitation networks’, in recognition, pp. 1492–1500, (2017). Proceedings of the IEEE conference on computer vision and pattern [39] Hao-Yu Yang, Junling Yang, Yue Pan, Kunlin Cao, Qi Song, Feng Gao, recognition, pp. 7132–7141, (2018). and Youbing Yin, ‘Learn to be uncertain: Leveraging uncertain labels [20] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein- in chest x-rays with bayesian neural networks’, in Proceedings of the berger, ‘Densely connected convolutional networks’, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Work- the IEEE conference on computer vision and pattern recognition, pp. shops, pp. 5–8, (2019). 4700–4708, (2017). [40] Jiayu Yao, Weiwei Pan, Soumya Ghosh, and Finale Doshi-Velez, ‘Qual- [21] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana ity of uncertainty quantification for bayesian neural network inference’, Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn arXiv preprint arXiv:1906.09686, (2019). Ball, Katie Shpanskaya, et al., ‘Chexpert: A large chest radiograph [41] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria, dataset with uncertainty labels and expert comparison’, arXiv preprint ‘Recent trends in deep learning based natural language processing’, arXiv:1901.07031, (2019). ieee Computational intelligenCe magazine, 13(3), 55–75, (2018). [22] Alex Kendall and Yarin Gal, ‘What uncertainties do we need in