1. Introduction

Vienna, Austria $ mfilipiuk@nvidia.com (M. Filipiuk); vasus@nvidia.com (V. Singh)

Exploring Diversity in Neural Architectures for Safety

Michał Filipiuk

Vasu Singh

0 0 NVIDIA , Einsteinstraße 172, Munich , Germany

2022

000 0 0003

Apart from the predominant convolutional neural networks (CNNs), several new architectures like Vision Transformers (ViTs) and MLP-Mixers have recently been proposed. Research also shows that these architectures learn diferently. Ensembles based on diferent state-of-the-art neural architectures thus provide diversity, an important characteristic in designing safety-critical systems. To quantify the benefit of ensembles, we investigate diferent metrics like error consistency and diversity metric that have been proposed in the literature. We observe that with comparable individual performance, an ensemble of diverse architectures performs not only more accurately than an ensemble of one architecture, but also more robustly to diverse input corruptions.

eol>diversity ensemble safety deep learning image classification robustness safety-critical systems

1. Introduction To improve the confidence in prediction, ensembles [ 7]

of neural networks are commonly used. Multiple models The development of safety-critical systems relies on strin- are trained on the same data, then each of the trained gent safety methodologies, designs, and analyses to pre- models is used to make a prediction before combining vent hazards during operation. Automotive safety stan- the predictions in some way to create the final prediction. dards like ISO26262 [ 1 ] and ISO/PAS 21448 [ 2 ] mandate Ensembles have also shown to reduce the variance [ 8 ]. methodologies for system, hardware, and software de- The inherent diversity in an ensemble has been shown velopment for automotive systems. Diversity is an im- to be a key factor for their superior performance. Diferportant concept in safety-critical systems that prevents ent diversity metrics have been proposed in the machine against common cause failures. For example, diversity in learning literature. Error consistency [9], based on the hardware is provided through lockstep execution across Cohen’s kappa metric, measures the similarity of clasdiferent HW engines. Diversity in software is guaran- sification normalized by chance of common prediction. teed through diverse algorithmic implementations. Diversity [10] allows to define diversity metrics based on

Deep neural networks [ 3 ] based on convolutional neu- diferent loss functions. ral networks (CNN) are well-known for vision tasks using The objective of our work is to quantify the diversity machine learning. These include safety-critical applica- of ensembles created using diferent models, and evalutions like autonomous driving and robotics, where CNN ate their benefits. We choose two CNNs, two ViTs, and models are used for object detection and image segmenta- two MLP Mixers, and create 30 in total ensembles by tion as perception units to process sensor data. Over the averaging the models’ outputs. Our results show that last few years, new neural architectures have disrupted ensembles created using diferent architectures are more the dominance of CNNs in vision tasks: Vision Trans- diverse than ensembles from the same architecture. We formers (ViTs) [ 4 ], inspired by the transformer model [ 5 ] show that an ensemble of diferent architectures with that was originally proposed for natural language pro- similar accuracy further improves the performance. In cessing (NLP) tasks, leverages self-attention layers in- our experiments, we observe the best ensemble results stead of convolution layers to process the input split into for a CNN and a ViT. set of non-overlaping patches. Similarly, MLP Mixers [ 6 ] The paper is organized as follows. Section 2 describes have been proposed as a competitive but conceptually a the properties of CNNs, Vision Transformers, and MLPsimple alternative that - instead of convolutions or self- Mixers, how they compare to each other including a attention - are based entirely on multi-layer perceptrons summary of related work, and an overview of diferent (MLPs) that are repeatedly applied across either spatial diversity metrics. Section 3 provides our experimental locations or feature channels. results. Section 4 concludes the paper with a summary of our ongoing work and future directions.

2. Background We describe the evolution of diferent neural architectures and their strengths and weaknesses. 2.1. Neural architectures

Convolutional Neural Networks. The convolution operation predates the first convolutional neural networks. With hand-engineered features, it was used in classical computer vision applications many years before it appeared in first neural networks in 1980s. However, the rise of CNNs started with AlexNet in 2012, which defeated by a large margin other, non-neural approaches in the ImageNet competition. Over the last 10 years, we have seen multiple improvements to this architecture, but they were more evolutionary than revolutionary.

The fact that convolutions managed to be in the spotlight for such a long time may seem quite surprising, however an analysis of their properties gives us the answer: Convolutions have two key inductive biases that allow them to excel at high-dimensional data with strong spatial correlation like images: the spatial inductive bias allows them to focus on local information in the input images. Applying the same kernel over the whole image results in the translation equivariance as input translations result only in the shifted output of convolutional layers. The convolution operation is also a very simple and compute-eficient operation. Its memory usage is not only small, but also constant with regard to the size of the image what combined with possibility to apply it in parallel, makes it feasible for every hardware.

MLP-Mixer. Presented in 2021, MLP-Mixers [6] pro

vide an alternative to CNNs and ViTs that does not use convolutions or self-attention. Mixers use two types of MLP layers: channel-mixing and token-mixing MLPs. The channel-mixing MLPs are applied to every patch separately, exchanging the information between channels, while the token-mixing MLPs work on one channel, but across all patches, allowing the communication between the patches.

Matrix multiplications in MLPs are a simpler operation than a convolution, which require more specialized hardware or a costly conversion to a matrix multiplication operation.

As MLP-Mixers perform similarly to Vision Transformers on a level of encoder layers, they have similar properties: both architectures have global perception fields and they both sufer of no translation equivariance due to the use of image patches as input. Regarding the differences of these two architectures: MLP-Mixers do not need position encoding as MLP layers diferentiate between diferent elements of its input, in contrast to the multi-head attention in ViTs.

2.2. Related Work As the three architectures present diferent approaches

to image classification — using convolutions, multi-head attention, or multilayer perceptrons to process the inVision Transformer. The Transformer architec- put — the comparison between them should not restrict ture [ 5 ] was initially introduced in 2017 for NLP tasks. just to experimental accuracy, e.g. on a single dataset In 2020, this architecture was applied to image classifica- like ImageNet, but should also include more experiments, tion problem and called the Vision Transformer (ViT) [ 4 ]. analyzing in-detail the diferent aspects of image clasHere, an input image is split into a set of non-overlapping sification problem (e.g. robustness to input corruption patches, which after being embedded are provided to the or transformations like translations or rotations) and inViT encoder blocks. ViTs have much less image-specific ternal properties of each model. Bhojanapalli et al. [11] inductive bias than CNNs. In CNNs, the locality and trans- conduct multiple experiments, assessing the robustness lation equivariance are inherent to convolutional layers of Vision Transformers to multiple corruptions with rethroughout the whole model. In ViT, the self-attention gard to model sizes and their pre-training datasets, in layers are global, and only the MLP layers are performed comparison to various ResNet models. They show that locally and translationally equivariant on the patch level. (1) adversarial attacks like Fast Gradient Sign Method and The two-dimensional neighborhood is not present in the Projected Gradient Descent similarly influence both ViTs network architecture as transformers treat the input as and CNNs, (2) corrupted images with an attack are not an unordered set. This information needs to be input to transferable, resulting in only a modest, few percentage the first layer in form of position embedding together points drop between the architectures, while they are with image patches. transferable between the models of the same architec

Reducing the inductive biases has twofold conse- ture. Regarding less artificial corruptions and distribution quences: Transformers have to learn properties that shifts, present in ImageNet-C, -R, and -A datasets: perforwould otherwise be inherited from the convolution oper- mance of diferent architectures seems to be similar. One ation, that proved to be successful: to be invariant to the important conclusion is how the accuracy changes with input shifts and balancing the local and global percep- the size of the pretraining dataset – for ILSVRC-2012, tion in encoding blocks. But at the same time, they can ViTs perform worse than CNNs, however for ImageNetimprove upon them, can leverage the global perception 21k and JFT-300M performance is comparable. Under a to their advantage and discover its own priors based on closer inspection of ImageNet-C dataset, ViTs and CNNs data, what results in performing the task distinctly and perform significantly diferent on various ImageNet-C bringing diversity of solutions to the field. corruptions: e.g. on glass blur Vision Transformers perform significantly better than CNNs, while they perform where E and V stand for an expected value over worse on contrast corruption, on the highest level of the whole data generating distribution (which is apseverity – this observation is crucial for our research proximated using a dataset) and a variance of models’ presented in this paper. Naseer et al. [12] extends this predictions that the ensemble consists of. The formulas comparison to e.g. input occlusions or input patches perare derived from a loss analysis of every classifier and mutation, where ViTs perform much more robustly than their ensemble, where the diversity upper bounds a dif

CNNs. They investigates also the shape-texture bias of

ference between an averaged loss of classifiers and the these architectures and show that transformers are less loss of their ensemble. In summary: these metrics meabiased towards local textures than CNNs.

sure how diverse the predictions of diferent models for a In [13], authors analyze the information that every dataset are by calculating the variance of prediction, avlayer processes, how the reception fields looks like for eraged over every data point. In case of CE diversity, the Transformers (which are not restricted by the convolu- predictions are being additionally scaled to [ 0,1 ] range. tion operation) and how diferent layers learn depending

From our perspective, the CE loss diversity should be

on the dataset size. Their research shows that CNNs and more interesting as we are going to ensemble models by

ViTs perform their computation significantly diferently. It also briefly describes how MLP-Mixers behave closer

averaging their prediction, but CE loss diversity is more complex than 0/1 diversity and eventually we evaluate to ViTs with regard to the intermediate features learned. models using accuracy, which binarizes their outputs to There have also been architectures that combine CNNs count them as correct and incorrect classification. At the and ViTs. For example, Cvt: Introducing convolutions to vision transformers [14] apply convolutions over input same time, CE loss diversity is able to provide us with more information e.g. in a case when both models classify image and intermediate feature token maps, which are identically, but with diferent probabilities assigned. next processed by a transformer block. While the Swin

Error consistency [9] is a metric measuring how much

Transformer [15] doesn’t feature convolution layers, it inerrors of two classifiers coincide. It calculates a numtroduces a hierarchical approach of CNNs and the locality ber of items classified either correctly or incorrectly by of convolutions to transformers: it applies MHA to small, both models and compares it to an expected rate of equal local set of patches (windows), while the patches are beresponses in case when both models were totally statising merged into bigger patches as we progress deeper tically independent. The exact formula is presented as into the model. To support the information propaga- follows: = −

− where stands for a fracture of equal classification (either correct or incorrect) and is an expected rate of equal responses, which is calculated using models’ accuracies: = 12+(1− 1)(1− 2). This metric can only compare two models in contrast to the diversity metrics which does not have such a restriction.

3. Experiments

Model selection. Setup.

We have chosen the best performing models that were available to us at the time of conducting the research, pretrained on ImageNet-21k and ifne-tuned to ImageNet-1k. We considered the arguments pretraining dataset. This has the best potential to perform robustly on ImageNet-C [16], which we’ll use to compare the architectures. ImageNet-C is a dataset created by artificially applying various corruptions (blurs, noises, digital corruptions, and weather conditions), which feature diferent severity levels, to the ImageNet (ILSVRC2012) validation set. The models are as follows, and ensembles are created by averaging the returned softmax outputs of two models. We use only two at the time to observe how ensembles of diferent architectures perform compared tion between patches, the model shifts the windows with every layer to overlap with previously used windows.

These changes can also be introduced to a MLP-Mixers,

resulting in the performance improvement.

The results of the aforementioned research inspire us to investigate how this variety of these three architectures, proved by multiple various experiments, can be leveraged for improving the diversity in safety-critical systems.

2.3. Diversity metrics

While the intuition behind the diversity may be straightforward, quantifying it is not. We present below three distinct metrics from the literature that try to capture models’ diversity. ferent loss functions like 0/1 loss, cross-entropy loss, and squared loss. As we are focused on the classification problem, we’ll use 0/1 and cross-entropy losses, which formulas are presented below:

Ortega et al. [10] provide a metric of diversity for dif- raised in the previous section to determine the size of the [︁ [︃ ︁(

︃( D0/1( ) = E V 1 (ℎ (; ) ̸= ) D( ) = E

V √

( | , ) 2 max ( | , ) ︁)]︁

)︃]︃ to the single models that build them. Also using more models in the ensembles would prohibit us from using the error consistency metric. Ensembles are created by averaging the softmax outputs as it is the simplest way of building ensembles. While it has its disadvantages (e.g. models are calibrated diferently and overconfident ones can dominate under-confident ones with their predictions), we choose it for its simplicity, leaving potential improvements to future work.

Vision Transformers: • Vision Transformer B/8 (86M parameters)1 • Vision Transformer L/16 (307M parameters)2 Convolutional Neural Networks [17]: • ConvNeXt-Base (89M parameters)3 • ConvNeXt-XLarge (350M parameters)4 MLP-Mixers: • MLP-Mixer B/16 (59M parameters)5 • MLP-Mixer L/16 (207M parameters)6

Using six distinct models allows us to create 30 diferent ensembles that are used for the experiments. We do not create the ensemble of a model with itself.

To compare the models, apart from the diversity metrics, we use Top10 accuracy and the retention metric [18] (an accuracy on corrupted dataset divided by the accuracy on the original data). We picked Top10 accuracy to smoothen out the achieved scores as some images from ImageNet may contain multiple objects of diferent classes, which introduces variance to the accuracy prediction.

1available here: https://storage.googleapis.com/vit_models/ augreg/B_8-i21k-300ep-lr_0.001-aug_medium2-wd_0.1-do_0. 0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_224.npz

2https://storage.googleapis.com/vit_models/augreg/L_ 16-i21k-300ep-lr_0.001-aug_medium2-wd_0.03-do_0.1-sd_0. 1--imagenet2012-steps_20k-lr_0.01-res_224 3https://tfhub.dev/sayakpaul/convnext_base_21k_1k_224/1 4https://tfhub.dev/sayakpaul/convnext_xlarge_21k_1k_224/1 5https://tfhub.dev/sayakpaul/mixer_b16_i21k_classification/1 6https://tfhub.dev/sayakpaul/mixer_l16_i21k_classification/1 (b) 0-1 Diversity (c) CE Diversity (d) Error consistency (e) 0-1 Div. Components (d) CE Diversity (e) Error consistency Results. While values averaged over diferent corrup- At the diagonal, we have the scores of single models. The tions do not play a key role in our comparison, they last, non-triangular one called 0-1 Diversity components allow us to comprehend a broader picture of the research (0-1 Diversity is calculated by averaging the two values subject. In Figure 1, solid lines represent a retention of from this plot, located symmetrically to the diagonal) specific architectures (a mean of two models using this presents a fraction of images that are classified correctly architecture), while dashed ones shows a retention of by one model (the one in the row) and incorrectly by the diferent ensembles (also averaged over all ensembles second one (the column model). of each kind). We clearly see that MLP Mixers perform Starting with the accuracy plot, we see that the best significantly worse than ViTs and CNNs. However, when performing model is ConvNeXt-XLarge, followed by ViT MLP Mixers are combined with ViTs or CNNs, the en- Base, ViT Large, MLP-Mixer Base, and MLP-Mixer Large. sembles (brown and grey dashed lines) performance only In cases of ViTs and MLP-Mixers, smaller models perslightly worse than single ViTs or CNNs models respec- form better than their bigger counterparts - this might tively. When we take a look at the top performing en- be an artifact of insuficient training. Regarding their sembles, ViT+CNN ensembles are followed by pure CNN ensembles, it is not surprising that the best accuracy is and ViT ensembles. This suggests that mixing difer- presented by the ensemble of the best performing models ent architectures is beneficial for their robustness. The (ViT-B and ConvNeXt-XL). We also observe that ensemnext experiments will support these two hypotheses with ble performance deteriorates only slightly when one of more concrete examples and results. its components performs significantly (e.g. MLP-Mixer

Figure 2 presents accuracy, diversity metrics, and error Large) worse than the other. consistency calculated on original ImageNet data. Each When we analyze all diversity metrics, we see that cell represents a metric value scored by an ensemble MLP-Mixers stand out from other models, especially the created by models from corresponding columns and rows. Large one. That is caused by much lower accuracy than (d) CE Diversity (e) Error consistency others – when we take a look at 0-1 Diversity Compo- Blur at severity 5 and Contrast at severity 4. We have nents plot, it shows that Mixers misclassify a significant chosen them as they exemplify how diferent architecfraction of images. We also see that the diversity is higher tures perform on various corruptions. The results for for CNN+ViT ensembles than for intra-architecture en- these corruptions are present at figures 3 and 4. Next to sembles. This allows CNN+ViT ensembles to perform the accuracy plots, we also present the retention values. better, e.g. ConvNeXt-B + ViT-B performs better than The Gaussian blur corruption is favored by the Vision ConvNeXt-B + ConvNeXt-XL, although ViT-B has lower Transformer as ViTs perform better than their CNN and accuracy. MLP-Mixer counterparts. However this time, the best

Another interesting insight: MLP-Mixer L ensembles performing model is the ViT-Large instead of Base, what perform slightly better than all MLP-Mixer B ensembles, suggests that while its learning process was not suficient while MLP-Mixer L has lower accuracy than MLP-Mixer to perform better than the smaller model, but it was B by 5 p.p.. One of the possible explanations is that suficient to learn it to perform robustly (ViT-Large is the MLP-Mixer L and B are not that diferent although thrice as big as ViT-B). they have significantly diferent accuracy (what results When we take a look at metrics, the highest (or lowis diferent 0-1 diversity and Error Consistency values est in case of error consistency) values belong to MLPwith regard to all other models) – CE diversity between Mixers, which perform poorly in comparison to ViTs and Mixers is as low as between ViTs (which classify very CNNs, so we may expect that this diversity comes mostly similarly, without 5 p.p. gap). Another evidence that from their misclassfication. We see it in the 0-1 diverthese model behaves similarly is that they have similar sity components, which state that Mixers classify around CE diversity values with all other models. 30-40% of images incorrectly in contrast to other models.

To keep the paper concise, we investigate in detail two Regarding ViTs and CNNs ensembles, pure CNNs ensemspecific selected corruptions from ImageNet-C: Gaussian bles are less diverse than ensembles of ViTs and CNNs or pure ViT ensembles. If we focus on ConvNeXt-B+XL and of diferent sizes. Another direction is to improve the ensemble and compare it to ConvNeXt-B+ViT-B, we see ensemble technique. The potential improvement spans that it performs slightly better, while ViT-B is less accu- from a weighted ensemble that would average the models rate than ConvNeXt-XL. While it’s not the most diverse e.g. based on their individual performance to a mixture pair between CNNs and ViTs, it’s according to all met- of experts that could predict which model will perform rics more diverse than the pure CNN ensemble. Other better at some input, and thus precisely leverage the adinteresting comparison is ViT-L+B vs. ViT-L+ConvNeXt- vantages of each particular model to tackle particular B: We substitute a Base ViT with a worse performing corruptions. Such a mixture of experts solution would CNN, what creates a better performing ensemble and also be viable in a resource-constrained environment, more diverse. where running multiple models simultaneously may be

Regarding the contrast corruption in figure 4, CNNs unacceptable. The last one is to continue this research dominate performance with only a modest drop in ac- for more complex problems like object detection and imcuracy, while other models perform much worse, espe- age segmentation. We need to define diversity metrics cially Mixers. The highest diversity values are related for these problems and then investigate the quality of to the worst performing MLP-Mixers. But at the same ensembles created using diferent neural architectures. time, Mixers ensembled with CNNs perform similar to ViT+CNN: worst performing MLP-Mixer Base, which is almost 20 p.p. worse than ViT-L, performs marginally References better when ensembled with ConvNeXt-XL - which we ifnd intriguing.

4. Conclusions

While our approach to combine the inherent diversity across models by an ensemble is simple, it manages to capture a synergy that arises from the use of diferent architectures. The ViT+CNN ensemble has proven to perform not only on average better than other combinations but also regardless of the corruption type, it succeeds to perform satisfactorily.

The diversity metrics and error consistency provide valuable quantitative tools to compare models and quantify the diferences in classifications. However, they only allow us to understand the relationships between the models when they are inferred on a specific input. Unfortunately, these metrics may be deceiving in case of two models, where one performs significantly worse than the other. High diversity does not translate to an improved performance of their ensemble which might seem counter-intuitive. The metrics capture how diversely models classify, not the potential of the ensemble of the two models. These two objectives coincide when models perform similarly on the accuracy metric, while a discrepancy in accuracies causes them to misalign. This behavior requires a careful analysis of the metric on every corruption separately.

We list several possible extensions to our work. The ifrst one is an improvement on diversity metrics to metrics assessing the ensemble potential. Secondly, our research was limited to three diferent architectures. While the results look promising, to fully evaluate and quantify how ensemble aggregates robustness of various models, more experiments should be run, involving more models of diferent architectures, pretrained on diferent datasets, 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 6402–6413. [9] R. Geirhos, K. Meding, F. A. Wichmann, Beyond accuracy: quantifying trial-bytrial behaviour of cnns and humans by measuring error consistency, 2020. URL: https://proceedings.neurips.cc/paper/2020/lfie/ 9f6992966d4c363ea0162a056cb45fe5-Paper.pdf. [10] L. A. Ortega, R. Cabañas, A. R. Masegosa, Diversity and generalization in neural network ensembles, 2021. URL: https://arxiv.org/abs/2110.13786. doi:10. 48550/ARXIV.2110.13786. [11] S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, A. Veit, Understanding robustness of transformers for image classification (2021). URL: https://arxiv.org/abs/2103.14586. arXiv:2103.14586. [12] M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. S.

Khan, M.-H. Yang, Intriguing properties of vision transformers, 2021. arXiv:2105.10497. [13] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, A. Dosovitskiy, Do vision transformers see like convolutional neural networks?, 2021. arXiv:2108.08810. [14] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, L. Zhang, Cvt: Introducing convolutions to vision transformers (2021). URL: https://arxiv.org/ abs/2103.15808. arXiv:2103.15808. [15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows (2021). [16] D. Hendrycks, T. Dietterich, Benchmarking neural network robustness to common corruptions and perturbations, in: International Conference on Learning Representations, 2019. URL: https: //openreview.net/forum?id=HJz6tiCqYm. [17] Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s (2022). URL: https: //arxiv.org/abs/2201.03545. arXiv:2201.03545. [18] D. Zhou, Z. Yu, E. Xie, C. Xiao, A. Anandkumar, J. Feng, J. M. Alvarez, Understanding the robustness in vision transformers, 2022. URL: https://arxiv. org/abs/2204.12451. doi:10.48550/ARXIV.2204. 12451.

[1]

International

Standards Organization , ISO 26262: Road vehicles - functional safety, parts 1 to 11 , in: Road Vehicles - Functional Safety , Second Edition , 2018 - 12 .

[2]

International

Standards Organization , ISO/PAS 21448: Road vehicles - safety of the intended functionality, in: Road Vehicles - Safety of the intended functionality, 2019 - 01 .

[3]

LeCun , Y. Bengio, G. Hinton, Deep learning , nature 521 ( 2015 ) 436 .

[4]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly , et al., An image is worth 16x16 words: Transformers for image recognition at scale ( 2020 ).

[5]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need, 2017 .

[6]

I. O.

Tolstikhin ,

Houlsby ,

Kolesnikov ,

Beyer ,

Zhai ,

Unterthiner ,

Yung ,

Steiner ,

Keysers ,

Uszkoreit ,

Lucic ,

Dosovitskiy , Mlpmixer: An all-mlp architecture for vision , 2021 . URL: https://proceedings.neurips.cc/paper/2021/ ifle/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Paper.pdf .

[7]

Sagi , L. Rokach, Ensemble learning: A survey , Wiley Interdiscip. Rev. Data Min. Knowl. Discov . 8 ( 2018 ). URL: https://doi.org/10.1002/widm.1249. doi: 10 .1002/widm.1249.

[8]

Lakshminarayanan ,

Pritzel ,

Blundell , Simple and scalable predictive uncertainty estimation using deep ensembles , in: I. Guyon, U. von Luxburg, S. Bengio,

H. M.

Wallach ,

Fergus ,

S. V. N.

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems