=Paper= {{Paper |id=Vol-3215/paper_16 |storemode=property |title=Exploring Diversity in Neural Architectures for Safety |pdfUrl=https://ceur-ws.org/Vol-3215/16.pdf |volume=Vol-3215 |authors=Michał Filipiuk,Vasu Singh |dblpUrl=https://dblp.org/rec/conf/ijcai/FilipiukS22 }} ==Exploring Diversity in Neural Architectures for Safety== https://ceur-ws.org/Vol-3215/16.pdf
Exploring Diversity in Neural Architectures for Safety
Michał Filipiuk1 , Vasu Singh1
1
    NVIDIA, Einsteinstraße 172, Munich, Germany


                                             Abstract
                                             Apart from the predominant convolutional neural networks (CNNs), several new architectures like Vision Transformers (ViTs)
                                             and MLP-Mixers have recently been proposed. Research also shows that these architectures learn differently. Ensembles based
                                             on different state-of-the-art neural architectures thus provide diversity, an important characteristic in designing safety-critical
                                             systems. To quantify the benefit of ensembles, we investigate different metrics like error consistency and diversity metric
                                             that have been proposed in the literature. We observe that with comparable individual performance, an ensemble of diverse
                                             architectures performs not only more accurately than an ensemble of one architecture, but also more robustly to diverse
                                             input corruptions.

                                             Keywords
                                             diversity, ensemble, safety, deep learning, image classification, robustness, safety-critical systems



1. Introduction                                                                                                          To improve the confidence in prediction, ensembles [7]
                                                                                                                      of neural networks are commonly used. Multiple models
The development of safety-critical systems relies on strin-                                                           are trained on the same data, then each of the trained
gent safety methodologies, designs, and analyses to pre-                                                              models is used to make a prediction before combining
vent hazards during operation. Automotive safety stan-                                                                the predictions in some way to create the final prediction.
dards like ISO26262 [1] and ISO/PAS 21448 [2] mandate                                                                 Ensembles have also shown to reduce the variance [8].
methodologies for system, hardware, and software de-                                                                  The inherent diversity in an ensemble has been shown
velopment for automotive systems. Diversity is an im-                                                                 to be a key factor for their superior performance. Differ-
portant concept in safety-critical systems that prevents                                                              ent diversity metrics have been proposed in the machine
against common cause failures. For example, diversity in                                                              learning literature. Error consistency [9], based on the
hardware is provided through lockstep execution across                                                                Cohen’s kappa metric, measures the similarity of clas-
different HW engines. Diversity in software is guaran-                                                                sification normalized by chance of common prediction.
teed through diverse algorithmic implementations.                                                                     Diversity [10] allows to define diversity metrics based on
   Deep neural networks [3] based on convolutional neu-                                                               different loss functions.
ral networks (CNN) are well-known for vision tasks using                                                                 The objective of our work is to quantify the diversity
machine learning. These include safety-critical applica-                                                              of ensembles created using different models, and evalu-
tions like autonomous driving and robotics, where CNN                                                                 ate their benefits. We choose two CNNs, two ViTs, and
models are used for object detection and image segmenta-                                                              two MLP Mixers, and create 30 in total ensembles by
tion as perception units to process sensor data. Over the                                                             averaging the models’ outputs. Our results show that
last few years, new neural architectures have disrupted                                                               ensembles created using different architectures are more
the dominance of CNNs in vision tasks: Vision Trans-                                                                  diverse than ensembles from the same architecture. We
formers (ViTs) [4], inspired by the transformer model [5]                                                             show that an ensemble of different architectures with
that was originally proposed for natural language pro-                                                                similar accuracy further improves the performance. In
cessing (NLP) tasks, leverages self-attention layers in-                                                              our experiments, we observe the best ensemble results
stead of convolution layers to process the input split into                                                           for a CNN and a ViT.
set of non-overlaping patches. Similarly, MLP Mixers [6]                                                                 The paper is organized as follows. Section 2 describes
have been proposed as a competitive but conceptually a                                                                the properties of CNNs, Vision Transformers, and MLP-
simple alternative that - instead of convolutions or self-                                                            Mixers, how they compare to each other including a
attention - are based entirely on multi-layer perceptrons                                                             summary of related work, and an overview of different
(MLPs) that are repeatedly applied across either spatial                                                              diversity metrics. Section 3 provides our experimental
locations or feature channels.                                                                                        results. Section 4 concludes the paper with a summary
                                                                                                                      of our ongoing work and future directions.
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety
(AISafety 2022), July 24–25, 2022, Vienna, Austria
$ mfilipiuk@nvidia.com (M. Filipiuk); vasus@nvidia.com                                                                2. Background
(V. Singh)
 0000-0003-4926-8449 (M. Filipiuk)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative   We describe the evolution of different neural architec-
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        tures and their strengths and weaknesses.
2.1. Neural architectures                                   MLP-Mixer. Presented in 2021, MLP-Mixers [6] pro-
                                                            vide an alternative to CNNs and ViTs that does not use
Convolutional Neural Networks. The convolution
                                                            convolutions or self-attention. Mixers use two types of
operation predates the first convolutional neural net-
                                                            MLP layers: channel-mixing and token-mixing MLPs.
works. With hand-engineered features, it was used in
                                                            The channel-mixing MLPs are applied to every patch sep-
classical computer vision applications many years before
                                                            arately, exchanging the information between channels,
it appeared in first neural networks in 1980s. However,
                                                            while the token-mixing MLPs work on one channel, but
the rise of CNNs started with AlexNet in 2012, which
                                                            across all patches, allowing the communication between
defeated by a large margin other, non-neural approaches
                                                            the patches.
in the ImageNet competition. Over the last 10 years, we
                                                               Matrix multiplications in MLPs are a simpler operation
have seen multiple improvements to this architecture,
                                                            than a convolution, which require more specialized hard-
but they were more evolutionary than revolutionary.
                                                            ware or a costly conversion to a matrix multiplication
   The fact that convolutions managed to be in the spot-
                                                            operation.
light for such a long time may seem quite surprising,
                                                               As MLP-Mixers perform similarly to Vision Transform-
however an analysis of their properties gives us the an-
                                                            ers on a level of encoder layers, they have similar prop-
swer: Convolutions have two key inductive biases that
                                                            erties: both architectures have global perception fields
allow them to excel at high-dimensional data with strong
                                                            and they both suffer of no translation equivariance due
spatial correlation like images: the spatial inductive bias
                                                            to the use of image patches as input. Regarding the dif-
allows them to focus on local information in the input
                                                            ferences of these two architectures: MLP-Mixers do not
images. Applying the same kernel over the whole image
                                                            need position encoding as MLP layers differentiate be-
results in the translation equivariance as input transla-
                                                            tween different elements of its input, in contrast to the
tions result only in the shifted output of convolutional
                                                            multi-head attention in ViTs.
layers. The convolution operation is also a very simple
and compute-efficient operation. Its memory usage is
not only small, but also constant with regard to the size 2.2. Related Work
of the image what combined with possibility to apply it As the three architectures present different approaches
in parallel, makes it feasible for every hardware.          to image classification — using convolutions, multi-head
                                                            attention, or multilayer perceptrons to process the in-
Vision Transformer. The Transformer architec- put — the comparison between them should not restrict
ture [5] was initially introduced in 2017 for NLP tasks. just to experimental accuracy, e.g. on a single dataset
In 2020, this architecture was applied to image classifica- like ImageNet, but should also include more experiments,
tion problem and called the Vision Transformer (ViT) [4]. analyzing in-detail the different aspects of image clas-
Here, an input image is split into a set of non-overlapping sification problem (e.g. robustness to input corruption
patches, which after being embedded are provided to the or transformations like translations or rotations) and in-
ViT encoder blocks. ViTs have much less image-specific ternal properties of each model. Bhojanapalli et al. [11]
inductive bias than CNNs. In CNNs, the locality and trans- conduct multiple experiments, assessing the robustness
lation equivariance are inherent to convolutional layers of Vision Transformers to multiple corruptions with re-
throughout the whole model. In ViT, the self-attention gard to model sizes and their pre-training datasets, in
layers are global, and only the MLP layers are performed comparison to various ResNet models. They show that
locally and translationally equivariant on the patch level. (1) adversarial attacks like Fast Gradient Sign Method and
The two-dimensional neighborhood is not present in the Projected Gradient Descent similarly influence both ViTs
network architecture as transformers treat the input as and CNNs, (2) corrupted images with an attack are not
an unordered set. This information needs to be input to transferable, resulting in only a modest, few percentage
the first layer in form of position embedding together points drop between the architectures, while they are
with image patches.                                         transferable between the models of the same architec-
   Reducing the inductive biases has twofold conse- ture. Regarding less artificial corruptions and distribution
quences: Transformers have to learn properties that shifts, present in ImageNet-C, -R, and -A datasets: perfor-
would otherwise be inherited from the convolution oper- mance of different architectures seems to be similar. One
ation, that proved to be successful: to be invariant to the important conclusion is how the accuracy changes with
input shifts and balancing the local and global percep- the size of the pretraining dataset – for ILSVRC-2012,
tion in encoding blocks. But at the same time, they can ViTs perform worse than CNNs, however for ImageNet-
improve upon them, can leverage the global perception 21k and JFT-300M performance is comparable. Under a
to their advantage and discover its own priors based on closer inspection of ImageNet-C dataset, ViTs and CNNs
data, what results in performing the task distinctly and perform significantly different on various ImageNet-C
bringing diversity of solutions to the field.               corruptions: e.g. on glass blur Vision Transformers per-
form significantly better than CNNs, while they perform           where E𝜈 and V𝜌 stand for an expected value over
worse on contrast corruption, on the highest level of          the whole data generating distribution 𝜈 (which is ap-
severity – this observation is crucial for our research        proximated using a dataset) and a variance of models’
presented in this paper. Naseer et al. [12] extends this       predictions that the ensemble consists of. The formulas
comparison to e.g. input occlusions or input patches per-      are derived from a loss analysis of every classifier and
mutation, where ViTs perform much more robustly than           their ensemble, where the diversity upper bounds a dif-
CNNs. They investigates also the shape-texture bias of         ference between an averaged loss of classifiers and the
these architectures and show that transformers are less        loss of their ensemble. In summary: these metrics mea-
biased towards local textures than CNNs.                       sure how diverse the predictions of different models for a
   In [13], authors analyze the information that every         dataset are by calculating the variance of prediction, av-
layer processes, how the reception fields looks like for       eraged over every data point. In case of CE diversity, the
Transformers (which are not restricted by the convolu-         predictions are being additionally scaled to [0,1] range.
tion operation) and how different layers learn depending          From our perspective, the CE loss diversity should be
on the dataset size. Their research shows that CNNs and        more interesting as we are going to ensemble models by
ViTs perform their computation significantly differently.      averaging their prediction, but CE loss diversity is more
It also briefly describes how MLP-Mixers behave closer         complex than 0/1 diversity and eventually we evaluate
to ViTs with regard to the intermediate features learned.      models using accuracy, which binarizes their outputs to
   There have also been architectures that combine CNNs        count them as correct and incorrect classification. At the
and ViTs. For example, Cvt: Introducing convolutions to        same time, CE loss diversity is able to provide us with
vision transformers [14] apply convolutions over input         more information e.g. in a case when both models classify
image and intermediate feature token maps, which are           identically, but with different probabilities assigned.
next processed by a transformer block. While the Swin             Error consistency [9] is a metric measuring how much
Transformer [15] doesn’t feature convolution layers, it in-    errors of two classifiers coincide. It calculates a num-
troduces a hierarchical approach of CNNs and the locality      ber of items classified either correctly or incorrectly by
of convolutions to transformers: it applies MHA to small,      both models and compares it to an expected rate of equal
local set of patches (windows), while the patches are be-      responses in case when both models were totally statis-
ing merged into bigger patches as we progress deeper           tically independent. The exact formula is presented as
into the model. To support the information propaga-            follows:
tion between patches, the model shifts the windows with                                   𝑐𝑜𝑏𝑠 − 𝑐𝑒𝑥𝑝
                                                                                   𝜅 =
every layer to overlap with previously used windows.                                        1 − 𝑐𝑒𝑥𝑝
These changes can also be introduced to a MLP-Mixers,             where 𝑐𝑜𝑏𝑠 stands for a fracture of equal classification
resulting in the performance improvement.                      (either correct or incorrect) and 𝑐𝑒𝑥𝑝 is an expected rate
   The results of the aforementioned research inspire us       of equal responses, which is calculated using models’
to investigate how this variety of these three architec-       accuracies: 𝑐𝑒𝑥𝑝 = 𝑎𝑐𝑐1 𝑎𝑐𝑐2 +(1−𝑎𝑐𝑐1 )(1−𝑎𝑐𝑐2 ). This
tures, proved by multiple various experiments, can be          metric can only compare two models in contrast to the
leveraged for improving the diversity in safety-critical       diversity metrics which does not have such a restriction.
systems.

2.3. Diversity metrics
                                                               3. Experiments
While the intuition behind the diversity may be straight-      Model selection. Setup. We have chosen the best per-
forward, quantifying it is not. We present below three         forming models that were available to us at the time of
distinct metrics from the literature that try to capture       conducting the research, pretrained on ImageNet-21k and
models’ diversity.                                             fine-tuned to ImageNet-1k. We considered the arguments
   Ortega et al. [10] provide a metric of diversity for dif-   raised in the previous section to determine the size of the
ferent loss functions like 0/1 loss, cross-entropy loss, and   pretraining dataset. This has the best potential to perform
squared loss. As we are focused on the classification          robustly on ImageNet-C [16], which we’ll use to compare
problem, we’ll use 0/1 and cross-entropy losses, which         the architectures. ImageNet-C is a dataset created by ar-
formulas are presented below:                                  tificially applying various corruptions (blurs, noises, dig-
                                                               ital corruptions, and weather conditions), which feature
                                                               different severity levels, to the ImageNet (ILSVRC2012)
                     [︁ (︁                      )︁]︁
    D0/1 (𝜌) = E𝜈 V𝜌 1 (ℎ𝑊 (𝑥; 𝜃) ̸= 𝑦)
                                                               validation set. The models are as follows, and ensembles
                                                               are created by averaging the returned softmax outputs of
                      [︃ (︃                          )︃]︃
                                    𝑝(𝑦 | 𝑥, 𝜃)
     D𝑐𝑒 (𝜌) = E𝜈 V𝜌 √                                         two models. We use only two at the time to observe how
                                2 max𝜃 𝑝(𝑦 | 𝑥, 𝜃)
                                                               ensembles of different architectures perform compared
to the single models that build them. Also using more
models in the ensembles would prohibit us from using
the error consistency metric. Ensembles are created by
averaging the softmax outputs as it is the simplest way
of building ensembles. While it has its disadvantages
(e.g. models are calibrated differently and overconfident
ones can dominate under-confident ones with their pre-
dictions), we choose it for its simplicity, leaving potential
improvements to future work.
Vision Transformers:
      • Vision Transformer B/8 (86M parameters)1
      • Vision Transformer L/16 (307M parameters)2
Convolutional Neural Networks [17]:                                                          (a) Top10 Accuracy
      • ConvNeXt-Base (89M parameters)3
      • ConvNeXt-XLarge (350M parameters)4
MLP-Mixers:
      • MLP-Mixer B/16 (59M parameters)5
      • MLP-Mixer L/16 (207M parameters)6
   Using six distinct models allows us to create 30 differ-
ent ensembles that are used for the experiments. We do
not create the ensemble of a model with itself.
   To compare the models, apart from the diversity met-
rics, we use Top10 accuracy and the retention metric [18]                                     (b) 0-1 Diversity
(an accuracy on corrupted dataset divided by the accu-
racy on the original data). We picked Top10 accuracy
to smoothen out the achieved scores as some images
from ImageNet may contain multiple objects of differ-
ent classes, which introduces variance to the accuracy
prediction.




                                                                                              (c) CE Diversity




                                                                                            (d) Error consistency
Figure 1: Retention curves with regard to severity, averaged
over all ImageNet-C corruptions

    1
      available here: https://storage.googleapis.com/vit_models/
augreg/B_8-i21k-300ep-lr_0.001-aug_medium2-wd_0.1-do_0.
0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_224.npz
    2
      https://storage.googleapis.com/vit_models/augreg/L_
16-i21k-300ep-lr_0.001-aug_medium2-wd_0.03-do_0.1-sd_0.
1--imagenet2012-steps_20k-lr_0.01-res_224
    3
      https://tfhub.dev/sayakpaul/convnext_base_21k_1k_224/1
    4
      https://tfhub.dev/sayakpaul/convnext_xlarge_21k_1k_224/1                            (e) 0-1 Div. Components
    5
      https://tfhub.dev/sayakpaul/mixer_b16_i21k_classification/1
    6
      https://tfhub.dev/sayakpaul/mixer_l16_i21k_classification/1   Figure 2: Metrics performance on original data
                      (a) Top10 Accuracy                (b) Top10 Retention                  (c) 0-1 Diversity




                       (d) CE Diversity                 (e) Error consistency            (f) 0-1 Div. Components

Figure 3: Metrics performance on Gaussian Blur 5 corruption



Results. While values averaged over different corrup-         At the diagonal, we have the scores of single models. The
tions do not play a key role in our comparison, they          last, non-triangular one called 0-1 Diversity components
allow us to comprehend a broader picture of the research      (0-1 Diversity is calculated by averaging the two values
subject. In Figure 1, solid lines represent a retention of    from this plot, located symmetrically to the diagonal)
specific architectures (a mean of two models using this       presents a fraction of images that are classified correctly
architecture), while dashed ones shows a retention of         by one model (the one in the row) and incorrectly by the
different ensembles (also averaged over all ensembles         second one (the column model).
of each kind). We clearly see that MLP Mixers perform            Starting with the accuracy plot, we see that the best
significantly worse than ViTs and CNNs. However, when         performing model is ConvNeXt-XLarge, followed by ViT
MLP Mixers are combined with ViTs or CNNs, the en-            Base, ViT Large, MLP-Mixer Base, and MLP-Mixer Large.
sembles (brown and grey dashed lines) performance only        In cases of ViTs and MLP-Mixers, smaller models per-
slightly worse than single ViTs or CNNs models respec-        form better than their bigger counterparts - this might
tively. When we take a look at the top performing en-         be an artifact of insufficient training. Regarding their
sembles, ViT+CNN ensembles are followed by pure CNN           ensembles, it is not surprising that the best accuracy is
and ViT ensembles. This suggests that mixing differ-          presented by the ensemble of the best performing models
ent architectures is beneficial for their robustness. The     (ViT-B and ConvNeXt-XL). We also observe that ensem-
next experiments will support these two hypotheses with       ble performance deteriorates only slightly when one of
more concrete examples and results.                           its components performs significantly (e.g. MLP-Mixer
   Figure 2 presents accuracy, diversity metrics, and error   Large) worse than the other.
consistency calculated on original ImageNet data. Each           When we analyze all diversity metrics, we see that
cell represents a metric value scored by an ensemble          MLP-Mixers stand out from other models, especially the
created by models from corresponding columns and rows.        Large one. That is caused by much lower accuracy than
                       (a) Top10 Accuracy                (b) Top10 Retention                  (c) 0-1 Diversity




                       (d) CE Diversity                  (e) Error consistency            (f) 0-1 Div. Components

Figure 4: Metrics performance on Contrast 4 corruption



others – when we take a look at 0-1 Diversity Compo-           Blur at severity 5 and Contrast at severity 4. We have
nents plot, it shows that Mixers misclassify a significant     chosen them as they exemplify how different architec-
fraction of images. We also see that the diversity is higher   tures perform on various corruptions. The results for
for CNN+ViT ensembles than for intra-architecture en-          these corruptions are present at figures 3 and 4. Next to
sembles. This allows CNN+ViT ensembles to perform              the accuracy plots, we also present the retention values.
better, e.g. ConvNeXt-B + ViT-B performs better than              The Gaussian blur corruption is favored by the Vision
ConvNeXt-B + ConvNeXt-XL, although ViT-B has lower             Transformer as ViTs perform better than their CNN and
accuracy.                                                      MLP-Mixer counterparts. However this time, the best
   Another interesting insight: MLP-Mixer L ensembles          performing model is the ViT-Large instead of Base, what
perform slightly better than all MLP-Mixer B ensembles,        suggests that while its learning process was not sufficient
while MLP-Mixer L has lower accuracy than MLP-Mixer            to perform better than the smaller model, but it was
B by 5 p.p.. One of the possible explanations is that          sufficient to learn it to perform robustly (ViT-Large is
the MLP-Mixer L and B are not that different although          thrice as big as ViT-B).
they have significantly different accuracy (what results          When we take a look at metrics, the highest (or low-
is different 0-1 diversity and Error Consistency values        est in case of error consistency) values belong to MLP-
with regard to all other models) – CE diversity between        Mixers, which perform poorly in comparison to ViTs and
Mixers is as low as between ViTs (which classify very          CNNs, so we may expect that this diversity comes mostly
similarly, without 5 p.p. gap). Another evidence that          from their misclassfication. We see it in the 0-1 diver-
these model behaves similarly is that they have similar        sity components, which state that Mixers classify around
CE diversity values with all other models.                     30-40% of images incorrectly in contrast to other models.
   To keep the paper concise, we investigate in detail two     Regarding ViTs and CNNs ensembles, pure CNNs ensem-
specific selected corruptions from ImageNet-C: Gaussian        bles are less diverse than ensembles of ViTs and CNNs
or pure ViT ensembles. If we focus on ConvNeXt-B+XL             and of different sizes. Another direction is to improve the
ensemble and compare it to ConvNeXt-B+ViT-B, we see             ensemble technique. The potential improvement spans
that it performs slightly better, while ViT-B is less accu-     from a weighted ensemble that would average the models
rate than ConvNeXt-XL. While it’s not the most diverse          e.g. based on their individual performance to a mixture
pair between CNNs and ViTs, it’s according to all met-          of experts that could predict which model will perform
rics more diverse than the pure CNN ensemble. Other             better at some input, and thus precisely leverage the ad-
interesting comparison is ViT-L+B vs. ViT-L+ConvNeXt-           vantages of each particular model to tackle particular
B: We substitute a Base ViT with a worse performing             corruptions. Such a mixture of experts solution would
CNN, what creates a better performing ensemble and              also be viable in a resource-constrained environment,
more diverse.                                                   where running multiple models simultaneously may be
   Regarding the contrast corruption in figure 4, CNNs          unacceptable. The last one is to continue this research
dominate performance with only a modest drop in ac-             for more complex problems like object detection and im-
curacy, while other models perform much worse, espe-            age segmentation. We need to define diversity metrics
cially Mixers. The highest diversity values are related         for these problems and then investigate the quality of
to the worst performing MLP-Mixers. But at the same             ensembles created using different neural architectures.
time, Mixers ensembled with CNNs perform similar to
ViT+CNN: worst performing MLP-Mixer Base, which is
almost 20 p.p. worse than ViT-L, performs marginally            References
better when ensembled with ConvNeXt-XL - which we
                                                                 [1] International Standards Organization, ISO 26262:
find intriguing.
                                                                     Road vehicles - functional safety, parts 1 to 11, in:
                                                                     Road Vehicles - Functional Safety, Second Edition,
4. Conclusions                                                       2018-12.
                                                                 [2] International Standards Organization, ISO/PAS
While our approach to combine the inherent diversity                 21448: Road vehicles - safety of the intended func-
across models by an ensemble is simple, it manages to                tionality, in: Road Vehicles - Safety of the intended
capture a synergy that arises from the use of different              functionality, 2019-01.
architectures. The ViT+CNN ensemble has proven to per-           [3] Y. LeCun, Y. Bengio, G. Hinton, Deep learning,
form not only on average better than other combinations              nature 521 (2015) 436.
but also regardless of the corruption type, it succeeds to       [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
perform satisfactorily.                                              senborn, X. Zhai, T. Unterthiner, M. Dehghani,
   The diversity metrics and error consistency provide               M. Minderer, G. Heigold, S. Gelly, et al., An im-
valuable quantitative tools to compare models and quan-              age is worth 16x16 words: Transformers for image
tify the differences in classifications. However, they only          recognition at scale (2020).
allow us to understand the relationships between the             [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
models when they are inferred on a specific input. Unfor-            L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
tunately, these metrics may be deceiving in case of two              tention is all you need, 2017.
models, where one performs significantly worse than              [6] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer,
the other. High diversity does not translate to an im-               X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Key-
proved performance of their ensemble which might seem                sers, J. Uszkoreit, M. Lucic, A. Dosovitskiy, Mlp-
counter-intuitive. The metrics capture how diversely                 mixer: An all-mlp architecture for vision, 2021.
models classify, not the potential of the ensemble of the            URL: https://proceedings.neurips.cc/paper/2021/
two models. These two objectives coincide when mod-                  file/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Paper.pdf.
els perform similarly on the accuracy metric, while a            [7] O. Sagi, L. Rokach, Ensemble learning: A survey,
discrepancy in accuracies causes them to misalign. This              Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
behavior requires a careful analysis of the metric on every          8 (2018). URL: https://doi.org/10.1002/widm.1249.
corruption separately.                                               doi:10.1002/widm.1249.
   We list several possible extensions to our work. The          [8] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim-
first one is an improvement on diversity metrics to met-             ple and scalable predictive uncertainty estimation
rics assessing the ensemble potential. Secondly, our re-             using deep ensembles, in: I. Guyon, U. von Luxburg,
search was limited to three different architectures. While           S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vish-
the results look promising, to fully evaluate and quantify           wanathan, R. Garnett (Eds.), Advances in Neural
how ensemble aggregates robustness of various models,                Information Processing Systems 30: Annual Con-
more experiments should be run, involving more models                ference on Neural Information Processing Systems
of different architectures, pretrained on different datasets,
     2017, 4-9 December 2017, Long Beach, CA, USA,
     2017, pp. 6402–6413.
 [9] R. Geirhos, K. Meding, F. A. Wichmann,
     Beyond accuracy:              quantifying trial-by-
     trial behaviour of cnns and humans by
     measuring error consistency,               2020. URL:
     https://proceedings.neurips.cc/paper/2020/file/
     9f6992966d4c363ea0162a056cb45fe5-Paper.pdf.
[10] L. A. Ortega, R. Cabañas, A. R. Masegosa, Diversity
     and generalization in neural network ensembles,
     2021. URL: https://arxiv.org/abs/2110.13786. doi:10.
     48550/ARXIV.2110.13786.
[11] S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li,
     T. Unterthiner, A. Veit,           Understanding ro-
     bustness of transformers for image classifica-
     tion (2021). URL: https://arxiv.org/abs/2103.14586.
     arXiv:2103.14586.
[12] M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. S.
     Khan, M.-H. Yang, Intriguing properties of vision
     transformers, 2021. arXiv:2105.10497.
[13] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang,
     A. Dosovitskiy, Do vision transformers see
     like convolutional neural networks?, 2021.
     arXiv:2108.08810.
[14] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan,
     L. Zhang, Cvt: Introducing convolutions to vi-
     sion transformers (2021). URL: https://arxiv.org/
     abs/2103.15808. arXiv:2103.15808.
[15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,
     B. Guo, Swin transformer: Hierarchical vision trans-
     former using shifted windows (2021).
[16] D. Hendrycks, T. Dietterich, Benchmarking neu-
     ral network robustness to common corruptions
     and perturbations, in: International Conference
     on Learning Representations, 2019. URL: https:
     //openreview.net/forum?id=HJz6tiCqYm.
[17] Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell,
     S. Xie, A convnet for the 2020s (2022). URL: https:
     //arxiv.org/abs/2201.03545. arXiv:2201.03545.
[18] D. Zhou, Z. Yu, E. Xie, C. Xiao, A. Anandkumar,
     J. Feng, J. M. Alvarez, Understanding the robust-
     ness in vision transformers, 2022. URL: https://arxiv.
     org/abs/2204.12451. doi:10.48550/ARXIV.2204.
     12451.