<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Deep learning.
nature</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comparing Vision Transformers and Convolutional Nets for Safety Critical Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michał Filipiuk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasu Singh NVIDIA</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Munich</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>mfilipiuk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>vasusg@nvidia.com</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>521</volume>
      <issue>7553</issue>
      <fpage>436</fpage>
      <lpage>444</lpage>
      <abstract>
        <p>Transformer based architectures like vision transformers (ViTs) are improving the state-of-the-art established by convolutional neural networks (CNNs) for computer vision tasks. Recent research shows that ViTs learn differently than CNNs, that provides an appealing choice to developers of safetycritical applications for redundant design. Moreover, ViTs have been shown to be robust to image perturbations. In this position paper, we analyze the properties of ViTs and compare them to CNNs. We create an ensemble of a CNN and a ViT and compare its performance to individual models. On the ImageNet benchmark, the ensemble shows minor improvements in accuracy relative to individual models. On the image corruption benchmark ImageNet-C, the ensemble shows up to 10% improvement over the individual models, and generally performs as well as better of the two individual networks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Machine learning plays an important role in computer vision
applications. Safety critical applications like autonomous
vehicles and robotics increasingly depend on machine
learning for computer vision. Deep neural networks (LeCun,
Bengio, and Hinton 2015) based on Convolutional
Neural Networks (CNN) are well-known and widely used for
their powerful representation. For instance, in the field of
autonomous driving, various CNN models have been used
for object detection and image segmentation algorithms that
serve as perception units to process camera (e.g.,
PilotNet
        <xref ref-type="bibr" rid="ref5">(Bojarski et al. 2016)</xref>
        , Fast RCNN (Wang, Shrivastava,
and Gupta 2017)) and Lidar (e.g., VoxelNet (Zhou and Tuzel
2018)) data. However, the success of CNNs comes at the
cost of restricting the computation to leverage data limited
spatially by using convolutional layers. The performance of
CNNs has gradually saturated, and the ML research
community has been exploring alternative architectures for
computer vision.
      </p>
      <p>
        One of these alternative architectures that challenges
the dominance of CNNs is based on the transformer
model (Vaswani et al. 2017). First proposed for natural
language processing tasks, transformers have been adapted for
computer vision tasks by different approaches. For example,
the vision transformer model (ViT)
        <xref ref-type="bibr" rid="ref7">(Dosovitskiy et al. 2020)</xref>
        uses self-attention layer instead of convolution layers,
effectively removing the spatial inductive bias introduced by the
convolution operation and enables the network to use full
image data to its advantage.
      </p>
      <p>
        The development of safety critical systems relies on
stringent safety methodologies, designs, and analyses to prevent
hazards during operation. Automotive safety standards like
ISO26262
        <xref ref-type="bibr" rid="ref12">(International Standards Organization 2018-12)</xref>
        and ISO/PAS 21448
        <xref ref-type="bibr" rid="ref13">(International Standards Organization
2019-01)</xref>
        mandate methodologies for system, hardware, and
software development for automotive systems. Furthermore,
these standards have been extended with best practices to
use machine learning based components in safety critical
systems. Ashmore et al.
        <xref ref-type="bibr" rid="ref11 ref2">(Ashmore, Calinescu, and
Paterson 2019)</xref>
        describe the ML safety lifecycle that establishes
best practices across the ML development cycle from data
management, model selection, training to deployment.
Similarly, the industry whitepaper titled Safety First for
Automated Driving
        <xref ref-type="bibr" rid="ref1">(Aptiv et al. 2019)</xref>
        specifies techniques for
developing, deploying, and monitoring neural networks for
safety critical systems. In general, these guidelines
recommend to identify common causes of failures, avoid
overfitting to training data, quantify uncertainty in prediction,
and make networks robust to natural perturbations.
      </p>
      <p>
        Based on these proposals for using machine learning in
safety critical systems, we investigate the behavior of ViTs
in comparison and conjunction with CNNs. We explore
whether CNNs and ViTs could be combined into an
ensemble
        <xref ref-type="bibr" rid="ref6">(Dietterich 2000)</xref>
        for better accuracy. The fact that
ViTs are based on a different architecture than CNNs is
valuable for developing safety-critical systems, as this
allows to reason about two independent network models, one
based on convolution and the other on self-attention. We
also investigate the robustness of the ensemble compared
to individual networks. We consider the ImageNet-C
benchmark
        <xref ref-type="bibr" rid="ref11">(Hendrycks and Dietterich 2019)</xref>
        to create
perturbations like gaussian noise, defocus blur, artificially added fog,
and lowered image contrast. The ensemble shows an
improvement of up to 10% compared to individual networks
on these robustness benchmarks.
      </p>
      <p>The paper is organized as follows. Section 2 describes
the properties of vision transformers, motivating their use
in safety-critical applications and comparing them to CNNs.
Section 3 provides a quantitative analysis of CNNs and ViT
(and their ensemble) on image classification. Section 4
discusses related work. Section 5 concludes the paper with a
summary of our ongoing work and future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>ViTs for Safety</title>
      <p>
        The original transformer model (Vaswani et al. 2017)
for natural language processing takes a sequence of
onedimensional token embeddings as input, and relies on
selfattention to capture long-range data dependencies.
Transformers have become the dominant network architecture for
natural language tasks. A straightforward application of the
transformer model to computer vision tasks would require
attention between every pair of pixels - this does not scale to
realistic image sizes due to quadratic cost in the number of
pixels. The ViT model
        <xref ref-type="bibr" rid="ref7">(Dosovitskiy et al. 2020)</xref>
        avoids this
limitation by reshaping an image into a sequence of flattened
patches of size P P , reducing the effective sequence
input length P 2 times. Generally, the patch size P is chosen to
be 16 or 32. We now describe desirable properties of neural
network architectures for use in safety-critical applications.
Reusability. Transfer learning (Pan and Yang 2009) is a
commonly used technique for training ML models, where
a model is trained for a particular context, and then re-used
in a different context with limited training data. It is a
powerful technique that reduces computational effort and increases
confidence in the trained model. CNNs are well suited for
transfer learning since convolutional layers allow to encode
features in the input space. It has been shown
        <xref ref-type="bibr" rid="ref7">(Dosovitskiy
et al. 2020)</xref>
        that ViTs also attain excellent results when they
are trained at sufficient scale and then transferred to new
tasks with relatively fewer datapoints. Especially when
pretraining data is in abundance and transfer data is scarce
(fewshot learning), ViTs outperform state-of-the-art CNNs.
Robustness. For use in safety-critical applications, it is
important that the network is robust against image
perturbations. For example, an automotive perception network
trained under sunny weather conditions should perform well
also in rainy and snowing situations. To simulate such
effects, several image corruption benchmarks
        <xref ref-type="bibr" rid="ref11">(Hendrycks and
Dietterich 2019)</xref>
        , (Michaelis et al. 2019) have been created
and the performance of different network architecture
studied. Bhojanapalli et al.
        <xref ref-type="bibr" rid="ref3">(Bhojanapalli et al. 2021)</xref>
        investigate
the performance of ViTs and CNNs in images with
corruptions like noise and blur. They demonstrate that with a
significant size of the pre-training dataset, ViTs are at least
as robust as CNNs, and sometimes more robust on
artificially corrupted data. Similarly, Naseer et al. (Naseer et al.
2021) show that ViTs are robust against severe occlusions of
foreground objects and random patch locations compared to
state-of-the-art CNNs.
      </p>
      <p>
        Detection of Distribution Shift. In addition to robustness
against image perturbations, it is also essential that the
network can identify distribution shift during deployment, i.e.
scenarios where the network is observing data that is
different from its training data - this is because unseen data might
result in incorrect predictions. For example, an automotive
perception network trained to detect pedestrians should be
Model
CNN
ViT
able to distinguish cyclists as being different from
pedestrians. Fort et al.
        <xref ref-type="bibr" rid="ref8">(Fort, Ren, and Lakshminarayanan 2021)</xref>
        show that pre-trained transformers perform better in
detecting out-of-distribution (OOD) samples than CNNs. Also,
transformers are better suited to few-shot outlier exposure
than CNNs, where a network is shown a few outlier samples
in order to improve distribution shift.
      </p>
      <p>
        Redundancy. Self-attention allows ViTs to integrate global
information about the image even in the lower layers. ViTs
have more uniform representation across layers, preserving
input spatial information. This is contrary to CNNs where
global information is available only in higher layers.
Moreover, CNNs have an intrinstic local neighborhood structure
in each layer. The translational invariance in CNNs also
introduces an inductive bias. Raghu et al. (Raghu et al. 2021)
show that ViTs and CNNs indeed learn differently, and this
is reflected in their internal structures after training. This
fundamental difference in how ViTs and CNNs learn
provides a powerful tool for redundant design of safety critical
applications. In addition, independent models can be
pretrained on different datasets and executed on different
hardware platforms at runtime. An ensemble of CNN and ViT
can argue the safety based on the fact that the two
individual architectures differ in their detection mechanism
(convolution and self-attention respectively) and thus the
ensemble does not suffer from common-cause failures like an
ensemble of multiple CNNs or multiple ViTs would. For
example, Bhojanapalli et al.
        <xref ref-type="bibr" rid="ref3">(Bhojanapalli et al. 2021)</xref>
        show
that adversarial perturbations do not transfer across ViTs and
CNNs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Quantitative Analysis</title>
      <p>We start with an investigation whether CNNs and ViTs can
be combined for more accurate detection. We compare the
accuracy of the individual architecture with the ensemble
model.</p>
      <p>
        We choose the Vision Transformer(ViT)
        <xref ref-type="bibr" rid="ref7">(Dosovitskiy
et al. 2020)</xref>
        and Big Transfer(BiT)
        <xref ref-type="bibr" rid="ref14">(Kolesnikov et al. 2019)</xref>
        models for our comparsion. The specific ViT that we picked
for our experiments is the largest, publicly available ViT-L
using patches of 16x16 pixels, pretrained at ImageNet21K
and fine-tuned to ImageNet2012 images at resolution of
384x3841. It consists of 307M trainable parameters. Big
Transfer model is a CNN architecture based on well-known
1available here: gs://vit models/augreg/L
16-i21k-300eplr 0.001-aug strong1-wd 0.1-do 0.0-sd
0.0--imagenet2012steps 20k-lr 0.01-res 384.npz
(a) Correct label: Mountain (b) Correct label: Remote
conbike trol
ResNet networks with a few improvements, enabling it to
be transferable between the datasets similarly to ViTs. Here,
we also picked the biggest available checkpoint called
BiTM, based on ResNet152x4, pretrained on ImageNet21K and
fine-tuned to ImageNet20122. BiT consists of 937M
parameters. Using these two models, we create an ensemble
model. It combines the output of both CNN and ViT, and
treats them as individual probabilities distribution over all
1000 ImageNet classes. The distributions are then multiplied
element-wise to obtain values proportional to each class’
likelihood. Such approach is just one of many possible
techniques to combine the results of models’ inferences. We plan
to research other ensemble models in future work.
      </p>
      <p>Table 1 shows the Top-1, Top-5, and Top-10 accuracy of
the CNN, ViT, and the ensemble. CNN and ViT perform
very similarly, with ViT being consistently slightly better,
while an ensemble performs better across all metrics,
significantly at Top-1.</p>
      <p>Next, we investigate examples where the ensemble of
CNN and ViT predicts the correct class (Top-1), while the
individual models do not. Figure 1 shows two such
examples. Figure 1a shows a bicycle handlebar where a large part
of the bike is outside the image. This poses a challenge to
correctly comprehend the image. Figure 1b shows an old
remote control for an Apple device. We see its back, what is an
unusual way of presenting the remote control. It’s also not
common to see such remotes as most of us associate Apple
and its logo with iPhones and MacBooks.</p>
      <p>Table 2 provides the softmax probabilities for the top-5
predictions per network for these examples. We observe that
both networks predict different class (top-1) for both cases,
while the correct prediction appears in the top-5 of both
networks. We also observe that for the remote control, the CNN
predicts it as a hard disc with significantly high probability
(0.48), whereas the ViT does not predict it as a hard-disc in
the top-5 predictions.</p>
      <p>Robustness. Next, we ask the question: how robust are
ViTs to perturbations for image classification in comparison
to CNNs and can we somehow leverage their unique ways
2available here:
ilsvrc2012 classification/1
https://tfhub.dev/google/bit/m-r152x4/
Correct
for image
label
Mountain bike
(joystick, 0.19682)
(mountain
0.18778)
(microphone,
0.15267)
(stopwatch,
0.03436)
(mountain
0.02978)
bike,
Remote control
(screw, 0.06855)</p>
      <p>(joystick, 0.02686)
of comprehending the image to our advantage with the
ensemble?</p>
      <p>We continue to work with the ViT and BiT models
mentioned earlier, but we choose a different checkpoint for the
ViT3, as we change the images resolution from 384x384 to
224x224. The respective BiT model in our analysis has a
degraded performance as there is no checkpoint available for
smaller images.</p>
      <p>
        To validate the performance on the corrupted data, we
have chosen ImageNet-C dataset
        <xref ref-type="bibr" rid="ref11">(Hendrycks and Dietterich
2019)</xref>
        and selected a few corruptions: Gaussian noise,
defocus blur, contrast and fog. For each corruption, we
preprocessed the data with the highest level of corruption
severity (sample image can be seen in Figure 2. The corruption
was applied to original ImageNet images by TensorFlow
dataset.4). We use first 10% of the ImageNet validation data
(5000 images) with aforementioned corruptions added. We
took pre-trained checkpoints mentioned in the section above
and run inference using NVIDIA Quadro A6000 GPU.
      </p>
      <p>Table 3 compares the Top-1, Top-5, and Top-10 accuracy
of BiT, ViT, and their ensemble. For every corruption except
Gaussian noise, the ensemble is superior to both CNN and
ViT working separately, while in case of the Gaussian noise
it is slightly worse. It is also interesting how the individual
models perform on various corruptions: ViT is much better
3available here: gs://vit models/augreg/L
16-i21k-300eplr 0.001-aug strong1-wd 0.1-do 0.0-sd
0.0--imagenet2012steps 20k-lr 0.01-res 224.npz</p>
      <p>4https://www.tensorflow.org/datasets/catalog/imagenet2012
corrupted
at Gaussian noise and fog, while it seems to be on par with
CNN on defocus blur, and performs worse in the contrast
corruption.</p>
    </sec>
    <sec id="sec-4">
      <title>Related work</title>
      <p>
        Since the introduction of ViTs in 2020, their properties have
been extensively studied. Naseer et al. (Naseer et al. 2021)
observe that ViTs are resilient to domain shifts and
occlusions. Bhojanapalli et al.
        <xref ref-type="bibr" rid="ref3">(Bhojanapalli et al. 2021)</xref>
        show
robustness of ViTs against adversarial and natural
perturbations. Fort et al.
        <xref ref-type="bibr" rid="ref8">(Fort, Ren, and Lakshminarayanan 2021)</xref>
        study the performance of vision transformers on
out-ofdistribution detection. Ranftl et al. (Ranftl, Bochkovskiy,
and Koltun 2021) use vision transformers for
monocular depth estimation and semantic segmentation. Raghu et
al. (Raghu et al. 2021) investigate the difference between the
learning representation of ViTs and CNNs. There have also
been architectures that combine CNNs and ViTs. For
example, CNNs meet Transformers (CMT)
        <xref ref-type="bibr" rid="ref10">(Guo et al. 2021)</xref>
        is
an architecture where the input image is fed into a sequence
of convolutional blocks for fine-grained feature extraction,
followed by CMT blocks (transformer with depth-wise
convolution) for representation learning.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We investigated how the introduction of vision transformers
as an alternative to CNNs impacts network design in safety
critical systems for computer vision. We compare CNNs and
ViTs as well as their ensemble on image classification tasks.
We also show that for many common image corruptions,
ViTs are relatively more resilient than CNNs. Moreover,
an ensemble of a CNN and a ViT provides up to 10%
higher accuracy than individual networks on the corruptions
provided in the ImageNet-C benchmark.</p>
      <p>Ongoing and future Work. Vision transformers are an
exciting development for safety-critical applications of
computer vision. Not only do they learn differently from
CNNs, but also are more robust against natural and
adversarial perturbations. We believe that vision transformers
alone or in conjunction with CNNs need to be extensively
investigated to develop stronger safety guarantees of ML
based components. This is ongoing work, and we continue
to investigate the following:
• How does the performance of transformers compare
to CNNs on object detection benchmarks with and
without corruption? We plan to use the corruption
datasets (Michaelis et al. 2019) corresponding to
common object detection benchmarks like Pascal, Coco,
Cityscapes, and compare the performance of transformer
based object detection networks like Swin (Liu et al.
2021) to CNNs. We also plan to combine Swin with CNN
models and use ensemble techniques for object
detection (Wei, Ball, and Anderson 2018).
• How good are vision transformers for detection
distribution shift after deployment? We are investigating the
performance of transformers on automotive benchmarks for
OOD detection (Nitsch et al. 2021) with different
metrics like maximum over softmax probabilities and
Mahalanobis distance. We plan to compare zero-shot versus
few-shot OOD detection for different architectures.
• How to implement redundant design in
resourceconstrained systems? The current performant ViTs are
large, and challenges exist in scaling performance to
smaller models (Liu et al. 2021) that are robust as well
as suitable for resource-constrained domains like
automotive and robotics. In future work, we plan to address
these challenges.
(a) Original image
(b) Gaussian noise
(c) Fog
(d) Defocus Blur
CNN + ViT</p>
      <p>Original data</p>
      <p>Gaussian noise</p>
      <p>Defocus blur</p>
      <p>Contrast
Top-1
0.8326</p>
      <p>Top-5</p>
      <p>Top-1
0.5330</p>
      <p>Top-5
0.7532</p>
      <p>Top-1
0.3966</p>
      <p>Top-5
0.5852</p>
      <p>Top-1
0.1980</p>
      <p>Top-5
0.6076</p>
      <p>Wei, P.; Ball, J. E.; and Anderson, D. T. 2018. Fusion of an
Ensemble of Augmented Image Detectors for Robust Object
Detection. Sensors, 18(3): 894.</p>
      <p>Zhou, Y.; and Tuzel, O. 2018. Voxelnet: End-to-end learning
for point cloud based 3d object detection. In CVPR.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Aptiv; Audi; Baidu; BMW; Continental; Daimler; FCA; Here; Infineon; Intel; and</article-title>
          <string-name>
            <surname>Volkswagen.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Safety First For Automated Driving</article-title>
          .
          <source>In Safety First For Automated Driving</source>
          ,
          <fpage>116</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ashmore</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Calinescu, R.; and
          <string-name>
            <surname>Paterson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges</article-title>
          . CoRR, abs/
          <year>1905</year>
          .04223.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bhojanapalli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Glasner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Unterthiner,
          <string-name>
            <given-names>T.</given-names>
            ; and
            <surname>Veit</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2021</year>
          .
          <article-title>Understanding Robustness of Transformers for Image Classification</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>arXiv:2103</source>
          .
          <fpage>14586</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Bojarski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Del Testa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dworakowski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Firner</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Flepp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jackel</surname>
            ,
            <given-names>L. D.</given-names>
          </string-name>
          ; Monfort,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <surname>U.</surname>
          </string-name>
          ; Zhang, J.; et al.
          <year>2016</year>
          .
          <article-title>End to end learning for self-driving cars</article-title>
          .
          <source>arXiv preprint arXiv:1604</source>
          .
          <fpage>07316</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dietterich</surname>
            ,
            <given-names>T. G.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Ensemble Methods in Machine Learning</article-title>
          .
          <source>In Multiple Classifier Systems</source>
          ,
          <volume>1</volume>
          -
          <fpage>15</fpage>
          . Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-
          <fpage>540</fpage>
          - 45014-6.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Dosovitskiy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kolesnikov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Weissenborn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Unterthiner</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dehghani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Minderer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Heigold</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Gelly,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; et al.
          <year>2020</year>
          .
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          . arXiv preprint arXiv:
          <year>2010</year>
          .11929.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Fort</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ren</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>arXiv:2106</source>
          .
          <fpage>03004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Guo</surname>
            , J.; Han,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , H.;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>CMT: Convolutional Neural Networks Meet Vision Transformers</article-title>
          . arXiv:
          <volume>2107</volume>
          .
          <fpage>06263</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Hendrycks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Dietterich,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Benchmarking Neural Network Robustness to Common Corruptions and Perturbations</article-title>
          . In International Conference on Learning Representations.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>International</given-names>
            <surname>Standards Organization</surname>
          </string-name>
          .
          <year>2018</year>
          -
          <fpage>12</fpage>
          . ISO 26262:
          <string-name>
            <surname>Road Vehicles - Functional Safety</surname>
          </string-name>
          ,
          <article-title>Parts 1 to 11</article-title>
          . In Road Vehicles - Functional
          <string-name>
            <surname>Safety</surname>
            ,
            <given-names>Second</given-names>
          </string-name>
          <string-name>
            <surname>Edition</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>International</given-names>
            <surname>Standards Organization</surname>
          </string-name>
          .
          <year>2019</year>
          -
          <fpage>01</fpage>
          . ISO/PAS 21448:
          <article-title>Road Vehicles - Safety of the intended functionality</article-title>
          .
          <source>In Road Vehicles - Safety of the intended functionality.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Kolesnikov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Puigcerver</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yung</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Gelly,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Large Scale Learning of General Visual Representations for Transfer</article-title>
          . CoRR, abs/
          <year>1912</year>
          .11370.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>