<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Exploring Diversity in Neural Architectures for Safety</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Michał</forename><surname>Filipiuk</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">NVIDIA</orgName>
								<address>
									<addrLine>Einsteinstraße 172</addrLine>
									<settlement>Munich</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Vasu</forename><surname>Singh</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">NVIDIA</orgName>
								<address>
									<addrLine>Einsteinstraße 172</addrLine>
									<settlement>Munich</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Exploring Diversity in Neural Architectures for Safety</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">43CA59E74A6717D6AE01559DC3FED25B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T23:22+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>diversity</term>
					<term>ensemble</term>
					<term>safety</term>
					<term>deep learning</term>
					<term>image classification</term>
					<term>robustness</term>
					<term>safety-critical systems</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Apart from the predominant convolutional neural networks (CNNs), several new architectures like Vision Transformers (ViTs) and MLP-Mixers have recently been proposed. Research also shows that these architectures learn differently. Ensembles based on different state-of-the-art neural architectures thus provide diversity, an important characteristic in designing safety-critical systems. To quantify the benefit of ensembles, we investigate different metrics like error consistency and diversity metric that have been proposed in the literature. We observe that with comparable individual performance, an ensemble of diverse architectures performs not only more accurately than an ensemble of one architecture, but also more robustly to diverse input corruptions.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The development of safety-critical systems relies on stringent safety methodologies, designs, and analyses to prevent hazards during operation. Automotive safety standards like ISO26262 <ref type="bibr" target="#b0">[1]</ref> and ISO/PAS 21448 <ref type="bibr" target="#b1">[2]</ref> mandate methodologies for system, hardware, and software development for automotive systems. Diversity is an important concept in safety-critical systems that prevents against common cause failures. For example, diversity in hardware is provided through lockstep execution across different HW engines. Diversity in software is guaranteed through diverse algorithmic implementations.</p><p>Deep neural networks <ref type="bibr" target="#b2">[3]</ref> based on convolutional neural networks (CNN) are well-known for vision tasks using machine learning. These include safety-critical applications like autonomous driving and robotics, where CNN models are used for object detection and image segmentation as perception units to process sensor data. Over the last few years, new neural architectures have disrupted the dominance of CNNs in vision tasks: Vision Transformers (ViTs) <ref type="bibr" target="#b3">[4]</ref>, inspired by the transformer model <ref type="bibr" target="#b4">[5]</ref> that was originally proposed for natural language processing (NLP) tasks, leverages self-attention layers instead of convolution layers to process the input split into set of non-overlaping patches. Similarly, MLP Mixers <ref type="bibr" target="#b5">[6]</ref> have been proposed as a competitive but conceptually a simple alternative that -instead of convolutions or selfattention -are based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels.</p><p>The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety 2022), July 24-25, 2022, Vienna, Austria mfilipiuk@nvidia.com (M. Filipiuk); vasus@nvidia.com (V. Singh) 0000-0003-4926-8449 (M. Filipiuk)</p><p>To improve the confidence in prediction, ensembles <ref type="bibr" target="#b6">[7]</ref> of neural networks are commonly used. Multiple models are trained on the same data, then each of the trained models is used to make a prediction before combining the predictions in some way to create the final prediction. Ensembles have also shown to reduce the variance <ref type="bibr" target="#b7">[8]</ref>. The inherent diversity in an ensemble has been shown to be a key factor for their superior performance. Different diversity metrics have been proposed in the machine learning literature. Error consistency <ref type="bibr" target="#b8">[9]</ref>, based on the Cohen's kappa metric, measures the similarity of classification normalized by chance of common prediction. Diversity <ref type="bibr" target="#b9">[10]</ref> allows to define diversity metrics based on different loss functions.</p><p>The objective of our work is to quantify the diversity of ensembles created using different models, and evaluate their benefits. We choose two CNNs, two ViTs, and two MLP Mixers, and create 30 in total ensembles by averaging the models' outputs. Our results show that ensembles created using different architectures are more diverse than ensembles from the same architecture. We show that an ensemble of different architectures with similar accuracy further improves the performance. In our experiments, we observe the best ensemble results for a CNN and a ViT.</p><p>The paper is organized as follows. Section 2 describes the properties of CNNs, Vision Transformers, and MLP-Mixers, how they compare to each other including a summary of related work, and an overview of different diversity metrics. Section 3 provides our experimental results. Section 4 concludes the paper with a summary of our ongoing work and future directions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Neural architectures</head><p>Convolutional Neural Networks. The convolution operation predates the first convolutional neural networks. With hand-engineered features, it was used in classical computer vision applications many years before it appeared in first neural networks in 1980s. However, the rise of CNNs started with AlexNet in 2012, which defeated by a large margin other, non-neural approaches in the ImageNet competition. Over the last 10 years, we have seen multiple improvements to this architecture, but they were more evolutionary than revolutionary.</p><p>The fact that convolutions managed to be in the spotlight for such a long time may seem quite surprising, however an analysis of their properties gives us the answer: Convolutions have two key inductive biases that allow them to excel at high-dimensional data with strong spatial correlation like images: the spatial inductive bias allows them to focus on local information in the input images. Applying the same kernel over the whole image results in the translation equivariance as input translations result only in the shifted output of convolutional layers. The convolution operation is also a very simple and compute-efficient operation. Its memory usage is not only small, but also constant with regard to the size of the image what combined with possibility to apply it in parallel, makes it feasible for every hardware.</p><p>Vision Transformer. The Transformer architecture <ref type="bibr" target="#b4">[5]</ref> was initially introduced in 2017 for NLP tasks. In 2020, this architecture was applied to image classification problem and called the Vision Transformer (ViT) <ref type="bibr" target="#b3">[4]</ref>. Here, an input image is split into a set of non-overlapping patches, which after being embedded are provided to the ViT encoder blocks. ViTs have much less image-specific inductive bias than CNNs. In CNNs, the locality and translation equivariance are inherent to convolutional layers throughout the whole model. In ViT, the self-attention layers are global, and only the MLP layers are performed locally and translationally equivariant on the patch level. The two-dimensional neighborhood is not present in the network architecture as transformers treat the input as an unordered set. This information needs to be input to the first layer in form of position embedding together with image patches.</p><p>Reducing the inductive biases has twofold consequences: Transformers have to learn properties that would otherwise be inherited from the convolution operation, that proved to be successful: to be invariant to the input shifts and balancing the local and global perception in encoding blocks. But at the same time, they can improve upon them, can leverage the global perception to their advantage and discover its own priors based on data, what results in performing the task distinctly and bringing diversity of solutions to the field. <ref type="bibr" target="#b5">[6]</ref> provide an alternative to CNNs and ViTs that does not use convolutions or self-attention. Mixers use two types of MLP layers: channel-mixing and token-mixing MLPs. The channel-mixing MLPs are applied to every patch separately, exchanging the information between channels, while the token-mixing MLPs work on one channel, but across all patches, allowing the communication between the patches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MLP-Mixer. Presented in 2021, MLP-Mixers</head><p>Matrix multiplications in MLPs are a simpler operation than a convolution, which require more specialized hardware or a costly conversion to a matrix multiplication operation.</p><p>As MLP-Mixers perform similarly to Vision Transformers on a level of encoder layers, they have similar properties: both architectures have global perception fields and they both suffer of no translation equivariance due to the use of image patches as input. Regarding the differences of these two architectures: MLP-Mixers do not need position encoding as MLP layers differentiate between different elements of its input, in contrast to the multi-head attention in ViTs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Related Work</head><p>As the three architectures present different approaches to image classification -using convolutions, multi-head attention, or multilayer perceptrons to process the input -the comparison between them should not restrict just to experimental accuracy, e.g. on a single dataset like ImageNet, but should also include more experiments, analyzing in-detail the different aspects of image classification problem (e.g. robustness to input corruption or transformations like translations or rotations) and internal properties of each model. Bhojanapalli et al. <ref type="bibr" target="#b10">[11]</ref> conduct multiple experiments, assessing the robustness of Vision Transformers to multiple corruptions with regard to model sizes and their pre-training datasets, in comparison to various ResNet models. They show that (1) adversarial attacks like Fast Gradient Sign Method and Projected Gradient Descent similarly influence both ViTs and CNNs, (2) corrupted images with an attack are not transferable, resulting in only a modest, few percentage points drop between the architectures, while they are transferable between the models of the same architecture. Regarding less artificial corruptions and distribution shifts, present in ImageNet-C, -R, and -A datasets: performance of different architectures seems to be similar. One important conclusion is how the accuracy changes with the size of the pretraining dataset -for ILSVRC-2012, ViTs perform worse than CNNs, however for ImageNet-21k and JFT-300M performance is comparable. Under a closer inspection of ImageNet-C dataset, ViTs and CNNs perform significantly different on various ImageNet-C corruptions: e.g. on glass blur Vision Transformers per-form significantly better than CNNs, while they perform worse on contrast corruption, on the highest level of severity -this observation is crucial for our research presented in this paper. Naseer et al. <ref type="bibr" target="#b11">[12]</ref> extends this comparison to e.g. input occlusions or input patches permutation, where ViTs perform much more robustly than CNNs. They investigates also the shape-texture bias of these architectures and show that transformers are less biased towards local textures than CNNs.</p><p>In <ref type="bibr" target="#b12">[13]</ref>, authors analyze the information that every layer processes, how the reception fields looks like for Transformers (which are not restricted by the convolution operation) and how different layers learn depending on the dataset size. Their research shows that CNNs and ViTs perform their computation significantly differently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>It also briefly describes how MLP-Mixers behave closer to ViTs with regard to the intermediate features learned.</head><p>There have also been architectures that combine CNNs and ViTs. For example, Cvt: Introducing convolutions to vision transformers <ref type="bibr" target="#b13">[14]</ref> apply convolutions over input image and intermediate feature token maps, which are next processed by a transformer block. While the Swin Transformer <ref type="bibr" target="#b14">[15]</ref> doesn't feature convolution layers, it introduces a hierarchical approach of CNNs and the locality of convolutions to transformers: it applies MHA to small, local set of patches (windows), while the patches are being merged into bigger patches as we progress deeper into the model. To support the information propagation between patches, the model shifts the windows with every layer to overlap with previously used windows. These changes can also be introduced to a MLP-Mixers, resulting in the performance improvement.</p><p>The results of the aforementioned research inspire us to investigate how this variety of these three architectures, proved by multiple various experiments, can be leveraged for improving the diversity in safety-critical systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Diversity metrics</head><p>While the intuition behind the diversity may be straightforward, quantifying it is not. We present below three distinct metrics from the literature that try to capture models' diversity.</p><p>Ortega et al. <ref type="bibr" target="#b9">[10]</ref> provide a metric of diversity for different loss functions like 0/1 loss, cross-entropy loss, and squared loss. As we are focused on the classification problem, we'll use 0/1 and cross-entropy losses, which formulas are presented below:</p><formula xml:id="formula_0">D 0/1 (𝜌) = E𝜈 [︁ V𝜌 (︁ 1 (ℎ𝑊 (𝑥; 𝜃) ̸ = 𝑦) )︁]︁ D𝑐𝑒(𝜌) = E𝜈 [︃ V𝜌 (︃ 𝑝(𝑦 | 𝑥, 𝜃) √ 2 max 𝜃 𝑝(𝑦 | 𝑥, 𝜃) )︃]︃</formula><p>where E𝜈 and V𝜌 stand for an expected value over the whole data generating distribution 𝜈 (which is approximated using a dataset) and a variance of models' predictions that the ensemble consists of. The formulas are derived from a loss analysis of every classifier and their ensemble, where the diversity upper bounds a difference between an averaged loss of classifiers and the loss of their ensemble. In summary: these metrics measure how diverse the predictions of different models for a dataset are by calculating the variance of prediction, averaged over every data point. In case of CE diversity, the predictions are being additionally scaled to [0,1] range.</p><p>From our perspective, the CE loss diversity should be more interesting as we are going to ensemble models by averaging their prediction, but CE loss diversity is more complex than 0/1 diversity and eventually we evaluate models using accuracy, which binarizes their outputs to count them as correct and incorrect classification. At the same time, CE loss diversity is able to provide us with more information e.g. in a case when both models classify identically, but with different probabilities assigned.</p><p>Error consistency <ref type="bibr" target="#b8">[9]</ref> is a metric measuring how much errors of two classifiers coincide. It calculates a number of items classified either correctly or incorrectly by both models and compares it to an expected rate of equal responses in case when both models were totally statistically independent. The exact formula is presented as follows:</p><formula xml:id="formula_1">𝜅 = 𝑐 𝑜𝑏𝑠 − 𝑐𝑒𝑥𝑝 1 − 𝑐𝑒𝑥𝑝</formula><p>where 𝑐 𝑜𝑏𝑠 stands for a fracture of equal classification (either correct or incorrect) and 𝑐𝑒𝑥𝑝 is an expected rate of equal responses, which is calculated using models' accuracies: 𝑐𝑒𝑥𝑝 = 𝑎𝑐𝑐1𝑎𝑐𝑐2+(1−𝑎𝑐𝑐1)(1−𝑎𝑐𝑐2). This metric can only compare two models in contrast to the diversity metrics which does not have such a restriction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head><p>Model selection. Setup. We have chosen the best performing models that were available to us at the time of conducting the research, pretrained on ImageNet-21k and fine-tuned to ImageNet-1k. We considered the arguments raised in the previous section to determine the size of the pretraining dataset. This has the best potential to perform robustly on ImageNet-C <ref type="bibr" target="#b15">[16]</ref>, which we'll use to compare the architectures. ImageNet-C is a dataset created by artificially applying various corruptions (blurs, noises, digital corruptions, and weather conditions), which feature different severity levels, to the ImageNet (ILSVRC2012) validation set. The models are as follows, and ensembles are created by averaging the returned softmax outputs of two models. We use only two at the time to observe how ensembles of different architectures perform compared to the single models that build them. Also using more models in the ensembles would prohibit us from using the error consistency metric. Ensembles are created by averaging the softmax outputs as it is the simplest way of building ensembles. While it has its disadvantages (e.g. models are calibrated differently and overconfident ones can dominate under-confident ones with their predictions), we choose it for its simplicity, leaving potential improvements to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Vision Transformers:</head><p>• Vision Transformer B/8 (86M parameters) <ref type="foot" target="#foot_0">1</ref>• Vision Transformer L/16 (307M parameters)<ref type="foot" target="#foot_1">2</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Convolutional Neural Networks [17]:</head><p>• ConvNeXt-Base (89M parameters) <ref type="foot" target="#foot_2">3</ref>• ConvNeXt-XLarge (350M parameters)<ref type="foot" target="#foot_3">4</ref> MLP-Mixers:</p><p>• MLP-Mixer B/16 (59M parameters) <ref type="foot" target="#foot_4">5</ref>• MLP-Mixer L/16 (207M parameters)<ref type="foot" target="#foot_5">6</ref> Using six distinct models allows us to create 30 different ensembles that are used for the experiments. We do not create the ensemble of a model with itself.</p><p>To compare the models, apart from the diversity metrics, we use Top10 accuracy and the retention metric <ref type="bibr" target="#b17">[18]</ref> (an accuracy on corrupted dataset divided by the accuracy on the original data). We picked Top10 accuracy to smoothen out the achieved scores as some images from ImageNet may contain multiple objects of different classes, which introduces variance to the accuracy prediction.   In Figure <ref type="figure" target="#fig_0">1</ref>, solid lines represent a retention of specific architectures (a mean of two models using this architecture), while dashed ones shows a retention of different ensembles (also averaged over all ensembles of each kind). We clearly see that MLP Mixers perform significantly worse than ViTs and CNNs. However, when MLP Mixers are combined with ViTs or CNNs, the ensembles (brown and grey dashed lines) performance only slightly worse than single ViTs or CNNs models respectively. When we take a look at the top performing ensembles, ViT+CNN ensembles are followed by pure CNN and ViT ensembles. This suggests that mixing different architectures is beneficial for their robustness. The next experiments will support these two hypotheses with more concrete examples and results. Figure <ref type="figure" target="#fig_1">2</ref> presents accuracy, diversity metrics, and error consistency calculated on original ImageNet data. Each cell represents a metric value scored by an ensemble created by models from corresponding columns and rows.</p><p>At the diagonal, we have the scores of single models. The last, non-triangular one called 0-1 Diversity components (0-1 Diversity is calculated by averaging the two values from this plot, located symmetrically to the diagonal) presents a fraction of images that are classified correctly by one model (the one in the row) and incorrectly by the second one (the column model).</p><p>Starting with the accuracy plot, we see that the best performing model is ConvNeXt-XLarge, followed by ViT Base, ViT Large, MLP-Mixer Base, and MLP-Mixer Large. In cases of ViTs and MLP-Mixers, smaller models perform better than their bigger counterparts -this might be an artifact of insufficient training. Regarding their ensembles, it is not surprising that the best accuracy is presented by the ensemble of the best performing models (ViT-B and ConvNeXt-XL). We also observe that ensemble performance deteriorates only slightly when one of its components performs significantly (e.g. MLP-Mixer Large) worse than the other.</p><p>When we analyze all diversity metrics, we see that MLP-Mixers stand out from other models, especially the Large one. That is caused by much lower accuracy than  The Gaussian blur corruption is favored by the Vision Transformer as ViTs perform better than their CNN and MLP-Mixer counterparts. However this time, the best performing model is the ViT-Large instead of Base, what suggests that while its learning process was not sufficient to perform better than the smaller model, but it was sufficient to learn it to perform robustly (ViT-Large is thrice as big as ViT-B).</p><p>When we take a look at metrics, the highest (or lowest in case of error consistency) values belong to MLP-Mixers, which perform poorly in comparison to ViTs and CNNs, so we may expect that this diversity comes mostly from their misclassfication. We see it in the 0-1 diversity components, which state that Mixers classify around 30-40% of images incorrectly in contrast to other models. Regarding ViTs and CNNs ensembles, pure CNNs ensembles are less diverse than ensembles of ViTs and CNNs or pure ViT ensembles. If we focus on ConvNeXt-B+XL ensemble and compare it to ConvNeXt-B+ViT-B, we see that it performs slightly better, while ViT-B is less accurate than ConvNeXt-XL. While it's not the most diverse pair between CNNs and ViTs, it's according to all metrics more diverse than the pure CNN ensemble. Other interesting comparison is ViT-L+B vs. ViT-L+ConvNeXt-B: We substitute a Base ViT with a worse performing CNN, what creates a better performing ensemble and more diverse.</p><p>Regarding the contrast corruption in figure <ref type="figure" target="#fig_4">4</ref>, CNNs dominate performance with only a modest drop in accuracy, while other models perform much worse, especially Mixers. The highest diversity values are related to the worst performing MLP-Mixers. But at the same time, Mixers ensembled with CNNs perform similar to ViT+CNN: worst performing MLP-Mixer Base, which is almost 20 p.p. worse than ViT-L, performs marginally better when ensembled with ConvNeXt-XL -which we find intriguing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions</head><p>While our approach to combine the inherent diversity across models by an ensemble is simple, it manages to capture a synergy that arises from the use of different architectures. The ViT+CNN ensemble has proven to perform not only on average better than other combinations but also regardless of the corruption type, it succeeds to perform satisfactorily.</p><p>The diversity metrics and error consistency provide valuable quantitative tools to compare models and quantify the differences in classifications. However, they only allow us to understand the relationships between the models when they are inferred on a specific input. Unfortunately, these metrics may be deceiving in case of two models, where one performs significantly worse than the other. High diversity does not translate to an improved performance of their ensemble which might seem counter-intuitive. The metrics capture how diversely models classify, not the potential of the ensemble of the two models. These two objectives coincide when models perform similarly on the accuracy metric, while a discrepancy in accuracies causes them to misalign. This behavior requires a careful analysis of the metric on every corruption separately.</p><p>We list several possible extensions to our work. The first one is an improvement on diversity metrics to metrics assessing the ensemble potential. Secondly, our research was limited to three different architectures. While the results look promising, to fully evaluate and quantify how ensemble aggregates robustness of various models, more experiments should be run, involving more models of different architectures, pretrained on different datasets, and of different sizes. Another direction is to improve the ensemble technique. The potential improvement spans from a weighted ensemble that would average the models e.g. based on their individual performance to a mixture of experts that could predict which model will perform better at some input, and thus precisely leverage the advantages of each particular model to tackle particular corruptions. Such a mixture of experts solution would also be viable in a resource-constrained environment, where running multiple models simultaneously may be unacceptable. The last one is to continue this research for more complex problems like object detection and image segmentation. We need to define diversity metrics for these problems and then investigate the quality of ensembles created using different neural architectures.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Retention curves with regard to severity, averaged over all ImageNet-C corruptions</figDesc><graphic coords="4,96.20,401.29,187.51,145.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Metrics performance on original data</figDesc><graphic coords="4,335.01,518.96,134.50,135.84" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Metrics performance on Gaussian Blur 5 corruption</figDesc><graphic coords="5,92.48,219.15,168.12,169.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Metrics performance on Contrast 4 corruption</figDesc><graphic coords="6,92.48,219.15,168.12,169.80" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">available here: https://storage.googleapis.com/vit_models/ augreg/B_8-i21k-300ep-lr_0.001-aug_medium2-wd_0.1-do_0. 0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_224.npz</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://storage.googleapis.com/vit_models/augreg/L_ 16-i21k-300ep-lr_0.001-aug_medium2-wd_0.03-do_0.1-sd_0. 1--imagenet2012-steps_20k-lr_0.01-res_224</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://tfhub.dev/sayakpaul/convnext_base_21k_1k_224/1</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://tfhub.dev/sayakpaul/convnext_xlarge_21k_1k_224/1</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://tfhub.dev/sayakpaul/mixer_b16_i21k_classification/1</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">https://tfhub.dev/sayakpaul/mixer_l16_i21k_classification/1</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m">International Standards Organization, ISO 26262: Road vehicles -functional safety, parts 1 to 11</title>
				<imprint>
			<biblScope unit="page" from="2018" to="2030" />
		</imprint>
	</monogr>
	<note>Road Vehicles -Functional Safety, Second Edition</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m">International Standards Organization, ISO/PAS 21448: Road vehicles -safety of the intended functionality, in: Road Vehicles -Safety of the intended functionality</title>
				<imprint>
			<biblScope unit="page" from="2019" to="2020" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Deep learning</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">nature</title>
		<imprint>
			<biblScope unit="volume">521</biblScope>
			<biblScope unit="page">436</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">An image is worth 16x16 words: Transformers for image recognition at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">O</forename><surname>Tolstikhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Steiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Keysers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lucic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2021/file/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Paper.pdf" />
		<title level="m">Mlpmixer: An all-mlp architecture for vision</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Ensemble learning: A survey</title>
		<author>
			<persName><forename type="first">O</forename><surname>Sagi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Rokach</surname></persName>
		</author>
		<idno type="DOI">10.1002/widm.1249</idno>
		<ptr target="https://doi.org/10.1002/widm.1249.doi:10.1002/widm.1249" />
	</analytic>
	<monogr>
		<title level="j">Wiley Interdiscip. Rev. Data Min. Knowl. Discov</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Simple and scalable predictive uncertainty estimation using deep ensembles</title>
		<author>
			<persName><forename type="first">B</forename><surname>Lakshminarayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pritzel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Blundell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Guyon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">U</forename><surname>Luxburg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><forename type="middle">M</forename><surname>Wallach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Fergus</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">V N</forename><surname>Vishwanathan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<meeting><address><addrLine>Long Beach, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-12">December 2017. 2017</date>
			<biblScope unit="page" from="6402" to="6413" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Beyond accuracy: quantifying trial-bytrial behaviour of cnns and humans by measuring error consistency</title>
		<author>
			<persName><forename type="first">R</forename><surname>Geirhos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Meding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Wichmann</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2020/file/9f6992966d4c363ea0162a056cb45fe5-Paper.pdf" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Diversity and generalization in neural network ensembles</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Ortega</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cabañas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Masegosa</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2110.13786</idno>
		<ptr target="https://arxiv.org/abs/2110.13786.doi:10.48550/ARXIV.2110.13786" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Bhojanapalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Glasner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Veit</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2103.14586.arXiv:2103.14586" />
		<title level="m">Understanding robustness of transformers for image classification</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Naseer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ranasinghe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hayat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">S</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-H</forename><surname>Yang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2105.10497</idno>
		<title level="m">Intriguing properties of vision transformers</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Do vision transformers see like convolutional neural networks?</title>
		<author>
			<persName><forename type="first">M</forename><surname>Raghu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kornblith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2108.08810</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Codella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2103.15808.arXiv:2103.15808" />
		<title level="m">Cvt: Introducing convolutions to vision transformers</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Guo</surname></persName>
		</author>
		<title level="m">Swin transformer: Hierarchical vision transformer using shifted windows</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Benchmarking neural network robustness to common corruptions and perturbations</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Dietterich</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=HJz6tiCqYm" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">A convnet for the 2020s</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Feichtenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xie</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2201.03545.arXiv:2201.03545" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Understanding the robustness in vision transformers</title>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Anandkumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Alvarez</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2204.12451</idno>
		<ptr target="https://arxiv.org/abs/2204.12451.doi:10.48550/ARXIV.2204.12451" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
