<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Fine-Grained ImageNet Classification in the Wild</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Maria</forename><surname>Lymperaiou</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Electrical and Computer Engineering</orgName>
								<orgName type="laboratory">AILS Lab</orgName>
								<orgName type="institution">National Technical University of Athens</orgName>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Konstantinos</forename><surname>Thomas</surname></persName>
							<email>kthomas@islab.ntua.gr</email>
							<affiliation key="aff0">
								<orgName type="department">School of Electrical and Computer Engineering</orgName>
								<orgName type="laboratory">AILS Lab</orgName>
								<orgName type="institution">National Technical University of Athens</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giorgos</forename><surname>Stamou</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Electrical and Computer Engineering</orgName>
								<orgName type="laboratory">AILS Lab</orgName>
								<orgName type="institution">National Technical University of Athens</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Fine-Grained ImageNet Classification in the Wild</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">30BE68D1E314FEB36C4C90E2FEA1B76A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-12-29T08:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Image Classification</term>
					<term>Knowledge Graphs</term>
					<term>Robustness</term>
					<term>Explainable Evaluation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Image classification has been one of the most popular tasks in Deep Learning, seeing an abundance of impressive implementations each year. However, there is a lot of criticism tied to promoting complex architectures that continuously push performance metrics higher and higher. Robustness tests can uncover several vulnerabilities and biases which go unnoticed during the typical model evaluation stage. So far, model robustness under distribution shifts has mainly been examined within carefully curated datasets. Nevertheless, such approaches do not test the real response of classifiers in the wild, e.g. when uncurated web-crawled image data of corresponding classes are provided. In our work, we perform fine-grained classification on closely related categories, which are identified with the help of hierarchical knowledge. Extensive experimentation on a variety of convolutional and transformer-based architectures reveals model robustness in this novel setting. Finally, hierarchical knowledge is again employed to evaluate and explain misclassifications, providing an information-rich evaluation scheme adaptable to any classifier.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>ImageNet <ref type="bibr" target="#b0">[1]</ref> has been one of the most popular image classification datasets in literature, inspiring a variety of popular model implementations for over a decade. The first significant breakthrough in ImageNet classification was marked with AlexNet <ref type="bibr" target="#b1">[2]</ref>, a convolutional neural network (CNN) for image classification that greatly outperformed its competitors. Ever since various CNN-based implementations continued pushing accuracy scores even higher <ref type="bibr" target="#b2">[3]</ref>.</p><p>The local nature of convolutional filters that cannot capture long-range visual dependencies was suspected to hinder further improvements in performance, demanding the exploration of alternative architectural choices. To this end, attention mechanisms that have successfully served Natural Language Processing <ref type="bibr" target="#b3">[4]</ref> appear as a promising substitute to convolutions, as they are able to detect spatially distant concepts and assign appropriate importance weights to them. Indeed, the adaptation of the Transformer <ref type="bibr" target="#b3">[4]</ref> for visual tasks, led to the introduction of the Visual Transformer (ViT) <ref type="bibr" target="#b4">[5]</ref>, which divides the image into visual patches and processes them similarly to how the original Transformer handles words. Consequently, transformer-based image classifiers emerged <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>, reaching unprecedented state-of-the-art results.</p><p>Even though so much effort is invested to perpetually improve model performance by employing more and more refined architectures and techniques, inevitably increasing the demand for computational resources necessary for training, there are still some open questions regarding the ability of such models to properly handle distribution shifts. Distribution shifts refer to testing an already trained model on a data distribution that diverges from the one the model was trained on. The analysis of distribution shifts has gained interest in recent years <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14]</ref>, as a crucial step towards enhancing model robustness. Most of these endeavors apply pixel-level perturbations to artificially influence the distribution under investigation. Nevertheless, the highly constrained setting of artificial distribution shifts excludes various real-world scenarios, impeding robust generalization of image classifiers. In this case, natural shifts <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18]</ref> are more representative. They usually require the creation of a curated dataset containing image variations such as changes in viewpoint or object background, rotations, and other minor changes. Both synthetic and natural shifts can comprise data augmentation techniques, which aid the development of robust models when incorporated during training <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22]</ref>.</p><p>So far, there is no approach testing image classification 'in the wild', where uncurated images corresponding to pre-defined dataset labels are encountered. We argue that this is a real-world user-oriented scenario, where totally new images corresponding to ImageNet labels need to be appropriately classified. For example, an image of a cat found on the web may significantly differ from ImageNet cat instances, even when popular distribution shifts are taken into account. Even though a human can identify a cat present in an image with satisfactory confidence, we question whether an image classifier can do so; the unrestricted space of possible variations of uncurated images demands advanced generalization capabilities to properly understand the real discriminative characteristics of an ImageNet class without getting distracted from extraneous features.</p><p>The problem of classification 'in the wild' becomes even more difficult when fine-grained classification needs to be performed, as distinguishing between closely related categories relies on detailed discriminative characteristics, which may be less prevalent in uncurated settings. For example, siamese and persian cat races present many visual similarities, increasing the potential risk of learning and reproducing dataset biases, especially when distribution shifts are present. We can attribute this risk to the fact that existing classifiers lack external or domain knowledge, which can help humans discriminate between closely related categories.</p><p>To sum up, in our current paper we aspire to answer the following questions:</p><p>1. How do different models, pre-trained on ImageNet or web images, behave on uncurated image sets crawled from Google images (given ImageNet labels as Google queries)? We target this question by producing a novel natural distribution shift based on uncurated web images upon which we evaluate various image classifiers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">How does hierarchical knowledge help with evaluating classification results since several</head><p>ImageNet categories are hierarchically related? We attempt to verify to which extent the assumption that the lack of external knowledge limits the generalization capabilities of classifiers holds. Thus, we leverage WordNet <ref type="bibr" target="#b22">[23]</ref> to discover neighbors of given terms and test whether classifiers struggle with discriminating between closely related classes. 3. Can evaluation of classification be explainable? Knowledge sources, such as WordNet can reveal the semantic relationships between concepts (ImageNet classes), providing possible paths connecting frequently confused classes.</p><p>Our code can be found at https://github.com/marialymperaiou/classification-in-the-wild.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>Image classifiers With the outburst of neural architectures for classification tasks, Computer Vision has been one of the fields most benefited from recent developments. Convolutional classifiers (CNN) is a well-established backbone, with first successful endeavors <ref type="bibr" target="#b1">[2]</ref> already paving the way for more refined architectures, such as VGG <ref type="bibr" target="#b23">[24]</ref>, Inception <ref type="bibr" target="#b24">[25]</ref>, ResNet <ref type="bibr" target="#b25">[26]</ref>, Xception <ref type="bibr" target="#b26">[27]</ref>, InceptionResnet <ref type="bibr" target="#b27">[28]</ref> and others <ref type="bibr" target="#b2">[3]</ref>. There is some criticism around the usage of CNNs for image classification, even though some contemporary endeavors such as ConvNext <ref type="bibr" target="#b28">[29]</ref> revisit and insist on the classic paradigm, providing advanced performance. The rapid advancements that the Transformer framework <ref type="bibr" target="#b3">[4]</ref> brought via the usage of self-attention mechanisms, widely replacing prior architectures for Natural Language Processing applications, inspired the usage of similar models for Computer Vision as an answer to the aforementioned criticism <ref type="bibr" target="#b29">[30]</ref>. Thus, Vision Transformers (ViTs) <ref type="bibr" target="#b4">[5]</ref> built upon <ref type="bibr" target="#b3">[4]</ref> set a new baseline in literature; ever since, several related architectures emerged. In general, transformer-based models rely on an abundance of training data to ensure proper generalization. This requirement was relaxed in DeiT <ref type="bibr" target="#b30">[31]</ref>, enabling learning on medium-sized datasets. Further development introduced novel transformer-based architectures, such as BeiT <ref type="bibr" target="#b8">[9]</ref>, Swin <ref type="bibr" target="#b31">[32]</ref> and RegNets <ref type="bibr" target="#b32">[33]</ref>, which realize specific refinements to boost performance. Overall, it has been proven that ViTs are more robust compared to classic CNN image classifiers <ref type="bibr" target="#b33">[34]</ref>. In our work, we verify the degree this claim holds by testing CNN and transformer-based classifiers on the uncurated fine-grained setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Robustness under distribution shifts</head><p>Generalization capabilities of existing image classifiers have been a crucial problem <ref type="bibr" target="#b34">[35]</ref>, currently addressed from a few different viewpoints.</p><p>Artificial corruptions <ref type="bibr" target="#b35">[36,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b36">37,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b10">11]</ref> or natural shifts <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b37">38]</ref> on curated data have already exposed biases and architectural vulnerabilities. Adversarial robustness <ref type="bibr" target="#b38">[39,</ref><ref type="bibr" target="#b39">40,</ref><ref type="bibr" target="#b40">41,</ref><ref type="bibr" target="#b41">42,</ref><ref type="bibr" target="#b42">43]</ref> is a related field where models are tested against adversarial examples, which introduce imperceptible though influential perturbations on images. Contrary to such attempts, we concentrated around naturally occurring distribution shifts stemming from uncurated image data. Regarding architectural choices, many studies perform robustness tests attempting to resolve the CNN vs Transformer contest <ref type="bibr" target="#b33">[34,</ref><ref type="bibr" target="#b43">44,</ref><ref type="bibr" target="#b44">45]</ref>, while other ventures focus on interpreting and understanding model robustness <ref type="bibr" target="#b45">[46,</ref><ref type="bibr" target="#b46">47,</ref><ref type="bibr" target="#b47">48]</ref>. In our approach, by experimenting with both CNN and transformer-based architectures we adopt such research attempts to the uncurated setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>The general workflow of our method (Figure <ref type="figure" target="#fig_0">1</ref>) consists of three stages. First, the dataset should be constructed by gathering common terms (queries) and their subcategories which exist as ImageNet classes. Images corresponding to those terms are crawled from Google search. In the second stage, various pre-trained classifiers are utilized to classify crawled images. The hierarchical relationships between the given classes are reported to enrich the evaluation process. Finally, all semantic relationships between misclassified samples are gathered to extract explanations and quantify how much, falsely predicted classes, diverge from their ground truth. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dataset creation</head><p>We start by gathering user-defined common words regarding visual concepts as queries, which will act as starting points towards extracting subcategories. The WordNet hierarchy <ref type="bibr" target="#b22">[23]</ref> is used to provide the subcategories, via the hypernym-hyponym (IsA) relationships, which refer to more general or more specific concepts respectively. For example, given the query 'car', its hypernym is 'motor vehicle' ('car' IsA 'motor vehicle'), while its hyponyms are 'limousine' ('limousine' IsA 'car'), 'sports car' ('sports car' IsA 'car') and other specific car types. Therefore, we map queries on WordNet to obtain all their immediate hyponyms, constructing a hyponyms set 𝐻. We then filter out any hyponyms not belonging to ImageNet class labels.</p><p>The filtered categories of 𝐻 among the initial query are provided as search terms to a web crawler suitable for searching Google images. We set a predefined threshold 𝑘 for the number of Google images returned so that we evaluate classifiers on categories containing almost equal numbers of samples. This is necessary since some popular categories may return way more Google images compared to others. We will experiment with several values of k, thus influencing the tradeoff between relevance to the keyword and adequate dataset size. The retrieved images comprise a labeled dataset 𝐷, with the keywords as labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Classification</head><p>We consider a variety of image classifiers to test their ability for fine-grained classification on uncurated web images. We commence our experimentation with convolutionalbased models as baselines, which have generally been considered to be less robust against distribution shifts and other perturbations <ref type="bibr" target="#b33">[34]</ref>, and we proceed with recent transformer-based architectures. We perform no further training or fine-tuning on the selected models.</p><p>For each model, we perform inference on the crawled images that constitute our dataset, as explained in the previous paragraph. We implement a rich evaluation scheme to capture various insights of the classification process. Accuracy is useful as a benchmark metric to compare our findings with expected classification results. WordNet similarity functions offer valuable information about misclassifications; for example, let's assume that the true label of a sample is 'cat' and the classifier predicts the label 'dog' in one case and the label 'airplane' in another case. Intuitively, we hypothesize that a 'cat' is more closely related to a 'dog' than an 'airplane' since they are both animals. This human intuition is reflected in the WordNet hierarchy, thus assigning a different penalty depending on the concept relevance within the hierarchy.</p><p>This concept-based evaluation can be realized using the following WordNet functions: path similarity, Leacock-Chodorow Similarity (LCS), and Wu-Palmer Similarity (WUPS). Path similarity evaluates how similar two concepts are, based on the shortest path that connects them within the WordNet hierarchy. It can provide values between 0 and 1, with 1 denoting the maximum possible similarity score. LCS also seeks for the shortest path between two concepts but additionally regards the depth of the taxonomy. Specifically, equation 1 mathematically describes LCS between two concepts 𝑐 1 and 𝑐 2 :</p><formula xml:id="formula_0">𝐿𝐶𝑆 = − log 𝑝𝑎𝑡ℎ(𝑐 1 , 𝑐 2 ) 2 • 𝑑<label>(1)</label></formula><p>where 𝑝𝑎𝑡ℎ(𝑐 1 , 𝑐 2 ) denotes the shortest path connecting 𝑐 1 and 𝑐 2 and 𝑑 refers to the taxonomy depth. Higher LCS values indicate higher similarity between concepts. WUPS takes into account the depth that the two concepts 𝑐 1 and 𝑐 2 appear in WordNet taxonomy and the depth of their most specific common ancestor node, called Least Common Subsumer. Higher WUPS scores refer to more similar concepts. For each of the path similarity, LCS, and WUPS metrics we obtain an average value over the total number of samples of the constructed dataset 𝐷. Moreover, we report the percentage of sibling concepts among misclassifications. Two concepts are considered to be siblings if they share an immediate (1 hop) parent. For example, the concepts 'tabby cat' and 'egyptian cat' share the same parent node ('domestic cat'). It is highly likely that a classifier is more easily confused between two sibling classes, thus providing false positive (FP) predictions closely related to the ground truth (GT) label. Therefore, a lower number of siblings denotes reduced classification capacity compared to models of higher siblings percentage.</p><p>Explanations are provided during the evaluation stage, aiming to answer why a pre-trained classifier cannot correctly classify uncurated images belonging to a class 𝑐.</p><p>FP predictions contain valuable information regarding which classes are confused with the GT. The per-class misclassification frequency (MF) refers to the percentage of occurrences of each false positive class 𝑓 within the total number of false positive instances. Thus, given a dataset with 𝑁 classes, 𝑐 as the ground truth class and 𝑓 as one of the false positive classes, the misclassification frequency for the 𝑐 → 𝑓 misclassification is:</p><formula xml:id="formula_1">𝑀 𝐹 𝑐 = 𝐹 𝑃 𝑖=𝑓 𝑖=𝑁 ∑︁ 𝑖=0 𝐹 𝑃 𝑖 • 100%<label>(2)</label></formula><p>𝑀 𝐹 scores can be extracted for all 𝑓 ̸ = 𝑐 FP classes so that the most influential misclassifications are discovered. Higher 𝑀 𝐹 scores denote some classifier tendency to choose the FP class over the GT one, therefore indicating either a classifier bias or an annotation error in the dataset. Specifically, a classifier bias refers to consistently classifying samples from class 𝑐 as samples of class 𝑓 , given that the annotation is the best possible. Of course, such a requirement cannot be always satisfied, especially when expert annotators are needed, as may happen in the case of fine-grained classification. On the other hand, since our explainable evaluation approach is able to capture such misclassification patterns, it is not necessary to attribute the source of misclassification beforehand. Human annotators can be employed at a later stage, identifying and verifying the source of misclassifications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>In all following experiments, we selected a threshold of 𝑇 =50 crawled images per class. We will present results on a random initial query as a proof-of-concept to demonstrate our findings. For this reason, we provide the query 'cat', which returns the following WordNet hyponyms (also corresponding to ImageNet labels):</p><p>𝐻={'angora cat', 'cougar cat', 'egyptian cat', 'leopard cat', 'lynx cat', 'persian cat', 'siamese cat', 'tabby cat', 'tiger cat'}</p><p>The same experimentation can be replicated for other selected queries, as long as they can be mapped on WordNet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Convolutional classifiers</head><p>We leveraged the following CNN classifiers: VGG16/19, <ref type="bibr" target="#b23">[24]</ref>, ResNet50/101/152 <ref type="bibr" target="#b25">[26]</ref>, Incep-tionV3 <ref type="bibr" target="#b24">[25]</ref>, InceptionResnetV2 <ref type="bibr" target="#b27">[28]</ref>, Xception <ref type="bibr" target="#b26">[27]</ref>, MobileNetV2 <ref type="bibr" target="#b48">[49]</ref>, NasNet-Large <ref type="bibr" target="#b49">[50]</ref>, DenseNet121/169/201 <ref type="bibr" target="#b50">[51]</ref>, EfficientNet-B7 <ref type="bibr" target="#b51">[52]</ref>, ConvNeXt <ref type="bibr" target="#b28">[29]</ref>. We present results for CNN classifiers in Table <ref type="table" target="#tab_0">1</ref>. Bold instances denote lower accuracy than the best ImageNet accuracy of each model, as reported by the authors of each model respectively <ref type="foot" target="#foot_0">1</ref> . Underlined cells indicate best accuracy/sibling percentage scores for each category. The absence of models or keywords from Table <ref type="table" target="#tab_0">1</ref> means that they correspond to zero accuracy scores. For example, we observe the complete absence of models such as InceptionV3, InceptionResNetV2, Xception, NASNetLarge, DenseNet121/169/201 meaning that they are completely unable to properly classify the crawled images, even those belonging to categories that show satisfactory accuracy when other classifiers are deployed. MobileNetV2 also shows deteriorated performance for all categories. We will investigate later if hierarchical knowledge can help extract any meaningful information regarding this surprisingly low performance.</p><p>Other results that can be extracted from Table <ref type="table" target="#tab_0">1</ref> is that some categories can be easily classified ('siamese cat', 'lynx cat', 'cougar cat', 'persian cat', 'cat') contrary to others ('tabby cat', 'tiger cat', 'egyptian cat', 'leopard cat', 'angora cat'). Since we have no specific knowledge of animal species, we will once again leverage WordNet to obtain explanations regarding this behavior. Sibling percentages offer a first glance at the degree of confusion between similar classes in the fine-grained setting. For example, even though 'siamese cat' and 'cougar cat' classes demonstrate high accuracy scores, we observe a completely different behavior regarding the sibling percentages: most CNN classifiers return some sibling false positives for 'siamese cat' ground truth label, while the opposite happens for the 'cougar cat' ground truth label, which mostly receives zero sibling misclassifications. This behavior indicates that for 'siamese cat' if a sample is misclassified, it is likely that it belongs to a conceptually similar class, while for 'cougar cat' misclassifications, false positives belong to more semantically distant categories.</p><p>Regarding model capabilities, we observe that for both 'siamese' and 'cougar cat' classes, all ResNet50 false positives belong to non-sibling classes, contrary to EfficientNet false positives, which all belong to sibling classes. By also looking to other categories, we observe that in general, EfficientNet achieves a higher sibling percentage compared to ResNet50, meaning that EfficientNet misclassifications are more justified compared to ResNet50 misclassifications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Transformer-based classifiers</head><p>The following transformer-based image classifiers were used: ViT <ref type="bibr" target="#b4">[5]</ref>, Regnet-x <ref type="bibr" target="#b32">[33]</ref>, DeiT <ref type="bibr" target="#b30">[31]</ref>, BeiT <ref type="bibr" target="#b8">[9]</ref>, CLIP <ref type="bibr" target="#b52">[53]</ref>, Swin Transformer V2 <ref type="bibr" target="#b31">[32]</ref>. Results for Transformer-based classifiers are provided in Table <ref type="table" target="#tab_1">2</ref>. We spot a similar pattern regarding the categories upon which models struggle to make predictions: instances belonging to 'tabby cat', 'tiger cat', 'egyptian cat' categories are classified with low accuracy compared to 'siamese cat', 'lynx cat', 'cougar cat', 'persian cat', 'cat', 'angora cat' and 'leopard cat'. We suspect that there is a common reason behind this behavior, probably attributed to unavoidable intra-class similarities present in the fine-grained classification setting.</p><p>As for model performance, we examine sibling percentage apart from exclusively evaluating accuracy. The behavior of transformer-based models regarding sibling misclassification is harder to be interpreted compared to CNN models, because models that return high sibling percentages for some categories may present low sibling percentages on other categories and vice versa. For example, BeiT scores low on sibling percentages for 'tabby cat' (3.45%), 'siamese cat' (0%) and 'persian cat' (10%) compared to other models for the same classes; on the other hand, it returns best sibling scores for 'leopard cat' (78.72%), 'tiger cat' (22.45%) and 'egyptian cat' (22.50%). More results about the explainability of results are provided in Section 4.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Explaining misconceptions</head><p>In Tables <ref type="table" target="#tab_3">3,4</ref> &amp; 5 we report the top-3 misclassifications per ground truth (GT) category and per model, as well as the misclassification frequency (MF) for each false positive (FP) label. GT column refers to cat species exclusively, even if the word 'cat' is omitted (for example, 'tiger' GT entry refers to 'tiger cat'). We highlight with red irrelevant FP classes, which are semantically distant compared to the GT label, while misconceptions involving sibling classes are highlighted with blue. Moreover, magenta indicates that an FP is actually an immediate (1 hop) hypernym of the GT. Due to space constraints, we present here all transformer-based models, but only a subset of the CNN models tested in total; more results can be found in the Appendix. Interestingly, we can spot some surprising frequent misconceptions, such as confusing cat species with the 'mexican hairless' dog breed. For CNN classifiers, we spot this peculiarity for all models under investigation: 10.53% of ResNet50 FP for 'egyptian cat' GT label belong to the 'mexican hairless' class; the same applies to 14.29% of ResNet101 FP, 18.18% of ResNet152 FP and 8.33% of VGG16 FP. More animals such as 'wallaby', 'jaguar', 'sea lion', 'cheetah', 'arctic fox', 'coyote' etc appear as frequent FPs.</p><p>For transformer models, the 'egyptian cat' → 'mexican hairless' abnormality is observed for all classifiers when 'egyptian cat' GT label is provided, resulting in the following 'mexican hairless' FP percentages: 26.67% for CLIP, 10% for BeiT, 15.62% for DeiT, 15.38% for xRegNet, 20.83% for Swin, and 16.33% for ViT. Obviously, regardless of whether the CNN or transformer classifier is being used, images of 'egyptian cats' are often erroneously perceived as 'mexican hairless dogs'. A qualitative analysis between 'egyptian cat' images and 'mexican hairless dog' images indicates that these animals are obviously distinct, even though they present similar ear shapes and rather hairless, thin bodies. Therefore, we can assume that the transformer-based classifiers are biased towards texture, verifying relevant observations reported for CNNs <ref type="bibr" target="#b10">[11]</ref>. Also, ear shape acts as a confounding factor, overshadowing other actually distinct animal characteristics. There are more misclassifications involving animals, such as 'armadillo', 'chihuahua', 'soft-coated wheaten terrier', 'kelpie', and others.</p><p>Even more surprising are misclassifications not including animal species. For example, CNN classifiers predict 'web site' instead of 'tabby cat', 'hatched' instead of 'persian cat', 'barbershop' instead of 'cat', 'menu' instead of 'cougar' etc. All ResNet50/101/152 and VGG16 make at least one such misclassification, something that highly questions which features of cat species contribute to such predictions.</p><p>Misclassifications involving non-animal classes using transformers (Tables <ref type="table" target="#tab_4">4, 5</ref>) provide the following interesting abnormalities: 'cat' is classified as 'fur coat' for 50% of the FP instances using DeiT. This non-negligible misclassification rate once again verifies the aforementioned texture bias. In a similar sense, xRegNet classifies 'egyptian cat' images as 'mask' and as 'comic book' 7.69% of the FPs respectively. Such categories had also appeared in CNN misclassifications. We cannot provide a human-interpretable explanation about the 'mask' misclassification, since the term 'mask' may refer to many different objects. We hypothesize that 'mask' ImageNet instances may contain carnival masks looking similar to cats, therefore the lack of context confused xRegNet. 'Comic book' appears 9.38% of the times an 'egyptian cat' image is misclassified by DeiT, 33.33% of the times a 'cat' photo is misclassified by xRegNet, and 16.67% of the times an 'egyptian cat' is misclassified by Swin. This can be attributed to the fact that crawled images may contain cartoon-like instances, which cannot be clearly regarded as cats. Other interesting misclassifications involving irrelevant categories are 'cat'→'washer' (25% of FPs using ViT), 'leopard cat'→'web site' (2.27% of FPs using ViT, 15% of FPs using DeiT), 'persian cat'→'plastic bag' (25% of FPs using ViT), 'cat'→'jersey' (25% of FPs using Swin), 'egyptian cat'→'table lamp' (8.33% of FPs using Swin), 'cat'→'tub (33.33% of FPs using xRegNet), and others.</p><p>An interesting observation revolves around the 'egyptian cat' label. For CNN models, almost all top-3 FP of 'egyptian cat' GT label correspond to irrelevant ImageNet categories. On the contrary, 'tabby cat', 'angora cat', and 'tiger cat' present more sensible FPs, which usually involve sibling categories (highlighted with blue). As for transformer models, we observe that 'egyptian cat' label is always being confused with at least one irrelevant ImageNet category, while 'angora cat' is only confused with other cat species, and not with conceptually distant classes. Thus, 'egyptian cat' crawled images seem to contain some misleading visual features that frequently derail the classification process. Indeed, when viewing 'egyptian cat' crawled images, some of them are drawings or photos of cat souvenirs; however, misconceptions such as 'table lamp' or 'armadillo' cannot be visually explained by human inspectors, unraveling more questions on the topic. A comparison between CNN classifiers (Table <ref type="table" target="#tab_2">3</ref>) and transformer-based classifiers (Table <ref type="table" target="#tab_3">4</ref>, 5) denotes that transformers are more capable of retrieving similar categories to the GT; this becomes obvious by observing the higher number or irrelevant misclassifications highlighted with red for CNNs, compared to transformer results.</p><p>By combining Tables <ref type="table" target="#tab_3">3, 4</ref> &amp; 5 with Tables 1&amp; 2, we obtain some very interesting findings: how are low classification metric scores connected to the relevance between misclassified categories? We start with categories presenting low accuracy scores ('tabby cat', 'tiger cat', 'egyptian cat'), and we compare them with categories offering frequent extraneous misclassifications ('egyptian cat' and 'cat', followed by 'tabby cat' and 'lynx'). Classifying 'egyptian cat' images both yields low classification scores and returns irrelevant false positives. On the other hand, even though 'cat' images present high accuracy scores, misclassifications are highly unrelated when they happen. 'Tiger cat' scores low in accuracy, however, misclassifications are rather justified, since other cat species are returned. Surprisingly, 'tiger cat' also scores low in siblings percentage, indicating that false positives are not immediately related to the GT 'tiger cat' class. In this case, we assume that false positives ('egyptian cat', 'tabby cat', 'leopard cat' etc) belong to more distant relatives of the 'tiger cat' concept class, even though bearing some similar features.</p><p>Overall, throughout this analysis we prove that classification accuracy is unable to reveal the whole truth behind the way classifiers behave; to this end, knowledge sources are able to shed some light on the inner workings of this process. By analyzing a constraint family of related ImageNet labels (cat species) we already disentangled the classification accuracy from the classification relevance: false positives can be highly relevant to the ground truth (such as 'tiger cat' misclassifications) or not ('cat' misclassifications). We, therefore, argue that fine-grained classification also demands fine-grained evaluation, which can provide insightful information when driven by knowledge. The human interpretable insights of Tables <ref type="table" target="#tab_4">4, 5</ref> are going to be quantified and verified in the next Section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Knowledge-driven metrics</head><p>The aforementioned claim regarding the need for fine-grained evaluation is supported by demonstrating results using knowledge-driven metrics based on conceptual distance as provided by WordNet (Tables <ref type="table" target="#tab_6">6&amp; 7)</ref>. Since higher path similarity/LCH, WUPS scores are better, we denote with bold best (higher) scores for each category. By comparing path similarity, LCH, and WUPS metrics across categories, we observe that categories having a large number of irrelevant FP (marked in red in Tables <ref type="table" target="#tab_4">4, 5</ref>), such as 'cougar cat' and 'lynx cat', followed by 'egyptian cat' and 'cat' also present low knowledge-driven metric scores in Tables 6, 7, as expected. Other categories such as 'angora cat', 'leopard cat', and 'tiger cat' that present misclassifications of related (sibling or parent) categories also present higher knowledge-driven metric scores. Therefore, we can safely assume that employing knowledge-driven metrics for evaluating fine-grained classification results is highly correlated with human-interpretable notions of similarity and therefore trustworthy. Model performance is rather clear when examining CNN classifiers. EfficientNet achieves predicting more relevant FP images compared to other classifiers for the majority of the categories. On the other hand, it is harder to draw a similar conclusion for Transformer-based classifiers, as different models perform better for different categories; however, compared to CNN classifiers the results of knowledge-driven metrics are the same or higher for most categories. Even though this difference is not impressive, transformer-based models showcase an improved capability of predicting more relevant classes, when failing to return the GT one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this work, we implemented a novel distribution shift involving uncurated web images, upon which we tested convolutional and transformer-based image classifiers. Selecting closely related categories for classification is instructed by hierarchical knowledge, which is again employed to evaluate the quality of results. We prove that accuracy-related metrics can only scratch the surface of classification evaluation since they cannot capture semantic relationships between misclassified samples and ground truth labels. To this end, we propose an explainable, knowledge-driven evaluation scheme, able to quantify misclassification relevance by providing the semantic distance between false positive and real labels. The same scheme is also used to compare the classification capabilities of CNN vs transformer-based models on the implemented distribution shift. As future work, we plan to extend our analysis to more query terms in order to examine the extend of our current findings, and also combine the uncurated image classification setting with artificial corruptions to enhance our insights.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Outline of our method.</figDesc><graphic coords="4,99.71,147.75,395.87,305.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Classification results using CNNs. Bold entries denote lower accuracy compared to best model accuracy.</figDesc><table><row><cell>Model</cell><cell>Label</cell><cell cols="2">Accuracy↑ Siblings↑</cell><cell>Label</cell><cell cols="2">Accuracy↑ Siblings↑</cell></row><row><cell>ResNet50</cell><cell></cell><cell>50.00%</cell><cell>24.00%</cell><cell></cell><cell>90.00%</cell><cell>0.00%</cell></row><row><cell>ResNet101</cell><cell></cell><cell>52.00%</cell><cell>41.67%</cell><cell></cell><cell>88.00%</cell><cell>16.67%</cell></row><row><cell>ResNet152</cell><cell></cell><cell>50.00%</cell><cell>12.00%</cell><cell></cell><cell>90.00%</cell><cell>20.00%</cell></row><row><cell>VGG16 VGG19</cell><cell>tabby cat</cell><cell>38.00% 50.00%</cell><cell>38.71% 32.00%</cell><cell>siamese cat</cell><cell>82.00% 88.00%</cell><cell>11.11% 16.67%</cell></row><row><cell>MobileNetV2</cell><cell></cell><cell>2.00%</cell><cell>2.04%</cell><cell></cell><cell>4.00%</cell><cell>0.00%</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>10.00%</cell><cell>33.33%</cell><cell></cell><cell>96.00%</cell><cell>100.00%</cell></row><row><cell>ConvNext</cell><cell></cell><cell>60.00%</cell><cell>15.00%</cell><cell></cell><cell>92.00%</cell><cell>75.00%</cell></row><row><cell>ResNet50</cell><cell></cell><cell>82.00%</cell><cell>0.00%</cell><cell></cell><cell>84.00%</cell><cell>0.00%</cell></row><row><cell>ResNet101</cell><cell></cell><cell>84.00%</cell><cell>0.00%</cell><cell></cell><cell>78.00%</cell><cell>0.00%</cell></row><row><cell>ResNet152</cell><cell></cell><cell>86.00%</cell><cell>0.00%</cell><cell></cell><cell>88.00%</cell><cell>0.00%</cell></row><row><cell>VGG16</cell><cell>lynx cat</cell><cell>82.00%</cell><cell>0.00%</cell><cell>cougar cat</cell><cell>86.00%</cell><cell>0.00%</cell></row><row><cell>VGG19</cell><cell></cell><cell>80.00%</cell><cell>0.00%</cell><cell></cell><cell>78.00%</cell><cell>0.00%</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>90.00%</cell><cell>0.00%</cell><cell></cell><cell>98.00%</cell><cell>100.00%</cell></row><row><cell>ConvNext</cell><cell></cell><cell>92.00%</cell><cell>0.00%</cell><cell></cell><cell>98.00%</cell><cell>0.00%</cell></row><row><cell>ResNet50</cell><cell></cell><cell>18.33%</cell><cell>0.00%</cell><cell></cell><cell>92.00%</cell><cell>25.00%</cell></row><row><cell>ResNet101</cell><cell></cell><cell>23.33%</cell><cell>0.00%</cell><cell></cell><cell>88.00%</cell><cell>16.67%</cell></row><row><cell>ResNet152</cell><cell></cell><cell>26.67%</cell><cell>0.00%</cell><cell></cell><cell>88.00%</cell><cell>33.33%</cell></row><row><cell>VGG16</cell><cell>tiger cat</cell><cell>20.00%</cell><cell>0.00%</cell><cell>persian cat</cell><cell>86.00%</cell><cell>14.29%</cell></row><row><cell>VGG19</cell><cell></cell><cell>28.33%</cell><cell>0.00%</cell><cell></cell><cell>80.00%</cell><cell>10.00%</cell></row><row><cell>MobileNetV2</cell><cell></cell><cell>1.67%</cell><cell>0.00%</cell><cell></cell><cell>8.00%</cell><cell>2.17%</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>36.67%</cell><cell>0.00%</cell><cell></cell><cell>98.00%</cell><cell>100.00%</cell></row><row><cell>ConvNext</cell><cell></cell><cell>26.67%</cell><cell>0.00%</cell><cell></cell><cell>98.00%</cell><cell>100.00%</cell></row><row><cell>ResNet50</cell><cell></cell><cell>12.00%</cell><cell>18.18%</cell><cell></cell><cell>12.00%</cell><cell>50.00%</cell></row><row><cell>ResNet101</cell><cell></cell><cell>12.00%</cell><cell>15.91%</cell><cell></cell><cell>20.00%</cell><cell>50.00%</cell></row><row><cell>ResNet152 VGG16</cell><cell>leopard cat</cell><cell>4.00% 10.00%</cell><cell>14.58% 6.67%</cell><cell>angora cat</cell><cell>20.00% 10.00%</cell><cell>62.50% 46.67%</cell></row><row><cell>VGG19</cell><cell></cell><cell>10.00%</cell><cell>6.67%</cell><cell></cell><cell>8.00%</cell><cell>54.35%</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>2.00%</cell><cell>16.33%</cell><cell></cell><cell>4.00%</cell><cell>95.83%</cell></row><row><cell>ConvNext</cell><cell></cell><cell>16.00%</cell><cell>16.67%</cell><cell></cell><cell>10.00%</cell><cell>88.89%</cell></row><row><cell>ResNet50</cell><cell></cell><cell>24.00%</cell><cell>2.63%</cell><cell></cell><cell>82.05%</cell><cell>0.00%</cell></row><row><cell>ResNet101</cell><cell></cell><cell>30.00%</cell><cell>2.86%</cell><cell></cell><cell>82.05%</cell><cell>0.00%</cell></row><row><cell>ResNet152 VGG16</cell><cell>egyptian cat</cell><cell>34.00% 28.00%</cell><cell>6.06% 0.00%</cell><cell>cat</cell><cell>79.49% 87.18%</cell><cell>0.00% 0.00%</cell></row><row><cell>VGG19</cell><cell></cell><cell>26.00%</cell><cell>2.70%</cell><cell></cell><cell>76.92%</cell><cell>0.00%</cell></row><row><cell>MobileNetV2</cell><cell></cell><cell>0.00%</cell><cell>0.00%</cell><cell></cell><cell>2.56%</cell><cell>0.00%</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>70.00%</cell><cell>0.00%</cell><cell></cell><cell>92.31%</cell><cell>0.00%</cell></row><row><cell>ConvNext</cell><cell></cell><cell>52.00%</cell><cell>0.00%</cell><cell></cell><cell>94.87%</cell><cell>0.00%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Classification results using Transformers. Bold entries denote lower accuracy compared to best model accuracy, underlined metrics indicate best metric performance per class.</figDesc><table><row><cell>Model</cell><cell>Label</cell><cell cols="2">Accuracy↑ Siblings↑</cell><cell>Label</cell><cell cols="2">Accuracy↑ Siblings↑</cell></row><row><cell>ViT</cell><cell></cell><cell>44.00%</cell><cell>42.86%</cell><cell></cell><cell>92.00%</cell><cell>50.00%</cell></row><row><cell>BeiT</cell><cell></cell><cell>42.00%</cell><cell>3.45%</cell><cell></cell><cell>94.00%</cell><cell>0.00%</cell></row><row><cell>DeiT</cell><cell>tabby cat</cell><cell>60.00%</cell><cell>30.00%</cell><cell>siamese cat</cell><cell>94.00%</cell><cell>33.33%</cell></row><row><cell>Swin</cell><cell></cell><cell>48.00%</cell><cell>30.77%</cell><cell></cell><cell>94.00%</cell><cell>100.00%</cell></row><row><cell>xRegNet</cell><cell></cell><cell>52.00%</cell><cell>25.00%</cell><cell></cell><cell>92.00%</cell><cell>50.00%</cell></row><row><cell>CLIP</cell><cell></cell><cell>30.00%</cell><cell>28.57%</cell><cell></cell><cell>96.00%</cell><cell>50.00%</cell></row><row><cell>ViT</cell><cell></cell><cell>90.00%</cell><cell>0.00%</cell><cell></cell><cell>96.00%</cell><cell>0.00%</cell></row><row><cell>BeiT</cell><cell></cell><cell>26.00%</cell><cell>0.00%</cell><cell></cell><cell>92.00%</cell><cell>0.00%</cell></row><row><cell>DeiT</cell><cell>lynx cat</cell><cell>92.00%</cell><cell>0.00%</cell><cell>cougar cat</cell><cell>96.00%</cell><cell>0.00%</cell></row><row><cell>Swin</cell><cell></cell><cell>86.00%</cell><cell>0.00%</cell><cell></cell><cell>96.00%</cell><cell>0.00%</cell></row><row><cell>xRegNet</cell><cell></cell><cell>90.00%</cell><cell>0.00%</cell><cell></cell><cell>96.00%</cell><cell>50.00%</cell></row><row><cell>CLIP</cell><cell></cell><cell>86.00%</cell><cell>0.00%</cell><cell></cell><cell>92.00%</cell><cell>0.00%</cell></row><row><cell>ViT</cell><cell></cell><cell>18.33%</cell><cell>0.00%</cell><cell></cell><cell>92.00%</cell><cell>75.00%</cell></row><row><cell>BeiT</cell><cell></cell><cell>18.33%</cell><cell>22.45%</cell><cell></cell><cell>80.00%</cell><cell>10.00%</cell></row><row><cell>DeiT</cell><cell>tiger cat</cell><cell>15.00%</cell><cell>0.00%</cell><cell>persian cat</cell><cell>96.00%</cell><cell>50.00%</cell></row><row><cell>Swin</cell><cell></cell><cell>21.67%</cell><cell>0.00%</cell><cell></cell><cell>96.00%</cell><cell>50.00%</cell></row><row><cell>xRegNet</cell><cell></cell><cell>35.00%</cell><cell>0.00%</cell><cell></cell><cell>98.00%</cell><cell>100.00%</cell></row><row><cell>CLIP</cell><cell></cell><cell>46.67%</cell><cell>0.00%</cell><cell></cell><cell>96.00%</cell><cell>50.00%</cell></row><row><cell>ViT</cell><cell></cell><cell>12.00%</cell><cell>2.27%</cell><cell></cell><cell>6.00%</cell><cell>89.36%</cell></row><row><cell>BeiT</cell><cell></cell><cell>6.00%</cell><cell>78.72%</cell><cell></cell><cell>62.00%</cell><cell>52.63%</cell></row><row><cell>DeiT</cell><cell>leopard cat</cell><cell>10.00%</cell><cell>11.11%</cell><cell>angora cat</cell><cell>0.00%</cell><cell>94.00%</cell></row><row><cell>Swin</cell><cell></cell><cell>14.00%</cell><cell>9.30%</cell><cell></cell><cell>8.00%</cell><cell>95.65%</cell></row><row><cell>xRegNet</cell><cell></cell><cell>6.00%</cell><cell>21.28%</cell><cell></cell><cell>0.00%</cell><cell>76.00%</cell></row><row><cell>CLIP</cell><cell></cell><cell>10.00%</cell><cell>55.56%</cell><cell></cell><cell>8.00%</cell><cell>91.30%</cell></row><row><cell>ViT</cell><cell></cell><cell>38.00%</cell><cell>3.23%</cell><cell></cell><cell>89.74%</cell><cell>0.00%</cell></row><row><cell>BeiT</cell><cell></cell><cell>20.00%</cell><cell>22.50%</cell><cell></cell><cell>53.85%</cell><cell>0.00%</cell></row><row><cell>DeiT</cell><cell>egyptian cat</cell><cell>36.00%</cell><cell>3.12%</cell><cell>cat</cell><cell>94.87%</cell><cell>0.00%</cell></row><row><cell>Swin</cell><cell></cell><cell>52.00%</cell><cell>0.00%</cell><cell></cell><cell>89.74%</cell><cell>0.00%</cell></row><row><cell>xRegNet</cell><cell></cell><cell>48.00%</cell><cell>0.00%</cell><cell></cell><cell>92.31%</cell><cell>0.00%</cell></row><row><cell>CLIP</cell><cell></cell><cell>70.00%</cell><cell>6.67%</cell><cell></cell><cell>69.23%</cell><cell>0.00%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Common misclassifications for selected GT cat classes and misclassification frequency (CNNs).</figDesc><table><row><cell></cell><cell></cell><cell>Top-1</cell><cell></cell><cell>Top-2</cell><cell></cell><cell>Top-3</cell><cell></cell></row><row><cell>Model</cell><cell>GT</cell><cell>FP</cell><cell>MF</cell><cell>FP</cell><cell>MF</cell><cell>FP</cell><cell>MF</cell></row><row><cell></cell><cell>tabby</cell><cell>tiger cat</cell><cell>32.00%</cell><cell>egyptian cat</cell><cell>24.00%</cell><cell>web site</cell><cell>8.00</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell>34.00%</cell><cell>arctic fox</cell><cell>11.36%</cell><cell>lynx</cell><cell>9.09%</cell></row><row><cell></cell><cell>lynx</cell><cell>coyote</cell><cell>22.22%</cell><cell>tabby cat</cell><cell cols="2">11.11% egyptian cat</cell><cell>11.11%</cell></row><row><cell></cell><cell>siamese</cell><cell>great dane</cell><cell>20.00%</cell><cell>hare</cell><cell>20.00%</cell><cell>american</cell><cell>20.00%</cell></row><row><cell>Res</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>egret</cell><cell></cell></row><row><cell>Net50</cell><cell>tiger</cell><cell>tabby cat</cell><cell>40.82%</cell><cell>egyptian cat</cell><cell>20.41%</cell><cell>tiger</cell><cell>14.29%</cell></row><row><cell></cell><cell>persian</cell><cell>old English</cell><cell>25.00%</cell><cell>siamese cat</cell><cell>25.00%</cell><cell>hatchet</cell><cell>25.00%</cell></row><row><cell></cell><cell></cell><cell>sheepdog</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>cougar</cell><cell>lynx</cell><cell>25.00%</cell><cell>malinois</cell><cell>25.00%</cell><cell>wallaby</cell><cell>25.00%</cell></row><row><cell></cell><cell>leopard</cell><cell>egyptian cat</cell><cell>30.00%</cell><cell>tiger cat</cell><cell>16.00%</cell><cell>jaguar</cell><cell>12.00%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell>10.53%</cell><cell>mask</cell><cell>5.26%</cell><cell>comic book</cell><cell>5.26%</cell></row><row><cell></cell><cell>cat</cell><cell>fur coat</cell><cell>14.29%</cell><cell>carton</cell><cell cols="2">14.29% book jacket</cell><cell>14.29%</cell></row><row><cell></cell><cell>tabby</cell><cell>egyptian cat</cell><cell>41.67%</cell><cell>tiger cat</cell><cell>29.17%</cell><cell>web site</cell><cell>8.33%</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell>32.50%</cell><cell>egyptian cat</cell><cell>12.50</cell><cell>lynx</cell><cell>10.00%</cell></row><row><cell></cell><cell>lynx</cell><cell>tabby cat</cell><cell>12.50%</cell><cell>egyptian cat</cell><cell>12.50%</cell><cell>cheetah</cell><cell>12.50%</cell></row><row><cell>Res Net 101</cell><cell>siamese tiger persian cougar</cell><cell>Boston bull tabby cat keeshond lynx</cell><cell>16.67% 34.78% 16.67% 45.45%</cell><cell>egyptian cat tiger cat guinea pig meerkat</cell><cell cols="2">16.67% 17.39% egyptian cat hare 16.67% collie 9.09% dhole</cell><cell>16.67% 15.22% 16.67% 9.09%</cell></row><row><cell></cell><cell>leopard</cell><cell>egyptian cat</cell><cell>36.00%</cell><cell>tiger cat</cell><cell>14.00%</cell><cell>leopard</cell><cell>12.00%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell>14.29%</cell><cell>mask</cell><cell>8.57%</cell><cell>sea lion</cell><cell>5.71%</cell></row><row><cell></cell><cell>cat</cell><cell>macaque</cell><cell>14.29%</cell><cell>barbershop</cell><cell>14.29%</cell><cell>Pembroke</cell><cell>14.29%</cell></row><row><cell></cell><cell>tabby</cell><cell>tiger cat</cell><cell>40.00%</cell><cell>egyptian cat</cell><cell>12.00%</cell><cell>lynx</cell><cell>12.00%</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell>35.00%</cell><cell>siamese cat</cell><cell>10.00%</cell><cell>shower</cell><cell>10.00%</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>curtain</cell><cell></cell></row><row><cell></cell><cell>lynx</cell><cell>tabby cat</cell><cell>42.86%</cell><cell>coyote</cell><cell>14.29%</cell><cell>norwich</cell><cell>14.29</cell></row><row><cell>Res Net 152</cell><cell>siamese tiger persian</cell><cell>whippet tabby cat siamese cat</cell><cell>20.00% 34.09% 33.33%</cell><cell>egyptian cat egyptian cat collie</cell><cell>20.00% 18.18% 16.67%</cell><cell>terrier angora cat tiger fur coat</cell><cell>20.00% 15.91% 16.67%</cell></row><row><cell></cell><cell>cougar</cell><cell>menu</cell><cell>16.67%</cell><cell>wild boar</cell><cell>16.67%</cell><cell>wallaby</cell><cell>16.67%</cell></row><row><cell></cell><cell>leopard</cell><cell>egyptian cat</cell><cell>28.00%</cell><cell>lynx</cell><cell>22.00%</cell><cell>jaguar</cell><cell>16.00%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell>18.18%</cell><cell>web site</cell><cell>9.09%</cell><cell>tabby cat</cell><cell>6.06%</cell></row><row><cell></cell><cell>cat</cell><cell>macaque</cell><cell>12.50%</cell><cell>Pembroke</cell><cell>12.50%</cell><cell>chihuahua</cell><cell>12.50%</cell></row><row><cell></cell><cell>tabby</cell><cell>egyptian cat</cell><cell>38.71%</cell><cell>tiger cat</cell><cell cols="2">22.58% wood rabbit</cell><cell>3.23%</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell>26.67%</cell><cell>egyptian cat</cell><cell>15.56%</cell><cell>lynx</cell><cell>8.89%</cell></row><row><cell></cell><cell>lynx</cell><cell>coyote</cell><cell>33.33</cell><cell>egyptian cat</cell><cell cols="2">22.22% madagascar</cell><cell>11.11%</cell></row><row><cell></cell><cell cols="2">siamese mexican hairless</cell><cell>22.22%</cell><cell>whippet</cell><cell>11.11%</cell><cell>fur coat</cell><cell>11.11%</cell></row><row><cell>VGG</cell><cell>tiger</cell><cell>tabby cat</cell><cell>33.33%</cell><cell>egyptian cat</cell><cell>20.83%</cell><cell>tiger</cell><cell>16.67%</cell></row><row><cell>16</cell><cell>persian</cell><cell>arctic fox</cell><cell>14.29%</cell><cell>angora cat</cell><cell>14.29%</cell><cell>lynx</cell><cell>14.29%</cell></row><row><cell></cell><cell>cougar</cell><cell>lynx</cell><cell>42.86%</cell><cell>coyote</cell><cell>28.57%</cell><cell>menu</cell><cell>14.29%</cell></row><row><cell></cell><cell>leopard</cell><cell>egyptian cat</cell><cell>42.00%</cell><cell>lynx</cell><cell>18.00%</cell><cell>jaguar</cell><cell>10.00%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell>8.33%</cell><cell>lynx</cell><cell>5.56%</cell><cell>sombrero</cell><cell>5.56%</cell></row><row><cell></cell><cell>cat</cell><cell>norwich terrier</cell><cell>20.00%</cell><cell>schipperke</cell><cell>20.00%</cell><cell>kit fox</cell><cell>20.00%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Common misclassifications for selected GT cat classes and misclassification frequency (Transformers).</figDesc><table><row><cell></cell><cell></cell><cell>Top-1</cell><cell></cell><cell>Top-2</cell><cell></cell><cell>Top-3</cell><cell></cell></row><row><cell>Model</cell><cell>GT</cell><cell>FP</cell><cell>MF</cell><cell>FP</cell><cell>MF</cell><cell>FP</cell><cell>MF</cell></row><row><cell></cell><cell>tabby</cell><cell>madagascar</cell><cell>40.00%</cell><cell>egyptian cat</cell><cell>22.86%</cell><cell>tiger cat</cell><cell>11.43</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell>78.26%</cell><cell>madagascar</cell><cell>6.52%</cell><cell>siamese cat</cell><cell>6.52%</cell></row><row><cell></cell><cell>lynx</cell><cell>madagascar</cell><cell>14.29%</cell><cell>leopard cat</cell><cell>14.29%</cell><cell>grey fox</cell><cell>14.29%</cell></row><row><cell></cell><cell>siamese</cell><cell>polecat</cell><cell>50.00%</cell><cell>persian cat</cell><cell>50.00%</cell><cell>-</cell><cell>-</cell></row><row><cell>CLIP</cell><cell>tiger persian</cell><cell>egyptian cat madagascar</cell><cell>30.77% 50.00%</cell><cell>madagascar siamese cat</cell><cell cols="2">19.23% leopard cat 50.00% -</cell><cell>15.38% -</cell></row><row><cell></cell><cell>cougar</cell><cell>lynx</cell><cell>75.00%</cell><cell>madagascar</cell><cell>25.00%</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>leopard</cell><cell>tiger cat</cell><cell>55.56%</cell><cell>madagascar</cell><cell cols="2">17.78% egyptian cat</cell><cell>11.11%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell>26.67%</cell><cell>madagascar</cell><cell>26.67%</cell><cell>armadillo</cell><cell>6.67%</cell></row><row><cell></cell><cell>cat</cell><cell>madagascar</cell><cell>66.67%</cell><cell>orange</cell><cell>16.67%</cell><cell>bib</cell><cell>8.33%</cell></row><row><cell></cell><cell>tabby</cell><cell>tiger cat</cell><cell>17.24%</cell><cell>cat</cell><cell>13.79%</cell><cell>domestic</cell><cell>13.79%</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell>42.11%</cell><cell>domestic</cell><cell cols="2">21.05% quadruped</cell><cell>5.26%</cell></row><row><cell></cell><cell>lynx</cell><cell>common lynx</cell><cell>59.46%</cell><cell>Canada lynx</cell><cell>16.22%</cell><cell>bobcat</cell><cell>5.41%</cell></row><row><cell></cell><cell>siamese</cell><cell>kitten</cell><cell>66.67%</cell><cell>feline</cell><cell>33.33%</cell><cell>-</cell><cell>-</cell></row><row><cell>BeiT</cell><cell>tiger persian</cell><cell>tabby cat domestic</cell><cell>23.40% 20.00%</cell><cell>margay angora cat</cell><cell cols="2">14.89% 10.00% breadwinner domestic</cell><cell>6.38% 10.00%</cell></row><row><cell></cell><cell>cougar</cell><cell>feline</cell><cell>25.00%</cell><cell>big cat</cell><cell>25.00%</cell><cell>cub</cell><cell>25.00%</cell></row><row><cell></cell><cell>leopard</cell><cell>margay</cell><cell>42.55%</cell><cell>ocelot</cell><cell cols="2">21.28% spotted lynx</cell><cell>8.51%</cell></row><row><cell></cell><cell>egyptian</cell><cell>Abyssinian</cell><cell cols="2">15.00% mexican hairless</cell><cell>10.00%</cell><cell>mouser</cell><cell>5.00%</cell></row><row><cell></cell><cell>cat</cell><cell>feline</cell><cell>33.33%</cell><cell>kitten</cell><cell>22.22%</cell><cell>caterer</cell><cell>11.11%</cell></row><row><cell></cell><cell>tabby</cell><cell>tiger cat</cell><cell>35.00%</cell><cell>egyptian cat</cell><cell>30.00%</cell><cell>web site</cell><cell>15.00%</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell>62.00%</cell><cell>egyptian cat</cell><cell>28.00%</cell><cell>tabby cat</cell><cell>2.00%</cell></row><row><cell></cell><cell>lynx</cell><cell>tabby cat</cell><cell>75.00%</cell><cell>coyote</cell><cell>25.00%</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>siamese</cell><cell>egyptian cat</cell><cell cols="2">33.33% mexican hairless</cell><cell>33.33%</cell><cell>lynx</cell><cell>33.33%</cell></row><row><cell></cell><cell>tiger</cell><cell>tabby cat</cell><cell>37.50%</cell><cell>egyptian cat</cell><cell cols="2">27.50% leopard cat</cell><cell>12.50%</cell></row><row><cell>DeiT</cell><cell>persian</cell><cell>wheaten terrier soft-coated</cell><cell>50.00%</cell><cell>siamese cat</cell><cell>50.00%</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>cougar</cell><cell>web site</cell><cell>50.00%</cell><cell>dingo</cell><cell>50.00%</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>leopard</cell><cell>egyptian cat</cell><cell>48.89%</cell><cell>lynx</cell><cell>22.22%</cell><cell>tiger cat</cell><cell>11.11%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell>15.62%</cell><cell>comic book</cell><cell>9.38%</cell><cell>kelpie</cell><cell>3.12%</cell></row><row><cell></cell><cell>cat</cell><cell>fur coat</cell><cell>50.00%</cell><cell>chihuahua</cell><cell>50.00%</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>tabby</cell><cell>tiger cat</cell><cell>62.50%</cell><cell>egyptian cat</cell><cell>20.83%</cell><cell>menu</cell><cell>4.17%</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell>48.00%</cell><cell>egyptian cat</cell><cell>18.00%</cell><cell>lynx</cell><cell>8.00%</cell></row><row><cell></cell><cell>lynx</cell><cell>tabby cat</cell><cell>40.00%</cell><cell>tiger cat</cell><cell cols="2">20.00% egyptian cat</cell><cell>20.00%</cell></row><row><cell></cell><cell>siamese</cell><cell>egyptian cat</cell><cell>50.00%</cell><cell>polecat</cell><cell>25.00%</cell><cell>lynx</cell><cell>25.00%</cell></row><row><cell>xReg</cell><cell>tiger</cell><cell>tabby cat</cell><cell>40.00%</cell><cell>egyptian cat</cell><cell>25.71%</cell><cell>lynx</cell><cell>11.43%</cell></row><row><cell>Net</cell><cell>persian</cell><cell>siamese cat</cell><cell>100.0%</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>cougar</cell><cell>tiger cat</cell><cell>50.00%</cell><cell>lynx</cell><cell>50.00%</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>leopard</cell><cell>egyptian cat</cell><cell>40.43%</cell><cell>tiger cat</cell><cell>21.28%</cell><cell>lynx</cell><cell>17.02%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell>15.38%</cell><cell>mask</cell><cell>7.69%</cell><cell>comic book</cell><cell>7.69%</cell></row><row><cell></cell><cell>cat</cell><cell>comic book</cell><cell>33.33%</cell><cell>tub</cell><cell>33.33%</cell><cell>drake</cell><cell>33.33%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 (</head><label>5</label><figDesc>Continuation of Tab 4). Common misclassifications and misclassification frequency.</figDesc><table><row><cell></cell><cell></cell><cell>Top-1</cell><cell></cell><cell>Top-2</cell><cell></cell><cell>Top-3</cell><cell></cell></row><row><cell>Model</cell><cell>GT</cell><cell>FP</cell><cell>MF</cell><cell>FP</cell><cell>MF</cell><cell>FP</cell><cell>MF</cell></row><row><cell></cell><cell>tabby</cell><cell>tiger cat</cell><cell cols="2">57.69% egyptian cat</cell><cell>30.77%</cell><cell>web site</cell><cell>7.69%</cell></row><row><cell></cell><cell>angora</cell><cell>persian cat</cell><cell cols="2">58.70% egyptian cat</cell><cell>26.09%</cell><cell>tabby cat</cell><cell>10.87%</cell></row><row><cell></cell><cell>lynx</cell><cell>tabby cat</cell><cell>57.14%</cell><cell>fur coat</cell><cell>14.29%</cell><cell>timber wolf</cell><cell>14.29%</cell></row><row><cell></cell><cell>siamese</cell><cell>egyptian cat</cell><cell>100.0%</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell>Swin</cell><cell>tiger persian</cell><cell>tabby cat siamese cat</cell><cell cols="2">35.00% egyptian cat 50.00% hand blower</cell><cell>32.50% 50.00%</cell><cell>leopard cat -</cell><cell>12.50% -</cell></row><row><cell></cell><cell>cougar</cell><cell>web site</cell><cell>50.00%</cell><cell>Irish</cell><cell>50.00%</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>wolfhound</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>leopard</cell><cell>egyptian cat</cell><cell>44.19%</cell><cell>lynx</cell><cell>37.21%</cell><cell>tiger cat</cell><cell>9.30%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell cols="2">20.83% comic book</cell><cell>16.67%</cell><cell>table lamp</cell><cell>8.33%</cell></row><row><cell></cell><cell>cat</cell><cell>fur coat</cell><cell>25.00%</cell><cell>jersey</cell><cell>25.00%</cell><cell></cell><cell>25.00%</cell></row><row><cell></cell><cell>tabby</cell><cell>egyptian cat</cell><cell>42.86%</cell><cell>tiger cat</cell><cell>32.14%</cell><cell>web site</cell><cell>10.71%</cell></row><row><cell></cell><cell>angora</cell><cell>egyptian cat</cell><cell>48.94%</cell><cell>persian cat</cell><cell>38.30%</cell><cell>tabby cat</cell><cell>2.13%</cell></row><row><cell></cell><cell>lynx</cell><cell>tabby cat</cell><cell cols="2">40.00% egyptian cat</cell><cell>40.00%</cell><cell>timber wolf</cell><cell>20.00%</cell></row><row><cell></cell><cell>siamese</cell><cell>egyptian cat</cell><cell>50.00%</cell><cell>chihuahua</cell><cell cols="2">25.00% mexican hairless</cell><cell></cell></row><row><cell>ViT</cell><cell>tiger persian</cell><cell>egyptian cat plastic bag</cell><cell cols="2">48.78% 25.00% egyptian cat tabby cat</cell><cell>26.83% 25.00%</cell><cell>leopard cat siamese cat</cell><cell>14.63% 25.00%</cell></row><row><cell></cell><cell>cougar</cell><cell>egyptian cat</cell><cell>50.00%</cell><cell>malinois</cell><cell>50.00%</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>leopard</cell><cell>egyptian cat</cell><cell cols="2">86.36% snow leopard</cell><cell>2.27%</cell><cell>web site</cell><cell>2.27%</cell></row><row><cell></cell><cell cols="2">egyptian mexican hairless</cell><cell>16.13%</cell><cell>pedestal</cell><cell>12.90%</cell><cell>vase</cell><cell>6.45%</cell></row><row><cell></cell><cell>cat</cell><cell>washer</cell><cell>25.00%</cell><cell>fur coat</cell><cell cols="2">25.00% mexican hairless</cell><cell>25.00%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6</head><label>6</label><figDesc>Conceptual metrics based on WordNet distances using CNN classifiers.</figDesc><table><row><cell>Model</cell><cell>Label</cell><cell cols="4">Path sim↑ LCH↑ WUPS↑ Label</cell><cell cols="3">Path sim↑ LCH↑ WUPS↑</cell></row><row><cell>ResNet50</cell><cell></cell><cell>0.18</cell><cell>1.79</cell><cell>0.69</cell><cell></cell><cell>0.10</cell><cell>1.25</cell><cell>0.57</cell></row><row><cell>ResNet101</cell><cell></cell><cell>0.22</cell><cell>1.99</cell><cell>0.75</cell><cell></cell><cell>0.15</cell><cell>1.60</cell><cell>0.72</cell></row><row><cell>ResNet152</cell><cell>tabby</cell><cell>0.16</cell><cell>1.59</cell><cell>0.62</cell><cell>siamese</cell><cell>0.17</cell><cell>1.71</cell><cell>0.67</cell></row><row><cell>VGG16</cell><cell>cat</cell><cell>0.21</cell><cell>1.85</cell><cell>0.70</cell><cell>cat</cell><cell>0.16</cell><cell>1.65</cell><cell>0.65</cell></row><row><cell>VGG19</cell><cell></cell><cell>0.18</cell><cell>1.68</cell><cell>0.63</cell><cell></cell><cell>0.17</cell><cell>1.73</cell><cell>0.70</cell></row><row><cell>MobileNetV2</cell><cell></cell><cell>0.09</cell><cell>1.13</cell><cell>0.39</cell><cell></cell><cell>0.09</cell><cell>1.15</cell><cell>0.40</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>0.24</cell><cell>2.17</cell><cell>0.86</cell><cell></cell><cell>0.33</cell><cell>2.54</cell><cell>0.88</cell></row><row><cell>ResNet50</cell><cell></cell><cell>0.05</cell><cell>0.53</cell><cell>0.11</cell><cell></cell><cell>0.08</cell><cell>0.97</cell><cell>0.49</cell></row><row><cell>ResNet101</cell><cell></cell><cell>0.05</cell><cell>0.62</cell><cell>0.13</cell><cell></cell><cell>0.08</cell><cell>0.90</cell><cell>0.41</cell></row><row><cell>ResNet152 VGG16</cell><cell>lynx cat</cell><cell>0.05 0.04</cell><cell>0.56 0.46</cell><cell>0.09 0.08</cell><cell>cougar cat</cell><cell>0.08 0.08</cell><cell>1.01 0.88</cell><cell>0.49 0.37</cell></row><row><cell>VGG19</cell><cell></cell><cell>0.04</cell><cell>0.48</cell><cell>0.08</cell><cell></cell><cell>0.07</cell><cell>0.86</cell><cell>0.38</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>0.05</cell><cell>0.54</cell><cell>0.09</cell><cell></cell><cell>0.33</cell><cell>2.54</cell><cell>0.94</cell></row><row><cell>ResNet50</cell><cell></cell><cell>0.15</cell><cell>1.61</cell><cell>0.70</cell><cell></cell><cell>0.16</cell><cell>1.59</cell><cell>0.60</cell></row><row><cell>ResNet101</cell><cell></cell><cell>0.14</cell><cell>1.47</cell><cell>0.64</cell><cell></cell><cell>0.17</cell><cell>1.73</cell><cell>0.66</cell></row><row><cell>ResNet152 VGG16</cell><cell>tiger cat</cell><cell>0.13 0.14</cell><cell>1.43 1.51</cell><cell>0.61 0.65</cell><cell>persian cat</cell><cell>0.17 0.12</cell><cell>1.57 1.27</cell><cell>0.56 0.50</cell></row><row><cell>VGG19</cell><cell></cell><cell>0.13</cell><cell>1.43</cell><cell>0.61</cell><cell></cell><cell>0.13</cell><cell>1.45</cell><cell>0.57</cell></row><row><cell>MobileNetV2</cell><cell></cell><cell>0.07</cell><cell>0.90</cell><cell>0.41</cell><cell></cell><cell>0.09</cell><cell>1.19</cell><cell>0.43</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>0.13</cell><cell>1.41</cell><cell>0.60</cell><cell></cell><cell>0.33</cell><cell>2.54</cell><cell>0.88</cell></row><row><cell>ResNet50</cell><cell></cell><cell>0.17</cell><cell>1.62</cell><cell>0.67</cell><cell></cell><cell>0.22</cell><cell>1.94</cell><cell>0.72</cell></row><row><cell>ResNet101</cell><cell></cell><cell>0.17</cell><cell>1.62</cell><cell>0.67</cell><cell></cell><cell>0.22</cell><cell>1.93</cell><cell>0.71</cell></row><row><cell>ResNet152</cell><cell>leopard</cell><cell>0.16</cell><cell>1.55</cell><cell>0.64</cell><cell>angora</cell><cell>0.24</cell><cell>2.05</cell><cell>0.73</cell></row><row><cell>VGG16</cell><cell>cat</cell><cell>0.15</cell><cell>1.49</cell><cell>0.62</cell><cell>cat</cell><cell>0.21</cell><cell>1.90</cell><cell>0.72</cell></row><row><cell>VGG19</cell><cell></cell><cell>0.15</cell><cell>1.53</cell><cell>0.64</cell><cell></cell><cell>0.23</cell><cell>2.01</cell><cell>0.75</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>0.18</cell><cell>1.68</cell><cell>0.68</cell><cell></cell><cell>0.32</cell><cell>2.48</cell><cell>0.86</cell></row><row><cell>ResNet50</cell><cell></cell><cell>0.11</cell><cell>1.32</cell><cell>0.51</cell><cell></cell><cell>0.11</cell><cell>1.29</cell><cell>0.56</cell></row><row><cell>ResNet101</cell><cell></cell><cell>0.11</cell><cell>1.32</cell><cell>0.50</cell><cell></cell><cell>0.11</cell><cell>1.34</cell><cell>0.63</cell></row><row><cell>ResNet152 VGG16</cell><cell>egyptian cat</cell><cell>0.12 0.09</cell><cell>1.40 1.15</cell><cell>0.55 0.41</cell><cell>cat</cell><cell>0.11 0.14</cell><cell>1.35 1.67</cell><cell>0.61 0.80</cell></row><row><cell>VGG19</cell><cell></cell><cell>0.10</cell><cell>1.21</cell><cell>0.45</cell><cell></cell><cell>0.12</cell><cell>1.42</cell><cell>0.64</cell></row><row><cell>MobileNetV2</cell><cell></cell><cell>0.08</cell><cell>1.06</cell><cell>0.36</cell><cell></cell><cell>0.07</cell><cell>0.88</cell><cell>0.34</cell></row><row><cell>EfficientNet</cell><cell></cell><cell>0.10</cell><cell>1.24</cell><cell>0.49</cell><cell></cell><cell>0.12</cell><cell>1.52</cell><cell>0.74</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 7 (</head><label>7</label><figDesc>Continuation of Tab 6). Conceptual metrics based on WordNet distances using transformers.</figDesc><table><row><cell>Model</cell><cell>Label</cell><cell cols="4">Path sim↑ LCH↑ WUPS↑ Label</cell><cell cols="3">Path sim↑ LCH↑ WUPS↑</cell></row><row><cell>ViT</cell><cell></cell><cell>0.23</cell><cell>2.02</cell><cell>0.76</cell><cell></cell><cell>0.23</cell><cell>2.01</cell><cell>0.71</cell></row><row><cell>BeiT</cell><cell></cell><cell>0.24</cell><cell>2.02</cell><cell>0.75</cell><cell></cell><cell>0.18</cell><cell>1.88</cell><cell>0.77</cell></row><row><cell>DeiT</cell><cell>tabby</cell><cell>0.20</cell><cell>1.83</cell><cell>0.70</cell><cell>siamese</cell><cell>0.19</cell><cell>1.72</cell><cell>0.58</cell></row><row><cell>Swin</cell><cell>cat</cell><cell>0.23</cell><cell>2.07</cell><cell>0.82</cell><cell>cat</cell><cell>0.33</cell><cell>2.54</cell><cell>0.88</cell></row><row><cell>xRegNet</cell><cell></cell><cell>0.22</cell><cell>2.00</cell><cell>0.79</cell><cell></cell><cell>0.21</cell><cell>1.84</cell><cell>0.66</cell></row><row><cell>CLIP</cell><cell></cell><cell>0.18</cell><cell>1.78</cell><cell>0.75</cell><cell></cell><cell>0.24</cell><cell>2.12</cell><cell>0.84</cell></row><row><cell>ViT</cell><cell></cell><cell>0.05</cell><cell>0.55</cell><cell>0.09</cell><cell></cell><cell>0.15</cell><cell>1.63</cell><cell>0.79</cell></row><row><cell>BeiT</cell><cell></cell><cell>0.04</cell><cell>0.33</cell><cell>0.07</cell><cell></cell><cell>0.21</cell><cell>1.94</cell><cell>0.79</cell></row><row><cell>DeiT Swin</cell><cell>lynx cat</cell><cell>0.05 0.05</cell><cell>0.54 0.56</cell><cell>0.09 0.09</cell><cell>cougar cat</cell><cell>0.09 0.07</cell><cell>1.13 0.97</cell><cell>0.54 0.51</cell></row><row><cell>xRegNet</cell><cell></cell><cell>0.04</cell><cell>0.50</cell><cell>0.08</cell><cell></cell><cell>0.19</cell><cell>1.44</cell><cell>0.50</cell></row><row><cell>CLIP</cell><cell></cell><cell>0.04</cell><cell>0.51</cell><cell>0.10</cell><cell></cell><cell>0.06</cell><cell>0.62</cell><cell>0.24</cell></row><row><cell>ViT</cell><cell></cell><cell>0.15</cell><cell>1.60</cell><cell>0.69</cell><cell></cell><cell>0.27</cell><cell>2.19</cell><cell>0.76</cell></row><row><cell>BeiT</cell><cell></cell><cell>0.21</cell><cell>1.89</cell><cell>0.76</cell><cell></cell><cell>0.22</cell><cell>1.87</cell><cell>0.66</cell></row><row><cell>DeiT Swin</cell><cell>tiger cat</cell><cell>0.13 0.14</cell><cell>1.42 1.47</cell><cell>0.61 0.63</cell><cell>persian cat</cell><cell>0.24 0.21</cell><cell>2.12 1.85</cell><cell>0.80 0.65</cell></row><row><cell>xRegNet</cell><cell></cell><cell>0.14</cell><cell>1.50</cell><cell>0.64</cell><cell></cell><cell>0.33</cell><cell>2.54</cell><cell>0.88</cell></row><row><cell>CLIP</cell><cell></cell><cell>0.12</cell><cell>1.39</cell><cell>0.62</cell><cell></cell><cell>0.22</cell><cell>1.99</cell><cell>0.80</cell></row><row><cell>ViT</cell><cell></cell><cell>0.17</cell><cell>1.76</cell><cell>0.75</cell><cell></cell><cell>0.31</cell><cell>2.39</cell><cell>0.83</cell></row><row><cell>BeiT</cell><cell></cell><cell>0.31</cell><cell>2.47</cell><cell>0.93</cell><cell></cell><cell>0.30</cell><cell>2.22</cell><cell>0.73</cell></row><row><cell>DeiT</cell><cell>leopard</cell><cell>0.15</cell><cell>1.47</cell><cell>0.60</cell><cell>angora</cell><cell>0.32</cell><cell>2.48</cell><cell>0.86</cell></row><row><cell>Swin</cell><cell>cat</cell><cell>0.13</cell><cell>1.26</cell><cell>0.50</cell><cell>cat</cell><cell>0.32</cell><cell>2.49</cell><cell>0.86</cell></row><row><cell>xRegNet</cell><cell></cell><cell>0.18</cell><cell>1.70</cell><cell>0.69</cell><cell></cell><cell>0.28</cell><cell>2.22</cell><cell>0.78</cell></row><row><cell>CLIP</cell><cell></cell><cell>0.22</cell><cell>1.87</cell><cell>0.74</cell><cell></cell><cell>0.31</cell><cell>2.44</cell><cell>0.86</cell></row><row><cell>ViT</cell><cell></cell><cell>0.11</cell><cell>1.33</cell><cell>0.50</cell><cell></cell><cell>0.09</cell><cell>1.15</cell><cell>0.50</cell></row><row><cell>BeiT</cell><cell></cell><cell>0.16</cell><cell>1.63</cell><cell>0.60</cell><cell></cell><cell>0.23</cell><cell>1.81</cell><cell>0.68</cell></row><row><cell>DeiT Swin</cell><cell>egyptian cat</cell><cell>0.11 0.11</cell><cell>1.31 1.34</cell><cell>0.50 0.52</cell><cell>cat</cell><cell>0.05 0.05</cell><cell>0.72 0.73</cell><cell>0.29 0.29</cell></row><row><cell>xRegNet</cell><cell></cell><cell>0.10</cell><cell>1.24</cell><cell>0.46</cell><cell></cell><cell>0.06</cell><cell>0.89</cell><cell>0.39</cell></row><row><cell>CLIP</cell><cell></cell><cell>0.15</cell><cell>1.65</cell><cell>0.70</cell><cell></cell><cell>0.12</cell><cell>1.42</cell><cell>0.66</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://paperswithcode.com/sota/image-classification-on-imagenet</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI PhD Fellowships (Fellowship Number 5537).</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. More CNN misclassifications</head><p>In Table <ref type="table">8</ref>, we present the continuation of the results present in Table <ref type="table">3</ref> for the rest of the CNN models presenting non-zero accuracy. It becomes evident that the capacity of the classifier plays an important role in identifying relevant FP: MobileNetV2, which already demonstrated low accuracy scores, also fail to retrieve semantically related FP classes. This can be easily observed from the numerous red entries corresponding to this model.</p><p>Other than that, the results agree with the observations analyzed in Table <ref type="table">3</ref>, where 'egyptian cat' label demonstrated many irrelevant FP, contrary to 'tabby cat' or 'tiger cat' labels. </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Imagenet: A large-scale hierarchical image database</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2009.5206848</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="248" to="255" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Imagenet classification with deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 25th International Conference on Neural Information Processing Systems -Volume 1, NIPS&apos;12</title>
				<meeting>the 25th International Conference on Neural Information Processing Systems -Volume 1, NIPS&apos;12<address><addrLine>Red Hook, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Curran Associates Inc</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1097" to="1105" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Liu</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2004.02806</idno>
		<idno>arXiv, 2020</idno>
		<ptr target="https://arxiv.org/abs/2004.02806.doi:10.48550/ARXIV.2004.02806" />
		<title level="m">A survey of convolutional neural networks: Analysis, applications, and prospects</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1706.03762</idno>
		<ptr target="https://arxiv.org/abs/1706.03762.doi:10.48550/ARXIV.1706.03762" />
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">An image is worth 16x16 words: Transformers for image recognition at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2010.11929</idno>
		<ptr target="https://arxiv.org/abs/2010.11929.doi:10.48550/ARXIV.2010.11929" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Coca: Contrastive captioners are image-text foundation models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vasudevan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yeung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Seyedhosseini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2205.01917</idno>
		<ptr target="https://arxiv.org/abs/2205.01917.doi:10.48550/ARXIV.2205.01917" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time</title>
		<author>
			<persName><forename type="first">M</forename><surname>Wortsman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ilharco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Y</forename><surname>Gadre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Roelofs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gontijo-Lopes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Morcos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Namkoong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Carmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kornblith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schmidt</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v162/wortsman22a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 39th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Chaudhuri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Jegelka</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Song</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Szepesvari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Niu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Sabato</surname></persName>
		</editor>
		<meeting>the 39th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">162</biblScope>
			<biblScope unit="page" from="23965" to="23998" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Guo</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2111.09883</idno>
		<ptr target="https://arxiv.org/abs/2111.09883.doi:10.48550/ARXIV.2111.09883" />
		<title level="m">Swin transformer v2: Scaling up capacity and resolution</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Beit: Bert pre-training of image transformers</title>
		<author>
			<persName><forename type="first">H</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Piao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2106.08254</idno>
		<ptr target="https://arxiv.org/abs/2106.08254.doi:10.48550/ARXIV.2106.08254" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Certified adversarial robustness via randomized smoothing</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Rosenfeld</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Kolter</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v97/cohen19c.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 36th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Chaudhuri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</editor>
		<meeting>the 36th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">97</biblScope>
			<biblScope unit="page" from="1310" to="1320" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Imagenettrained cnns are biased towards texture; increasing shape bias improves accuracy and robustness</title>
		<author>
			<persName><forename type="first">R</forename><surname>Geirhos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rubisch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Michaelis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bethge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Wichmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Brendel</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1811.12231</idno>
		<ptr target="https://arxiv.org/abs/1811.12231.doi:10.48550/ARXIV.1811.12231" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Invariance-inducing regularization using worstcase transformations suffices to boost accuracy and spatial robustness</title>
		<author>
			<persName><forename type="first">F</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Heinze-Deml</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2019/file/1d01bd2e16f57892f0954902899f0692-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Wallach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Larochelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Beygelzimer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Alché-Buc</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Fox</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">32</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">A large-scale study of representation learning with the visual task adaptation benchmark</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Puigcerver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ruyssen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Riquelme</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lucic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Djolonga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Pinto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bachem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschannen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Michalski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bousquet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1910.04867</idno>
		<ptr target="https://arxiv.org/abs/1910.04867.doi:10.48550/ARXIV.1910.04867" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Benchmarking neural network robustness to common corruptions and perturbations</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Dietterich</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1903.12261</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Measuring robustness to natural distribution shifts in image classification</title>
		<author>
			<persName><forename type="first">R</forename><surname>Taori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Shankar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Carlini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Recht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schmidt</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2007.00644</idno>
		<ptr target="https://arxiv.org/abs/2007.00644.doi:10.48550/ARXIV.2007.00644" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">The many faces of robustness: A critical analysis of out-of-distribution generalization</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Mu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kadavath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Dorundo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Desai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Parajuli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">X</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gilmer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF International Conference on Computer Vision (ICCV)</title>
				<imprint>
			<date type="published" when="2020">2021. 2020</date>
			<biblScope unit="page" from="8320" to="8329" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Barbu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mayo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Alverio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Gutfreund</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tenenbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Katz</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2019/file/97af07a14cacba681feacf3012730892-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Wallach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Larochelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Beygelzimer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Alché-Buc</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Fox</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">32</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Natural adversarial examples</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1907.07174</idno>
		<ptr target="https://arxiv.org/abs/1907.07174.doi:10.48550/ARXIV.1907.07174" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Improved regularization of convolutional neural networks with cutout</title>
		<author>
			<persName><forename type="first">T</forename><surname>Devries</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">W</forename><surname>Taylor</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1708.04552</idno>
		<ptr target="https://arxiv.org/abs/1708.04552.doi:10.48550/ARXIV.1708.04552" />
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Improving the robustness of deep neural networks via stability training</title>
		<author>
			<persName><forename type="first">S</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1604.04326</idno>
		<ptr target="https://arxiv.org/abs/1604.04326.doi:10.48550/ARXIV.1604.04326" />
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Improving deep learning using generic data augmentation</title>
		<author>
			<persName><forename type="first">L</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Nitschke</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1708.06020</idno>
		<ptr target="https://arxiv.org/abs/1708.06020.doi:10.48550/ARXIV.1708.06020" />
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Data augmentation can improve robustness</title>
		<author>
			<persName><forename type="first">S.-A</forename><surname>Rebuffi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gowal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Calian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Stimberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Wiles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mann</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2111.05328</idno>
		<ptr target="https://arxiv.org/abs/2111.05328.doi:10.48550/ARXIV.2111.05328" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Fellbaum</surname></persName>
		</author>
		<title level="m">Wordnet: An electronic lexical database</title>
				<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Very deep convolutional networks for large-scale image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno>CoRR abs/1409.1556</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Rethinking the inception architecture for computer vision</title>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ioffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wojna</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2016.308</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="2818" to="2826" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2016.90</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="770" to="778" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Xception: Deep learning with depthwise separable convolutions</title>
		<author>
			<persName><forename type="first">F</forename><surname>Chollet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR</title>
				<imprint>
			<date type="published" when="2016">2017. 2016</date>
			<biblScope unit="page" from="1800" to="1807" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Inception-v4, inception-resnet and the impact of residual connections on learning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ioffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Alemi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI&apos;17</title>
				<meeting>the Thirty-First AAAI Conference on Artificial Intelligence, AAAI&apos;17</meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="4278" to="4284" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">A convnet for the 2020s</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Feichtenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xie</surname></persName>
		</author>
		<idno>CoRR abs/2201.03545</idno>
		<ptr target="https://arxiv.org/abs/2201.03545.arXiv:2201.03545" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Transformers in vision: A survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Naseer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hayat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>Zamir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">S</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shah</surname></persName>
		</author>
		<idno type="DOI">10.1145/3505244</idno>
		<ptr target="https://doi.org/10.1145%2F3505244.doi:10.1145/3505244" />
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="1" to="41" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Training dataefficient image transformers &amp; distillation through attention</title>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Douze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Massa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jegou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 38th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Meila</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</editor>
		<meeting>the 38th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">139</biblScope>
			<biblScope unit="page" from="10347" to="10357" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Swin transformer v2: Scaling up capacity and resolution</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Guo</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR52688.2022.01170</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2022">2022. 2022</date>
			<biblScope unit="page" from="11999" to="12009" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<title level="m" type="main">Designing network design spaces</title>
		<author>
			<persName><forename type="first">I</forename><surname>Radosavovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P</forename><surname>Kosaraju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollár</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Vision transformers are robust learners</title>
		<author>
			<persName><forename type="first">S</forename><surname>Paul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-Y</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AAAI Conference on Artificial Intelligence</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Do imagenet classifiers generalize to imagenet?</title>
		<author>
			<persName><forename type="first">B</forename><surname>Recht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Roelofs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Shankar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<title level="m" type="main">Generalisation in humans and deep neural networks</title>
		<author>
			<persName><forename type="first">R</forename><surname>Geirhos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">R M</forename><surname>Temme</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rauber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Schütt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bethge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Wichmann</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1808.08750</idno>
		<ptr target="https://arxiv.org/abs/1808.08750.doi:10.48550/ARXIV.1808.08750" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title level="m" type="main">Using synthetic corruptions to measure robustness to natural distribution shifts</title>
		<author>
			<persName><forename type="first">A</forename><surname>Laugros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Caplier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ospici</surname></persName>
		</author>
		<idno>ArXiv abs/2107.12052</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<title level="m" type="main">Natural adversarial examples</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1907.07174</idno>
		<ptr target="https://arxiv.org/abs/1907.07174.doi:10.48550/ARXIV.1907.07174" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Evasion attacks against machine learning at test time</title>
		<author>
			<persName><forename type="first">B</forename><surname>Biggio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Corona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Maiorca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Nelson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Rndić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Laskov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Giacinto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Roli</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-642-40994-3_25</idno>
		<idno>doi:</idno>
		<ptr target="10.1007/978-3-642-40994-3_25" />
	</analytic>
	<monogr>
		<title level="m">Advanced Information Systems Engineering</title>
				<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="387" to="402" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Benchmarking adversarial robustness on image classification</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q.-A</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR42600.2020.00040</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="318" to="328" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<monogr>
		<title level="m" type="main">Robustness may be at odds with accuracy</title>
		<author>
			<persName><forename type="first">D</forename><surname>Tsipras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Santurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Engstrom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Turner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madry</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.1805.12152</idno>
		<ptr target="https://arxiv.org/abs/1805.12152.doi:10.48550/ARXIV.1805.12152" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<monogr>
		<title level="m" type="main">Selection of source images heavily influences the effectiveness of adversarial attacks</title>
		<author>
			<persName><forename type="first">U</forename><surname>Ozbulak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">T</forename><surname>Anzaku</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">D</forename><surname>Neve</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">V</forename><surname>Messem</surname></persName>
		</author>
		<idno>ArXiv abs/2106.07141</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<monogr>
		<title level="m" type="main">Adversarial robustness is at odds with lazy training</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ullah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mianjy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Arora</surname></persName>
		</author>
		<idno>ArXiv abs/2207.00411</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<analytic>
		<title level="a" type="main">An impartial take to the cnn vs transformer robustness contest</title>
		<author>
			<persName><forename type="first">F</forename><surname>Pinto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">H S</forename><surname>Torr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">K</forename><surname>Dokania</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<monogr>
		<title level="m" type="main">Can cnns be more robust than transformers?</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xie</surname></persName>
		</author>
		<idno>ArXiv abs/2206.03452</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<monogr>
		<title level="m" type="main">Understanding robustness of transformers for image classification</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bhojanapalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Glasner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Veit</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<monogr>
		<title level="m" type="main">On the strong correlation between model invariance and generalization</title>
		<author>
			<persName><forename type="first">W</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gould</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<idno>ArXiv abs/2207.07065</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b47">
	<analytic>
		<title level="a" type="main">On interaction between augmentations and corruptions in natural corruption robustness</title>
		<author>
			<persName><forename type="first">E</forename><surname>Mintun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kirillov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xie</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2021/file/1d49780520898fe37f0cd6b41c5311bf-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Ranzato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Beygelzimer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Dauphin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Vaughan</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="3571" to="3583" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b48">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Sandler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Howard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zhmoginov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-C</forename><surname>Chen</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2018.00474</idno>
		<title level="m">Mobilenetv2: Inverted residuals and linear bottlenecks</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="4510" to="4520" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b49">
	<monogr>
		<title level="m" type="main">Learning transferable architectures for scalable image recognition</title>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vasudevan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Le</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2018.00907</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="8697" to="8710" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b50">
	<monogr>
		<title level="m" type="main">Densely connected convolutional networks</title>
		<author>
			<persName><forename type="first">G</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Weinberger</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2017.243</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b51">
	<monogr>
		<title level="m" type="main">Efficientnet: Rethinking model scaling for convolutional neural networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Le</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b52">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
